Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
3 changes: 3 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ jobs:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: eu-west-2
- name: Prepare build
run: |
make all
- name: Build wheel
uses: PyO3/maturin-action@v1
with:
Expand Down
27 changes: 15 additions & 12 deletions .github/workflows/python-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,17 @@ jobs:
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Install Poetry
uses: snok/install-poetry@v1
- name: Setup python
uses: actions/setup-python@v4
with:
python-version: '3.9'
cache: 'poetry'
- name: Install Deps
run: poetry install
- name: Run Black
run: "poetry run black --check --diff ."
- uses: pre-commit/action@v3.0.0
test:
strategy:
fail-fast: false
matrix:
py:
- "3.12"
- "3.11"
- "3.10"
- "3.9"
- "3.8"
runs-on: ubuntu-latest
steps:
- name: Checkout
Expand All @@ -41,11 +40,15 @@ jobs:
- name: Setup python
uses: actions/setup-python@v4
with:
python-version: '3.9'
python-version: ${{ matrix.py }}
cache: 'poetry'
- name: Install Deps
run: |
poetry install
- name: Prepare build
working-directory: .
run: |
make all
- name: Install csv_gp
run: |
poetry run maturin develop
Expand Down
16 changes: 6 additions & 10 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,11 @@
repos:
- repo: https://github.com/psf/black
rev: "24.2.0"
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: 'v0.6.4'
hooks:
- id: black
language_version: python3.10
args: ["--config", "csv_gp_python/pyproject.toml"]
- repo: https://github.com/pycqa/isort
rev: "5.13.2"
hooks:
- id: isort
args: ["--profile", "black"]
- id: ruff
args: [ "--fix" ]
- id: ruff-format

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: "v4.5.0"
hooks:
Expand Down
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Xelix (GSPV Ltd)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
all: csv_gp_python/LICENSE csv_gp_python/README.md

csv_gp_python/LICENSE: LICENSE
cp LICENSE csv_gp_python
csv_gp_python/README.md:
cp README.md csv_gp_python
34 changes: 31 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,20 +23,48 @@ Add the following to your `Cargo.toml`:

### From package manager

The library is uploaded to the `xelix` codeartifact repository, once you are authenticated to use that you can install with:
The library is available on PyPI, at https://pypi.org/project/csv-gp/ so you can just run:

`pip install --index-url <codeartifact url> csv-gp`
`pip install csv-gp`

### Compiling from source

1. [Install rust](https://www.rust-lang.org/tools/install)
2. Install (`pip install maturin`)
3. Clone the repo
4. `cd csv_gp_python && maturin develop`
4. Run `make all`
5. `cd csv_gp_python && maturin develop`

## Usage

## Rust standalone binary

After installing the binary, the default usage is running `csv-gp $FILE`. This will print a diagnosis of the file. The command provides options to change the delimiter and the encoding of the file. See `csv-gp -h` for details.

Another option provided is `--correct-rows-path` which will export only the correct rows to the provided path.

## Python library

The python library exposes two main functions, `check_file` and `get_rows`.

The check file function takes a path to file, the delimiter and the encoding (see https://github.com/xelixdev/csv-gp/blob/0f77c62841509c134a3bbe06ec178426e9c5aa10/csv_gp_python/csv_gp.pyi) and returns an instance of a class `CSVDetails` which provides details about the file. See the same file to see all the available attributes and their names/types.
If the `valid_rows_output_path` argument is provided to the function, only the correct rows will be exported to that path.

The get_rows once again takes a path to file, the delimiter and the encoding and additionally a list of row numbers. The function will then return the parsed cells for given rows. See the above file for the exact typing of the parameter and returned values.

## Releasing a new version of the Python lib

1. Update version numbers in `csv_gp_python/pyproject.toml` and `csv_gp_python/cargo.toml`
2. Merge this change into main
3. Create a new release on GitHub, creating a tag in the form `vX.Y.Z`
4. The 'Publish' pipeline should begin running, and the new version will be published

## Running tests

### Running Rust tests

Run `cargo test`.

### Running Python tests

Follow the instructions on compiling from source. Then you can run `pytest`.
2 changes: 1 addition & 1 deletion csv_gp/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "csv-gp"
version = "0.1.12"
version = "0.2.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
Expand Down
2 changes: 2 additions & 0 deletions csv_gp_python/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
LICENSE
README.md
2 changes: 1 addition & 1 deletion csv_gp_python/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "csv-gp-python"
version = "0.1.12"
version = "0.2.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
Expand Down
23 changes: 2 additions & 21 deletions csv_gp_python/csv_gp.pyi
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
from typing import Optional

class UnknownEncoding(Exception):
class UnknownEncoding(Exception): # noqa: N818
pass

class CSVDetails:
Expand All @@ -9,70 +7,60 @@ class CSVDetails:
"""
Number of non-empty rows (including the header) in the file
"""
...

@property
def column_count(self) -> int:
"""
Number of columns according to the header
"""
...

@property
def invalid_character_count(self) -> int:
"""
Number of REPLACEMENT CHARACTERs (U+FFFD) in the file
"""
...

@property
def column_count_per_line(self) -> list[int]:
"""
Number of columns per line, the index corresponding to the line number
"""
...

@property
def too_few_columns(self) -> list[int]:
"""
List of line numbers that contain fewer columns than the header
"""
...

@property
def too_many_columns(self) -> list[int]:
"""
List of line numbers that contain more columns than the header
"""
...

@property
def quoted_delimiter(self) -> list[int]:
"""
List of line numbers that contain a correctly quoted delimiter
"""
...

@property
def quoted_newline(self) -> list[int]:
"""
List of line numbers that contain a correctly quoted newline
"""
...

@property
def quoted_quote(self) -> list[int]:
"""
List of line numbers that contain quoted-quotes ("")
"""
...

@property
def quoted_quote_correctly(self) -> list[int]:
"""
List of line numbers that contain correctly quoted-quotes (only contained within quoted cells)
"""
...

@property
def incorrect_cell_quote(self) -> list[int]:
Expand All @@ -83,46 +71,39 @@ class CSVDetails:
- Missing an opening or closing quote
- Containing unquoted quotes
"""
...

@property
def all_empty_rows(self) -> list[int]:
"""
List of line numbers where all cells in the row are empty (either zero characters or just `""`)
"""
...

@property
def blank_rows(self) -> list[int]:
"""
List of line numbers that are completely blank
"""
...

@property
def valid_rows(self) -> set[int]:
"""
Set of all row numbers that are valid in the file
"""
...

@property
def header_messed_up(self) -> bool:
"""
The header is considered messed up when none of the rows have the same number of columns as the header
"""
...

def check_file(path: str, delimiter: str, encoding: str, valid_rows_output_path: Optional[str] = None) -> CSVDetails:
def check_file(path: str, delimiter: str, encoding: str, valid_rows_output_path: str | None = None) -> CSVDetails:
"""
Check the file located at `path`, interpreting the file with `delimiter` and `encoding`

If `valid_rows_output_path` is passed, a file containing the valid rows will be written to the specified path
"""
...

def get_rows(path: str, delimiter: str, encoding: str, row_numbers: set[int]) -> list[tuple[int, list[str]]]:
"""
Returns all the rows in the file in the `row_numbers` set
"""
...
Loading