-
Notifications
You must be signed in to change notification settings - Fork 0
docs: 📝 Why Pandera for data verification/validation #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,242 @@ | ||
| --- | ||
| title: "Why Pandera" | ||
| description: | | ||
| A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. | ||
| This post explains why we chose to use `pandera` for automating this process. | ||
| date: "2024-11-08" | ||
| categories: | ||
| - backend | ||
| - develop | ||
| - check | ||
| --- | ||
|
|
||
| ::: content-hidden | ||
| Use other decision posts as inspiration to writing these. Leave the | ||
| content-hidden sections in the text for future reference. | ||
| ::: | ||
|
|
||
| ## Context and problem statement | ||
|
|
||
| ::: content-hidden | ||
| State the context and some background on the issue, then write a | ||
| statement in the form of a question for the problem. | ||
| ::: | ||
|
|
||
| A key functionality of Sprout is checking whether user-supplied data | ||
| match the metadata that describe them. Metadata are stored in JSON files | ||
| following the [Data Package](https://datapackage.org/) standard. | ||
| Checking data against metadata has two components: verification and | ||
| validation. Verification involves checking whether the overall structure | ||
| of the data (e.g. column number, column data type) is as expected, while | ||
| validation involves checking that all individual data items meet | ||
| constraints listed in the metadata (e.g. maximum values, specific | ||
| formats). We are looking for a tool to automate the data verification | ||
| and validation process. The question then is: | ||
|
|
||
| *Which data verification and validation tools are available and which | ||
| one should we use?* | ||
|
|
||
| ## Decision drivers | ||
|
|
||
| ::: content-hidden | ||
| List some reasons for why we need to make this decision and what things | ||
| have arisen that impact work. | ||
| ::: | ||
|
|
||
| - The new tool should support both data verification and validation. | ||
| - Ideally, it should support multiple tabular data formats, including | ||
| [`polars`](https://pola.rs/) data frames. | ||
| - It should be easy to transform JSON metadata into the representation | ||
| required by the tool. | ||
| - The tool should be able to handle relatively large datasets | ||
| efficiently. | ||
| - Support for extracting metadata from data would be a plus. | ||
|
|
||
| ## Considered options | ||
|
|
||
| ::: content-hidden | ||
| List and describe some of the options, as well as some of the benefits | ||
| and drawbacks for each option. | ||
| ::: | ||
|
|
||
| - [`frictionless-py`](https://framework.frictionlessdata.io/) | ||
| - [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) | ||
| - [Great | ||
| Expectations](https://docs.greatexpectations.io/docs/core/introduction/) | ||
| - [Pydantic](https://docs.pydantic.dev/latest/) | ||
|
|
||
| ### `frictionless-py` | ||
|
|
||
| [`frictionless-py`](https://framework.frictionlessdata.io/) is the | ||
| Python implementation of the Data Package standard by its parent | ||
| organisation, and as such it would be the obvious choice for our use | ||
| case. As well as functionality for data verification and validation, it | ||
| supports checking metadata against the Data Package standard and | ||
| building pipelines for transforming data. | ||
|
|
||
| ::::: columns | ||
| ::: column | ||
| #### Benefits | ||
|
|
||
| - Supports both data verification and validation, although it is not | ||
| possible to run these checks separately. | ||
| - Multiple tabular data formats are supported, including `pandas` data | ||
| frames. | ||
| - Directly compatible with our JSON metadata, as it implements the | ||
| Data Package standard. | ||
| - Supports large data files. | ||
| - Supports extracting metadata from data, matching the Data Package | ||
| standard. | ||
| ::: | ||
|
|
||
| ::: column | ||
| #### Drawbacks | ||
|
|
||
| - The API suggests that it is possible to filter for specific errors, | ||
| but this functionality does not seem to work fully. | ||
| - There are a number of different entry points to the | ||
| verification/validation flow and it is quite difficult to foresee | ||
| how these differ in behaviour. | ||
| - `polars` data frames are not supported. | ||
| - So far we've found it a bit difficult to navigate the | ||
| `frictionless-py` codebase and documentation. | ||
| ::: | ||
| ::::: | ||
|
|
||
| ### Pandera | ||
|
|
||
| [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) is a | ||
| flexible data validation library operating on data frames. Its | ||
| validation mechanism is based on the concept of a schema expressing | ||
| expectations about the data. It also has capabilities for preprocessing | ||
| data and generating synthetic data from `pandera` schemas. | ||
|
|
||
| ::::: columns | ||
| ::: column | ||
| #### Benefits | ||
|
|
||
| - Supports both data verification and validation, and can run these | ||
| checks separately. | ||
| - Supports `polars` data frames to a large extent (see on the right). | ||
| - Supports large datasets. | ||
| - Offers schema inference, although not with `polars`. | ||
| - `pandera` is widely used, extensively tested, and has good | ||
| documentation. | ||
| ::: | ||
|
|
||
| ::: column | ||
| #### Drawbacks | ||
|
|
||
| - Only data frames are accepted as input, so other formats (e.g. CSV) | ||
| have to be loaded into a data frame first. | ||
| - While `polars` is supported, the integration is [not yet | ||
| complete](https://pandera.readthedocs.io/en/stable/index.html#supported-features). | ||
| E.g., it cannot yet extract metadata from `polars` data frames. | ||
| - We would need to write custom code to translate our table metadata | ||
signekb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| from JSON to `pandera` schemas in Python. For its own schemas, | ||
| `pandera` provides JSON conversion out of the box. | ||
| ::: | ||
| ::::: | ||
|
|
||
| ### Great Expectations | ||
|
|
||
| [Great | ||
| Expectations](https://docs.greatexpectations.io/docs/core/introduction/) | ||
| is a larger framework for testing and validating data. It also offers a | ||
| range of other functionality, which includes data visualisation, data | ||
| collation from remote sources, and statistical summary generation. It is | ||
| structured around expectations about the data, which are organised into | ||
| expectation suites. | ||
|
|
||
| ::::: columns | ||
| ::: column | ||
| #### Benefits | ||
|
|
||
| - Supports both data verification and validation, and can run these | ||
| checks separately. | ||
| - Supports a wide range of data formats, although not `polars`. | ||
| - Supports large datasets. | ||
| - Can generate an expectation suite based on data. | ||
| ::: | ||
|
|
||
| ::: column | ||
| #### Drawbacks | ||
|
|
||
| - No support for `polars`. | ||
| - We would need to write custom code to translate our table metadata | ||
| from JSON to expectations in Python. For its own expectations | ||
| suites, Great Expectations provides JSON conversion out of the box. | ||
| - The API for declaring expectations matches the structure of the Data | ||
| Package standard less closely than that of the other options. | ||
| - Significantly larger and more complex to set up than any of the | ||
| other options. | ||
| ::: | ||
| ::::: | ||
|
|
||
| ### Pydantic | ||
|
|
||
| [Pydantic](https://docs.pydantic.dev/latest/) is the most popular | ||
| library for matching data against a schema in Python. Its basic use case | ||
| is describing how data should be structured in a Pydantic model and | ||
| checking an object against this model to confirm that they match. Model | ||
| requirements are expressed using type hints and the matching behaviour | ||
| is highly customisable. | ||
|
|
||
| ::::: columns | ||
| ::: column | ||
| #### Benefits | ||
|
|
||
| - Supports data validation. | ||
| ::: | ||
|
|
||
| ::: column | ||
| #### Drawbacks | ||
|
|
||
| - No out-of-the-box support for data verification. | ||
| - We would need to translate our JSON metadata into Pydantic models. | ||
| - Pydantic only accepts dictionary-like objects as input, so data | ||
| files would need to be loaded into Python manually and fed to the | ||
| Pydantic model row by row. | ||
signekb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - The above means that support for large datasets would depend on our | ||
| implementation. | ||
| - No support for model extraction. | ||
| ::: | ||
| ::::: | ||
|
|
||
| ## Decision outcome | ||
|
|
||
| ::: content-hidden | ||
| What decision was made, use the form "We decided on CHOICE because of | ||
| REASONS." | ||
| ::: | ||
|
|
||
| We decided to use `pandera` because it is a great match for our use | ||
| case, has extensive documentation, and its behaviour is easy to tailor | ||
| to our needs. While `frictionless-py` is a direct implementation of the | ||
| Data Package standard, it is less mature and less widely used than | ||
| `pandera`. We have found some inconsistencies in its | ||
| verification/validation behaviour and feel that we would need to | ||
| customise it using somewhat brittle and inelegant workarounds for it to | ||
| fit into our workflow. | ||
|
|
||
| As for the remaining options, we decided not to go with Pydantic because | ||
| its use case is not verifying or validating datasets. Although Great | ||
| Expectations offers most of the functionality we need, it is a complete | ||
| framework with many parts we don't need, is rather complex to set up, | ||
| and integrating with it would shape our codebase more than any of the | ||
| other tools. | ||
|
|
||
| ### Consequences | ||
|
|
||
| ::: content-hidden | ||
| List some potential consequences of this decision. | ||
| ::: | ||
|
|
||
| - We will have to write custom logic for transforming JSON metadata | ||
| into `pandera` schemas. | ||
| - We will have to find a solution for extracting metadata from data, | ||
| as `pandera` cannot currently infer schemas from `polars` data | ||
| frames. | ||
| - If we want to add any checks or behaviours based on file-level | ||
| properties of the data (e.g. file size, hash, encoding, etc.), these | ||
| will have to be implemented outside of `pandera`. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.