diff --git a/why-pandera/index.qmd b/why-pandera/index.qmd new file mode 100644 index 0000000..c1512b5 --- /dev/null +++ b/why-pandera/index.qmd @@ -0,0 +1,242 @@ +--- +title: "Why Pandera" +description: | + A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. + This post explains why we chose to use `pandera` for automating this process. +date: "2024-11-08" +categories: +- backend +- develop +- check +--- + +::: content-hidden +Use other decision posts as inspiration to writing these. Leave the +content-hidden sections in the text for future reference. +::: + +## Context and problem statement + +::: content-hidden +State the context and some background on the issue, then write a +statement in the form of a question for the problem. +::: + +A key functionality of Sprout is checking whether user-supplied data +match the metadata that describe them. Metadata are stored in JSON files +following the [Data Package](https://datapackage.org/) standard. +Checking data against metadata has two components: verification and +validation. Verification involves checking whether the overall structure +of the data (e.g. column number, column data type) is as expected, while +validation involves checking that all individual data items meet +constraints listed in the metadata (e.g. maximum values, specific +formats). We are looking for a tool to automate the data verification +and validation process. The question then is: + +*Which data verification and validation tools are available and which +one should we use?* + +## Decision drivers + +::: content-hidden +List some reasons for why we need to make this decision and what things +have arisen that impact work. +::: + +- The new tool should support both data verification and validation. +- Ideally, it should support multiple tabular data formats, including + [`polars`](https://pola.rs/) data frames. +- It should be easy to transform JSON metadata into the representation + required by the tool. +- The tool should be able to handle relatively large datasets + efficiently. +- Support for extracting metadata from data would be a plus. + +## Considered options + +::: content-hidden +List and describe some of the options, as well as some of the benefits +and drawbacks for each option. +::: + +- [`frictionless-py`](https://framework.frictionlessdata.io/) +- [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) +- [Great + Expectations](https://docs.greatexpectations.io/docs/core/introduction/) +- [Pydantic](https://docs.pydantic.dev/latest/) + +### `frictionless-py` + +[`frictionless-py`](https://framework.frictionlessdata.io/) is the +Python implementation of the Data Package standard by its parent +organisation, and as such it would be the obvious choice for our use +case. As well as functionality for data verification and validation, it +supports checking metadata against the Data Package standard and +building pipelines for transforming data. + +::::: columns +::: column +#### Benefits + +- Supports both data verification and validation, although it is not + possible to run these checks separately. +- Multiple tabular data formats are supported, including `pandas` data + frames. +- Directly compatible with our JSON metadata, as it implements the + Data Package standard. +- Supports large data files. +- Supports extracting metadata from data, matching the Data Package + standard. +::: + +::: column +#### Drawbacks + +- The API suggests that it is possible to filter for specific errors, + but this functionality does not seem to work fully. +- There are a number of different entry points to the + verification/validation flow and it is quite difficult to foresee + how these differ in behaviour. +- `polars` data frames are not supported. +- So far we've found it a bit difficult to navigate the + `frictionless-py` codebase and documentation. +::: +::::: + +### Pandera + +[`pandera`](https://pandera.readthedocs.io/en/stable/index.html) is a +flexible data validation library operating on data frames. Its +validation mechanism is based on the concept of a schema expressing +expectations about the data. It also has capabilities for preprocessing +data and generating synthetic data from `pandera` schemas. + +::::: columns +::: column +#### Benefits + +- Supports both data verification and validation, and can run these + checks separately. +- Supports `polars` data frames to a large extent (see on the right). +- Supports large datasets. +- Offers schema inference, although not with `polars`. +- `pandera` is widely used, extensively tested, and has good + documentation. +::: + +::: column +#### Drawbacks + +- Only data frames are accepted as input, so other formats (e.g. CSV) + have to be loaded into a data frame first. +- While `polars` is supported, the integration is [not yet + complete](https://pandera.readthedocs.io/en/stable/index.html#supported-features). + E.g., it cannot yet extract metadata from `polars` data frames. +- We would need to write custom code to translate our table metadata + from JSON to `pandera` schemas in Python. For its own schemas, + `pandera` provides JSON conversion out of the box. +::: +::::: + +### Great Expectations + +[Great +Expectations](https://docs.greatexpectations.io/docs/core/introduction/) +is a larger framework for testing and validating data. It also offers a +range of other functionality, which includes data visualisation, data +collation from remote sources, and statistical summary generation. It is +structured around expectations about the data, which are organised into +expectation suites. + +::::: columns +::: column +#### Benefits + +- Supports both data verification and validation, and can run these + checks separately. +- Supports a wide range of data formats, although not `polars`. +- Supports large datasets. +- Can generate an expectation suite based on data. +::: + +::: column +#### Drawbacks + +- No support for `polars`. +- We would need to write custom code to translate our table metadata + from JSON to expectations in Python. For its own expectations + suites, Great Expectations provides JSON conversion out of the box. +- The API for declaring expectations matches the structure of the Data + Package standard less closely than that of the other options. +- Significantly larger and more complex to set up than any of the + other options. +::: +::::: + +### Pydantic + +[Pydantic](https://docs.pydantic.dev/latest/) is the most popular +library for matching data against a schema in Python. Its basic use case +is describing how data should be structured in a Pydantic model and +checking an object against this model to confirm that they match. Model +requirements are expressed using type hints and the matching behaviour +is highly customisable. + +::::: columns +::: column +#### Benefits + +- Supports data validation. +::: + +::: column +#### Drawbacks + +- No out-of-the-box support for data verification. +- We would need to translate our JSON metadata into Pydantic models. +- Pydantic only accepts dictionary-like objects as input, so data + files would need to be loaded into Python manually and fed to the + Pydantic model row by row. +- The above means that support for large datasets would depend on our + implementation. +- No support for model extraction. +::: +::::: + +## Decision outcome + +::: content-hidden +What decision was made, use the form "We decided on CHOICE because of +REASONS." +::: + +We decided to use `pandera` because it is a great match for our use +case, has extensive documentation, and its behaviour is easy to tailor +to our needs. While `frictionless-py` is a direct implementation of the +Data Package standard, it is less mature and less widely used than +`pandera`. We have found some inconsistencies in its +verification/validation behaviour and feel that we would need to +customise it using somewhat brittle and inelegant workarounds for it to +fit into our workflow. + +As for the remaining options, we decided not to go with Pydantic because +its use case is not verifying or validating datasets. Although Great +Expectations offers most of the functionality we need, it is a complete +framework with many parts we don't need, is rather complex to set up, +and integrating with it would shape our codebase more than any of the +other tools. + +### Consequences + +::: content-hidden +List some potential consequences of this decision. +::: + +- We will have to write custom logic for transforming JSON metadata + into `pandera` schemas. +- We will have to find a solution for extracting metadata from data, + as `pandera` cannot currently infer schemas from `polars` data + frames. +- If we want to add any checks or behaviours based on file-level + properties of the data (e.g. file size, hash, encoding, etc.), these + will have to be implemented outside of `pandera`.