From fd76b980071848eac4c18f61845ae28eaee97bca Mon Sep 17 00:00:00 2001 From: Marton Vago Date: Fri, 8 Nov 2024 15:04:06 +0100 Subject: [PATCH 1/4] docs: :memo: add Why Pandera post --- why-pandera/index.qmd | 239 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 239 insertions(+) create mode 100644 why-pandera/index.qmd diff --git a/why-pandera/index.qmd b/why-pandera/index.qmd new file mode 100644 index 0000000..eb15624 --- /dev/null +++ b/why-pandera/index.qmd @@ -0,0 +1,239 @@ +--- +title: "Why Pandera" +description: | + A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. + This post explains why we chose to use Pandera for automating this process. +date: "2024-11-08" +categories: +- backend +- develop +--- + +::: content-hidden +Use other decision posts as inspiration to writing these. Leave the +content-hidden sections in the text for future reference. +::: + +## Context and problem statement + +::: content-hidden +State the context and some background on the issue, then write a +statement in the form of a question for the problem. +::: + +A key functionality of Sprout is checking whether user-supplied data +match the metadata that describe them. Metadata are stored in JSON files +following the [Data Package](https://datapackage.org/) standard. +Checking data against metadata has two components: verification involves +checking whether the overall structure of the data (e.g. column number, +column data type) is as expected, while validation involves checking +that all individual data items meet constraints listed in the metadata +(e.g. maximum values, specific formats). We are looking for a tool to +automate the data verification and validation process. The question then +is: + +*Which data verification and validation tools are available and which +one should we use?* + +## Decision drivers + +::: content-hidden +List some reasons for why we need to make this decision and what things +have arisen that impact work. +::: + +- The new tool should support both data verification and validation. +- Ideally, it should support multiple tabular data formats, including + [`polars`](https://pola.rs/) data frames. +- It should be easy to transform JSON metadata into the representation + required by the tool. +- The tool should be able to handle relatively large datasets + efficiently. +- Support for extracting metadata from data would be a plus. + +## Considered options + +::: content-hidden +List and describe some of the options, as well as some of the benefits +and drawbacks for each option. +::: + +- [`frictionless-py`](https://framework.frictionlessdata.io/) +- [Pandera](https://pandera.readthedocs.io/en/stable/index.html) +- [Great + Expectations](https://docs.greatexpectations.io/docs/core/introduction/) +- [Pydantic](https://docs.pydantic.dev/latest/) + +### `frictionless-py` + +[`frictionless-py`](https://framework.frictionlessdata.io/) is the +Python implementation of the Data Package standard by its parent +organisation, and as such it would be the obvious choice for our use +case. As well as functionality for data verification and validation, it +supports checking metadata against the Data Package standard and +building pipelines for transforming data. + +::::: columns +::: column +#### Benefits + +- Supports both data verification and validation, although it is not + possible to run these checks separately. +- Multiple tabular data formats are supported, including `pandas` data + frames. +- Directly compatible with our JSON metadata, as it implements the + Data Package standard. +- Supports large data files. +- Supports extracting metadata from data, matching the Data Package + standard. +::: + +::: column +#### Drawbacks + +- The API suggests that it is possible to filter for specific errors, + but this functionality does not seem to work fully. +- There are a number of different entry points to the + verification/validation flow and it is quite difficult to foresee + how these differ in behaviour. +- `polars` data frames are not supported. +- So far we've found it a bit difficult to navigate the + `frictionless-py` codebase. +::: +::::: + +### Pandera + +[Pandera](https://pandera.readthedocs.io/en/stable/index.html) is a +flexible data validation library operating on data frames. Its +validation mechanism is based on the concept of a schema expressing +expectations about the data. It also has capabilities for preprocessing +data and generating synthetic data from Pandera schemas. + +::::: columns +::: column +#### Benefits + +- Supports both data verification and validation, and can run these + checks separately. +- Supports `polars` data frames to a large extent (see on the right). +- Supports large datasets. +- Offers schema inference, although not with `polars`. +- Pandera is widely used, extensively tested, and has good + documentation. +::: + +::: column +#### Drawbacks + +- Only data frames are accepted as input, so other formats (e.g. CSV) + have to be loaded into a data frame first. +- While `polars` is supported, the integration is [not yet + complete](https://pandera.readthedocs.io/en/stable/index.html#supported-features). + E.g., it cannot yet extract metadata from `polars` data frames. +- We would need to write custom code to translate our table metadata + from JSON to Pandera schemas in Python. For its own schemas, Pandera + provides JSON conversion out of the box. +::: +::::: + +### Great Expectations + +[Great +Expectations](https://docs.greatexpectations.io/docs/core/introduction/) +is a larger framework for testing and validating data. It also offers a +range of other functionality, which includes data visualisation, data +collation from remote sources, and statistical summary generation. It is +structured around expectations about the data, which are organised into +expectation suites. + +::::: columns +::: column +#### Benefits + +- Supports both data verification and validation, and can run these + checks separately. +- Supports a wide range of data formats, although not `polars`. +- Supports large datasets.\ +- Can generate an expectation suite based on data. +::: + +::: column +#### Drawbacks + +- No support for `polars`. +- We would need to write custom code to translate our table metadata + from JSON to expectations in Python. For its own expectations + suites, Great Expectations provides JSON conversion out of the box. +- The API for declaring expectations matches the structure of the Data + Package standard less closely than that of the other options. +- Significantly larger and more complex to set up than any of the + other options. +::: +::::: + +### Pydantic + +[Pydantic](https://docs.pydantic.dev/latest/) is the most popular +library for matching data against a schema in Python. Its basic use case +is describing how data should be structured in a Pydantic model and +checking an object against this model to confirm that they match. Model +requirements are expressed using type hints and the matching behaviour +is highly customisable. + +::::: columns +::: column +#### Benefits + +- Supports data validation. +::: + +::: column +#### Drawbacks + +- No out-of-the-box support for data verification. +- We would need to translate our JSON metadata into Pydantic models. +- Pydantic only accepts dictionary-like objects as input, so data + files would need to be loaded into Python manually and fed to the + Pydantic model row by row. +- The above means that support for large datasets would depend on our + implementation. +- No support for model extraction. +::: +::::: + +## Decision outcome + +::: content-hidden +What decision was made, use the form "We decided on CHOICE because of +REASONS." +::: + +We decided to use Pandera because it is a great match for our use case, +has extensive documentation, and its behaviour is easy to tailor to our +needs. While `frictionless-py` is a direct implementation of the Data +Package standard, it is less mature and less widely used than Pandera. +We have found some inconsistencies in its verification/validation +behaviour and feel that we would need to customise it using somewhat +brittle and inelegant workarounds for it to fit into our workflow. + +As for the remaining options, we rejected Pydantic because its use case +is not verifying or validating datasets. Although Great Expectations +offers most of the functionality we need, it is a complete framework +with many parts we don't need, is rather complex to set up, and +integrating with it would shape our codebase more than any of the other +tools. + +### Consequences + +::: content-hidden +List some potential consequences of this decision. +::: + +- We will have to write custom logic for transforming JSON metadata + into Pandera schemas. +- We will have to find a solution for extracting metadata from data, + as Pandera cannot currently infer schemas from `polars` data frames. +- If we want to add any checks or behaviours based on file-level + properties of the data (e.g. file size, hash, encoding, etc.), these + will have to be implemented outside of Pandera. From 2a7d29de020c4ff30c0f5b7f9536b2ae20f97cea Mon Sep 17 00:00:00 2001 From: martonvago <57952344+martonvago@users.noreply.github.com> Date: Mon, 11 Nov 2024 15:13:53 +0100 Subject: [PATCH 2/4] apply suggestions from code review Co-authored-by: Luke W. Johnston --- why-pandera/index.qmd | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/why-pandera/index.qmd b/why-pandera/index.qmd index eb15624..1947e18 100644 --- a/why-pandera/index.qmd +++ b/why-pandera/index.qmd @@ -7,6 +7,7 @@ date: "2024-11-08" categories: - backend - develop +- check --- ::: content-hidden @@ -98,7 +99,7 @@ building pipelines for transforming data. how these differ in behaviour. - `polars` data frames are not supported. - So far we've found it a bit difficult to navigate the - `frictionless-py` codebase. + `frictionless-py` codebase and documentation. ::: ::::: @@ -154,7 +155,7 @@ expectation suites. - Supports both data verification and validation, and can run these checks separately. - Supports a wide range of data formats, although not `polars`. -- Supports large datasets.\ +- Supports large datasets. - Can generate an expectation suite based on data. ::: From e8ef0b9d82ac4f26b02d4fc640bbfa621d2dc9cb Mon Sep 17 00:00:00 2001 From: martonvago <57952344+martonvago@users.noreply.github.com> Date: Tue, 12 Nov 2024 15:53:57 +0100 Subject: [PATCH 3/4] apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com> --- why-pandera/index.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/why-pandera/index.qmd b/why-pandera/index.qmd index 1947e18..474772c 100644 --- a/why-pandera/index.qmd +++ b/why-pandera/index.qmd @@ -218,7 +218,7 @@ We have found some inconsistencies in its verification/validation behaviour and feel that we would need to customise it using somewhat brittle and inelegant workarounds for it to fit into our workflow. -As for the remaining options, we rejected Pydantic because its use case +As for the remaining options, we decided not to go with Pydantic because its use case is not verifying or validating datasets. Although Great Expectations offers most of the functionality we need, it is a complete framework with many parts we don't need, is rather complex to set up, and From 8d5516288f61a05f2b49805d40bc6958116b0f50 Mon Sep 17 00:00:00 2001 From: Marton Vago Date: Tue, 12 Nov 2024 16:00:15 +0100 Subject: [PATCH 4/4] docs: :memo: review markups --- why-pandera/index.qmd | 64 ++++++++++++++++++++++--------------------- 1 file changed, 33 insertions(+), 31 deletions(-) diff --git a/why-pandera/index.qmd b/why-pandera/index.qmd index 474772c..c1512b5 100644 --- a/why-pandera/index.qmd +++ b/why-pandera/index.qmd @@ -2,7 +2,7 @@ title: "Why Pandera" description: | A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. - This post explains why we chose to use Pandera for automating this process. + This post explains why we chose to use `pandera` for automating this process. date: "2024-11-08" categories: - backend @@ -25,13 +25,13 @@ statement in the form of a question for the problem. A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. Metadata are stored in JSON files following the [Data Package](https://datapackage.org/) standard. -Checking data against metadata has two components: verification involves -checking whether the overall structure of the data (e.g. column number, -column data type) is as expected, while validation involves checking -that all individual data items meet constraints listed in the metadata -(e.g. maximum values, specific formats). We are looking for a tool to -automate the data verification and validation process. The question then -is: +Checking data against metadata has two components: verification and +validation. Verification involves checking whether the overall structure +of the data (e.g. column number, column data type) is as expected, while +validation involves checking that all individual data items meet +constraints listed in the metadata (e.g. maximum values, specific +formats). We are looking for a tool to automate the data verification +and validation process. The question then is: *Which data verification and validation tools are available and which one should we use?* @@ -60,7 +60,7 @@ and drawbacks for each option. ::: - [`frictionless-py`](https://framework.frictionlessdata.io/) -- [Pandera](https://pandera.readthedocs.io/en/stable/index.html) +- [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) - [Great Expectations](https://docs.greatexpectations.io/docs/core/introduction/) - [Pydantic](https://docs.pydantic.dev/latest/) @@ -105,11 +105,11 @@ building pipelines for transforming data. ### Pandera -[Pandera](https://pandera.readthedocs.io/en/stable/index.html) is a +[`pandera`](https://pandera.readthedocs.io/en/stable/index.html) is a flexible data validation library operating on data frames. Its validation mechanism is based on the concept of a schema expressing expectations about the data. It also has capabilities for preprocessing -data and generating synthetic data from Pandera schemas. +data and generating synthetic data from `pandera` schemas. ::::: columns ::: column @@ -120,7 +120,7 @@ data and generating synthetic data from Pandera schemas. - Supports `polars` data frames to a large extent (see on the right). - Supports large datasets. - Offers schema inference, although not with `polars`. -- Pandera is widely used, extensively tested, and has good +- `pandera` is widely used, extensively tested, and has good documentation. ::: @@ -133,8 +133,8 @@ data and generating synthetic data from Pandera schemas. complete](https://pandera.readthedocs.io/en/stable/index.html#supported-features). E.g., it cannot yet extract metadata from `polars` data frames. - We would need to write custom code to translate our table metadata - from JSON to Pandera schemas in Python. For its own schemas, Pandera - provides JSON conversion out of the box. + from JSON to `pandera` schemas in Python. For its own schemas, + `pandera` provides JSON conversion out of the box. ::: ::::: @@ -210,20 +210,21 @@ What decision was made, use the form "We decided on CHOICE because of REASONS." ::: -We decided to use Pandera because it is a great match for our use case, -has extensive documentation, and its behaviour is easy to tailor to our -needs. While `frictionless-py` is a direct implementation of the Data -Package standard, it is less mature and less widely used than Pandera. -We have found some inconsistencies in its verification/validation -behaviour and feel that we would need to customise it using somewhat -brittle and inelegant workarounds for it to fit into our workflow. - -As for the remaining options, we decided not to go with Pydantic because its use case -is not verifying or validating datasets. Although Great Expectations -offers most of the functionality we need, it is a complete framework -with many parts we don't need, is rather complex to set up, and -integrating with it would shape our codebase more than any of the other -tools. +We decided to use `pandera` because it is a great match for our use +case, has extensive documentation, and its behaviour is easy to tailor +to our needs. While `frictionless-py` is a direct implementation of the +Data Package standard, it is less mature and less widely used than +`pandera`. We have found some inconsistencies in its +verification/validation behaviour and feel that we would need to +customise it using somewhat brittle and inelegant workarounds for it to +fit into our workflow. + +As for the remaining options, we decided not to go with Pydantic because +its use case is not verifying or validating datasets. Although Great +Expectations offers most of the functionality we need, it is a complete +framework with many parts we don't need, is rather complex to set up, +and integrating with it would shape our codebase more than any of the +other tools. ### Consequences @@ -232,9 +233,10 @@ List some potential consequences of this decision. ::: - We will have to write custom logic for transforming JSON metadata - into Pandera schemas. + into `pandera` schemas. - We will have to find a solution for extracting metadata from data, - as Pandera cannot currently infer schemas from `polars` data frames. + as `pandera` cannot currently infer schemas from `polars` data + frames. - If we want to add any checks or behaviours based on file-level properties of the data (e.g. file size, hash, encoding, etc.), these - will have to be implemented outside of Pandera. + will have to be implemented outside of `pandera`.