Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 242 additions & 0 deletions why-pandera/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
title: "Why Pandera"
description: |
A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them.
This post explains why we chose to use `pandera` for automating this process.
date: "2024-11-08"
categories:
- backend
- develop
- check
---

::: content-hidden
Use other decision posts as inspiration to writing these. Leave the
content-hidden sections in the text for future reference.
:::

## Context and problem statement

::: content-hidden
State the context and some background on the issue, then write a
statement in the form of a question for the problem.
:::

A key functionality of Sprout is checking whether user-supplied data
match the metadata that describe them. Metadata are stored in JSON files
following the [Data Package](https://datapackage.org/) standard.
Checking data against metadata has two components: verification and
validation. Verification involves checking whether the overall structure
of the data (e.g. column number, column data type) is as expected, while
validation involves checking that all individual data items meet
constraints listed in the metadata (e.g. maximum values, specific
formats). We are looking for a tool to automate the data verification
and validation process. The question then is:

*Which data verification and validation tools are available and which
one should we use?*

## Decision drivers

::: content-hidden
List some reasons for why we need to make this decision and what things
have arisen that impact work.
:::

- The new tool should support both data verification and validation.
- Ideally, it should support multiple tabular data formats, including
[`polars`](https://pola.rs/) data frames.
- It should be easy to transform JSON metadata into the representation
required by the tool.
- The tool should be able to handle relatively large datasets
efficiently.
- Support for extracting metadata from data would be a plus.

## Considered options

::: content-hidden
List and describe some of the options, as well as some of the benefits
and drawbacks for each option.
:::

- [`frictionless-py`](https://framework.frictionlessdata.io/)
- [`pandera`](https://pandera.readthedocs.io/en/stable/index.html)
- [Great
Expectations](https://docs.greatexpectations.io/docs/core/introduction/)
- [Pydantic](https://docs.pydantic.dev/latest/)

### `frictionless-py`

[`frictionless-py`](https://framework.frictionlessdata.io/) is the
Python implementation of the Data Package standard by its parent
organisation, and as such it would be the obvious choice for our use
case. As well as functionality for data verification and validation, it
supports checking metadata against the Data Package standard and
building pipelines for transforming data.

::::: columns
::: column
#### Benefits

- Supports both data verification and validation, although it is not
possible to run these checks separately.
- Multiple tabular data formats are supported, including `pandas` data
frames.
- Directly compatible with our JSON metadata, as it implements the
Data Package standard.
- Supports large data files.
- Supports extracting metadata from data, matching the Data Package
standard.
:::

::: column
#### Drawbacks

- The API suggests that it is possible to filter for specific errors,
but this functionality does not seem to work fully.
- There are a number of different entry points to the
verification/validation flow and it is quite difficult to foresee
how these differ in behaviour.
- `polars` data frames are not supported.
- So far we've found it a bit difficult to navigate the
`frictionless-py` codebase and documentation.
:::
:::::

### Pandera

[`pandera`](https://pandera.readthedocs.io/en/stable/index.html) is a
flexible data validation library operating on data frames. Its
validation mechanism is based on the concept of a schema expressing
expectations about the data. It also has capabilities for preprocessing
data and generating synthetic data from `pandera` schemas.

::::: columns
::: column
#### Benefits

- Supports both data verification and validation, and can run these
checks separately.
- Supports `polars` data frames to a large extent (see on the right).
- Supports large datasets.
- Offers schema inference, although not with `polars`.
- `pandera` is widely used, extensively tested, and has good
documentation.
:::

::: column
#### Drawbacks

- Only data frames are accepted as input, so other formats (e.g. CSV)
have to be loaded into a data frame first.
- While `polars` is supported, the integration is [not yet
complete](https://pandera.readthedocs.io/en/stable/index.html#supported-features).
E.g., it cannot yet extract metadata from `polars` data frames.
- We would need to write custom code to translate our table metadata
from JSON to `pandera` schemas in Python. For its own schemas,
`pandera` provides JSON conversion out of the box.
:::
:::::

### Great Expectations

[Great
Expectations](https://docs.greatexpectations.io/docs/core/introduction/)
is a larger framework for testing and validating data. It also offers a
range of other functionality, which includes data visualisation, data
collation from remote sources, and statistical summary generation. It is
structured around expectations about the data, which are organised into
expectation suites.

::::: columns
::: column
#### Benefits

- Supports both data verification and validation, and can run these
checks separately.
- Supports a wide range of data formats, although not `polars`.
- Supports large datasets.
- Can generate an expectation suite based on data.
:::

::: column
#### Drawbacks

- No support for `polars`.
- We would need to write custom code to translate our table metadata
from JSON to expectations in Python. For its own expectations
suites, Great Expectations provides JSON conversion out of the box.
- The API for declaring expectations matches the structure of the Data
Package standard less closely than that of the other options.
- Significantly larger and more complex to set up than any of the
other options.
:::
:::::

### Pydantic

[Pydantic](https://docs.pydantic.dev/latest/) is the most popular
library for matching data against a schema in Python. Its basic use case
is describing how data should be structured in a Pydantic model and
checking an object against this model to confirm that they match. Model
requirements are expressed using type hints and the matching behaviour
is highly customisable.

::::: columns
::: column
#### Benefits

- Supports data validation.
:::

::: column
#### Drawbacks

- No out-of-the-box support for data verification.
- We would need to translate our JSON metadata into Pydantic models.
- Pydantic only accepts dictionary-like objects as input, so data
files would need to be loaded into Python manually and fed to the
Pydantic model row by row.
- The above means that support for large datasets would depend on our
implementation.
- No support for model extraction.
:::
:::::

## Decision outcome

::: content-hidden
What decision was made, use the form "We decided on CHOICE because of
REASONS."
:::

We decided to use `pandera` because it is a great match for our use
case, has extensive documentation, and its behaviour is easy to tailor
to our needs. While `frictionless-py` is a direct implementation of the
Data Package standard, it is less mature and less widely used than
`pandera`. We have found some inconsistencies in its
verification/validation behaviour and feel that we would need to
customise it using somewhat brittle and inelegant workarounds for it to
fit into our workflow.

As for the remaining options, we decided not to go with Pydantic because
its use case is not verifying or validating datasets. Although Great
Expectations offers most of the functionality we need, it is a complete
framework with many parts we don't need, is rather complex to set up,
and integrating with it would shape our codebase more than any of the
other tools.

### Consequences

::: content-hidden
List some potential consequences of this decision.
:::

- We will have to write custom logic for transforming JSON metadata
into `pandera` schemas.
- We will have to find a solution for extracting metadata from data,
as `pandera` cannot currently infer schemas from `polars` data
frames.
- If we want to add any checks or behaviours based on file-level
properties of the data (e.g. file size, hash, encoding, etc.), these
will have to be implemented outside of `pandera`.