core and backend pandera API internals rewrite #913

cosmicBboy · 2022-08-12T19:33:42Z

fixes #381

Fundamentally, pandera is about defining types for statistical data containers (e.g. pandas DataFrames, xarray Datasets, SQL tables) that serve to:

self-document the properties of data in code
validate those properties at run-time
provide some basic type-linting capabilities (currently still somewhat limited)

What

This PR introduces two new subpackages to pandera:

core: this defines schema specifications for particular families of data containers, e.g. "pandas-like dataframes". This module is responsible for defining the properties held by these data containers.
backends: this defines the underlying implementation of the validation logic given a particular schema specification. This module is responsible for actually verifying those properties for a specific type of data container (e.g. for pandas DataFrames, modin, dask, pyspark.pandas DataFrames, etc.)

Why?

The purpose of this PR is to:

decouple and abstract the specification from the thing that actually runs the validation rules.
provide base classes upon which the community can build schema specifications and backends for potentially any arbitrary data structure, including xarray.Datasets, numpy arrays, tensore objects, etc, all with a focus on:
- coercion/validation of data types per field
- validation of arbitrary properties, in particular statistical properties across records or the container's equivalent of records.

This change will not effect the user-facing API of pandera and will not introduce any breaking changes.

Design Implications

For each core schema specification, there may be multiple backends that can apply to it. For example, I can define a DataFrameSchema and, depending on the type of dataframe that I supply to schema.validate, pandera will delegate to a particular backend.
Instead of trying to design "one schema specification to rule them all", pandera will try to strike a balance between keeping the API surface as small as possible while embracing the richness and diversity of dataframe-like objects that now exist.

Phases

This PR will be the first phase in a multi-phase approach to improving extensibility:

[this PR] introduce decoupled architecture, with no fundamental changes to pandera's functionality and implementation
introduce multiple backends for the dataframe object: clean up the dataframe validation code by having separate backends for modin, dask, pyspark.pandas, geopandas (the motivation here is to ensure the backend abstraction makes sense).
introduce a schema specification and backend for SQL tables:: borrow specification from dataframe schemas to introduce SQL-native validation using SQLAlchemy. (the motivation here is to ensure the core + backend abstactions make sense from an extensibility stand point)
introduce a pandera-contrib ecosystem: this exists to host other pandera-compliant projects (e.g. https://github.com/carbonplan/xarray-schema) so that the broader community can use pandera's core and backend abstractions to build their own schema types.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

…into core-schema

cosmicBboy requested a review from jeffzi August 12, 2022 19:33

cosmicBboy marked this pull request as draft August 12, 2022 19:37

cosmicBboy force-pushed the core-schema branch from e6e42e8 to 0473280 Compare August 24, 2022 18:41

cosmicBboy mentioned this pull request Sep 21, 2022

Integration: Daft Dataframes #945

Open

cosmicBboy deleted the branch main October 18, 2022 21:36

cosmicBboy closed this Oct 18, 2022

cosmicBboy reopened this Oct 26, 2022

cosmicBboy mentioned this pull request Nov 7, 2022

Schema Models for SeriesSchema #1009

Closed

cosmicBboy added 19 commits December 17, 2022 11:12

[wip] execution backends

dc266ce

implement container, field, component backends

21b0260

wip

4c42e83

implement ArraySchema for pandas

abba7f2

move coerce logic to backend

b913f05

implement index and multiindex

fc126a6

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

move error_formatters, check cleanup

e591bb5

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

[wip]

7bd3e1d

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

implement checks

7725961

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

built-in checks

09c1734

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

add todo

af6f4d1

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

handle core and backend mypy issues

f5580fc

fix test_schemas, test_checks unit tests

f677b11

[wip] fix unit tests for core

02f3b42

fix io tests

d47b14b

fix strategies tests

88662fe

fix mypy tests

3b40b4d

fix dask tests

528f575

fix modin tests

be56b4a

cosmicBboy added 13 commits January 22, 2023 15:57

fix hypothesis docstring

8908ef8

delete old modules for io, schema inference and statistics

3517722

delete schemas.py

c21289e

delete model.py and model_components.py

d0110f4

delete *_accessor.py

dd2de21

delete extensions.py module

eddec98

delete checks and hypotheses modules

d0554bc

delete schema_components

6e7b387

delete pandera/typing/config.py

f16ee09

clean up import statements

60b60a7

clean up imports

c89fa74

clean up docs

c51289e

fix mypy tests

94a8697

cosmicBboy marked this pull request as ready for review January 23, 2023 23:58

cosmicBboy changed the title ~~[wip] core and backend pandera API~~ core and backend pandera API internals rewrite Jan 23, 2023

cosmicBboy changed the base branch from dev to main January 23, 2023 23:59

cosmicBboy added 6 commits January 23, 2023 18:59

Merge branch 'main' into core-schema

b93c48f

clean up docs

4e1183e

Merge branch 'core-schema' of https://github.com/unionai-oss/pandera …

31c5ebe

…into core-schema

update try pandera notebook

1ca48c2

fix mypy

e63e79a

fix typo

f614f06

cosmicBboy merged commit 061f989 into main Jan 24, 2023

This was referenced Jan 27, 2023

Allow fallback coercion function as column/field argument #1082

Closed

Add compatibility with Cudf #969

Open

Support vaex #1085

Closed

cosmicBboy mentioned this pull request Feb 9, 2023

Support for polars #1064

Closed

cosmicBboy deleted the core-schema branch March 14, 2023 20:44

cosmicBboy mentioned this pull request Mar 11, 2024

bugfix: add index validation to SeriesSchema #1524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core and backend pandera API internals rewrite #913

core and backend pandera API internals rewrite #913

cosmicBboy commented Aug 12, 2022 •

edited

Loading

core and backend pandera API internals rewrite #913

core and backend pandera API internals rewrite #913

Conversation

cosmicBboy commented Aug 12, 2022 • edited Loading

What

Why?

Design Implications

Phases

cosmicBboy commented Aug 12, 2022 •

edited

Loading