Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core and backend pandera API internals rewrite #913

Merged
merged 64 commits into from
Jan 24, 2023
Merged

core and backend pandera API internals rewrite #913

merged 64 commits into from
Jan 24, 2023

Conversation

cosmicBboy
Copy link
Collaborator

@cosmicBboy cosmicBboy commented Aug 12, 2022

fixes #381

Fundamentally, pandera is about defining types for statistical data containers (e.g. pandas DataFrames, xarray Datasets, SQL tables) that serve to:

  1. self-document the properties of data in code
  2. validate those properties at run-time
  3. provide some basic type-linting capabilities (currently still somewhat limited)

What

This PR introduces two new subpackages to pandera:

  • core: this defines schema specifications for particular families of data containers, e.g. "pandas-like dataframes". This module is responsible for defining the properties held by these data containers.
  • backends: this defines the underlying implementation of the validation logic given a particular schema specification. This module is responsible for actually verifying those properties for a specific type of data container (e.g. for pandas DataFrames, modin, dask, pyspark.pandas DataFrames, etc.)

Why?

The purpose of this PR is to:

  • decouple and abstract the specification from the thing that actually runs the validation rules.
  • provide base classes upon which the community can build schema specifications and backends for potentially any arbitrary data structure, including xarray.Datasets, numpy arrays, tensore objects, etc, all with a focus on:
    • coercion/validation of data types per field
    • validation of arbitrary properties, in particular statistical properties across records or the container's equivalent of records.

This change will not effect the user-facing API of pandera and will not introduce any breaking changes.

Design Implications

  1. For each core schema specification, there may be multiple backends that can apply to it. For example, I can define a DataFrameSchema and, depending on the type of dataframe that I supply to schema.validate, pandera will delegate to a particular backend.
  2. Instead of trying to design "one schema specification to rule them all", pandera will try to strike a balance between keeping the API surface as small as possible while embracing the richness and diversity of dataframe-like objects that now exist.

Phases

This PR will be the first phase in a multi-phase approach to improving extensibility:

  1. [this PR] introduce decoupled architecture, with no fundamental changes to pandera's functionality and implementation
  2. introduce multiple backends for the dataframe object: clean up the dataframe validation code by having separate backends for modin, dask, pyspark.pandas, geopandas (the motivation here is to ensure the backend abstraction makes sense).
  3. introduce a schema specification and backend for SQL tables:: borrow specification from dataframe schemas to introduce SQL-native validation using SQLAlchemy. (the motivation here is to ensure the core + backend abstactions make sense from an extensibility stand point)
  4. introduce a pandera-contrib ecosystem: this exists to host other pandera-compliant projects (e.g. https://github.com/carbonplan/xarray-schema) so that the broader community can use pandera's core and backend abstractions to build their own schema types.

@cosmicBboy cosmicBboy marked this pull request as ready for review January 23, 2023 23:58
@cosmicBboy cosmicBboy changed the title [wip] core and backend pandera API core and backend pandera API internals rewrite Jan 23, 2023
@cosmicBboy cosmicBboy changed the base branch from dev to main January 23, 2023 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc
1 participant