Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381

Closed
cosmicBboy opened this issue Jan 12, 2021 · 16 comments · Fixed by #913
Closed
Assignees
Labels
enhancement New feature or request

Comments

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jan 12, 2021

Is your feature request related to a problem? Please describe.

Extending pandera to non-pandas dataframe-like structures is a challenge today because the schema and schema component class definitions are strongly coupled with the pandas API. For example, the DataFramesSchema.validate method assumes that validated objects follow the pandas API.

Potential Solutions

  1. Abstract out the core pandera interface into Schema, SchemaComponent, and Check abstract base classes so that core and third-party pandera schemas can be easily developed on top of it. Subclasses of these base classes would implement the validation logic for a specific library, e.g. SparkSchema, PandasSchema, etc.
  2. Provide a validation engine interface where core and third-party developers can register and use different validation backends depending on the type of dataframe implementation (e.g. pandas, spark, dask, etc) being used, similar to the proposal in Decouple pandera and pandas type systems #369. The public-facing API won't change: different dataframe types would be validated via different (non-mutually exclusive) approaches:
    • at runtime validation, pandera delegates to the appropriate engine based on the type of obj when schema.validate(obj) is called.
    • add a engine: str option, to explicitly specify which engine to use. (q: should this be in __init__ or validate or both?)

Describe the solution you'd like

Because this is quite a momentous change in the pandera's scope (to support not just pandas dataframes), I'll first re-iterate the design philosophy of pandera:

  1. minimize the proliferation of classes in the public-facing API
  2. the schema-definition interface should be isomorphic to the data structure being validated i.e. defining a dataframe schema should feel like defining a dataframe
  3. prioritize flexibility/expressiveness of validation functions, add built-ins for common checks (based on feature-parity of other similar schema libraries, or by popular request)

In keeping with these principles, I propose going with solution (2), in order to prevent an increase in the complexity and surface area of the user-facing API (DaskSchema, PandasSchema, SparkSchema, VaexSchema, etc).

edit:
Actually with solution (1), one approach that would keep the API surface area small is to use a subpackage pattern that replicates the pandera interface but with the alternative backend:

from pandera.spark as pa

spark_schema = pa.DataFrameSchema({...})

class SparkSchema(pa.SchemaModel):
    ...

Etc...

import pandera.dask
import pandera.modin

Will need to think through the pros and cons of 1 vs 2 some more...

Re: data synthesis strategies, which is used purely for testing and not meant to generate massive amounts of data, we could just fallback on pandas and convert the synthesized data to the corresponding dataframe type, assuming the df library supports this, e.g. spark.createDataFrame

@cosmicBboy cosmicBboy added the enhancement New feature or request label Jan 12, 2021
@cosmicBboy cosmicBboy added this to the 0.8.0 release milestone Feb 5, 2021
@cosmicBboy cosmicBboy added this to 0.8.0 Release TODO in Release Roadmap Feb 19, 2021
@cosmicBboy cosmicBboy changed the title Refactor schema and schema components into base classes Support alternative validation backends Apr 10, 2021
@cosmicBboy cosmicBboy changed the title Support alternative validation backends Introduct validation engines to support alternative dataframe libraries Apr 10, 2021
@cosmicBboy cosmicBboy changed the title Introduct validation engines to support alternative dataframe libraries Introduce validation engines to support alternative dataframe libraries Apr 10, 2021
@cosmicBboy
Copy link
Collaborator Author

cosmicBboy commented Apr 10, 2021

Initial Thoughts

Currently, the schema and check classes conflate the specification of schema properties with the validation of those properties on some data. We may want to separate these two concerns.

  • DataFrameSchema collects all column types and checks and does some basic schema validations to make sure the specification is valid (raises SchemaInitError if invalid).
  • DataFrameSchema.validate should delegate the validation of some input data to a ValidationEngine. The validation engine performs the following operations:
    • checks strictness criteria, i.e. only columns specified in schema are in the dataframe (optional)
    • checks dataframe column order against schema columns order (optional)
    • coerces columns to types specified (optional)
    • expands schema regex columns based on dataframe columns
    • run schema component (column/index) checks
      • check for nulls (optional)
      • check for duplicates (optional)
      • check datatype
      • run Check validations
    • run dataframe-level checks
  • _CheckBase needs to delegate the implementation of groupby, element_wise, agg, and potentially other modifiers (see here) to the underlying dataframe library via ValidationEngine.
  • the ValidationEngine would also have to supply implementation for built-in Checks. This can happen incrementally such that an error is raised if the implementation isn't done for a particular dataframe library.
  • the strategies module needs to be extended to support other dataframe types. Since hypothesis supports numpy and pandas it makes sense to use the existing strategies logic to generate a pandas dataframe and convert it to some other desired format (e.g. koalas, modin, dask, etc) and see how far that gets us.

Here's a high-level sketch of the API:

# pandera contributor to codebase or custom third-party engine
class MySpecialDataFrameValidationEngine(ValidationEngine):
    # implement a bunch of stuff
    ...

register_validation_engine(MySpecialDataFrameValidationEngine)

# end-user interaction, with hypothetical special_dataframe package.
from special_dataframe import MySpecialDataFrame

special_df = MySpecialDataFrame(...)

schema = pa.DataFrameSchema({...})
schema.validate(special_df)

@cosmicBboy cosmicBboy changed the title Introduce validation engines to support alternative dataframe libraries Support validation of non-pandas dataframes, e.g. spark, dask, modin Apr 11, 2021
@cosmicBboy cosmicBboy changed the title Support validation of non-pandas dataframes, e.g. spark, dask, modin Support validation of non-pandas dataframes, e.g. spark, dask, etc Apr 11, 2021
@jeffzi
Copy link
Collaborator

jeffzi commented Apr 11, 2021

checks strictness criteria, i.e. only columns specified in schema are in the dataframe (optional)
checks dataframe column order against schema columns order (optional)
expands schema regex columns based on dataframe columns

I think those operations can be handled by DataFrameSchema, provided that the engine exposes get_columns(df)/set_columns(df). "columns" here refers to a list of pandera.Column.
edit: It occurred to me that this idea may be too restrictive for multi-dimensional dataframes (like xarray), unless DataFrameSchema knows about multi-dimensions.

We could merge the idea of Backend outlined in #369 with ValidationEngine. That would add the responsibility of registering dtypes.

Question: What to do with pandera.Index? Most DataFrame libraries don't have this concept. If we want to minimize the proliferation of classes in the public-facing API, which I totally agree with, we need to keep set_index()/reset_index() on DataFrameSchema but raise an error if the engine does not support it.

@cosmicBboy cosmicBboy self-assigned this May 16, 2021
@cosmicBboy cosmicBboy changed the title Support validation of non-pandas dataframes, e.g. spark, dask, etc Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc May 16, 2021
@crypdick
Copy link

Any ETA on Modin support?

@cosmicBboy
Copy link
Collaborator Author

hey @crypdick once #504 is merged (should be in the next few days) I'll going to tackle this issue.

The plan right now is to make a ValidationEngine base class and PandasValidationEngine with native support for pandas, modin, and koalas.

I've done a little bit of prototyping of the new validation engine but still needs a bunch of work... I'm going to push for a finished solution before scipy conf this year, so ETA mid-July?

@kvnkho
Copy link
Contributor

kvnkho commented Sep 18, 2021

Went through the discussion and we'd certainly be interested in contributing a Fugue ValidationEngine. We'll keep an eye out for the PandasValidationEngine and the koalas/modin support and see if Fugue has direct mappings to the implementation you arrive at!

@JackKelly
Copy link

Hi, I was just wondering if it's possible to use pandera to define schemas for n-dimensional numpy arrays; and hence to use pandera with xarray.DataArray objects, just as pandera is currently used for pandas.DataFrames?

@cosmicBboy
Copy link
Collaborator Author

@JackKelly I'd love to add support for numpy+xarray, but unfortunately it's currently not possible.

After this PR is merged (still WIP) we'll have a much better interface for extending pandera to other non-pandas data structures, numpy and xarray would be natural to support on pandera.

Out of curiosity (looking at openclimatefix/nowcasting_dataset#211) is your primary use-case to check data types and dimensions of xarray objects?

@JackKelly
Copy link

Thanks loads for the reply! No worries at all!

Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays.

@cosmicBboy
Copy link
Collaborator Author

cosmicBboy commented Oct 11, 2021

Thanks loads for the reply! No worries at all!

Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays.

Great! will keep this in mind for when we get there.

Also, once pandera schemas can be used as valid pydantic types, #453 is supported, the solution you outline here would be pretty straightforward to port over to pandera, making for a pretty concise schema definition... I'm imagining a user-API like:

import pandera as pa
import pydantic

class ImageDataset(pa.SchemaModel)    
    data: DataArray[int] = NDField(dims=("time", "x", "y"))
    x_coords: Optional[DataArray[int]] = NDField(dims=("index", ))
    y_coords: Optional[DataArray[int]] = NDField(dims=("index", ))


class Example(pydantic.BaseModel):
    """A single machine learning training example."""
    satellite: Optional[ImageDataset]
    nwp: Optional[ImageDataset]

@JackKelly
Copy link

That looks absolutely perfect, thank you!

This was referenced Oct 13, 2021
@jhamman
Copy link

jhamman commented Dec 6, 2021

Hi all. I wanted to share a little experiment we've been playing with, xarray-schema, which provides schema validation logic for Xarray objects. We've been following this thread closely and we're looking at ways to integrate what we've done with pandera/pydantic.

@cosmicBboy
Copy link
Collaborator Author

wow @jhamman this looks amazing! I'd love to integrate, do you want to find a time to chat?
https://calendly.com/niels-bantilan/30min

Also feel free to join the discord community if you want to discuss further there: https://discord.gg/vyanhWuaKB

@cosmicBboy cosmicBboy mentioned this issue Dec 11, 2021
3 tasks
@cosmicBboy
Copy link
Collaborator Author

@jhamman made this issue #705 to track the xarray integration.

I'm planning on making a PR for this issue (#381) by end of year to make the xarray-schema integration as smooth as possible.

@blais
Copy link

blais commented Feb 1, 2022

Thanks for your email Niels.
PETL allows one to process tables of data.
It involves several differences and some advantages over Pandas:

  • The data storage is a lot more straightforward - no indices, regular Python objects (no Pandas-specific dtypes)
  • As a result, the code is much more predictable.Pandas is often quirky and leads to silent failures hard to predict. In comparison, I nearly 100% of the time get things right the first time with PETL (and I have solid experience with Pandas).
  • PETL allows you to keep only a portion of the dataframe in memory.
  • PETL is row-based, not column based, so depending on the operation, some of the processing is not available compared to Pandas. In-row and near-row operations are still possible though.
  • PETL is lazy evaluated by default, it's only at the point of producing the output that the data is pulled through its processing pipeline. This has advantages - small memory footprint - but also disadvantages - e.g. using a closure may have sometimes difficult-to-predict behavior because it actually gets executed way after its point of definition.

Overall, I think for 90% of the processing I've seen done in Pandas, PETL is a better choice. For the remaining 10% Pandas is needed more like NumPy.

Having schemas for PETL would be awesome. Its support should be much easier than for Pandas - as I mentioned, it doesn't define custom data types, the data representation model is really straightforward: lists of (lists or tuples) or any Python object.

@andretheronsa
Copy link

andretheronsa commented Apr 12, 2022

What would be required to ensure we can add a GeoDataFrame type from GeoPandas with a Pydantic BaseModel?
I am thinking it may not be as complex as support for spark/dask and new interfaces. If someone could point me in the right direction I could work on a PR.

I would like to do:

import pandera as pa
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
from pandera.typing import Series
import pydantic
from shapely.geometry import Polygon

class BaseGeoDataFrameSchema(pa.SchemaModel):
    geometry: GeoSeries
    properties: Optional[Series[str]]

class Inputs(pydantic.BaseModel):
    gdf: GeoDataFrame[BaseGeoDataFrameSchema]
   # TypeError: Fields of type "<class 'pandera.typing.geopandas.GeoDataFrame'>" are not supported.

gdf = GeoDataFrame[BaseGeoDataFrameSchema]({"geometry": [Polygon(((0, 0), (0, 1), (1, 1), (1, 0)))], "extra": [1]}, crs=4326)
validated_inputs = Inputs(gdf=gdf)

@cosmicBboy
Copy link
Collaborator Author

hi all, pinging this issue to point everyone to this PR: #913

It's a WIP PR for laying the groundwork for improving the extensibility of pandera's abstractions. I'd very much appreciate people's feedback on this, nothing is set in stone yet!

I'll be adding additional details to the PR description in the next few days, but for now it outlines the main changes at a high level. Please chime in with your thoughts/comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Release Roadmap
Todo: 0.8.0 Release
Development

Successfully merging a pull request may close this issue.

8 participants