Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design Data Types Library That Supports Both PySpark & Pandas #1360

Open
lior5654 opened this issue Oct 1, 2023 · 10 comments
Open

Design Data Types Library That Supports Both PySpark & Pandas #1360

lior5654 opened this issue Oct 1, 2023 · 10 comments
Labels
question Further information is requested

Comments

@lior5654
Copy link

lior5654 commented Oct 1, 2023

Design Data Types Library That Supports Both PySpark & Pandas

Hi, I have multiple data types I commonly work with, sometimes in pandas and sometimes in pyspark.

I don't want to create 2 pandera DataFrameModels for each type, that seems like a really bad practice.

What's the best way to currently do this?

Is there also a way to write code that will work both on pyspark & pandas?

@lior5654 lior5654 added the question Further information is requested label Oct 1, 2023
@lior5654
Copy link
Author

lior5654 commented Oct 1, 2023

Any comments on this?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Oct 1, 2023

Long story short, the primitives are there, it'll just be some work before we can realize the vision of "one DataFrameModel to rule them all" 💍.

I see this question can be broken down into two sub-problems:

  1. How to create a common DataFrameModel interface that can validate a suite of supported dataframe types
  2. How to create a common type system such that I can use a single data type system that can work for different dataframe types (but still have to use different DataFrameModel classes). This is what the pandera type system was designed for, but will take some work to make it really nice developer experience.

I don't want to create 2 pandera DataFrameModels for each type, that seems like a really bad practice.

Agreed, but you might be surprised how challenging this is to get right 🙃.

I'd say (2) is a little easier to tackle right now. Basically we'd need to add the library-agnostic data types as supported equivalent types in the pyspark_engine module. This is already somewhat supported in the pandas_engine module but that also needs to be cleaned up.

As for (1), that's going to take more work, but basically we'd need to create a generic DataFrameSchema and DataFrameModel interface that supports both pandas and pyspark (and e.g. polars, etc). This would require some pretty big internal changes to the way that DataFrameModel works (I'm not happy about its current state and will need to overhaul it), and perhaps use Pythonic typing conventions like Annotated[pd.DataFrame, DataFrameModel] instead of pandera.typing.pandas.DataFrame[DataFrameModel].

@NeerajMalhotra-QB @jaskaransinghsidana FYI.

I guess to kick this effort off, would you mind sharing some example code of what you're doing today @lior5654 ?

@lior5654
Copy link
Author

lior5654 commented Oct 1, 2023

Thanks for the detailed answer! I really appreciate it.

I'll share a minimal PoC soon, but another question arises - do you recommend other libraries to do this these days? Are you aware of any libraries that currently support such concepts? @cosmicBboy

@cosmicBboy
Copy link
Collaborator

@lior5654 as far as I know I don't know of efforts to create a "unified dataframe model for schema validation"... pandera is the only such effort I know of :) Happy to learn about others if any community members know about other efforts, but would love to work with you to figure out how we can achieve this vision with pandera.

The "one DataFrameModel to rule them all" really is the goal here, but for the longest time pandera only supported pandas-compliant APIs, e.g. modin, dask, pyspark.pandas. Recent support forpyspark-sql was the first experiment to see if pandera can really support other dataframe libraries (the answer is yes 😀). So as not to generalize too early @NeerajMalhotra-QB and team and I decided to essentially duplicate some of the code when we built out the pyspark-sql-native support.

Now with the efforts to support polars and ibis I think we're in a good position to generalize the API so we can have a generic DataFrameSchema and DataFrameModel base class, which can serve as the single entrypoint for validating dataframe-like objects, delegating to the appropriate backend and type system as needed.

@lior5654
Copy link
Author

lior5654 commented Oct 3, 2023

Thanks again for the reply. before I send code samples, I have another question.
Regarding modin, dask, pyspark.pandas - which are all pandas API compliant, can I use the same DataFrameModel to validate the all? I see that pyspark.pandas has its own Series class in pandera.

@cosmicBboy

@lior5654
Copy link
Author

lior5654 commented Oct 3, 2023

But I think my intentions are pretty clear, here's a code sample:

import pyspark.pandas as ps
import pandas as pd
import pandera as pa

# should GENERIC, for ALL pandas compliant APIS
# in the future - also for non pandas compliat APIS (only when applicable)
# currently - this is for pandas itself only
from pandera.typing import DataFrame, Series


class Schema(pa.DataFrameModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})

# validate a pd.DataFrame (this will work)
# validate a ps.DataFrame (this has it's own Series type so you need to define the same class with a different Series
# ...

I guess for a start, supporting all pandas-compliant APIs with a single class should be easy, right?

@lior5654
Copy link
Author

lior5654 commented Oct 3, 2023

Oh I just tested, and it seems like the pandera pandas DataFrame model works seamlessly with pyspark.pandas API, great to know!

@cosmicBboy
Copy link
Collaborator

Oh I just tested, and it seems like the pandera pandas DataFrame model works seamlessly with pyspark.pandas API, great to know!

Yep! The pyspark.pandas integration has been around for longer

@cosmicBboy
Copy link
Collaborator

supporting all pandas-compliant APIs with a single class should be easy, right?

Yes, this is possible today with the backend extensions plugin. This is currently done for dask, modin, and pyspark.pandas. The challenge is making pyspark.sql dataframe schemas use the same schema specification as the pandas one, just with a different backend.

@tcsantini
Copy link

tcsantini commented Jun 27, 2024

Is there any update now that polars (#1064) is also supported, @cosmicBboy?
It seems like your point (1) is pretty much done.

Poking around the API and pandera's source code, it seems like I'm already very close to being able to define a functional shared DataFrameModel that is engine agnostic (at least between pandas / polars).

Here is a POC that seems to work: Albeit I can't validate directly with the shared model. I guess it should be straight forward to enable validation with the shared model by having pandera.api.dataframe.model.DataFrameModel's validate delegate to the correct library-specific validate based on the type of check_obj?

import pandas as pd
import polars as pl

# Pandera pandas
import pandera as pd_pa
# Pandera polars
import pandera.polars as pl_pa
# Pandera library-agnostic
from pandera.api.dataframe.model import DataFrameModel
from pandera.api.dataframe.model_components import Field
from pandera.dtypes import String, Int64
from pandera.errors import SchemaErrors


class SharedModel(DataFrameModel):
    a_str: String = Field(isin=["ok"])
    an_int: Int64 = Field(gt=0)


class PandasModel(SharedModel, pd_pa.DataFrameModel):
    ...


class PolarsModel(SharedModel, pl_pa.DataFrameModel):
    ...


def main():
    # define a pandas and a polars dataframe with the same content
    pa_df = pd.DataFrame({
        "a_str": [
            "ok",
            "not ok",  # should raise
        ],
        "an_int": [
            -1,  # should raise
             1,
        ]
    })
    pl_df = pl.from_pandas(pa_df)

    print("pandas dataframe / model works")
    try:
        PandasModel.validate(pa_df, lazy=True)
        assert False
    except SchemaErrors as e:
        print(e)

    print("polars dataframe / model works")
    try:
        PolarsModel.validate(pl_df, lazy=True)
        assert False
    except SchemaErrors as e:
        print(e)

    print("pandas / polars dataframe with shared model doesn't work: raises NotImplementedError")
    for df in (pa_df, pl_df):
        try:
            SharedModel.validate(df, lazy=True)
            assert False
        except NotImplementedError:
            ...


if __name__ == "__main__":
    main()

Which outputs:

pandas dataframe / model works
{
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "PandasModel",
                "column": "a_str",
                "check": "isin(['ok'])",
                "error": "Column 'a_str' failed element-wise validator number 0: isin(['ok']) failure cases: not ok"
            },
            {
                "schema": "PandasModel",
                "column": "an_int",
                "check": "greater_than(0)",
                "error": "Column 'an_int' failed element-wise validator number 0: greater_than(0) failure cases: -1"
            }
        ]
    }
}
polars dataframe / model works
{
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "PolarsModel",
                "column": "a_str",
                "check": "isin(['ok'])",
                "error": "Column 'a_str' failed validator number 0: <Check isin: isin(['ok'])> failure case examples: [{'a_str': 'not ok'}]"
            },
            {
                "schema": "PolarsModel",
                "column": "an_int",
                "check": "greater_than(0)",
                "error": "Column 'an_int' failed validator number 0: <Check greater_than: greater_than(0)> failure case examples: [{'an_int': -1}]"
            }
        ]
    }
}
pandas / polars dataframe with shared model doesn't work: raise NotImplementedError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants