Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas2 / pyarrow backend support #1262

Closed
mattharrison opened this issue Jul 15, 2023 · 10 comments · Fixed by #1628
Closed

Pandas2 / pyarrow backend support #1262

mattharrison opened this issue Jul 15, 2023 · 10 comments · Fixed by #1628
Labels
enhancement New feature or request

Comments

@mattharrison
Copy link

Describe the bug
I can't generate a schema from a pyarrow-backed dataframe

Code Sample, a copy-pastable example

import io
import pandas as pd
import pandera 
data = 'id,date\n0e90a7243dbb433fbfb24e23f08b0684,08-05-2022\nb6242783029545f1ac86be6b950ed6d7,30-04-2023\n'

df = pd.read_csv(io.StringIO(data), engine='pyarrow', dtype_backend='pyarrow')
print(pd.__version__)
pandera.infer_schema(df)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[127], line 8
      6 df = pd.read_csv(io.StringIO(data), engine='pyarrow', dtype_backend='pyarrow')
      7 print(pd.__version__)
----> 8 pandera.infer_schema(df)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_inference/pandas.py:39, in infer_schema(pandas_obj)
     32 """Infer schema for pandas DataFrame or Series object.
     33 
     34 :param pandas_obj: DataFrame or Series object to infer.
     35 :returns: DataFrameSchema or SeriesSchema
     36 :raises: TypeError if pandas_obj is not expected type.
     37 """
     38 if isinstance(pandas_obj, pd.DataFrame):
---> 39     return infer_dataframe_schema(pandas_obj)
     40 elif isinstance(pandas_obj, pd.Series):
     41     return infer_series_schema(pandas_obj)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_inference/pandas.py:73, in infer_dataframe_schema(df)
     67 def infer_dataframe_schema(df: pd.DataFrame) -> DataFrameSchema:
     68     """Infer a DataFrameSchema from a pandas DataFrame.
     69 
     70     :param df: DataFrame object to infer.
     71     :returns: DataFrameSchema
     72     """
---> 73     df_statistics = infer_dataframe_statistics(df)
     74     schema = DataFrameSchema(
     75         columns={
     76             colname: Column(
   (...)
     84         coerce=True,
     85     )
     86     schema._is_inferred = True

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:15, in infer_dataframe_statistics(df)
     13 """Infer column and index statistics from a pandas DataFrame."""
     14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16 column_statistics = {
     17     col: {
     18         "dtype": dtype,
   (...)
     22     for col, dtype in inferred_column_dtypes.items()
     23 }
     24 return {
     25     "columns": column_statistics if column_statistics else None,
     26     "index": infer_index_statistics(df.index),
     27 }

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:15, in <dictcomp>(.0)
     13 """Infer column and index statistics from a pandas DataFrame."""
     14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16 column_statistics = {
     17     col: {
     18         "dtype": dtype,
   (...)
     22     for col, dtype in inferred_column_dtypes.items()
     23 }
     24 return {
     25     "columns": column_statistics if column_statistics else None,
     26     "index": infer_index_statistics(df.index),
     27 }

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:184, in _get_array_type(x)
    181 def _get_array_type(x):
    182     # get most granular type possible
--> 184     data_type = pandas_engine.Engine.dtype(x.dtype)
    185     # for object arrays, try to infer dtype
    186     if data_type is pandas_engine.Engine.dtype("object"):

File ~/.envs/menv/lib/python3.10/site-packages/pandera/engines/pandas_engine.py:209, in Engine.dtype(cls, data_type)
    206         common_np_dtype = np.dtype(np_or_pd_dtype.name)
    207         np_or_pd_dtype = common_np_dtype.type
--> 209 return engine.Engine.dtype(cls, np_or_pd_dtype)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/engines/engine.py:265, in Engine.dtype(cls, data_type)
    263     return registry.dispatch(data_type)
    264 except (KeyError, ValueError):
--> 265     raise TypeError(
    266         f"Data type '{data_type}' not understood by {cls.__name__}."
    267     ) from None

TypeError: Data type 'string[pyarrow]' not understood by Engine.

Expected behavior

I want to be able to use Pandera with pyarrow backed dataframes

Versions:

  • Pandas : 2.0.2
  • Pandera: 0.15.2
@mattharrison mattharrison added the bug Something isn't working label Jul 15, 2023
@cosmicBboy
Copy link
Collaborator

@mattharrison I think this would be a feature request: the current scope of pandera is that it doesn't yet support pyarrow datatypes/backend. Gonna close #1162 and merge that with this issue

@cosmicBboy cosmicBboy added enhancement New feature or request and removed bug Something isn't working labels Jul 16, 2023
@franzoni315
Copy link

Is there a workaround to make the validation to work with pyarrow types? Or do you have any idea when this will be implemented?

@OliverKleinBST
Copy link

I would also support the request to support arrow datatypes which I guess become the new normal in Pandas 2.
My current workaround is to convert the arrow dtypes to nullable numpy before running pandera.
df.convert_dtypes(infer_objects=False, dtype_backend='numpy_nullable')

@juanarrivillaga
Copy link

I just want to point out that pyarrow will become a required dependency in pandas 3.0, and the arrow string datatype will become the default string datatype (although numeric types will continue to default to numpy types, IIUC):

https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

@cosmicBboy
Copy link
Collaborator

Anyone who wants to create PR for this has my blessing!

A good place to start would be:

@aaravind100
Copy link
Contributor

aaravind100 commented May 3, 2024

@cosmicBboy i took a quick stab at it

adding this to pandas_engine.py

@Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])
@immutable
class ArrowINT64(DataType, dtypes.Int):
    type = pd.ArrowDtype(pyarrow.int64())
    bit_width: int = 64


@Engine.register_dtype(equivalents=["string", pd.ArrowDtype(pyarrow.string())])
@immutable
class ArrowString(DataType, dtypes.String):
    type = pd.ArrowDtype(pyarrow.string())

this gets validated

import pandas as pd
import pandera as pa

df = pd.DataFrame(
    [
        {"foo": 123, "bar": "abc"},
    ],
)


class Schema(pa.DataFrameModel):
    foo: int
    bar: str


print("pandas:")
print(df.dtypes)
print()
print(Schema.validate(df))
print()

df = df.convert_dtypes(dtype_backend="pyarrow")
print("pandas[pyarrow]:")
print(df.dtypes)
print()
print(Schema.validate(df))

output:

pandas:
foo     int64
bar    object
dtype: object

   foo  bar
0  123  abc

pandas[pyarrow]:
foo     int64[pyarrow]
bar    string[pyarrow]
dtype: object

   foo  bar
0  123  abc

would you like me to continue this direction?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 4, 2024

@aaravind100 the overall approach makes sense! Thanks for taking the initiative on this.

@Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])

Let's avoid overloading "int" here since it's already taken by the numpy int type: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/numpy_engine.py#L163-L165

For the equivalents, pandera has taken the philosophy of "accept whatever pandas (or underlying dataframe library) accepts as dtypes. So this means:

  • the string alias, e.g. "int64[pyarrow]"
  • the ArrowDtype instance pd.ArrowDtype(pyarrow.int64())

Another thought here is instead of requiring users to wrap the pyarrow dtype in pd.ArrowDtype up front, we could also potentially do away with the need to wrap the type in pd.ArrowDtype(...) when specifying a pandera schema and just do it in the background (would be curious on your thoughts here).

import pandera as pa
import pyarrow

pa.DataFrameSchema({
    "foo": pa.Column(pyarrow.int64()),
    "bar": pa.Column(pyarrow.timestamp(unit="s")),
})

The benefit is it makes for more concise. As mentioned in the docs, we'll need to make sure to wrap these in pd.ArrowDtype under the hood for parameterized types like pyarrow.timestamp. This is necessary to support DataFrameModel-style schemas:

class Model(pa.DataFrameModel):
    foo: pyarrow.int64  # these need to be types, so pyarrow.int64() is invalid
    bar: pyarrow.timestamp = pa.Field(dtype_kwargs={"unit": "s"})

    # or using typing.Annotated
    bar: Annotated[pyarrow.timestamp, "s"]

So something like:

@Engine.register_dtype(equivalents=["int", pyarrow.int64, pyarrow.int64())])  # this makes sure plain pyarrow.int64 is accepted as dtype in the schema definition
@immutable
class ArrowInt64(DataType, dtypes.Int):
    type = pd.ArrowDtype(pyarrow.int64())  # we wrap this here
    bit_width: int = 64

For parameterized dtypes it'll be slightly more complicated

@Engine.register_dtype(equivalents=["int", pyarrow.timestamp, pyarrow.timestamp())])
@immutable
class ArrowTimestamp(DataType, dtypes.Timestamp):
    type: Optional[pd.ArrowDtype] = dataclasses.field(default=None, init=False)  # we'll set this in __post_init__
    bit_width: int = 64

    unit: Optional[str] = None
    tz: Optional[datetime.tzinfo] = None

    def __post_init__(self):
        type_ = pd.ArrowDtype(pyarrow.timestamp(self.unit, self.tz))
        object.__setattr__(self, "type", type_)

    # this handles creating an instance of ArrowTimestamp in the DataFrameModel
    # schema definition
    @classmethod
    def from_parametrized_dtype(cls, pyarrow_dtype: pyarrow_dtype.timestamp):
        return cls(unit=pyarrow_dtype.unit, tz=pyarrow_dtype.tz)  # type: ignore

@aaravind100
Copy link
Contributor

pandera has taken the philosophy of "accept whatever pandas (or underlying dataframe library) accepts as dtypes

thank you, that clears some confusion :)

The suggestion for using pyarrow.<type> does indeed make more sense to me. It also opens up schema/model interoperability with other dataframe library which uses pyarrow types.

@cosmicBboy
Copy link
Collaborator

@mattharrison you'll be pleased to learn that #1628 has been merged :) the 0.20.0 release will have these changes. will probably cut a beta release in the next week or so if you wanted to play around with it

@mattharrison
Copy link
Author

mattharrison commented May 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants