Pandas2 / pyarrow backend support #1262

mattharrison · 2023-07-15T21:51:48Z

Describe the bug
I can't generate a schema from a pyarrow-backed dataframe

Code Sample, a copy-pastable example

import io
import pandas as pd
import pandera 
data = 'id,date\n0e90a7243dbb433fbfb24e23f08b0684,08-05-2022\nb6242783029545f1ac86be6b950ed6d7,30-04-2023\n'

df = pd.read_csv(io.StringIO(data), engine='pyarrow', dtype_backend='pyarrow')
print(pd.__version__)
pandera.infer_schema(df)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[127], line 8
      6 df = pd.read_csv(io.StringIO(data), engine='pyarrow', dtype_backend='pyarrow')
      7 print(pd.__version__)
----> 8 pandera.infer_schema(df)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_inference/pandas.py:39, in infer_schema(pandas_obj)
     32 """Infer schema for pandas DataFrame or Series object.
     33 
     34 :param pandas_obj: DataFrame or Series object to infer.
     35 :returns: DataFrameSchema or SeriesSchema
     36 :raises: TypeError if pandas_obj is not expected type.
     37 """
     38 if isinstance(pandas_obj, pd.DataFrame):
---> 39     return infer_dataframe_schema(pandas_obj)
     40 elif isinstance(pandas_obj, pd.Series):
     41     return infer_series_schema(pandas_obj)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_inference/pandas.py:73, in infer_dataframe_schema(df)
     67 def infer_dataframe_schema(df: pd.DataFrame) -> DataFrameSchema:
     68     """Infer a DataFrameSchema from a pandas DataFrame.
     69 
     70     :param df: DataFrame object to infer.
     71     :returns: DataFrameSchema
     72     """
---> 73     df_statistics = infer_dataframe_statistics(df)
     74     schema = DataFrameSchema(
     75         columns={
     76             colname: Column(
   (...)
     84         coerce=True,
     85     )
     86     schema._is_inferred = True

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:15, in infer_dataframe_statistics(df)
     13 """Infer column and index statistics from a pandas DataFrame."""
     14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16 column_statistics = {
     17     col: {
     18         "dtype": dtype,
   (...)
     22     for col, dtype in inferred_column_dtypes.items()
     23 }
     24 return {
     25     "columns": column_statistics if column_statistics else None,
     26     "index": infer_index_statistics(df.index),
     27 }

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:15, in <dictcomp>(.0)
     13 """Infer column and index statistics from a pandas DataFrame."""
     14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16 column_statistics = {
     17     col: {
     18         "dtype": dtype,
   (...)
     22     for col, dtype in inferred_column_dtypes.items()
     23 }
     24 return {
     25     "columns": column_statistics if column_statistics else None,
     26     "index": infer_index_statistics(df.index),
     27 }

File ~/.envs/menv/lib/python3.10/site-packages/pandera/schema_statistics/pandas.py:184, in _get_array_type(x)
    181 def _get_array_type(x):
    182     # get most granular type possible
--> 184     data_type = pandas_engine.Engine.dtype(x.dtype)
    185     # for object arrays, try to infer dtype
    186     if data_type is pandas_engine.Engine.dtype("object"):

File ~/.envs/menv/lib/python3.10/site-packages/pandera/engines/pandas_engine.py:209, in Engine.dtype(cls, data_type)
    206         common_np_dtype = np.dtype(np_or_pd_dtype.name)
    207         np_or_pd_dtype = common_np_dtype.type
--> 209 return engine.Engine.dtype(cls, np_or_pd_dtype)

File ~/.envs/menv/lib/python3.10/site-packages/pandera/engines/engine.py:265, in Engine.dtype(cls, data_type)
    263     return registry.dispatch(data_type)
    264 except (KeyError, ValueError):
--> 265     raise TypeError(
    266         f"Data type '{data_type}' not understood by {cls.__name__}."
    267     ) from None

TypeError: Data type 'string[pyarrow]' not understood by Engine.

Expected behavior

I want to be able to use Pandera with pyarrow backed dataframes

Versions:

Pandas : 2.0.2
Pandera: 0.15.2

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2023-07-16T02:32:48Z

@mattharrison I think this would be a feature request: the current scope of pandera is that it doesn't yet support pyarrow datatypes/backend. Gonna close #1162 and merge that with this issue

franzoni315 · 2023-09-16T03:39:24Z

Is there a workaround to make the validation to work with pyarrow types? Or do you have any idea when this will be implemented?

OliverKleinBST · 2023-09-17T18:47:38Z

I would also support the request to support arrow datatypes which I guess become the new normal in Pandas 2.
My current workaround is to convert the arrow dtypes to nullable numpy before running pandera.
df.convert_dtypes(infer_objects=False, dtype_backend='numpy_nullable')

juanarrivillaga · 2024-02-20T23:58:44Z

I just want to point out that pyarrow will become a required dependency in pandas 3.0, and the arrow string datatype will become the default string datatype (although numeric types will continue to default to numpy types, IIUC):

https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

cosmicBboy · 2024-04-24T16:19:18Z

Anyone who wants to create PR for this has my blessing!

A good place to start would be:

Dtype docs: https://pandera.readthedocs.io/en/stable/dtypes.html
Pandas engine implementation for datatypes: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/pandas_engine.py

aaravind100 · 2024-05-03T15:53:32Z

@cosmicBboy i took a quick stab at it

adding this to pandas_engine.py

@Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])
@immutable
class ArrowINT64(DataType, dtypes.Int):
    type = pd.ArrowDtype(pyarrow.int64())
    bit_width: int = 64


@Engine.register_dtype(equivalents=["string", pd.ArrowDtype(pyarrow.string())])
@immutable
class ArrowString(DataType, dtypes.String):
    type = pd.ArrowDtype(pyarrow.string())

this gets validated

import pandas as pd
import pandera as pa

df = pd.DataFrame(
    [
        {"foo": 123, "bar": "abc"},
    ],
)


class Schema(pa.DataFrameModel):
    foo: int
    bar: str


print("pandas:")
print(df.dtypes)
print()
print(Schema.validate(df))
print()

df = df.convert_dtypes(dtype_backend="pyarrow")
print("pandas[pyarrow]:")
print(df.dtypes)
print()
print(Schema.validate(df))

output:

pandas:
foo     int64
bar    object
dtype: object

   foo  bar
0  123  abc

pandas[pyarrow]:
foo     int64[pyarrow]
bar    string[pyarrow]
dtype: object

   foo  bar
0  123  abc

would you like me to continue this direction?

cosmicBboy · 2024-05-04T02:35:46Z

@aaravind100 the overall approach makes sense! Thanks for taking the initiative on this.

@Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])

Let's avoid overloading "int" here since it's already taken by the numpy int type: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/numpy_engine.py#L163-L165

For the equivalents, pandera has taken the philosophy of "accept whatever pandas (or underlying dataframe library) accepts as dtypes. So this means:

the string alias, e.g. "int64[pyarrow]"
the ArrowDtype instance pd.ArrowDtype(pyarrow.int64())

Another thought here is instead of requiring users to wrap the pyarrow dtype in pd.ArrowDtype up front, we could also potentially do away with the need to wrap the type in pd.ArrowDtype(...) when specifying a pandera schema and just do it in the background (would be curious on your thoughts here).

import pandera as pa
import pyarrow

pa.DataFrameSchema({
    "foo": pa.Column(pyarrow.int64()),
    "bar": pa.Column(pyarrow.timestamp(unit="s")),
})

The benefit is it makes for more concise. As mentioned in the docs, we'll need to make sure to wrap these in pd.ArrowDtype under the hood for parameterized types like pyarrow.timestamp. This is necessary to support DataFrameModel-style schemas:

class Model(pa.DataFrameModel):
    foo: pyarrow.int64  # these need to be types, so pyarrow.int64() is invalid
    bar: pyarrow.timestamp = pa.Field(dtype_kwargs={"unit": "s"})

    # or using typing.Annotated
    bar: Annotated[pyarrow.timestamp, "s"]

So something like:

@Engine.register_dtype(equivalents=["int", pyarrow.int64, pyarrow.int64())])  # this makes sure plain pyarrow.int64 is accepted as dtype in the schema definition
@immutable
class ArrowInt64(DataType, dtypes.Int):
    type = pd.ArrowDtype(pyarrow.int64())  # we wrap this here
    bit_width: int = 64

For parameterized dtypes it'll be slightly more complicated

@Engine.register_dtype(equivalents=["int", pyarrow.timestamp, pyarrow.timestamp())])
@immutable
class ArrowTimestamp(DataType, dtypes.Timestamp):
    type: Optional[pd.ArrowDtype] = dataclasses.field(default=None, init=False)  # we'll set this in __post_init__
    bit_width: int = 64

    unit: Optional[str] = None
    tz: Optional[datetime.tzinfo] = None

    def __post_init__(self):
        type_ = pd.ArrowDtype(pyarrow.timestamp(self.unit, self.tz))
        object.__setattr__(self, "type", type_)

    # this handles creating an instance of ArrowTimestamp in the DataFrameModel
    # schema definition
    @classmethod
    def from_parametrized_dtype(cls, pyarrow_dtype: pyarrow_dtype.timestamp):
        return cls(unit=pyarrow_dtype.unit, tz=pyarrow_dtype.tz)  # type: ignore

aaravind100 · 2024-05-04T09:10:38Z

pandera has taken the philosophy of "accept whatever pandas (or underlying dataframe library) accepts as dtypes

thank you, that clears some confusion :)

The suggestion for using pyarrow.<type> does indeed make more sense to me. It also opens up schema/model interoperability with other dataframe library which uses pyarrow types.

cosmicBboy · 2024-05-11T01:47:36Z

@mattharrison you'll be pleased to learn that #1628 has been merged :) the 0.20.0 release will have these changes. will probably cut a beta release in the next week or so if you wanted to play around with it

mattharrison · 2024-05-11T03:13:38Z

👍

…

On Fri, May 10, 2024, 7:47 PM Niels Bantilan ***@***.***> wrote: @mattharrison <https://github.com/mattharrison> you'll be pleased to learn that #1628 <#1628> has been merged :) the 0.20.0 release will have these changes. will probably cut a beta release in the next week or so if you wanted to play around with it — Reply to this email directly, view it on GitHub <#1262 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA5E3P65HZZY76WYVM7GFDZBV2E3AVCNFSM6AAAAAA2LQWQN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVGQ2DAOBZGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mattharrison added the bug Something isn't working label Jul 15, 2023

cosmicBboy added enhancement New feature or request and removed bug Something isn't working labels Jul 16, 2023

cosmicBboy mentioned this issue Jul 16, 2023

Data type 'uint64[pyarrow]' not understood by Engine. #1162

Closed

3 tasks

cosmicBboy mentioned this issue Apr 24, 2024

fix: add List, Dict, Tuple and NamedTuple to the GenericDType bound #1556

Open

aaravind100 mentioned this issue May 8, 2024

add pandas pyarrow backend support #1628

Merged

cosmicBboy closed this as completed in #1628 May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas2 / pyarrow backend support #1262

Pandas2 / pyarrow backend support #1262

mattharrison commented Jul 15, 2023

cosmicBboy commented Jul 16, 2023

franzoni315 commented Sep 16, 2023

OliverKleinBST commented Sep 17, 2023

juanarrivillaga commented Feb 20, 2024

cosmicBboy commented Apr 24, 2024

aaravind100 commented May 3, 2024 •

edited

cosmicBboy commented May 4, 2024 •

edited

aaravind100 commented May 4, 2024

cosmicBboy commented May 11, 2024

mattharrison commented May 11, 2024 via email

Pandas2 / pyarrow backend support #1262

Pandas2 / pyarrow backend support #1262

Comments

mattharrison commented Jul 15, 2023

Code Sample, a copy-pastable example

Expected behavior

Versions:

cosmicBboy commented Jul 16, 2023

franzoni315 commented Sep 16, 2023

OliverKleinBST commented Sep 17, 2023

juanarrivillaga commented Feb 20, 2024

cosmicBboy commented Apr 24, 2024

aaravind100 commented May 3, 2024 • edited

cosmicBboy commented May 4, 2024 • edited

aaravind100 commented May 4, 2024

cosmicBboy commented May 11, 2024

mattharrison commented May 11, 2024 via email

aaravind100 commented May 3, 2024 •

edited

cosmicBboy commented May 4, 2024 •

edited