-
-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas2 / pyarrow backend support #1262
Comments
@mattharrison I think this would be a feature request: the current scope of pandera is that it doesn't yet support pyarrow datatypes/backend. Gonna close #1162 and merge that with this issue |
Is there a workaround to make the validation to work with pyarrow types? Or do you have any idea when this will be implemented? |
I would also support the request to support arrow datatypes which I guess become the new normal in Pandas 2. |
I just want to point out that https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html |
Anyone who wants to create PR for this has my blessing! A good place to start would be:
|
@cosmicBboy i took a quick stab at it adding this to pandas_engine.py @Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())])
@immutable
class ArrowINT64(DataType, dtypes.Int):
type = pd.ArrowDtype(pyarrow.int64())
bit_width: int = 64
@Engine.register_dtype(equivalents=["string", pd.ArrowDtype(pyarrow.string())])
@immutable
class ArrowString(DataType, dtypes.String):
type = pd.ArrowDtype(pyarrow.string()) this gets validated import pandas as pd
import pandera as pa
df = pd.DataFrame(
[
{"foo": 123, "bar": "abc"},
],
)
class Schema(pa.DataFrameModel):
foo: int
bar: str
print("pandas:")
print(df.dtypes)
print()
print(Schema.validate(df))
print()
df = df.convert_dtypes(dtype_backend="pyarrow")
print("pandas[pyarrow]:")
print(df.dtypes)
print()
print(Schema.validate(df)) output: pandas:
foo int64
bar object
dtype: object
foo bar
0 123 abc
pandas[pyarrow]:
foo int64[pyarrow]
bar string[pyarrow]
dtype: object
foo bar
0 123 abc would you like me to continue this direction? |
@aaravind100 the overall approach makes sense! Thanks for taking the initiative on this. @Engine.register_dtype(equivalents=["int", pd.ArrowDtype(pyarrow.int64())]) Let's avoid overloading "int" here since it's already taken by the numpy int type: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/numpy_engine.py#L163-L165 For the
Another thought here is instead of requiring users to wrap the pyarrow dtype in import pandera as pa
import pyarrow
pa.DataFrameSchema({
"foo": pa.Column(pyarrow.int64()),
"bar": pa.Column(pyarrow.timestamp(unit="s")),
}) The benefit is it makes for more concise. As mentioned in the docs, we'll need to make sure to wrap these in class Model(pa.DataFrameModel):
foo: pyarrow.int64 # these need to be types, so pyarrow.int64() is invalid
bar: pyarrow.timestamp = pa.Field(dtype_kwargs={"unit": "s"})
# or using typing.Annotated
bar: Annotated[pyarrow.timestamp, "s"] So something like: @Engine.register_dtype(equivalents=["int", pyarrow.int64, pyarrow.int64())]) # this makes sure plain pyarrow.int64 is accepted as dtype in the schema definition
@immutable
class ArrowInt64(DataType, dtypes.Int):
type = pd.ArrowDtype(pyarrow.int64()) # we wrap this here
bit_width: int = 64 For parameterized dtypes it'll be slightly more complicated @Engine.register_dtype(equivalents=["int", pyarrow.timestamp, pyarrow.timestamp())])
@immutable
class ArrowTimestamp(DataType, dtypes.Timestamp):
type: Optional[pd.ArrowDtype] = dataclasses.field(default=None, init=False) # we'll set this in __post_init__
bit_width: int = 64
unit: Optional[str] = None
tz: Optional[datetime.tzinfo] = None
def __post_init__(self):
type_ = pd.ArrowDtype(pyarrow.timestamp(self.unit, self.tz))
object.__setattr__(self, "type", type_)
# this handles creating an instance of ArrowTimestamp in the DataFrameModel
# schema definition
@classmethod
def from_parametrized_dtype(cls, pyarrow_dtype: pyarrow_dtype.timestamp):
return cls(unit=pyarrow_dtype.unit, tz=pyarrow_dtype.tz) # type: ignore |
thank you, that clears some confusion :) The suggestion for using |
@mattharrison you'll be pleased to learn that #1628 has been merged :) the 0.20.0 release will have these changes. will probably cut a beta release in the next week or so if you wanted to play around with it |
👍
…On Fri, May 10, 2024, 7:47 PM Niels Bantilan ***@***.***> wrote:
@mattharrison <https://github.com/mattharrison> you'll be pleased to
learn that #1628 <#1628> has
been merged :) the 0.20.0 release will have these changes. will probably
cut a beta release in the next week or so if you wanted to play around with
it
—
Reply to this email directly, view it on GitHub
<#1262 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA5E3P65HZZY76WYVM7GFDZBV2E3AVCNFSM6AAAAAA2LQWQN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVGQ2DAOBZGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Describe the bug
I can't generate a schema from a pyarrow-backed dataframe
Code Sample, a copy-pastable example
Expected behavior
I want to be able to use Pandera with pyarrow backed dataframes
Versions:
The text was updated successfully, but these errors were encountered: