-
-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pyspark.sql.DataFrame #1138
Comments
hi @cosmicBboy, I am working on implementing pyspark.sql.DataFrame integration for Pandera. As I understood the purpose of With this in mind, I have, added a new engine: """PySpark engine and data types."""
# pylint:disable=too-many-ancestors
# docstrings are inherited
# pylint:disable=missing-class-docstring
# pylint doesn't know about __init__ generated with dataclass
# pylint:disable=unexpected-keyword-arg,no-value-for-parameter
import builtins
import dataclasses
import datetime
import decimal
import inspect
import warnings
from typing import (
Any,
Callable,
Dict,
Iterable,
List,
Optional,
Type,
Union,
cast,
)
from pydantic import BaseModel, ValidationError
from pandera import dtypes, errors
from pandera.dtypes import immutable
from pandera.engines import engine
import pyspark.sql.types as pst
try:
import pyarrow # pylint:disable=unused-import
PYARROW_INSTALLED = True
except ImportError:
PYARROW_INSTALLED = False
try:
from typing import Literal # type: ignore
except ImportError:
from typing_extensions import Literal # type: ignore
@immutable(init=True)
class DataType(dtypes.DataType):
"""Base `DataType` for boxing PySpark data types."""
type: Any = dataclasses.field(repr=False, init=False)
"""Native pyspark dtype boxed by the data type."""
def __init__(self, dtype: Any):
super().__init__()
object.__setattr__(self, "type", dtype)
dtype_cls = dtype if inspect.isclass(dtype) else dtype.__class__
warnings.warn(
f"'{dtype_cls}' support is not guaranteed.\n"
+ "Usage Tip: Consider writing a custom "
+ "pandera.dtypes.DataType or opening an issue at "
+ "https://github.com/pandera-dev/pandera"
)
def __post_init__(self):
# this method isn't called if __init__ is defined
object.__setattr__(self, "type", self.type) # pragma: no cover
def check(
self,
pandera_dtype: dtypes.DataType,
) -> Union[bool, Iterable[bool]]:
try:
pandera_dtype = Engine.dtype(pandera_dtype)
except TypeError:
return False
# attempts to compare pandas native type if possible
# to let subclass inherit check
# (super will compare that DataType classes are exactly the same)
try:
return self.type == pandera_dtype.type or super().check(pandera_dtype)
except TypeError:
return super().check(pandera_dtype)
def __str__(self) -> str:
return str(self.type)
def __repr__(self) -> str:
return f"DataType({self})"
class Engine( # pylint:disable=too-few-public-methods
metaclass=engine.Engine,
base_pandera_dtypes=(DataType),
):
"""PySpark data type engine."""
@classmethod
def dtype(cls, data_type: Any) -> dtypes.DataType:
"""Convert input into a pyspark-compatible
Pandera :class:`~pandera.dtypes.DataType` object."""
try:
return engine.Engine.dtype(cls, data_type)
except TypeError:
raise
###############################################################################
# boolean
###############################################################################
@Engine.register_dtype(
equivalents=["bool", pst.BooleanType()],
)
@immutable
class Bool(DataType, dtypes.Bool):
"""Semantic representation of a :class:`pyspark.sql.types.BooleanType`."""
type = pst.BooleanType()
_bool_like = frozenset({True, False})
def coerce_value(self, value: Any) -> Any:
"""Coerce an value to specified boolean type."""
if value not in self._bool_like:
raise TypeError(f"value {value} cannot be coerced to type {self.type}")
return super().coerce_value(value)
@Engine.register_dtype(
equivalents=["string", pst.StringType()], # type: ignore
)
@immutable
class String(DataType, dtypes.String): # type: ignore
"""Semantic representation of a :class:`pyspark.sql.StringType`."""
type = pst.StringType() # type: ignore
@Engine.register_dtype(
equivalents=["int", pst.IntegerType()], # type: ignore
)
@immutable
class Int(DataType, dtypes.Int): # type: ignore
"""Semantic representation of a :class:`pyspark.sql.IntegerType`."""
type = pst.IntegerType() # type: ignore
@Engine.register_dtype(
equivalents=["float", pst.FloatType()], # type: ignore
)
@immutable
class Float(DataType, dtypes.Float): # type: ignore
"""Semantic representation of a :class:`pyspark.sql.FloatType`."""
type = pst.FloatType() # type: ignore Let me know if this is in right direction. Specifically pay close attention to Secondly, what is the difference between
vs
Looking forward to your thoughts on above questions. Thanks :) |
Yes, this is correct! The |
Notes 3/6/2023@NeerajMalhotra-QB @cosmicBboy
|
Notes 4/14/2023@NeerajMalhotra-QB @cosmicBboy
|
Thanks Niels (@cosmicBboy) for summarizing our discussion. |
Notes 4/20/2023
|
For pt1, I think this is one of the areas which needs to be changed to support pyspark generics:
|
hi Niels (@cosmicBboy) Here's a sample schema definition using
in
I guess this will need changes in common areas too (e.g. pandas flows). WDYT? |
In case we define pandera schema with native pyspark types as follows, it gets past previous issue but
Now I feel if we can find a way to set type correctly with aliases ('str', 'int', etc) we defined in pyspark.engine for native pyspark.sql, it will work. |
Hi @NeerajMalhotra-QB, will do a little digging and get back to you early next week! |
Hi Niels (@cosmicBboy), I just pushed a change PR to support There are some areas we need to redesign under |
Hi @NeerajMalhotra-QB, check out this PR which adds support for the bare data type (instead of using Re: your PR, I'd rather not introduce a |
Looking into this issue now |
Notes 4/27/2023@NeerajMalhotra-QB @cosmicBboy
TODO:
Release Plan
|
@cosmicBboy, I will draft an initial version for blog by next week and we can collaborate on it. Let's plan it for right after we merge the changes in beta release. :) |
Awesome, thanks @NeerajMalhotra-QB ! |
Notes 5/4/2023@NeerajMalhotra-QB @cosmicBboy
TODO:
|
Thanks @cosmicBboy for capturing notes. |
Hi Niels (@cosmicBboy). I have a quick question on I noticed that we need to use
But we follow different style for schema definitions in case of
So is my understanding correct in saying if we have to support In other words, this is how I will add
|
correct. The reason behind this is that class attributes are reserved for fields, so |
Thanks Niels, it makes sense. |
that said, I'm trying to think about other syntax for doing this... like perhaps dunder attributes like |
Notes 6/1/2023@NeerajMalhotra-QB @cosmicBboy
|
Notes 6/8/2023@NeerajMalhotra-QB @cosmicBboy
|
thanks @cosmicBboy for taking meeting notes :) |
Notes 6/15/2023 @NeerajMalhotra-QB @cosmicBboy
|
Is your feature request related to a problem? Please describe.
Currently, pandera only supports validation of
pyspark.pandas.DataFrame
objects. This issue will track the work needed to supportpyspark.sql.DataFrame
objects.Describe the solution you'd like
The solution will require work in three major areas:
engines
: create a pyspark_engine.py that implements just a few of the basic data types (bool, int, float, str). These can be expanded later when we have a working POC.backends
: create a pyspark backend for validating dataframes and columns contained in the dataframeapi
: depending on how well the current DataFrameSchema and Column abstraction fits pyspark.sql, we can simply refactor the codebase so that we have a base class that both pandas and pyspark inherit from.The initial POC for pyspark support can live in the main
pandera
codebase itself, since thepandera[pyspark]
extra already exists. If, after getting a working POC, we feel like it would be easier to maintain a separatepandera-pyspark
package, we can do so (see discussion here), either as a pandera monorepo, or separate repos per data framework (e.g.pandera-pyspark
,pandera-polars
, etc).Describe alternatives you've considered
NA
Additional context
NA
The text was updated successfully, but these errors were encountered: