Releases: unionai-oss/pandera
0.11.0: Docs support dark mode, custom names and errors for built-in checks, bug fixes
Big shoutout to the contributors on this release!
Highlights
Docs Gets Dark Mode 🌓
Just a little something for folks who prefer dark mode!
Enhancements
- Make DataFrameSchema respect subclassing #830
- Feature: Add support for Generic to SchemaModel #810
- feat: make schema available in SchemaErrors #831
- add support for custom name and error in builtin checks #843
Bugfixes
- Make DataFrameSchema respect subclassing #830
- fix pandas_engine.DateTime.coerce_value not consistent with coerce #827
- fix mypy 9c5eaa3
Documentation Improvements
- Dark docs #841
0.11.0b1: fix mypy error
v0.11.0b1 release v0.11.0b1
0.11.0b0: Docs support dark mode, custom names and errors for built-in checks, bug fixes
v0.11.0b0 beta release for 0.11.0
0.10.1: Pyspark documentation fixes
v0.10.1 release 0.10.1
0.10.0: Pyspark.pandas Support, PydanticModel datatype, Performance Improvements
Highlights
pandera
now supports pyspark dataframe validation via pyspark.pandas
The pandera koalas integration has now been deprecated
You can now pip install pandera[pyspark]
and validate pyspark.pandas
dataframes:
import pyspark.pandas as ps
import pandas as pd
import pandera as pa
from pandera.typing.pyspark import DataFrame, Series
class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
# create a pyspark.pandas dataframe that's validated on object initialization
df = DataFrame[Schema](
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
print(df)
PydanticModel
DataType Enables Row-wise Validation with a pydantic
model
Pandera now supports row-wise validation by applying a pydantic model as a dataframe-level dtype:
from pydantic import BaseModel
import pandera as pa
class Record(BaseModel):
name: str
xcoord: str
ycoord: int
import pandas as pd
from pandera.engines.pandas_engine import PydanticModel
class PydanticSchema(pa.SchemaModel):
"""Pandera schema using the pydantic model."""
class Config:
"""Config with dataframe-level data type."""
dtype = PydanticModel(Record)
coerce = True # this is required, otherwise a SchemaInitError is raised
Improved conda installation experience
Before this release there were only two conda packages: one to install pandera-core
and another to install pandera
(which would install all extras functionality)
The conda packaging now supports finer-grained control:
conda install -c conda-forge pandera-hypotheses # hypothesis checks
conda install -c conda-forge pandera-io # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies # data synthesis strategies
conda install -c conda-forge pandera-mypy # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi # fastapi integration
conda install -c conda-forge pandera-dask # validate dask dataframes
conda install -c conda-forge pandera-pyspark # validate pyspark dataframes
conda install -c conda-forge pandera-modin # validate modin dataframes
conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask
Enhancements
- Add option to disallow duplicate column names #758
- Make SchemaModel use class name, define own config #761
- implement coercion-on-initialization for DataFrame[SchemaModel] types #772
- Update filtering columns for performance reasons. #777
- implement pydantic model data type #779
- make finding coerce failure cases faster #792
- add pyspark support, deprecate koalas #793
- Add overloads to schema.to_yaml #790
- Add overloads to infer_schema #789
Bugfixes
Deprecations
Docs Improvements
- add imports to fastapi docs
- add documentation for pandas_engine.DateTime #780
- update docs for 0.10.0 #795
- update docs with fastapi #804
Testing Improvements
Misc Changes
Contributors
0.9.0: FastAPI Integration, Support GeoPandas DataFrames
Highlights
FastAPI Integration [Docs]
pandera
now integrates with fastapi. You can decorate app endpoint arguments with DataFrame[Schema]
types and the endpoint will validate incoming and outgoing data.
from typing import Optional
from pydantic import BaseModel, Field
import pandera as pa
# schema definitions
class Transactions(pa.SchemaModel):
id: pa.typing.Series[int]
cost: pa.typing.Series[float] = pa.Field(ge=0, le=1000)
class Config:
coerce = True
class TransactionsOut(Transactions):
id: pa.typing.Series[int]
cost: pa.typing.Series[float]
name: pa.typing.Series[str]
class TransactionsDictOut(TransactionsOut):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}
App endpoint example:
from fastapi import FastAPI, File
app = FastAPI()
@app.post("/transactions/", response_model=DataFrame[TransactionsDictOut])
def create_transactions(transactions: DataFrame[Transactions]):
output = transactions.assign(name="foo")
... # do other stuff, e.g. update backend database with transactions
return output
Data Format Conversion [Docs]
The class-based API now supports automatically deserializing/serializing pandas dataframes in the context of @pa.check_types
-decorated functions, @pydantic.validate_arguments
-decorated functions, and fastapi endpoint functions.
import pandera as pa
from pandera.typing import DataFrame, Series
# base schema definitions
class InSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
int_col: Series[int]
class OutSchema(InSchema):
float_col: pa.typing.Series[float]
# read and validate data from a parquet file
class InSchemaParquet(InSchema):
class Config:
from_format = "parquet"
# output data as a list of dictionary records
class OutSchemaDict(OutSchema):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}
@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
return df.assign(float_col=1.1)
The transform
function can then take a filepath or buffer containing a parquet file that pandera automatically reads and validates:
import io
import json
buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)
dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))
Output:
[
{
"str_col": "a",
"int_col": 0,
"float_col": 1.1
},
{
"str_col": "b",
"int_col": 1,
"float_col": 1.1
},
{
"str_col": "c",
"int_col": 2,
"float_col": 1.1
}
]
Data Validation with GeoPandas [Docs]
DataFrameSchema
s can now validate geopandas.GeoDataFrame
and GeoSeries
objects:
import geopandas as gpd
import pandas as pd
import pandera as pa
from shapely.geometry import Polygon
geo_schema = pa.DataFrameSchema({
"geometry": pa.Column("geometry"),
"region": pa.Column(str),
})
geo_df = gpd.GeoDataFrame({
"geometry": [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
"region": ["NA", "SA"]
})
geo_schema.validate(geo_df)
You can also define SchemaModel
classes with a GeoSeries
field type annotation to create validated GeoDataFrame
s, or use then in @pa.check_types
-decorated functions for input/output validation:
from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
class Schema(pa.SchemaModel):
geometry: GeoSeries
region: Series[str]
# create a geodataframe that's validated on object initialization
df = GeoDataFrame[Schema](
{
'geometry': [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
'region': ['NA','SA']
}
)
Enhancements
- Support GeoPandas data structures (#732)
- Fastapi integration (#741)
- add title/description fields (#754)
- add nullable float dtypes (#721)
Bugfixes
- typed descriptors and setup.py only includes pandera (#739)
@pa.dataframe_check
works correctly on pandas==1.1.5 (#735)- fix set_index with MultiIndex (#751)
- strategies: correctly handle StringArray null values (#748)
Docs Improvements
- fastapi docs, add to ci (#753)
Testing Improvements
- Add Python 3.10 to CI matrix (#724)
Contributors
Big shout out to the following folks for your contributions on this release 🎉🎉🎉
0.8.1: Mypy Plugin, Better Editor Type Annotation Autocomplete, Pickleable SchemaError(s), Improved Error-reporting, Bugfixes
Enhancements
- add
__all__
declaration to root module for better editor autocompletion 42e60c6 - fix: expose nullable boolean in pandera.typing 5f9c713
- type annotations for DataFrameSchema (#700)
- add head of coerce failure cases (#710)
- add mypy plugin (#701)
- make SchemaError and SchemaErrors picklable (#722)
Bugfixes
- Only concat and drop_duplicates if more than one of {sample,head,tail} are present d3bc974, f756166, 20a631f
- fix field autocompletion (#702)
Docs Improvements
- Update contributing documentation: how to add dependencies #696
- update package description in setup.py eb130b4
- Fix broken links in dataframe_schemas.rst (#708)
Contributors
Big shout out to the following folks for your contributions on this release 🎉🎉🎉
0.8.0: Integrate with Dask, Koalas, Modin, Pydantic, Mypy
Community Announcements
Pandera now has a discord community! Join us if you need help, want to discuss features/bugs, or help other community members 🤝
Highlights
Schema support for Dask, Koalas, Modin
Excited to announce that 0.8.0
is the first release that adds built-in support for additional dataframe types beyond Pandas: you can now use the exact same DataFrameSchema
objects or SchemaModel
classes to validate Dask, Modin, and Koalas dataframes.
import dask.dataframe as dd
import pandas as pd
import pandera as pa
from pandera.typing import dask, koalas, modin
class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
@pa.check_types
def dask_function(ddf: dask.DataFrame[Schema]) -> dask.DataFrame[Schema]:
return ddf[ddf["state"] == "CA"]
@pa.check_types
def koalas_function(df: koalas.DataFrame[Schema]) -> koalas.DataFrame[Schema]:
return df[df["state"] == "CA"]
@pa.check_types
def modin_function(df: modin.DataFrame[Schema]) -> modin.DataFrame[Schema]:
return df[df["state"] == "CA"]
And DataFramaSchema
objects will work on all dataframe types:
schema: pa.DataFrameSchema = Schema.to_schema()
schema(dask_df)
schema(modin_df)
schema(koalas_df)
Pydantic Integration
pandera.SchemaModel
s are fully compatible with pydantic:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic
class SimpleSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True)
class PydanticModel(pydantic.BaseModel):
x: int
df: DataFrame[SimpleSchema]
valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
PydanticModel(x=1, df=valid_df)
invalid_df = pd.DataFrame({"str_col": ["hello", "hello"]})
PydanticModel(x=1, df=invalid_df)
Error:
Traceback (most recent call last):
...
ValidationError: 1 validation error for PydanticModel
df
series 'str_col' contains duplicate values:
1 hello
Name: str_col, dtype: object (type=value_error)
Mypy Integration
Pandera now supports static type-linting of DataFrame
types with mypy out of the box so you can catch certain classes of errors at lint-time.
import pandera as pa
from pandera.typing import DataFrame, Series
class Schema(pa.SchemaModel):
id: Series[int]
name: Series[str]
class SchemaOut(pa.SchemaModel):
age: Series[int]
class AnotherSchema(pa.SchemaModel):
foo: Series[int]
def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[AnotherSchema]) # mypy error
# error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]";
# expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]" [arg-type] # noqa
schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})
fn(schema_df) # mypy okay
fn(pandas_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame";
# expected "pandera.typing.pandas.DataFrame[Schema]" [arg-type]
Enhancements
- 735e7fe implement dataframe types (#672)
- 46dc3a2 Support mypy (#650)
- 02063c8 Add Basic Dask Support (#665)
- b7f6516 Modin support (#660)
- cdf4667 Add Pydantic support (#659)
- 12378ea Support Koalas (#658)
- 62d689d improve lazy validation performance for nullable cases (#655)
Bugfixes
- 7a98e23 bugfix: support nullable empty strategies (#638)
- 5ec4611 Fix remaining unrecognized numpy dtypes (#637)
- 96d6516 Correctly handling single string constraints (#670)
Docs Improvements
- 1860685 add pyproject.toml, update doc typos
- 3c086a9 add discord link, update readme, docs (#674)
- d75298f more detailed docstring of pandera.model_components.Field (#671)
- 96415a0 Add strictly typed pandas to readme (#649)
Testing Improvements
Internals Improvements
- fdcdb91 Reuse coerce in engines.utils (#645)
- 655dd85 remove assumption from nullable strategies (#641)
Contributors
Big shout out to the following folks for your contributions on this release 🎉🎉🎉
- @sbrugman
- @rbngz
- @jeffzi
- @bphillips-exos
- @thorben-flapo
- @tfwillems: special shout out here for contributing a good chunk of the code for the pydantic plugin #659
0.7.2: Bugfixes
Bugfixes
- Strategies should not rely on pandas dtype aliases (#620)
- support timedelta in data synthesis strats (#621)
- fix multiindex error reporting (#622)
- Pin pylint (#629)
- exclude np.float128 type registration in MacM1 (#624)
- fix numpy_pandas_coercible bug dealing with single element (#626)
- update pylint (#630)
0.7.1: Add unique option to DataFrameSchema
Enhancements
- add support for Any annotation in schema model (#594)
- add support for timezone-aware datetime strategies (#595)
unique
keyword arg: replace and deprecateallow_duplicates
(#580)- Add support for empty data type annotation in SchemaModel (#602)
- support frictionless primary keys with multiple fields (#608)
Bugfixes
- unify
typing.DataFrame
class definitions (#576) - schemas with multi-index columns correctly report errors (#600)
- strategies module supports undefined checks in regex columns (#599)
- fix validation of check raising error without message (#613)
Docs Improvements
- Tutorial: docs/scaling - Bring Pandera to Spark and Dask (#588)
Repo Improvements
- use virtualenv instead of conda in ci (#578)
Dependency Changes
Contributors
🎉🎉 Big shout out to all the contributors on this release 🎉🎉
- @admackin
- @jeffzi
- @tfwillems
- @fkrull8
- @kvnkho