Releases · unionai-oss/pandera

01 May 00:31

cosmicBboy

v0.11.0

c494ee7

0.11.0: Docs support dark mode, custom names and errors for built-in checks, bug fixes

Big shoutout to the contributors on this release!

Highlights

Docs Gets Dark Mode 🌓

Just a little something for folks who prefer dark mode!

Enhancements

Make DataFrameSchema respect subclassing #830
Feature: Add support for Generic to SchemaModel #810
feat: make schema available in SchemaErrors #831
add support for custom name and error in builtin checks #843

Bugfixes

Make DataFrameSchema respect subclassing #830
fix pandas_engine.DateTime.coerce_value not consistent with coerce #827
fix mypy 9c5eaa3

Documentation Improvements

Dark docs #841

Contributors

tfwillems, fleimgruber, and 2 other contributors

Assets 2

1 Join discussion

30 Apr 13:39

cosmicBboy

v0.11.0b1

9c5eaa3

0.11.0b1: fix mypy error Pre-release

Pre-release

v0.11.0b1

release v0.11.0b1

Assets 2

29 Apr 21:13

cosmicBboy

v0.11.0b0

f302b5a

0.11.0b0: Docs support dark mode, custom names and errors for built-in checks, bug fixes Pre-release

Pre-release

v0.11.0b0

beta release for 0.11.0

Assets 2

04 Apr 03:13

cosmicBboy

v0.10.1

89154b9

0.10.1: Pyspark documentation fixes

v0.10.1

release 0.10.1

Assets 2

01 Apr 13:56

cosmicBboy

v0.10.0

5913499

0.10.0: Pyspark.pandas Support, PydanticModel datatype, Performance Improvements

Highlights

`pandera` now supports pyspark dataframe validation via `pyspark.pandas`

The pandera koalas integration has now been deprecated

You can now pip install pandera[pyspark] and validate pyspark.pandas dataframes:

import pyspark.pandas as ps
import pandas as pd
import pandera as pa

from pandera.typing.pyspark import DataFrame, Series


class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


# create a pyspark.pandas dataframe that's validated on object initialization
df = DataFrame[Schema](
    {
        'state': ['FL','FL','FL','CA','CA','CA'],
        'city': [
            'Orlando',
            'Miami',
            'Tampa',
            'San Francisco',
            'Los Angeles',
            'San Diego',
        ],
        'price': [8, 12, 10, 16, 20, 18],
    }
)
print(df)

`PydanticModel` DataType Enables Row-wise Validation with a `pydantic` model

Pandera now supports row-wise validation by applying a pydantic model as a dataframe-level dtype:

from pydantic import BaseModel

import pandera as pa


class Record(BaseModel):
    name: str
    xcoord: str
    ycoord: int

import pandas as pd
from pandera.engines.pandas_engine import PydanticModel


class PydanticSchema(pa.SchemaModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(Record)
        coerce = True  # this is required, otherwise a SchemaInitError is raised

⚠️ Warning: This may lead to performance issues for very large dataframes.

Improved conda installation experience

Before this release there were only two conda packages: one to install pandera-core and another to install pandera (which would install all extras functionality)

The conda packaging now supports finer-grained control:

conda install -c conda-forge pandera-hypotheses  # hypothesis checks
conda install -c conda-forge pandera-io          # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies  # data synthesis strategies
conda install -c conda-forge pandera-mypy        # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi     # fastapi integration
conda install -c conda-forge pandera-dask        # validate dask dataframes
conda install -c conda-forge pandera-pyspark     # validate pyspark dataframes
conda install -c conda-forge pandera-modin       # validate modin dataframes
conda install -c conda-forge pandera-modin-ray   # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask  # validate modin dataframes with dask

Enhancements

Bugfixes

Deprecations

Docs Improvements

Testing Improvements

Misc Changes

Contributors

Assets 2

0 Join discussion

09 Feb 00:34

cosmicBboy

v0.9.0

f36cc9b

0.9.0: FastAPI Integration, Support GeoPandas DataFrames

Highlights

FastAPI Integration [Docs]

pandera now integrates with fastapi. You can decorate app endpoint arguments with DataFrame[Schema] types and the endpoint will validate incoming and outgoing data.

from typing import Optional

from pydantic import BaseModel, Field

import pandera as pa


# schema definitions
class Transactions(pa.SchemaModel):
    id: pa.typing.Series[int]
    cost: pa.typing.Series[float] = pa.Field(ge=0, le=1000)

    class Config:
        coerce = True

class TransactionsOut(Transactions):
    id: pa.typing.Series[int]
    cost: pa.typing.Series[float]
    name: pa.typing.Series[str]

class TransactionsDictOut(TransactionsOut):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

App endpoint example:

from fastapi import FastAPI, File

app = FastAPI()

@app.post("/transactions/", response_model=DataFrame[TransactionsDictOut])
def create_transactions(transactions: DataFrame[Transactions]):
    output = transactions.assign(name="foo")
    ...  # do other stuff, e.g. update backend database with transactions
    return output

Data Format Conversion [Docs]

The class-based API now supports automatically deserializing/serializing pandas dataframes in the context of @pa.check_types-decorated functions, @pydantic.validate_arguments-decorated functions, and fastapi endpoint functions.

import pandera as pa
from pandera.typing import DataFrame, Series

# base schema definitions
class InSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
    int_col: Series[int]

class OutSchema(InSchema):
    float_col: pa.typing.Series[float]

# read and validate data from a parquet file
class InSchemaParquet(InSchema):
    class Config:
        from_format = "parquet"

# output data as a list of dictionary records
class OutSchemaDict(OutSchema):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
    return df.assign(float_col=1.1)

The transform function can then take a filepath or buffer containing a parquet file that pandera automatically reads and validates:

import io
import json

buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)

dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))

Output:

[
    {
        "str_col": "a",
        "int_col": 0,
        "float_col": 1.1
    },
    {
        "str_col": "b",
        "int_col": 1,
        "float_col": 1.1
    },
    {
        "str_col": "c",
        "int_col": 2,
        "float_col": 1.1
    }
]

Data Validation with GeoPandas [Docs]

DataFrameSchemas can now validate geopandas.GeoDataFrame and GeoSeries objects:

import geopandas as gpd
import pandas as pd
import pandera as pa
from shapely.geometry import Polygon

geo_schema = pa.DataFrameSchema({
    "geometry": pa.Column("geometry"),
    "region": pa.Column(str),
})

geo_df = gpd.GeoDataFrame({
    "geometry": [
        Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
        Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
    ],
    "region": ["NA", "SA"]
})

geo_schema.validate(geo_df)

You can also define SchemaModel classes with a GeoSeries field type annotation to create validated GeoDataFrames, or use then in @pa.check_types-decorated functions for input/output validation:

from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries


class Schema(pa.SchemaModel):
    geometry: GeoSeries
    region: Series[str]


# create a geodataframe that's validated on object initialization
df = GeoDataFrame[Schema](
    {
        'geometry': [
            Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
            Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
        ],
        'region': ['NA','SA']
    }
)

Enhancements

Support GeoPandas data structures (#732)
Fastapi integration (#741)
add title/description fields (#754)
add nullable float dtypes (#721)

Bugfixes

typed descriptors and setup.py only includes pandera (#739)
@pa.dataframe_check works correctly on pandas==1.1.5 (#735)
fix set_index with MultiIndex (#751)
strategies: correctly handle StringArray null values (#748)

Docs Improvements

fastapi docs, add to ci (#753)

Testing Improvements

Add Python 3.10 to CI matrix (#724)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉

Contributors

jamesmyatt, roshcagra, and 3 other contributors

Assets 2

0 Join discussion

31 Dec 21:43

cosmicBboy

v0.8.1

9448d0a

0.8.1: Mypy Plugin, Better Editor Type Annotation Autocomplete, Pickleable SchemaError(s), Improved Error-reporting, Bugfixes

Enhancements

add __all__ declaration to root module for better editor autocompletion 42e60c6
fix: expose nullable boolean in pandera.typing 5f9c713
type annotations for DataFrameSchema (#700)
add head of coerce failure cases (#710)
add mypy plugin (#701)
make SchemaError and SchemaErrors picklable (#722)

Bugfixes

Only concat and drop_duplicates if more than one of {sample,head,tail} are present d3bc974, f756166, 20a631f
fix field autocompletion (#702)

Docs Improvements

Update contributing documentation: how to add dependencies #696
update package description in setup.py eb130b4
Fix broken links in dataframe_schemas.rst (#708)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉

Contributors

nickolay, smackesey, and 4 other contributors

Assets 2

0 Join discussion

13 Nov 05:03

cosmicBboy

v0.8.0

cf37ced

0.8.0: Integrate with Dask, Koalas, Modin, Pydantic, Mypy

Community Announcements

Pandera now has a discord community! Join us if you need help, want to discuss features/bugs, or help other community members 🤝

Highlights

Schema support for Dask, Koalas, Modin

Excited to announce that 0.8.0 is the first release that adds built-in support for additional dataframe types beyond Pandas: you can now use the exact same DataFrameSchema objects or SchemaModel classes to validate Dask, Modin, and Koalas dataframes.

import dask.dataframe as dd
import pandas as pd
import pandera as pa

from pandera.typing import dask, koalas, modin

class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})

@pa.check_types
def dask_function(ddf: dask.DataFrame[Schema]) -> dask.DataFrame[Schema]:
    return ddf[ddf["state"] == "CA"]

@pa.check_types
def koalas_function(df: koalas.DataFrame[Schema]) -> koalas.DataFrame[Schema]:
    return df[df["state"] == "CA"]

@pa.check_types
def modin_function(df: modin.DataFrame[Schema]) -> modin.DataFrame[Schema]:
    return df[df["state"] == "CA"]

And DataFramaSchema objects will work on all dataframe types:

schema: pa.DataFrameSchema = Schema.to_schema()

schema(dask_df)
schema(modin_df)
schema(koalas_df)

Pydantic Integration

pandera.SchemaModels are fully compatible with pydantic:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic


class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)


class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]


valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
PydanticModel(x=1, df=valid_df)

invalid_df = pd.DataFrame({"str_col": ["hello", "hello"]})
PydanticModel(x=1, df=invalid_df)

Error:

Traceback (most recent call last):
...
ValidationError: 1 validation error for PydanticModel
df
series 'str_col' contains duplicate values:
1    hello
Name: str_col, dtype: object (type=value_error)

Mypy Integration

Pandera now supports static type-linting of DataFrame types with mypy out of the box so you can catch certain classes of errors at lint-time.

import pandera as pa
from pandera.typing import DataFrame, Series

class Schema(pa.SchemaModel):
    id: Series[int]
    name: Series[str]

class SchemaOut(pa.SchemaModel):
    age: Series[int]

class AnotherSchema(pa.SchemaModel):
    foo: Series[int]

def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay

def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[AnotherSchema])  # mypy error
    # error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]";
    # expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]"  [arg-type]  # noqa

schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})

fn(schema_df)  # mypy okay
fn(pandas_df)  # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame";
# expected "pandera.typing.pandas.DataFrame[Schema]"  [arg-type]

Enhancements

735e7fe implement dataframe types (#672)
46dc3a2 Support mypy (#650)
02063c8 Add Basic Dask Support (#665)
b7f6516 Modin support (#660)
cdf4667 Add Pydantic support (#659)
12378ea Support Koalas (#658)
62d689d improve lazy validation performance for nullable cases (#655)

Bugfixes

7a98e23 bugfix: support nullable empty strategies (#638)
5ec4611 Fix remaining unrecognized numpy dtypes (#637)
96d6516 Correctly handling single string constraints (#670)

Docs Improvements

1860685 add pyproject.toml, update doc typos
3c086a9 add discord link, update readme, docs (#674)
d75298f more detailed docstring of pandera.model_components.Field (#671)
96415a0 Add strictly typed pandas to readme (#649)

Testing Improvements

0a72a51 update suppression of health checks (#653)

Internals Improvements

fdcdb91 Reuse coerce in engines.utils (#645)
655dd85 remove assumption from nullable strategies (#641)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉

@sbrugman
@rbngz
@jeffzi
@bphillips-exos
@thorben-flapo
@tfwillems: special shout out here for contributing a good chunk of the code for the pydantic plugin #659

Contributors

tfwillems, sbrugman, and 3 other contributors

Assets 2

0 Join discussion

25 Sep 02:06

cosmicBboy

v0.7.2

1085259

0.7.2: Bugfixes

Bugfixes

Strategies should not rely on pandas dtype aliases (#620)
support timedelta in data synthesis strats (#621)
fix multiindex error reporting (#622)
Pin pylint (#629)
exclude np.float128 type registration in MacM1 (#624)
fix numpy_pandas_coercible bug dealing with single element (#626)
update pylint (#630)

Assets 2

13 Sep 00:28

cosmicBboy

v0.7.1

f0ddcbf

0.7.1: Add unique option to DataFrameSchema

Enhancements

add support for Any annotation in schema model (#594)
add support for timezone-aware datetime strategies (#595)
unique keyword arg: replace and deprecate allow_duplicates (#580)
Add support for empty data type annotation in SchemaModel (#602)
support frictionless primary keys with multiple fields (#608)

Bugfixes

unify typing.DataFrame class definitions (#576)
schemas with multi-index columns correctly report errors (#600)
strategies module supports undefined checks in regex columns (#599)
fix validation of check raising error without message (#613)

Docs Improvements

Tutorial: docs/scaling - Bring Pandera to Spark and Dask (#588)

Repo Improvements

use virtualenv instead of conda in ci (#578)

Dependency Changes

remove frictionless from core pandera deps (#609)
docs/requirements.txt pin setuptools (#611)

Contributors

🎉🎉 Big shout out to all the contributors on this release 🎉🎉

Contributors

admackin, tfwillems, and 3 other contributors

Assets 2

0 Join discussion

Releases: unionai-oss/pandera

0.11.0: Docs support dark mode, custom names and errors for built-in checks, bug fixes

Big shoutout to the contributors on this release!

Highlights

Docs Gets Dark Mode 🌓

Enhancements

Bugfixes

Documentation Improvements

Contributors

0.11.0b1: fix mypy error

0.11.0b0: Docs support dark mode, custom names and errors for built-in checks, bug fixes

0.10.1: Pyspark documentation fixes

0.10.0: Pyspark.pandas Support, PydanticModel datatype, Performance Improvements

Highlights

pandera now supports pyspark dataframe validation via pyspark.pandas

PydanticModel DataType Enables Row-wise Validation with a pydantic model

Improved conda installation experience

Enhancements

Bugfixes

Deprecations

Docs Improvements

Testing Improvements

Misc Changes

Contributors

0.9.0: FastAPI Integration, Support GeoPandas DataFrames

Highlights

FastAPI Integration [Docs]

Data Format Conversion [Docs]

Data Validation with GeoPandas [Docs]

Enhancements

Bugfixes

Docs Improvements

Testing Improvements

Contributors

Contributors

0.8.1: Mypy Plugin, Better Editor Type Annotation Autocomplete, Pickleable SchemaError(s), Improved Error-reporting, Bugfixes

Enhancements

Bugfixes

Docs Improvements

Contributors

Contributors

0.8.0: Integrate with Dask, Koalas, Modin, Pydantic, Mypy

Community Announcements

Highlights

Schema support for Dask, Koalas, Modin

Pydantic Integration

Mypy Integration

Enhancements

Bugfixes

Docs Improvements

Testing Improvements

Internals Improvements

Contributors

Contributors

0.7.2: Bugfixes

Bugfixes

0.7.1: Add unique option to DataFrameSchema

Enhancements

Bugfixes

Docs Improvements

Repo Improvements

Dependency Changes

Contributors

Contributors

`pandera` now supports pyspark dataframe validation via `pyspark.pandas`

`PydanticModel` DataType Enables Row-wise Validation with a `pydantic` model