Skip to content

Releases: unionai-oss/pandera

0.11.0: Docs support dark mode, custom names and errors for built-in checks, bug fixes

01 May 00:31
c494ee7
Compare
Choose a tag to compare

Big shoutout to the contributors on this release!

Highlights

Docs Gets Dark Mode 🌓

Just a little something for folks who prefer dark mode!

image

Enhancements

  • Make DataFrameSchema respect subclassing #830
  • Feature: Add support for Generic to SchemaModel #810
  • feat: make schema available in SchemaErrors #831
  • add support for custom name and error in builtin checks #843

Bugfixes

  • Make DataFrameSchema respect subclassing #830
  • fix pandas_engine.DateTime.coerce_value not consistent with coerce #827
  • fix mypy 9c5eaa3

Documentation Improvements

0.11.0b1: fix mypy error

30 Apr 13:39
9c5eaa3
Compare
Choose a tag to compare
Pre-release
v0.11.0b1

release v0.11.0b1

0.11.0b0: Docs support dark mode, custom names and errors for built-in checks, bug fixes

29 Apr 21:13
Compare
Choose a tag to compare

0.10.1: Pyspark documentation fixes

04 Apr 03:13
Compare
Choose a tag to compare
v0.10.1

release 0.10.1

0.10.0: Pyspark.pandas Support, PydanticModel datatype, Performance Improvements

01 Apr 13:56
Compare
Choose a tag to compare

Highlights

pandera now supports pyspark dataframe validation via pyspark.pandas

The pandera koalas integration has now been deprecated

You can now pip install pandera[pyspark] and validate pyspark.pandas dataframes:

import pyspark.pandas as ps
import pandas as pd
import pandera as pa

from pandera.typing.pyspark import DataFrame, Series


class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


# create a pyspark.pandas dataframe that's validated on object initialization
df = DataFrame[Schema](
    {
        'state': ['FL','FL','FL','CA','CA','CA'],
        'city': [
            'Orlando',
            'Miami',
            'Tampa',
            'San Francisco',
            'Los Angeles',
            'San Diego',
        ],
        'price': [8, 12, 10, 16, 20, 18],
    }
)
print(df)

PydanticModel DataType Enables Row-wise Validation with a pydantic model

Pandera now supports row-wise validation by applying a pydantic model as a dataframe-level dtype:

from pydantic import BaseModel

import pandera as pa


class Record(BaseModel):
    name: str
    xcoord: str
    ycoord: int

import pandas as pd
from pandera.engines.pandas_engine import PydanticModel


class PydanticSchema(pa.SchemaModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(Record)
        coerce = True  # this is required, otherwise a SchemaInitError is raised

⚠️ Warning: This may lead to performance issues for very large dataframes.

Improved conda installation experience

Before this release there were only two conda packages: one to install pandera-core and another to install pandera (which would install all extras functionality)

The conda packaging now supports finer-grained control:

conda install -c conda-forge pandera-hypotheses  # hypothesis checks
conda install -c conda-forge pandera-io          # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies  # data synthesis strategies
conda install -c conda-forge pandera-mypy        # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi     # fastapi integration
conda install -c conda-forge pandera-dask        # validate dask dataframes
conda install -c conda-forge pandera-pyspark     # validate pyspark dataframes
conda install -c conda-forge pandera-modin       # validate modin dataframes
conda install -c conda-forge pandera-modin-ray   # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask  # validate modin dataframes with dask

Enhancements

Bugfixes

Deprecations

Docs Improvements

Testing Improvements

Misc Changes

Contributors

0.9.0: FastAPI Integration, Support GeoPandas DataFrames

09 Feb 00:34
Compare
Choose a tag to compare

Highlights

FastAPI Integration [Docs]

pandera now integrates with fastapi. You can decorate app endpoint arguments with DataFrame[Schema] types and the endpoint will validate incoming and outgoing data.

from typing import Optional

from pydantic import BaseModel, Field

import pandera as pa


# schema definitions
class Transactions(pa.SchemaModel):
    id: pa.typing.Series[int]
    cost: pa.typing.Series[float] = pa.Field(ge=0, le=1000)

    class Config:
        coerce = True

class TransactionsOut(Transactions):
    id: pa.typing.Series[int]
    cost: pa.typing.Series[float]
    name: pa.typing.Series[str]

class TransactionsDictOut(TransactionsOut):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

App endpoint example:

from fastapi import FastAPI, File

app = FastAPI()

@app.post("/transactions/", response_model=DataFrame[TransactionsDictOut])
def create_transactions(transactions: DataFrame[Transactions]):
    output = transactions.assign(name="foo")
    ...  # do other stuff, e.g. update backend database with transactions
    return output

Data Format Conversion [Docs]

The class-based API now supports automatically deserializing/serializing pandas dataframes in the context of @pa.check_types-decorated functions, @pydantic.validate_arguments-decorated functions, and fastapi endpoint functions.

import pandera as pa
from pandera.typing import DataFrame, Series

# base schema definitions
class InSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
    int_col: Series[int]

class OutSchema(InSchema):
    float_col: pa.typing.Series[float]

# read and validate data from a parquet file
class InSchemaParquet(InSchema):
    class Config:
        from_format = "parquet"

# output data as a list of dictionary records
class OutSchemaDict(OutSchema):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
    return df.assign(float_col=1.1)

The transform function can then take a filepath or buffer containing a parquet file that pandera automatically reads and validates:

import io
import json

buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)

dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))

Output:

[
    {
        "str_col": "a",
        "int_col": 0,
        "float_col": 1.1
    },
    {
        "str_col": "b",
        "int_col": 1,
        "float_col": 1.1
    },
    {
        "str_col": "c",
        "int_col": 2,
        "float_col": 1.1
    }
]

Data Validation with GeoPandas [Docs]

DataFrameSchemas can now validate geopandas.GeoDataFrame and GeoSeries objects:

import geopandas as gpd
import pandas as pd
import pandera as pa
from shapely.geometry import Polygon

geo_schema = pa.DataFrameSchema({
    "geometry": pa.Column("geometry"),
    "region": pa.Column(str),
})

geo_df = gpd.GeoDataFrame({
    "geometry": [
        Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
        Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
    ],
    "region": ["NA", "SA"]
})

geo_schema.validate(geo_df)

You can also define SchemaModel classes with a GeoSeries field type annotation to create validated GeoDataFrames, or use then in @pa.check_types-decorated functions for input/output validation:

from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries


class Schema(pa.SchemaModel):
    geometry: GeoSeries
    region: Series[str]


# create a geodataframe that's validated on object initialization
df = GeoDataFrame[Schema](
    {
        'geometry': [
            Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
            Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
        ],
        'region': ['NA','SA']
    }
)

Enhancements

  • Support GeoPandas data structures (#732)
  • Fastapi integration (#741)
  • add title/description fields (#754)
  • add nullable float dtypes (#721)

Bugfixes

  • typed descriptors and setup.py only includes pandera (#739)
  • @pa.dataframe_check works correctly on pandas==1.1.5 (#735)
  • fix set_index with MultiIndex (#751)
  • strategies: correctly handle StringArray null values (#748)

Docs Improvements

  • fastapi docs, add to ci (#753)

Testing Improvements

  • Add Python 3.10 to CI matrix (#724)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉

0.8.1: Mypy Plugin, Better Editor Type Annotation Autocomplete, Pickleable SchemaError(s), Improved Error-reporting, Bugfixes

31 Dec 21:43
9448d0a
Compare
Choose a tag to compare

Enhancements

  • add __all__ declaration to root module for better editor autocompletion 42e60c6
  • fix: expose nullable boolean in pandera.typing 5f9c713
  • type annotations for DataFrameSchema (#700)
  • add head of coerce failure cases (#710)
  • add mypy plugin (#701)
  • make SchemaError and SchemaErrors picklable (#722)

Bugfixes

  • Only concat and drop_duplicates if more than one of {sample,head,tail} are present d3bc974, f756166, 20a631f
  • fix field autocompletion (#702)

Docs Improvements

  • Update contributing documentation: how to add dependencies #696
  • update package description in setup.py eb130b4
  • Fix broken links in dataframe_schemas.rst (#708)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉

0.8.0: Integrate with Dask, Koalas, Modin, Pydantic, Mypy

13 Nov 05:03
Compare
Choose a tag to compare

Community Announcements

Pandera now has a discord community! Join us if you need help, want to discuss features/bugs, or help other community members 🤝

Discord

Highlights

Schema support for Dask, Koalas, Modin

Excited to announce that 0.8.0 is the first release that adds built-in support for additional dataframe types beyond Pandas: you can now use the exact same DataFrameSchema objects or SchemaModel classes to validate Dask, Modin, and Koalas dataframes.

import dask.dataframe as dd
import pandas as pd
import pandera as pa

from pandera.typing import dask, koalas, modin

class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})

@pa.check_types
def dask_function(ddf: dask.DataFrame[Schema]) -> dask.DataFrame[Schema]:
    return ddf[ddf["state"] == "CA"]

@pa.check_types
def koalas_function(df: koalas.DataFrame[Schema]) -> koalas.DataFrame[Schema]:
    return df[df["state"] == "CA"]

@pa.check_types
def modin_function(df: modin.DataFrame[Schema]) -> modin.DataFrame[Schema]:
    return df[df["state"] == "CA"]

And DataFramaSchema objects will work on all dataframe types:

schema: pa.DataFrameSchema = Schema.to_schema()

schema(dask_df)
schema(modin_df)
schema(koalas_df)

Pydantic Integration

pandera.SchemaModels are fully compatible with pydantic:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic


class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)


class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]


valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
PydanticModel(x=1, df=valid_df)

invalid_df = pd.DataFrame({"str_col": ["hello", "hello"]})
PydanticModel(x=1, df=invalid_df)

Error:

Traceback (most recent call last):
...
ValidationError: 1 validation error for PydanticModel
df
series 'str_col' contains duplicate values:
1    hello
Name: str_col, dtype: object (type=value_error)

Mypy Integration

Pandera now supports static type-linting of DataFrame types with mypy out of the box so you can catch certain classes of errors at lint-time.

import pandera as pa
from pandera.typing import DataFrame, Series

class Schema(pa.SchemaModel):
    id: Series[int]
    name: Series[str]

class SchemaOut(pa.SchemaModel):
    age: Series[int]

class AnotherSchema(pa.SchemaModel):
    foo: Series[int]

def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay

def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[AnotherSchema])  # mypy error
    # error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]";
    # expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]"  [arg-type]  # noqa

schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})

fn(schema_df)  # mypy okay
fn(pandas_df)  # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame";
# expected "pandera.typing.pandas.DataFrame[Schema]"  [arg-type]

Enhancements

Bugfixes

  • 7a98e23 bugfix: support nullable empty strategies (#638)
  • 5ec4611 Fix remaining unrecognized numpy dtypes (#637)
  • 96d6516 Correctly handling single string constraints (#670)

Docs Improvements

  • 1860685 add pyproject.toml, update doc typos
  • 3c086a9 add discord link, update readme, docs (#674)
  • d75298f more detailed docstring of pandera.model_components.Field (#671)
  • 96415a0 Add strictly typed pandas to readme (#649)

Testing Improvements

Internals Improvements

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉

0.7.2: Bugfixes

25 Sep 02:06
Compare
Choose a tag to compare

Bugfixes

  • Strategies should not rely on pandas dtype aliases (#620)
  • support timedelta in data synthesis strats (#621)
  • fix multiindex error reporting (#622)
  • Pin pylint (#629)
  • exclude np.float128 type registration in MacM1 (#624)
  • fix numpy_pandas_coercible bug dealing with single element (#626)
  • update pylint (#630)

0.7.1: Add unique option to DataFrameSchema

13 Sep 00:28
f0ddcbf
Compare
Choose a tag to compare

Enhancements

  • add support for Any annotation in schema model (#594)
  • add support for timezone-aware datetime strategies (#595)
  • unique keyword arg: replace and deprecate allow_duplicates (#580)
  • Add support for empty data type annotation in SchemaModel (#602)
  • support frictionless primary keys with multiple fields (#608)

Bugfixes

  • unify typing.DataFrame class definitions (#576)
  • schemas with multi-index columns correctly report errors (#600)
  • strategies module supports undefined checks in regex columns (#599)
  • fix validation of check raising error without message (#613)

Docs Improvements

  • Tutorial: docs/scaling - Bring Pandera to Spark and Dask (#588)

Repo Improvements

  • use virtualenv instead of conda in ci (#578)

Dependency Changes

  • remove frictionless from core pandera deps (#609)
  • docs/requirements.txt pin setuptools (#611)

Contributors

🎉🎉 Big shout out to all the contributors on this release 🎉🎉