# Validating Pandas DataFrames with Pandera

Niels Bantilan

01/23/2020

## Problem: I have a dataset, and I want to make sure it has certain properties

In [1]:
import pandas as pd

from sklearn.datasets import load_iris

iris = load_iris()

def normalize_feature_name(x):
    return x.replace("(", "").replace(")", "").replace(" ", "_")

iris_dataset = pd.DataFrame(
    data=iris["data"],
    columns=[normalize_feature_name(x) for x in iris["feature_names"]])

iris_dataset["target"] = iris["target"]
iris_dataset["target_names"] = [iris["target_names"][i] for i in iris["target"]]

iris_dataset.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,target,target_names
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


## What do we want to validate?

- Certain columns are present in the dataframe
- Those columns have the expected data types
- Certain deterministic properties of the data are true
- Certain statistical properties of the data are true

## Validating Data Types

Use the `dtypes` attribute and check against your expectations

In [2]:
expected_dtypes = pd.Series({
    "sepal_length_cm": "float64",
    "sepal_width_cm": "float64",
    "petal_length_cm": "float64",
    "petal_width_cm": "float64",
    "target": "int64",
    "target_names": "object",
})

(iris_dataset.dtypes == expected_dtypes).all()

True

Or you can try coercing the columns into the expected datatypes

In [3]:
iris_dataset_coerced = iris_dataset.astype(expected_dtypes.to_dict())
iris_dataset_coerced.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,target,target_names
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


With `pandera`, you can perform type checking and coercion by expressing
expectations as a `DataFrameSchema`.

In [4]:
import pandera as pa

iris_schema = pa.DataFrameSchema(
    columns={
        "sepal_length_cm": pa.Column(pa.Float),
        "sepal_width_cm": pa.Column(pa.Float),
        "petal_length_cm": pa.Column(pa.Float),
        "petal_width_cm": pa.Column(pa.Float),
        "target": pa.Column(pa.Int),
        "target_names": pa.Column(pa.String),
    },
    coerce=True
)

validated_iris_dataset = iris_schema(iris_dataset)
validated_iris_dataset.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,target,target_names
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


If the dataset becomes corrupted and data type coercion fails,
`pandera` will raise a `ValueError`.

In [5]:
corrupted_iris_dataset = iris_dataset.copy()
corrupted_iris_dataset["sepal_length_cm"] = "foo"

try:
    iris_schema(corrupted_iris_dataset)
except Exception as e:
    print(e)

could not convert string to float: 'foo'


Or if a column is missing it'll throw a `SchemaError`

In [6]:
missing_column_iris_dataset = iris_dataset.copy()
del missing_column_iris_dataset["sepal_length_cm"]

try:
    iris_schema(missing_column_iris_dataset)
except Exception as e:
    print(e)

column 'sepal_length_cm' not in dataframe
   sepal_width_cm  petal_length_cm  petal_width_cm  target target_names
0             3.5              1.4             0.2       0       setosa
1             3.0              1.4             0.2       0       setosa
2             3.2              1.3             0.2       0       setosa
3             3.1              1.5             0.2       0       setosa
4             3.6              1.4             0.2       0       setosa


## Validating Deterministic Properties

What if we wanted to check certain properties that we agree should always
be true? The `Check` object allows us to express these expectations.

In [8]:
from pandera import Check

iris_schema = pa.DataFrameSchema(
    columns={
        "sepal_length_cm": pa.Column(pa.Float, Check.greater_than(0)),
        "sepal_width_cm": pa.Column(pa.Float, Check.greater_than(0)),
        "petal_length_cm": pa.Column(pa.Float, Check.greater_than(0)),
        "petal_width_cm": pa.Column(pa.Float, Check.greater_than(0)),
        "target": pa.Column(pa.Int, Check.isin([0, 1, 2])),
        "target_names": pa.Column(
            pa.String, Check.isin(["setosa", "versicolor", "virginica"])),
    },
    coerce=True
)

validated_iris_dataset = iris_schema(iris_dataset)
validated_iris_dataset.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,target,target_names
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


If we corrupt a few of the values in the dataset, we get informative errors.

In [9]:
wrong_sepal_length_iris_dataset = iris_dataset.copy()
wrong_sepal_length_iris_dataset.loc[:5, "sepal_length_cm"] = -100
wrong_sepal_length_iris_dataset.loc[5:15, "sepal_length_cm"] = -10

try:
    iris_schema(wrong_sepal_length_iris_dataset)
except Exception as e:
    print(e)

<Schema Column: 'sepal_length_cm' type=float64> failed element-wise validator 0:
<Check _greater_than: greater_than(0)>
failure cases:
                                                index  count
failure_case                                                
-10.0         [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]     11
-100.0                                [0, 1, 2, 3, 4]      5


## Validating Statistical Hypotheses

What if we want to test the hypothesis that the petal width of `virginica` flowers
are larger than that of `setosa`? `pandera` provides an intuitive interface for
expressing this in a `DataFrameSchema`.

In [10]:
from pandera import Hypothesis

hypothesis_test_schema = pa.DataFrameSchema({
    "petal_width_cm": pa.Column(
        pa.Float,
        Hypothesis.two_sample_ttest(
            sample1="virginica",
            relationship="greater_than",
            sample2="setosa",
            groupby="target_names",
            alpha=0.01)),
    "target_names": pa.Column(pa.String)
})

hypothesis_tested_iris_dataset = hypothesis_test_schema(iris_dataset)
hypothesis_tested_iris_dataset.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,target,target_names
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


`pandera` will tell us if this hypothesis test doesn't pass.

In [11]:
petal_width_corrupted_iris_dataset = iris_dataset.copy()

petal_width_corrupted_iris_dataset.loc[
    petal_width_corrupted_iris_dataset.target_names == "virginica",
    "petal_width_cm"] = -100

try:
    hypothesis_test_schema(petal_width_corrupted_iris_dataset)
except Exception as e:
    print(e)

<Schema Column: 'petal_width_cm' type=float64> failed series validator 0: <Check _hypothesis_check: failed two sample ttest between 'virginica' and 'setosa'>


## Seamless Integration Into Your Data Pipelines

Finally `pandera` offers seamless integration with your existing code via decorators.

In [12]:
from pandera import check_output, check_input

def normalize_feature_name(x):
    return x.replace("(", "").replace(")", "").replace(" ", "_")

@check_output(iris_schema)
def load_iris_dataframe():
    iris = load_iris()
    iris_dataset = pd.DataFrame(
        data=iris["data"],
        columns=[normalize_feature_name(x) for x in iris["feature_names"]])

    iris_dataset["target"] = iris["target"]
    iris_dataset["target_names"] = [
        iris["target_names"][i] for i in iris["target"]]
    return iris_dataset

@check_input(iris_schema)
def train_model(iris_dataset):
    ...


### Thanks!

github: https://github.com/pandera-dev/pandera

docs: https://pandera.readthedocs.io/en/latest

slides: https://github.com/pandera-dev/pandera-presentations

[@cosmicbboy](https://twitter.com/cosmicBboy)