# What's Pandera?

Pandera is an open source framework for precision data testing, built for
data scientists and ML engineers.

In this notebook, you'll learn how to:

> 1. Define Pandera schemas for your dataframe-like objects 📦
> 2. Integrate them seamlessly into your data pipelines 🔀
> 3. Ensure your data and data transformation functions are correct ✅

▶️ Run the code cells below to get a sense of how pandera works and how its
error reporting system can provide direct insight into what specific data
values caused the error.

First, install `pandera` in your notebook session:

In [None]:
import piplite

await piplite.install("pandera")

## What are Schemas?

With `pandera`, we can create schemas, which specify types for dataframe-like
objects. We can then use these schemas to assert properties about data at runtime
and try parsing it into a desired state.

Suppose you're working with a transactions dataset of grocery `item`s and their
associated `price`s. We can state our assumptions about the data upfront by
defining a `Schema`.

In [None]:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series


class Schema(pa.SchemaModel):
    item: Series[str] = pa.Field(isin=["apple", "orange"], coerce=True)
    price: Series[float] = pa.Field(gt=0, coerce=True)

You can see that the `Schema` class inherits from [`pandera.SchemaModel`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model.SchemaModel.html#pandera.model.SchemaModel),
and defines two fields: `item` and `price`.

`pandera` gives you a flexible and concise way to specify the datatypes associated with
each column, but also other properties about it like set equivalence, with `isin=...` and value ranges, with `gt=...`.

Setting `coerce=True` will cause pandera to parse the columns into the expected data types, giving you the ability
to ensure that data flowing through your pipeline is of the expected type.

## Runtime DataFrame Value Checks

We can now use the `Schema` class to validate data passing through a function.

In [None]:
@pa.check_types(lazy=True)
def transform_data(data: DataFrame[Schema]):
    ...

As you can see below, using the `@pa.check_types` decorator and specifying the `data: DataFrame[Schema]` annotation will ensure that dataframe inputs are validated
at runtime before being passed into the `transform_data` function body.

By providing the `lazy=True` option in the `check_types` decorator, we're
telling pandera to validate all field properties before raising a `SchemaErrors`
exception.

With valid data, calling `transform_data` shouldn't be a problem.

In [None]:
valid_data = pd.DataFrame.from_records([
    {"item": "apple", "price": 0.5},
    {"item": "orange", "price": 0.75}
])
transform_data(valid_data)

With invalid data, however, pandera will raise a `SchemaErrors` exception. We can
catch the exception and identify all the failure cases

In [None]:
invalid_data = pd.DataFrame.from_records([
    {"item": "applee", "price": 0.5},
    {"item": "orange", "price": -1000}
])


try:
    transform_data(invalid_data)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

The `exc.failure_cases` attribute points to a dataframe that contains metadata
about the failure cases that occurred when validating the data.

We can see that row index `0` had a failure case in the `item` column, which
failed the `isin({"apple", "orange"})` check. The failure case value in question
`applee`.

We can also see the row index `1` had a failure case of `-1000.0` in the `price`
column, since negative prices don't really make sense in this context.

## In-line Validation

You can also use `Schema` classes to validate data in-line by calling the `validate` method:

In [None]:
Schema.validate(valid_data)

This gives you ultimate flexibility on where you want to validate data in your code.

## Schemas as Data Quality Checkpoints

With `pandera`, you can use inheritance to indicate changes in
the contents of a dataframe that some function is has to implement.

In [None]:
class Schema(pa.SchemaModel):
    item: Series[str] = pa.Field(isin=["apple", "orange"], coerce=True)
    price: Series[float] = pa.Field(gt=0, coerce=True)

class TransformedSchema(Schema):
    expiry: Series[pd.Timestamp] = pa.Field(coerce=True)

`TransformedSchema` will inherit the class attributes defined in
`Schema`, with an additional `expiry` datetime field.

Now we can implement a function that performs the transformation needed to
connect these two schemas.

In [None]:
from datetime import datetime
from typing import List


@pa.check_types(lazy=True)
def transform_data(
    data: DataFrame[Schema],
    expiry: List[datetime],
) -> DataFrame[TransformedSchema]:
    return data.assign(expiry=expiry)


transform_data(valid_data, [datetime.now()] * valid_data.shape[0])

Now every time we call the `transform_data` function, not only is the
`data` input argument validated, but the output dataframe is validated
against `TransformedSchema`.

This means that you can catch bugs in your data transformation code
more easily:

In [None]:
@pa.check_types(lazy=True)
def transform_data(
    data: DataFrame[Schema],
    expiry: List[datetime],
) -> DataFrame[TransformedSchema]:
    return data.assign(expiryy=expiry)  # typo bug: 🐛


try:
    transform_data(valid_data, [datetime.now()] * valid_data.shape[0])
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

The `failure_cases` dataframe is telling us the core `column_in_dataframe` check
is failing because the `expiry` column is not present in the output dataframe.

## Bonus: The Object-based API

`pandera` also provides an object-based API for defining dataframe schemas.

While the [`SchemaModel`](https://pandera.readthedocs.io/en/stable/schema_models.html) class-based API is closer in spirit to `dataclasses` and `pydantic`, which use Python classes to express complex data types , the
object-based [`DataFrameSchema`](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html) API enables you to transform your schema definition on the fly.

In [None]:
# class-based API
class Schema(pa.SchemaModel):
    item: Series[str] = pa.Field(isin=["apple", "orange"], coerce=True)
    price: Series[float] = pa.Field(gt=0, coerce=True)

# the equivalent object-based API syntax
schema = pa.DataFrameSchema({
    "item": pa.Column(str, pa.Check.isin(["apple", "orange"]), coerce=True),
    "price": pa.Column(float, pa.Check.gt(0), coerce=True),
})

You can add, remove, and update columns as you want:

In [None]:
transformed_schema = schema.add_columns({"expiry": pa.Column(pd.Timestamp)})
schema.remove_columns(["item"])  # remove the "item" column
schema.update_column("price", dtype=int)  # update the datatype of the "price" column to integer

You can use `DataFrameSchema`s to validate data just like `SchemaModel` subclasses:

In [None]:
schema.validate(valid_data)

And, similar to the `check_types` decorator, you can use the` check_io` decorator
to validate inputs and outputs of your functions.

In [None]:
@pa.check_io(data=schema, out=transformed_schema)
def fn(data, expiry):
    return data.assign(expiry=expiry)


fn(valid_data, [datetime.now()] * valid_data.shape[0])

### When to Use `DataFrameSchema` vs. `SchemaModel`

Practically speaking the two ways of writing pandera schemas are completely equivalent, and using one
over the other boils down to a few factors:

1. Preference: some developers are more comfortable with one syntax over the other.
2. The class-based API unlocks static type-checking of data via [mypy](https://pandera.readthedocs.io/en/stable/mypy_integration.html)
   and integrates well with Python's type hinting system.
3. The object-based API is good if you want to dynamically update your schema definition at runtime.

At the end of the day, you can use them interchangeably in your applications.

### What's Next?

This notebook gave you a brief intro to Pandera, but this
framework has a lot more to offer to help you test your data:

- [Create in-line custom checks](https://pandera.readthedocs.io/en/stable/checks.html)
- [Register custom checks](https://pandera.readthedocs.io/en/stable/extensions.html)
- [Define statistical hypothesis tests](https://pandera.readthedocs.io/en/stable/hypothesis.html)
- [Bootstrap schemas with data profiling](https://pandera.readthedocs.io/en/stable/schema_inference.html)
- [Synthesize fake data for unit testing](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html)
- [Scale Validation with Distributed DataFrames](https://pandera.readthedocs.io/en/stable/supported_libraries.html#)
- [Integrate with the Python Ecosystem](https://pandera.readthedocs.io/en/stable/integrations.html)