# Data Validation in Training Pipelines


In this notebook, we will go through the process of validating dataframes in a training pipeline using Pandera.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NatanMish/data_validation/blob/main/notebooks/2_training_pipeline_data_validation.ipynb)


#### Install the required packages and import them to the notebook

In [1]:
!pip install sklearn pandas pandera

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting pandera
  Downloading pandera-0.11.0-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 634 kB/s eta 0:00:01
[?25hCollecting scikit-learn
  Downloading scikit_learn-1.1.1-cp38-cp38-macosx_10_13_x86_64.whl (8.5 MB)
[K     |████████████████████████████████| 8.5 MB 13.0 MB/s eta 0:00:01
Collecting wrapt
  Downloading wrapt-1.14.1-cp38-cp38-macosx_10_9_x86_64.whl (35 kB)
Collecting pyarrow
  Downloading pyarrow-8.0.0-cp38-cp38-macosx_10_13_x86_64.whl (22.4 MB)
[K     |████████████████████████████████| 22.4 MB 1.5 MB/s eta 0:00:01
[?25hCollecting typing-inspect>=0.6.0
  Downloading typing_inspect-0.7.1-py3-none-any.whl (8.4 kB)
Collecting pydantic
  Downloading pydantic-1.9.1-cp38-cp38-macosx_10_9_x86_64.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 3.3 MB/s eta 0:00:01
Collecting mypy-extensions>=0.3.0
  Downloading mypy_extensions-0.4.3-py2.py3-none-any.whl (4.5 kB)
Collec

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandera as pa

#### Load the data

In [3]:
home_data = pd.read_csv('https://github.com/NatanMish/data_validation/blob/a77b247b25c6622ce0c8f8cbc505228161c31a3c/data/train.csv?raw=true')

#### Train basic model
We'll start by setting up a training pipeline using Scikit Learn's native class. We only want to select a few basic features for the purpose of this example, so we'll set up a pipeline step class that will select only those features.

In [4]:
class ChooseFeatures(BaseEstimator):
    def __init__(self, features=None):
        self.features = features
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return X[self.features]

In [5]:
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd', 'LotFrontage']

Now we set up the pipeline and fit it to the data.

In [6]:
pipe = Pipeline([
     ('feature_selection', ChooseFeatures(features=feature_names)),
     ('scaler', StandardScaler()),
     ('rf', RandomForestRegressor())
])

In [7]:
X = home_data
y = home_data.SalePrice
pipe.fit(home_data, y)

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Looks like our data has null values and this causes the model to break. Let's take a look at Pandera to see how it can help us with this.

<div>
<img src="https://raw.githubusercontent.com/pandera-dev/pandera/master/docs/source/_static/pandera-banner.png" width="500"/>
</div>

Pandera provides a flexible and expressive API for performing data validation on dataframes to make data processing pipelines more readable and robust. Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. We'll take a look at these Pandera features:

1. Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.

2. Perform more complex statistical validation like hypothesis testing.

3. Integrate with existing data analysis/processing pipelines via function decorators.

4. Define schema models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.

5. Synthesize data from schema objects for property-based testing with pandas data structures.

6. Lazily Validate dataframes so that all validation rules are executed before raising an error.

For more information, see [Pandera's documentation](https://pandera.readthedocs.io/en/latest/).

#### 1. DataFrame Schemas - Type Validation

In [8]:
# We'll add one more feature to make it more interesting
feature_names.append('LotConfig')

In [9]:
# Create a basic schema for the home_data DataFrame to check types for just 2 of the feature
basic_types_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int),
    "LotConfig": pa.Column(str),
    })

In [10]:
# Validate the home_data DataFrame against the basic_schema
# notice that although we only defined two of the features in the dataframe, and Pandera ignored the rest.
basic_types_schema.validate(home_data[feature_names])

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,LotFrontage,LotConfig
0,8450,2003,856,854,2,3,8,65.0,Inside
1,9600,1976,1262,0,2,3,6,80.0,FR2
2,11250,2001,920,866,2,3,6,68.0,Inside
3,9550,1915,961,756,1,3,7,60.0,Corner
4,14260,2000,1145,1053,2,4,9,84.0,FR2
...,...,...,...,...,...,...,...,...,...
1455,7917,1999,953,694,2,3,7,62.0,Inside
1456,13175,1978,2073,0,2,3,7,85.0,Inside
1457,9042,1941,1188,1152,2,4,9,66.0,Inside
1458,9717,1950,1078,0,1,2,5,68.0,Inside


There is an output from the validation, this means that the data is valid.
There are different ways we can specify the type:
- a string alias, as long as it is recognized by pandas.
- a python type: int, float, double, bool, str
- a numpy data type
- a pandas extension type: it can be an instance (e.g pd.CategoricalDtype([“a”, “b”])) or a class (e.g pandas.CategoricalDtype) if it can be initialized with default values.
- a pandera DataType: it can also be an instance or a class.

In [11]:
# Now let's create a schema that does not fit the data types in home data
bad_types_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int),
    "LotConfig": pa.Column(float),
})

In [12]:
# The bad schema validation will throw an error
bad_types_schema.validate(home_data[feature_names])

SchemaError: expected series 'LotConfig' to have type float64, got object

#### 2. DataFrame Schemas - Value Ranges Validation

In [13]:
# Pandera also allows validating value ranges for numerical columns
value_range_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int, pa.Check(lambda s: s <= 1000000), nullable=False),
    "YearBuilt": pa.Column(int, [pa.Check.in_range(1800, 2022)]),
})

In [14]:
# Validate the home_data DataFrame against the value_range_schema
value_range_schema.validate(home_data[feature_names])

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,LotFrontage,LotConfig
0,8450,2003,856,854,2,3,8,65.0,Inside
1,9600,1976,1262,0,2,3,6,80.0,FR2
2,11250,2001,920,866,2,3,6,68.0,Inside
3,9550,1915,961,756,1,3,7,60.0,Corner
4,14260,2000,1145,1053,2,4,9,84.0,FR2
...,...,...,...,...,...,...,...,...,...
1455,7917,1999,953,694,2,3,7,62.0,Inside
1456,13175,1978,2073,0,2,3,7,85.0,Inside
1457,9042,1941,1188,1152,2,4,9,66.0,Inside
1458,9717,1950,1078,0,1,2,5,68.0,Inside


#### 3. DataFrame Schemas - Catch Bad Data

What if instead of breaking on error we want to continue processing the dataframe? or we want to skip the bad data? we can use the `failure_cases` attribute of the error message to capture the bad data indices and the `lazy` argument for going over the entire dataframe instead of failing on the first bad row. We can do that by utilizing a try-except block.

In [15]:
# We'll use a small sample of the data to make the example more clear
sample_data = home_data.sample(n=10)

In [16]:
# Create a schema that will fail on the first bad data point
catch_bad_data_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int, pa.Check(lambda s: s <= 1000000)),
    "YearBuilt": pa.Column(int, pa.Check.in_range(1900,1990)),  # notice that the year built has a restrictive range
})

In [17]:
# Validating the home_data DataFrame against the catch_bad_data_schema will throw an error
catch_bad_data_schema.validate(sample_data[feature_names])

SchemaError: <Schema Column(name=YearBuilt, type=DataType(int64))> failed element-wise validator 0:
<Check in_range: in_range(1900, 1990)>
failure cases:
   index  failure_case
0    234          2002
1     81          1998
2    686          2007
3     60          2004

Now let's use a try except block to catch the bad data indices. This is a common and valid practice in Python called EAFP - "easier to ask for forgiveness than permission" which might not be as well recieved in other languages.

In [18]:
try:
    catch_bad_data_schema.validate(sample_data[feature_names], lazy=True)
except pa.errors.SchemaErrors as e:
    failure_cases = e.failure_cases

# Failure cases is a dataframe of the bad data only
failure_cases.head()

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,Column,YearBuilt,"in_range(1900, 1990)",0,2002,234
1,Column,YearBuilt,"in_range(1900, 1990)",0,1998,81
2,Column,YearBuilt,"in_range(1900, 1990)",0,2007,686
3,Column,YearBuilt,"in_range(1900, 1990)",0,2004,60


In [19]:
# We can easily filter out the bad data from the original dataframe using the failure_cases dataframe
filtered_df = sample_data[~sample_data.index.isin(failure_cases["index"])]

In [20]:
# Let's see that the filtered data passes the validation test
catch_bad_data_schema.validate(filtered_df[feature_names])

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,LotFrontage,LotConfig
372,7175,1984,752,0,1,2,4,50.0,Inside
298,11700,1968,1041,702,1,3,7,90.0,Inside
729,6240,1925,848,0,1,2,5,52.0,Inside
217,9906,1925,810,518,1,3,8,57.0,Inside
122,9464,1958,1080,0,1,3,5,75.0,Corner
1038,1533,1970,798,546,1,3,6,21.0,Inside


#### 4. DataFrame Schemas - Validate acceptable categorical values

In [21]:
lot_config_values = ["Inside", "Corner", "CulDSac", "FR3"]

In [22]:
lot_config_values_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int, pa.Check(lambda s: s <= 1000000)),
    "LotConfig": pa.Column(str, pa.Check.isin(lot_config_values)),
})

In [23]:
# Validating the home_data DataFrame against the lot_config_values_schema will throw an error
lot_config_values_schema.validate(home_data[feature_names])

SchemaError: <Schema Column(name=LotConfig, type=DataType(str))> failed element-wise validator 0:
<Check isin: isin({'FR3', 'Corner', 'Inside', 'CulDSac'})>
failure cases:
    index failure_case
0       1          FR2
1       4          FR2
2      81          FR2
3     140          FR2
4     195          FR2
5     214          FR2
6     223          FR2
7     228          FR2
8     236          FR2
9     266          FR2
10    364          FR2
11    386          FR2
12    421          FR2
13    480          FR2
14    483          FR2
15    537          FR2
16    541          FR2
17    558          FR2
18    574          FR2
19    611          FR2
20    670          FR2
21    687          FR2
22    761          FR2
23    775          FR2
24    805          FR2
25    849          FR2
26    933          FR2
27    941          FR2
28    959          FR2
29    975          FR2
30    994          FR2
31   1018          FR2
32   1057          FR2
33   1117          FR2
34   1158          FR2
35   1164          FR2
36   1178          FR2
37   1193          FR2
38   1232          FR2
39   1237          FR2
40   1259          FR2
41   1362          FR2
42   1369          FR2
43   1436          FR2
44   1437          FR2
45   1444          FR2
46   1450          FR2

Other useful methods for `pa.Check` are:

<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.eq.html">pandera.checks.Check.eq</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.equal_to.html">pandera.checks.Check.equal_to</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.ge.html">pandera.checks.Check.ge</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.greater_than.html">pandera.checks.Check.greater_than</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.greater_than_or_equal_to.html">pandera.checks.Check.greater_than_or_equal_to</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.gt.html">pandera.checks.Check.gt</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.in_range.html">pandera.checks.Check.in_range</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.isin.html">pandera.checks.Check.isin</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.le.html">pandera.checks.Check.le</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.less_than.html">pandera.checks.Check.less_than</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.less_than_or_equal_to.html">pandera.checks.Check.less_than_or_equal_to</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.lt.html">pandera.checks.Check.lt</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.ne.html">pandera.checks.Check.ne</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.not_equal_to.html">pandera.checks.Check.not_equal_to</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.notin.html">pandera.checks.Check.notin</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_contains.html">pandera.checks.Check.str_contains</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_endswith.html">pandera.checks.Check.str_endswith</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_length.html">pandera.checks.Check.str_length</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_matches.html">pandera.checks.Check.str_matches</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_startswith.html">pandera.checks.Check.str_startswith</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.__call__.html">pandera.checks.Check.__call__</a></li>

#### 5. DataFrame Schemas - `Coerce`

`Coerce` allows forcing type onto a specific dataframe column

In [24]:
home_data.LotArea.dtype

dtype('int64')

In [25]:
coerce_schema = pa.DataFrameSchema(
    columns={"LotArea": pa.Column(float)},
    coerce=False,
)

In [26]:
coerce_schema.validate(home_data)

SchemaError: expected series 'LotArea' to have type float64, got int64

In [27]:
# and if we set coerce to True, we can coerce the dataframe to the schema
coerce_schema = pa.DataFrameSchema(
    columns={"LotArea": pa.Column(float)},
    coerce=True,
)

In [28]:
coerce_schema.validate(home_data)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450.0,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600.0,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250.0,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550.0,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260.0,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917.0,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175.0,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042.0,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717.0,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


#### 6. DataFrame Schemas - `Strict`

In [29]:
# Using `strict` we can specify that the dataframe must have the exact columns specified in the schema
strict_schema = pa.DataFrameSchema(
    columns={"LotArea": pa.Column(int), "YearBuilt": pa.Column(int)},
    strict=True,
)

In [30]:
# Another useful feature is setting `strict` to 'filter' which will filter out any columns that are not in the schema
strict_filter_schema = pa.DataFrameSchema(
    columns={"LotArea": pa.Column(int), "YearBuilt": pa.Column(int)},
    strict="filter",
)
filtered_df = strict_filter_schema.validate(home_data)
filtered_df.head()

Unnamed: 0,LotArea,YearBuilt
0,8450,2003
1,9600,1976
2,11250,2001
3,9550,1915
4,14260,2000


### Exercise 1 - DataFrame Schemas

Create a pa.DataFrameSchema object for the `home_data` DataFrame. Not all the checks requested were shown above, for some of them you'll need to have a quick search in the Pandera documentation. It should have the following columns and rules:
1. Id is a required and unique column of an integer type and cannot be null.
2. MSZoning is a non-required column of a string type and can be null. If not null it can only accept these values - 'RL', 'RM', 'C (all)', 'RH' and 'FV'.
3. OverallQual is a required column of an integer type, cannot be null and must be in the range 1-10.
4. BsmtCond is a non-required column of a string type and can be null. If not null it can only accept a string of a length of 2.

Bonus:

5. Add the 1stFlrSF and 2ndFlrSF columns to the schema and validate that on average 1stFlrSF>=2ndFlrSF.

Create the schema such that it filters out any other columns that are not in the schema.


In [None]:
exercise_schema = pa.DataFrameSchema(
    columns={
        "Id": <YOUR ANSWER HERE>,
        "MSZoning": <YOUR ANSWER HERE>,
        "OverallQual": <YOUR ANSWER HERE>,
        "BsmtCond": <YOUR ANSWER HERE>,
        "1stFlrSF": <YOUR ANSWER HERE>,
        "2ndFlrSF": <YOUR ANSWER HERE>,
    },
    strict=<YOUR ANSWER HERE>,
    checks=<YOUR ANSWER HERE>,
)

In [None]:
exercise_schema.validate(home_data)

*Exercise solutions can be found in the exercise solutions file in the current directory.*

In [None]:
#^^^^