# Data Validation with Tensorflow Data Validation

Data Validation is a part of the extension of Tensorflow called Tensorflow extended. Tensorflow extended works around doing everything to put any model into production, starting from data validation
to serving it in a server

## Library Imports

In [1]:
import pandas as pd
import tensorflow_data_validation as tfdv

## Load the dataset

In [2]:
# Load the dataset
dataset = pd.read_csv("resources/pollution_small.csv")

# Check the load
dataset.head()

Unnamed: 0,Date,pm10,no2,so2,soot
0,1/1/2009,98.67,14.1,44.38,34.81
1,1/2/2009,52.33,14.1,29.75,33.06
2,1/3/2009,74.67,20.5,36.25,39.25
3,1/4/2009,72.0,17.3,46.44,34.38
4,1/5/2009,81.0,25.64,56.56,45.59


In [3]:
# Check the datastructure
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2188 entries, 0 to 2187
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    2188 non-null   object 
 1   pm10    2188 non-null   float64
 2   no2     2188 non-null   float64
 3   so2     2188 non-null   float64
 4   soot    2188 non-null   float64
dtypes: float64(4), object(1)
memory usage: 85.6+ KB


In [7]:
# we will divide the rows into training and test
training_dataset = dataset[:1600]
test_dataset = dataset[1600:]

In [8]:
# Check the summary statistics of the training dataset
training_dataset.describe()

Unnamed: 0,pm10,no2,so2,soot
count,1600.0,1600.0,1600.0,1600.0
mean,49.656494,30.980519,16.229981,21.551956
std,35.211906,12.400788,10.621896,12.127354
min,6.38,9.74,4.01,6.0
25%,28.345,22.5675,9.7775,14.4
50%,38.835,28.715,13.275,18.63
75%,58.05,36.37,19.2825,24.0725
max,277.25,138.01,123.13,107.65


In [9]:
# Check the summary statistics of the test dataset'
test_dataset.describe()

Unnamed: 0,pm10,no2,so2,soot
count,588.0,588.0,588.0,588.0
mean,44.648248,37.296922,13.60517,18.44131
std,28.992087,10.94005,5.098944,6.596459
min,11.9,15.07,4.99,8.0
25%,28.3375,29.2175,10.1225,14.41
50%,35.555,35.815,12.345,17.09
75%,50.8125,43.8725,15.855,20.9625
max,273.77,106.03,38.03,87.21


## Data Analysis and Validation with TFDV

### Generate Training data statistics

These statistics are much more detailed than the usual describe method

In [10]:
# We will perform this on the training dataset
train_stats = tfdv.generate_statistics_from_dataframe(dataframe=dataset)

### Inferring the Schema

The generated statistics will return a schema object which will have every single information regarding our training dataset
All data (test data) is going to be checked against this schema to denote any anomalies and the admitted to a pipeline

In [13]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Date',BYTES,required,,-
'pm10',FLOAT,required,,-
'no2',FLOAT,required,,-
'so2',FLOAT,required,,-
'soot',FLOAT,required,,-


### Calculate Test Statistics

In [14]:
test_stats = tfdv.generate_statistics_from_dataframe(dataframe=test_dataset)

### Compare Test Statistics with Schema

In [15]:
# Checking for anomalies in new data
anomalies = tfdv.validate_statistics(statistics=test_stats, schema=schema)

### Displaying all detected anomalies

- Integer larger than 10
- STRING type when expected INT type
- FLOAT type when expected INT type
- Integer smaller than 0

In [16]:
tfdv.display_anomalies(anomalies)

### New Data with Anomalies
We introduce anomalies and check how the Tfdv detects the same

In [17]:
test_set_copy = test_dataset.copy()
test_set_copy.drop('soot', axis=1, inplace=True)

In [18]:
# Now we generate statistics
test_set_copy_stats = tfdv.generate_statistics_from_dataframe(dataframe=test_set_copy)
anomalies_copy = tfdv.validate_statistics(statistics=test_set_copy_stats, schema=schema)
tfdv.display_anomalies(anomalies_copy)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'soot',Column dropped,Column is completely missing


## Prepare the Schema for Serving

We can create different environments and mention what in schema will present in what environment

In [19]:
# Lets create the environments
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

### We will remove the target column from the serving schema as that is the colum we will predict

In [20]:
tfdv.get_feature(schema, "soot").not_in_environment.append("SERVING") # target variable will not be present in the serving environment

# Now we will check for anomalies between SERVING environment and new test set

In [22]:
serving_env_anomalies = tfdv.validate_statistics(test_set_copy_stats, schema=schema, environment="SERVING")
tfdv.display_anomalies(serving_env_anomalies)

## Freezing the schema for later use - Useful for end to end pipeline

In [23]:
tfdv.write_schema_text(schema=schema, output_path="pollution_schema.pbtxt")