# Using TensorFlow Data Validation (TFDV) to explore data

The [downloaded dataset](https://www.kaggle.com/datasets/whenamancodes/alcohol-effects-on-study) will be explored with the TF package [**TFDV**](https://www.tensorflow.org/tfx/data_validation/get_started) to analyse it by:

1. Load data and split it.
2. Calculate the descriptive statistics from train data and infer its internal schema.
3. Detect anomalies in the data split and solve them
4. Check data drift and skew

## Load packages

In [92]:
# Change Git path
import os 
CURRENT_PATH = os.getcwd()
os.chdir(CURRENT_PATH.split('Feature-engineering-with-TF',1)[0] + 'Feature-engineering-with-TF/01-DataValidation_TF')

# Essential packages
import pandas as pd
from sklearn.model_selection import train_test_split

# TF package
import tensorflow_data_validation as tfdv

# Omit warning messages from the following code
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

## 1. Load data

The description and the dowloading code for the current dataset is explained in [README.md](https://github.com/saraalgo/Feature-engineering-with-TF/blob/main/README.md).

Load the raw data from */exdata* folder. In this case, the selected problem **Alcohol Effects On Study** has two main files: **Portuguese** and **Math**. The first step is to concatenate them to have a bigger dataset before starting.

In [93]:
# Read CSV files as dataframe
df_maths = pd.read_csv('../exdata/Alcohol-effects-study/Maths.csv', header=0)
df_port = pd.read_csv('../exdata/Alcohol-effects-study/Portuguese.csv', header=0)

# Adding them an extra column with the name of the subject and check that both have the same columns
df_maths.insert(0, 'Subject', 'Maths')
df_port.insert(0, 'Subject', 'Portuguese')

print(f'Both dataframes have the same column names: {df_maths.columns.equals(df_port.columns)}')

# Concatenate and print dataset
df = pd.concat([df_maths, df_port], ignore_index=True)
df.head()

Both dataframes have the same column names: True


Unnamed: 0,Subject,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,Maths,GP,F,18,U,GT3,A,4,4,at_home,...,4,3,4,1,1,3,6,5,6,6
1,Maths,GP,F,17,U,GT3,T,1,1,at_home,...,5,3,3,1,1,3,4,5,5,6
2,Maths,GP,F,15,U,LE3,T,1,1,at_home,...,4,3,2,2,3,3,10,7,8,10
3,Maths,GP,F,15,U,GT3,T,4,2,health,...,3,2,2,1,1,5,2,15,14,15
4,Maths,GP,F,16,U,GT3,T,3,3,other,...,4,3,2,1,2,5,4,6,10,10


### Data division

It is neccesary to divide the data in classical three splits:
- 75% training data
- 15% validation data
- 10% test data

Test data will not have the *output* variable, due to in production is not usual to have them.

In [94]:
df_shuffled = df.sample(frac=1, random_state=1).reset_index(drop=True)

# Divide df by output variable
Xdata = df_shuffled.drop('G3', axis=1)
ydata = df_shuffled['G3']

# Percentages for the division
train_perc = 0.75
validation_perc = 0.15
test_perc = 0.10

# train is now 75% of the entire data set
x_train, x_test, y_train, y_test = train_test_split(Xdata, ydata, test_size=1 - train_perc)

# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_perc/(test_perc + validation_perc)) 

train = x_train
train['output'] = y_train

val = x_val
val['output'] = y_val

test = x_test

print(f'Train data: {train.shape}, validation data: {val.shape} and test data: {test.shape}')

Train data: (783, 34), validation data: (156, 34) and test data: (105, 33)


## 2. Extract descriptive statistics and internal schema

Pipeline:
- Set which columns are going to use, removing irrelevant ones*.
- Get training statistics
- Explore descriptive train data interactively 
- Extracting internal schema of train data

*In this case, there is no apparently irrelevant feature to remove before further exploration, however, it would be implemented as follows.

In [95]:
# Set filter for columns with TFDV StatsOptions
remove_cols = []
remain_cols = [col for col in train.columns if (col not in remove_cols)]
stats_options = tfdv.StatsOptions(feature_allowlist=remain_cols)
print(stats_options.feature_allowlist)

['Subject', 'school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'output']


In [96]:
# Get descriptors for train data features
stats_train = tfdv.generate_statistics_from_dataframe(train, stats_options=stats_options)
tfdv.visualize_statistics(stats_train)

In [97]:
schema = tfdv.infer_schema(stats_train)
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Subject',STRING,required,,'Subject'
'school',STRING,required,,'school'
'sex',STRING,required,,'sex'
'age',INT,required,,-
'address',STRING,required,,'address'
'famsize',STRING,required,,'famsize'
'Pstatus',STRING,required,,'Pstatus'
'Medu',INT,required,,-
'Fedu',INT,required,,-
'Mjob',STRING,required,,'Mjob'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Subject',"'Maths', 'Portuguese'"
'school',"'GP', 'MS'"
'sex',"'F', 'M'"
'address',"'R', 'U'"
'famsize',"'GT3', 'LE3'"
'Pstatus',"'A', 'T'"
'Mjob',"'at_home', 'health', 'other', 'services', 'teacher'"
'Fjob',"'at_home', 'health', 'other', 'services', 'teacher'"
'reason',"'course', 'home', 'other', 'reputation'"
'guardian',"'father', 'mother', 'other'"


## 3. Detect anomalities 

Having into account the internal schema with which we are going to train the model, it is neccesary to check whether the validation and test splits are representatives splits of that data. Once checked, those anormalities are tackled before the training of the ML models.

Examples of how to solve those anomalies can be found [HERE](https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic#fix_evaluation_anomalies_in_the_schema).

### Validation data

In [98]:
stats_val = tfdv.generate_statistics_from_dataframe(val, stats_options=stats_options)

tfdv.visualize_statistics(lhs_statistics=stats_val, rhs_statistics=stats_train,
                          lhs_name='VALIDATION', rhs_name='TRAIN')

In [99]:
anomalies_val = tfdv.validate_statistics(stats_val, schema)

tfdv.display_anomalies(anomalies_val)

### Test data

In [100]:
stats_test = tfdv.generate_statistics_from_dataframe(test, stats_options=stats_options)

tfdv.visualize_statistics(lhs_statistics=stats_test, rhs_statistics=stats_train,
                          lhs_name='TEST', rhs_name='TRAIN')

In [101]:
anomalies_test = tfdv.validate_statistics(stats_test, schema)

tfdv.display_anomalies(anomalies_test)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'output',Column dropped,Column is completely missing


Here, as expected, there is the anomalie of the ouput missing column in case of test, due to not be needed for the prediction of the model when this split is introduced to the model. So, to set the environment of the test split as free of anomalities when referring to the output column absence, the following code would fix it.

In [102]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAIN')
schema.default_environment.append('TEST')

# Specify that 'tips' feature is not in SERVING environment.
tfdv.get_feature(schema, 'output').not_in_environment.append('TEST')

anomalies_test_env = tfdv.validate_statistics(stats_test, schema, environment='TEST')

tfdv.display_anomalies(anomalies_test_env)

## 4. Check data drift and skew

These two risky phenomenons with the data splits can be studied also with TFDV. The threshold of the metrics used can be changed as it is defined [HERE](https://www.tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift). Very useful to be sure that our data scope and samples still have the same objective and internal propieties.

In [103]:
skew_drift_anomalies = tfdv.validate_statistics(stats_train, schema,
                                          previous_statistics=stats_val,
                                          serving_statistics=stats_test)


tfdv.display_anomalies(skew_drift_anomalies)