## EDA using TensorFlow Data Validation (TFDV)

One of the libraries that I find useful to quickly contribute to the EDA is **TFDV**.

As in this example, with TFDV you can very quickly get a complete statistical analysis of the datasets, using a nice graphics (it is based on **Facets**) and you can very quickly **compare Train and Test set statistics** to verify if the two datasets have the same distribution.

Is this the case?

In [None]:
print('Installing TensorFlow Data Validation')
!pip install -q tensorflow_data_validation[visualization] > /dev/null

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf

import tensorflow_data_validation as tfdv

In [None]:
INPUT_DIR = '/kaggle/input/tabular-playground-series-may-2021/'

FILE_TRAIN = INPUT_DIR + 'train.csv'
FILE_TEST = INPUT_DIR + 'test.csv'

In [None]:
# compute TRAIN stats with TFDV
train_stats = tfdv.generate_statistics_from_csv(data_location=FILE_TRAIN)

In [None]:
# visualize TRAIN STATS
tfdv.visualize_statistics(train_stats)

### Let's compare TRAIN and TEST stats to see if the two Datasets have the same distribution

In [None]:
test_stats = tfdv.generate_statistics_from_csv(data_location=FILE_TEST)

tfdv.visualize_statistics(lhs_statistics=test_stats, rhs_statistics=train_stats,
                          lhs_name='TEST_DATASET', rhs_name='TRAIN_DATASET')

### In the report we find all the differences between the Train and the Test set features' distributions.
Some conclusion:
* all numeric features
* no missing values
* there is often a very high percentage of zero values
* there are some differences in statistics for some features, around 10% (feature39, ...)
* Features have not a gaussian distribution

At this stage we don't know if these differences are negligible or not... maybe we could consider if we decide to remove some features.