# Data Validation Tutorial
Welcome to the data validation tutorial using tensorflow data validation! In this tutorial we will be using tensorflow data validation (or tfdv) to write data tests. The tutorial is divided into the following segments:

1. TFDV Exploration: we will start by exploring the fundamentals of tfdv.
2. Data validation: next we will write some *tests* for our data.

In [1]:
# put your import statements here
import tensorflow as tf
import tensorflow_data_validation as tfdv
print('TF version:', tf.__version__)
print('TFDV version:', tfdv.version.__version__)

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

TF version: 2.12.0
TFDV version: 1.13.0


## Part 1: TFDV Exploration
In this section we will explore the basics of tfdv. Follow the instructions below.

### 1.1 Load the data
Use `pandas` to load the csv file as a dataframe.

+ *Hint*: did you check the [10 minutes with pandas guide]? Did you read the *Getting data in/out* section?
+ *Hint*: did you import pandas as `pd`?

[10 minutes with pandas guide]: https://pandas.pydata.org/docs/user_guide/10min.html

In [2]:
# your code here
df = pd.read_csv("./data/data.csv")

In [3]:
print(df.shape)

(569, 33)


### 1.2 Create train-test split
Use sklearn to create a 75-25 train-test split (in favour of train).

**NOTE**: Use 42 as the `random_state` so that we have reproducible splits.

+ *Hint*: did you check the api documentation for [sklearn.model_selection]?
+ *Hint*: did you import the necessary function?

[sklearn.model_selection]: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

In [11]:
# your code here
train_set, test_set = train_test_split(df, test_size=0.25, random_state=42)

In [10]:
print(train_set.shape)

(426, 33)


In [7]:
print(test_set)

           id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
204     87930         B        12.47         18.60           81.09      481.9   
70     859575         M        18.94         21.31          123.60     1130.0   
131      8670         M        15.46         19.48          101.70      748.9   
431    907915         B        12.40         17.68           81.47      467.8   
540    921385         B        11.54         14.44           74.65      402.9   
..        ...       ...          ...           ...             ...        ...   
89     861598         B        14.64         15.24           95.77      651.9   
199    877500         M        14.45         20.22           94.49      642.7   
411    905520         B        11.04         16.83           70.92      373.2   
18     849014         M        19.81         22.15          130.00     1260.0   
390  90317302         B        10.26         12.22           65.75      321.6   

     smoothness_mean  compa

### 1.3 Use tfdv to compute statistics for the training set
Generate descriptive statistics for the training set using tfdv.

+ *Hint*: you already know what to do...[read the docs]! Look for the appropriate `generate_statistics_from...` method.
+ *Hint*: did you import tensorflow data validation as `tfdv`?

[read the docs]: https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv

In [12]:
# your code here
train_stats = tfdv.generate_statistics_from_dataframe(train_set)

### 1.4 Visualise training stats & make observations
Lets visualise the statistics for our training set & see how it looks like. In particular, make the following observations:

1. Look at the distribution of the features
1. Look at the range of values for numerical features
1. Look at the values for categorical features
1. Do we have any missing values?
1. Explore the UI: change the feature sort order, change the chart type, etc.

+ *Hint*: there is only one method in tfdv that can do this!

In [14]:
# your code here
tfdv.visualize_statistics(train_stats)

***Reflect on the following questions:***

1. Is the visualisation provided by tfdv useful?
1. Can tfdv be used during data exploration & understanding?
1. Do we have any features that do not follow a *normal distribution*? Will this affect the model's performance?

### 1.5 Visualise testing stats & make observations
Lets now also make sure that our test set is *similar* to our training set. Generate the statistics for the test set & visulise it alongside the train set.

+ *Hint*: did you read the api documentation for the method you used to visualise statistics carefully? Can you visualise statistics of two datasets at the same time?

In [19]:
# your code here
test_stats = tfdv.generate_statistics_from_dataframe(test_set)

In [20]:
tfdv.visualize_statistics(lhs_statistics=test_stats, rhs_statistics=train_stats, lhs_name='TEST_DATASET', rhs_name='TRAIN_DATASET')

***Reflect on the following questions:***

1. Are the training & testing splits similar?
1. Can tfdv be used during data exploration & understanding?

## Part 2: Data Validation
In this section we will write tests for our data. TFDV does this by first creating a *schema* for our data which specifies what we expect our data to look like. In SE terms, the schema is the *test oracle* which we can use to check if our tests are passing or failing.

### 2.1 Create a schema
Generate a schema from the training set.

In [21]:
# your code here
schema = tfdv.infer_schema(statistics=train_stats)

### 2.2 Inspect the schema
"Pretty print" the schema that was generated.

In [22]:
# your code here
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'id',INT,required,,-
'diagnosis',STRING,required,,'diagnosis'
'radius_mean',FLOAT,required,,-
'texture_mean',FLOAT,required,,-
'perimeter_mean',FLOAT,required,,-
'area_mean',FLOAT,required,,-
'smoothness_mean',FLOAT,required,,-
'compactness_mean',FLOAT,required,,-
'concavity_mean',FLOAT,required,,-
'concave points_mean',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'diagnosis',"'B', 'M'"


In [44]:
tfdv.display_anomalies(tfdv.validate_statistics(train_stats, schema))

***Reflect on the following:***

+ Does the schema make sense?
+ Does it need to be modified?
+ What about this `Unnamed: 32` column?
+ Do we really need the `id` column?

#### 2.2.1 Cleanup the schema
Drop the columns we don't need & recreate the schema.

In [50]:
# your code here
features_to_drop = ['id']

df_dropped = test_set.drop(features_to_drop, axis=1)
# print(df_dropped.head())
df_dropped_stats = tfdv.generate_statistics_from_dataframe(df_dropped)
schema_dropped = tfdv.infer_schema(statistics=df_dropped_stats)
tfdv.display_schema(schema=schema_dropped)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'diagnosis',STRING,required,,'diagnosis'
'radius_mean',FLOAT,required,,-
'texture_mean',FLOAT,required,,-
'perimeter_mean',FLOAT,required,,-
'area_mean',FLOAT,required,,-
'smoothness_mean',FLOAT,required,,-
'compactness_mean',FLOAT,required,,-
'concavity_mean',FLOAT,required,,-
'concave points_mean',FLOAT,required,,-
'symmetry_mean',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'diagnosis',"'B', 'M'"


### 2.3 Inspect anomalies in test set
Now lets check if our test set meets the expectations that we define in our schema. First, use tfdv to find the *anomalies* (a.k.a bugs) in our test set and then "pretty print" them.

+ *Hint*: I will give you the answer for this one! You need to use the `validate_statistics` method followed by the appropriate `display...` method.

In [51]:
# your code here
# print(df_dropped_stats)
# print(schema_dropped)
anomalies = tfdv.validate_statistics(df_dropped_stats, schema_dropped)
# print(anomalies)
tfdv.display_anomalies(anomalies)

### 2.4 Make the bugs go away!
Can you fix the issue here? We did this for the training set already!

In [None]:
# your code here
