In [1]:
try:
  import colab
  !pip install --upgrade pip
except:
  pass

In [2]:
print('Installing TensorFlow Data Validation')
!pip install --upgrade 'tensorflow_data_validation[visualization]<2'

Installing TensorFlow Data Validation
Collecting tensorflow_data_validation<2 (from tensorflow_data_validation[visualization]<2)
  Downloading tensorflow_data_validation-1.16.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting pandas<2,>=1.0 (from tensorflow_data_validation<2->tensorflow_data_validation[visualization]<2)
  Downloading pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting pyarrow<11,>=10 (from tensorflow_data_validation<2->tensorflow_data_validation[visualization]<2)
  Downloading pyarrow-10.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pyfarmhash<0.4,>=0.2.2 (from tensorflow_data_validation<2->tensorflow_data_validation[visualization]<2)
  Downloading pyfarmhash-0.3.2.tar.gz (99 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.9/99.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... 

In [2]:
import os
import pandas as pd
import tensorflow as tf
import tempfile, urllib, zipfile
import tensorflow_data_validation as tfdv


from tensorflow.python.lib.io import file_io
from tensorflow_data_validation.utils import slicing_util
from tensorflow_metadata.proto.v0.statistics_pb2 import DatasetFeatureStatisticsList, DatasetFeatureStatistics

tf.get_logger().setLevel('ERROR')

In [5]:
df = pd.read_csv("diabetic_data.csv")
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


#### Data splits
In a production ML system, the model performance can be negatively affected by anomalies and divergence between data splits for training, evaluation, and serving. To emulate a production system, you will split the dataset into:

1. 70% training set
2. 15% evaluation set
3. 15% serving set

You will then use TFDV to visualize, analyze, and understand the data. You will create a data schema from the training dataset, then compare the evaluation and serving sets with this schema to detect anomalies and data drift/skew.

#### Label column
This dataset has been prepared to analyze the factors related to diabetes outcome. In this notebook, you will treat the `readmitted` column as the target or label column.

The target (or label) is important to know while splitting the data into training, evaluation and serving sets. In supervised learning, you need to include the target in the training and evaluation datasets. **For the serving set however (i.e. the set that simulates the data coming from your users), the label column needs to be dropped since that is the feature that your model will be trying to predict.**

In [6]:
len(df) * 0.7

71236.2

In [8]:
def prepare_data_splits_from_dataframe(df):
    '''
    Splits a Pandas Dataframe into training, evaluation and serving sets.

    Parameters:
            df : pandas dataframe to split

    Returns:
            train_df: Training dataframe(70% of the entire dataset)
            eval_df: Evaluation dataframe (15% of the entire dataset)
            serving_df: Serving dataframe (15% of the entire dataset, label column dropped)
    '''

    # 70% of records for generating the training set
    train_len = int(len(df) * 0.7)

    # Remaining 30% of records for generating the evaluation and serving sets
    eval_serv_len = len(df) - train_len

    # Half of the 30%, which makes up 15% of total records, for generating the evaluation set
    eval_len = eval_serv_len // 2

    # Remaining 15% of total records for generating the serving set
    serv_len = eval_serv_len - eval_len

    # Split the dataframe into the three subsets
    train_df = df.iloc[:train_len].reset_index(drop=True)
    eval_df = df.iloc[train_len: train_len + eval_len].reset_index(drop=True)
    serving_df = df.iloc[train_len + eval_len: train_len + eval_len + serv_len].reset_index(drop=True)

    # Serving data emulates the data that would be submitted for predictions, so it should not have the label column.
    serving_df = serving_df.drop(['readmitted'], axis=1)

    return train_df, eval_df, serving_df

In [9]:
train_df, eval_df, serving_df = prepare_data_splits_from_dataframe(df)
print('Training dataset has {} records\nValidation dataset has {} records\nServing dataset has {} records'.format(len(train_df),len(eval_df),len(serving_df)))

Training dataset has 71236 records
Validation dataset has 15265 records
Serving dataset has 15265 records


#### 3 - Generate and Visualize Training Data Statistics

In this section, you will be generating descriptive statistics from the dataset. This is usually the first step when dealing with a dataset you are not yet familiar with. It is also known as performing an exploratory data analysis and its purpose is to understand the data types, the data itself and any possible issues that need to be addressed.

It is important to mention that exploratory data analysis should be perfomed on the training dataset only. This is because getting information out of the evaluation or serving datasets can be seen as "cheating" since this data is used to emulate data that you have not collected yet and will try to predict using your ML algorithm. In general, it is a good practice to avoid leaking information from your evaluation and serving data into your model.

#### Removing Irrelevant Features
Before you generate the statistics, you may want to drop irrelevant features from your dataset. You can do that with TFDV with the `tfdv.StatsOptions` class. It is usually not a good idea to drop features without knowing what information they contain. However there are times when this can be fairly obvious.

One of the important parameters of the `StatsOptions` class is `feature_allowlist`, which defines the features to include while calculating the data statistics. You can check the documentation to learn more about the class arguments.

In this case, you will omit the statistics for encounter_id and patient_nbr since they are part of the internal tracking of patients in the hospital and they don't contain valuable information for the task at hand.

In [10]:
features_to_remove = {'encounter_id', 'patient_nbr'}

allowed_cols = [col for col in df.columns if col not in features_to_remove]

stats_options = tfdv.StatsOptions(feature_allowlist=allowed_cols)

for feature in stats_options.feature_allowlist:
    print(feature)

race
gender
age
weight
admission_type_id
discharge_disposition_id
admission_source_id
time_in_hospital
payer_code
medical_specialty
num_lab_procedures
num_procedures
num_medications
number_outpatient
number_emergency
number_inpatient
diag_1
diag_2
diag_3
number_diagnoses
max_glu_serum
A1Cresult
metformin
repaglinide
nateglinide
chlorpropamide
glimepiride
acetohexamide
glipizide
glyburide
tolbutamide
pioglitazone
rosiglitazone
acarbose
miglitol
troglitazone
tolazamide
examide
citoglipton
insulin
glyburide-metformin
glipizide-metformin
glimepiride-pioglitazone
metformin-rosiglitazone
metformin-pioglitazone
change
diabetesMed
readmitted


#### Exercise 1: Generate Training Statistics
TFDV allows you to generate statistics from different data formats such as CSV or a Pandas DataFrame.

Since you already have the data stored in a DataFrame you can use the function `tfdv.generate_statistics_from_dataframe()` which, given a DataFrame and stats_options, generates an object of type DatasetFeatureStatisticsList. This object includes the computed statistics of the given dataset.

Complete the cell below to generate the statistics of the training set. Remember to pass the training dataframe and the `stats_options` that you defined above as arguments.

In [11]:
train_stats = tfdv.generate_statistics_from_dataframe(train_df, stats_options=stats_options)

In [12]:
# get the number of features used to compute statistics
print(f"Number of features used: {len(train_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples used: {train_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {train_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {train_stats.datasets[0].features[-1].path.step[0]}")

Number of features used: 48
Number of examples used: 71236
First feature: race
Last feature: readmitted


#### Exercise 2: Visualize Training Statistics
Now that you have the computed statistics in the `DatasetFeatureStatisticsList` instance, you will need a way to visualize these to get actual insights. TFDV provides this functionality through the method `tfdv.visualize_statistics()`.

Using this function in an interactive Python environment such as this one will output a very nice and convenient way to interact with the descriptive statistics you generated earlier.

Try it out yourself! Remember to pass in the generated training statistics in the previous exercise as an argument.

In [13]:
tfdv.visualize_statistics(train_stats)

#### 4 - Infer a data schema

A schema defines the properties of the data and can thus be used to detect errors. Some of these properties include:

1. which features are expected to be present
feature type
2. the number of values for a feature in each example
3. the presence of each feature across all examples
4. the expected domains of features

The schema is expected to be fairly static, whereas statistics can vary per data split. So, you will `infer the data schema from only the training dataset`. Later, you will generate statistics for evaluation and serving datasets and compare their state with the data schema to detect anomalies, drift and skew.

#### Exercise 3: Infer the training set schema

Schema inference is straightforward using `tfdv.infer_schema()`. This function needs only the statistics (an instance of DatasetFeatureStatisticsList) of your data as input. The output will be a Schema protocol buffer containing the results.

A complimentary function is `tfdv.display_schema()` for displaying the schema in a table. This accepts a Schema protocol buffer as input.

In [14]:
training_schema = tfdv.infer_schema(train_stats)
tfdv.display_schema(training_schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'race',STRING,required,,'race'
'gender',STRING,required,,'gender'
'age',STRING,required,,'age'
'weight',STRING,required,,'weight'
'admission_type_id',INT,required,,-
'discharge_disposition_id',INT,required,,-
'admission_source_id',INT,required,,-
'time_in_hospital',INT,required,,-
'payer_code',STRING,required,,'payer_code'
'medical_specialty',STRING,required,,'medical_specialty'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'race',"'?', 'AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other'"
'gender',"'Female', 'Male', 'Unknown/Invalid'"
'age',"'[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'"
'weight',"'>200', '?', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)'"
'payer_code',"'?', 'BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'"
'medical_specialty',"'?', 'AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology'"
'max_glu_serum',"'>200', '>300', 'None', 'Norm'"
'A1Cresult',"'>7', '>8', 'None', 'Norm'"
'metformin',"'Down', 'No', 'Steady', 'Up'"
'repaglinide',"'Down', 'No', 'Steady', 'Up'"


In [15]:
# Check number of features
print(f"Number of features in schema: {len(training_schema.feature)}")

# Check domain name of 2nd feature
print(f"Second feature in schema: {list(training_schema.feature)[1].domain}")

Number of features in schema: 48
Second feature in schema: gender


#### 5 - Calculate, Visualize and Fix Evaluation Anomalies

It is important that the schema of the evaluation data is consistent with the training data since the data that your model is going to receive should be consistent to the one you used to train it with.

Moreover, **it is also important that the features of the evaluation data belong roughly to the same range as the training data**. This ensures that the model will be evaluated on a similar loss surface covered during training.

#### Exercise 4: Compare Training and Evaluation Statistics

Now you are going to generate the evaluation statistics and compare it with training statistics. You can use the `tfdv.generate_statistics_from_dataframe()` function for this. But this time, you'll need to pass the `evaluation data`. For the stats_options parameter, the list you used before works here too.

Remember that to visualize the evaluation statistics you can use `tfdv.visualize_statistics()`.

However, it is impractical to visualize both statistics separately and do your comparison from there. Fortunately, TFDV has got this covered. You can use the `visualize_statistics` function and pass additional parameters to overlay the statistics from both datasets (referenced as left-hand side and right-hand side statistics). Let's see what these parameters are:

1. `lhs_statistics`: Required parameter. Expects an instance of `DatasetFeatureStatisticsList`.
2. `rhs_statistics`: Expects an instance of `DatasetFeatureStatisticsList`  to compare with `lhs_statistics`
3. `lhs_name`: Name of the `lhs_statistics` dataset.
4. `rhs_name`: Name of the `rhs_statistics` dataset.

For this case, remember to define the `lhs_statistics` protocol with the `eval_stats`, and the optional `rhs_statistics` protocol with the `train_stats`.

Additionally, check the function for the protocol name declaration, and define the lhs and rhs names as `'EVAL_DATASET'` and `'TRAIN_DATASET'` respectively.


In [16]:
eval_stats = tfdv.generate_statistics_from_dataframe(eval_df, stats_options=stats_options)

tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

In [17]:
# get the number of features used to compute statistics
print(f"Number of features: {len(eval_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples: {eval_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {eval_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {eval_stats.datasets[0].features[-1].path.step[0]}")

Number of features: 48
Number of examples: 15265
First feature: race
Last feature: readmitted


#### Exercise 5: Detecting Anomalies

At this point, you should ask if your evaluation dataset matches the schema from your training dataset. For instance, if you scroll through the output cell in the previous exercise, you can see that the categorical feature glimepiride-pioglitazone has 1 unique value in the training set while the evaluation dataset has 2. You can verify with the built-in Pandas describe() method as well.

In [18]:
train_df["glimepiride-pioglitazone"].describe()

Unnamed: 0,glimepiride-pioglitazone
count,71236
unique,1
top,No
freq,71236


In [19]:
eval_df["glimepiride-pioglitazone"].describe()

Unnamed: 0,glimepiride-pioglitazone
count,15265
unique,2
top,No
freq,15264


It is possible but highly inefficient to visually inspect and determine all the anomalies. So, let's instead use TFDV functions to detect and display these.

You can use the function `tfdv.validate_statistics()` for detecting anomalies and `tfdv.display_anomalies()` for displaying them.

The `validate_statistics()` method has two required arguments:

1. an instance of `DatasetFeatureStatisticsList`
2. an instance of `Schema`


In [20]:
def calculate_and_display_anomalies(statistics, schema):
    '''
    Calculate and display anomalies.

            Parameters:
                    statistics : Data statistics in statistics_pb2.DatasetFeatureStatisticsList format
                    schema : Data schema in schema_pb2.Schema format

            Returns:
                    display of calculated anomalies
    '''
    anomalies = tfdv.validate_statistics(statistics, schema)

    # HINTS: Display input anomalies by using the calculated anomalies
    tfdv.display_anomalies(anomalies)

In [21]:
calculate_and_display_anomalies(eval_stats, schema=training_schema)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'glimepiride-pioglitazone',Unexpected string values,Examples contain values missing from the schema: Steady (<1%).
'medical_specialty',Unexpected string values,Examples contain values missing from the schema: Neurophysiology (<1%).


#### Exercise 6: Fix evaluation anomalies in the schema

The evaluation data has records with values for the features `glimepiride-pioglitazone` and `medical_speciality` that were not included in the schema generated from the training data. You can fix this by adding the new values that exist in the evaluation dataset to the domain of these features.

To get the `domain` of a particular feature you can use `tfdv.get_domain()`

You can use the `append()` method to the `value` property of the returned `domain` to add strings to the valid list of values. To be more explicit, given a domain you can do something like:



```
domain.value.append("feature_value")
```



In [22]:
gp_doamin = tfdv.get_domain(feature_path="glimepiride-pioglitazone", schema=training_schema)
gp_doamin.value.append("Steady")

ms_domain = tfdv.get_domain(feature_path="medical_specialty", schema=training_schema)
ms_domain.value.append("Neurophysiology")

calculate_and_display_anomalies(eval_stats, schema=training_schema)

#### 6 - Schema Environments

By default, all datasets in a pipeline should use the same schema. However, there are some exceptions.

For example, **the label column is dropped in the serving set** so this will be flagged when comparing with the training set schema.

#### Exercise 7: Check anomalies in the serving set

Now you are going to check for anomalies in the `serving data`. The process is very similar to the one you previously did for the evaluation data with a little change.

Let's create a `new StatsOptions` that is aware of the information provided by the schema and use it when generating statistics from the serving DataFrame.

In [23]:
options = tfdv.StatsOptions(
    schema = training_schema,
    infer_type_from_schema = True,
    feature_allowlist = allowed_cols
)

serving_stats = tfdv.generate_statistics_from_dataframe(serving_df, stats_options=options)
calculate_and_display_anomalies(serving_stats, schema=training_schema)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'payer_code',Unexpected string values,Examples contain values missing from the schema: FR (<1%).
'metformin-rosiglitazone',Unexpected string values,Examples contain values missing from the schema: Steady (<1%).
'medical_specialty',Unexpected string values,"Examples contain values missing from the schema: DCPTEAM (<1%), Endocrinology-Metabolism (<1%), Resident (<1%)."
'readmitted',Column dropped,Column is completely missing
'metformin-pioglitazone',Unexpected string values,Examples contain values missing from the schema: Steady (<1%).


You should see that `metformin-rosiglitazone`, `metformin-pioglitazone`, `payer_code` and `medical_specialty` features have an anomaly (i.e. Unexpected string values) which is less than 1%.

Let's `relax the anomaly detection constraints` for the last two of these features by defining the `min_domain_mass` of the feature's distribution constraints.

In [24]:
payer_code = tfdv.get_feature(schema=training_schema, feature_path="payer_code")
payer_code.distribution_constraints.min_domain_mass = 0.9

medical_specialty = tfdv.get_feature(schema=training_schema, feature_path="medical_specialty")
medical_specialty.distribution_constraints.min_domain_mass = 0.9

calculate_and_display_anomalies(serving_stats, schema=training_schema)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'metformin-pioglitazone',Unexpected string values,Examples contain values missing from the schema: Steady (<1%).
'metformin-rosiglitazone',Unexpected string values,Examples contain values missing from the schema: Steady (<1%).
'readmitted',Column dropped,Column is completely missing


If the `payer_code` and `medical_specialty` are no longer part of the output cell, then the relaxation worked!

### Exercise 8: Modifying the Domain

In [25]:
tfdv.display_schema(training_schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'race',STRING,required,,'race'
'gender',STRING,required,,'gender'
'age',STRING,required,,'age'
'weight',STRING,required,,'weight'
'admission_type_id',INT,required,,-
'discharge_disposition_id',INT,required,,-
'admission_source_id',INT,required,,-
'time_in_hospital',INT,required,,-
'payer_code',STRING,required,,'payer_code'
'medical_specialty',STRING,required,,'medical_specialty'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'race',"'?', 'AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other'"
'gender',"'Female', 'Male', 'Unknown/Invalid'"
'age',"'[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'"
'weight',"'>200', '?', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)'"
'payer_code',"'?', 'BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'"
'medical_specialty',"'?', 'AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology', 'Neurophysiology'"
'max_glu_serum',"'>200', '>300', 'None', 'Norm'"
'A1Cresult',"'>7', '>8', 'None', 'Norm'"
'metformin',"'Down', 'No', 'Steady', 'Up'"
'repaglinide',"'Down', 'No', 'Steady', 'Up'"


Towards the bottom of the Domain-Values pairs of the cell above, you can see that many features (including '**metformin**') have the same values: `['Down', 'No', 'Steady', 'Up']`. These values are common to many features including the ones with missing values during schema inference.

TFDV allows you to modify the domains of some features to match an existing domain. To address the detected anomaly, you can **set the domain** of these features to the domain of the **metformin** feature.

For this, use the tfdv.set_domain() function, which has the following parameters:


1.   `schema`: The schema
2.   `feature_path`: The name of the feature whose domain needs to be set.
3. `domain`: the name of a global string domain present in the input schema.

Using below function, we can set the domain of the features defined in the `domain_change_features` array below to be equal to `metformin's` domain to address the anomalies found.



In [26]:
def modify_domain_of_features(schema, features, convert_to_domain):
  for feature in features:
      tfdv.set_domain(schema=schema, feature_path=feature, domain=convert_to_domain)
  return schema


domain_change_features = ['repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
                          'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
                          'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
                          'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
                          'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']

training_schema = modify_domain_of_features(training_schema, domain_change_features, 'metformin')



In [28]:
tfdv.display_schema(training_schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'race',STRING,required,,'race'
'gender',STRING,required,,'gender'
'age',STRING,required,,'age'
'weight',STRING,required,,'weight'
'admission_type_id',INT,required,,-
'discharge_disposition_id',INT,required,,-
'admission_source_id',INT,required,,-
'time_in_hospital',INT,required,,-
'payer_code',STRING,required,,'payer_code'
'medical_specialty',STRING,required,,'medical_specialty'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'race',"'?', 'AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other'"
'gender',"'Female', 'Male', 'Unknown/Invalid'"
'age',"'[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'"
'weight',"'>200', '?', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)'"
'payer_code',"'?', 'BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'"
'medical_specialty',"'?', 'AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology', 'Neurophysiology'"
'max_glu_serum',"'>200', '>300', 'None', 'Norm'"
'A1Cresult',"'>7', '>8', 'None', 'Norm'"
'metformin',"'Down', 'No', 'Steady', 'Up'"
'repaglinide',"'Down', 'No', 'Steady', 'Up'"


In [30]:
# check that the domain of some features are now switched to `metformin`
print(f"Domain name of 'chlorpropamide': {tfdv.get_feature(training_schema, 'chlorpropamide').domain}")
print(f"Domain values of 'chlorpropamide': {tfdv.get_domain(training_schema, 'chlorpropamide').value}")
print(f"Domain name of 'repaglinide': {tfdv.get_feature(training_schema, 'repaglinide').domain}")
print(f"Domain values of 'repaglinide': {tfdv.get_domain(training_schema, 'repaglinide').value}")
print(f"Domain name of 'nateglinide': {tfdv.get_feature(training_schema, 'nateglinide').domain}")
print(f"Domain values of 'nateglinide': {tfdv.get_domain(training_schema, 'nateglinide').value}")

Domain name of 'chlorpropamide': metformin
Domain values of 'chlorpropamide': ['Down', 'No', 'Steady', 'Up']
Domain name of 'repaglinide': metformin
Domain values of 'repaglinide': ['Down', 'No', 'Steady', 'Up']
Domain name of 'nateglinide': metformin
Domain values of 'nateglinide': ['Down', 'No', 'Steady', 'Up']


In [32]:
calculate_and_display_anomalies(serving_stats, schema=training_schema)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'readmitted',Column dropped,Column is completely missing


the `metformin-pioglitazone` and `metformin-rosiglitazone` features dropped from the output anomalies.

### Exercise 9: Detecting anomalies with environments

The `readmitted` feature (which is the label column) showed up as an anomaly ('Column dropped'). Since `labels are not expected in the serving data, let's tell TFDV to ignore this detected anomaly`.

This requirement of introducing slight schema variations can be expressed by using environments. In particular, features in the schema can be associated with a set of environments using `default_environment`, `in_environment` and `not_in_environment`.

To exclude the `readmitted` feature from the `SERVING ` environment.


1.   Use the `tfdv.get_feature()` function to get the `readmitted` feature from the inferred schema and use its `not_in_environment` attribute `to specify that readmitted should be removed from the SERVING environment's schema`
2. `feature.not_in_environment.append('NAME_OF_ENVIRONMENT')`

The function `tfdv.get_feature` receives the following parameters: `schema` and `feature_path`





In [33]:
# All features are by default in both TRAINING and SERVING environments.
training_schema.default_environment.append('TRAINING')
training_schema.default_environment.append('SERVING')

readmitted = tfdv.get_feature(training_schema, 'readmitted')
readmitted.not_in_environment.append('SERVING')

serving_anomalies_with_env = tfdv.validate_statistics(serving_stats, training_schema, environment='SERVING')


In [34]:
tfdv.display_anomalies(serving_anomalies_with_env)

### 7 - Check for Data Drift and Skew

During data validation, you also need to check for data drift and data skew between the training and serving data. You can do this by specifying the `skew_comparator` and `drift_comparator` in the schema.

Drift and skew is expressed in terms of `L-infinity distance` which evaluates the difference between vectors as the greatest of the differences along any coordinate dimension.

You can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.

Let's check for the skew in the `diabetesMed` feature and drift in the `payer_code` feature.

In [35]:
diabetes_med = tfdv.get_feature(schema=training_schema, feature_path="diabetesMed")
diabetes_med.skew_comparator.infinity_norm.threshold = 0.03

payer_code = tfdv.get_feature(schema=training_schema, feature_path="payer_code")
payer_code.drift_comparator.infinity_norm.threshold = 0.03

skew_drift_anomalies = tfdv.validate_statistics(
    train_stats,
    schema=training_schema,
    previous_statistics=eval_stats,
    serving_statistics=serving_stats,
)

tfdv.display_anomalies(skew_drift_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'payer_code',High Linfty distance between current and previous,"The Linfty distance between current and previous is 0.451135 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: ?"
'diabetesMed',High Linfty distance between training and serving,"The Linfty distance between training and serving is 0.0325464 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: No"


### 9 - Freeze the schema

In [36]:
# Use TensorFlow text output format pbtxt to store the schema
schema_file = os.path.join("./", 'schema.pbtxt')

# write_schema_text function expect the defined schema and output path as parameters
tfdv.write_schema_text(training_schema, schema_file)