<a href="https://colab.research.google.com/github/srivatsan88/YouTubeLI/blob/master/TFX_Visualize_Distribution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Link to supporting video for below notebook walkhthrough - https://www.youtube.com/watch?v=eGIG_qHgQ08

In [0]:
from __future__ import print_function
import sys,tempfile, urllib, os

Install tensorflow data validation library

In [0]:
!pip install -q tensorflow_data_validation
import tensorflow_data_validation as tfdv

print('TFDV version: {}'.format(tfdv.version.__version__))

TFDV version: 0.14.1


In [0]:
!pip install pandas-profiling

Create file in local system to store the downloaded dataset

In [0]:
BASE_DIR = '/tmp'
OUTPUT_FILE = os.path.join(BASE_DIR, 'churn_data.csv')

Download the Watson Telecom dataset and store it in local disk

In [0]:
churn_data=urllib.request.urlretrieve('https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/WA_Fn-UseC_-Telco-Customer-Churn.csv', OUTPUT_FILE)


Do not worry about the dataset details for now. Let us quickly run Tensorflow Datavalidation (TFDV) component to generate statistics on the file

In [0]:
train_stats = tfdv.generate_statistics_from_csv(data_location=OUTPUT_FILE)

Visualize the generated stats. If you see the visualization. TensorFlow Data Validation provides tools for visualizing the distribution of feature values. By examining these distributionsyou can catch common problems with data. One quick observation is SeniorCitizen column below has around 84% zeros. Play around with different chart below and also in case if you want to search into any particular feature

In [0]:
tfdv.visualize_statistics(train_stats)

Let us now create schema for our data using infer_schema method. Schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data. For categorical features the schema also defines the domain - the list of acceptable values. Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct. The schema also provides documentation for the data

In [0]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'TechSupport',STRING,required,,'TechSupport'
'MonthlyCharges',FLOAT,required,,-
'Churn',STRING,required,,'Churn'
'Contract',STRING,required,,'Contract'
'tenure',INT,required,,-
'SeniorCitizen',INT,required,,-
'StreamingTV',STRING,required,,'StreamingTV'
'PaymentMethod',STRING,required,,'PaymentMethod'
'OnlineBackup',STRING,required,,'OnlineBackup'
'PaperlessBilling',STRING,required,,'PaperlessBilling'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'TechSupport',"'No', 'No internet service', 'Yes'"
'Churn',"'No', 'Yes'"
'Contract',"'Month-to-month', 'One year', 'Two year'"
'StreamingTV',"'No', 'No internet service', 'Yes'"
'PaymentMethod',"'Bank transfer (automatic)', 'Credit card (automatic)', 'Electronic check', 'Mailed check'"
'OnlineBackup',"'No', 'No internet service', 'Yes'"
'PaperlessBilling',"'No', 'Yes'"
'OnlineSecurity',"'No', 'No internet service', 'Yes'"
'InternetService',"'DSL', 'Fiber optic', 'No'"
'PhoneService',"'No', 'Yes'"


Let us now load the file downloaded earlier in pandas dataframe and split the dataset to compare distirbution and schema against each other

In [0]:
import pandas as pd
import pandas_profiling

In [0]:
churn_df = pd.read_csv(OUTPUT_FILE)

In [0]:
churn_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [0]:
profile = pandas_profiling.ProfileReport(churn_df.head(100))


In [0]:
profile

0,1
Number of variables,21
Number of observations,100
Total Missing (%),0.0%
Total size in memory,16.5 KiB
Average record size in memory,168.8 B

0,1
Numeric,2
Categorical,16
Boolean,1
Date,0
Text (Unique),2
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
No,76
Yes,24

Value,Count,Frequency (%),Unnamed: 3
No,76,76.0%,
Yes,24,24.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
Month-to-month,56
Two year,24
One year,20

Value,Count,Frequency (%),Unnamed: 3
Month-to-month,56,56.0%,
Two year,24,24.0%,
One year,20,20.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
No,65
Yes,35

Value,Count,Frequency (%),Unnamed: 3
No,65,65.0%,
Yes,35,35.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
No,46
Yes,38
No internet service,16

Value,Count,Frequency (%),Unnamed: 3
No,46,46.0%,
Yes,38,38.0%,
No internet service,16,16.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
Fiber optic,44
DSL,40
No,16

Value,Count,Frequency (%),Unnamed: 3
Fiber optic,44,44.0%,
DSL,40,40.0%,
No,16,16.0%,

0,1
Distinct count,93
Unique (%),93.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,68.223
Minimum,18.95
Maximum,113.25
Zeros (%),0.0%

0,1
Minimum,18.95
5-th percentile,20.15
Q1,49.2
Median,74.725
Q3,95.463
95-th percentile,107.55
Maximum,113.25
Range,94.3
Interquartile range,46.263

0,1
Standard deviation,29.177
Coef of variation,0.42767
Kurtosis,-1.1032
Mean,68.223
MAD,24.795
Skewness,-0.31206
Sum,6822.4
Variance,851.3
Memory size,880.0 B

Value,Count,Frequency (%),Unnamed: 3
20.15,3,3.0%,
80.65,2,2.0%,
90.25,2,2.0%,
79.85,2,2.0%,
20.65,2,2.0%,
99.65,2,2.0%,
74.85,1,1.0%,
99.3,1,1.0%,
84.0,1,1.0%,
89.9,1,1.0%,

Value,Count,Frequency (%),Unnamed: 3
18.95,1,1.0%,
19.8,1,1.0%,
19.95,1,1.0%,
20.15,3,3.0%,
20.2,1,1.0%,

Value,Count,Frequency (%),Unnamed: 3
108.45,1,1.0%,
110.5,1,1.0%,
111.05,1,1.0%,
111.6,1,1.0%,
113.25,1,1.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
No,50
Yes,42
No phone service,8

Value,Count,Frequency (%),Unnamed: 3
No,50,50.0%,
Yes,42,42.0%,
No phone service,8,8.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
No,42
Yes,42
No internet service,16

Value,Count,Frequency (%),Unnamed: 3
No,42,42.0%,
Yes,42,42.0%,
No internet service,16,16.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
No,47
Yes,37
No internet service,16

Value,Count,Frequency (%),Unnamed: 3
No,47,47.0%,
Yes,37,37.0%,
No internet service,16,16.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
Yes,63
No,37

Value,Count,Frequency (%),Unnamed: 3
Yes,63,63.0%,
No,37,37.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
No,52
Yes,48

Value,Count,Frequency (%),Unnamed: 3
No,52,52.0%,
Yes,48,48.0%,

0,1
Distinct count,4
Unique (%),4.0%
Missing (%),0.0%
Missing (n),0

0,1
Electronic check,29
Credit card (automatic),28
Bank transfer (automatic),24

Value,Count,Frequency (%),Unnamed: 3
Electronic check,29,29.0%,
Credit card (automatic),28,28.0%,
Bank transfer (automatic),24,24.0%,
Mailed check,19,19.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
Yes,92
No,8

Value,Count,Frequency (%),Unnamed: 3
Yes,92,92.0%,
No,8,8.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.15

0,1
0,85
1,15

Value,Count,Frequency (%),Unnamed: 3
0,85,85.0%,
1,15,15.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
No,45
Yes,39
No internet service,16

Value,Count,Frequency (%),Unnamed: 3
No,45,45.0%,
Yes,39,39.0%,
No internet service,16,16.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
Yes,45
No,39
No internet service,16

Value,Count,Frequency (%),Unnamed: 3
Yes,45,45.0%,
No,39,39.0%,
No internet service,16,16.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
No,55
Yes,29
No internet service,16

Value,Count,Frequency (%),Unnamed: 3
No,55,55.0%,
Yes,29,29.0%,
No internet service,16,16.0%,

First 3 values
1862.9
3487.95
2514.5

Last 3 values
1874.45
1426.4
2970.3

Value,Count,Frequency (%),Unnamed: 3
1009.25,1,1.0%,
1022.95,1,1.0%,
1057.0,1,1.0%,
108.15,1,1.0%,
1090.65,1,1.0%,

Value,Count,Frequency (%),Unnamed: 3
930.9,1,1.0%,
957.1,1,1.0%,
97.0,1,1.0%,
973.35,1,1.0%,
981.45,1,1.0%,

First 3 values
8865-TNMNX
9489-DEDVP
5919-TMRGD

Last 3 values
8028-PNXHQ
5698-BQJOH
8779-QRDMV

Value,Count,Frequency (%),Unnamed: 3
0191-ZHSKZ,1,1.0%,
0278-YXOOG,1,1.0%,
0280-XJGEX,1,1.0%,
0318-ZOPWS,1,1.0%,
0434-CSFON,1,1.0%,

Value,Count,Frequency (%),Unnamed: 3
9803-FTJCG,1,1.0%,
9848-JQJTX,1,1.0%,
9867-JCZSP,1,1.0%,
9919-YLNNG,1,1.0%,
9959-WOFKT,1,1.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
Female,55
Male,45

Value,Count,Frequency (%),Unnamed: 3
Female,55,55.0%,
Male,45,45.0%,

0,1
Distinct count,46
Unique (%),46.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,32.97
Minimum,1
Maximum,72
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,1.0
Q1,10.0
Median,30.0
Q3,52.0
95-th percentile,71.05
Maximum,72.0
Range,71.0
Interquartile range,42.0

0,1
Standard deviation,24.481
Coef of variation,0.74251
Kurtosis,-1.4279
Mean,32.97
MAD,21.889
Skewness,0.16405
Sum,3297
Variance,599.3
Memory size,880.0 B

Value,Count,Frequency (%),Unnamed: 3
1,9,9.0%,
2,5,5.0%,
72,5,5.0%,
52,4,4.0%,
49,4,4.0%,
10,4,4.0%,
34,3,3.0%,
25,3,3.0%,
46,3,3.0%,
47,3,3.0%,

Value,Count,Frequency (%),Unnamed: 3
1,9,9.0%,
2,5,5.0%,
3,2,2.0%,
5,2,2.0%,
7,1,1.0%,

Value,Count,Frequency (%),Unnamed: 3
66,2,2.0%,
69,2,2.0%,
70,1,1.0%,
71,3,3.0%,
72,5,5.0%,

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
