# TensorFlow Data Validation Example

This notebook describes how to explore and validate Chicago Taxi dataset using TensorFlow Data Validation.

# Setup

Import necessary packages and set up data paths.

In [None]:
import tensorflow_data_validation as tfdv
import os

In [2]:
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, 'data')
TRAIN_DATA = os.path.join(DATA_DIR, 'train.csv')
EVAL_DATA = os.path.join(DATA_DIR, 'eval.csv')

# Compute descriptive data statistics

TFDV can compute descriptive
[statistics](https://github.com/tensorflow/metadata/tree/v0.6.0/tensorflow_metadata/proto/v0/statistics.proto)
that provide a quick overview of the data in terms of the features that are
present and the shapes of their value distributions.

Internally, TFDV uses [Apache Beam](https://beam.apache.org)'s data-parallel
processing framework to scale the computation of statistics over large datasets.
For applications that wish to integrate deeper with TFDV (e.g., attach
statistics generation at the end of a data-generation pipeline), the API also
exposes a Beam PTransform for statistics generation.

In [None]:
train_stats = tfdv.generate_statistics_from_csv(TRAIN_DATA)

The statistics can be visualized using [Facets Overview](https://pair-code.github.io/facets/) tool which provide a succinct visualization of these statistics for easy browsing. TFDV provides a utility method that visualizes statistics using Facets.

In [4]:
tfdv.visualize_statistics(train_stats)

# Infer a schema

The
[schema](https://github.com/tensorflow/metadata/tree/v0.6.0/tensorflow_metadata/proto/v0/schema.proto)
describes the expected properties of the data. Some of these properties are:

*   which features are expected to be present
*   their type
*   the number of values for a feature in each example
*   the presence of each feature across all examples
*   the expected domains of features.

In short, the schema describes the expectations for "correct" data and can thus
be used to detect errors in the data (described below). 

Since writing a schema can be a tedious task, especially for datasets with lots
of features, TFDV provides a method to generate an initial version of the schema
based on the descriptive statistics.

In [5]:
schema = tfdv.infer_schema(train_stats)

In general, TFDV uses conservative heuristics to infer stable data properties
from the statistics in order to avoid overfitting the schema to the specific
dataset. It is strongly advised to **review the inferred schema and refine
it as needed**, to capture any domain knowledge about the data that TFDV's
heuristics might have missed.

In [6]:
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
fare,Float,required,single,
trip_start_hour,Int,required,single,
dropoff_census_tract,Float,optional,single,
company,String,optional,single,company
trip_start_timestamp,Int,required,single,
pickup_longitude,Float,required,single,
trip_start_month,Int,required,single,
trip_miles,Float,required,single,
dropoff_longitude,Float,optional,single,
dropoff_community_area,Float,optional,single,


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
company,"""0118 - 42111 Godfrey S.Awir"", ""0694 - 59280 Chinesco Trans Inc"", ""1085 - 72312 N and W Cab Co"", ""2733 - 74600 Benny Jona"", ""2809 - 95474 C & D Cab Co Inc."", ""3011 - 66308 JBL Cab Inc."", ""3152 - 97284 Crystal Abernathy"", ""3201 - C&D Cab Co Inc"", ""3201 - CID Cab Co Inc"", ""3253 - 91138 Gaither Cab Co."", ""3385 - 23210 Eman Cab"", ""3623 - 72222 Arrington Enterprises"", ""3897 - Ilie Malec"", ""4053 - Adwar H. Nikola"", ""4197 - 41842 Royal Star"", ""4615 - 83503 Tyrone Henderson"", ""4615 - Tyrone Henderson"", ""4623 - Jay Kim"", ""5006 - 39261 Salifu Bawa"", ""5006 - Salifu Bawa"", ""5074 - 54002 Ahzmi Inc"", ""5074 - Ahzmi Inc"", ""5129 - 87128"", ""5129 - 98755 Mengisti Taxi"", ""5129 - Mengisti Taxi"", ""5724 - KYVI Cab Inc"", ""585 - Valley Cab Co"", ""5864 - 73614 Thomas Owusu"", ""5864 - Thomas Owusu"", ""5874 - 73628 Sergey Cab Corp."", ""5997 - 65283 AW Services Inc."", ""5997 - AW Services Inc."", ""6488 - 83287 Zuha Taxi"", ""6743 - Luhak Corp"", ""Blue Ribbon Taxi Association Inc."", ""C & D Cab Co Inc"", ""Chicago Elite Cab Corp."", ""Chicago Elite Cab Corp. (Chicago Carriag"", ""Chicago Medallion Leasing INC"", ""Chicago Medallion Management"", ""Choice Taxi Association"", ""Dispatch Taxi Affiliation"", ""KOAM Taxi Association"", ""Northwest Management LLC"", ""Taxi Affiliation Services"", ""Top Cab Affiliation"""
payment_type,"""Cash"", ""Credit Card"", ""Dispute"", ""No Charge"", ""Pcard"", ""Unknown"""


# Check evaluation data for errors

Given a schema, it is possible to check whether a dataset conforms to the
expectations set in the schema or whether there exist any data anomalies. TFDV
performs this check by matching the statistics of the dataset against the schema
and marking any discrepancies. 

In [7]:
eval_stats = tfdv.generate_statistics_from_csv(EVAL_DATA)

In [24]:
anomalies = tfdv.validate_statistics(eval_stats, schema)

In [25]:
tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
company,Unexpected string values,"Examples contain values missing from the schema: 2092 - 61288 Sbeih company (<1%), 2192 - 73487 Zeymane Corp (<1%), 2192 - Zeymane Corp (<1%), 2823 - 73307 Seung Lee (<1%), 3094 - 24059 G.L.B. Cab Co (<1%), 3319 - CD Cab Co (<1%), 3385 - Eman Cab (<1%), 3897 - 57856 Ilie Malec (<1%), 4053 - 40193 Adwar H. Nikola (<1%), 4197 - Royal Star (<1%), 585 - 88805 Valley Cab Co (<1%), 5874 - Sergey Cab Corp. (<1%), 6057 - 24657 Richard Addo (<1%), 6574 - Babylon Express Inc. (<1%), 6742 - 83735 Tasha ride inc (<1%)."
payment_type,Unexpected string values,Examples contain values missing from the schema: Prcard (<1%).


The anomalies indicate that out of domain values were found for features `company` and `payment_type` in the stats in < 1% of the examples. If this was expected, then the schema can be updated as follows.

In [26]:
# Relax the minimum fraction of values that must come from the domain for feature company.
company = tfdv.get_feature(schema, 'company')
company.distribution_constraints.min_domain_mass = 0.9

# Add new value to the domain of feature payment_type.
payment_type_domain = tfdv.get_domain(schema, 'payment_type')
payment_type_domain.value.append('Prcard')

In [27]:
updated_anomalies = tfdv.validate_statistics(eval_stats, schema)

In [28]:
tfdv.display_anomalies(updated_anomalies)

If an anomaly truly indicates a data error, then the underlying data should be fixed.