# DATA VALIDATION

In [60]:
# !pip install apache-beam[interactive]

In [1]:
import os
import tensorflow

Internally, TFDV uses `Apache Beam's` data-parallel processing framework to scale the computation of statistics over large datasets.

When we installed the tfx package introduced in Chapter 2, TFDV was already
installed as a dependency. If we would like to use TFDV as a standalone package, we
can install it with this command:
    `$ pip install tensorflow-data-validation`

### TensorFlow Data Validation (TFDV) can analyze training and serving data to:

-compute `descriptive statistics,`

-infer a `schema`,

-detect `data anomalies`.

In [3]:
#TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_csv(data_location='data/housing.csv',delimiter=',')



Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


In [5]:
# stats

We can generate feature statistics from TFRecord files in a very similar way using the
following code:

In [6]:
base_dir = os.getcwd()
data_dir = os.path.join(os.pardir, "tfrecord_data")
tf_record_location = os.path.join(base_dir, data_dir)
tf_record_location

'C:\\Users\\ASUS\\building-machine-learning-pipelines\\Untitled Folder\\..\\tfrecord_data'

In [7]:
os.listdir(tf_record_location)

['.ipynb_checkpoints', 'housing.tfrecord']

In [8]:
stats = tfdv.generate_statistics_from_tfrecord(data_location=os.path.join(tf_record_location , "housing.tfrecord"))
#The returned value is a DatasetFeatureStatisticsList protocol buffer

In [9]:
# stats

In [10]:
 tfdv.visualize_statistics(stats)

The previous example assumes that the data is stored in a TFRecord file. TFDV also supports CSV input format, with extensibility for other common formats. You can find the available data decoders here. In addition, TFDV provides the tfdv.generate_statistics_from_dataframe utility function for users with in-memory data represented as a pandas DataFrame.

In addition to computing a default set of data statistics, TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats set to True to tfdv.generate_statistics_from_tfrecord.

## Inferring a schema over the data


The schema describes the expected properties of the data. Some of these properties are:

    -which features are expected to be present their type
    
    -the number of values for a feature in each example
    
    -the presence of each feature across all examples
    
    -the expected domains of features.
    
In short, the schema describes the expectations for "correct" data and can thus be used to detect errors in the data (described below). Moreover, the same schema can be used to set up TensorFlow Transform for data transformations. Note that the schema is expected to be fairly static, e.g., several datasets can conform to the same schema, whereas statistics (described above) can vary per dataset.

Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics:

In [12]:
schema = tfdv.infer_schema(stats)

In [9]:
schema

feature {
  name: "households"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "housing_median_age"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "latitude"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "longitude"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "median_house_value"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "median_income"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "ocean_proximity"
  type: BYTES
  domain: "ocean_proximity"
  presence {
    min_fra

In general, TFDV uses conservative heuristics to infer stable data properties from the statistics in order to avoid overfitting the schema to the specific dataset. It is strongly advised to review the inferred schema and refine it as needed, to capture any domain knowledge about the data that TFDV's heuristics might have missed.

***Tfdv.infer_schema generates a schema protocol defined by TensorFlow***

In [13]:
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'households',FLOAT,required,,-
'housing_median_age',FLOAT,required,,-
'latitude',FLOAT,required,,-
'longitude',FLOAT,required,,-
'median_house_value',FLOAT,required,,-
'median_income',FLOAT,required,,-
'ocean_proximity',STRING,required,,'ocean_proximity'
'population',FLOAT,required,,-
'total_bedrooms',FLOAT,required,,-
'total_rooms',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'ocean_proximity',"'<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'"


**By analyzing the statistical properties of the data, considering the frequency and distribution of values, and incorporating user-defined settings, TFDV's infer_schema function determines which features should be marked as required and which ones can be marked as optional in the inferred schema.**

In this visualization, Presence means whether the feature must be present in 100% of
data examples (required) or not (optional). Valency means the number of values
required per training example. In the case of categorical features, single would mean
each training example must have exactly one category for the feature

The schema that has been generated here may not be exactly what we need because it
assumes that the current dataset is exactly representative of all future data as well. If a
feature is present in all training examples in this dataset, it will be marked as
required, but in reality it may be optional. We will show you how to update the
schema according to your own knowledge of the dataset

**With the schema now defined, we can compare our training or evaluation datasets(later),
or check our datasets for any problems that may affect our model**

In [22]:
# generate two files from the existing data file for demo proposes 


In [14]:

# Define file paths
data_folder = '../Untitled Folder/data/data_validation/'
original_file_path = '../Untitled Folder/data/housing.csv'
dataset_1_path = os.path.join(data_folder, 'dataset_1.csv')
dataset_2_path = os.path.join(data_folder, 'dataset_2.csv')

# Create the data_validation folder if it doesn't exist
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

# Read the original data file
with open(original_file_path, 'r') as original_file:
    lines = original_file.readlines()

# Write the first subset of data to dataset_1.csv
with open(dataset_1_path, 'w') as dataset_1_file:
    dataset_1_file.writelines(lines[:10000])

# Write the second subset of data to dataset_2.csv
with open(dataset_2_path, 'w') as dataset_2_file:
    dataset_2_file.writelines([lines[0]] + lines[10000:20000])




In [15]:
train_stats = tfdv.generate_statistics_from_csv(
    data_location=os.path.join(data_folder,"dataset_1.csv"),
    delimiter=',')
val_stats = tfdv.generate_statistics_from_csv(
    data_location=os.path.join(data_folder,"dataset_2.csv"),
    delimiter=',')


In [16]:
tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats,
                          lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')


## Check for evaluation anomalies
Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical features, where we want to identify the range of acceptable values.

Key Point: What would happen if we tried to evaluate using data with categorical feature values that were not in our training dataset? What about numeric features that are outside the ranges in our training dataset?

In [18]:
# Anomalies in validation set can be detected using the following code:
anomalies = tfdv.validate_statistics(statistics=val_stats, schema=schema)
# And we can then display the anomalies with:
tfdv.display_anomalies(anomalies)

# The following output shows the underlying anomaly protocol. This contains useful
#information that we can use to automate our machine learning workflow:

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'total_bedrooms',Multiple errors,"The feature has a shape, but it's not always present (if the feature is nested, then it should always be present at each nested level) or its value lengths vary. The feature was present in fewer examples than expected: minimum fraction = 1.000000, actual = 0.990100"


In [19]:
anomalies

baseline {
  feature {
    name: "households"
    type: FLOAT
    presence {
      min_fraction: 1.0
      min_count: 1
    }
    shape {
      dim {
        size: 1
      }
    }
  }
  feature {
    name: "housing_median_age"
    type: FLOAT
    presence {
      min_fraction: 1.0
      min_count: 1
    }
    shape {
      dim {
        size: 1
      }
    }
  }
  feature {
    name: "latitude"
    type: FLOAT
    presence {
      min_fraction: 1.0
      min_count: 1
    }
    shape {
      dim {
        size: 1
      }
    }
  }
  feature {
    name: "longitude"
    type: FLOAT
    presence {
      min_fraction: 1.0
      min_count: 1
    }
    shape {
      dim {
        size: 1
      }
    }
  }
  feature {
    name: "median_house_value"
    type: FLOAT
    presence {
      min_fraction: 1.0
      min_count: 1
    }
    shape {
      dim {
        size: 1
      }
    }
  }
  feature {
    name: "median_income"
    type: FLOAT
    presence {
      min_fraction: 1.0
      min_count: 1

The provided information points out a problem with the "total_bedrooms" feature in your data:

1. **Description of the Problem:**
   - The "total_bedrooms" feature is supposed to be there in every part of your data, but it's missing in some parts or doesn't always have the same amount of information. For example, some entries might have it while others don't, or its length (like the number of bedrooms) varies between entries.
   - The expected rule was that this feature should be in all parts of your data (like in every row), but it's actually missing in around 0.99% of the places where it should be.

2. **Severity of the Problem:**
   - This is marked as a serious issue (ERROR), meaning it needs immediate attention because it affects the reliability of your data analysis or machine learning models.

3. **Reasons for the Problem:**
   - The first reason (INVALID_FEATURE_SHAPE) explains that while the feature "total_bedrooms" has a defined structure, it's not consistently present where it should be or doesn't always have the expected length.
   - The second reason (FEATURE_TYPE_LOW_FRACTION_PRESENT) tells us that this feature is missing more often than it should be based on what's expected from the data.

4. **Path of the Problem:**
   - The problem specifically relates to the "total_bedrooms" feature in your dataset.

To fix this, you might need to ensure that the "total_bedrooms" feature is consistently available in all parts of your data, or if there are variations in the length of this feature, you'll need to handle that consistently across your dataset.

### Updating the Schema

In [50]:
# our training set has 1 percent missing of 'total_bedrooms'  values compare to validation set which has 0.99 percent , what if "total_bedrooms" is an importannt 
#column and we need to have it most of the training examples? lets update our schema

In [47]:
schema_location = "../pipeline_root_learning/tfx_for_tfrecordData/SchemaGen/schema/15/schema.pbtxt"  #or schema = tfdv.infer_schema(stats)
# schema = tfdv.load_schema_text(schema_location)
schema = tfdv.infer_schema(stats)

In [20]:
schema

feature {
  name: "households"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "housing_median_age"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "latitude"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "longitude"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "median_house_value"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "median_income"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "ocean_proximity"
  type: BYTES
  domain: "ocean_proximity"
  presence {
    min_fra

In [21]:
#we update this particular feature so that it is required in 90% of cases
total_bedroom_feature = tfdv.get_feature(schema, 'total_bedrooms')
total_bedroom_feature.presence.min_fraction = 0.99

In [22]:
#We could also update the list of "ocean_proximity" to remove "<1H OCEAN":
ocean_pro = tfdv.get_domain(schema, 'ocean_proximity')
ocean_pro.value.remove('<1H OCEAN')

In [23]:
schema

feature {
  name: "households"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "housing_median_age"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "latitude"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "longitude"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "median_house_value"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "median_income"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "ocean_proximity"
  type: BYTES
  domain: "ocean_proximity"
  presence {
    min_fra

    -we can compare schema before and now , our total_bedrooms feature in eval set is now present in 99% times also  we have removed ('<1H OCEAN')
    -Once we are happy with the schema, we write the schema file to its serialized location
     with the following:

In [25]:
schema_location = "../pipeline_root_learning/tfx_for_tfrecordData/SchemaGen/schema/15/schema.pbtxt"
tfdv.write_schema_text(schema, schema_location)

We then need to revalidate the statistics to view the updated anomalies

In [26]:
updated_anomalies = tfdv.validate_statistics(val_stats, schema)
tfdv.display_anomalies(updated_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'total_bedrooms',Feature shape dropped,"The feature has a shape, but it's not always present (if the feature is nested, then it should always be present at each nested level) or its value lengths vary."
'ocean_proximity',Unexpected string values,Examples contain values missing from the schema: <1H OCEAN (~35%).


since we had remove one value of ocean_proximity our updated schema has generated a new analomy , 

In this way, we can adjust the anomalies to those that are appropriate for our dataset.2

## Check for drift and skew


TFDV provides a built-in “skew comparator” that detects large differences between
the statistics of two datasets. This isn’t the statistical definition of skew (a dataset that
is asymmetrically distributed around its mean). It is defined in TFDV as t L-infinity norm of the difference between the serving_statistics of two datasets. `If
the difference between the two datasets exceeds the threshold of the L-infinity norm
for a given feature, TFDV highlights it as an anomaly using the anomaly detecti`oer.herain on.



### compare the skew between datasets:

In [29]:
tfdv.get_feature(schema, 'total_bedrooms').skew_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(
        statistics=train_stats, schema=schema, serving_statistics=val_stats)

In [30]:
tfdv.display_anomalies(skew_anomalies)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'total_bedrooms',Multiple errors,"The feature has a shape, but it's not always present (if the feature is nested, then it should always be present at each nested level) or its value lengths vary. The feature was present in fewer examples than expected: minimum fraction = 0.990000, actual = 0.989999"
'ocean_proximity',Multiple errors,"Examples contain values missing from the schema: <1H OCEAN (~53%). The Linfty distance between training and serving is 0.180354 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: <1H OCEAN"


TFDV also provides a `drift_comparator` for comparing the statistics of two datasets
of the same type, such as two training sets collected on two different days. If drift is
detected, the data scientist should either check the model architecture or determine
whether feature engineering needs to be performed again.

Similar to this skew example, you should define your drift_comparator for the features
you would like to watch and compare. You can then call validate_statistics
with the two dataset statistics as arguments, one for your baseline (e.g., yesterday’s
dataset) and one for a comparison (e.g., today’s dataset):

In [34]:
tfdv.get_feature(schema,'total_bedrooms').drift_comparator.infinity_norm.threshold = 0.01
# drift_anomalies = tfdv.validate_statistics(statistics=train_stats_today,schema=schema,previous_statistics=train_stats_yesterday)


## Biased Datasets

In [36]:
tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats,
                          lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')


In [38]:
# it is observed that are data is bisased in ocean Proximity as one feature is highly ssampled than another

## Integrating TFDV into Your Machine Learning Pipeline