<a href="https://colab.research.google.com/github/joanby/tensorflow2/blob/master/Collab%209%20-%20Validación%20de%20datos%20con%20TensorFlow%20Data%20Validation%20(TFDV).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Paso 1:Instalar todas las dependencias y configurar el entorno

In [None]:
!apt-get install python-dev python-snappy

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-dev is already the newest version (2.7.15~rc1-1).
The following NEW packages will be installed:
  python-snappy
0 upgraded, 1 newly installed, 0 to remove and 22 not upgraded.
Need to get 10.8 kB of archives.
After this operation, 39.9 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-snappy amd64 0.5-1.1build2 [10.8 kB]
Fetched 10.8 kB in 0s (35.6 kB/s)
Selecting previously unselected package python-snappy.
(Reading database ... 144618 files and directories currently installed.)
Preparing to unpack .../python-snappy_0.5-1.1build2_amd64.deb ...
Unpacking python-snappy (0.5-1.1build2) ...
Setting up python-snappy (0.5-1.1build2) ...


In [None]:
!pip install -q --upgrade tensorflow_data_validation

[K     |████████████████████████████████| 1.3MB 3.3MB/s 
[K     |████████████████████████████████| 296kB 13.3MB/s 
[K     |████████████████████████████████| 378kB 19.0MB/s 
[K     |████████████████████████████████| 8.6MB 22.0MB/s 
[K     |████████████████████████████████| 1.8MB 62.7MB/s 
[K     |████████████████████████████████| 63.8MB 47kB/s 
[K     |████████████████████████████████| 71kB 10.9MB/s 
[K     |████████████████████████████████| 61kB 9.5MB/s 
[K     |████████████████████████████████| 1.4MB 55.9MB/s 
[K     |████████████████████████████████| 829kB 53.4MB/s 
[K     |████████████████████████████████| 81kB 12.0MB/s 
[K     |████████████████████████████████| 51kB 8.2MB/s 
[K     |████████████████████████████████| 153kB 58.7MB/s 
[K     |████████████████████████████████| 153kB 59.9MB/s 
[K     |████████████████████████████████| 256kB 58.7MB/s 
[K     |████████████████████████████████| 276kB 57.3MB/s 
[K     |████████████████████████████████| 122kB 64.5MB/s 
[K  

In [None]:
import os
os.kill(os.getpid(), 9)

## Paso 2: Importar las librerías para el proyecto

In [None]:
import pandas as pd
import tensorflow as tf
import tensorflow_data_validation as tfdv

from __future__ import print_function

## Paso 3: Análisis descriptivo del dataset


In [None]:
dataset = pd.read_csv("https://raw.githubusercontent.com/joanby/tensorflow2/master/datasets/pollution-small.csv")

In [None]:
dataset.shape

(2188, 5)

In [None]:
training_data = dataset[:1600]

In [None]:
training_data.describe()

Unnamed: 0,pm10,no2,so2,soot
count,1600.0,1600.0,1600.0,1600.0
mean,49.656494,30.980519,16.229981,21.551956
std,35.211906,12.400788,10.621896,12.127354
min,6.38,9.74,4.01,6.0
25%,28.345,22.5675,9.7775,14.4
50%,38.835,28.715,13.275,18.63
75%,58.05,36.37,19.2825,24.0725
max,277.25,138.01,123.13,107.65


In [None]:
test_set = dataset[1600:]

In [None]:
test_set.describe()

Unnamed: 0,pm10,no2,so2,soot
count,588.0,588.0,588.0,588.0
mean,44.648248,37.296922,13.60517,18.44131
std,28.992087,10.94005,5.098944,6.596459
min,11.9,15.07,4.99,8.0
25%,28.3375,29.2175,10.1225,14.41
50%,35.555,35.815,12.345,17.09
75%,50.8125,43.8725,15.855,20.9625
max,273.77,106.03,38.03,87.21


## Paso 3b: Análisis descriptivo y validación con TFDV

### Generar análisis descriptivo del dataset


In [None]:
train_stats = tfdv.generate_statistics_from_dataframe(dataframe=training_data)

### Inferir el esquema 

In [None]:
schema = tfdv.infer_schema(statistics=train_stats)

In [None]:
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Date',BYTES,required,,-
'pm10',FLOAT,required,,-
'no2',FLOAT,required,,-
'so2',FLOAT,required,,-
'soot',FLOAT,required,,-


### Calcular los estadísticos  descriptivos del conjunto de testing


In [None]:
test_stats = tfdv.generate_statistics_from_dataframe(dataframe=test_set)

## Paso 4: Comparar estadísticos del conjunto de test con Schema

### Buscar anomalías en los nuevos datos

In [None]:
anomalies = tfdv.validate_statistics(statistics=test_stats, schema=schema)

### Mostrar todas las anomalías detectadas

- Número entero mayor que  10
- Tipo STRING cuando se esperaba un tipo INT
- Tipo FLOAT cuando se esperaba un tipo INT
- Número entero menor que 0

In [None]:
tfdv.display_anomalies(anomalies)

### Nuevos datos CON anomalías

In [None]:
test_set_copy = test_set.copy()

In [None]:
test_set_copy.drop("soot", axis=1, inplace=True)

### Estadísticos basados en datos con anomalías

In [None]:
test_set_copy_stats = tfdv.generate_statistics_from_dataframe(dataframe=test_set_copy)

In [None]:
anomalies_new = tfdv.validate_statistics(statistics=test_set_copy_stats, schema=schema)

In [None]:
tfdv.display_anomalies(anomalies_new)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'soot',Column dropped,Column is completely missing


## Paso 5: Preparar el esquema para subir a producción

In [None]:
schema.default_environment.append("TRAINING")
schema.default_environment.append("SERVING")

### Eliminar la columna objetivo del esquema de producción

In [None]:
tfdv.get_feature(schema, "soot").not_in_environment.append("SERVING")

### Comprobar anomalías entre el entorno del servidor y nuevos datos entrantes

In [None]:
serving_env_anomalies = tfdv.validate_statistics(test_set_copy_stats, schema, environment="SERVING")

In [None]:
tfdv.display_anomalies(serving_env_anomalies)

## Paso 6: Congelar el esquema

In [None]:
tfdv.write_schema_text(schema = schema, output_path = "pollution_schema.pbtxt")