# Hands-on #3: Titanic disaster

In this notebook, we'll take a look at TensorFlow Data Validation, using the Titanic dataset.

## Step 1: Loading the data

In [None]:
import os
import pandas as pd
titanic_file = '../data/titanic/titanic.csv'
df = pd.read_csv(titanic_file)
df.head()


## Step 2: Let TensorFlow Data Validation analyze the data

The DataValidation library can read data in form of TFRecord files, CSV files or pandas DataFrames, and generate statistics.

In [None]:
import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_dataframe(df)

These statistics can be visualized directly in the notebook (but you may need to use Chrome or Chromium...)

In [None]:
tfdv.visualize_statistics(stats)

## Step 3: Infer a schema and spot anomalies

From the statistics, we can infer a schema which can later be used as a blueprint to checck new data for anomalies:

In [None]:
schema = tfdv.infer_schema(stats)
schema

Let's create data with missing some value, and see whether this gets detected:

In [None]:
faulty_csv = '../data/faulty.csv'
faulty_samples = df.iloc[[0],:].assign(Age=None).to_csv(faulty_csv)

anomalies = tfdv.validate_examples_in_csv(faulty_csv, tfdv.StatsOptions(schema=schema)) 

In [None]:
tfdv.visualize_statistics(anomalies)

## Step 4: Train a simple estimator to predict survivals

We now want to train a pre-built estimator on the dataset. First, we split the data:

In [None]:
from sklearn.model_selection import train_test_split
train, val = train_test_split(df, test_size=0.2)

Next, we use some preprocessing to make the pandas dataframe digestible for pre-built TensorFlow estimators.

In [None]:
import tensorflow as tf

# A utility method to create a tf.data dataset from a Pandas Dataframe

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('Survived')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
        ds = ds.batch(batch_size)
    return ds

def input_fn():
    return df_to_dataset(train)

def input_fn_eval():
    return df_to_dataset(val)


age = tf.feature_column.numeric_column('Age')
sex = tf.feature_column.categorical_column_with_vocabulary_list('Sex', df.Sex.unique())
sex_ohe = tf.feature_column.indicator_column(sex)
pclass = tf.feature_column.categorical_column_with_vocabulary_list('Pclass', df.Pclass.unique())
pclass_ohe = tf.feature_column.indicator_column(pclass)

feature_columns = [age, sex_ohe, pclass_ohe]

Now comes the training...

In [None]:
classifier = tf.estimator.BoostedTreesClassifier(feature_columns, n_batches_per_layer=5)
classifier.train(input_fn)

And now the validation:

In [None]:
classifier.evaluate(input_fn_eval)


As a final step, it would be nice to analyze our estimator with the TensorFlow Model Analysis library. But to do so, we still need to export the validation data in form of a TFRecord file, and it is too late for that right now... Good bye!