<a href="https://colab.research.google.com/github/soerenml/tf2/blob/master/Introduction_TF_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End to end ML project

Before you start with the project, make sure you **understand the underlying logic of the problem**. 

It is important to think in causal relationships first and have a look at the data later. I.e. there should be a clear logical representation of the problem **before** you jump into the process of data analysis.

Furthermore, make sure you select the correct **performance measure** which is suitable for your problem. For example, it makes a significant difference if you use RMSE or RME when sensitivity to outliers is an issue.



In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
print("Tensorflow version: {}".format(tf.__version__))

%load_ext tensorboard

Tensorflow version: 2.2.0-rc2
The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


# Load the data

## Download and extract

I am consuming data directly from github. This tutorial in not aiming to show production ready ML pipelines. It's a gentle introduction to TF.

In [0]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
eval_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

print(train_file_path)

/root/.keras/datasets/train.csv


In [0]:
# Quick check if has beeen exported correctly.
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


In [0]:
df_train = pd.read_csv(train_file_path)

## Check the data

In [0]:
import pandas as pd
df_train.iloc[1:10,:]

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
5,0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
6,1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
7,1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
8,1,female,4.0,1,1,16.7,Third,G,Southampton,n
9,0,male,20.0,0,0,8.05,Third,unknown,Southampton,y


In [0]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 627 entries, 0 to 626
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   survived            627 non-null    int64  
 1   sex                 627 non-null    object 
 2   age                 627 non-null    float64
 3   n_siblings_spouses  627 non-null    int64  
 4   parch               627 non-null    int64  
 5   fare                627 non-null    float64
 6   class               627 non-null    object 
 7   deck                627 non-null    object 
 8   embark_town         627 non-null    object 
 9   alone               627 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 49.1+ KB


In [0]:
df_train.describe()

Unnamed: 0,survived,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0,627.0
mean,0.38756,29.631308,0.545455,0.379585,34.385399
std,0.487582,12.511818,1.15109,0.792999,54.59773
min,0.0,0.75,0.0,0.0,0.0
25%,0.0,23.0,0.0,0.0,7.8958
50%,0.0,28.0,0.0,0.0,15.0458
75%,1.0,35.0,1.0,0.0,31.3875
max,1.0,80.0,8.0,5.0,512.3292


In [0]:
df_train["embark_town"].value_counts()

Southampton    450
Cherbourg      123
Queenstown      53
unknown          1
Name: embark_town, dtype: int64

## Install facets

In [0]:
!pip install facets-overview



In [0]:
# Source: https://colab.research.google.com/github/PAIR-code/facets/blob/master/colab_facets.ipynb

jsonstr = df.to_json(orient='records')
sprite_size = 32 if len(df.index)>50000 else 64

from IPython.core.display import display, HTML

jsonstr = df_train.to_json(orient='records')
HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))

## Load data into tensorflow
 
It is highly recommended to use tf.data to achieve highest peformance.

The following features have been excluded:
+ Name
+ Cabin
+ PassengerId

In [0]:
df["Sex"].value_counts()

male      100
female     56
Name: Sex, dtype: int64

In [0]:
LABEL_COLUMN = 'survived'
SELECTED_COLUMNS = ['survived','class','age','fare','embark_town','n_siblings_spouses','sex']

def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5,
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True,
      select_columns=SELECTED_COLUMNS
      )
  return dataset

data_tf = get_dataset(train_file_path)

### Inspect tensorflow data

In [0]:
def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))

show_batch(data_tf)

sex                 : [b'male' b'female' b'female' b'male' b'male']
age                 : [24. 21. 18. 71. 28.]
n_siblings_spouses  : [2 0 0 0 0]
fare                : [73.5    77.9583 13.     49.5042  7.8958]
class               : [b'Second' b'First' b'Second' b'First' b'Third']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Cherbourg' b'Southampton']


In [0]:
from tensorflow import feature_column

feature_columns = []

# Numeric cols
for header in ["fare"]:
  feature_columns.append(feature_column.numeric_column(header))

# Bucketized cols.
age = feature_column.numeric_column("age")
feature_columns.append(feature_column.bucketized_column(age, boundaries=[3, 10, 20, 50, 80]))

n_siblings_spouses = feature_column.numeric_column("n_siblings_spouses")
feature_columns.append(feature_column.bucketized_column(n_siblings_spouses, boundaries=[0, 1, 3]))

# Categorical columns.
ship_class = feature_column.categorical_column_with_vocabulary_list(
    'class', ["First", "Second", "Third"])
feature_columns.append(feature_column.indicator_column(ship_class))

embark_town = feature_column.categorical_column_with_vocabulary_list(
    'embark_town', ["Southampton", "Queenstown", "Cherbourg", "unknown"])
feature_columns.append(feature_column.indicator_column(embark_town))

sex = feature_column.categorical_column_with_vocabulary_list(
    'sex', ["male", "female"])
feature_columns.append(feature_column.indicator_column(sex))

In [0]:
# See how the preprocessed data looks like
example_batch = next(iter(data_tf))[0]

def demo(feature_column):
  feature_layer = tf.keras.layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

n_siblings_spouses = feature_column.numeric_column("n_siblings_spouses")
demo(feature_column.bucketized_column(n_siblings_spouses, boundaries=[0, 1, 3]))

[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]


In [0]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

# Select between different models.
def model_types(model_type):
  if model_type is "DNN":
    model = tf.keras.Sequential(
        [
         feature_layer,
         tf.keras.layers.Dense(128, activation='relu'),
         tf.keras.layers.Dense(128, activation='relu'),
         tf.keras.layers.Dense(1)
         ]
    )

  if model_type is "LR":
    model = tf.keras.Sequential(
        [
         feature_layer,
         tf.keras.layers.Dense(1)
         ]
    )
  return model

model_types("DNN")
model.compile(optimizer='adam',
                  loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                  metrics=['accuracy'])

model.fit(
    get_dataset(train_file_path),
    validation_data=get_dataset(eval_file_path),
    epochs=10
    )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7ffa7a7d0400>