<a href="https://colab.research.google.com/github/senkmp/TensorFlow-2.0/blob/master/Tutorial_titanic_problem_estimator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to train Linear model and boosted tree model in Tensorflow
In this tutorial, we will see how to use tf.estimator.LinearClassifier model and tf.estimator.BoostedTreesClassifier to classify structured data (pandas dataframe)  with creating an input pipe line using feature columns ( tf.feature_column) and tf.data.

you will learn-


* Creating different types of feature columns using tf.feature_columns
* Creating input data function using tf.data for train, val and test set
* Creating, compiling and training of tf.estimator model 
* Evaluating model
* Prediction on test data

## The Dataset

I have used [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) from kaggle, you can [download](https://www.kaggle.com/c/3136/download-all) and find [description](https://www.kaggle.com/c/titanic/data) of dataset on kaggle. I have used google colab and hence uploaded data in google drive.

## Mount google drive
I have uploaded data on **google drive,** Learn How to use data from google drive [here](https://medium.com/ml-book/simplest-way-to-open-files-from-google-drive-in-google-colab-fae14810674)

---

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Import TensorFlow and other libraries
I have used Tensorflow nightly version which is unstable version (aug 2019)

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd

try:
  # %tensorflow_version only exists in Colab.
  !pip install tf-nightly-2.0-preview
except Exception:
  pass
!pip install sklearn
import tensorflow as tf

from tensorflow import feature_column
from sklearn.model_selection import train_test_split
from IPython.display import clear_output

Collecting tf-nightly-2.0-preview
[?25l  Downloading https://files.pythonhosted.org/packages/63/2d/b478f71fd352b4b7c175b7bf04b74fd86e0e0f10e9acf148814a6e02fe42/tf_nightly_2.0_preview-2.0.0.dev20190829-cp36-cp36m-manylinux2010_x86_64.whl (89.1MB)
[K     |████████████████████████████████| 89.1MB 44.2MB/s 
Collecting opt-einsum>=2.3.2 (from tf-nightly-2.0-preview)
[?25l  Downloading https://files.pythonhosted.org/packages/c0/1a/ab5683d8e450e380052d3a3e77bb2c9dffa878058f583587c3875041fb63/opt_einsum-3.0.1.tar.gz (66kB)
[K     |████████████████████████████████| 71kB 23.5MB/s 
Collecting tensorflow-estimator-2.0-preview (from tf-nightly-2.0-preview)
[?25l  Downloading https://files.pythonhosted.org/packages/a2/96/497ea214af4bc285a95d749c8bfb6f246d79075cb585a21adcb3d18b991b/tensorflow_estimator_2.0_preview-1.14.0.dev2019082901-py2.py3-none-any.whl (450kB)
[K     |████████████████████████████████| 450kB 48.4MB/s 
[?25hCollecting tb-nightly<1.16.0a0,>=1.15.0a0 (from tf-nightly-2.0-previe

# Load and preprocess Data

## Use Pandas to create a dataframe

[Pandas](https://pandas.pydata.org/) is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset from mounted google drive, and load it into a dataframe

In [0]:
data = pd.read_csv('drive/My Drive/collab data/titanic/train.csv')
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [0]:
data.shape

(891, 12)

## Missing Data

### Check missing values

In [0]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Missing value handling

As you can seee that there are some missing values in 'age' , 'embark' and 'cabin'. In 'cabin' number of missing values are large hence we delete this column from data, and in 'age' we will fill missing values with mean value and in 'embark' with most frequent value.

In [0]:
mean_value = round(data['Age'].mean())
mode_value = data['Embarked'].mode()[0]

value = {'Age': mean_value, 'Embarked': mode_value}
data.fillna(value=value,inplace=True)

data.dropna(axis=1,inplace=True)

In [0]:
data.shape

(891, 11)

## Explore data with pandas_profiling library 

In [0]:
import pandas_profiling as pdpf
pdpf.ProfileReport(data)

# Train, val, test Split

We will divide data into train, validation, test data with 3:1:1 ratio

In [0]:
train, test = train_test_split(data, test_size=0.2)
train, val = train_test_split(train, test_size=0.25)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

534 train examples
178 validation examples
179 test examples


# Input pilpe line

## Feature columns
Know more about feature columns [here](https://medium.com/ml-book/demonstration-of-tensorflow-feature-columns-tf-feature-column-3bfcca4ca5c4) 

### Decide which types of features you have in data
While data exploration you should note the types of features we have, for example, whether a feature is numerical or categorical, if it is numerical then can we categorize it into buckets or not, or if it is categorical then it should be checked how many categories are there, can we convert it into indicator columns or embedding column, are there any two feature, those can we combined to create new crossed feature. I will recommend you to read this very simplified [tutorial on feature columns](https://medium.com/ml-book/demonstration-of-tensorflow-feature-columns-tf-feature-column-3bfcca4ca5c4).

In [0]:
num_c = ['Age','Fare','Parch','SibSp']
bucket_c  = ['Age']

cat_i_c = ['Embarked', 'Pclass','Sex']
cat_e_c = ['Ticket']

### Scaler function
It is very important for numerical variables to get scaled. here I have used min-max scaling. Here we are creating a function named 'get_scal' which takes list of numerical features and  returns 'minmax' function, which will be used in tf.feature_column.numeric_column() as normalizer_fn in parameters. 'minmax' function itself takes a 'numerical' number from a particular feature and return scaled value of that number. 

In [0]:
def get_scal(feature):
  def minmax(x):
    mini = train[feature].min()
    maxi = train[feature].max()
    return (x - mini)/(maxi-mini)
  return(minmax)

### Creating feature columns


In [0]:
# Numerical columns
feature_columns = []
for header in num_c:
  scal_input_fn = get_scal(header)
  feature_columns.append(feature_column.numeric_column(header, normalizer_fn=scal_input_fn))

# Bucketized columns
Age = feature_column.numeric_column("Age")
age_buckets = feature_column.bucketized_column(Age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

# Categorical indicator columns
for feature_name in cat_i_c:
  vocabulary = data[feature_name].unique()
  cat_c = tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary)
  one_hot = feature_column.indicator_column(cat_c)
  feature_columns.append(one_hot)

# Categorical embedding columns
for feature_name in cat_e_c:
  vocabulary = data[feature_name].unique()
  cat_c = tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary)
  embeding = feature_column.embedding_column(cat_c, dimension=50)
  feature_columns.append(embeding)

# Crossed columns
vocabulary = data['Sex'].unique()
Sex = tf.feature_column.categorical_column_with_vocabulary_list('Sex', vocabulary)

crossed_feature = feature_column.crossed_column([age_buckets, Sex], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)
len(feature_columns)

10



## Create an input data function using tf.data

Next, we will wrap the dataframes with [tf.data](https://www.tensorflow.org/guide/datasets). This will enable us  to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model. If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly. That is not covered in this tutorial.


Here we will define a make_input_fn which will retrun input_function for data. The input_function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. tf.data.Dataset take take in multiple sources such as a dataframe, a csv-formatted file, and more. In tf.estimator we provide input function in model intead of data, but in dt.keras we can directly provide input data through input function.

In [0]:

def make_input_fn(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('Survived')
  def input_function():
    
    
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
      ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds
  return input_function


train_input_fn = make_input_fn(train)
eval_input_fn = make_input_fn(val,shuffle=False)

# Train linear model
After adding all the base features to the model, let's train the model. Training a model is just a single command using the tf.estimator API:

In [0]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(eval_input_fn)

clear_output()
print(result)



{'accuracy': 0.8033708, 'accuracy_baseline': 0.6235955, 'auc': 0.87380666, 'auc_precision_recall': 0.8492318, 'average_loss': 0.47463596, 'label/mean': 0.3764045, 'loss': 0.47913766, 'precision': 0.92105263, 'prediction/mean': 0.34997383, 'recall': 0.52238804, 'global_step': 17}


In [0]:
print(pd.Series(result))

accuracy                 0.803371
accuracy_baseline        0.623595
auc                      0.873807
auc_precision_recall     0.849232
average_loss             0.474636
label/mean               0.376404
loss                     0.479138
precision                0.921053
prediction/mean          0.349974
recall                   0.522388
global_step             17.000000
dtype: float64


# Train boosted Tree model

Tensorflow boosted tree model does not support embeding column (aug 2019), hence creating feature columns without embedding column

In [0]:
feature_columns1 = list(set(feature_columns)-set([embeding]))
feature_columns1

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 NumericColumn(key='SibSp', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.minmax at 0x7f73e180bea0>),
 NumericColumn(key='Fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.minmax at 0x7f73e63b1e18>),
 IndicatorColumn(categorical_column=CrossedColumn(keys=(BucketizedColumn(source_column=NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(18, 25, 30, 35, 40, 45, 50, 55, 60, 65)), VocabularyListCategoricalColumn(key='Sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), hash_bucket_size=1000, hash_key=None)),
 NumericColumn(key='Parch', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.mi

In [0]:
n_batches = 1
est = tf.estimator.BoostedTreesClassifier(feature_columns1,
                                          n_batches_per_layer=n_batches)

# The model will stop training once the specified number of trees is built, not
# based on the number of steps.
est.train(train_input_fn)

# Eval.
result = est.evaluate(eval_input_fn)
clear_output()
print(pd.Series(result))


accuracy                 0.797753
accuracy_baseline        0.623595
auc                      0.858747
auc_precision_recall     0.809862
average_loss             0.478607
label/mean               0.376404
loss                     0.484489
precision                0.771930
prediction/mean          0.395406
recall                   0.656716
global_step             16.000000
dtype: float64


# Problem test data

## Load and pre process data

In [0]:
test_data = pd.read_csv('drive/My Drive/collab data/titanic/test.csv')
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [0]:
test_data.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [0]:
mean_value = round(data['Age'].mean())
mean_value1 = data['Fare'].mean()

value = {'Age': mean_value, 'Fare': mean_value1}
test_data.fillna(value=value,inplace=True)

In [0]:
test_data.dropna(axis=1,inplace=True)

## Input function

In [0]:
def test_input_fn(features, batch_size=256):
  def input_fn():
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)
  return input_fn

test_input = test_input_fn(test_data)

## Prediction on linear model

In [0]:
pred_dicts1 = list(linear_est.predict(test_input))
probs1 = pd.Series([pred['probabilities'][1] for pred in pred_dicts1])

predict_df_l = test_data[['PassengerId']]
predict_df_l['Survived'] = (probs1>=.5).astype(int)
predict_df_l.head(10)

W0829 11:33:54.219903 140136900896640 base_layer.py:1820] Layer linear/linear_model is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
5,897,0
6,898,0
7,899,0
8,900,1
9,901,0


## Prediction on boosted tree model

In [0]:
pred_dicts = list(est.predict(test_input))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])
predict_df_bt = test_data[['PassengerId']]
predict_df_bt['Survived'] = (probs>=.5).astype(int)
predict_df_bt.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0
