<h1><center>CS 455/595a: MLP Demo using TensorFlow</center></h1>
<center>Richard S. Stansbury</center>

This notebook applies the ANN techniques for the Titanic Survivors and Boston Housing Prediction models covered in [1] with the [Titanic](https://www.kaggle.com/c/titanic/) and [Boston Housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) data sets for DT-based classification and regression, respectively.

Several different approaches to model construction are shown ihe demos below

Reference:

[1] Aurelen Geron. *Hands on Machine Learning with Scikit-Learn & TensorFlow* O'Reilley Media Inc, 2017.

[2] Aurelen Geron. "ageron/handson-ml: A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow." Github.com, online at: https://github.com/ageron/handson-ml [last accessed 2019-03-01]

**Table of Contents**
1. [Titanic Survivor ANN Classifiers](#Titanic-Survivor-Classifier)
 
2. [Boston Housing Cost Ensemble ANN Regressor](#Boston-Housing-Cost-Estimator)

# Titanic Survivor Classifier

## Set up - Imports of libraries and Data Preparation

In [2]:
from matplotlib import pyplot as plt
%matplotlib inline 
import numpy as np
import pandas as pd
import os

# From: https://github.com/ageron/handson-ml/blob/master/09_up_and_running_with_tensorflow.ipynb    
def reset_graph():
    tf.reset_default_graph() 

Import the data and apply pipelines to pre-process the data.

In [16]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.model_selection import train_test_split

# Read data from input files into Pandas data frames
data_path = os.path.join("datasets","titanic")
train_filename = "train.csv"
test_filename = "test.csv"

def read_csv(data_path, filename):
    joined_path = os.path.join(data_path, filename)
    return pd.read_csv(joined_path)

# Read CSV file into Pandas Dataframes
train_df = read_csv(data_path, train_filename)

# Defining Data Pre-Processing Pipelines
class DataFrameSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, attributes):
        self.attributes = attributes
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.attributes]

class MostFrequentImputer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        self.most_frequent = pd.Series([X[c].value_counts().index[0] for c in X], 
                                       index = X.columns)
        return self
    
    def transform(self, X):
        return X.fillna(self.most_frequent)


numeric_pipe = Pipeline([
        ("Select", DataFrameSelector(["Age", "Fare", "SibSp", "Parch"])), # Selects Fields from dataframe
        ("Imputer", SimpleImputer(strategy="median")),   # Fills in NaN w/ median value for its column
        ("Scaler", StandardScaler()),
    ])

categories_pipe = Pipeline([
        ("Select", DataFrameSelector(["Pclass", "Sex"])), # Selects Fields from dataframe
        ("MostFreqImp", MostFrequentImputer()), # Fill in NaN with most frequent
        ("OneHot", OneHotEncoder(sparse=False, categories='auto')), # Onehot encode
    ])

preprocessing_pipe = FeatureUnion(transformer_list = [
        ("numeric pipeline", numeric_pipe), 
        ("categories pipeline", categories_pipe)
     ]) 

# Process Input Data Using Pipleines
X_data = preprocessing_pipe.fit_transform(train_df)
y_data = train_df["Survived"].values.reshape(-1,1)

# Process the output data.
feature_names = ["Age", "Fare", "SibSp", "Parch", "Class0", "class1","Sex0", "Sex1"]

print(X_data.shape)
print(y_data.shape)

(891, 9)
(891, 1)


Split the data into a training and validation set.

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.33)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(596, 9) (596, 1) (295, 9) (295, 1)


Implementation of the TF.Estimator.DNNClassifier (formerly of TFLearn)

In [23]:
# Construction Phase

import tensorflow as tf

reset_graph()


feature_cols = [tf.feature_column.numeric_column("X", shape=[X_data.shape[1]])]

dnn_clf = tf.estimator.DNNClassifier(hidden_units=[20,20], n_classes=2,
                                     feature_columns=feature_cols)

train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"X": X_train}, y=y_train, batch_size=50, num_epochs=400, shuffle=True)
dnn_clf.train(input_fn=train_input_fn)

test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"X": X_test}, y=y_test, shuffle=False)
eval_results = dnn_clf.evaluate(input_fn=test_input_fn)
                                
eval_results


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\richa\\AppData\\Local\\Temp\\tmp152d877p', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001DD025F89E8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create Checkp

{'accuracy': 0.8305085,
 'accuracy_baseline': 0.64406776,
 'auc': 0.85167915,
 'auc_precision_recall': 0.82776606,
 'average_loss': 0.4634297,
 'label/mean': 0.3559322,
 'loss': 45.570587,
 'precision': 0.7522936,
 'prediction/mean': 0.38629514,
 'recall': 0.7809524,
 'global_step': 4768}

In [84]:
reset_graph()

def get_batch(X, iter, size):
    return X[(iter*batch_size) : ((iter+1)*batch_size)]

learning_rate = 0.1
num_features = X_train.shape[1]
num_instances = X_train.shape[0]

# Construction
X = tf.placeholder(tf.float32, shape=(None, num_features), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("Titanic_MLP"):
    hidden1 = tf.layers.dense(X, 20, name="Hidden-1", activation = tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, 10, name="Hidden-2", activation=tf.nn.relu)
    hidden3 = tf.layers.dense(hidden2, 5, name="Hidden-3", activation=tf.nn.relu)
    logits = tf.layers.dense(hidden3, 2, name="Survived")
    
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                              logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
with tf.name_scope("train"): 
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)
    
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()

saver = tf.train.Saver()

n_epochs = 100
batch_size = 50


# Execution
with tf.Session() as sess:
    init.run()
    
    for epoch in range(n_epochs):
        
        for iteration in range(num_instances // batch_size + 1):
            X_batch = get_batch(X_train, iteration, batch_size)
            y_batch = get_batch(y_train, iteration, batch_size)
            
            sess.run(training_op, feed_dict={X: X_batch,
                                            y: y_batch.reshape(y_batch.shape[0])})
        acc_train = accuracy.eval(feed_dict={X: X_batch,
                                            y: y_batch.reshape(y_batch.shape[0])})
        acc_val = accuracy.eval(feed_dict={X:X_test, y: y_test.reshape(y_test.shape[0])})
        
        print("{}-Train: {} Test:{}".format(epoch,
                                           acc_train,
                                           acc_val))

0-Train: 0.6304348111152649 Test:0.6847457885742188
1-Train: 0.695652186870575 Test:0.7593220472335815
2-Train: 0.760869562625885 Test:0.7762711644172668
3-Train: 0.782608687877655 Test:0.806779682636261
4-Train: 0.804347813129425 Test:0.806779682636261
5-Train: 0.804347813129425 Test:0.8237287998199463
6-Train: 0.8478260636329651 Test:0.8305084705352783
7-Train: 0.8478260636329651 Test:0.8338983058929443
8-Train: 0.8260869383811951 Test:0.8338983058929443
9-Train: 0.8478260636329651 Test:0.8406779766082764
10-Train: 0.8478260636329651 Test:0.8406779766082764
11-Train: 0.8478260636329651 Test:0.8406779766082764
12-Train: 0.8478260636329651 Test:0.8406779766082764
13-Train: 0.8260869383811951 Test:0.8406779766082764
14-Train: 0.8260869383811951 Test:0.8338983058929443
15-Train: 0.804347813129425 Test:0.8305084705352783
16-Train: 0.8260869383811951 Test:0.8305084705352783
17-Train: 0.804347813129425 Test:0.8305084705352783
18-Train: 0.804347813129425 Test:0.8305084705352783
19-Train: 0.8

In [84]:
reset_graph()



0-Train: 0.6304348111152649 Test:0.6847457885742188
1-Train: 0.695652186870575 Test:0.7593220472335815
2-Train: 0.760869562625885 Test:0.7762711644172668
3-Train: 0.782608687877655 Test:0.806779682636261
4-Train: 0.804347813129425 Test:0.806779682636261
5-Train: 0.804347813129425 Test:0.8237287998199463
6-Train: 0.8478260636329651 Test:0.8305084705352783
7-Train: 0.8478260636329651 Test:0.8338983058929443
8-Train: 0.8260869383811951 Test:0.8338983058929443
9-Train: 0.8478260636329651 Test:0.8406779766082764
10-Train: 0.8478260636329651 Test:0.8406779766082764
11-Train: 0.8478260636329651 Test:0.8406779766082764
12-Train: 0.8478260636329651 Test:0.8406779766082764
13-Train: 0.8260869383811951 Test:0.8406779766082764
14-Train: 0.8260869383811951 Test:0.8338983058929443
15-Train: 0.804347813129425 Test:0.8305084705352783
16-Train: 0.8260869383811951 Test:0.8305084705352783
17-Train: 0.804347813129425 Test:0.8305084705352783
18-Train: 0.804347813129425 Test:0.8305084705352783
19-Train: 0.8

# Boston Housing Cost Estimator

Building off the classifier examples above, this section shows ensemble regressors using bagging and random forests.

## Setup

In [8]:
# Load Data Set
boston_housing_data = datasets.load_boston()

train_X, test_X, train_y, test_y = train_test_split(boston_housing_data.data,
                                                   boston_housing_data.target,
                                                   test_size=0.33)

def plot_learning_curves(model, X, y):
    """
    Plots performance on the training set and testing (validation) set.
    X-axis - number of training samples used
    Y-axis - RMSE
    """
    
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.20)
    
    training_errors, validation_errors = [], []
    
    for m in range(1, len(train_X)):
        
        model.fit(train_X[:m], train_y[:m])
        
        train_pred = model.predict(train_X)
        test_pred = model.predict(test_X)
        
        training_errors.append(np.sqrt(mean_squared_error(train_y, train_pred)))
        validation_errors.append(np.sqrt(mean_squared_error(test_y, test_pred)))
        
    plt.plot(training_errors, "r-+", label="train")
    plt.plot(validation_errors, "b-", label="test")
    plt.legend()
    plt.axis([0, 80, 0, 3])

NameError: name 'datasets' is not defined