<h1><center>CS 455/595a: Decision Trees</center></h1>
<center>Richard S. Stansbury</center>

This notebook applies the support vector machine concepts covered in [1] with the [Titanic](https://www.kaggle.com/c/titanic/) and [Boston Housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) data sets for SVM-based classification and regression, respectively.



Reference:

[1] Aurelen Geron. *Hands on Machine Learning with Scikit-Learn & TensorFlow* O'Reilley Media Inc, 2017.

[2] Aurelen Geron. "ageron/handson-ml: A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow." Github.com, online at: https://github.com/ageron/handson-ml [last accessed 2019-03-01]

**Table of Contents**
1. [Titanic Survivor Classifier w/ Decision Trees](#Titanic-Survivor-Classifier)
    * [Decision Tree Demonstration](#Decision-Tree-Demonstration)
 
2. [Boston Housing Cost Estimator w/ Decision Tree](#Boston Housing Cost Estimator)

# Titanic Survivor Classifier

## Set up

In [27]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC, LinearSVC, SVR, LinearSVR
from sklearn import datasets

from matplotlib import pyplot as plt
%matplotlib inline 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, f1_score 
from sklearn.metrics import mean_squared_error, mean_absolute_error

import numpy as np
import pandas as pd
import os

# Read data from input files into Pandas data frames
data_path = os.path.join("datasets","titanic")
train_filename = "train.csv"
test_filename = "test.csv"

def read_csv(data_path, filename):
    joined_path = os.path.join(data_path, filename)
    return pd.read_csv(joined_path)

# Read CSV file into Pandas Dataframes
train_df = read_csv(data_path, train_filename)


# Defining Data Pre-Processing Pipelines
class DataFrameSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, attributes):
        self.attributes = attributes
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.attributes]

class MostFrequentImputer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        self.most_frequent = pd.Series([X[c].value_counts().index[0] for c in X], 
                                       index = X.columns)
        return self
    
    def transform(self, X):
        return X.fillna(self.most_frequent)

    
numeric_pipe = Pipeline([
        ("Select", DataFrameSelector(["Age", "Fare", "SibSp", "Parch"])), # Selects Fields from dataframe
        ("Imputer", SimpleImputer(strategy="median")),   # Fills in NaN w/ median value for its column
    ])

categories_pipe = Pipeline([
        ("Select", DataFrameSelector(["Pclass", "Sex"])), # Selects Fields from dataframe
        ("MostFreqImp", MostFrequentImputer()), # Fill in NaN with most frequent
        ("OneHot", OneHotEncoder(sparse=True)), # Onehot encode
    ])

preprocessing_pipe = FeatureUnion(transformer_list = [
        ("numeric pipeline", numeric_pipe), 
        ("categories pipeline", categories_pipe)
     ]) 

# Process Input Data Using Pipleines
train_X_data = preprocessing_pipe.fit_transform(train_df)

# Scale Input Data
#s = StandardScaler()
#train_X_data = s.fit_transform(train_X_data)

print(train_X_data[1])

train_y_data = train_df["Survived"]

  (0, 0)	38.0
  (0, 1)	71.2833
  (0, 2)	1.0
  (0, 4)	1.0
  (0, 7)	1.0


## KNN Classifier (for comparison)

In [28]:
# KNN Classifier 10-fold Validation
k=10
clf = KNeighborsClassifier(n_neighbors=k)

y_pred = cross_val_predict(clf, train_X_data, train_y_data, cv=5)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred)))                            
                               

Confusion Matrix:
[[457  92]
 [178 164]]
Accuracy Score = 0.696969696969697
Pecision Score = 0.640625
Recall Score = 0.47953216374269003
F1 Score = 0.5484949832775919


## Decision Tree Demonstration

In [29]:
from sklearn.tree import DecisionTreeClassifier


clf2 = DecisionTreeClassifier(max_depth=10)

y_pred = cross_val_predict(clf2, train_X_data, train_y_data, cv=5)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred))) 

Confusion Matrix:
[[488  61]
 [106 236]]
Accuracy Score = 0.8125701459034792
Pecision Score = 0.7946127946127947
Recall Score = 0.6900584795321637
F1 Score = 0.7386541471048513


In [32]:
from sklearn.tree import export_graphviz

clf = DecisionTreeClassifier(max_depth=4)
clf.fit(train_X_data, train_y_data)

feature_names = ["Age", "Fare", "SibSp", "Parch", "Class A", "Class B", "Class C", "M", "F"]
target_names = ["Survived","Died"]

export_graphviz(
        clf,
        out_file="tree.dot",
        feature_names=feature_names,
        class_names=target_names,
        rounded=True,
        filled=True
    )

# Boston Housing Cost Estimator

## Setup

In [None]:
# Load Data Set
boston_housing_data = datasets.load_boston()

# Build data frame for visualization
boston_df = pd.DataFrame(np.c_[boston_housing_data.data, boston_housing_data.target], 
                  columns=["CRIM", "ZN","INDUS","CHAS", "NOX","RM","AGE",
                           "DIS","RAD","TAX","PTRatio","BK", "LSTAT","MEDV"])

scaler = StandardScaler()
boston_data_set = scaler.fit_transform(boston_housing_data.data)
train_X, test_X, train_y, test_y = train_test_split(boston_data_set,
                                                   boston_housing_data.target,
                                                   test_size=0.33)


def plot_learning_curves(model, X, y):
    """
    Plots performance on the training set and testing (validation) set.
    X-axis - number of training samples used
    Y-axis - RMSE
    """
    
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.20)
    
    training_errors, validation_errors = [], []
    
    for m in range(1, len(train_X)):
        
        model.fit(train_X[:m], train_y[:m])
        
        train_pred = model.predict(train_X)
        test_pred = model.predict(test_X)
        
        training_errors.append(np.sqrt(mean_squared_error(train_y, train_pred)))
        validation_errors.append(np.sqrt(mean_squared_error(test_y, test_pred)))
        
    plt.plot(training_errors, "r-+", label="train")
    plt.plot(validation_errors, "b-", label="test")
    plt.legend()
    plt.axis([0, 80, 0, 3])