<h1><center>CS 455/595a: Support Vector Machines Demos</center></h1>
<center>Richard S. Stansbury</center>

This notebook applies the support vector machine concepts covered in [1] with the [Titanic](https://www.kaggle.com/c/titanic/) and [Boston Housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) data sets for SVM-based classification and regression, respectively.



Reference:

[1] Aurelen Geron. *Hands on Machine Learning with Scikit-Learn & TensorFlow* O'Reilley Media Inc, 2017.

[2] Aurelen Geron. "ageron/handson-ml: A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow." Github.com, online at: https://github.com/ageron/handson-ml [last accessed 2019-03-01]

**Table of Contents**
1. [Titanic Survivor Classifier w/ SVM](#Titanic-Survivor-Classifier)
    * [Linear SVC Demonstration](#Linear-SVC-Demonstration)
    * [SVC with Linear Kernel Demo](#SVC-with-Linear-Kernel-Demo)
    * [LinearSVC with Polynomial Features](#LinearSVC-with-Polynomial-Features)
    * [SVC Classifier with Polynomial Kernel](SVC-Classifier-with-Polynomial-Kernel)
    * [SVC with RBF Kernel](#SVC-with-RBF-Kernel)

2. Boston Demo - Coming Soon

# Titanic Survivor Classifier

## Set up

In [1]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.svm import SVC, LinearSVC

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, f1_score 

import numpy as np
import pandas as pd
import os

In [2]:
# Read data from input files into Pandas data frames
data_path = os.path.join("datasets","titanic")
train_filename = "train.csv"
test_filename = "test.csv"

def read_csv(data_path, filename):
    joined_path = os.path.join(data_path, filename)
    return pd.read_csv(joined_path)

# Read CSV file into Pandas Dataframes
train_df = read_csv(data_path, train_filename)
test_df = read_csv(data_path, test_filename)

In [3]:
# Defining Data Pre-Processing Pipelines

class DataFrameSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, attributes):
        self.attributes = attributes
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.attributes]

class MostFrequentImputer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        self.most_frequent = pd.Series([X[c].value_counts().index[0] for c in X], 
                                       index = X.columns)
        return self
    
    def transform(self, X):
        return X.fillna(self.most_frequent)

    
numeric_pipe = Pipeline([
        ("Select", DataFrameSelector(["Age", "Fare", "SibSp", "Parch"])), # Selects Fields from dataframe
        ("Imputer", SimpleImputer(strategy="median")),   # Fills in NaN w/ median value for its column
    ])

categories_pipe = Pipeline([
        ("Select", DataFrameSelector(["Pclass", "Sex", "Embarked"])), # Selects Fields from dataframe
        ("MostFreqImp", MostFrequentImputer()), # Fill in NaN with most frequent
        ("OneHot", OneHotEncoder(sparse=False)), # Onehot encode
    ])

preprocessing_pipe = FeatureUnion(transformer_list = [
        ("numeric pipeline", numeric_pipe), 
        ("categories pipeline", categories_pipe)
     ]) 

In [4]:
# Process Input Data Using Pipleines
train_X_data = preprocessing_pipe.fit_transform(train_df)
test_X_data = preprocessing_pipe.fit_transform(test_df)
train_y_data = train_df["Survived"]

## KNN Classifier (for comparison)

In [5]:
# KNN Classifier 10-fold Validation
k=10

clf_pipe = Pipeline([
        ("Scaler", StandardScaler()),
        ("classifier", KNeighborsClassifier(n_neighbors=k)), 
    ])


y_pred = cross_val_predict(clf_pipe, train_X_data, train_y_data, cv=10)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred)))                            
                               

Confusion Matrix:
[[509  40]
 [133 209]]
Accuracy Score = 0.8058361391694725
Pecision Score = 0.8393574297188755
Recall Score = 0.6111111111111112
F1 Score = 0.7072758037225043


## Linear SVC Demonstration

In [6]:
# LinearSVC Classifier - Hard Margin
C=1

clf_pipe = Pipeline([
        ("Scaler", StandardScaler()),
        ("classifier", LinearSVC(C=C, loss="hinge")), 
    ])


y_pred = cross_val_predict(clf_pipe, train_X_data, train_y_data, cv=10)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred)))                            
                               

Confusion Matrix:
[[468  81]
 [109 233]]
Accuracy Score = 0.7867564534231201
Pecision Score = 0.7420382165605095
Recall Score = 0.6812865497076024
F1 Score = 0.7103658536585367




In [7]:
# LinearSVC Classifier - Soft Margin
C=100

clf_pipe = Pipeline([
        ("Scaler", StandardScaler()),
        ("classifier", LinearSVC(C=C, loss="hinge")), 
    ])


y_pred = cross_val_predict(clf_pipe, train_X_data, train_y_data, cv=10)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred)))                            
                               



Confusion Matrix:
[[464  85]
 [112 230]]
Accuracy Score = 0.7789001122334456
Pecision Score = 0.7301587301587301
Recall Score = 0.672514619883041
F1 Score = 0.700152207001522


## SVC with Linear Kernel Demo

In [8]:
# SVC Classifier 
c=1

clf_pipe = Pipeline([
        ("Scaler", StandardScaler()),
        ("classifier", SVC(kernel="linear", C=c)), 
    ])


y_pred = cross_val_predict(clf_pipe, train_X_data, train_y_data, cv=10)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred)))                            
                               

Confusion Matrix:
[[468  81]
 [109 233]]
Accuracy Score = 0.7867564534231201
Pecision Score = 0.7420382165605095
Recall Score = 0.6812865497076024
F1 Score = 0.7103658536585367


## LinearSVC with Polynomial Features

In [9]:
# SVC Classifier with Polynomial Features Added
c=1
deg=3

clf_pipe = Pipeline([
        ("Polynomial", PolynomialFeatures(degree=deg)),
        ("Scaler", StandardScaler()),
        ("classifier", LinearSVC(loss="hinge", max_iter=10000, C=c)), 
    ])


y_pred = cross_val_predict(clf_pipe, train_X_data, train_y_data, cv=10)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred)))                            
                               



Confusion Matrix:
[[492  57]
 [117 225]]
Accuracy Score = 0.8047138047138047
Pecision Score = 0.7978723404255319
Recall Score = 0.6578947368421053
F1 Score = 0.7211538461538463




## SVC Classifier with Polynomial Kernel

In [10]:
# SVC Classifier with Polynomial Kernel
c=10
deg=2
r=100

clf_pipe = Pipeline([
        ("Scaler", StandardScaler()),
        ("classifier", SVC(kernel="poly", degree=deg, coef0=r, C=c)), 
    ])


y_pred = cross_val_predict(clf_pipe, train_X_data, train_y_data, cv=10)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred)))                            
                               

Confusion Matrix:
[[516  33]
 [119 223]]
Accuracy Score = 0.8294051627384961
Pecision Score = 0.87109375
Recall Score = 0.652046783625731
F1 Score = 0.7458193979933111


## SVC with RBF Kernel

In [11]:
# SVC Classifier with Gaussian Radial Basis Function Kernel
c=0.7
gamma=.1

clf_pipe = Pipeline([
        ("Scaler", StandardScaler()),
        ("classifier", SVC(kernel="rbf", C=c, gamma=gamma)), 
    ])


y_pred = cross_val_predict(clf_pipe, train_X_data, train_y_data, cv=10)

print("Confusion Matrix:")
print(confusion_matrix(train_y_data, y_pred))
print("Accuracy Score = " + str(accuracy_score(train_y_data, y_pred)))
print("Pecision Score = " + str(precision_score(train_y_data, y_pred)))
print("Recall Score = " + str(recall_score(train_y_data,y_pred)))
print("F1 Score = " + str(f1_score(train_y_data,y_pred)))                            
                               

Confusion Matrix:
[[520  29]
 [126 216]]
Accuracy Score = 0.8260381593714927
Pecision Score = 0.8816326530612245
Recall Score = 0.631578947368421
F1 Score = 0.735945485519591
