# SIT307 T1 2021
# Assignment 3 - Machine Learning Challenge
***Group 5*** - Rhys McMillan (218335964), Brenton Fleming (217603898), Neb Miletic (218489118), Sean Pain (218137385), Oliver Bennett (218143462), Muhammad Sibtain (219345654), Asim Arshad (219337467)  
  
***Data*** - Titanic: Machine Learning From Disaster (https://www.kaggle.com/c/titanic/data)

## Table of Contents

* [1. Preparation](#1)
    * [1.1 Import Relevant Libraries](#1_1)
    * [1.2 Load Data from File](#1_2)
* [2. Data Overview](#2)
    * [2.1 Data Dictionary](#2_1)
    * [2.2 Data Preparation Summary](#2_2)
        * [2.2.1 Feature Engineering](#2_2_1)
        * [2.2.2 Data Cleaning](#2_2_2)
        * [2.2.3 Dimensionality Reduction](#2_2_3)
* [3. Machine Learning Experimentation](#3)
    * [3.1 Support Vector Machine](#3_1)
    * [3.2 AdaBoost](#3_2)
    * [3.3 k-Nearest  Neighbors](#3_3)
    * [3.4 Random Forest](#3_4)
    * [3.5 Linear Regression](#3_5)
    * [3.6 Classifier 6](#3_6)
    * [3.7 Classifier 7](#3_7)

# 1. Preparation <a class="anchor" id="1"></a>

## 1.1 Import Relevant Libraries <a class="anchor" id="1_1"></a>

In [None]:
# data analysis
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import r2_score

# visualisation
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
%matplotlib inline

#Model
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Classifiers
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

## 1.2 Load Data from File <a class="anchor" id="1_2"></a>
The source data for this machine learning experimentation (titanic_train_clean.csv) has been previously cleaned and pruned during the data preparation and feature selection stages completed in assignment 2. The original source data can be found at https://www.kaggle.com/c/titanic/data.

A summary of the data preparation and updated data dictionary can be found in [Data Overview](#2) below.

In [None]:
# load train.csv to pandas data frame, using 'PassengerId' as the index
titanic_df = pd.read_csv('../input/titanic-train-clean/titanic_train_clean.csv' , index_col='PassengerId')

# Preview the data
titanic_df.head()

# 2. Data Overview <a class="anchor" id="2"></a>
A brief overview of the dataset features.
## 2.1 Data Dictionary <a class="anchor" id="2_1"></a>
The following data dictionary has been updated to reflect the cleaned dataset:
<table>
    <tr>
        <th>Variable</th>
        <th>Definition</th>
        <th>Key</th>
    </tr>
    <tr>
        <td>Survived</td>
        <td>Did the passenger survive?</td>
        <td>1 = Yes, 0 = No</td>
    </tr>
    <tr>
        <td>Pclass</td>
        <td>Ticket class</td>
        <td>1 = 1st, 2 = 2nd, 3 = 3rd</td>
    </tr>
    <tr>
        <td>sex</td>
        <td>Sex</td>
        <td>1 = Female, 0 = Male</td>
    </tr>
    <tr>
        <td>Age</td>
        <td>Age in years</td>
        <td></td>
    </tr>
    <tr>
        <td>Fare</td>
        <td>Passenger fare</td>
        <td></td>
    </tr>
    <tr>
        <td>Embarked</td>
        <td>Port of Embarkation</td>
        <td>C = Cherbourg, Q = Queenstown, S = Southampton</td>
    </tr>
    <tr>
        <td>Title</td>
        <td>Title of the passenger (extracted from name)</td>
        <td></td>
    </tr>
    <tr>
        <td>UniqueTicket</td>
        <td>Was the passenger ticket number unique?</td>
        <td>1 = Yes, 0 = No</td>
    </tr>
    <tr>
        <td>IsChild</td>
        <td>Is the passenger a child (15 years or younger)?</td>
        <td>1 = Yes, 0 = No</td>
    </tr>
</table>

## 2.2 Data Preparation Summary <a class="anchor" id="2_2"></a>
A brief summary of the data clearning, feature engineering and feature selection performed during assignment 2.

### 2.2.1 Feature Engineering <a class="anchor" id="2_2_1"></a>
Three of the features contained in this dataset were engineered from the original dataset:
* Title
* UniqueTicket
* IsChild

***Title*** - The original data included the passenger name which contained the title, first name and last name of the passenger. As each passenger name was unique the field offered very little information gain. The title of each passenger was extracted and then normalised to a defined list.

***UniqueTicket*** - The original data set included the ticket number of each passenger. This field contained a significant percentage of unique values and offered little information gain. A new field was calculated to represent if the passengers ticket is unique within the dataset, or a duplicate.

***IsChild*** - Analysis of survival across different age brackets found that children had a much higher survival rate than adults. A new feature was created to represent if the passenger is a child (i.e. 15 years or younger).

### 2.2.2 Data Cleaning <a class="anchor" id="2_2_2"></a>

Two of the features in this dataset required cleaning:
* Age
* Embarked

***Age*** - Missing values were imputed using multivariate linear regression based on Title and Pclass.

***Embarked*** - As only 2 of 891 values were missing, these were simply filled using the most common embarked value.

### 2.2.3 Dimensionality Reduction <a class="anchor" id="2_2_3"></a>
Five features were removed from the original data set as analysis determined they offered little or no information gain:
* Name
* SibSp (# of siblings or spouses also onboard)
* Parch (# parents of children also onboard)
* Ticket
* Cabin

# 3. Machine Learning Experimentation <a class="anchor" id="3"></a>

## 3.1 Support Vector Machine <a class="anchor" id="3_1"></a>

**Data Preparation**  
There are two issues to be addressed in our data set before we can use the SVM classifier:  
* The SVM classifier does not support labelled input data. Out dataset contains 2 labelled features - Embarked and Title.  
 * Previous examination of the data (assignment 2) found embarked's correlation to be a derivative of Pclass and Sex. This feature will be dropped as Pclass and Sex are retained.  
 * Title will initially be remapped to numeric values. As this implies an ordinal relationship between the values, we should test the performance of the classifier using mapping vs dropping the feature to ensure we are not degrading our prediction.
* As support vector machines are not scale invarient, we can improve the accuracy of our model by preprocessing the dataset SciKit Learn's StandardScaler.  
(Ref: https://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use)

In [None]:
# create pipeline using standardScaler and SVM classifier
svm_clf = make_pipeline(StandardScaler(), SVC())

# create X and y
svm_X = titanic_df.drop(columns=['Survived', 'Embarked'])
svm_y = titanic_df.Survived

# map title to numeric values
svm_X.Title = svm_X.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})

# split data in to training and testing subsets
svm_X_train , svm_X_test , svm_y_train, svm_y_test = train_test_split(svm_X, svm_y, random_state=1)

# fit the data to the model
svm_clf.fit(svm_X_train, svm_y_train)

# predict our test data
svm_y_pred = svm_clf.predict(svm_X_test)

**Evalulation**  
Performance of an SVM classifier is normally done using the classification rate or error rate. We will use SciLearn's accuracy_score to evaluate the basic performance of our model.

In [None]:
# check the accuracy of our prediction
print('Accuracy: {0}%'.format((accuracy_score(svm_y_pred, svm_y_test)*100).round(2)))

**Retest with Dropping Title**

In [None]:
# drop title from our X data sets
svm_X_train = svm_X_train.drop(columns=['Title'])
svm_X_test = svm_X_test.drop(columns=['Title'])

# fit the data to the model
svm_clf.fit(svm_X_train, svm_y_train)

# predict our test data
svm_y_pred = svm_clf.predict(svm_X_test)

# check the accuracy of our prediction
print('Accuracy: {0}%'.format((accuracy_score(svm_y_pred, svm_y_test)*100).round(2)))

Dropping Title improves the accuracy of our SVM classifier.

**Further Evaluation**  
We can further evaluate the performance of our SVM classifier using SciLearn's classification_report to see the precision, recall and f1-score for our model.

In [None]:
# generate classification report
print(classification_report(svm_y_test, svm_y_pred, target_names=['Died', 'Survived']))

## 3.2 AdaBoost <a class="anchor" id="3_2"></a>

**Data Preparation**

Adaboost does not support labelled features so we need to either remove or modify those features to suit AdaBoost.
* From previous analysis of the data from Assignment 2 we found that embarked's correlation to be a derivative of Sex and Pclass, therefor we will drop Embarked as Sex and Pclass will be kept. 
* Title will be mapped to numeric values, however just like the in the previous classifier testing we will also compare the accuracy of mapping the values to dropping the Title feature.

In [None]:
# Separate predictors and response
ada_X = titanic_df.drop(columns=['Survived', 'Embarked'])
ada_y = titanic_df.Survived

# map title to numeric values
ada_X.Title = ada_X.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})

# split data in to training and testing subsets 80% train and 20% test
ada_X_train , ada_X_test , ada_y_train, ada_y_test = train_test_split(ada_X, ada_y, test_size = 0.2, shuffle=False)

# Create AdaBoost classifier object
ada_classifier = AdaBoostClassifier(n_estimators = 100, learning_rate = 1)

# Train AdaBoost classifier
ada_model = ada_classifier.fit(ada_X_train, ada_y_train)

# Predict the response for test dataset
ada_y_pred = ada_model.predict(ada_X_test)

**Evaluation**

As we did in the previous classifier's evaluation we will be using SciLearn's accuracy_score to evaluate the basic performance. 


In [None]:
# Model accuracy
print('Accuracy: {0}%'.format((accuracy_score(ada_y_pred, ada_y_test)*100).round(2)))

**Retest the accuracy after dropping Title.**

In [None]:
# drop title from X values
ada_X = ada_X.drop(columns='Title')

# split data in to training and testing subsets 80% train and 20% test
ada_X_train , ada_X_test , ada_y_train, ada_y_test = train_test_split(ada_X, ada_y, test_size = 0.2, shuffle=False)

# Train AdaBoost classifier
ada_model = ada_classifier.fit(ada_X_train, ada_y_train)

# Predict the response for test dataset
ada_y_pred = ada_model.predict(ada_X_test)

# Model accuracy
print('Accuracy: {0}%'.format((accuracy_score(ada_y_pred, ada_y_test)*100).round(2)))

**Further Evaluation**

We performed further evaluation of the AdaBoost classifier using SciLearn's classification_report as we did for the previous classifier. 

In [None]:
# generate classification report
print(classification_report(ada_y_test, ada_y_pred, target_names=['Died', 'Survived']))

## 3.3 K-Nearest Neighbours <a class="anchor" id="3_3"></a>

Load in librarys and format correctly.

In [None]:
# create X and y
knn_X = titanic_df.drop(columns=['Survived', 'Embarked'])
knn_y = titanic_df.Survived

# map title to numeric values
knn_X.Title = knn_X.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})

# split data in to training and testing subsets
knn_X_train , knn_X_test , knn_y_train, knn_y_test = train_test_split(knn_X, knn_y, test_size = 0.4, random_state=42, stratify = knn_y)

We will determine the most accurate K value to use for neighbors by running tests wiht variying neighbor inputs.

In [None]:
#Setup sample arrays to test most accurate K value
neighbors = np.arange(1,20)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# load in each k value 1-20 and test
for i, k in enumerate(neighbors):

    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(knn_X_train, knn_y_train)
    
    #Check accuracy on the training set
    train_accuracy[i] = knn.score(knn_X_train, knn_y_train)
    
    #Check accuracy on the test set
    test_accuracy[i] = knn.score(knn_X_test, knn_y_test)

In [None]:
#Generate plot
plt.title('k-NN Varying number of neighbors')
plt.plot(neighbors, test_accuracy, label='Testing Accuracy')
plt.plot(neighbors, train_accuracy, label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

From this analysis we can determine two spikes of increased accuracy. The first at a K value of 13 and the second at 18. Moving forward we will use a neighbors value of 13.

In [None]:
#Load k neighbors to classifier
knn = KNeighborsClassifier(n_neighbors= 13)

#Fit data to model
knn.fit(knn_X_train, knn_y_train)

#Collect predications form classifier
knn_y_pred = knn.predict(knn_X_test)

#Print classification report
print(classification_report(knn_y_test,knn_y_pred, target_names=['Died', 'Survived']))

# Model accuracy
print('Accuracy: {0}%'.format((accuracy_score(knn_y_pred, knn_y_test)*100).round(2)))

## 3.4 Random Forest <a class="anchor" id="3_4"></a>

**3.4.1 Import libraries, create train and test sets for Random Forest Classifier**

In [None]:
# create train and test
rf_train = titanic_df.drop(columns=['Survived','Embarked'])
rf_test = titanic_df.Survived

# map title to numeric values
rf_train.Title = rf_train.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})
# split data in to training and testing subsets
rf_X_train , rf_X_test , rf_y_train, rf_y_test = train_test_split(rf_train, rf_test, random_state=1)

**3.4.2 Random Forest 1: No hyperparameters**

In [None]:
# create make_pipeline object using MinMaxScaler and Random Forest Classifier that uses no hyperparameters
rf_clf = make_pipeline(MinMaxScaler(), RandomForestClassifier())

# fit the data to the model
rf_clf.fit(rf_X_train, rf_y_train)

# predict our test data
rf_y_pred = rf_clf.predict(rf_X_test)
#Classification report
print(classification_report(rf_y_test, rf_y_pred, target_names=['Died', 'Survived']))
#Accuracy Score
print('Accuracy: {0}%'.format((accuracy_score(rf_y_pred, rf_y_test)*100).round(2)))
labels = ['Yes','No']
plot_confusion_matrix(rf_clf, rf_X_test, rf_y_test, display_labels=labels, normalize=None)
plt.show()

**3.4.3 Random Forest: With hyperparameters (trial and error)**

In [None]:
rf_clf = make_pipeline(MinMaxScaler(),RandomForestClassifier(criterion='gini',
# rf_clf = RandomForestClassifier(criterion='gini',
    n_estimators=700,
    min_samples_split=10, 
    min_samples_leaf=1, 
    max_features='auto', 
    oob_score=True, 
    random_state=1,
    n_jobs=-1)
)
rf_clf.fit(rf_X_train, rf_y_train)
rf_y_pred = rf_clf.predict(rf_X_test)

print(classification_report(rf_y_test, rf_y_pred, target_names=['Died', 'Survived']))
print('Accuracy: {0}%'.format((accuracy_score(rf_y_pred, rf_y_test)*100).round(2)))
labels = ['Yes','No']
plot_confusion_matrix(rf_clf, rf_X_test, rf_y_test, display_labels=labels, normalize=None)
plt.show()

**3.4.4 Random Forest: With Gridsearch tuned hyperparameters**

In [None]:
'''
# ***WARNING: The gridsearch may take a few minutes to run***
rf = RandomForestClassifier(max_features='auto', oob_score=True, random_state=1, n_jobs=-1)

param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10], "min_samples_split" : [2, 4, 10, 12, 16], "n_estimators": [50, 100, 400, 700, 1000]}

rf_gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)

rf_gs = rf_gs.fit(rf_X_train, rf_y_train)
# print(gs.best_score_)
print(rf_gs.best_params_)
# print(gs.cv_results_)

# new rf, with gridsearch hyperparameters
rf_clf = make_pipeline(MinMaxScaler(), RandomForestClassifier(**rf_gs.best_params_))
rf_clf.fit(rf_X_train, rf_y_train)
rf_y_pred = rf_clf.predict(rf_X_test)

print(classification_report(rf_y_test, rf_y_pred, target_names=['Died', 'Survived']))
print('Accuracy: {0}%'.format((accuracy_score(rf_y_pred, rf_y_test)*100).round(2)))
labels = ['Yes','No']
plot_confusion_matrix(rf_clf, rf_X_test, rf_y_test, display_labels=labels, normalize=None)
plt.show()
'''

****3.4.5.1 Visualizing unimportant features****

In [None]:
'''
rf_clf = RandomForestClassifier(**rf_gs.best_params_)
rf_clf.fit(rf_X_train, rf_y_train)
rf_y_pred = rf_clf.predict(rf_X_test)

#Visualize the important features
feature_imp = pd.Series(rf_clf.feature_importances_, index=rf_train.columns).sort_values(ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.tight_layout()
'''

**3.4.5.2 Drop UniqueTicket and IsChild Features**

In [None]:
# create new train and test, dropping non-Important UniqueTicket and IsChild features too.
rf_new_train = titanic_df.drop(columns=['UniqueTicket','IsChild','Survived','Embarked'])
rf_new_test = titanic_df.Survived

# map title to numeric values
rf_new_train.Title = rf_new_train.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})
# split data in to new training and testing subsets
rf_new_X_train , rf_new_X_test , rf_new_y_train, rf_new_y_test = train_test_split(rf_new_train, rf_new_test, random_state=1)

**3.4.6 Random Forest with 'non-important' features & Gridsearch Tuned Hyperparameters**

In [None]:
'''
# ***WARNING: The gridsearch may take a few minutes to run***
rf = RandomForestClassifier(max_features='auto', oob_score=True, random_state=1, n_jobs=-1)

param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10], "min_samples_split" : [2, 4, 10, 12, 16], "n_estimators": [50, 100, 400, 700, 1000]}

rf_gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)

rf_gs = rf_gs.fit(rf_new_X_train, rf_new_y_train)

# new rf, with gridsearch hyperparameters
rf_clf = make_pipeline(MinMaxScaler(), RandomForestClassifier(**rf_gs.best_params_))
rf_clf.fit(rf_new_X_train, rf_new_y_train)
rf_new_y_pred = rf_clf.predict(rf_new_X_test)

print(rf_gs.best_params_)
print(classification_report(rf_new_y_test, rf_new_y_pred, target_names=['Died', 'Survived']))
print('Accuracy: {0}%'.format((accuracy_score(rf_new_y_pred, rf_new_y_test)*100).round(2)))
labels = ['Yes','No']
plot_confusion_matrix(rf_clf, rf_new_X_test, rf_new_y_test, display_labels=labels, normalize=None)
plt.show()
'''

## 3.5 Linear Regression <a class="anchor" id="3_5"></a>

In [None]:
# Separate predictors and response
LR_X = titanic_df.drop(columns=['Survived', 'Embarked'])
LR_y = titanic_df.Survived

# map title to numeric values
LR_X.Title = LR_X.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})

LR_X_train, LR_X_test, LR_y_train, LR_y_test = train_test_split(LR_X, LR_y, test_size = 0.2, random_state = 0)

In [None]:
LRegression = LinearRegression()
LRegression.fit(LR_X_train, LR_y_train)
LR_y_pred = LRegression.predict(LR_X_test) 

In [None]:
result = round(LRegression.score(LR_X_train, LR_y_train) * 100, 2) 
print("Accuracy {0}%".format(result)) 

In [None]:
#print(classification_report(LR_y_test, LR_y_pred, target_names=['Died', 'Survived']))

In [None]:
# Model accuracy
#print('Accuracy: {0}%'.format((accuracy_score(LR_y_pred, LR_y_test)*100).round(2)))

## 3.6 Logistic Regression <a class="anchor" id="3_6"></a>

In [None]:
# Removing unwanted string labels - They can be used if encoded into floats, but we will ignore it for now
copy_titanic_df = titanic_df.drop(labels=['Title' , 'Embarked'] , axis=1)

# splitting response and predictor variables
LogR_y = copy_titanic_df['Survived'].values
LogR_X = copy_titanic_df.iloc[:,1:].values

# splitting data into train and test
LogR_X_train, LogR_X_test, LogR_y_train, LogR_y_test = train_test_split(LogR_X, LogR_y, random_state=21, test_size=0.2)

In [None]:
# modelling Logistic Regression
LogR_clf = LogisticRegression()
LogR_clf.fit(LogR_X_train, LogR_y_train)
LogR_y_pred = LogR_clf.predict(LogR_X_test) 
print("Accuracy on our test set: {0}%".format((LogR_clf.score(LogR_X_test, LogR_y_test)*100).round(2)))
print("Accuracy on our train set: {0}%".format((LogR_clf.score(LogR_X_train, LogR_y_train)*100).round(2)))
print("\n",classification_report(LogR_y_test, LogR_y_pred, target_names=['Died', 'Survived']))

In [None]:
# Model Evaluation using visuals
# Predicting train data
train_preds = LogR_clf.predict(LogR_X_train)
cm = confusion_matrix(LogR_y_train, train_preds)

# Displaying confusion matrix on train data
plt.figure(figsize=(6,6))
plt.title('Confusion matrix on train data')
sns.heatmap(cm, annot=True, fmt='d', cmap=plt.cm.Greens, cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Predicting test data
test_preds = LogR_clf.predict(LogR_X_test)
cm = confusion_matrix(LogR_y_test, test_preds)

# Displaying confusion matrix on test data
plt.figure(figsize=(6,6))
plt.title('Confusion matrix on test data')
sns.heatmap(cm, annot=True, fmt='d', cmap=plt.cm.Reds, cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## 3.7 Naive Bayes <a class="anchor" id="3_7"></a>

In [None]:
# Separating data into response and precdicatbles
NB_X = titanic_df.drop(columns=['Survived', 'Embarked'])
NB_Y = titanic_df.Survived

NB_X.Title = NB_X.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})

# splitting training and test data
NB_X_train , NB_X_test , NB_Y_train, NB_Y_test = train_test_split(NB_X, NB_Y, random_state=1)

# Modelling and prediction
gaussian = GaussianNB()
gaussian.fit(NB_X_train, NB_Y_train)
NB_Y_pred = gaussian.predict(NB_X_test)

print("Accuracy: {0}%\n".format((gaussian.score(NB_X_test, NB_Y_test)*100).round(2)))
print(classification_report(NB_Y_test, NB_Y_pred, target_names=['Died', 'Survived']))