# SIT307 T1 2021
# Assignment 3 - Machine Learning Challenge
***Group 5*** - Rhys McMillan (218335964), Brenton Fleming (217603898), Neb Miletic (218489118), Sean Pain (218137385), Oliver Bennett (218143462), Muhammad Sibtain (219345654), Asim Arshad (219337467)  
  
***Data*** - Titanic: Machine Learning From Disaster (https://www.kaggle.com/c/titanic/data)

## Table of Contents

* [1. Preparation](#1)
    * [1.1 Import Relevant Libraries](#1_1)
    * [1.2 Load Data from File](#1_2)
* [2. Data Overview](#2)
    * [2.1 Data Dictionary](#2_1)
    * [2.2 Data Preparation Summary](#2_2)
        * [2.2.1 Feature Engineering](#2_2_1)
        * [2.2.2 Data Cleaning](#2_2_2)
        * [2.2.3 Dimensionality Reduction](#2_2_3)
* [3. Machine Learning Experimentation](#3)
    * [3.1 Support Vector Machine](#3_1)
    * [3.2 Classifier 2](#3_2)
    * [3.3 Classifier 3](#3_3)
    * [3.4 Classifier 4](#3_4)
    * [3.5 Classifier 5](#3_5)
    * [3.6 Classifier 6](#3_6)
    * [3.7 Classifier 7](#3_7)

# 1. Preparation <a class="anchor" id="1"></a>

## 1.1 Import Relevant Libraries <a class="anchor" id="1_1"></a>

In [None]:
# data analysis
import pandas as pd
import numpy as np
from scipy import stats

# visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 1.2 Load Data from File <a class="anchor" id="1_2"></a>
The source data for this machine learning experimentation (titanic_train_clean.csv) has been previously cleaned and pruned during the data preparation and feature selection stages completed in assignment 2. The original source data can be found at https://www.kaggle.com/c/titanic/data.

A summary of the data preparation and updated data dictionary can be found in [Data Overview](#2) below.

In [None]:
# load train.csv to pandas data frame, using 'PassengerId' as the index
titanic_df = pd.read_csv('../input/titanic-train-clean/titanic_train_clean.csv' , index_col='PassengerId')

# Preview the data
titanic_df.head()

# 2. Data Overview <a class="anchor" id="2"></a>
A brief overview of the dataset features.
## 2.1 Data Dictionary <a class="anchor" id="2_1"></a>
The following data dictionary has been updated to reflect the cleaned dataset:
<table>
    <tr>
        <th>Variable</th>
        <th>Definition</th>
        <th>Key</th>
    </tr>
    <tr>
        <td>Survived</td>
        <td>Did the passenger survive?</td>
        <td>1 = Yes, 0 = No</td>
    </tr>
    <tr>
        <td>Pclass</td>
        <td>Ticket class</td>
        <td>1 = 1st, 2 = 2nd, 3 = 3rd</td>
    </tr>
    <tr>
        <td>sex</td>
        <td>Sex</td>
        <td>1 = Female, 0 = Male</td>
    </tr>
    <tr>
        <td>Age</td>
        <td>Age in years</td>
        <td></td>
    </tr>
    <tr>
        <td>Fare</td>
        <td>Passenger fare</td>
        <td></td>
    </tr>
    <tr>
        <td>Embarked</td>
        <td>Port of Embarkation</td>
        <td>C = Cherbourg, Q = Queenstown, S = Southampton</td>
    </tr>
    <tr>
        <td>Title</td>
        <td>Title of the passenger (extracted from name)</td>
        <td></td>
    </tr>
    <tr>
        <td>UniqueTicket</td>
        <td>Was the passenger ticket number unique?</td>
        <td>1 = Yes, 0 = No</td>
    </tr>
    <tr>
        <td>IsChild</td>
        <td>Is the passenger a child (15 years or younger)?</td>
        <td>1 = Yes, 0 = No</td>
    </tr>
</table>

## 2.2 Data Preparation Summary <a class="anchor" id="2_2"></a>
A brief summary of the data clearning, feature engineering and feature selection performed during assignment 2.

### 2.2.1 Feature Engineering <a class="anchor" id="2_2_1"></a>
Three of the features contained in this dataset were engineered from the original dataset:
* Title
* UniqueTicket
* IsChild

***Title*** - The original data included the passenger name which contained the title, first name and last name of the passenger. As each passenger name was unique the field offered very little information gain. The title of each passenger was extracted and then normalised to a defined list.

***UniqueTicket*** - The original data set included the ticket number of each passenger. This field contained a significant percentage of unique values and offered little information gain. A new field was calculated to represent if the passengers ticket is unique within the dataset, or a duplicate.

***IsChild*** - Analysis of survival across different age brackets found that children had a much higher survival rate than adults. A new feature was created to represent if the passenger is a child (i.e. 15 years or younger).

### 2.2.2 Data Cleaning <a class="anchor" id="2_2_2"></a>

Two of the features in this dataset required cleaning:
* Age
* Embarked

***Age*** - Missing values were imputed using multivariate linear regression based on Title and Pclass.

***Embarked*** - As only 2 of 891 values were missing, these were simply filled using the most common embarked value.

### 2.2.3 Dimensionality Reduction <a class="anchor" id="2_2_3"></a>
Five features were removed from the original data set as analysis determined they offered little or no information gain:
* Name
* SibSp (# of siblings or spouses also onboard)
* Parch (# parents of children also onboard)
* Ticket
* Cabin

# 3. Machine Learning Experimentation <a class="anchor" id="3"></a>

## 3.1 Support Vector Machine <a class="anchor" id="3_1"></a>

**Data Preparation**  
There are two issues to be addressed in our data set before we can use the SVM classifier:  
* The SVM classifier does not support labelled input data. Out dataset contains 2 labelled features - Embarked and Title.  
 * Previous examination of the data (assignment 2) found embarked's correlation to be a derivative of Pclass and Sex. This feature will be dropped as Pclass and Sex are retained.  
 * Title will initially be remapped to numeric values. As this implies an ordinal relationship between the values, we should test the performance of the classifier using mapping vs dropping the feature to ensure we are not degrading our prediction.
* As support vector machines are not scale invarient, we can improve the accuracy of our model by preprocessing the dataset SciKit Learn's StandardScaler.  
(Ref: https://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use)

In [None]:
# import libraries
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# create pipeline using standardScaler and SVM classifier
svm_clf = make_pipeline(StandardScaler(), SVC())

# create X and y
X = titanic_df.drop(columns=['Survived', 'Embarked'])
y = titanic_df.Survived

# map title to numeric values
X.Title = X.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})

# split data in to training and testing subsets
X_train , X_test , y_train, y_test = train_test_split(X, y, random_state=1)

# fit the data to the model
svm_clf.fit(X_train, y_train)

# predict our test data
y_pred = svm_clf.predict(X_test)

**Evalulation**  
Performance of an SVM classifier is normally done using the classification rate or error rate. We will use SciLearn's accuracy_score to evaluate the basic performance of our model.

In [None]:
# import libraries
from sklearn.metrics import accuracy_score

# check the accuracy of our prediction
print('Accuracy: {0}%'.format((accuracy_score(y_pred, y_test)*100).round(2)))

**Retest with Dropping Title**

In [None]:
# drop title from our X data sets
X_train = X_train.drop(columns=['Title'])
X_test = X_test.drop(columns=['Title'])

# fit the data to the model
svm_clf.fit(X_train, y_train)

# predict our test data
y_pred = svm_clf.predict(X_test)

# check the accuracy of our prediction
print('Accuracy: {0}%'.format((accuracy_score(y_pred, y_test)*100).round(2)))

Dropping Title improves the accuracy of our SVM classifier.

**Further Evaluation**  
We can further evaluate the performance of our SVM classifier using SciLearn's classification_report to see the precision, recall and f1-score for our model.

In [None]:
# import libraries
from sklearn.metrics import classification_report

# generate classification report
print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))

## 3.2 <Classifier 2> <a class="anchor" id="3_2"></a>

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# create X and y
XO = titanic_df.drop(columns=[
    'Survived',                           
    'Embarked',
#        'Pclass',
#     'Sex',
#     'Age',
#     'Fare',
#     'Title',
#     'UniqueTicket',
#     'IsChild'
])
yo = titanic_df.Survived

# map title to numeric values
XO.Title = XO.Title.map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Officer': 5, 'Royalty': 6})

# split data in to training and testing subsets
XO_train , XO_test , yo_train, yo_test = train_test_split(XO, yo, random_state=1)


In [None]:
# view data
# yo_train
# XO_train
# XO_test
# yo_test

In [None]:
random_forest = RandomForestClassifier(criterion='gini',
    n_estimators=700,
#     n_estimators=350,
    min_samples_split=10, 
    min_samples_leaf=1, 
    max_features='auto', 
    oob_score=True, 
    random_state=1,
    n_jobs=-1)


In [None]:
random_forest.fit(XO_train, yo_train)
yo_pred = random_forest.predict(XO_test) 
result = round(random_forest.score(XO_train, yo_train) * 100, 2) 
result 

In [None]:
print(classification_report(yo_test, yo_pred, target_names=['Died', 'Survived']))

In [None]:
# check the accuracy of our prediction
print('Accuracy: {0}%'.format((accuracy_score(yo_pred, yo_test)*100).round(2)))

In [None]:
from sklearn.metrics import plot_confusion_matrix
labels = ['Yes','No']
plot_confusion_matrix(random_forest, XO, yo, display_labels=labels, normalize=None)
plt.show()


## 3.3 <Classifier 3> <a class="anchor" id="3_3"></a>

## 3.4 <Classifier 4> <a class="anchor" id="3_4"></a>

## 3.5 <Classifier 5> <a class="anchor" id="3_5"></a>

## 3.6 <Classifier 6> <a class="anchor" id="3_6"></a>

## 3.7 <Classifier 7> <a class="anchor" id="3_7"></a>