$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

# Final Project

# Dmitry Melnikov

## Introduction

The purpose of this project is to classify Titanic passengers as survivors based on the data provided about these passengers. The goal is to train various algorithms and to use them to predict whether the passenger would survive or not.

  * This  project is focused on using data analytics and deep learning algorithms to predict survival chance of passengers of Titanic. The data is obtained from public source and includes description of passenger's details such as class of travel, gender, age and survival data. I have used publicly availiable classification algorythms and compared their performance.
  * To analyze this data I used publicly avaliable classification algorithms and find the best one for this problem from Scikit-learn tool kit [scikit-learn.org]
  * This project is focus on classification of passengers between survived/not survived classes.
  * I was interested to see how well the algorithms would be able to predict the survival of passengers based on the information about them.
  * This was an individual project.
  
  

## Methods

The data for this project is availiable from one of Kaggle competition submissions. __[Data Source](https://www.kaggle.com/c/titanic/data)__ I will be using publicly available classification algorithms from __[Scikit](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)__. I have used multiple algorithms presented there and evaluated their performance. I have used seven different algorithms and compared their performance.

The following fields with show combination of markdown cells and code cells to make it easier to see the development process of this project and how I've arrived at the results. 

The first step in the project was to inspect the data and correctly indentify which fields could be ommited, which fields were important for modeling and which alphanumeric fields had to be converted to numeric fields for the purposes of fitting into inputs of modeling algorithms.

In [41]:
import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [42]:
#load test and train data sets
train_df = pd.read_csv('train.csv')
train_df.sample()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
765,766,1,1,"Hogeboom, Mrs. John C (Anna Andrews)",female,51.0,1,0,13502,77.9583,D11,S


**The first step is to load the provided file 'train.csv', parse the data with pandas and review the given fields. The fields are as follows:**
* PassengerId - ID of the passengers, just a key in the table
* Survived - zero or one, shows whether the passengers survived or not
* Pclass - class of travel, one is highest, equal to first class on modern airplanes
* Name - the name of passenger, with title
* Sex - the sex of passenger
* Age - the age of passenger
* SibSp - number of siblings/spouses abroad the ship
* Parch - number of parent/children abroad the ship
* Ticket - alphanumeric value of ticket
* Fare - cost of ticket
* Cabin - cabin/room number
* Embarked - which port the passenger embarked from, C = Cherbourg, Q = Queenstown, S = Southampton

**The next step is to start inspecting the data for unnecessary information. **
First choice is to remove the Ticket and Cabin fields since they don't convey any necessary information. I chose to drop them from the dataset.


In [43]:
#Ticket and Cabin are alphanumeric data fields with no correlation to the survival rate. Cabin information is already captured 
#in Pclass field. Drop both fields from data set
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
train_df.sample()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
320,321,0,3,"Dennis, Mr. Samuel",male,22.0,0,0,7.25,S


Additionaly I remove the PassengerId column since this is just a counter or a key into the database with no value to our predictions.

In [44]:
#Additionaly drop PassengerId since it's just a counter
train_df = train_df.drop(['PassengerId'], axis=1)
train_df.sample()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
229,0,3,"Lefebre, Miss. Mathilde",female,,3,1,25.4667,S


Next I identify empty fields and see which values need to be populated in order to fir the modeling software requirements.

In [45]:
#find empty fields
train_df.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

We see that Embarked has 2 empty fields. I chose to fill them in with most common value of S, this should have minimal effect on modeling results. Next since we are operating on the 'Embarked' column we can begin the process of converting alphanumeric values to numeric bins in order to fit the requirement of modeling software. 

In [46]:
# convert Embarked to numeric fields. Fill in empty fields first

train_df['Embarked'] = train_df['Embarked'].fillna('S')
train_df['Embarked'] = train_df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
train_df.sample()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,0


Next we need to verify that all of the 'Embarked' fields are filled in:

In [47]:
#find empty fields
train_df.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      0
dtype: int64

Now it's time to fill in the 'Age' column. This was a concerning situation, considering how many values were missing and how critical it seems to the modeling. I have researched and experimented with multiple way of handling this issue. One way included trying to guess the age of the passengers based on other data, essentially using predictive modeling for one of the variables in predictive modeling. Another way would be to assign the mean value to all of the unknown ages. I chose to leave blank ages as unknown and populate those cells as -1. This would preserve the initial integrity of data without compromising our ability to model it in the popular algorithms that require all of the rows to be filled in.

In [48]:
train_df['Age'] = train_df['Age'].fillna(-1)
#verify that all missing Age values were replaced
A = train_df.query('Age != Age')
A.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked


We see that all of the empty 'Age' rows have been filled in. Next is to convert 'Sex' column values from male/female to numeric representation and verify correct table format.

In [49]:
#Convert Sex field to numeric value
train_df['Sex'] = train_df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)

In [50]:
#verify correct table format
train_df.sample()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
735,0,3,"Williams, Mr. Leslie",0,28.5,0,0,16.1,0


The 'Fare' column can be reduced down into bin for each quartile increment of the Fare value. This is especially convenient since the Fare cost corresponds to the class of travel so it makes sense to simplify it for our modeling purposes. We create new column 'TicketPrice' and delete the 'Fare' column. We also going to remove the 'Name' column. After experimentation of trying to parse out titles and process them as separate variables it made no changes to the results. Also titles in this data sets are mostly 'Mr' or 'Miss/Mrs/Ms' which correspond to the 'Sex' values we already have. Other values included 1-off titles like 'Captn' which would not be helpful if the data set is broken in parts between testing and training sets. So I have decided to remove that field from modeling and focus on other parameters.

In [51]:
#Separate ticket costs into 4 quartile increment for easier computation

bins = (-1, 0, 8, 15, 31, 1000)
groups = ['0', '1', '2', '3', '4']

train_df['TicketPrice'] = pd.cut(train_df['Fare'], bins, labels=groups)

In [52]:
train_df.sample()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,TicketPrice
90,0,3,"Christmann, Mr. Emil",0,29.0,0,0,8.05,0,2


In [53]:
# Drop Name and Fare column.

train_df = train_df.drop(['Name', 'Fare'], axis=1)
train_df.sample()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked,TicketPrice
517,0,3,0,-1.0,0,0,2,3


**The data set is now cleaned up, formatted and ready to be used in our modeling tools**
First step is to separate the provided data between train and testing data sets so that we can evaluate the prediction accuracy of our algorithms. This is done by the provided utility function 'train_test_split' which automatically divides the data set into X and Y matrices by a 75/25 split for train/test. 


In [54]:
# Create testing and training X and Y matrices

x = train_df.drop(['Survived'], axis=1)
y = train_df['Survived']

X_train, X_test, y_train, y_test = train_test_split(x, y)

**Classifier algorithm selection and explanations**

The first set of algorithm used for this project are the SVM Classifiers:

**SVM Classifier Summary:**

Also known as “Support Vector Machines” these are supervised learning methods used for classification, regression, and outlier’s detection. Advantages are SVM’s are useful with high dimensional spaces, and still operative where the number of dimensions are greater than the number of samples. It is also memory efficient while using a subset of training points whilst in the decision function. You can identify custom kernels and varying kernel functions can be specified for the decision function. 

Disadvantages exists for SVMs one in particular describes that probability estimates are not provided with SVM’s but rather five-fold cross-validation are used in the probability estimates.

The first SVM Classifier is the standard liner classifier. The classifier algorithms are taken from the Scikit tool set. The results of their prediction are evaluated based on accuracy. The accuracy_score is the provided function with the Scikit tools that compares predicted Y matrix to the actual Y matrix (the survived/ not survived in this case). The accuracy percentage is stoged in the 'performance' dictionary in order to compare the classifing algorithms. 


In [55]:
#SVM Classifier, liner 

svclassifier = SVC(kernel='linear')  
svclassifier.fit(X_train, y_train)  
y_pred = svclassifier.predict(X_test)  
svm_liner=accuracy_score(y_test, y_pred)
performance={}
performance['SVM liner']=round(svm_liner*100,2)
svm_liner

0.7802690582959642

Next we run the other types of SVM classifiers - with gaussian kernel and sigmoid kernel. Very simlar code, only change is to the passed parameter to the svm function. The resulting accuracy score is once again stored in the 'performance' dictionary. Same is done for the sigmoid kernel. 

In [56]:
#SVM Classifier, Gaussian 

svclassifier = SVC(kernel='rbf')  
svclassifier.fit(X_train, y_train)  
y_pred = svclassifier.predict(X_test)  
svm_gaus=accuracy_score(y_test, y_pred)
performance['SVM Gauss']=round(svm_gaus*100,2)
svm_gaus

0.8026905829596412

In [57]:
#SVM Classifier, Sigmoid 

svclassifier = SVC(kernel='sigmoid')  
svclassifier.fit(X_train, y_train)  
y_pred = svclassifier.predict(X_test)  
svm_sig=accuracy_score(y_test, y_pred)
performance['SVM Sigmoid']=round(svm_sig*100,2)
svm_sig

0.47533632286995514


**k-Neighbor Classifier "nearest neighbor" Summary:**

Described as a type of instance-based learning or non-generalizing learning, neighbors- based classification stores instances of training data and does not attempt to build a general internal model. Sort of like an election process, classification is computed based on a majority vote of the nearest neighbors of each point. After the vote a query point is assigned the data class. This data class has the most representative within the nearest neighbor of that point.


In [58]:
#k-Neighbor Classifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
knn_acc=accuracy_score(y_test, y_pred)
performance['k-Neighbor']=round(knn_acc*100,2)
knn_acc

0.7533632286995515

**Gaussian Naive Bayes Summary:**

This classification implements the Gaussian Naive Bayes algorithm. This algorithm used in machine learning uses naïve Bayes classifiers which are a family of “probabilistic classifiers”, it mostly describes the probability of an event based on the prior knowledge of conditions.


In [59]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
y_pred = gaussian.predict(X_test)
gaus_acc=accuracy_score(y_test, y_pred)
performance['Gaussian Naive Bayes']=round(gaus_acc*100,2)
gaus_acc

0.7443946188340808

**Decision Tree Summary:**

Also a part of the family of supervised learning algorithms, the decision tree algorithms can be useful in solving regression and classification issues. This algorithm creates a model that predicts the value of a  variable by learning simple decision rules inferred from the data features. These trees can be easily visualized and understood in simple terms. [scikit-learn.org] Their disadvanges is the inability to generalize the data well in certain cases. [dataaspirant.com]


In [60]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)
tree_acc=accuracy_score(y_test, y_pred)
performance['Decision Tree']=round(tree_acc*100,2)
tree_acc

0.7443946188340808

**Random Forest Summary:**

The bigger picture of decision tree classifier. This algorithm builds estimator that fits multiple decision trees on various samples of the dataset and uses averaging to improve the accuracy. It is a type of ensamble learning method. This is a farly recent technique, with earliest example of it used in 1995. [Wikipedia]

In [61]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=200)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
rand_forest_acc=accuracy_score(y_test, y_pred)
performance['Random Forest']=round(rand_forest_acc*100,2)
rand_forest_acc

0.7892376681614349

## Results

I was hoping to find correlation between passenger's info and their predicted survival in the Titanic disaster. I was especially interested how likely were passengers to survive based on their class of travel, age and gender. I was interested in which classification algorithm is better suited to evaluate this data. 

After performing the above work the resulting performance is the summary of accuracy of our selected algorithms on this dataset:

In [62]:
performance

{'SVM liner': 78.03,
 'SVM Gauss': 80.27,
 'SVM Sigmoid': 47.53,
 'k-Neighbor': 75.34,
 'Gaussian Naive Bayes': 74.44,
 'Decision Tree': 74.44,
 'Random Forest': 78.92}

Depending on randomness of test/train data set separation and in modeling algorithms the results will vary slightly every time this notebook is ran. However some clear results can be seen independently of such factors. The worst performing algorithm is consistently the SVM Sigmoid with accuracy as low as 50% in some cases. All the other algorithms perform above 70 and sometimes up to 80 percent accuracy in predicting the survival rate. Considering how seemingly non-correlating this dataset is, this kind of accuracy surprised me. It was and important step for my understanding of data processing in relations to neural networks and classifiers in particular. 

## Conclusion

This is an individual project and I was able  to fully complete it by the deadline.  The timeline was as follows:

* Nov 13 - initial set-up complete
* Nov 19 - neural networks trained 
* Nov 27 - data is tested and analyzed
* Dec 9 - final clean-up and submission of project

Changes that had to be made to the timeline were all due to the up front work with the data. I've spent three extra days "grooming" the dataset. I was surprised how much preparation the data required in order to be correctly modeled. This was the most difficult part of this project for me. Decisions had to be made regarding the way to handle empty 'Age' values. Learning about various ways to handle this situation was helpful in my understanding of data processing. 
During this project I was able to learn about multiple classification algorithms, some of them were covered in class and some of the were new to me. However I feel that the CS440 course gave me a good understanding of underlying principles and conceptual background of neural networks and data modeling which allowed me to quickly understand the new algorithms. Implementation part was very easy thanks to the package I chose to use for this project.

I would like to thank Dr Anderson and his CS440 lectures that helped me prepare for this project. 


## References

1. “Supervised Learning.” Scikit-Learn 0.19.2,https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

2.	“How Decision Tree Algorithm Works.” Dataaspirant, 21 Apr. 2017, dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/.
3.	“1.10. Decision Trees¶.” 1.4. Support Vector Machines - Scikit-Learn 0.19.2 Documentation, scikit-learn.org/stable/modules/tree.html#classification.
4. “Random Forest.” Wikipedia, 8 Dec. 2018,https://en.wikipedia.org/wiki/Random_forest


In [63]:
import io
from IPython.nbformat import current
import glob
nbfile = glob.glob('Melnikov-Project.ipynb')
if len(nbfile) > 1:
    print('More than one ipynb file. Using the first one.  nbfile=', nbfile)
with io.open(nbfile[0], 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')
word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print('Word count for file', nbfile[0], 'is', word_count)

Word count for file Melnikov-Project.ipynb is 1872
