# Predicting Titanic Survivers
Like Titanic, this is my maiden voyage,  when it comes to Kaggle contest that is!. I've completed the Data Science track on Data Camp, but I'm a relative newbie when it comes to machine learning. I'm going to attempt to work my way through the Titanic: Machine Learning contest. My aim is to submission and initial entry as quickly as possible to get a base line score and then attempt to improve on  on it by first looking at missing data, then engineering key features before establishing a  secondary base line and trying to improve the model itself. 

Please feel free to post comments or  make suggestions as to what i may be doing wrong or could maybe do better and  consider upvoting if you find the notebook useful!

# Import the Libraries and Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

df_train=pd.read_csv('../input/train.csv',sep=',')
df_test=pd.read_csv('../input/test.csv',sep=',')

PassengerId = df_test['PassengerId']
Submission=pd.DataFrame()
Submission['PassengerId'] = df_test['PassengerId']

# Stage 1 : Explore the Data and create a basic model on raw data

# Explore the data Statistically

In [None]:
# How big are the training and test datasets
print(df_train.shape)
print("----------------------------")
print(df_test.shape)

In [None]:
# What are the column names 
print(df_train.columns)

In [None]:
# What type of data object are in each column and how many missing values are there
print(df_train.info())
print("----------------------------")
print(df_test.info())

In [None]:
#check for any other unusable values
print(pd.isnull(df_train).sum())
print("----------------------------")
print(pd.isnull(df_test).sum())

## Observations on missing data.

There are 144 missing ages in the training data and 86 mssing ages in the test data. Age is an important feature so it is worth spending time to address this properly. 

There are 468 missing Cabin entries in the training data and 326 in the test data, at this stage I'm not sure how important this feature is so I'm going to revisit this when I know more about the feature.
There are 2 missing embarked data points in the train data and 1 missing fare in the test data, at this stage this does not represent a problem.

In [None]:
# Get a statistical overview of the data
print(df_train.describe())
print("----------------------------")
print(df_test.describe())

In [None]:
# Take a look at some sample data
print(df_train.head(10))
print(df_train.tail(10))

# Explore Data Graphically

# Pairplots

To get a very basic idea of the relationships between the different features we can use pairplots from seaborn.

In [None]:
g = sns.pairplot(df_train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch', u'Fare', u'Embarked']], hue='Survived', palette = 'seismic',size=4,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=50) )
g.set(xticklabels=[])

# Create simple model

Create a baseline score by using old the standard numeric data on on a very basic model, this will be used to see how much any changes we make to the data or model improve performance.

In [None]:
NUMERIC_COLUMNS=['Pclass','Age','SibSp','Parch','Fare']
OTHER_COLUMNS=['Sex', 'Embarked','Name','Ticket','Cabin']

# create test and training data
data_to_train = df_train[NUMERIC_COLUMNS].fillna(-1000)
y=df_train['Survived']
X=data_to_train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=21, stratify=y)

clf = SVC()
# clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X_train, y_train)

# Check the Accuracy of model on test data


In [None]:
# Print the accuracy

print("Accuracy: {}".format(clf.score(X_test, y_test)))

# Create initial predictions¶

In [None]:
test = df_test[NUMERIC_COLUMNS].fillna(-1000)

Submission['Survived']=clf.predict(test)
print(Submission.head(5))

# Make first Submission

In [None]:
# write data frame to csv file
Submission.set_index('PassengerId', inplace=True)
Submission.to_csv('myfirstsubmission.csv',sep=',')

The result of this first submission was a score of 0.57894. This constitutes performing just above random, if i'd simply flipped a coin fair coin for each passenger i could have achieved this kind of score. So there is plenty of room for improvement.

# Stage 2 : Clean Data & Engineer features to improve results

Here I am going to go with the principal that those that were in the lifeboats were more likely to survive and from history we know that the women and children were given priority for the life boat places. So i am going to try and engineer features to help the model find the women and children. I've also going to use these features to further explore the data statistically and visually to see if there are further patterns in the data that will help identify additional patterns that might explain anomolies to this, from the initial visualisations it appearred that class might also play a major role in whether certain groups of passengers survived or not.    

# Feature Reduction

The PassengerId and ticket are not really relevant for the training data, so we can drop those elements.

In [None]:
# Feature reduction
drop_elements = ['Ticket','Cabin']
df_train = df_train.drop(drop_elements, axis = 1)
df_test = df_test.drop(drop_elements, axis = 1)

# Filling in the blanks

## Estimate missing Fare Data

In [None]:
#Fill the na values in Fare
df_train["Fare"].fillna(np.median(df_train["Fare"]))
df_test["Fare"].fillna(np.median(df_test["Fare"]))

#Create new variable called log_fare, because Fare distribution is VERY skewed.  
df_train["log_fare"] = np.log(df_train["Fare"])
df_test["log_fare"] = np.log(df_test["Fare"])

## Estimate missing Age Data

In [None]:
#Fill the missing Age values

# Age 
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,4))
axis1.set_title('Original Age values - Titanic')
axis2.set_title('New Age values - Titanic')

# get average, std, and number of NaN values in titanic_df
average_age_titanic   = df_train["Age"].mean()
std_age_titanic       = df_train["Age"].std()
count_nan_age_titanic = df_train["Age"].isnull().sum()

# get average, std, and number of NaN values in test_df
average_age_test   = df_test["Age"].mean()
std_age_test       = df_test["Age"].std()
count_nan_age_test = df_test["Age"].isnull().sum()

# generate random numbers between (mean - std) & (mean + std)
rand_1 = np.random.randint(average_age_titanic - std_age_titanic, average_age_titanic + std_age_titanic, size = count_nan_age_titanic)
rand_2 = np.random.randint(average_age_test - std_age_test, average_age_test + std_age_test, size = count_nan_age_test)

# plot original Age values
# NOTE: drop all null values, and convert to int
df_train['Age'].dropna().astype(int).hist(bins=70, ax=axis1)
# test_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)

# fill NaN values in Age column with random values generated
df_train["Age"][np.isnan(df_train["Age"])] = rand_1
df_test["Age"][np.isnan(df_test["Age"])] = rand_2

# convert from float to int
df_train['Age'] = df_train['Age'].astype(int)
df_test['Age']    = df_test['Age'].astype(int)
        
# plot new Age Values
df_train['Age'].hist(bins=70, ax=axis2)
#df_test['Age'].hist(bins=70, ax=axis4)

# peaks for survived/not survived passengers by their age
facet = sns.FacetGrid(df_train, hue="Survived",palette = 'seismic',aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, df_train['Age'].max()))
facet.add_legend()

# Feature Engineering
This is based on info from 'introduction-to-ensembling-stacking'.

Convert male/female Categories to Columns for training data
Once we have Category data, the next stage is to make each category into a column, to do this we use the panda's method get_dummies and use the arguent prefixsep='' to determine what is used in the naming convention on the new columns.

>Example : df.Sex.astype('category')

##  Gender Feature

In [None]:
# convert categories to Columns
dummies=pd.get_dummies(df_train[['Sex']], prefix_sep='_') #Gender
df_train = pd.concat([df_train, dummies], axis=1) 
testdummies=pd.get_dummies(df_test[['Sex']], prefix_sep='_') #Gender
df_test = pd.concat([df_test, testdummies], axis=1) 

## Title Feature

In [None]:
#Get titles
df_train["Title"] = df_train.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
df_test["Title"] = df_test.Name.str.extract(' ([A-Za-z]+)\.', expand=False) 

#Unify common titles. 
df_train["Title"] = df_train["Title"].replace('Mlle', 'Miss')
df_test["Title"] = df_test["Title"].replace('Mlle', 'Miss')
df_train["Title"] = df_train["Title"].replace('Master', 'Master')
df_test["Title"] = df_test["Title"].replace('Master', 'Master')
df_train["Title"] = df_train["Title"].replace(['Mme', 'Dona', 'Ms'], 'Mrs')
df_test["Title"] = df_test["Title"].replace(['Mme', 'Dona', 'Ms'], 'Mrs')
df_train["Title"] = df_train["Title"].replace(['Jonkheer','Don'],'Mr')
df_test["Title"] = df_test["Title"].replace(['Jonkheer','Don'],'Mr')
df_train["Title"] = df_train["Title"].replace(['Capt','Major', 'Col','Rev','Dr'], 'Services')
df_test["Title"] = df_test["Title"].replace(['Capt', 'Col', 'Rev', 'Dr'], 'Services')
df_train["Title"] = df_train["Title"].replace(['Lady', 'Countess','Sir'], 'Titled')
df_test["Title"] = df_test["Title"].replace(['Lady', 'Countess','Sir'], 'Titled')

# convert Title categories to Columns
titledummies=pd.get_dummies(df_train[['Title']], prefix_sep='_') #Title
df_train = pd.concat([df_train, titledummies], axis=1) 
ttitledummies=pd.get_dummies(df_test[['Title']], prefix_sep='_') #Title
df_test = pd.concat([df_test, ttitledummies], axis=1) 

In [None]:
df_train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

## Embarked Feature

In [None]:
#map each Embarked value to a numerical value
embarked_mapping = {"S": 1, "C": 2, "Q": 3}
df_train['Embarked'] = df_train['Embarked'].map(embarked_mapping)
df_test['Embarked'] = df_test['Embarked'].map(embarked_mapping)

In [None]:
df_train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean()

## Fare feature

In [None]:
#fill in missing Fare value in test set based on mean fare for that Pclass 
for x in range(len(df_test["Fare"])):
    if pd.isnull(df_test["Fare"][x]):
        pclass = df_test["Pclass"][x] #Pclass = 3
        df_test["Fare"][x] = round(df_train[df_train["Pclass"] == pclass]["Fare"].mean(), 4)
        
#map Fare values into groups of numerical values
df_train['FareBand'] = pd.qcut(df_train['Fare'], 4, labels = [1, 2, 3, 4])
df_test['FareBand'] = pd.qcut(df_test['Fare'], 4, labels = [1, 2, 3, 4])

#drop Fare values
df_train = df_train.drop(['Fare'], axis = 1)
df_test = df_test.drop(['Fare'], axis = 1)

In [None]:
df_train[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean()

# Visualize the new features

In [None]:
grid = sns.FacetGrid(df_train, col = "Sex", row = "Pclass", hue = "Survived", palette = 'seismic')
grid = grid.map(plt.scatter, "Age", "log_fare")
grid.add_legend()
grid

In [None]:
facet = sns.FacetGrid(data = df_train, hue = "Title", legend_out=True, size = 5)
facet = facet.map(sns.kdeplot, "Age")
facet.add_legend();

# Re-train the model on new features

In [None]:
# Re-evaluate factoring in gender of passenger

NUMERIC_COLUMNS=['Pclass','Age','SibSp','Parch','Sex_female','Sex_male','Title_Master', 'Title_Miss',
       'Title_Mr', 'Title_Mrs', 'Title_Services','Embarked']

# create test and training data
data_to_train = df_train[NUMERIC_COLUMNS].fillna(-1000)
y=df_train['Survived']
X=data_to_train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=21, stratify=y)

clf = SVC()
# clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X_train, y_train)

# Re-evaluate the on new features

In [None]:
# Print the accuracy# Print  
print("Accuracy: {}".format(clf.score(X_test, y_test)))

# Reforcast predictions based on new features

In [None]:
test = df_test[NUMERIC_COLUMNS].fillna(-1000)
Submission['Survived']=clf.predict(test)

# Make revised submission

In [None]:
# write data frame to csv file
#Submission.set_index('PassengerId', inplace=True)
Submission.to_csv('revisedsubmission.csv',sep=',')

The second revised submission scored 0.75598 which was an improvement of the original revision which scored 0.64593, this used was  is an improvement on the original score of 0.57894. This advanced the submission to 9117 place on the leaderboard, from the starting point of 10599th place! Obviousy a step in the right direction but still needing work.

# Stage 3 : Test Different Models and parameters

## Slit data into test and training

In [None]:
from sklearn.model_selection import train_test_split
NUMERIC_COLUMNS=['Pclass','Age','SibSp','Parch','Sex_female','Sex_male','Title_Master', 'Title_Miss',
       'Title_Mr', 'Title_Mrs', 'Title_Services','Embarked']

# create test and training data
predictors = df_train.drop(['Survived', 'PassengerId'], axis=1)
data_to_train = df_train[NUMERIC_COLUMNS].fillna(-1000)
target = df_train["Survived"]
x_train, x_val, y_train, y_val = train_test_split(data_to_train, y, test_size = 0.3, random_state = 0)

# create test and training data
#data_to_train = df_train[NUMERIC_COLUMNS].fillna(-1000)
#y=df_train['Survived']
#X=data_to_train
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=21, stratify=y)




## Support Vector Classification

Has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-rest scheme.

In [None]:
clf = SVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(x_val)
acc_clf = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_clf)

## Naive Bayes

In [None]:
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB

gaussian = GaussianNB()
gaussian.fit(x_train, y_train)
y_pred = gaussian.predict(x_val)
acc_gaussian = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_gaussian)

## Decision Tree

In [None]:
#Decision Tree
from sklearn.tree import DecisionTreeClassifier

decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train, y_train)
y_pred = decisiontree.predict(x_val)
acc_decisiontree = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_decisiontree)

## Random Forest

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_val)
acc_randomforest = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_randomforest)

# Reforcast predictions based on performing model

In [None]:
test = df_test[NUMERIC_COLUMNS].fillna(-1000)

Submission['Survived']=decisiontree.predict(test)
print(Submission.head(5))

# Make final submission

In [None]:
# write data frame to csv file
#Submission.set_index('PassengerId', inplace=True)
#Submission.to_csv('finalsubmission.csv',sep=',')
Submission.to_csv('decisiontreesubmission.csv',sep=',')

# Credit where credits due

This competition is predominantly a training exercise and as such I have tried to looks at different approaches and try different techniques to see hw they work.  I have looked at some of the existing entries and adopted some of the tequiques that i have found interesting. So firstly a huge thanks to everyone that look the time to document their code and explain step by step what they did and why.

To naming names, some of the notebooks that i found most useful and think deserve special mensions are:

### Anisotropic
https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/notebook

Introduction to Ensembling/Stacking in Python is a very useful project on many levels, in particular I borrowed the pairplot idea for visualising the data.

### Henrique Mello 
https://www.kaggle.com/hrmello/introduction-to-data-exploration-using-seaborn/notebook

This was very helpful in getting title data while feature engineering, I also used some code for to Visualisation new features using FacetGrid from seaborn.

### Omar El Gabry
https://www.kaggle.com/omarelgabry/a-journey-through-titanic?scriptVersionId=447802/notebook

This kernal has an interesting section on estimating the missing ages and calculating pearson co-efficients for the features.

### Nadin Tamer
https://www.kaggle.com/nadintamer/titanic-survival-predictions-beginner/notebook

I found this another really useful kernel. It is very much a step by step approach, with a particularly good section on different types of model and how they perform for this project.

# Summary

In this project we have explored the Titanic Data Set, we have identified missing data and filled then as best we could, we have converted categorical data to columns of numeric features that we can use in machine learning and we have engineered new features based on the data we had. We improved our score from base line of 0.57894 to  a score of 0.75598.

We  looked at a range of different models and compared the accuracy of each model on the training data to decide which model to use. We then produced predictions from the best performing models which we submitted to ensure that our models were not overfitting. 

We certainly didn't come any where near winning this contest,, but we survived our first Kaggle competition, and hopefully we had fun and learnt alot along the way by looking at what other people were doing and trying different techniques.