# Titanic Survivor Predictions

We need to determine whether or not each passenger survives. To do this we can look at the factors which may or may not affect their survival rate. These include their gender, age, socio-economic class (which we may infer from their ticket class) and other factors included in the data. We first import the necessary packages:

In [1]:
import pandas as pd
from matplotlib import pyplot as plt

We then read both the training and testing csv datasets:

In [2]:
training = pd.read_csv('./train.csv')
testing = pd.read_csv('./test.csv')

In [3]:
training.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We will check for any missing values, or NaN or null values in the training data:

In [4]:
training.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

It appears we have 177 null values in the Age column, 687 null values in the Cabin column, and 2 null values in the Embarked column. The Cabin column may not cause us too much difficulty. However, the null values in the age column may cause issues. We will leave them as is for now. The total number of adults and children on board were:

In [5]:
num_children = len(training.loc[training.Age < 18])
num_adults = len(training.loc[training.Age >= 18])

print('Total number of children:', num_children)
print('Total number of adults:', num_adults)

Total number of children: 113
Total number of adults: 601


As a first step we will determine the number of adult and child passengers that survived:

In [6]:
child_survivors = training.loc[training.Age < 18].Survived.sum()
adult_survivors = training.loc[training.Age >= 18].Survived.sum()

print('Child Survivors:', child_survivors)
print('Adult Survivors:', adult_survivors)

Child Survivors: 61
Adult Survivors: 229


Of those children and adults we can look at how many were male/female:

In [7]:
male_child_survivors = training.loc[(training.Age < 18) & (training.Sex == 'male')].Survived.sum()
female_child_survivors = training.loc[(training.Age < 18) & (training.Sex == 'female')].Survived.sum()
male_adult_survivors = training.loc[(training.Age >= 18) & (training.Sex == 'male')].Survived.sum()
female_adult_survivors = training.loc[(training.Age >= 18) & (training.Sex == 'female')].Survived.sum()

print('Male Children survivors:', male_child_survivors)
print('Female Children survivors:', female_child_survivors)
print('Male Adults survivors:', male_adult_survivors)
print('Female Adults survivors:', female_adult_survivors)

Male Children survivors: 23
Female Children survivors: 38
Male Adults survivors: 70
Female Adults survivors: 159


So it is evident that females are in the majority of those survived. Particularly, more than twice as many adult females survived than adult males. We can now look to the socio-economic factor. So first determining what class of passenger is more likely to survive:

In [8]:
num_first = len(training[training.Pclass == 1])
num_second = len(training[training.Pclass == 2])
num_third = len(training[training.Pclass == 3])
first_survivors = training[training.Pclass == 1].Survived.sum() * 100 / num_first
second_survivors = training[training.Pclass == 2].Survived.sum() * 100 / num_second
third_survivors = training[training.Pclass == 3].Survived.sum() * 100 / num_third

print('First class survivors:', first_survivors)
print('Second class survivors:', second_survivors)
print('Third class survivors:', third_survivors)

First class survivors: 62.96296296296296
Second class survivors: 47.28260869565217
Third class survivors: 24.236252545824847


It is clear that a higher percentage of higher class people survived. While "similar" numbers of passengers across the three classes survived (i.e. 136 vs 87 vs 119) the percentage surviving is stark (63% vs 47% vs 24%). So we may arrive at the conclusion that people with higher class survived.

As a first step to making predictions on who would survive we start with a K-Nearest Neighbours algorithm. We implement this with scikit-learn:

In [54]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score

We can now clean the data and split it into a training and a validation set:

In [158]:
training.Age.fillna(training.Age.mean(), inplace=True)
training.replace({"male": 1, "female": 0}, inplace = True)

reduced_features = training[['Pclass', 'Sex', 'Age', 'SibSp']]
targets = training['Survived']

X_train, X_validation, y_train, y_validation = train_test_split(reduced_features, targets, train_size=0.7, test_size=0.3)

Now implementing the KNN model:

In [159]:
knn_model = KNeighborsClassifier(5)
knn_model.fit(X_train, y_train)
predicted_knn_survival = knn_model.predict(X_validation)
knn_f1_score = f1_score(predicted_knn_survival, y_validation)
knn_accuracy_score = accuracy_score(predicted_knn_survival, y_validation)

print('F1 score:', knn_f1_score)
print('Accuracy score:', knn_accuracy_score)

F1 score: 0.6524064171122994
Accuracy score: 0.7574626865671642


As a second step we will implement a Support Vector Machine:

In [160]:
from sklearn.svm import SVC

In [161]:
svc_model = SVC(kernel = 'linear', C = 1)
svc_model.fit(X_train, y_train)
predicted_svc_survival = svc_model.predict(X_validation)
svc_f1_score = f1_score(predicted_svc_survival, y_validation)
svc_accuracy_score = accuracy_score(predicted_svc_survival, y_validation)

print('F1 score:', svc_f1_score)
print('Accuracy score:', svc_accuracy_score)

F1 score: 0.7087378640776699
Accuracy score: 0.7761194029850746


As a final step we will implement a simple decision tree:

In [162]:
from sklearn.tree import DecisionTreeClassifier

In [163]:
decision_model = DecisionTreeClassifier(max_depth = 4)
decision_model.fit(X_train, y_train)
predicted_decision_survival = decision_model.predict(X_validation)
decision_f1_score = f1_score(predicted_decision_survival, y_validation)
decision_accuracy_score = accuracy_score(predicted_decision_survival, y_validation)

print('F1 score:', decision_f1_score)
print('Accuracy score:', decision_accuracy_score)

F1 score: 0.7623762376237623
Accuracy score: 0.8208955223880597
