# Titanic Survivor Predictions

We need to determine whether or not each passenger survives. To do this we can look at the factors which may or may not affect their survival rate. These include their gender, age, socio-economic class (which we may infer from their ticket class) and other factors included in the data. We first import the necessary packages:

In [1]:
import pandas as pd
from matplotlib import pyplot as plt

We then read both the training and testing csv datasets:

In [80]:
training = pd.read_csv('./train.csv')
testing = pd.read_csv('./test.csv')

In [53]:
training_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We will check for any missing values, or NaN or null values in the training data:

In [63]:
training.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

It appears we have 177 null values in the Age column, 687 null values in the Cabin column, and 2 null values in the Embarked column. The Cabin column may not cause us too much difficulty. However, the null values in the age column may cause issues. We will leave them as is for now. The total number of adults and children on board were:

In [59]:
num_children = len(training.loc[training.Age < 18])
num_adults = len(training.loc[training.Age >= 18])
print(num_children, num_adults)

113 601


As a first step we will determine the number of adult and child passengers that survived:

In [206]:
child_survivors = training.loc[training.Age < 18].Survived.sum()
adult_survivors = training.loc[training.Age >= 18].Survived.sum()
print(child_survivors, adult_survivors)

61 229


Of those children and adults we can look at how many were male/female:

In [61]:
male_child_survivors = training.loc[(training.Age < 18) & (training.Sex == 'male')].Survived.sum()
female_child_survivors = training.loc[(training.Age < 18) & (training.Sex == 'female')].Survived.sum()
male_adult_survivors = training.loc[(training.Age >= 18) & (training.Sex == 'male')].Survived.sum()
female_adult_survivors = training.loc[(training.Age >= 18) & (training.Sex == 'female')].Survived.sum()
print(male_child_survivors, female_child_survivors, male_adult_survivors, female_adult_survivors)

23 38 70 159


So it is evident that females are in the majority of those survived. Particularly, more than twice as many adult females survived than adult males. We can now look to the socio-economic factor. So first determining what class of passenger is more likely to survive:

In [58]:
num_first = len(training[training.Pclass == 1])
num_second = len(training[training.Pclass == 2])
num_third = len(training[training.Pclass == 3])
first_survivors = training[training.Pclass == 1].Survived.sum() * 100 / num_first
second_survivors = training[training.Pclass == 2].Survived.sum() * 100 / num_second
third_survivors = training[training.Pclass == 3].Survived.sum() * 100 / num_third

print(first_survivors, num_first, training[training.Pclass == 1].Survived.sum())
print(second_survivors, num_second, training[training.Pclass == 2].Survived.sum())
print(third_survivors, num_third, training[training.Pclass == 3].Survived.sum())

62.96296296296296 216 136
47.28260869565217 184 87
24.236252545824847 491 119


It is clear that a higher percentage of higher class people survived. While "similar" numbers of passengers across the three classes survived (i.e. 136 vs 87 vs 119) the percentage surviving is stark (63% vs 47% vs 24%). So we may arrive at the conclusion that people with higher class survived.

As a first step to making predictions on who would survive we start with a K-Nearest Neighbours algorithm. We implement this with scikit-learn:

In [113]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [344]:
clean_training_data = training.dropna(subset=['Age'])
#clean_training_data = training.fillna(value={'Age': training.Age.mean()})

features = ['Pclass', 'Sex', 'Age']
X = clean_training_data[features]
X.Sex = X.apply(lambda row: 1 if row.Sex == 'male' else 0, axis='columns')
y = clean_training_data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3)

knn_model = KNeighborsClassifier(5)
knn_model.fit(X_train, y_train)
predicted_survival = knn_model.predict(X_test)
f1_score(predicted_survival, y_test)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


0.6625766871165644

As a second step we will implement a Support Vector Machine:

In [239]:
from sklearn.svm import SVC

In [345]:
classifier = SVC(kernel = 'linear')
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3)
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
f1_score(predicted, y_test)

0.6878980891719746