# Problem Definition

Given a labelled dataset of features about passengers aboard the titanic and whether or not they survived. Can we train a classification model to accurately predict the survival of passengers from an unlabelled dataset.

## Methodology

1. We will firstly inspect the data and determine which features are important to include in our models.
2. We will then engineer the selected features such that they are in a form convenient for modelling.
3. We will then fit and evaluate a number of models.
4. Finally, we will compare the accuracy of the various models and select the best performing model for submission.

# Data Inspection

In this section we will look carefully at the data and derive various statistical measures for each feature. The aim is to understand the data and try to select the features that have the strongest correlation with survival.

From the data description on kaggle we know that each row represents a single passenger and the following about the features:

- Survival is whether or not the person survived (0 = NO, 1 = YES).
- Pclass is the class of the persons ticket (1 = 1st, 2 = 2nd, 3 = 3rd).
- Sex is the sex of the person.
- Age is the age in years of the person.
- Sibsp is the number of siblings or spouses onboard with the person.
- Parch is the number of parents or children onboard with the person.
- Ticket is the ticket number of the person.
- Fare is the price paid by the person.
- Cabin is the cabin number in which the person stayed.
- Embarked is the port where they boarded the titanic (C = Cherbourg, Q = Queenstown, S = Southhampton).

In [1]:
# Data analysis tools
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
import random as rnd

# Data visualization tools
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


# Import data
train = pd.read_csv("/kaggle/input/titanic/train.csv")
test = pd.read_csv("/kaggle/input/titanic/test.csv")

# Display data information
train.info()
print('_' * 50)
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
__________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0  

From the above printouts we can see that for the test dataset there are:

- 6 numerical and 5 string type features.
- Age, Cabin and Embarked contain missing values (19.87%, 77.1% and 0.22% respectively).

In [2]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


From the descriptive statistics generated on the numerical features for the train dataset we can see that:

- 38.38% of passengers survived (reperesentative of the actual survival rate of 32%).
- The average age of passengers was about 30 years old.
- 75% of passengers onboard were under 38 years of age.
- Most passengers did not travel with parents or children.
- Most passengers (75%) payed less than 31 in fares but a small minority payed as much as 512.
- Most passengers did not travel with siblings more than 1 sibling of spouse.

In [3]:
train.describe(include = ['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Strom, Miss. Telma Matilda",male,347082,C23 C25 C27,S
freq,1,577,7,4,644


From the descriptive statistics generated on the categorical features for the train dataset we can see that:

- All names were unique.
- The passengers were 65% male.
- Some ticket numbers were repeated (most likely due to families purchasing one ticket for multiple people).
- Some cabin numbers were repeated (for similiar reasons as ticket numbers).
- The port of Southampton (S) was most frequently embarked from (72.44%).

## Feature Selection

From the above analysis we decide to remove:

- Name as it is unique for each row.
- Ticket as we cannot be sure that the repeated entries are not an error.
- Cabin for the same reason as above and also due to a large number of missing values.
- Passengerid as it is unique for each row.

In [4]:
train[["Pclass", "Survived"]].groupby(["Pclass"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [5]:
train[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


In [6]:
train[["SibSp", "Survived"]].groupby(["SibSp"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

Unnamed: 0,SibSp,Survived
1,1,0.535885
2,2,0.464286
0,0,0.345395
3,3,0.25
4,4,0.166667
5,5,0.0
6,8,0.0


In [7]:
train[["Fare", "Survived"]].groupby(["Fare"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

Unnamed: 0,Fare,Survived
247,512.3292,1.0
196,57.9792,1.0
89,13.8583,1.0
88,13.7917,1.0
86,13.4167,1.0
...,...,...
103,15.5500,0.0
180,47.1000,0.0
179,46.9000,0.0
178,42.4000,0.0


In [8]:
train[["Parch", "Survived"]].groupby(["Parch"], as_index = False).mean().sort_values(by = "Survived", ascending = False)

Unnamed: 0,Parch,Survived
3,3,0.6
1,1,0.550847
2,2,0.5
0,0,0.343658
5,5,0.2
4,4,0.0
6,6,0.0


From the above comaprisons we can see that:

- Pclass correlates strongly with Survived.
- Sex correlates strongly with Survived.
- SibSp somewhat correlates with Survived.
- Fare does not seem to correlate with survived (so we remove it from the feature selection).
- Parch does not seem to correlate with Survived (so we remove it from the feature selection).

Therefore, we will select the following 5 features for modelling:

- Age
- Pclass
- Sex
- SibSp
- Embarked

# Data Wrangling

We need to drop the features that we won't be using in our models and convert the remaining features to numerical types (this is because most models require that the data be in numerical format).

In [9]:
# Dropping features
train = train.drop(["Ticket", "Cabin", "PassengerId", "Name", "Parch", "Fare"], axis = 1)
test = test.drop(["Ticket", "Cabin", "Name", "Parch", "Fare"], axis = 1)

# Dropping missing values
train = train.dropna()
test = test.dropna()
full_data = [train, test]

# Converting categorical features to numeric
for dataset in full_data:
    dataset["Sex"] = dataset["Sex"].map( {"female": 1, "male": 0}).astype(int)

for dataset in full_data:
    dataset["Embarked"] = dataset["Embarked"].map( {"S": 0, "C": 1, "Q": 2}).astype(int)

# Modelling

To determine an accurate model we will evaluate the following classfification models:

- Logistic Regression
- K-Nearest Neighbors
- Decision Tree Classification
- Random Forest Classification

In [10]:
# machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Formatting train and test sets
X_train = train.drop("Survived", axis = 1)
y_train = train["Survived"]
X_test = test.drop("PassengerId", axis = 1).copy()

In [11]:
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
prediction = logreg.predict(X_test)

print("Logistic Regression Score:", round(logreg.score(X_train, y_train) * 100, 2))

Logistic Regression Score: 80.76


In [12]:
# KNN
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)

print("K-Nearest Neighbors Score:", round(knn.score(X_train, y_train) * 100, 2))

K-Nearest Neighbors Score: 88.06


In [13]:
# Decision Tree
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
prediction = tree.predict(X_test)

print("Decision Tree Score:", round(tree.score(X_train, y_train) * 100, 2))

Decision Tree Score: 93.4


In [14]:
# Random Forest
forest = RandomForestClassifier(n_estimators = 100)
forest.fit(X_train, y_train)
forest_prediction = forest.predict(X_test)

print("Random Forest Score:", round(forest.score(X_train, y_train) * 100, 2))

Random Forest Score: 93.4


# Results

The decision tree and random forest both have the highest score so we choose the random forest as it tends to generalize better than a single tree. In conclusion we can see that in this case atleast, the random forest would be a could choice of classifier.