# **Titanic Classifier - Report**

#### ***Introduction to the report***
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

In this report we will attempt to answer the question "what sorts of people were more likely to survive?" using 3 different machine learning approaches.<br><br>
We decided to use the KNN approach, logistic regression and trees.

*DICLAIMER: The number of comments inside of the actual code is kept to a minimum due to this being a report.*

Created by **Ron Ismaili** and **Alberto Gamez Gonzalez**.

#### ***First look at the dataset***

We start off by importing all of the libraries necessary to run our code.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier

We follow up by loading in the dataset *'titanic_train.csv'* into our file and then taking a look at the **dataframe**, its **shape** as well as its **keys**.

In [None]:
df = pd.read_csv('titanic_train.csv')
display(df)
print("Data Shape:", df.shape)
print("Data Keys:\n", df.keys())

#### ***General dataset information***

In the above section, we can see that our dataset contains **1047 rows** x **14 columns**, meaning that for each of the passengers *(of which there are 1047)* we have **14** features each.<br>
The following is a list of all the available features in this dataset as well as their details:<br>
1.  **PClass** - *(Passenger class)* - as a means to identify the SES *(Socio-economic status)* of the passenger. *(1 - upper, 2 - middle, 3 - lower)*
2.  **Survived** - Survival status of the passenger. *(Has the passenger survived the tragedy)*
3.  **Name** - Full name of the passenger.
4.  **Sex** - Gender of the passenger.
5.  **Age** - Age of the passenger. *(In years)*
6.  **Sibsp** - Number of siblings / spouses aboard the Titanic.
7.  **Parch** - Number of parents / children aboard the Titanic.
8.  **Ticket** - Ticket number.
9.  **Fare** - Passenger fare. *(Amount paid for the ticket)*
10.  **Cabin** - Cabin number.
11.  **Embarked** - Port from where the ship embarked. *(C = Cherbourg, Q = Queenstown, S = Southampton)*
12.  **Boat** - How many boats have been used.
13.  **Body** - Has the body been found or not.
14.  **Homedest** - Where the passenger's homedestination is.

#### ***Data pre-processing***

Before we get started with anything, we have to look at our data with an extra pair of glasses and look for missing data, strings that need to be converted, etc.<br><br>
The first thing that jumped at us when we saw the data was that there was plenty of missing data in the dataset. As an example lets look at the **age** feature:

In [None]:
df["age"]

As we can see, there are plenty of **"?"** within this column. The **"?"** in this case means **missing data**. But there isn't missing data only in the age column, but in others too. Now we need to start cleaning the useless and uncleaned data.<br>

We will start off by dealing with the missing data in our dataset.<br>
To locate the missing DATA there are multiple methods:
- **Heat maps**
- **Percentages**
- **Bar graphs**

In our case we think that **percentages** are the most convinient for finding out how much missing data there is in each of the columns.

We define and run the function for calculating the missing % of data in all of the columns.

In [None]:
def missingPercentages(self , str = ""):
    for col in self.columns:
        pct_missing = np.mean(df[col] == str)
        print('{} - {}%'.format(col, round(pct_missing*100)))

missingPercentages(df , "?") # Function which uses the dataframe and looks for percentages of missing data "?".

Here we can see the % of missing data in each column.<br><br>

There are **5 features** with missing data:
- **Age**
- **Cabin**
- **Boat**
- **Body**
- **Homedest**

Now we will define some functions to clean the dataset, remove unnecessary data as well as deal with the missing data problem.

In [None]:
def embarkedData(input_data): # We need to change the embarked port format.
    if (input_data == "C"):
        return 1
    elif (input_data == "Q"):
        return 2
    else:
        return 3

def reconstructSex(sex_string): # We need to change the male and female format.
    if (sex_string == 'male'):
        return 1
    else:
        return 0

def cleanData(data_input, restriction = "?" , datasub = 0): # We substitute the missing data in the dataset with 0s.
    if (data_input == restriction):
        return datasub
    else:
        return data_input
        
def cleanDataString(data_input, restriction = "?" , datasub = 0): # We substitute the missing data in the dataset with 0s.
    if ((type(data_input) == str ) or (data_input == restriction)):
        return datasub
    else:
        return data_input

We will now apply all of the necessary functions.

In [None]:
# We change the format of "Sex" and "Embarked" so that the algorithm can understand it.
df["sex"] = df["sex"].apply(reconstructSex)
df["embarked"] = df["embarked"].apply(embarkedData)

# We deal with the missing data in "Age", "Cabin", "Fare", "Body" and "Ticket".
df["age"] = df["age"].apply(cleanData)

df["cabin"] = df["cabin"].apply(cleanData) 

df["fare"] = df["fare"].apply(cleanData)
df["fare"] = df["fare"].apply(float)

df["body"] = df["body"].apply(cleanData)  
for i,value in enumerate(df["boat"]):
    try:
        df["boat"][i] = int(df["boat"][i])
    except:
        df["boat"][i] = cleanDataString(df["boat"][i])

df["ticket"] = df["ticket"].apply(cleanData)  
for i,value in enumerate(df["boat"]):
    try:
        df["ticket"][i] = int(df["ticket"][i])
    except:
        df["ticket"][i] = cleanDataString(df["ticket"][i])

Lets now take a look at the final result.

In [None]:
display(df)

As we can see, all of the data is now neat and clean, ready to be analyzed.

#### ***Exploratory Data Analysis (EDA)***

Now that everything is said and done, and we finally have our dataset ready to go, it's time to take a deep dive into the data and analyze it. It is our job now to look at these traits and decide which ones are more important than the others and if we can create any new features ourselves *(feature engineering)*.<br>

For starters, we know that the Titanic incident happened on **April 15, 1912**. We have to keep in mind the date all of this happened when taking into consideration the data.<br>
The Titanic had **2224** individuals on board, out of which **1502** died due to the incident, giving us a roughly **67.53%** death rate, or inversely a roughly **32.47%** survival rate. We believe that there is a reason why certain people survived and others did not.<br><br>

The first thing that came to mind was **gender**. We think that gender played a huge role on the survival chances of a person. We believe that males were more likely to survive compared to females due to their biological differences. Males are built differently biologically speaking, and when it comes to surviving in the deep blue sea, we are of the opinion that males have an edge over females.<br><br>

Another very important feature in our opinion is the **Socio-Economic Status** *(SES)* of the passenger. We believe that if a passenger had a higher **SES**, their chances of survival would be higher due to multiple factors. They probably had better access to life boats, they probably had better quality of food and an overall better experience on the boat. Whereas those with very low **SES** would stay in very poor conditions within the Titanic. All of this could contribute to the survival chances of the passengers.<br><br> 

**Safety boats** also played a huge role in the survival chances of an individual. If there were more safety boats, the chances up survival would go up.<br><br>

We experimented with **"body"** in our model and found out that it also played an important role in determining if an individual would survive or not. We are unsure as to why but figured that it would be important to note down here.<br><br>

*We also played with many other features in many different combinations but we found that these 5 are the most effective at predicting the survival chances of a passenger.*

#### ***Preparing and testing the algorithms***
First we prepare all of our necessary features into one spot.

In [None]:
criterios_np = df[['pclass','sex','boat','body','embarked']].to_numpy() # We gather all of the most important features here.
survived_np = df.survived.to_numpy() # Change the format to numpy. Therefore we can use numpy libraries and avoid errors.

Then we prepare the training sets.

In [None]:
X1_train, X1_test, y1_train, y1_test = train_test_split(
    criterios_np , survived_np , random_state=0)

X2_train, X2_test, y2_train, y2_test = train_test_split(
    criterios_np, survived_np, random_state=42)

X3_train, X3_test, y3_train, y3_test = train_test_split(
    criterios_np, survived_np, random_state=42)

Then we do the fitting.

In [None]:
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X1_train , y1_train)

logreg = LogisticRegression().fit(X2_train, y2_train)

tree = DecisionTreeClassifier(random_state=0)
tree.fit(X3_train, y3_train)

***The KNN algorithm:***

In [None]:
print("KNN algorithm:")
print("Training set score: {:.2f}".format(knn.score(X1_train, y1_train)))
print("Test set score: {:.2f}".format(knn.score(X1_test, y1_test)))

***The Logistic regression algorithm:***

In [None]:
print("Logicistic regression algorithm:")
print("Training set score: {:.3f}".format(logreg.score(X2_train, y2_train)))
print("Test set score: {:.3f}".format(logreg.score(X2_test, y2_test)))

***The tree algorithm:***

In [None]:
print("Tree algorithm:")
print("Training set score: {:.2f}".format(knn.score(X3_train, y3_train)))
print("Test set score: {:.2f}".format(knn.score(X3_test, y3_test)))

#### ***Conclusion***

The ranking of our 3 algorithms is as follows:
1. Tree algorithm - **0.94**
2. Logistic regression algorithm - **0.927**
3. KNN algorithm - **0.91**

**Tree algorithm** - This algorithm performed the best, it not only had the best score, but it also generalized the best out of the three.<br>
**Logistic regression algorithm** - This algorithm performed the second best, generalizing relatively well.<br>
**KNN algorithm** - This algorithm performed the worst *(although still a respectable 0.91)*, it not only had the worst score, but it also generalized the worst out of the three, meaning there was some overfitting.<br>

*We have learned a lot during this project and we would have done many things much differently now that we have some more experience. It was fun and a bit challenging, we hope you enjoyed the report.*