## **Titanic Project**

**PassengerId**: *Passenger Number*

**Survived**: *Identifies if the passenger survived or died*

		0  DIED
		1  ALIVE

**PClass**: *The ticket class of the passenger*
		
		1  FIRST CLASS
		2  SECOND CLASS
		3  THIRD CLASS

**Name**: *Name of the passenger*

**Sex**: *Gender of the passenger*

		MALE
		FEMALE

**Age**: *Age of passenger*

**SibSp**: *Number of Siblings/Spouses Aboard*

**Parch**: *Number of Parents/Children Aboard*

**Ticket**: *Ticket Number*

**Fare**: *Passenger Fare*

**Cabin**: *Cabin*

**Embarked**: *Port of Embarkation*

		C = Cherbourg
		Q = Queenstown
		S = Southampton



The goal of this project is to find out if we are able to predict one's survivability on the Titanic based on its attributes so that it can be used for future maritime disaster rescue aids to improve survivability.

In this notebook, we extract the data and perform data cleaning.

In [84]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
from sklearn.model_selection import train_test_split

# Import Decision Tree Classifier model from Scikit-Learn
from sklearn.tree import DecisionTreeClassifier
# Plot the trained Decision Tree
from sklearn.tree import plot_tree
# for plotting confusion matrix
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')

In [85]:
csv_data = pd.read_csv("datasets\\OG-Titanic-Dataset.csv")
csv_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [86]:
csv_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [87]:
csv_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## **Data Cleaning**

In [88]:
# Check for null values
for i in csv_data:
    nulls = csv_data[i].isnull().sum()
    print(f"Nulls for {i}: {nulls}")

Nulls for PassengerId: 0
Nulls for Survived: 0
Nulls for Pclass: 0
Nulls for Name: 0
Nulls for Sex: 0
Nulls for Age: 177
Nulls for SibSp: 0
Nulls for Parch: 0
Nulls for Ticket: 0
Nulls for Fare: 0
Nulls for Cabin: 687
Nulls for Embarked: 2


There seems to be some NULL value for Age and Embarked. Since Age is a numerical data and it has outliers, our group decided to use median to replace the NULL values. As for Embarked, since it is a categorical data, our group decided to use mode to replace the NULL values.

In [89]:
# Replace the nulls in age with median
csv_data['Age'].fillna(csv_data['Age'].median(), inplace=True)

In [90]:
# Replace the nulls in Embarked with mode
csv_data['Embarked'].fillna(csv_data['Embarked'].mode()[0], inplace=True)

In [91]:
# Check again if null values are replaced
for i in csv_data:
    nulls = csv_data[i].isnull().sum()
    print(f"Nulls for {i}: {nulls}")

Nulls for PassengerId: 0
Nulls for Survived: 0
Nulls for Pclass: 0
Nulls for Name: 0
Nulls for Sex: 0
Nulls for Age: 0
Nulls for SibSp: 0
Nulls for Parch: 0
Nulls for Ticket: 0
Nulls for Fare: 0
Nulls for Cabin: 687
Nulls for Embarked: 0


As for cabin, due to the sheer amount of NULL value in it, our group decided to not use that particular variable.

In [92]:
numeric_data = pd.DataFrame(csv_data[['Age', 'SibSp', 'Parch', 'Fare']])
cat_data = pd.DataFrame(csv_data[['Survived', 'Pclass', 'Sex', 'Embarked', 'Name']])


In [93]:
# Adding a column Family_Size
csv_data['Family_Size'] = 0
csv_data['Family_Size'] = csv_data['Parch'] + csv_data['SibSp'] + 1
numeric_data['Family_Size'] = 0
numeric_data['Family_Size'] = numeric_data['Parch'] + numeric_data['SibSp'] + 1
 
# Adding a column Alone
csv_data['Alone'] = 0
cat_data['Alone'] = 0
csv_data.loc[csv_data.Family_Size == 1, 'Alone'] = 1
cat_data.loc[csv_data.Family_Size == 1, 'Alone'] = 1

In [94]:
cat_data['Initial']=0
for i in cat_data:
    cat_data['Initial']=cat_data.Name.str.extract('([A-Za-z]+)\.')

In [95]:
cat_data.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked,Name,Alone,Initial
0,0,3,male,S,"Braund, Mr. Owen Harris",0,Mr
1,1,1,female,C,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,Mrs
2,1,3,female,S,"Heikkinen, Miss. Laina",1,Miss
3,1,1,female,S,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,Mrs
4,0,3,male,S,"Allen, Mr. William Henry",1,Mr


Our group did feature engineering and created three other variables that we think would be able to help us to better predict the survivability. The three new features would be the family size of that person travelling as well if if the person is travelling alone, as well as the initials/title of each person.

In [96]:
numeric_data = pd.DataFrame(numeric_data[['Age', 'SibSp', 'Parch', 'Fare', 'Family_Size']])
cat_data = pd.DataFrame(cat_data[['Survived', 'Pclass', 'Sex', 'Embarked', 'Alone', 'Initial']])


We have chosen to drop the following variables:

`Name` -- Reason: Obviously one's name will not affect their survivability

`Ticket` -- Reason: Strings appear to be random and does not tell us anything interesting

`PassengerId` -- Reason: It is just numbering the passengers and does not reflect survivability

`Cabin` -- High number of nulls render the variable not very useful

In [97]:
cleaned_data = pd.concat([numeric_data, cat_data], axis=1)
cleaned_data.to_csv("datasets\\cleaned-data.csv")

The cleaned data has been exported to the `cleaned-data.csv` in the `datasets` folder.