The sinking of the RMS Titanic is one of the most infamous tragedy in history. In this hands-on, a machine learning system will be developed to predict which passengers survived the tragedy.
![Image of Titanic](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/450px-RMS_Titanic_3.jpg)

In [1]:
import pandas as pd
from warnings import filterwarnings
filterwarnings('ignore')

df = pd.read_csv('titanic.csv')

In [2]:
# Preview 10 lines of dataset
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [3]:
# Statistical summary
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
# Check for missing data
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# Select the relevant columns only
selected = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Survived']
df1 = df[selected]

To do: Replace the missing data (imputation) using the statistics (mean, median etc.)

In [8]:
df1['Age'] = df1['Age'].fillna(df1['Age'].median())

To do: Check whether there is any more missing data

In [9]:
df1.isnull().sum()

Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
Survived    0
dtype: int64

To do: 
- Remove the samples with missing data.
- Use get_dummies() function from pandas to convert categorical features into one hot encoding. 

In [10]:
df1.dropna(inplace=True)
df2 = pd.get_dummies(df1)
df2

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Survived,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,3,22.0,1,0,7.2500,0,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,1,0,1,0,0
2,3,26.0,0,0,7.9250,1,1,0,0,0,1
3,1,35.0,1,0,53.1000,1,1,0,0,0,1
4,3,35.0,0,0,8.0500,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
886,2,27.0,0,0,13.0000,0,0,1,0,0,1
887,1,19.0,0,0,30.0000,1,1,0,0,0,1
888,3,28.0,1,2,23.4500,0,1,0,0,0,1
889,1,26.0,0,0,30.0000,1,0,1,1,0,0


To do: 
- Store the features as variable X and targets as variable y
- Split the data into training set and testing test

In [11]:
from sklearn.model_selection import train_test_split as split

y = df2['Survived'].values
del df2['Survived']
X = df2.values
print(X.shape, y.shape)

X_train, X_test, y_train, y_test = split(X, y, random_state=42)

(889, 10) (889,)


To do: Use grid search to fine tune the hyperparameters of Decition Tree Classifier (do this after Part 3)

In [12]:
from sklearn.tree import DecisionTreeClassifier

dct = DecisionTreeClassifier().fit(X_train, y_train)
print(f'dct accuracy: {dct.score(X_test, y_test)}')

dct accuracy: 0.7488789237668162


In [13]:
from sklearn.model_selection import GridSearchCV

params = {}
params['max_leaf_nodes'] = list(range(2, 11))
params['max_depth'] = list(range(11))
gs = GridSearchCV(DecisionTreeClassifier(), params, cv=5, n_jobs=-1, verbose=2)
gs.fit(X_train, y_train)
print(gs.best_params_)

Fitting 5 folds for each of 99 candidates, totalling 495 fits
{'max_depth': 3, 'max_leaf_nodes': 7}


In [15]:
dtc2 = DecisionTreeClassifier(**gs.best_params_).fit(X_train, y_train)
print(f'dtc2 accuracy: {dtc2.score(X_test, y_test)}')

dtc2 accuracy: 0.8161434977578476
