# 1. Introduction

This mini project will use a gradient boosting algorithm to predict whether a Titanic passenger would have survived its infamous sinking or not. I will be using a dataset containing various amounts of information on every passenger of the Titanic.

# 2. Sourcing and Loading

## 2a. Import relevant libraries

In [52]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
from sklearn import tree
from IPython.display import Image
%matplotlib inline
from sklearn.preprocessing import scale
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.metrics import roc_auc_score

## 2b. Load data

In [10]:
df = pd.read_csv('titanic.csv')
df = df.dropna()
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

## 2c. Preliminary data analysis

In [30]:
dfo = df.select_dtypes(include=['object'])
dfo.nunique()

Name        183
Sex           2
Ticket      127
Cabin       133
Embarked      3
dtype: int64

# 3. Modeling

## 3a. Create dummy features

In [31]:
df = pd.DataFrame(df.drop(dfo.columns,axis =1)).merge(pd.get_dummies(dfo.drop(['Name','Cabin','Ticket'],axis =1)),left_index=True,right_index=True).drop(['PassengerId'],axis =1)
print(df.shape)
df.head()

(183, 11)


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
1,1,1,38.0,1,0,71.2833,1,0,1,0,0
3,1,1,35.0,1,0,53.1,1,0,0,0,1
6,0,1,54.0,0,0,51.8625,0,1,0,0,1
10,1,3,4.0,1,1,16.7,1,0,0,0,1
11,1,1,58.0,0,0,26.55,1,0,0,0,1


In [32]:
df.isnull().sum()

Survived      0
Pclass        0
Age           0
SibSp         0
Parch         0
Fare          0
Sex_female    0
Sex_male      0
Embarked_C    0
Embarked_Q    0
Embarked_S    0
dtype: int64

## 3b. Create the X and y matrices from the dataframe, where y = df.Survived

In [40]:
X = df[[column for column in df.columns if column != 'Survived']]
y = df.Survived

## 3c. Scale data

In [45]:
X_scaled = scale(X)

## 3d. Train/test split

In [47]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25)

## 3e. Learning rate experimentation

In [48]:
learning_rates = [0.05, 0.1, 0.25, 0.5, 0.75, 1]
for learning_rate in learning_rates:
    gb = GradientBoostingClassifier(n_estimators=20, learning_rate = learning_rate, max_features=2, max_depth = 2, random_state = 0)
    gb.fit(X_train, y_train)
    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(gb.score(X_train, y_train)))
    print("Accuracy score (validation): {0:.3f}".format(gb.score(X_test, y_test)))
    print()

Learning rate:  0.05
Accuracy score (training): 0.788
Accuracy score (validation): 0.783

Learning rate:  0.1
Accuracy score (training): 0.825
Accuracy score (validation): 0.826

Learning rate:  0.25
Accuracy score (training): 0.854
Accuracy score (validation): 0.674

Learning rate:  0.5
Accuracy score (training): 0.847
Accuracy score (validation): 0.739

Learning rate:  0.75
Accuracy score (training): 0.891
Accuracy score (validation): 0.717

Learning rate:  1
Accuracy score (training): 0.912
Accuracy score (validation): 0.696



It looks like the best learning rate is 0.1

## 3f. Model with optimal learning rate

In [50]:
gb = GradientBoostingClassifier(n_estimators=20, learning_rate = 0.1, max_features=2, max_depth = 2, random_state = 0)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[ 8  6]
 [ 2 30]]
              precision    recall  f1-score   support

           0       0.80      0.57      0.67        14
           1       0.83      0.94      0.88        32

    accuracy                           0.83        46
   macro avg       0.82      0.75      0.77        46
weighted avg       0.82      0.83      0.82        46



## 3g. ROC Calculation

In [53]:
y_pred_prob = gb.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_pred_prob)

0.8482142857142856

# 4. Conclusion

With such a high accuracy (\~83%) and AUC (\~85%), my gradient boosted algorithm is very capable of predicting whether a Titanic passenger would have lived or died when the ship sank.