# Titanic dataset - Binary classification

The Titanic dataset present a binary classification problem, where the goal is to predict whether a passenger survived or not.   
In this notebook, I will build a simple Random Forest model to predict the survival of passengers.

Will import the data cleaned in the previous step, with the following features:

| Variable | Definition                          | Value                        |
|----------|-------------------------------------|----------------------------|
| survival | Survival                            | 0=No, 1=Yes            |
| pclass   | Ticket class                        | 1=1st, 2=2nd, 3=3rd  |
| sex      | Sex                                 | 0=Female, 1=Male                           |
| Age      | Age in years                        |                            |
| sibsp    | # of siblings / spouses aboard the Titanic |                    |
| parch    | # of parents / children aboard the Titanic |                    |
| fare     | Passenger fare                      |                            |
| cabin    | Cabin number                        | 0=NaN/Unidentified, 1=Yes/Valid Cabin nr                           |
| embarked | Port of Embarkation                 | 0=Cherbourg, 1=Queenstown, 2=Southampton |

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.float_format', lambda x: '%.2f' % x)

## Load and inspect data

In [2]:
# load cleaned data
train_data = pd.read_csv('data/train_clean_subsetFeatures.csv')
test_data = pd.read_csv('data/test_clean_subsetFeatures.csv')

# print the shape of the data
print('Train dataset shape (rows, columns):', train_data.shape)
print('Test dataset shape (rows, columns):', test_data.shape)

train_data.head(3)

Train dataset shape (rows, columns): (891, 18)
Test dataset shape (rows, columns): (418, 17)


Unnamed: 0,PassengerId,Survived,Pclass_2,Pclass_3,Sex_1,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Parch_9
0,1,0,False,True,True,True,False,False,False,False,False,False,False,False,False,False,False,False
1,2,1,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
2,3,1,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [3]:
test_data.head(3)

Unnamed: 0,PassengerId,Pclass_2,Pclass_3,Sex_1,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Parch_9
0,892,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False
1,893,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False
2,894,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False


## Define models

1) Logistic Regression
2) Random Forest (subset of features)
3) Random Forest (all features)
4) Support Vector Machine (SVM)
5) Xtreme Gradient Boosting (XGBoost)

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

In [5]:
# load cleaned data
train_data = pd.read_csv('data/train_clean_allFeatures.csv')
test_data = pd.read_csv('data/test_clean_allFeatures.csv')

y = train_data["Survived"] # target variable

X = train_data.drop("Survived", axis=1)
X_test = test_data
X.head(3)

Unnamed: 0,PassengerId,Age,Fare,SibSp,Parch,Pclass_2,Pclass_3,Sex_1,Cabin_1,Embarked_1,Embarked_2
0,1,-0.58,-0.5,0.48,-0.44,False,True,True,False,False,True
1,2,0.66,0.73,0.48,-0.44,False,False,False,True,False,False
2,3,-0.27,-0.49,-0.48,-0.44,False,True,False,False,False,True


In [6]:
# Logistic Regression model
lr = LogisticRegression(max_iter = 2000, random_state=1)
cv_accuracy = cross_val_score(lr, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(lr, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.81005587 0.79213483 0.78651685 0.79213483 0.83146067] | Mean value: 0.802
Cross-Validation F1:       [0.74626866 0.72180451 0.72058824 0.69918699 0.7761194 ] | Mean value: 0.733


In [7]:
# Random Forest model - subset of features
train_data = pd.read_csv('data/train_clean_subsetFeatures.csv')
test_data = pd.read_csv('data/test_clean_subsetFeatures.csv')

y = train_data["Survived"] # target variable
X = train_data.drop("Survived", axis=1)
X_test = test_data

rf_subset = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
cv_accuracy = cross_val_score(rf_subset, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(rf_subset, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.6424581  0.80898876 0.80337079 0.7752809  0.80337079] | Mean value: 0.767
Cross-Validation F1:       [0.13513514 0.74242424 0.72       0.60784314 0.6728972 ] | Mean value: 0.576


In [9]:
# load cleaned data - back to using all features
train_data = pd.read_csv('data/train_clean_allFeatures.csv')
test_data = pd.read_csv('data/test_clean_allFeatures.csv')

y = train_data["Survived"] # target variable
X = train_data.drop("Survived", axis=1)
X_test = test_data

In [10]:
# Random Forest model - all features
rf = RandomForestClassifier(n_estimators=150, max_depth=10, random_state=1)
cv_accuracy = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(rf, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.74301676 0.79213483 0.87078652 0.79213483 0.83146067] | Mean value: 0.806
Cross-Validation F1:       [0.55769231 0.71755725 0.82442748 0.69421488 0.765625  ] | Mean value: 0.712


In [11]:
# Support Vector Classifier (SVC) model
svc = SVC(probability = True)
cv_accuracy = cross_val_score(svc, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(svc, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.61452514 0.61797753 0.61797753 0.61797753 0.61235955] | Mean value: 0.616
Cross-Validation F1:       [0. 0. 0. 0. 0.] | Mean value: 0.0


In [12]:
# Xtreme Gradient Boosting (XGBoost) model
xgb = XGBClassifier(random_state =1)
cv_accuracy = cross_val_score(xgb, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(xgb, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.66480447 0.80898876 0.83707865 0.79213483 0.84269663] | Mean value: 0.789
Cross-Validation F1:       [0.25       0.74626866 0.78195489 0.74125874 0.78787879] | Mean value: 0.661


### Baseline Model Performance

| Model       | Accuracy | F1-score    |
|-------------|----------|-------------|
| Logistic Regression  | 80.2     | 73.3        |
| Random Forest subset | 76.7     | 57.6        |
| Random Forest all    | 80.6     | 71.2        |
| SVM                  | 61.6     | nan         |
| XGBoost              | 78.9     | 66.1        |


## Parameter tuning
Use GridSearchCV to find the best parameters for the models, while RandomizedSearchCV for Random Forest and XG Boost mopdels to reduce the computational time.


In [13]:
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV 

In [14]:
# report stats for each model
def clf_performance(classifier, model_name):
    print(model_name)
    print('Best Score: ' + str(classifier.best_score_))
    print('Best Parameters: ' + str(classifier.best_params_))

In [15]:
# Logistic Regression model - hyperparameter tuning
lr = LogisticRegression(random_state = 1)
param_grid = {'max_iter': [2000],
                'C': np.logspace(-4, 4, 20),
                'penalty': ['l1', 'l2'],
                'solver': ['liblinear']}
# Create grid search using 5-fold cross validation
clf_lr = GridSearchCV(lr, param_grid=param_grid, cv=5, verbose=True, n_jobs=-1) # n_jobs=-1 means all processors
best_clf_lr = clf_lr.fit(X, y)
clf_performance(best_clf_lr,'Logistic Regression')

Fitting 5 folds for each of 40 candidates, totalling 200 fits
Logistic Regression
Best Score: 0.8047015253279769
Best Parameters: {'C': 0.615848211066026, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'liblinear'}


In [16]:
# Random Forest model - hyperparameter tuning
# Because of the large potential search space, we 1) use RandomizedSearchCV to narrow down the search space
# and 2) use GridSearchCV on the smaller search space to find the best parameters

# 1) RandomizedSearchCV - to narrow down search space
rf = RandomForestClassifier(random_state = 1)
param_grid =  {'n_estimators': [100,500,1000], 
                'bootstrap': [True,False],
                'max_depth': [3,5,10,20,50,75,100,None],
                'max_features': ['auto','sqrt'],
                'min_samples_leaf': [1,2,4,10],
                'min_samples_split': [2,5,10]}

clf_rf_rnd = RandomizedSearchCV(rf, param_distributions=param_grid, n_iter=100, cv=5, verbose=True, n_jobs=-1)
best_rf_rnd = clf_rf_rnd.fit(X, y)
clf_performance(best_rf_rnd,'Random Forest')

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Random Forest
Best Score: 0.8294143493817087
Best Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 20, 'bootstrap': True}


240 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
229 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramData\miniconda3\envs\kaggle_ML_env\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\ProgramData\miniconda3\envs\kaggle_ML_env\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
    estimator._validate_params()
  File "c:\ProgramData\miniconda3\envs\kaggle_ML_env\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\ProgramData\miniconda3\envs\kaggle_ML_env\Lib\site-packages\sklearn\utils\_param_va

In [17]:
# Random Forest model
# 2) GridSearchCV on smaller search space
param_grid =  {'n_estimators': [400, 425],
                'bootstrap': [True],
                'max_depth': [4,5,6],
                'max_features': ['sqrt'],
                'min_samples_leaf': [2,3,4],
                'min_samples_split': [2,3,4]}
clf_rf = GridSearchCV(rf, param_grid=param_grid, cv=5, verbose=True, n_jobs=-1)
best_clf_rf = clf_rf.fit(X, y)
clf_performance(best_clf_rf,'Random Forest')

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Random Forest
Best Score: 0.831648986253217
Best Parameters: {'bootstrap': True, 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 400}


In [18]:
# Support Vector Classifier (SVC) model - hyperparameter tuning
svc = SVC(probability = True)
param_grid = tuned_parameters = [
   #  {'kernel': ['rbf'], 'gamma': [0.0001, 0.1], 'C': [1e-06, 0.001, 0.1]},
   #  {'kernel': ['linear'], 'C': [1]}
    {'kernel': ['poly'], 'degree':[2], 'C': [1e04]}
]

clf_svc = GridSearchCV(svc, param_grid=param_grid, cv=5, verbose=True, n_jobs=-1)
best_clf_svc = clf_svc.fit(X, y)
clf_performance(best_clf_svc,'SVC')

Fitting 5 folds for each of 1 candidates, totalling 5 fits
SVC
Best Score: 0.7262507061703596
Best Parameters: {'C': 10000.0, 'degree': 2, 'kernel': 'poly'}


In [19]:
# XGBoost model - hyperparameter tuning

# 1) RandomizedSearchCV - to narrow down search space
xgb = XGBClassifier(random_state = 1)
param_grid = {
    'n_estimators': [15, 20, 25],
    'colsample_bytree': [0.6, 0.7],
    'max_depth': [12, 15, 18],
    'reg_alpha': [0, 0.1],
    'reg_lambda': [1.8, 2, 2.2],
    'subsample': [0.7, 0.9],
    'learning_rate':[.1, 0.3],
    'gamma':[0.6, 0.9],
    'min_child_weight':[2, 3, 4],
    # 'sampling_method': ['uniform']
}
clf_xgb_rnd = RandomizedSearchCV(xgb, param_distributions = param_grid, n_iter = 1000, cv = 4, verbose = True, n_jobs = -1)
best_clf_xgb_rnd = clf_xgb_rnd.fit(X,y)
clf_performance(best_clf_xgb_rnd,'XGB')

Fitting 4 folds for each of 1000 candidates, totalling 4000 fits
XGB
Best Score: 0.8283187896416596
Best Parameters: {'subsample': 0.7, 'reg_lambda': 2, 'reg_alpha': 0, 'n_estimators': 25, 'min_child_weight': 3, 'max_depth': 12, 'learning_rate': 0.1, 'gamma': 0.6, 'colsample_bytree': 0.6}


In [22]:
# XGBoost model - hyperparameter tuning

# 2) GridSearchCV on smaller search space
xgb = XGBClassifier(random_state = 1)

param_grid = {
    'n_estimators': [20, 22, 25],
    'colsample_bytree': [0.5, 0.6, 0.65],
    'max_depth': [10, 11, 13],
    'reg_alpha': [0, 0.01],
    'reg_lambda': [1.8, 2.0, 2.3],
    'subsample': [0.70, 0.8, 0.90],
    'learning_rate':[0.25, 0.3, 0.35],
    'gamma':[0.3, 0.4, 0.5],
    'min_child_weight':[3, 4, 5],
    'sampling_method': ['uniform']
}
clf_xgb = GridSearchCV(xgb, param_grid = param_grid, cv = 4, verbose = True, n_jobs = -1)
best_clf_xgb = clf_xgb.fit(X,y)
clf_performance(best_clf_xgb,'XGB')

Fitting 4 folds for each of 13122 candidates, totalling 52488 fits
XGB
Best Score: 0.8406556780996244
Best Parameters: {'colsample_bytree': 0.6, 'gamma': 0.5, 'learning_rate': 0.3, 'max_depth': 10, 'min_child_weight': 4, 'n_estimators': 22, 'reg_alpha': 0, 'reg_lambda': 2.0, 'sampling_method': 'uniform', 'subsample': 0.8}



### Tuned Model Performance

| Model                | Accuracy | Accuracy_tuned |  Accuracy Kaggle test_set | 
|----------------------|----------|----------------|-----------|
| Logistic Regression  | 80.2     | 80.5           | 74.9    |
| Random Forest subset | 76.7      | 57.6           | 77.8    |
| Random Forest all    | 80.6     | 83.2           | 76.315    |
| SVM                  | 61.6     | 72.6           | 67.0    |
| XGBoost              | 78.9      | 84.1           | 76.1    |

In [23]:
# Save predictions from the models
# Logistic Regression
predictions_lr = best_clf_lr.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_lr})
output.to_csv('submissions/logistic_regression.csv', index=False)

# Random Forest
predictions_rf = best_clf_rf.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_rf})
output.to_csv('submissions/random_forest_3.csv', index=False)

# SVC
predictions_svc = best_clf_svc.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_svc})
output.to_csv('submissions/svc.csv', index=False)

# XGBoost
predictions_xgb = best_clf_xgb.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_xgb})
output.to_csv('submissions/xgboost.csv', index=False)