# Titanic dataset - Binary classification

The Titanic dataset present a binary classification problem, where the goal is to predict whether a passenger survived or not.   
In this notebook, I will build a simple Random Forest model to predict the survival of passengers.

Will import the data cleaned in the previous step, with the following features:

| Variable | Definition                          | Value                        |
|----------|-------------------------------------|----------------------------|
| survival | Survival                            | 0=No, 1=Yes            |
| pclass   | Ticket class                        | 1=1st, 2=2nd, 3=3rd  |
| sex      | Sex                                 | 0=Female, 1=Male                           |
| Age      | Age in years                        |                            |
| sibsp    | # of siblings / spouses aboard the Titanic |                    |
| parch    | # of parents / children aboard the Titanic |                    |
| fare     | Passenger fare                      |                            |
| cabin    | Cabin number                        | 0=NaN/Unidentified, 1=Yes/Valid Cabin nr                           |
| embarked | Port of Embarkation                 | 0=Cherbourg, 1=Queenstown, 2=Southampton |

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.float_format', lambda x: '%.2f' % x)

## Load and inspect data

In [2]:
# load cleaned data
train_data = pd.read_csv('data/train_clean_subsetFeatures.csv')
test_data = pd.read_csv('data/test_clean_subsetFeatures.csv')

# print the shape of the data
print('Train dataset shape (rows, columns):', train_data.shape)
print('Test dataset shape (rows, columns):', test_data.shape)

train_data.head(3)

Train dataset shape (rows, columns): (891, 18)
Test dataset shape (rows, columns): (418, 17)


Unnamed: 0,PassengerId,Survived,Pclass_2,Pclass_3,Sex_1,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Parch_9
0,1,0,False,True,True,True,False,False,False,False,False,False,False,False,False,False,False,False
1,2,1,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
2,3,1,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [3]:
test_data.head(3)

Unnamed: 0,PassengerId,Pclass_2,Pclass_3,Sex_1,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Parch_9
0,892,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False
1,893,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False
2,894,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False


## Define models

1) Logistic Regression
2) Random Forest (subset of features)
3) Random Forest (all features)
4) Support Vector Machine (SVM)
5) Xtreme Gradient Boosting (XGBoost)

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

In [5]:
# load cleaned data
train_data = pd.read_csv('data/train_clean_allFeatures.csv')
test_data = pd.read_csv('data/test_clean_allFeatures.csv')

y = train_data["Survived"] # target variable

X = train_data.drop(["Survived", "PassengerId"], axis=1)
X_test = test_data.drop(["PassengerId"], axis=1)
X.head(3)

Unnamed: 0,Age,Fare,SibSp,Parch,Pclass_2,Pclass_3,Sex_1,Cabin_1,Embarked_1,Embarked_2
0,-0.58,-0.5,0.48,-0.44,False,True,True,False,False,True
1,0.66,0.73,0.48,-0.44,False,False,False,True,False,False
2,-0.27,-0.49,-0.48,-0.44,False,True,False,False,False,True


In [6]:
# Logistic Regression model
lr = LogisticRegression(max_iter = 2000, random_state=1)
cv_accuracy = cross_val_score(lr, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(lr, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.81564246 0.79213483 0.78651685 0.79213483 0.83707865] | Mean value: 0.805
Cross-Validation F1:       [0.75912409 0.72180451 0.72058824 0.69918699 0.77165354] | Mean value: 0.734


In [7]:
# Random Forest model - subset of features
train_data = pd.read_csv('data/train_clean_subsetFeatures.csv')
test_data = pd.read_csv('data/test_clean_subsetFeatures.csv')

y = train_data["Survived"] # target variable
X = train_data.drop(["Survived", "PassengerId"], axis=1)
X_test = test_data.drop(["PassengerId"], axis=1)

rf_subset = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
cv_accuracy = cross_val_score(rf_subset, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(rf_subset, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.73743017 0.80898876 0.80337079 0.76966292 0.81460674] | Mean value: 0.787
Cross-Validation F1:       [0.52525253 0.73846154 0.72440945 0.66666667 0.74015748] | Mean value: 0.679


In [8]:
# load cleaned data - back to using all features
train_data = pd.read_csv('data/train_clean_allFeatures.csv')
test_data = pd.read_csv('data/test_clean_allFeatures.csv')

y = train_data["Survived"] # target variable
X = train_data.drop(["Survived", "PassengerId"], axis=1)
X_test = test_data.drop(["PassengerId"], axis=1)

In [9]:
# Random Forest model - all features
rf = RandomForestClassifier(n_estimators=150, max_depth=10, random_state=1)
cv_accuracy = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(rf, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.80446927 0.78651685 0.87078652 0.79775281 0.84831461] | Mean value: 0.822
Cross-Validation F1:       [0.73282443 0.71212121 0.82170543 0.69491525 0.80291971] | Mean value: 0.753


In [10]:
# Support Vector Classifier (SVC) model
svc = SVC(probability = True)
cv_accuracy = cross_val_score(svc, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(svc, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.82681564 0.82022472 0.79775281 0.81460674 0.85955056] | Mean value: 0.824
Cross-Validation F1:       [0.76691729 0.75384615 0.72307692 0.72268908 0.80916031] | Mean value: 0.755


In [11]:
# Xtreme Gradient Boosting (XGBoost) model
xgb = XGBClassifier(random_state =1)
cv_accuracy = cross_val_score(xgb, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(xgb, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.79329609 0.82022472 0.84831461 0.78089888 0.85955056] | Mean value: 0.82
Cross-Validation F1:       [0.72592593 0.75757576 0.8        0.69291339 0.81751825] | Mean value: 0.759


In [12]:
# Gradient Boosting Classifier
gbc = GradientBoostingClassifier(random_state =1)
cv_accuracy = cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(gbc, X, y, cv=5, scoring='f1')

print(f'Cross-Validation Accuracy: {cv_accuracy} | Mean value: {round(cv_accuracy.mean(),3)}')
print(f'Cross-Validation F1:       {cv_f1} | Mean value: {round(cv_f1.mean(),3)}')

Cross-Validation Accuracy: [0.81005587 0.79775281 0.85393258 0.79213483 0.85955056] | Mean value: 0.823
Cross-Validation F1:       [0.734375   0.72727273 0.79365079 0.69918699 0.81751825] | Mean value: 0.754


### Baseline Model Performance

| Model       | Accuracy | F1-score    |
|-------------|----------|-------------|
| Logistic Regression  | 80.5     | 73.4        |
| Random Forest subset | 78.7     | 67.9        |
| Random Forest all    | 82.2     | 75.3        |
| SVM                  | 62.4     | 75.5        |
| XGBoost              | 82.0     | 75.9        |
| Gradient Boosting    | 82.3     | 75.4        |


## Parameter tuning
Use GridSearchCV to find the best parameters for the models, while RandomizedSearchCV for Random Forest and XG Boost mopdels to reduce the computational time.


In [13]:
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV 

In [14]:
# report stats for each model
def clf_performance(classifier, model_name):
    print(model_name)
    print('Best Score: ' + str(classifier.best_score_))
    print('Best Parameters: ' + str(classifier.best_params_))

In [15]:
# Logistic Regression model - hyperparameter tuning
lr = LogisticRegression(random_state = 1)
param_grid = {'max_iter': [2000],
                'C': np.logspace(-4, 2, 20),
                'penalty': ['l2'],
                'solver': ['liblinear']}
# Create grid search using 5-fold cross validation
clf_lr = GridSearchCV(lr, param_grid=param_grid, cv=10, verbose=True, n_jobs=-1) # n_jobs=-1 means all processors
best_clf_lr = clf_lr.fit(X, y)
clf_performance(best_clf_lr,'Logistic Regression')

Fitting 10 folds for each of 20 candidates, totalling 200 fits
Logistic Regression
Best Score: 0.8125842696629213
Best Parameters: {'C': 0.14384498882876628, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'liblinear'}


Best Score: 0.79686774213797
Best Parameters: {'C': 29.763514416313132, 'max_iter': 2000, 'penalty': 'l1', 'solver': 'liblinear'}

Best Score: 0.79686774213797
Best Parameters: {'C': 23.357214690901213, 'max_iter': 2000, 'penalty': 'l1', 'solver': 'liblinear'}

Best Score: 0.7957441466323519
Best Parameters: {'C': 11.288378916846883, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'liblinear'}

Best Score: 0.7957303370786517
Best Parameters: {'C': 2.6366508987303554, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'liblinear'}


In [16]:
# Random Forest model - hyperparameter tuning (subset of features)
train_data = pd.read_csv('data/train_clean_subsetFeatures.csv')
test_data = pd.read_csv('data/test_clean_subsetFeatures.csv')

y = train_data["Survived"] # target variable
X = train_data.drop(["Survived", "PassengerId"], axis=1)
X_test = test_data.drop(["PassengerId"], axis=1)

# Because of the large potential search space, we 1) use RandomizedSearchCV to narrow down the search space
# and 2) use GridSearchCV on the smaller search space to find the best parameters

# # 1) RandomizedSearchCV - to narrow down search space
# rf_subset = RandomForestClassifier(random_state = 1)
# param_grid =  {'n_estimators': [100,500,1000], 
#                 'bootstrap': [True,False],
#                 'max_depth': [3,5,10,20,50,75],
#                 'max_features': ['auto','sqrt'],
#                 'min_samples_leaf': [1,2,4,10],
#                 'min_samples_split': [2,5,10]}

# clf_rf_subset_rnd = RandomizedSearchCV(rf_subset, param_distributions=param_grid, n_iter=100, cv=5, verbose=True, n_jobs=-1)
# best_rf_subset_rnd = clf_rf_subset_rnd.fit(X, y)
# clf_performance(best_rf_subset_rnd,'Random Forest')

In [17]:
# Random Forest model (subset of features)
# 2) GridSearchCV on smaller search space
param_grid =  {'n_estimators': [400, 500],
                'bootstrap': [True],
                'max_depth': [3,4,5],
                'max_features': ['sqrt'],
                'min_samples_leaf': [2,3,4],
                'min_samples_split': [2,3]}
clf_rf = GridSearchCV(rf, param_grid=param_grid, cv=10, verbose=True, n_jobs=-1)
best_clf_rf = clf_rf.fit(X, y)
clf_performance(best_clf_rf,'Random Forest')

Fitting 10 folds for each of 36 candidates, totalling 360 fits
Random Forest
Best Score: 0.8002247191011236
Best Parameters: {'bootstrap': True, 'max_depth': 4, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 400}


Best Score: 0.7879354717218002
Best Parameters: {'bootstrap': True, 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}

Best Score: 0.8013483146067415
Best Parameters: {'bootstrap': True, 'max_depth': 6, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 400}

Best Score: 0.8002247191011236
Best Parameters: {'bootstrap': True, 'max_depth': 4, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 400}

In [18]:
# Random Forest model - hyperparameter tuning (all features)
# load cleaned data - back to using all features
train_data = pd.read_csv('data/train_clean_allFeatures.csv')
test_data = pd.read_csv('data/test_clean_allFeatures.csv')

y = train_data["Survived"] # target variable
X = train_data.drop(["Survived", "PassengerId"], axis=1)
X_test = test_data.drop(["PassengerId"], axis=1)

# Because of the large potential search space, we 1) use RandomizedSearchCV to narrow down the search space
# and 2) use GridSearchCV on the smaller search space to find the best parameters

# # 1) RandomizedSearchCV - to narrow down search space
# rf = RandomForestClassifier(random_state = 1)
# param_grid =  {'n_estimators': [300, 400, 425],
#                 'bootstrap': [True],
#                 'max_depth': [4,5,6],
#                 'max_features': ['sqrt'],
#                 'min_samples_leaf': [2,3],
#                 'min_samples_split': [2,3,4]}

# clf_rf_rnd = RandomizedSearchCV(rf, param_distributions=param_grid, n_iter=100, cv=5, verbose=True, n_jobs=-1)
# best_rf_rnd = clf_rf_rnd.fit(X, y)
# clf_performance(best_rf_rnd,'Random Forest')

In [19]:
# Random Forest model (all features)
# 2) GridSearchCV on smaller search space
param_grid =  {'n_estimators': [400, 500],
                'bootstrap': [True],
                'max_depth': [4,5,6],
                'max_features': ['sqrt'],
                'min_samples_leaf': [2,3,4],
                'min_samples_split': [2,3]}
clf_rf = GridSearchCV(rf, param_grid=param_grid, cv=10, verbose=True, n_jobs=-1)
best_clf_rf = clf_rf.fit(X, y)
clf_performance(best_clf_rf,'Random Forest')

Fitting 10 folds for each of 36 candidates, totalling 360 fits
Random Forest
Best Score: 0.8260549313358302
Best Parameters: {'bootstrap': True, 'max_depth': 4, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 400}


Best Score: 0.831648986253217
Best Parameters: {'bootstrap': True, 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 400}

Best Score: 0.8249387985688281
Best Parameters: {'bootstrap': True, 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 400}

Best Score: 0.8260549313358302
Best Parameters: {'bootstrap': True, 'max_depth': 4, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 400}



In [20]:
# Support Vector Classifier (SVC) model - hyperparameter tuning
svc = SVC(probability = True)
param_grid = tuned_parameters = [
    {'kernel': ['rbf'], 'degree':[2], 'C': [0.5, 0.7, 1.0]},
    #  {'kernel': ['linear'], 'C': [1]}
    # {'kernel': ['poly'], 'degree':[2], 'C': [1e03]}
]
clf_svc = GridSearchCV(svc, param_grid=param_grid, cv=10, verbose=True, n_jobs=-1)
best_clf_svc = clf_svc.fit(X, y)
clf_performance(best_clf_svc,'SVC')

Fitting 10 folds for each of 3 candidates, totalling 30 fits
SVC
Best Score: 0.8215730337078652
Best Parameters: {'C': 1.0, 'degree': 2, 'kernel': 'rbf'}


Best Score: 0.823 | 
Best Parameters: {'C': 1.5, 'degree': 2, 'kernel': 'rbf'}

Best Score: 0.822 | 
Best Parameters: {'C': 1.0, 'degree': 2, 'kernel': 'rbf'}

Best Score: 0.798 | 
Best Parameters: {'C': 0.5, 'degree': 2, 'kernel': 'rbf'}

In [21]:
# XGBoost model - hyperparameter tuning

# 1) RandomizedSearchCV - to narrow down search space
xgb = XGBClassifier(random_state = 1)
param_grid = {
    'n_estimators': [150],
    'colsample_bytree': [0.3, 0.5],
    'max_depth': [9, 10, 11],
    'reg_alpha': [0.5, 0.9],
    'reg_lambda': [3, 4, 5],
    'subsample': [0.3, 0.4],
    'learning_rate':[0,2, 0.25],
    'gamma': [0.6, 0.8, 1],
    'min_child_weight':[2, 3],
    # 'sampling_method': ['uniform']
}
clf_xgb_rnd = RandomizedSearchCV(xgb, param_distributions = param_grid, n_iter = 1000, cv = 10, verbose = True, n_jobs = -1)
best_clf_xgb_rnd = clf_xgb_rnd.fit(X,y)
clf_performance(best_clf_xgb_rnd,'XGB')

Fitting 10 folds for each of 1000 candidates, totalling 10000 fits
XGB
Best Score: 0.8350686641697879
Best Parameters: {'subsample': 0.3, 'reg_lambda': 3, 'reg_alpha': 0.5, 'n_estimators': 150, 'min_child_weight': 2, 'max_depth': 10, 'learning_rate': 0.25, 'gamma': 0.6, 'colsample_bytree': 0.5}


Best Score: 0.8406556780996244
Best Parameters: {'colsample_bytree': 0.6, 'gamma': 0.5, 'learning_rate': 0.3, 'max_depth': 10, 'min_child_weight': 4, 'n_estimators': 22, 'reg_alpha': 0, 'reg_lambda': 2.0, 'subsample': 0.8}

Best Score: 0.8406741573033708
Best Parameters: {'subsample': 0.9, 'reg_lambda': 1.8, 'reg_alpha': 0.1, 'n_estimators': 25, 'min_child_weight': 2, 'max_depth': 12, 'learning_rate': 0.3, 'gamma': 0.6, 'colsample_bytree': 0.7}

Best Score: 0.8451560549313358
Best Parameters: {'subsample': 0.5, 'reg_lambda': 2, 'reg_alpha': 0.001, 'n_estimators': 50, 'min_child_weight': 4, 'max_depth': 10, 'learning_rate': 0.3, 'gamma': 1, 'colsample_bytree': 0.6}

Best Score: 0.8418102372034955
Best Parameters: {'subsample': 0.5, 'reg_lambda': 2.2, 'reg_alpha': 0.01, 'n_estimators': 100, 'min_child_weight': 3, 'max_depth': 10, 'learning_rate': 0.3, 'gamma': 0.6, 'colsample_bytree': 0.6}

Best Score: 0.8451810237203496
Best Parameters: {'subsample': 0.4, 'reg_lambda': 3, 'reg_alpha': 0.1, 'n_estimators': 100, 'min_child_weight': 3, 'max_depth': 10, 'learning_rate': 0.3, 'gamma': 0.8, 'colsample_bytree': 0.6}

Best Score: 0.8350686641697879
Best Parameters: {'subsample': 0.3, 'reg_lambda': 3, 'reg_alpha': 0.5, 'n_estimators': 150, 'min_child_weight': 2, 'max_depth': 11, 'learning_rate': 0.25, 'gamma': 0.6, 'colsample_bytree': 0.5}

In [22]:
# XGBoost model - hyperparameter tuning

# 2) GridSearchCV on smaller search space
xgb = XGBClassifier(random_state = 1)
param_grid = {
    'n_estimators': [150],
    'colsample_bytree': [0.3, 0.5],
    'max_depth': [9, 10, 11],
    'reg_alpha': [0.5, 0.6],
    'reg_lambda': [3, 4],
    'subsample': [0.3, 0.4],
    'learning_rate':[0,2, 0.25],
    'gamma': [0.6, 0.8, 1],
    'min_child_weight':[2, 3],
    # 'sampling_method': ['uniform']
}
clf_xgb = GridSearchCV(xgb, param_grid=param_grid, cv=10, verbose=True, n_jobs=-1)
best_clf_xgb = clf_xgb.fit(X,y)
clf_performance(best_clf_xgb,'XGB')

Fitting 10 folds for each of 864 candidates, totalling 8640 fits
XGB
Best Score: 0.8350686641697879
Best Parameters: {'colsample_bytree': 0.5, 'gamma': 0.6, 'learning_rate': 0.25, 'max_depth': 9, 'min_child_weight': 2, 'n_estimators': 150, 'reg_alpha': 0.5, 'reg_lambda': 3, 'subsample': 0.3}


Best Score: 0.8406556780996244
Best Parameters: {'colsample_bytree': 0.6, 'gamma': 0.5, 'learning_rate': 0.3, 'max_depth': 10, 'min_child_weight': 4, 'n_estimators': 22, 'reg_alpha': 0, 'reg_lambda': 2.0, 'sampling_method': 'uniform', 'subsample': 0.8}

Best Score: 0.8350686641697879
Best Parameters: {'colsample_bytree': 0.5, 'gamma': 0.6, 'learning_rate': 0.25, 'max_depth': 9, 'min_child_weight': 2, 'n_estimators': 150, 'reg_alpha': 0.5, 'reg_lambda': 3, 'subsample': 0.3}


In [23]:
# Gradient Boosting Classifier - hyperparameter tuning
gbc = GradientBoostingClassifier(random_state = 1)
param_grid = {                      # Overfitting options
    'n_estimators': [200],          # Increase number of trees
    'learning_rate': [0.08],        # Decrease learning rate
    'max_depth': [4, 5, 6],         # Reduce tree depth
    'min_samples_split': [4, 5],    # Increase min_samples_split
    'min_samples_leaf': [1, 2],     # Increase min_samples_leaf
    'subsample': [0.4, 0.5],        # Include more aggressive subsampling
    'max_features': ['sqrt'],       # Limit max_features for each split
    'loss': ['exponential']         # Consider different loss functions
}
clf_gbc = GridSearchCV(gbc, param_grid=param_grid, cv=10, verbose=True, n_jobs=-1)
best_clf_gbc = clf_gbc.fit(X, y)
clf_performance(best_clf_gbc,'Gradient Boosting Classifier')

Fitting 10 folds for each of 24 candidates, totalling 240 fits


Gradient Boosting Classifier
Best Score: 0.838414481897628
Best Parameters: {'learning_rate': 0.08, 'loss': 'exponential', 'max_depth': 6, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200, 'subsample': 0.4}


Best Score: 0.8440199750312111
Best Parameters: {'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

Best Score: 0.8339150084740444
Best Parameters: {'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 50, 'subsample': 0.7}

Best Score: 0.838390559286925
Best Parameters: {'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 100, 'subsample': 0.7}

Best Score: 0.8350561797752809
Best Parameters: {'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 4, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 150, 'subsample': 0.6}

Best Score: 0.838414481897628
Best Parameters: {'learning_rate': 0.08, 'loss': 'exponential', 'max_depth': 6, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200, 'subsample': 0.4}


### Tuned Model Performance

| Model                | Accuracy | Accuracy_tuned |  Accuracy Kaggle test_set | 
|----------------------|----------|----------------|-----------|
| Logistic Regression  | 80.5     | 79.6           |     |
| Random Forest subset | 78.7      | 80.0          |     |
| Random Forest all    | 82.2     | 82.6           |     |
| SVM                  | 82.4     | 82.2           |     |
| XGBoost              | 82.0      | 83.5          |     |
| Gradient Boosting    | 82.3     | 83.8           |     |

In [24]:
# Save predictions from the models
# Logistic Regression
predictions_lr = best_clf_lr.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_lr})
output.to_csv('submissions/logistic_regression.csv', index=False)

# Random Forest
predictions_rf = best_clf_rf.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_rf})
output.to_csv('submissions/random_forest_3.csv', index=False)

# SVC
predictions_svc = best_clf_svc.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_svc})
output.to_csv('submissions/svc.csv', index=False)

# XGBoost
predictions_xgb = best_clf_xgb.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_xgb})
output.to_csv('submissions/xgboost.csv', index=False)

# Gradient Boosting Classifier
predictions_gbc = best_clf_gbc.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions_gbc})
output.to_csv('submissions/gradient_boosting.csv', index=False)