# Ensemble methods on Titanic 🚢🚢

## Introduction

This exercise is the opportunity to practice ensemble methods on a dataset you have worked with before and that is the Titanic dataset.

Let's start by importing the librairies that we will used in the exercise.

In [186]:
# Load in our libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import plotly.figure_factory as ff

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
# import ensemble methods
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

## Feature Exploration, Engineering and Cleaning 

1. Import the data using the following link : "https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+Supervis%C3%A9/stacking/titanic.csv" , and display the first lines. Are there any missing values in the dataset?

In [187]:
dataset = pd.read_csv("https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+Supervis%C3%A9/stacking/titanic.csv")
dataset

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [188]:
dataset.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [189]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


2. What types of variables are present in this dataset? What kind of preprocessing could you run on these variables?

3. Here are some guidelines you can follow to clean the dataset as well as create new variables (feature engineering).

a.  Create a Name_length variable that measures the number of characters in the variable Name for each observations.

In [190]:
dataset["Name_length"] = dataset["Name"].apply(lambda x : len(x))

b. Create a variable Has_Cabin that indicates whether the passenger has a cabin or not.

Hint: [this method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.notna.html#pandas.DataFrame.notna) might be useful 😉

In [191]:
dataset["Has_Cabin"] = dataset["Cabin"].notna()

c. Create a variable FamilySize that gives the size of each passenger's family.

In [192]:
dataset["FamilySize"] = dataset["SibSp"] + dataset["Parch"] + 1

d. Create a variable IsAlone that indicates whether the passenger is traveling on their own.

In [193]:
dataset["IsAlone"] = dataset["FamilySize"].apply(lambda x : True if x == 1 else False)

h. Extract the title from each passenger in order to create a variable Title.

Hint: You might consider _applying_ a function that calls the [str.split method](https://docs.python.org/3.3/library/stdtypes.html?highlight=split#str.split) 😉

In [194]:
dataset["Title"] = dataset["Name"].apply(lambda x : x.split(", ")[1].split(".")[0])

i. If some of these titles are equivalent convert them in order to bring them all in the same category.

In [195]:
dataset["Title"] = dataset["Title"].replace("Mlle", "Miss")
dataset["Title"] = dataset["Title"].replace("Ms", "Miss")
dataset["Title"] = dataset["Title"].replace("Mme", "Mrs")

j. Are any of the remaining titles underrepresented among the observations? If it is the case, group them in a unique modality "Rare"

In [196]:
rare_titles = ["Don", "Mme", "Ms", "Major", "Lady", "Sir", "Mlle", "Col", "Capt", "the Countess", "Jonkheer"]
dataset["Title"] = dataset["Title"].replace(rare_titles, "Rare")

4. Drop the columns 'PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp' du dataset. Why don't we need these columns for what's next?

In [197]:
to_drop = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
dataset = dataset.drop(to_drop, axis = 1)

5. Separate the features from the target and split the data between train and test (with random_state = 0)

In [198]:
target_variable = "Survived"

X = dataset.drop(target_variable, axis=1)
y = dataset[target_variable]

X_train_unproc, X_test_unproc, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

6. Using the Pipeline and ColumnTransformer, make all the preprocessings at once. Use the [KNN imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) to handle the missing values in the numeric variables, and the SimpleImputer for categorical data.

In [199]:
numerical_features = [i for i in X.columns if X[i].dtype in ["int32", "float32", "int64", "float64"]]
categorical_features = [i for i in X.columns if X[i].dtype in ["object", "str", "category", "bool"]]

numerical_transformer = Pipeline(
    steps=[
        ("imputer", KNNImputer()),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(drop="first"))
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train = preprocessor.fit_transform(X_train_unproc)
X_test = preprocessor.transform(X_test_unproc)

In [208]:
column_names = []

for name, step, features_list in preprocessor.transformers_:
    if name == 'num':
        features = features_list
    else :
        features = step.get_feature_names_out()
    
    column_names.extend(features)

column_names

['Pclass',
 'Age',
 'Parch',
 'Fare',
 'Name_length',
 'FamilySize',
 'Sex_male',
 'Embarked_Q',
 'Embarked_S',
 'Has_Cabin_True',
 'IsAlone_True',
 'Title_Master',
 'Title_Miss',
 'Title_Mr',
 'Title_Mrs',
 'Title_Rare',
 'Title_Rev']

### Pearson Correlation Heatmap

7. Produce a figure that contains the correlation table for all the explanatory variables of X_train, what do you think?

In [200]:
corr_matrix = pd.DataFrame(X_train).corr().round(2)

fig = ff.create_annotated_heatmap(corr_matrix.values,
                                  x = corr_matrix.columns.tolist(),
                                  y = corr_matrix.index.tolist())

fig.show()

**Correlations between the variables are not very high, we can hope that they will each bring complementary information in order to predict the target variable.**

## Ensembling & Stacking models

Now that we have finished our preprocessing and made sure our data was fit for prediction, let's move on to creating our ensemble models. We'll train different models with different ensembling strategies and store their train and test scores for comparison.

### Random Forest
8. Train a Random Forest by tuning the hyperparameters with a grid search. Which ensemble method is related to random forests?

Evaluate the best model's accuracy on train and test sets. Save the scores into a pandas DataFrame.

In [201]:
scores_df = pd.DataFrame(columns = ['model', 'accuracy', 'set'])

In [202]:
random_forest = RandomForestClassifier()

params = {
    "max_depth": [2, 4, 6, 8, 10],
    "min_samples_leaf": [1, 2, 5],
    "min_samples_split": [2, 4, 8],
    "n_estimators": [10, 20, 40, 60, 80, 100]
}

gridsearch_rfc = GridSearchCV(random_forest, param_grid = params, cv = 3)
gridsearch_rfc.fit(X_train, y_train)

results = [
    {"model": "random_forest", "accuracy": gridsearch_rfc.score(X_train, y_train), "set": "train"},
    {"model": "random_forest", "accuracy": gridsearch_rfc.score(X_test, y_test), "set": "test"}
]

scores_df = pd.concat([scores_df, pd.DataFrame(results)], ignore_index=True)
scores_df

Unnamed: 0,model,accuracy,set
0,random_forest,0.889045,train
1,random_forest,0.843575,test


9. Create your own Bagging of decision tree (with the same hyperparameters as the optimal ones for Random Forest) and check you get compatible performances.

In [203]:
decision_tree = DecisionTreeClassifier(
    max_depth = gridsearch_rfc.best_params_["max_depth"],
    min_samples_leaf = gridsearch_rfc.best_params_["min_samples_leaf"],
    min_samples_split = gridsearch_rfc.best_params_["min_samples_split"]
)

bagging_dtc = BaggingClassifier(estimator=decision_tree, n_estimators=gridsearch_rfc.best_params_["n_estimators"])
bagging_dtc.fit(X_train, y_train)

results = [
    {"model": "bagging_dtc", "accuracy": bagging_dtc.score(X_train, y_train), "set": "train"},
    {"model": "bagging_dtc", "accuracy": bagging_dtc.score(X_test, y_test), "set": "test"}
]

scores_df = pd.concat([scores_df, pd.DataFrame(results)], ignore_index=True)
scores_df

Unnamed: 0,model,accuracy,set
0,random_forest,0.889045,train
1,random_forest,0.843575,test
2,bagging_dtc,0.918539,train
3,bagging_dtc,0.854749,test


10. Train an AdaBoost model by tuning the hyperparameters:
* With a logistic regression as base estimator
* With a decision tree as base estimator

For each model, evaluate the performances on the test set.

In [204]:
logistic_regression = LogisticRegression(max_iter = 1000)

adaboost_lg = AdaBoostClassifier(logistic_regression)

params = {
    "estimator__C": [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0],
    "n_estimators": [5, 10, 20, 30, 40, 50, 60, 70]
}

gridsearch_a_lg = GridSearchCV(adaboost_lg, param_grid = params, cv = 3)
gridsearch_a_lg.fit(X_train, y_train)

results = [
    {"model": "adaboost_lg", "accuracy": gridsearch_a_lg.score(X_train, y_train), "set": "train"},
    {"model": "adaboost_lg", "accuracy": gridsearch_a_lg.score(X_test, y_test), "set": "test"}
]

scores_df = pd.concat([scores_df, pd.DataFrame(results)], ignore_index=True)
scores_df

Unnamed: 0,model,accuracy,set
0,random_forest,0.889045,train
1,random_forest,0.843575,test
2,bagging_dtc,0.918539,train
3,bagging_dtc,0.854749,test
4,adaboost_lg,0.827247,train
5,adaboost_lg,0.821229,test


11. Train scikit-learn's GradientBoosting model (by tuning hyperparameters) and evaluate the performances.

In [205]:
gradient_boost = GradientBoostingClassifier()

params = {
    "max_depth": [4, 6, 8, 10, 12],
    "min_samples_leaf": [1, 2, 3],
    "min_samples_split": [8, 10, 12, 14, 16],
    "n_estimators": [10, 12, 14, 16, 18]
}

gridsearch_gbc = GridSearchCV(gradient_boost, param_grid = params, cv = 3)
gridsearch_gbc.fit(X_train, y_train)

results = [
    {"model": "gradient_boost", "accuracy": gridsearch_gbc.score(X_train, y_train), "set": "train"},
    {"model": "gradient_boost", "accuracy": gridsearch_gbc.score(X_test, y_test), "set": "test"}
]

scores_df = pd.concat([scores_df, pd.DataFrame(results)], ignore_index=True)
scores_df

Unnamed: 0,model,accuracy,set
0,random_forest,0.889045,train
1,random_forest,0.843575,test
2,bagging_dtc,0.918539,train
3,bagging_dtc,0.854749,test
4,adaboost_lg,0.827247,train
5,adaboost_lg,0.821229,test
6,gradient_boost,0.865169,train
7,gradient_boost,0.815642,test


12. Train an XGBoost model (by tuning hyperparameters). Do you get better or similar results compared to scikit-learn's GradientBoosting?

In [206]:
xgboost = XGBClassifier()

params = {
    "max_depth": [4, 6, 8, 10, 12],
    "min_samples_leaf": [1, 2, 3],
    "min_samples_split": [8, 10, 12, 14, 16],
    "n_estimators": [10, 12, 14, 16, 18]
}

gridsearch_xgb = GridSearchCV(xgboost, param_grid = params, cv = 3)
gridsearch_xgb.fit(X_train, y_train)

results = [
    {'model': 'xgboost', 'accuracy': gridsearch_xgb.score(X_train, y_train), 'set': 'train'},
    {'model': 'xgboost', 'accuracy': gridsearch_xgb.score(X_test, y_test), 'set': 'test'}
]

scores_df = pd.concat([scores_df, pd.DataFrame(results)], ignore_index=True)
scores_df

Unnamed: 0,model,accuracy,set
0,random_forest,0.889045,train
1,random_forest,0.843575,test
2,bagging_dtc,0.918539,train
3,bagging_dtc,0.854749,test
4,adaboost_lg,0.827247,train
5,adaboost_lg,0.821229,test
6,gradient_boost,0.865169,train
7,gradient_boost,0.815642,test
8,xgboost,0.910112,train
9,xgboost,0.826816,test


13. Compare all the models' performances in a bar chart and conclude. Which model is the best?

Hint: the option `barmode` in plotly's `px.bar()` might be useful 😇

In [207]:
px.bar(
    scores_df,
    x = "model",
    y = 'accuracy',
    color = 'set',
    barmode = 'group'
)