# Ensembles

# Table of contents <a class="anchor" id="toc"></a>
* [Load the Packages](#load_packages)
* [Functional Blocks](#functional_blocks)
* [Load the Dataset](#load_dataset)
* [Explore the dataset](#explore_dataset)
* [Preprocessing](#preprocessing)
    * [Missing values](#missing_values)
    * [Drop the irrelevant features](#drop_irrelevant)
    * [Replace missing values](#replace_missing)
    * [Feature Encoding](#feature_encoding)
* [Feature Selection](#feature_selection)
* [Feature Engineering](#feature_engineering)
    * [Binning](#binning)
* [Pipeline](#pipeline)
* [Data split: Train and Test sets](#data_split)
* [Modelling](#modelling)
    * [Decision Tree](#decision_tree)
        * [Classifier Pipeline](#dt_classifier_pipeline)
        * [Feature Importance](#feature_importance)
        * [Metrics](#dt_metrics)
            * [Classification Report](#dt_classification_report)
            * [Confusion Matrix](#dt_confusion_matrix)
        * [Tree Structure](#tree_structure)
        * [Tree Pruning](#tree_pruning)
            * [Retrain with Tree Pruning](#retrain_with_pruning)
            * [Metrics](#rp_metrics)
                * [Classification Report](#rp_classification_report)
                * [Confusion Matrix](#rp_confusion_matrix)
                * [Feature Importance](#rp_feature_imp)
        * [Hyper-parameters Tuning](#hyper_params_tuning)
            * [Retrain with Hyper-parameters Tuned](#hp_tuned_retrain)
            * [Metrics](#hp_tuned_metrics)
                * [Comfusion Matrix](#hp_confusion_matrix)
    * [Random Forest](#random_forest)
        * [Metrics](#rf_metrics)
            * [Classification Report](#rf_classification_report)
            * [Confusion Matrix](#rf_confusion_matrix)
        * [Feature Importance](#rf_feature_imp)
    * [Gradient Boosting](#gradiet_boosting)
        * [Metrics](#gb_metrics)
            * [Classification Report](#gb_classification_report)
            * [Confusion Matrix](#gb_confusion_matrix)
        * [Feature Importance](#gb_feature_importance)
    * [AdaBoost](#AdaBoost)
        * [Metrics](#AB_metrics)
            * [Classification Report](#AB_classification_report)
            * [Confusion Matrix](#AB_confusion_matrix)
        * [Feature Importance](#AB_feature_importance)
* [Models Comparison](#models_comparison)

## Load the Packages <a class="anchor" id="load_packages"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import time

# sklearn
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, mean_squared_error
from sklearn.tree import plot_tree, DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn import tree, set_config, metrics
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

from IPython.display import HTML, display, Markdown as md, Latex
from ipywidgets import widgets, Layout
from ipywidgets import Output, Tab
from matplotlib import pyplot as plt

# check scikit-learn version
from sklearn import __version__ as ver
print(f"scikit-learn version: {ver}")

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Functional Blocks <a class="anchor" id="functional_blocks"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
def feat_imp_chart(feat_imp, feat_names, plot_head):
    importances = pd.Series(feat_imp, index=feat_names).sort_values(ascending=True)
    importances.plot.barh()
    plt.title(plot_head)
    plt.show()

def get_top_5(data):
    return np.sort(data)[::-1][0:5]

## Load the Titanic Dataset (Source: openml.org) <a class="anchor" id="load_dataset"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

## Explore the dataset <a class="anchor" id="explore_dataset"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
X.info()

### Observations
- Features 'cabin', 'boat', 'body', and 'home.dest' has very few non-null values.
- Feature 'sex' and 'embarked' are of type category and would require encoding (?)

In [None]:
X.describe()

### Observations
- Features such as 'age', 'fare' and 'body' are of different ranges than 'pclass' and 'sibsp' and 'parch', might require scaling.

In [None]:
y.info()

In [None]:
plt.pie (y.value_counts(), labels=['not-survived', 'survived'], autopct='%1.0f%%')
plt.show()

## Preprocessing <a class="anchor" id="preprocessing"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

### Missing values <a class="anchor" id="missing_values"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Check for missing values
X.isnull().any()

In [None]:
# Check how much percentage of missing values in feature
X.isnull().sum() / len(X) * 100

In [None]:
# Get the data types of features
X.dtypes

### Observations
1. Features 'body', 'boat' and 'cabin' has more than 60% missing values, so we can drop these features (though the more research is needed to verify whether it's ideal to remove or fill the concerned features, but for simplicity sake for this lab, I am removing it)
2. Features 'age', 'fare' and 'embarked' has the null/missing values, but features 'age' and 'fare' are of numerical type so will replace the values using median while 'embarked' is categorical type, hence it makes more sense to replace it with the mode value.
3. Replacing the feature 'home.dest' (which represents the address) doesn't make much sense, so for simplicity of the model, removing this feature.
4. Feature 'ticket' is not relevant to the objective of our report, so removing it.

### Drop the irrelevant features <a class="anchor" id="drop_irrelevant"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# On basis of the above observations, dropping following irrelevant features
X.drop(["cabin", "boat", "body", "home.dest", "ticket"], axis=1, inplace=True)
X.head()

### Replace missing values <a class="anchor" id="replace_missing"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Replacing missing values
X["age"].fillna(X["age"].median(), inplace=True)
X["fare"].fillna(X["fare"].median(), inplace=True)
X["embarked"].fillna(X["embarked"].mode()[0], inplace=True)

# Verify whether any missing value in any feature
X.isnull().any()

### Feature Encoding <a class="anchor" id="feature_encoding"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>
Convert categorical features to numerical

In [None]:
le = LabelEncoder()
X["sex"] = le.fit_transform(X["sex"])
X["embarked"] = le.fit_transform(X["embarked"])

# Verify the dataset
X.head()

## Feature Selection <a class="anchor" id="feature_selection"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>
Examine Feature Correlations

In [None]:
# Since for correlation we need only numerical values, so 
sns.heatmap(X[["pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

### Observation:
Features 'parch' and 'sibsp' are weekly correlated featues, so we can combine them to get the more meaningful information, which lead to feature engineering.

## Feature Engineering <a class="anchor" id="feature_engineering"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Combining 'parch' and 'sibsp' to create new feature named as 'family_size' 

X['family_size'] = X['parch'] + X['sibsp']
X.drop(['parch', 'sibsp'], axis=1, inplace=True)

# Create a derived feature called 'is_alone' using the family_size feature
X['is_alone'] = 1
X['is_alone'].loc[X['family_size'] > 1] = 0

# Print the head to verify the data
X.head()

In [None]:
# Using 'name' feature we can derive a new fearture called 'Title' which 
# will have values such as 'Mr.' and 'Mrs', 'Cpt.' and 'Dr.', 
# these values seems be of interest for our model

# Create new feature
X['title'] =  X['name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]

# Remove the 'name' feature
X.drop(["name"], axis=1, inplace=True)

# Print the head to verify the data
X.head()

In [None]:
# Verify the new feature 'title'
pd.crosstab(X['title'], X['sex'])

### Observation
It seems like there are so many titles, so we can perform binning or grouping.

### Binning <a class="anchor" id="binning"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Mark the 'title' as 'rare' if the value is less than 10
rare_titles = (X['title'].value_counts() < 10)
rare_titles

X.title.loc[X.title == 'Miss'] = 'Mrs'
X['title'] = X.title.apply(lambda x: 'rare' if rare_titles[x] else x)

# Verify the new feature 'title'
pd.crosstab(X['title'], X['sex'])

## Pipeline <a class="anchor" id="pipeline"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>
Create preprocessing (numerical, categorical transform) pipeline

In [None]:
# here we call the new API set_config to tell sklearn we want to output a pandas DF
set_config(transform_output="pandas")

num_features = ['age', 'fare', 'family_size']
cat_features = ['embarked', 'sex', 'pclass', 'title', 'is_alone']

# creating the numerical pipeline
num_pipe = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())
])

#creating the transform to preprocess the data
transformer = ColumnTransformer(
    (
        ('numerical', num_pipe, num_features),
        ("categorical", 
             OneHotEncoder(sparse_output=False, 
                           drop="if_binary", 
                           handle_unknown="ignore"), 
             cat_features
        )
    ),
    verbose_feature_names_out=False,
)

## Data split: Train and Test sets <a class="anchor" id="data_split"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# split the data for training and testing the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=117)

## Modelling <a class="anchor" id="modelling"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>
Following Algorithms would be trained and compared:
- Decision Tree Classifier
- Random Forest, 
- Gradient Boosted and 
- Adaboost 

### Decision Tree <a class="anchor" id="decision_tree"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

### Classifier Pipeline <a class="anchor" id="dt_classifier_pipeline"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>
Create classifer pipeline with a data preprocessing step and decision tree classifier

In [None]:
# creating the classifier pipeline with a data preprocessing step and decision tree classifier
rf_pipeline = Pipeline([
    ('dataprep', transformer),
    ('rf_clf', DecisionTreeClassifier(random_state=117))
])

In [None]:
# training the model
dt_train_start_time = time.time()
rf_pipeline.fit(X_train, y_train)
dt_train_end_time = time.time()

dt_train_time = round(dt_train_end_time - dt_train_start_time, 4)
print("Decision Tree Train time:", dt_train_time, "secs\n\n")

### Feature Importance <a class="anchor" id="feature_importance"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# retrieving the RF Classifier from the model pipeline
clf = rf_pipeline[-1]

print(clf.feature_names_in_)
clf.feature_names_in_[clf.feature_names_in_ == 'sex_1'] = 'sex'

# making a pandas dataframe
data = list(zip(clf.feature_names_in_, clf.feature_importances_))
df_importances = pd.DataFrame(data, 
                              columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=False)

df_importances

In [None]:
# Plot feature importance chart
feat_imp_chart(get_top_5(clf.feature_importances_), get_top_5(clf.feature_names_in_), 'Top 5 Features by Importance')
clf_before_prune = clf

### Metrics <a class="anchor" id="dt_metrics"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
X_preproc = rf_pipeline[:-1].transform(X_train)
X_test_preproc = rf_pipeline[:-1].transform(X_test)

X_preproc.rename({'sex_1':'sex'}, axis=1, inplace=True)
X_test_preproc.rename({'sex_1':'sex'}, axis=1, inplace=True)

dt_train_acc = clf.score(X_preproc, y_train)
dt_test_acc = clf.score(X_test_preproc, y_test)
print(f"Train accuracy: {dt_train_acc:.3f}")
print(f"Test accuracy: {dt_test_acc:.3f}")

### Observation
Since the Train accuracy is much higher than the Test accuracy, it seems the model overfits the titanic dataset, we can further verify this using cross-validation.

#### Classification Report <a class="anchor" id="dt_classification_report"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
cf_test_start_time = time.time()
yhat = clf.predict(X_test_preproc)
cf_test_end_time = time.time()

dt_test_time = round(cf_test_end_time - cf_test_start_time, 4)

print("Test time:", dt_test_time, "secs\n\n")

print(classification_report(y_test, yhat))

#### Confusion Matrix <a class="anchor" id="dt_confusion_matrix"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
cm = confusion_matrix(y_test, yhat)
cm_display = ConfusionMatrixDisplay(cm).plot()

#### Tree Structure <a class="anchor" id="tree_structure"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
plot_tree(clf)
plt.show()
print(f"Decision Tree Depth: {clf.get_depth()}")
print(f"Decision Tree Node Count: {clf.tree_.node_count}")

### Tree Pruning <a class="anchor" id="tree_pruning"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
path = clf.cost_complexity_pruning_path(X_preproc, 
                                        y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Wee train a decision tree using the effective alphas. The last value
# in ``ccp_alphas`` is the alpha value that prunes the whole tree,
# leaving the tree, ``clfs[-1]``, with one node.
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=117, ccp_alpha=ccp_alpha)
    clf.fit(X_preproc, 
            y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)

# We remove the last element in ``clfs`` and ``ccp_alphas``, 
# because it is the trivial tree with only one node. 
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

# Accuracy vs alpha for training and testing sets
train_scores = [clf.score(X_preproc, 
                          y_train) for clf in clfs]
test_scores = [clf.score(X_test_preproc, 
                         y_test) for clf in clfs]

fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='.', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='.', label="test", drawstyle="steps-post")
ax.legend()
ax.grid()
plt.tight_layout()
plt.show()

#### Retrain with Pruning <a class="anchor" id="retrain_with_pruning"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
clf_after_prune = DecisionTreeClassifier(random_state=117, ccp_alpha=0.01)

cfp_train_start_time = time.time()
clf_after_prune.fit(X_preproc, y_train)
cfp_train_end_time = time.time()

dtp_train_time = round(cfp_train_end_time-cfp_train_start_time, 4)

print("Training time:", dtp_train_time, "secs")

dtp_train_acc = clf_after_prune.score(X_preproc, y_train)
dtp_test_acc = clf_after_prune.score(X_test_preproc, y_test)

print(f"Train accuracy: {dtp_train_acc:.3f}")
print(f"Test accuracy: {dtp_test_acc:.3f}")

### Metrics <a class="anchor" id="rp_metrics"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
print(f"Decision Tree Depth: {clf_after_prune.get_depth()}")
print(f"Decision Tree Node Count: {clf_after_prune.tree_.node_count}")

In [None]:
plot_tree(clf_after_prune, feature_names=list(clf_after_prune.feature_names_in_), filled=True)
plt.show()

#### Classification Report <a class="anchor" id="rp_classification_report"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
cfp_test_start_time = time.time()
yhat_after_prune = clf.predict(X_test_preproc)
cfp_test_end_time = time.time()

dtp_test_time = round(cfp_test_end_time-cfp_test_start_time, 4)
print("Testing time:", dtp_test_time, "secs\n\n")

print(classification_report(y_test, yhat_after_prune))

#### Confusion Matrix <a class="anchor" id="rp_confusion_matrix"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
cm = confusion_matrix(y_test, yhat_after_prune)
cm_display = ConfusionMatrixDisplay(cm).plot()

#### Feature Importance <a class="anchor" id="rp_feature_imp"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Plot feature importance chart
feat_imp_chart(get_top_5(clf_after_prune.feature_importances_), 
               get_top_5(clf_after_prune.feature_names_in_),
               'Top 5 Features by Importance (Post-Pruning)')

### Hyper-parameters tuning <a class="anchor" id="hyper_params_tuning"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# List of hyper-parameters supported by decision tree
# 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 
# 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 
# 'random_state', 'splitter'

dtree_reg = DecisionTreeRegressor(random_state=123) # Initialize a decision tree regressor

# Define the parameter grid to tune the hyperparameters
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2,4,6,8,10,12, 20, 30, None],
    'min_samples_split': np.arange(1, 10, 1),
    'min_samples_leaf': np.arange(1, 10, 1),
    'splitter':["best","random"]
}

clf_GS = GridSearchCV(rf_pipeline[-1], param_grid)
clf_GS.fit(X_preproc, y_train)

best_dtree_reg = clf_GS.best_estimator_ # Get the best estimator from the grid search

y_pred = best_dtree_reg.predict(X_test_preproc)

print('Best Criterion:', clf_GS.best_estimator_.get_params()['criterion'])
print('Best max_depth:', clf_GS.best_estimator_.get_params()['max_depth'])
print('Min samples split:', clf_GS.best_estimator_.get_params()['min_samples_split'])
print('Min samples leaf:', clf_GS.best_estimator_.get_params()['min_samples_leaf'])
print('Best splitter:', clf_GS.best_estimator_.get_params()['splitter'])

print(f"\nBest score:", clf_GS.best_score_)

### Retrain with Hyper-parameters Tuned <a class="anchor" id="hp_tuned_retrain"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
tuned_hyper_model = DecisionTreeClassifier( 
                       criterion=str(clf_GS.best_estimator_.get_params()['criterion']), 
                       max_depth=clf_GS.best_estimator_.get_params()['max_depth'],
                       min_samples_leaf=clf_GS.best_estimator_.get_params()['min_samples_leaf'],
                       min_samples_split=clf_GS.best_estimator_.get_params()['min_samples_split'],
                       splitter=str(clf_GS.best_estimator_.get_params()['splitter']),
                       random_state=123)

In [None]:
# fitting model
tuned_hyper_model.fit(X_preproc, y_train)

# prediction 
tuned_pred=tuned_hyper_model.predict(X_test_preproc)

#### Metrics <a class="anchor" id="hp_tuned_metrics"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

#### Comfusion Matrix <a class="anchor" id="hp_confusion_matrix"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Plot Confusion Matrix after model hyper parameters tuning
ConfusionMatrixDisplay(confusion_matrix(y_test, tuned_pred)).plot()

## Random Forest <a class="anchor" id="random_forest"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Build the Random Forest model
rf = RandomForestClassifier(random_state=117)

rf_train_start_time = time.time()

rf.fit(X_preproc, y_train)

rf_train_end_time = time.time()

rf_train_time = round(rf_train_end_time - rf_train_start_time, 4)

print("Training time:", rf_train_time, "secs")

rf_train_acc = rf.score(X_preproc, y_train)
rf_test_acc = rf.score(X_test_preproc, y_test)

print(f"Train accuracy: {rf_train_acc:.3f}")
print(f"Test accuracy: {rf_test_acc:.3f}")

### Metrics <a class="anchor" id="rf_metrics"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

#### Classification Report <a class="anchor" id="rf_classification_report"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
rf_test_start_time = time.time()

yhat = rf.predict(X_test_preproc)

rf_test_end_time = time.time()

rf_test_time = round(rf_test_end_time - rf_test_start_time, 4)

print("Testing time:", rf_test_time, "secs\n\n")

print(classification_report(y_test, yhat))

#### Confusion Matrix <a class="anchor" id="rf_confusion_matrix"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
cm = confusion_matrix(y_test, yhat)
cm_display = ConfusionMatrixDisplay(cm).plot()

### Feature Importance <a class="anchor" id="rf_feature_imp"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# feature importances
importances = pd.Series(
    rf.feature_importances_, index=X_preproc.columns
).sort_values(ascending=True).plot.barh()
plt.title('Feature Importances (Random Forest)')
plt.show()

### Gradient Boosting <a class="anchor" id="gradiet_boosting"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Build the Gradient Boosting model
gb = GradientBoostingClassifier(random_state=117)

gb_train_start_time = time.time()

gb.fit(X_preproc, y_train)

gb_train_end_time = time.time()

gb_train_time = round(gb_train_end_time-gb_train_start_time, 4)

print("Training time:", gb_train_time, "secs")

gb_train_acc = gb.score(X_preproc, y_train)
gb_test_acc = gb.score(X_test_preproc, y_test)

print(f"Train accuracy: {gb_train_acc:.3f}")
print(f"Test accuracy: {gb_test_acc:.3f}")

### Metrics <a class="anchor" id="gb_metrics"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

### Classification Report <a class="anchor" id="gb_classification_report"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
gb_test_start_time = time.time()

yhat = gb.predict(X_test_preproc)

gb_test_end_time = time.time()

gb_test_time = round(gb_test_end_time - gb_test_start_time, 4)

print("Testing time:", gb_test_time, "secs\n\n")

print(classification_report(y_test, yhat))

### Confusion Matrix <a class="anchor" id="gb_confusion_matrix"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
cm = confusion_matrix(y_test, yhat)
cm_display = ConfusionMatrixDisplay(cm).plot()

### Feature Importance <a class="anchor" id="gb_feature_importance"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# feature importances
importances = pd.Series(
    gb.feature_importances_, index=X_preproc.columns
).sort_values(ascending=True).plot.barh()
plt.title('Feature Importances (Gradient Boosting)')
plt.show()

### AdaBoost <a class="anchor" id="AdaBoost"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# Build the AdaBoost model
ada = AdaBoostClassifier(algorithm="SAMME", random_state=117)

ab_train_start_time = time.time()

ada.fit(X_preproc, y_train)

ab_train_end_time = time.time()

ada_train_time = round(ab_train_end_time - ab_train_start_time, 4)
print("Training time:", ada_train_time, "secs")

ada_train_acc = ada.score(X_preproc, y_train)
ada_test_acc = ada.score(X_test_preproc, y_test)

print(f"Train accuracy: {ada_train_acc:.3f}")
print(f"Test accuracy: {ada_test_acc:.3f}")

### Metrics <a class="anchor" id="AB_metrics"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

### Classification Report <a class="anchor" id="AB_classification_report"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
ab_test_start_time = time.time()

yhat = ada.predict(X_test_preproc)

ab_test_end_time = time.time()

ada_test_time = round(ab_test_end_time - ab_test_start_time, 4)

print("Testing time:", ada_test_time, "secs\n\n")

print(classification_report(y_test, yhat))

### Confusion Matrix <a class="anchor" id="AB_confusion_matrix"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
cm = confusion_matrix(y_test, yhat)
cm_display = ConfusionMatrixDisplay(cm).plot()

### Feature Importance <a class="anchor" id="AB_feature_importance"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
# feature importances
importances = pd.Series(
    ada.feature_importances_, index=X_preproc.columns
).sort_values(ascending=True).plot.barh()
plt.title('Feature Importances (AdaBoost)')
plt.show()

## Models Comparison <a class="anchor" id="models_comparison"></a> <p style="text-align: right; color: blue; font-size: 15px"> [Go to Main Menu](#toc) </p>

In [None]:
bar_colors = ['tab:red', 'tab:blue', 'tab:green', 'tab:orange', '#A890F0']

models = ['Decision Tree without pruning',
          'Decision Tree with pruning', 
          'Random Forest', 
          'Gradient Boosted', 
          'Adaboost']

training_time = [dt_train_time, dtp_train_time, 
                 rf_train_time, gb_train_time, ada_train_time]

testing_time = [dt_test_time, dtp_test_time, rf_test_time, 
                gb_test_time, ada_test_time]
train_accuracy = [dt_train_acc, dtp_train_acc, rf_train_acc, 
                  gb_train_acc, ada_train_acc]
test_accuracy = [dt_test_acc, dtp_test_acc, rf_test_acc, 
                 gb_test_acc, ada_test_acc]

#fig, ax = plt.subplots()
fig, ax = plt.subplots(4, figsize=(10,20))

#plt.figure(figsize = (25, 10))
#plt.subplot(2, 2, 1)
# creating the training bar plot
ax[0].bar(models, training_time, 
        color = bar_colors, 
        label=models,
        width = 0.4)

#ax.set_xlabel("Model")
#ax.xticks(rotation=90)
ax[0].set_xticks([])
ax[0].set_ylabel("Training Time")
ax[0].set_title("Training time comparison of models")
ax[0].legend(title = "Models", loc='center left', bbox_to_anchor=(1, 0.5))

# plt.subplot(2, 2, 2)
# # creating the Testing bar plot
ax[1].bar(models, testing_time, color = bar_colors, 
        label=models, 
        width = 0.4)

ax[1].set_xticks([])
ax[1].set_ylabel("Testing Time")
ax[1].set_title("Testing time comparison of models")
ax[1].legend(title = "Models", loc='center left', bbox_to_anchor=(1, 0.5))

# plt.subplot(2, 2, 3)
# # creating the Testing bar plot
ax[2].bar(models, train_accuracy, color = bar_colors, 
        label=models,
         width = 0.4)

# plt.xlabel("Model")
# plt.xticks(rotation=90)
ax[2].set_xticks([])
ax[2].set_ylabel("Training Accuracy")
ax[2].set_ylim(0.7, 1)
ax[2].set_title("Training Accuracy comparison of models")
ax[2].legend(title = "Models", loc='center left', bbox_to_anchor=(1, 0.5))
          
# plt.subplot(2, 2, 4)
# # creating the Testing bar plot
ax[3].bar(models, test_accuracy, color = bar_colors, 
        label=models,width = 0.4)

# plt.xlabel("Model")
# plt.xticks(rotation=90)
ax[3].set_xticks([])
ax[3].set_ylabel("Test Accuracy")
ax[3].set_ylim(0.7, 0.9)
ax[3].set_title("Test Accuracy comparison of models")
ax[3].legend(title = "Models", loc='center left', bbox_to_anchor=(1, 0.5))

plt.show()

## Summary
- Random forest shows the highest accuracy but with less testing accuracy, an indication of overfit.
- Adaboost metrics shows the balanced training and testing accuracy with least execution(training and testing) time.
- Gradient boost has the highest testing accuracy.

## Future Work:
- The Hyper parameters can be tuned for random forest, Adaboost and Gradient boost algorithms.
- Further Feature engineering can be performed to enhance the performance of the algorithms.