<img src='../images/gdd-logo.png' width='300px' align='right' style="padding: 15px">


# Boosting

In this notebook, we shall discuss some of the most performant machine learning models for tabular data and provide you an opportunity to practise the skills we have been refreshing.

**Program**
- [Preparing the data](#explore)
- [Overview of Ensemble Methods](#ensemble)
- [Gradient Boosting](#boosting)
- [Make a baseline model](#baseline)
- [Sklearn transformers](#transformers)

## Preparing the data

Let's load in the packages:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import RocCurveDisplay

<a id=about></a>
## About the data

<img src='../images/who.png' width='500px' align='right' style="padding: 15px">

According to the World Health Organization (WHO), strokes are the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

You will use this dataset to build a model that can **predict whether a patient is likely to have a `stroke`** (based on input parameters like gender, age and whether or not they smoke). 

Each row in the data provides relavant information about the patient.
### Features

1. `id`: unique identifier
1. `address`: a general address (city/county, state and postal code)
1. `gender`: "Male", "Female" or "Other"
1. `age`: age of the patient
1. `hypertension`: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
1. `heart_disease`: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
1. `ever_married`: "No" or "Yes"
1. `work_type`: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
1. `residence_type`: "Rural" or "Urban"
1. `avg_glucose_level`: average glucose level in blood
1. `bmi`: body mass index
1. `smoking_status`: "formerly smoked", "never smoked", "smokes" or "Unknown"
1. `stroke`: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient.*

Let's prepare the data for your model:

In [None]:
stroke = pd.read_csv('../data/stroke.csv').rename(columns=str.lower)

In [None]:
# Variable definitions
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type']
missing_cols = ['age','bmi']
drop_cols = ['id','address']

target = 'stroke'

def create_Xy(df, drop_cols, target_col):
    df = df.drop(columns=drop_cols)
    return (
        df.drop(columns=target_col),
        df[target_col]
    )

# Create X and y
X, y = stroke.pipe(create_Xy, 
                   drop_cols=drop_cols, 
                   target_col=target,
                   )

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 123,
                                                    stratify = y,
                                                    )

<a id=ensemble></a>

## Overview of Ensemble Methods

Ensemble methods enhance model performance by combining multiple individual models, resulting in a more expressive and flexible overall model. This approach effectively reduces model bias and variance, making the model less specific to the training dataset.

Two popular approaches to ensembling methods are **bagging** and **boosting**:

 - **Bagging**: aggregates the predictions of individual models that were trained in a *parallel* (on bootrapped subsamples of the data). 

- **Boosting**: individual models are trained *sequentially*, with each subsequent model learning from the mistakes of the one that came before.

## Gradient Boosting

Gradient boosting is a form of boosting.

Individual trees are trained sequentially to address the mistakes, or **residual error**, of the models that came before.

<img src="../images/gradient-boosting.png" style="display: block;margin-left: auto;margin-right: auto;height: 400px"/>

You're going to train a gradient boosting model. In order to do that, let's first create a preprocessing pipeline:

In [None]:
onehot = ColumnTransformer([
    ('onehot', OneHotEncoder(drop="if_binary"), categorical_cols)
], remainder='passthrough')

preprocessing = Pipeline(steps=[
    ('onehot', onehot),
    ('impute', SimpleImputer(strategy='mean')),
])

You can use the `GradientBoostingClassifier` from sklearn. 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

model_boosting = GradientBoostingClassifier(random_state=123)

And combine the two into one `Pipeline`:

In [None]:
pipeline_boosting = Pipeline(steps=[
    ('preprocessing', preprocessing),
    ('model', model_boosting)
])

Finally fit and score the model (just like you did in the previous notebooks):

In [None]:
pipeline_boosting.fit(X_train, y_train)

pipeline_boosting.score(X_train, y_train), pipeline_boosting.score(X_test, y_test)

In [None]:
from sklearn.metrics import classification_report

boosting_preds = pipeline_boosting.predict(X_test)

print(classification_report(y_test, boosting_preds))

In [None]:
fig, ax = plt.subplots()
RocCurveDisplay.from_estimator(pipeline_boosting, X_train, y_train, ax=ax, name='Train')
RocCurveDisplay.from_estimator(pipeline_boosting, X_test, y_test, ax=ax, name='Test')

## XGBoost & LightGBM 

For larger datasets, you may want to consider using [XGBoost](https://xgboost.readthedocs.io/) or [LightGBM](https://lightgbm.readthedocs.io)\.

They are gradient boosting methods that were designed for optimal computational speed and model performance.

Below we demonstrate an example using XGBoost.

In [None]:
from xgboost import XGBClassifier

In [None]:
model_xgboost = XGBClassifier(objective="multi:softmax", num_class=2, random_state=123)

In [None]:
pipeline_xgboost = Pipeline(steps=[
    ('preprocessing', preprocessing),
    ('model', model_xgboost)
])

In [None]:
pipeline_xgboost.fit(X_train, y_train)

pipeline_xgboost.score(X_train, y_train), pipeline_xgboost.score(X_test, y_test)

In [None]:
xgboost_preds = pipeline_xgboost.predict(X_test)
print(classification_report(y_test, xgboost_preds))

In [None]:
fig, ax = plt.subplots()
RocCurveDisplay.from_estimator(pipeline_xgboost, X_train, y_train, ax=ax, name='Train')
RocCurveDisplay.from_estimator(pipeline_xgboost, X_test, y_test, ax=ax, name='Test')

### <mark>Exercise</mark>

Sckit Learn has a native implementation of [Histogram Boosting](https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting), which was inspired by LightGBM. Implement the model in your pipeline and investigate what level of performance you can achieve on the data.

**Steps:**
1. Define the model
2. Incorporate it into your pipeline
3. Tune the pipeline's parameters by cross-validating it on the training data
4. Use different metrics to evaluate it on the test set

In [None]:
# YOUR CODE HERE

In [None]:
# %load ../answers/04-histo-boosting.py

---

<img src='../images/gdd-logo.png' width='300px' align='right' style="padding: 15px">

<a id=conc></a>

# Conclusion and Next Steps

This notebook has covered an overview of high-performant boosting algorithms and provided an opportunity to practise best-practice for training and evaluating ML models with Scikit-Learn.