# Adaboost

In this lecture, we will talk about another type of ensemble learning: the boosting.

More specifically, we will talk about **adaboost** model.

AdaBoost is vastly used in face detection to assess whether there is a face in the video or not. 

# I. Reminder: Decision Trees and Random Forests

## I.1. Decision Trees

As a reminder, in a Decision Tree, we perform classifications (or regression) and can grow the depth of a tree to make the resulting decision boundary complex enough.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1GxvuU-RU8afLzYZc2eegZnRtyUzoL4AF">
</p>

To build the tree, we choose each time the feature that splits our data the best way possible, thanks to a disorder measurement:
- Gini impurity or Entropy for classification
- Mean squared error for regression

Recall that the Gini impurity can be defined as follows:

$$
G(f) = 1 - \sum {f_i}^2
$$

Where $f_i$ is the proportion of class $i$ in the node/leaf.

## I.2. Random Forest

In a random frest, we build an ensemble of decision trees in several key steps.

**Step 1**: Boostrap subsampling

Pick *n* data points randomly.

We are allowed to pick the same point more than 1 time for each bootstrap sample, since we pick with replacement. We will build several bootstrap subsamples, say *K*.

<center>
<img src="https://drive.google.com/uc?export=view&id=1Cmh0STP2ZmfEnfe-BXuMLOjFgTeUKZDu" width=600>
</center>

**Step 2**: Decision trees training

Then, we train *K* decision trees, one for each bootstrap subsample, and use only a subspace of the features each time. 

This builds a wide variety of trees.

<center>
<img src="https://drive.google.com/uc?export=view&id=10bRgs3x9O4CoGXhQ0r_teSGzXPq15-2C" width=600>
</center>

**Step 3**: Aggregating with majority or soft vote

Finally, in order to predict a new sample, we run it through all the decision trees, and apply a vote on the output of each tree.

We first **bootstrapped** the data, and then **aggregated** the results. This is called Bagging, and stands for Bootstrap Aggregating.

# II. Limits of Bagging: same region same error

Let's consider a binary classification problem: 0 or 1.

 Let’s consider 3 decision trees $h_1$, $h_2$ and $h_3$, which produce a classification result and can be either right or wrong.

The final prediction will be the majority vote $H_T$ of those three decision trees.

If we plot the results of the 3 classifiers, there are regions in which the classifiers will be wrong. These regions are represented in red.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1iF2nzHrwnMTXw7IDOh-l_UHvC80aF9jK">
</p>

This example works perfectly well.

As long as only one classifier is wrong at a time, the two others are correct. Thus the final prediction is right, and thus our random forest works perfectly.


But what if more than 1 decision tree out of 3 is wrong on a given region?

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1AXptTl3qP2HmaScqFpkXXME_V0dutUP9">
</p>

This leads to the following limitations and potential solutions:
- instead of training parallel decision trees, we need to **train models sequentially**
- in order to avoid the "same region, same error", each decision tree should focus on **correcting the previous tree errors**

# III. Adaptative Boosting: AdaBoost

## III.1. AdaBoost vs. Random Forests

There are several changes between AdaBoost and Random Forests:
- Decision trees have a very limited max depth of 1: they are thus called **stumps**
- Each stump may now have an associated weight in the final prediction
- Decisions trees are no longer independant: each stump will be trained depending on the previous one

The decision trees have now a limited depth of 1. They are called **stumps**, and lead to a simple binary classification:

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1VwcO6eE83eHLzDIDGy8ncdLqUrny2Gq2">
</p>

The stumps are what we call **weak learners**. Weak learners are algorithms whose error rate is slightly under 50% as illustrated below:

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=17UvX3obE1yj47eAdkZzrZaw7uw_6uLVu">
</p>

The forest of stumps allows for different weights on each stump: some stumps may have a higher weight

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1hka_fKK0fEA1VyWIfsiTmdmtp3a8kLRB">
</p>

## III.2. Step by step algorithm

Boosting trains sequentially a series of low performing decision trees (weak learners), by adjusting the error metric over time.

Let's say we have 2 features $x_1$ and $x_2$ and we want to predict $y$.

We will see step by step how adaboost works.

**Step 1**: Initialize the weight given to each sample

When we start our AdaBoost algorithm, we should assign a weight $\frac {1}{n}$ to each of the $n$ samples.

Indeed, in adaboost, not only decisions trees have a weight, samples too.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1ECDQCid0Qj_3IPwuXU5h193Yf2dhynH5">
</p>

**Step 2**: Find the best stump

After initializing weights, the aim is to find the best stump, i.e the feature split that minimize the Gini impurity.

For example, here, we would typically split along the $x_1$ variable.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1otRVrfmOubEzB4N5NRzmQl1NASJHGWtq">
</p>

**Step 3**: Assign a weight to this stump

As said before, that **each stump has a different weight** in the final vote.

This stump weight has to take into account the accuracy of the stump: the more the stump is right, the larger the weight.

So the weight will depend on the classification performance of each stump. 

In our example, the stump misclassified 1 sample out of 12. 

First, we need to compute the total error called ${\epsilon_{t}}$ of the **stump $t$**:

$$
{\epsilon_{t}} = \frac {1}{12}
$$

Then, we compute the **amount of say** $\alpha_t$, i.e the weight of the stump in the final vote:

$$ \alpha_t = \frac {1} {2} ln \frac {1-\epsilon_{t}} {\epsilon_{t}} $$

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1V2Bo7c6M2rUEs0-6khO452X7gqFrNlcy">
</p>

We can see some extreme cases:
- if the stump is not better than a random classifier with an accuracy of 50%, its amount of say is 0
- if the total error is low, the amount of say is large
- on the opposite, a large total error leads to a large, **negative** amount of say

This amount of say $\alpha_t$ is the weight of the stump $t$ in the final vote.

**Step 4**: Update the weights of incorrectly classified samples

Now, we need to **update the weights of misclassified samples**.

By increasing their weight, the next stump will be more likely to correct those misclassifications.

To compute the weights $w_{t+1} (i)$ of the sample $i$ for the stump $t+1$, we use again the amount of say:

$$
\large w_{t+1}(i) = w_{t}(i) e^{\alpha_t } 
$$

Let's have a look at the function $e^{\text{amount of say}}$ to understand:

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1cYmpoHcW771OwP7WqQDv1PpCdhn1nDae">
</p>

Doing this update, if the amount of say is large, the weight on misclassified samples will be large.

Another way of saying it: 

> if a classifier was performing very well except for one sample, this one sample will have a really large weight for next stump.

**Step 5**: Update the weights of correctly classified samples

The same way, we need to **update the weights of the well classified samples**.

This is the same idea as before, in the other way around:

$$ 
\large w_{t+1}(i) = { w_{t}(i) } e ^{ - \alpha_t } 
$$


<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1jcghhM5RXLevq2zQCy-SO5K0odmrqiLh">
</p>

> If the classifier is good, thus having a large amount of say, we will lower the weight of the well classified samples

**Step 6**: Normalize the weights to 1

Now the weights of the samples do not sum to 1. We just need to normalize the weights to ensure a sum of 1:

$$
w_{t+1}(i) = \frac {w_{t+1}(i)} {\sum_i {w_{t+1}(i)}}
$$

We end up with a new dataset looking like this:

<center>
<img src="https://drive.google.com/uc?export=view&id=1TQmipTp6cpMk2e1n-1HkT4GwrqstPO3W" width=600>
</center>

**Step 7**: Rebalance the samples

Using the updated sample weights, we can go to the next iteration and fit the next stump.

There are 2 ways to handle the new imbalanced weights:
- Apply weighted Gini 
- Create duplicates of the observations by considering the weights as a distribution

In order to create duplicates with the right weights as distribution, we compute the cumulative sum of the weights, pick randomly a number between 0 and 1, and select the corresponding observation.

We do this until the new collection is the same size as the original.

A drawback is that there will be missing samples from the original dataset doing so.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1-GmuTb_gbkl2aUr8aX00n9EPBqSG4h0S">
</p>

**Step 8**: Compute the next stump

The next stump will then have to classify very carefully the large weight samples.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1X_R3IQjypKHJ3VgnBFSZ947mLdBvJrLh">
</p>

**Step 9**: Iterate and build forest of stumps...

Iterate this process until the pre-defined hyperparameter of number of iterations (`n_iter`) has been reached.

**Step 10**: Once trained, make classification with the forest of stumps


We just have look at the amount of say and prediction of each stump:
- we sum the amount of say $\alpha_t$ of the stumps that predict the new observation as class 1
- we sum the amount of say $\alpha_t$ of the stumps that classify the new observation as class 0

And finally predict the class that has the largest summed amount of say.

In other words, we compute a weighted average prediction of the trees $h_t$ with their associated weights $\alpha_t$:

$$
H(x) = \sum_t \alpha_t h_t(x)
$$

And finally, we apply a threshold of 0.5 as usual:
- if $H(x) > 0.5$, we predict class 1
- Otherwise, we predict class 0

### To summarize:

- Train the stump on the samples → $h_1$
- Train the stump with overweighted data on the regions in which $h_1$ performs poorly → $h_2$
- Train the stump with overweighted data on the regions in which $h_1$ ≠ $h_2$ → $h_3$
- ...

Instead of training the decision trees in **parallel** (as in bagging), we can train them **sequentially**.


<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1zxEJ_AoRCV_lwvLitK0stNSsnJa5ko7_">
</p>

AdaBoost can also be used as a regression algorithm.

Adaboost can be generalized to regression by simply taking the average output among the forest of stumps, the same way we did for random forest.

# IV. Implementation

AdaBoost classifier is available in scikit-learn, with the following signature:
```python
class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)
```

With the following hyperparameters:
- `base_estimator=None`: by default, it works with decision trees with max depth of 1, but other estimators can be provided
- `n_estimators=50`: the number of iterations of the model, by default 50
- `learning_rate=1.0`: though not cited, a learning rate can be applied to gradually decay the weights of the stumps. It defaults to 1, meaning no decay is applied.

Example on Iris dataset:

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load and split the data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    stratify=y, random_state=0)

In [5]:
from sklearn.preprocessing import StandardScaler
# Rescale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [12]:
from sklearn.ensemble import AdaBoostClassifier
# Use a default adaboost
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)
ada.score(X_test, y_test)

0.9333333333333333

In [13]:
from sklearn.tree import DecisionTreeClassifier
# Use an adaboost with decision trees with max depth of 2
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2))
ada.fit(X_train, y_train)
ada.score(X_test, y_test)

0.9666666666666667

In [14]:
from sklearn.linear_model import LogisticRegression
# Use an adaboost with logistic regression model
ada_lr = AdaBoostClassifier(base_estimator=LogisticRegression())
ada_lr.fit(X_train, y_train)
ada_lr.score(X_test, y_test)

0.8333333333333334