- Bagging vs boosting
- Stumps and adaboost
- Gradient boosting for classification
- Gradient boosting for regression
- Implementations of gradient boosting

## Boosting, Adaptative Boosting, Gradient Boosting

![](https://images.unsplash.com/photo-1508796079212-a4b83cbf734d?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [César Couto](https://unsplash.com/photos/hNZ6WOOnQpk)

Today's lesson is about Gradient Boosting, a powerful Machine Learning algorithm. Like Random Forest, it can be used either for regression or classification! We will start off with a gentle reminder on Decision Trees and Bagging techniques, before moving on to Adaptative Boosting, Gradient Boosting and XGBoost/LGMB/Catboost.

# I. Reminder : Decision Trees and Random Forests

## I.1 Decision Trees

Reminder that in a Decision Tree, we perform binary classifications and can grow the depth of a tree to make the resulting decision frontier more complex.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1GxvuU-RU8afLzYZc2eegZnRtyUzoL4AF">
</p>

To build the tree, we choose each time the feature that splits our data the best way possible. How do we measure the qualitiy of a split ?
- crossed-entropy
- Gini impurity
- classification error

Note that in order to grow a decision tree for numeric data, we usually order the data by value of each feature, compute the average between every successive pair of values, and compute the split quality measure (e.g Gini) using this average.

We stop the development of the tree when splitting a node does not lower the impurity.

Recall that the Gini impurity can be defined as follows. Let $p_{i}$ be the fraction of items labeled with class i in the set :
$$ I_G = 1 - \sum_{i = 1...J} {p_i}^2 $$

## I.2 Random Forest

In Random Forests, we build forests of Decision Trees in several key steps :

**Step 1** : Boostrap Sampling

Pick *n* data points randomly. We are allowed to pick the same point more than 1 time for each bootstrap sample. We will build several bootstrap samples, say K.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1Cmh0STP2ZmfEnfe-BXuMLOjFgTeUKZDu">
</p>

**Step 2** : Decision Tree

Then, we build a desicison tree for each bootstrap sample, and use only a subset of the variables each time.  This builds a wide variety of trees.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=10bRgs3x9O4CoGXhQ0r_teSGzXPq15-2C">
</p>

**Step 3** : Majority Vote

Finally, in prediction, we get a new sample, run it through all the decision trees, and apply a majority vote on the output of each tree.

We first **bootstraped** the data, and then **aggregated** the results. This is called Bagging, and stands for Bootstrap Aggregating.

# II. The limits of Bagging

## II.1 Same region, same mistake

For what comes next, consider a binary classification problem. We are either classifying an observation as 0 or as 1. This is not the purpose of the lesson, but for the sake of clarity, let’s recall the concept of bagging.

Bagging is a technique that stands for “Bootstrap Aggregating”. The essence is to select T bootstrap samples, fit a classifier on each of these samples, and train the models in parallel. Typically, in a Random Forest, decision trees are trained in parallel. The results of all classifiers are then averaged into a bagging classifier (i.e. we select the majority vote).

$$ H_T(x) = sign(1/T \sum_t {h_t(x)}) $$

This process can be illustrated the following way. Let’s consider 3 classifiers which produce a classification result and can be either right or wrong. If we plot the results of the 3 classifiers, there are regions in which the classifiers will be wrong. These regions are represented in red.



<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1iF2nzHrwnMTXw7IDOh-l_UHvC80aF9jK">
</p>

This example works perfectly, since when one classifier is wrong, the two others are correct. By voting classifier, you achieve a great accuracy ! But as you might guess, there’s also cases in which Bagging does not work properly, when all classifiers are mistaken in the same region.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1AXptTl3qP2HmaScqFpkXXME_V0dutUP9">
</p>

For this reason, the intuition behind the discovery of Boosting was the following :

- instead of training parallel models, one needs to **train models sequentially**
- and **each model should focus on where the previous classifier performed poorly**

## II.2 Same voting power

You might guess from above that one of the issues of Bagging is essentially that each tree has the same voting power, i.e we take the majority vote without giving more weight to the trees that usually perform better. 

# III. Adaptative Boosting : AdaBoost

## III.1 AdaBoost vs. Random Forests

There are some key changes between AdaBoost and Random Forests :
- The decision trees have now a limited depth. They will be called **stumps**, and must lead to a simple binary classification :

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1VwcO6eE83eHLzDIDGy8ncdLqUrny2Gq2">
</p>

- the stumps are called weak learners. Weak learners are algorithms whose error rate is slightly under 50% as illustrated below :

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=17UvX3obE1yj47eAdkZzrZaw7uw_6uLVu">
</p>

- the forest of stumps allows for different weights on each stump 

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1hka_fKK0fEA1VyWIfsiTmdmtp3a8kLRB">
</p>

- the individual trees are no longer independent. The error of the first stump will influence the way we build the second stump

## III.2 How does it work ?

Boosting trains a series of low performing algorithms (weak learners) by adjusting the error metric over time. 

Suppose that we have 2 features $x_1$ and $x_2$ and we want to predict $y$.

**Step 1** : Initialize the weight given to each observation

When we start our AdaBoost algorithm, we should assign a weight $\frac {1}{n}$ to each of the $n$ observations.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1ECDQCid0Qj_3IPwuXU5h193Yf2dhynH5">
</p>

**Step 2** : Find the best stump

Once we initialized our weight, our aim will be to find the best stump, i.e the variable to use and the threshold to use, that minimizes the Gini index. For example, here, we would typically observe something along the $x_1$ variable.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1otRVrfmOubEzB4N5NRzmQl1NASJHGWtq">
</p>

**Step 3** : Assign a weight to the stump

Remember that each stump has a different weight in the final vote. For this reason, we must now determine what weight to apply to the stump we just built. 

The weight will depend on the classification performance of each stump. Indeed, here, we classified incorrectly 1 observation on 12 overall. 

First, we need to compute the total error. The weight of the observation that was not well classified is ${\epsilon_{t}} = \frac {1}{12}$, which is the total error. If we had 2 wrong classified data, the total error would be ${\epsilon_{t}} = \frac {2}{12}$.

Then, we compute the "amount of say", i.e the weight of the stump on the final vote :

  
$$ \alpha_t = \frac {1} {2} ln \frac {1-\epsilon_{t}} {\epsilon_{t}} $$

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1V2Bo7c6M2rUEs0-6khO452X7gqFrNlcy">
</p>

If the stump is not better than a coin and predicts the good classification in 1 case out of 2, its amount of say is 0. Else, if the total error is low, the amount of say is large. A large total error can lead to a negative amount of say.

**Step 4** : Modify the weights of incorrectly classified samples

Now, we need to *modify the weights of the observations that were not well classified* in order for the next stump to take into account the error of the previous stump. How do we do this ?

We increase the weight of the incorrectly classified sample

$$ w_{t+1}(i) = w_{t}(i) e^{\alpha_t } $$
    
What does this mean ? The new weight is the previous weight * $e^{\text{(amount of say)}}$

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1cYmpoHcW771OwP7WqQDv1PpCdhn1nDae">
</p>

If the amount of say of the previous classifier is large, so the classifier was performing well, then the new weight of the observation will also be really large. We can interpret it the following way. If the classifier was overall good, but misclassified one observation, we'll add much more weight to this observation for the next stump.

**Step 5** : Modify the weights of correctly classified samples

Now, we need to *modify the weights of the observations that were correctly classified*. This is done by adding a minus sign in front of the Amounf of Say in the equation.

$$ w_{t+1}(i) = { w_{t}(i) } e ^{ - \alpha_t } $$


<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1jcghhM5RXLevq2zQCy-SO5K0odmrqiLh">
</p>

What this means is that if the classifier was good and had a large amount of say, we will reduce the weight of the observations that were well classified. If the amount of say was small, the new sample weight will be just a little smaller than the previous one, i.e we don't attach too much credibility to the job the previous stump did.

**Step 6** : Normalize the weights

The problem is that the weights of the observations do not sum to 1. We simply need to normalize the weights the following way :

$$ w_{t+1}(i) = \frac {w_{t+1}(i)} {\sum_{w_{t+1}}} = \frac {w_{t+1}(i)} {Z} $$


<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1TQmipTp6cpMk2e1n-1HkT4GwrqstPO3W">
</p>

**Step 7** : Rebalance the observations

Using the updated observation weights, we are ready to move to fit the next stump ! There are 2 ways to deal with the new unbalanced weights :
- apply Weighted Gini Index for example
- or create duplicates of the observations by considering the weights as a distribution

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1-GmuTb_gbkl2aUr8aX00n9EPBqSG4h0S">
</p>

We compute the cumulative sum of the weights, pick randomly a number between 0 and 1, and select the corresponding observation. We do this until the new collection is the same size as the original.

**Step 8** : Compute the new stump

The new stump will then be paying much more attention to the way it classifies the large weight observations.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1X_R3IQjypKHJ3VgnBFSZ947mLdBvJrLh">
</p>

**Step 9** : Build a forest of stumps...

Iterate this process until the pre-defined `n_iter` has been reached.

**Step 10** : Make classification with the forest of stumps

Alright, we do now have a forest of stumps. How can we use this to make classification ?

Well, we'll simply look at the amount of say of each stump :
- we sum the amount of say of the stumps that classify the new observation as 1
- we then sum the amount of say of the stumps that classify the new observation as 0

And attribute the class that has the largest summed amount of say.

$$ H(x) = sign(\alpha^1 h^1(x) + \alpha^2 h^2(x) + ... + \alpha^T h^T(x)) $$

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1FBuu8-rJifE3ubbYGQRFte64nvpA4C0N">
</p>

In summary :

- Train the stump on the observations → $h_1$
- Train the stump with exagerated data on the regions in which $h_1$ performs poorly → $h_2$
- Train the stump with exagerated data on the regions in which $h_1$ ≠ $h_2$ → $h_3$
- ...

Instead of training the models in **parallel**, we can train them **sequentially**. This is the essence of Boosting ! Boosting is also an ensemble technique.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1zxEJ_AoRCV_lwvLitK0stNSsnJa5ko7_">
</p>

### III.3 Pseudo-Code

Let's wrap up in a small pseudo-code what we covered so far.

*Step 1* : Let $w_{t}(i) = \frac {1} {N}$ where $N$ denotes the number of training samples, and let $T$ be the chosen number of iterations.

*Step 2* : For $t$ in $T$ :

  a. Pick $h^t$ the weak classifier that minimizes $\epsilon_{t}$ , here we implement a weighted error metric :
  
  $$ \epsilon_{t} = \sum _{i=1}^{m} w_{t}(i)[y_{i}\neq h(x_{i})] $$

  b. Compute the weight of the stump :
  
  $$ \alpha_t = \frac {1} {2} ln \frac {1-\epsilon_{t}} {\epsilon_{t}} $$
 
 c. Update the weights of the training examples $w_{t+1}^{i}$ :
 
 $$ w_{t+1}(i) = \frac { w_{t}(i) } { Z } e ^{- \alpha^t h^t(x) y(x)} $$
 
 And go back to step a).

*Step 3* : $$ H(x) = sign(\alpha^1 h^1(x) + \alpha^2 h^2(x) + ... + \alpha^T h^T(x)) $$

And we're done ! This algorithm is called **AdaBoost**. This is the most important algorithm one needs to understand in order to fully understand Gradient Boosting we'll introduce.

AdaBoost has for a long time been considered as one of the few algorithms that does not overfit. But lately, it has been proven to overfit at some point, and one should be aware of it. AdaBoost is vastly used in face detection to assess whether there is a face in the video or not. AdaBoost can also be used as a regression algorithm.

Adaboost can be generalized to regression by simply taking the average output among the forest of stumps.