# What are Ensemble Methods?

> **Ensemble methods are meta-algorithms that combine several machine learning techniques(weak-learners) into one predictive model in order to get better results.**

The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.

## Remember Bias/Variance Tradeoff ?

What is our desire? $\longrightarrow$ **low bias and low variance model**

The difficulty is they vary in opposite directions.
To obtain good results we want to have 
* enough degrees of freedom to determine the underlying patterns in data
* not too much degrees of freedom to avoid high variance (be more robust)

Eventually, this is the bias-variance tradeoff.

![b_v](../Images/bias_variance.png)

# So, what kind of models are considered as weak-learners?

Weak learners (or base models) are models that perform not so well individually, because they have
* high bias (low df)
* high variance (high df)

A strong learner is an ensemble (compination) of weak learners which reduces bias and/or variance and thus performes better.


### Homogeneous and Heterogeneous Ensembles

* a single base learning algorithm is used
* different type of base learning algorithms are used


### Sequential and Parallel ensemble methods

* sequential ensemble methods where the base learners are generated sequentially (e.g. AdaBoost). Motivation is to exploit the dependence between the base learners. The overall performance can be boosted by weighing previously mislabeled examples with higher weight.
* parallel ensemble methods where the base learners are generated in parallel (e.g. Random Forest). Motivation is to exploit independence between the base learners since the error can be reduced dramatically by averaging.


### Coherence between the models and aggregation method

High bias, low variance models should be aggregated with method that tries to reduce bias.
High variance, low bias models should be aggregated with method that tries to reduce variance.

![ens_overview](../Images/ensemble_overview.png)

## Three Main Meta-Algorithms for Ensemble Learning

* **Bagging** 
    * aims to decrease the variance
    * considers homogeneous weak-learners
    * learns in parallel
    * combines in a deterministic averaging method
* **Boosting** 
    * aims to decrease bias
    * considers homogeneous weak-learners
    * learns sequentially
    * combines in deterministic method
* **Stacking**
    * both
    * considers heterogeneous weak-learnes
    * learns in parallel
    * combines by training a meta-model

# Bagging

#### *(stands for “bootstrap aggregating”)*

### What is Bootstrapping?

> **In statistics, bootstrapping is any test or metric that relies on random sampling with replacement.**

We generate samples of size B (called bootstrap samples) from an initial dataset of size N by randomly drawing with replacement B observations.

![btstrp](../Images/bootstrap.png)

Samples generated by bootstrapping method can be considered as representative and independent samples of the true data distribution (almost i.i.d. samples). 
To make this approximation valid the following hypothesis are considered:
1. **representativity** - #(N) should be large enough to capture all the data complexity
2. **independence** - #(N) >> #(B) so that samples are not too much correlated

Bootstrapping is often used to evaluate variance or confidence interval of some statistical estimators.
![est_var](../Images/est_var.png)

### Bagging Step by Step
Keeping in mind that the training dataset comes from some true unknown underlying distribution (**theoretical variance of the training dataset**), the idea of bagging becomes more intuitive.
> We want to fit several independent models and “average” their predictions in order to obtain a model with a lower variance.

1. **Create multiple bootstrap samples so that each new bootstrap sample will act as another (almost) independent dataset drawn from true distribution.**

$$
\left\{z_{1}^{1}, z_{2}^{1}, \ldots, z_{B}^{1}\right\},\left\{z_{1}^{2}, z_{2}^{2}, \ldots, z_{B}^{2}\right\}, \ldots,\left\{z_{1}^{L}, z_{2}^{L}, \ldots, z_{B}^{L}\right\}
$$

where $z_{b}^{l} \equiv$  $b^{th}$ observation of the $l^{th}$ bootstrap sample

2. **Fit a weak learner for each of these samples.**

$w_{1}(.), w_{2}(.), \ldots, w_{L}(.)$

3. **Aggregate them such that we kind of “average” their outputs and, so, obtain an ensemble model with less variance that its components.**

    * simple average for regression problem 
    $$
    s_{L}(.)=\frac{1}{L} \sum_{l=1}^{L} w_{l}(.)
    $$

    * hard voting (majority vote) for classification problem
    $$
    s_{L}(.)=\underset{k}{\arg \max }\left[\operatorname{card}\left(l | w_{l}(.)=k\right)\right]
    $$
    
    * soft voting
    $$
    S_{l}(\cdot)=\arg \max \left[\frac{1}{L} \sum_{i=1}^{L} P\left(w_{l}(\cdot)\right) | w_{l}(\cdot)=k\right]
    $$

Due to our hypothesis, the bootstrap samples are approximatively i.i.d. $\Longrightarrow$ learned base models are i.i.d. too $\Longrightarrow$ averaging them doesn't change expected value, but reduces variance.


## Random Forests

> **The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:**

1. Random sampling of training data points when building trees
2. Random subsets of features considered when splitting nodes

Trees that compose a forest can be chosen to be either **shallow** (few depths) or **deep** (lot of depths, if not fully grown). Shallow trees have less variance but higher bias and then will be better choice for sequential methods that we will described thereafter. Deep trees, on the other side, have low bias but high variance and, so, are relevant choices for bagging method that is mainly focused at reducing variance.

![rf](../Images/rand_forest.png)
**The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree.**

# Boosting

>Boosting consists in, iteratively, fitting a weak learner, aggregate it to the ensemble model and “update” the training dataset to better take into account the strengths and weakness of the current ensemble model when fitting the next base model.

![boosting](../Images/boosting.png)

### Boosting Algorithms

### 1. Adaboost
#### *(stands for “adaptive boosting")*

![adaboost](../Images/adaboost.png)
![samme](../Images/samme.png)


## For more details check these papers:

* [A Short Introduction to Boosting (Freund,Schapire)](https://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf)
* [Multi-class AdaBoost (Zhu, Rosset, Zou,Hastie)](https://web.stanford.edu/~hastie/Papers/samme.pdf)

Literature

* https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
* https://scikit-learn.org/stable/modules/ensemble.html#forest
* https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76
* https://en.wikipedia.org/wiki/Ensemble_learning