
# Ensemble Learning

<!-- TOC START min:2 max:4 link:true asterisk:true update:true -->
* [What you will learn in this course](#what-you-will-learn-in-this-course)
  * [Voting / Averaging](#Voting-/-Averaging)
  * [Stacking](#Stacking)
<!-- TOC END -->



## What you will learn in this course ##

This course will introduce two very useful techniques in data science that will increase the stability of your models and improve their performance. These are averaging/voting and stacking. These techniques are very simple to implement in practice, and require very few new commands.

## Voting / Averaging

We speak of **Voting classifier** in the context of a Classification problem, when the target variable is qualitative. The idea is simple: instead of training a single model to predict the target variable, we will train several different models (for example, a Naive Bayes, an SVM and a Random Forest). Each of the models will give a certain prediction $\hat{y}_1,\hat{y}_2,\hat{y}_3$ and a certain score $S_1,S_2,S_3$. The voting classifier is an aggregated model which predictions are calculated through a weighted vote of the different models. Thus the prediction produced corresponds to the result of the voting of the different models, weighted by the score obtained by each of them (the higher the performance of a model is, the more weight it has, and the worse its performance is, the less weight it will have). 
Mathematically one can write: $\hat{y}=argmax_{m\in M}(S_1\times(\hat{y}_1=m)+S_2\times(\hat{y}_2=m)+S_3\times(\hat{y}_3=m))$
 where $M$ is the set of modalities taken by $y$.

Averaging corresponds to a regression situation, when the target variable is quantitative. In this case we also train several different models in order to obtain various predictions for the target variable $y$ (for example, a linear regression, an SVM and a Random Forest), each of these models gives a prediction for $y$: $\hat{y}_1,\hat{y}_2,\hat{y}_3$ and a certain score: $S_1,S_2,S_3$. The final prediction will be the score-weighted average of the predictions of the models used. Mathematically this can be noted as follows: $\hat{y}=\frac{1}{S_1+S_2+S_3}\times(S_1\times\hat{y}_1+S_2\times\hat{y}_2+S_3)$; 

You can create a voting classifier in python with the command described in the following link: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html.

For the voting regressor follow this link: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html#sklearn.ensemble.VotingRegressor


### Stacking

The idea of stacking is to train several layers of models in order to obtain a forecast for our target variable. This technique can be applied to regression problems as well as classification problems. The principle is simple: first we separate the training sample into two parts that we will call *part 1* and *part 2*, then we choose a number of different models that we train on our *part 1* training sample. Once these models have been trained, a new training dataset is constructed from the predictions given by each of the models in the previous step using the data from *Part 2*. This separation of the training set into two parts guarantees the independence between the observations that will be used to train the models of the first and second order and thus limit overfitting. The explanatory variables that described the observations are replaced by the predictions given by the models for each *part 2* observation. A new model is now chosen and trained on this new dataset, from *Part 2*, to predict the target variable. This model will potentially perform better, since the explanatory variables it trains on are themselves predictions for $y$ and contain information specifically calculated to describe $y$.

This methodology is valid for a two-level Stacking, i.e. a first set of models (first order) and a final model (second order), also called a metamodel. But the principle can be extended to a larger number of stacking layers, for this it is sufficient to divide the training set as many times as necessary so that each new layer of the model is trained on data that have been used only for this layer.

The command you can use to create a model by Stacking is "from vecstack import stacking" which you can use as follows:

```python
stacking(models,
X_train, y_train, X_test,
regression=False, 
mode='oof_pred_bag', 
needs_proba=False,
save_dir=None, 
metric=accuracy_score, 
n_folds=4, 
stratified=True,
shuffle=True,  
random_state=0,    
verbose=2)
```

The stacking function takes the following arguments:

**models**: list of the instances of the models of the first layer

**X_train**, **y_train**, **X_test**: test and train data

**regression**: boolean indicating whether it is a regression or classification problem

**mode**: how the model will calculate its performance. Here, it's the 'out of fold' score technique: this score is calculated by cross-validation on the non-selected part of the dataset, for each training step (it's a cross-validation score to summarize).

**needs_proba** : boolean allowing to recover probabilities in output or not

**save_dir**: boolean allowing to save the results in a file immediately after resolution

**metric**: cost function used for learning

**n_folds**: number of folds for cross-validation

**stratified**: boolean allowing you to stratify your folds in relation to the modalities of the target variable or not

**shuffle**: boolean allowing to mix observations or not

**random_state**: the state of the pseudo random system

**verbose** : int indicating the level of information returned when the command is executed, from 0 (no info) to 2 (maximum info).
