[View in Colaboratory](https://colab.research.google.com/github/thiagobrito/datasciencenotes/blob/master/Ensembling.ipynb)

# Ensemble Methods

## Bagging

> *Means averaging slightly different versions of the same model to improve accuracy*

### Parameters that control bagging

* Changing the seed
* Row(sub) sampling or bootstrapping
* Shuffling
* Column (Sub) sampling
* Model-specific parameters
* Number of models (or bags)
* (Optionally) Paralelism

### Examples of bagging

> BaggingClassifier and BaggingRegressor from SkLearn

```
# train is the training data
# test is the test data
# y is the target variable

model = RandomForestRegressor()
bags=10
seed=1
bagged_prediction=np.zeros(test.shape[0])
for n in range(0, bags):
    model.set_params(random_state=seed + n)
    model.fit(train, y)
    preds = model.predict(test)
    bagged_prediction += preds
# Take average of predicts
bagged_prediction /= bags

```

## Boosting

> A form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.

### Weight based

[Adaboost](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)

### Residual based

* XGBoost
* Lightgbm
* [H20's GBM](https://github.com/h2oai/h2o-3)
* Catboost
* Sklearns GBM

## Stacking

> Means making predictions of a number of models in a hold-out set and then using a different (Meta) model to train on these predictions.

### Methodology
1. Splitting the train set into two disjoint sets
2. Train several base leaners on the first part
3. Make predictions with the base leaners on the second (validation) part
4. Using the predictions from (3) as the input to train a higher level learner

```
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

#train is the training data
# test is the test data
# y is the target variable for the train data

training,valid,ytraining,yvalid = train_test_split(train, y, test_size=0.5)

model1 = RandomForestRegressor()
model2 = LinearRegression()

model1.fit(training, ytraining)
model2.fit(training, ytraining)

preds1 = model1.predict(valid)
preds2 = model2.predict(valid)

test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)

stacked_predictions = np.column_stack((preds1, preds2))
stacked_test_predictions = np.column_stack((test_preds1, test_preds2))

meta_model = LinearRegression()
meta_model.fit(stacked_predictions, yvalid)

final_predictions = meta_model.predict(stacked_test_predictions)
```

### Pay attention
* With time sensitive data - respect time
* Diversity as important as performance (how different is one model from the other)
* Diversity may come from:
   * Different algoritms
   * Different input features
* Performance plateauing after N models
* Meta model is normally modest

## StackNet

> A scalable meta modelling methodology that utilizes stacking to combine multiple models in a neural network archtecture of multiple levels.

### StackNet as a neural network
* In a neural network, every node is a simple linear model (like linear regression) with some non linear transformation.
* Instead of a linear model we could use any model

### How to train
* We cannot use Backpropagation (not all models are differentiable)
* We use stacking to link each model/node with target
* Use K-Fold paradigm (to extend to many levels, we can use a Kfold paradigm)
* No epochs - different connections instead


## Ensembling Tips and Tricks

### Level 1
* Diversity based on algorithms
   * 2-3 gradient boosted trees (lightgb, xgboost, H2O, catboost)
   * 2-3 neural nets (keras, pytorch) (one with 1 hidden-layer, another with 2 and another with 3)
   * 1-2 ExtraTrees/Random Forest (sklearn)
   * 1-2 knn models (sklearn)
   * 1 Factorization machine (libfm)
   * 1 SVM with nonlinear kernel if size/memory allows (sklearn)
   
* Diversity based on input data:
   * Categorical features: One hot, label encoding, target encoding
   * Numerical features: outliers, binning derivatives, percentiles, scaling
   * Interactions: col1*/+-col2, groupby, unsupervised
   
### Subsequent level tips
* Simpler (or shallower) algoritms:
   * gradient boosted trees with small depth (like 2 or 3)
   * Linear models with high regularization
   * ExtraTrees
   * Shallow networks (as in 1 hidden layer)
   * knn with BrayCurtis Distance
   * Brute forcing a search for best linear weights based on cv

### Feature engineering:
* pairwise differences between meta features
* row-wise statistics like averages or stds
* Standard feature selection techniques
* For every 7.5 models in previous level we add 1 in meta (subsequent layer)
* Be mindful or target leakage (control the k-folds number keep small)

### Software for stacking
* StackNet (https://github.com/kaz-anova/StackNet)
* Stacked ensembles from H2O
* Xcessiv (https://github.com/reiinakano/xcessiv)

### Links interessantes:
[Parametros importantes dos modelos](https://github.com/kaz-Anova/StackNet/blob/master/parameters/PARAMETERS.MD)


# Other ideas

There are a number of ways to validate second level models (meta-models). In this reading material you will find a description for the most popular ones. If not specified, we assume that the data does not have a time component. We also assume we already validated and fixed hyperparameters for the first level models (models).

### Simple holdout scheme
Split train data into three parts: partA and partB and partC.
Fit N diverse models on partA, predict for partB, partC, test_data getting meta-features partB_meta, partC_meta and test_meta respectively.
Fit a metamodel to a partB_meta while validating its hyperparameters on partC_meta.
When the metamodel is validated, fit it to [partB_meta, partC_meta] and predict for test_meta.

### Meta holdout scheme with OOF meta-features

Split train data into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in train_data we will have N meta-features (also known as out-of-fold predictions, OOF). Let's call them train_meta.
Fit models to whole train data and predict for test data. Let's call these features test_meta.
Split train_meta into two parts: train_metaA and train_metaB. Fit a meta-model to train_metaA while validating its hyperparameters on train_metaB.
When the meta-model is validated, fit it to train_meta and predict for test_meta.

### Meta KFold scheme with OOF meta-features

Obtain OOF predictions train_meta and test metafeatures test_meta using b.1 and b.2.
Use KFold scheme on train_meta to validate hyperparameters for meta-model. A common practice to fix seed for this KFold to be the same as seed for KFold used to get OOF predictions.
When the meta-model is validated, fit it to train_meta and predict for test_meta.

### Holdout scheme with OOF meta-features

Split train data into two parts: partA and partB.
Split partA into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in partA we will have N meta-features (also known as out-of-fold predictions, OOF). Let's call them partA_meta.
Fit models to whole partA and predict for partB and test_data, getting partB_meta and test_meta respectively.
Fit a meta-model to a partA_meta, using partB_meta to validate its hyperparameters.
When the meta-model is validated basically do 2. and 3. without dividing train_data into parts and then train a meta-model. That is, first get out-of-fold predictions train_meta for the train_data using models. Then train models on train_data, predict for test_data, getting test_meta. Train meta-model on the train_meta and predict for test_meta.

### KFold scheme with OOF meta-features

To validate the model we basically do d.1 -- d.4 but we divide train data into parts partA and partB M times using KFold strategy with M folds.
When the meta-model is validated do d.5.

Validation in presence of time component

### KFold scheme in time series

In time-series task we usually have a fixed period of time we are asked to predict. Like day, week, month or arbitrary period with duration of T.

Split the train data into chunks of duration T. Select first M chunks.
Fit N diverse models on those M chunks and predict for the chunk M+1. Then fit those models on first M+1 chunks and predict for chunk M+2 and so on, until you hit the end. After that use all train data to fit models and get predictions for test. Now we will have meta-features for the chunks starting from number M+1 as well as meta-features for the test.
Now we can use meta-features from first K chunks [M+1,M+2,..,M+K] to fit level 2 models and validate them on chunk M+K+1. Essentially we are back to step 1. with the lesser amount of chunks and meta-features instead of features.
g) KFold scheme in time series with limited amount of data

We may often encounter a situation, where scheme f) is not applicable, especially with limited amount of data. For example, when we have only years 2014, 2015, 2016 in train and we need to predict for a whole year 2017 in test. In such cases scheme c) could be of help, but with one constraint: KFold split should be done with the respect to the time component. For example, in case of data with several years we would treat each year as a fold.