# Ensemble 

## Bagging

It _always_ makes sense to do at least bagging. This is always going to be more reliable than just building one model.

Bagging means **averaging** slightly different versions of the same model to improve accuracy e.g. random forest builds many decision trees.

There are 2 main sources of errors when modelling:
1. Errors due to bias (underfitting)
2. Errors due to variance (overfitting)

High bias = not a very deep relationship. We are _biased_ to the data that we have got and have made some wild and sweeping generalizations based off of it.

High variance = easily moved by small changes in the data, perhaps we have taken in too much data e.g. seeing if someone will buy a car we do not need to know their eye color or the color of their house.

We make slightly different models.

The guy teaching this part of the course (and ex-number 1 ranked Kaggler) said he never just fits a model but always uses Bagging. There is no point just fitting one model.

Ahhhh shit, wait I've just thought I can manually create bagged GBDTs and LightGBM models myself. I just build say 20 of them with different random seeds and then write the prediction function myself. I don't just have to use the sklearn API to make predictions for me!! Light bulb moment!

### Parameters That Control Bagging

- Random seed
- Row (sub) sampling or bootstrapping
- Shuffling - some models will produce different results if the data is presented in a different order
- Column (sub) sampling
- Model-specific parameters e.g. change regularization strength in LogReg models.
- Bags (the number of models) - usually at least 10
- (Optional) parallelism

Sub-sampling means training models with less data. When we combine these models that have been trained on completely different datasets, they can be amazingly powerful. 

Bootstrapping is where we randomly build a new dataset _with replacement_ so the model will almost certainly see some data points more than once and it lets us artificially build a bigger dataset.

Woah. So yeah there are millions of things we could try.

In principle, more bags will never hurt you and will increase performance. But, after some point, you will start plateauing so there is a cost-benefit with time. 

With bagging, all models are completely independent of each other so you can make full use of all the cores in your machine.

The `BaggingClassifier` and `BaggingRegressor` from sklean are both good.

Some typical bagging code. I will defo implement this!

```python
# train is the training data
# test is the test data
# y is the target variable
model = RandomForestRegressor()
# Specify bagging params
bags = 10
seed = 1
# create array object to hold bagged predictions
bagged_prediction = np.zeros(test.shape[0])
# loop for as mind times as we want bags
for n in range(0, bags):
    model.set_params(random_state=seed + n) # update seed
    model.fit(train, y) # fit model
    preds = model.predict(test) # predict on test data
    bagged_prediction += preds
# take average of predictions
bagged_prediction /= bags
```

## Boosting

A form of weighted averaging of models where each model is built sequentially by taking into account past model performance.

Unlike bagging (which just builds loads of independent models), this iterates on one particular model over and over.

**Main Boosting Types*
1. Weight based
2. Residual based

Weight can be caluclated based off of the 1 + absolute error between predictions and target. There are many ways to do it but this is just an example.

The next model will then be fed the same features and target variable but also an additional `weight` column. The weight will be bigger if there was a bigger error. Thus we give more significance to data points the model cannot classify.

### Weight Based Boosting Parameters

- Learning rate (or shrinkage or eta)
 `pred_n = pred_0 * eta + pred_1 * eta + ... + pred_n * eta`. 
 The learning rate ensures that we don't trust one model too much. We trust many models a little bit (important to control overfitting)
- Num. estimators - often an inverse relationship with the learning rate. If we have more estimators, we need a smaller learning rate.
- Input model - can be anything that accepts weights.
- Sub boosting type
 - AdaBoost - good implementation in sklearn
 - LogitBoost - good implementation in Weka (Java)

To find optimal LR and num estimators, use CV. Start with fixed number of estimators (e.g. 100) and find the optimal LR for that (with CV). Then if we double the estimators, we should halve the learning rate. This gives us solid ballpark figures for how to increase/decrease estimators and LR. Can lose a lot of time if you don't do this as you will faff around trying to find the best LR. 

## Residual Based Boosting

This has been the most dominant and winning-est algorithm for the last years. 

For first model, you cal predictions. Then you calculate the error (not the absolute error, just the difference) between preds and y. Then you train the next model on the same features but use the error from model one as the target variable.

To get predictions for one row, you then add up all the error values on that row. Super clever!

### Residual Boosting Parameters

- Learning rate (or shrinkage or eta)
- Number of estimators
- Row (sub) sampling
- Column (sub) sampling
- Input model - can theoretically be done with anything but best performance so far has been with trees
- Sub boosting type (two most common below)
 - Fully gradient based
 - DART (particularly good with classification)

If the error for this model is 0.2 and our LR is 0.1, it means we only adjust the model prediction by 10%. So the new prediction is `old_pred + 0.2 * 0.1 = old_pred + 0.02`.

Normally more estimators is more but you need to offset this with the right learning rate to ensure each model has the right contribution. A high number of estimators mean you need a very small LR.

### Excellent Residual Based Boosting Implementations

- XGBoost
- LightGBM
- H20's GBM
- CatBoost
- Sklearn's GBM - can use _any_ sklearn estimator as a base with this one. 

## Stacking

Super popular form. In _all_ competitions, you will need to use stacking in the end to boost your performance as much as possible.

**Definition**: Making predictions with hold-out data sets and then collecting (stacking) these predictions to form a new dataset which you will fit a new model on to make the final predictions.

**Methodology**
1. Split the train set into two disjoint sets
2. Train several base learners on the first part
3. Make predictions with the base learners on the second (validation) part
4 Use the predictions from 3 as the inputs to train a higher level learner.

We call the first models base models/learners and the latter we call meta models.

Excellent explanation in the [video](https://www.coursera.org/learn/competitive-data-science/lecture/Qdtt6/stacking). See image below

<div>
    <img src="stacking.png" />
</div>

Note that it's called stacking because we stack the predictions next to each other to form the new datasets B1 and C1.

Let's do a code example

```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split

# train is the training data
# y is the target variable for the training data
# test is the test data

training, valid, ytraining, yvalid = train_test_split(
                                        train, y,
                                        test_size=0.5)
# Specify models
model_1 = RandomForestRegressor()
model_2 = LinearRegression()
# Fit models
model_1.fit(training, ytraining)
model_2.fit(training, ytraining)
# Make predictions for validation
preds_1 = model_1.predict(valid)
preds_2 = model_2.predict(valid)
# Make predictions for test data
test_preds_1 = model_1.predict(test)
test_preds_2 = model_2.predict(test)
# Form a new dataset for valid and test by stacking
# the predictions
stacked_predictions = np.column_stack((preds_1, preds_2))
stacked_test_predictions = np.column_stack((test_preds_1, test_preds_2))
# Specify meta model
meta_model = LinearRegression()
# Fit meta model on stacked predictions
meta_model.fit(stacked_predictions, yvalid)
# Make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)
```

Seems like a lot but when you read through it, it's actually quite simple.

Woah. This is a crazily powerful model. 

### Things To Be Mindful Of

- With time sensitive data, respect time i.e. train in the past, val and test in future
- Diversity is as important as performance
- Diversity comes from
 - Different algorithms
 - Different input features - e.g less features or completely different transformations of the input data e.g. OHE categorical features in one and label encode in the other
- Performance plateauing after N models
- Meta model is usually modest

Stacking is able to get the juice out of all of the models you put in. It is great if you combine loads of different models. Adding in weaker performing models will actually give stacking new features to work with as these models will probably be good in areas where the top performing models are poor. Combining 'weak' leaners with the strong ones is going to make a super crazy strong learner.

Can't know beforehand when we will start plateauing but generally it is affected by how many features you have in your data, how much diversity you have included, how many rows of data you have. Tough to know beforehand. But basically just add models and think of new things to try until you cannot get any more value.

Meta model is basically only using predictions of the other models. The other models have done the deep work, thus the meta model doesn't have to be that deep. Normally you have predictions that are correlated with the target, you just need to find a way to combine them. So if you use RandomForest, you would use a lower depth than the best one you found in your base models.

In [None]:
# train