# <a id='toc1_'></a>[Ensemble methods](#toc0_)

> **Ensemble**: Borrowed from French ensemble. A group of separate things that contribute to a coordinated whole. [$^{[1]}$](https://en.wiktionary.org/wiki/ensemble#French)

**Table of contents**<a id='toc0_'></a>    
- [Ensemble methods](#toc1_)    
- [Baseline](#toc2_)    
- [Voting classifiers](#toc3_)    
- [Bagging (Bootstrap aggregation)](#toc4_)    
  - [Bootstrapping](#toc4_1_)    
  - [Pasting](#toc4_2_)    
  - [Patching](#toc4_3_)    
  - [Bagging in practice](#toc4_4_)    
  - [Random Forest](#toc4_5_)    
- [Boosting](#toc5_)    
  - [Adaboost (Adaptive Boosting)](#toc5_1_)    
  - [Gradient boosting](#toc5_2_)    
  - [Extreme Gradient Boosting](#toc5_3_)    
- [Resources](#toc6_)    
- [References](#toc7_)    
- [Acknowledgements](#toc8_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.datasets import fetch_openml

def load_boston_dataset():
    dataset = fetch_openml(name='boston', version=1)
    return dataset.data, dataset.target

features, labels = load_boston_dataset()
features = features.astype(float)

display(features.head())
display(labels.head())

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

Remember, before we go to the train-test split we do data review, exploratory data analysis, and some feature engineering:

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=1)

# <a id='toc2_'></a>[Baseline](#toc0_)

> Here we are using all 506 observation (non-optimized method, straight from the shelf):

In [4]:
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(random_state=1) # fixing random state because I'm a chicken and terrified that random variation screws up my example
tree.fit(X_train, y_train)
print('Train score:', tree.score(X_train, y_train))
print('Test score:', tree.score(X_test, y_test))

Train score: 1.0
Test score: 0.7920086354372333


# <a id='toc3_'></a>[Voting classifiers](#toc0_)

> A Voting Classifier is a machine learning model that trains on an ensemble of numerous models and predicts an output (class) based on their highest probability of chosen class as the output. [$^{[2]}$](https://medium.com/@imamitsingh/voting-classifiers-in-machine-learning-a532935fe592)  

![](https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Lp5aIdSuk4uqGNwO.png)  
(Source: [Voting Classifiers in Machine Learning, Medium](https://medium.com/@imamitsingh/voting-classifiers-in-machine-learning-a532935fe592))

In [5]:
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
lin_reg = LinearRegression()
knn_reg = KNeighborsRegressor()
tree_reg = DecisionTreeRegressor(random_state=1)
voting_reg = VotingRegressor(
estimators=[('lr', lin_reg), ('dt', tree_reg), ('knn', knn_reg)])
voting_reg.fit(X_train, y_train)
print('Train score:', voting_reg.score(X_train, y_train))
print('Test score:', voting_reg.score(X_test, y_test))

Train score: 0.8974599905889998
Test score: 0.812370538133183


Let's have a look at the individual $R^2$ scores:

In [6]:
knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
print('Train score:', knn_reg.score(X_train, y_train))
print('Test score:', knn_reg.score(X_test, y_test))

Train score: 0.6667843569655112
Test score: 0.5281871748119744


In [7]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
print('Train score:', lin_reg.score(X_train, y_train))
print('Test score:', lin_reg.score(X_test, y_test))

Train score: 0.7168057552393374
Test score: 0.7789410172622842


The mix of 2 decent models and one not so decent model fares better than each of the models!

**Why not use the same model with different parameters in the same `VotingRegressor`?**

**Note:** For Voting Classifiers, we have 2 types of voting:

> - **Hard Voting** - make the final prediction by a simple majority vote for accuracy.   
> - **Soft Voting** - averaging out the probabilities calculated by individual algorithms then choosing the class based on the overall decision boundary. This can only be done when all your classifiers can calculate probabilities for the outcomes. [$^{[3]}$](https://towardsdatascience.com/ensemble-learning-in-machine-learning-getting-started-4ed85eb38e00)

# <a id='toc4_'></a>[Bagging (Bootstrap aggregation)](#toc0_)

> Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting.  [$^{[4]}$](https://en.wikipedia.org/wiki/Bootstrap_aggregating) 

![](https://github.com/sabinagio/data-analytics/blob/main/images/bagging.png?raw=true)  
(Source: [Bootstrap Aggregating, Wikipedia](https://en.wikipedia.org/wiki/Bootstrap_aggregating))

## <a id='toc4_1_'></a>[Bootstrapping](#toc0_)

> The bootstrap dataset is made by randomly picking objects from the original dataset. Also, it must be the same size as the original dataset. However, the difference is that the bootstrap dataset can have duplicate objects: [$^{[4]}$](https://en.wikipedia.org/wiki/Bootstrap_aggregating)   

![](https://upload.wikimedia.org/wikipedia/commons/f/fe/Bootstrap_Example_2.png)  
(Source: [Bootstrap aggregating, Wikipedia](https://en.wikipedia.org/wiki/Bootstrap_aggregating))

![](https://upload.wikimedia.org/wikipedia/commons/5/57/Complete_Example_2.png)  
(Source: [Bootstrap aggregating, Wikipedia](https://en.wikipedia.org/wiki/Bootstrap_aggregating))

## <a id='toc4_2_'></a>[Pasting](#toc0_)

> In case of Pasting, the same process as for Bootstrapping applies, only difference being that pasting doesn’t allow training instances to be sampled several times for the same predictors so the dataset sizes will not be the same as the original. [$^{[3]}$](https://archive.is/20210523073415/https://towardsdatascience.com/ensemble-learning-in-machine-learning-getting-started-4ed85eb38e00#selection-1085.149-1089.148)

## <a id='toc4_3_'></a>[Patching](#toc0_)

## <a id='toc4_4_'></a>[Bagging in practice](#toc0_)

> Bagging means we train many "weak" predictors but then we combine their predictions and some will hopefully make up for the others' failures:


In [8]:
from sklearn.ensemble import BaggingRegressor

In [9]:
# Bagging w/ pasting
bagging_reg = BaggingRegressor(
    DecisionTreeRegressor(max_depth=3), # depth 3 to force tree to be "weak"
    n_estimators=10, # 10 trees
    max_samples=100, # we limit each weaker tree to 100 datapoints
    bootstrap=False, # by default, the bagging regressor does bootstrapping
    random_state=1) # fixing random state because I want my examples to work and to look smart

bagging_reg.fit(X_train, y_train)
print('Train score:', bagging_reg.score(X_train, y_train))
print('Test score:', bagging_reg.score(X_test, y_test))

Train score: 0.8114183129636663
Test score: 0.7943759617838979


In [10]:
# Bagging w/ bootstrapping
bagging_reg = BaggingRegressor(
    DecisionTreeRegressor(max_depth=3), # depth 3 to force tree to be "weak"
    n_estimators=10, # 10 trees
    max_samples=100, # we limit each weaker tree to 100 datapoints
    bootstrap=True,  # by default, the bagging regressor does bootstrapping
    random_state=1) # fixing random state because I want my examples to work and to look smart

bagging_reg.fit(X_train, y_train)
print('Train score:', bagging_reg.score(X_train, y_train))
print('Test score:', bagging_reg.score(X_test, y_test))

Train score: 0.8155793199161233
Test score: 0.8232359437063704


**Which one to choose? Bootstrapping or Pasting?**

It depends on the size of your dataset. 

> Since pasting is without replacement, each subset of the sample can be used once at most, which means that you need a big dataset for it to work. As a matter of fact, pasting was originally designed for large data-sets, when computing power is limited. Bagging, on the other hand, can use the same subsets many times, which is great for smaller sample sizes, in which it improves robustness. [$^{[5]}$](https://stats.stackexchange.com/questions/219193/when-should-the-pasting-ensemble-method-be-used-instead-of-bagging)

> **Note:** BaggingClassifier automatically performs soft voting if the classifier can calculate the probabilities for its predictions. [$^{[3]}$](https://archive.is/20210523073415/https://towardsdatascience.com/ensemble-learning-in-machine-learning-getting-started-4ed85eb38e00#selection-1085.149-1089.148)

## <a id='toc4_5_'></a>[Random Forest](#toc0_)

> Random forests not only shuffle the dataset but also randomly select some features. They're using bagging with patching internally on decision trees so some trees will focus on one part of the data, some in others, then they meet to vote and get a holistic result.

In [11]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_estimators=10, # same 10 trees
                               #max_samples=100,
                               #max_features=0.6,
                               max_depth=3, # depth 3 to force tree to be "weak"
                               random_state=1) # fixing rand because I'm insecure and afraid you will judge me if I get a bad random selection that does not prove my point
forest.fit(X_train, y_train)
print('Train score:', forest.score(X_train, y_train))
print('Test score:', forest.score(X_test, y_test))

Train score: 0.8568595898803911
Test score: 0.8617132400893801


# <a id='toc5_'></a>[Boosting](#toc0_)

## <a id='toc5_1_'></a>[Adaboost (Adaptive Boosting)](#toc0_)

> Adaboost "directs" each subsequent tree to focus on the datapoints that the preceding tree got wrong. This way the trees try to compensate for each others' weakenesses. 

Then, in order to mitigate the bias resulting from the trees learning from each other, the more "naive" trees (i.e. the first to be fit) have the highest weight in the final vote.

In [12]:
from sklearn.ensemble import AdaBoostRegressor

ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=5), # you can overfit a bit because you compensate afterwards
                            n_estimators=10, # same 10 trees. You usually use faaaar more estimators
                            random_state=1 # once a coward, always a coward
                            )
ada_reg.fit(X_train, y_train)
print('Train score:', ada_reg.score(X_train, y_train))
print('Test score:', ada_reg.score(X_test, y_test))

Train score: 0.9539286914529287
Test score: 0.9013235759543973


Increasing the number of estimators:

In [13]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=5),
                            n_estimators=50, # that's more like it
                            random_state=1 # once a coward, always a coward
                            )
ada_reg.fit(X_train, y_train)
print('Train score:', ada_reg.score(X_train, y_train))
print('Test score:', ada_reg.score(X_test, y_test))

Train score: 0.9668135534946342
Test score: 0.9091002589113557


AdaBoosting is a method of training a model rather than a separate model so it can be used with different weak estimators:

> The base model used in boosting must be relatively low variance and high bias, but this is just a rule of thumb, boosting algorithms in python are usually implemented using decision trees by default.

In [14]:
ada_reg = AdaBoostRegressor(KNeighborsRegressor(),
                            n_estimators=50, # that's more like it
                            random_state=1 # once a coward, always a coward
                            )
ada_reg.fit(X_train, y_train)
print('Train score:', ada_reg.score(X_train, y_train))
print('Test score:', ada_reg.score(X_test, y_test))

Train score: 0.851115605405711
Test score: 0.6243732011331289


Well, there's a reason we typically use these methods with Decision Trees :D

## <a id='toc5_2_'></a>[Gradient boosting](#toc0_)

> Gradient boosting also focuses more on where the trees get it wrong but prefers to control the error rather than getting the observation fully right. It is really trying to just "correct" the preceeding tree.

In [15]:
from sklearn.ensemble import GradientBoostingRegressor

gb_reg = GradientBoostingRegressor(max_depth=5, #gradient boosting always works with trees, no need to call the tree regressor
                                   n_estimators=50,
                                   random_state=1 # tastes like chicken
                                   )
gb_reg.fit(X_train, y_train)
print('Train score:', gb_reg.score(X_train, y_train))
print('Test score:', gb_reg.score(X_test, y_test))

Train score: 0.9927971890442564
Test score: 0.9092025261090009


## <a id='toc5_3_'></a>[Extreme Gradient Boosting](#toc0_)

![Evolution of tree-based algorithms](https://miro.medium.com/max/925/1*QJZ6W-Pck_W7RlIDwUIN9Q.jpeg)

> XGBoost is one of the best algorithms to work on tabular data. This is not the only nor the best (IMO) implementation of XGB. For example, this does not accept NaNs (LightXGB would, for example) but still a champ.

In [16]:
import xgboost

xgb_reg = xgboost.XGBRegressor(max_depth=5)
xgb_reg.fit(X_train, y_train)
print('Train score:', xgb_reg.score(X_train, y_train))
print('Test score:', xgb_reg.score(X_test, y_test))

Train score: 0.9999519467774459
Test score: 0.8839164225193332


**Imagine:** What could we have acheived if we had done feature engineering?

# <a id='toc6_'></a>[Resources](#toc0_)

- StatQuest
    - [Bootstrapping (10 min)](https://www.youtube.com/watch?v=Xz0x-8-cgaQ) - to understand the random forest video below 
    - [Random Forests Part 1: Building, Using and Evaluating (10 min)](https://www.youtube.com/watch?v=J4Wdy0Wc_xQ)
    - [Random Forests Part 2: Missing data and clustering (12 min)](https://www.youtube.com/watch?v=sQ870aTKqiM)

# <a id='toc7_'></a>[References](#toc0_)

[1] [Ensemble, Wikipedia](https://en.wiktionary.org/wiki/ensemble#French)   
[2] [Voting Classifiers in Machine Learning, Medium](https://medium.com/@imamitsingh/voting-classifiers-in-machine-learning-a532935fe592)  
[3] [Ensemble Learning in Machine Learning, Towards Data Science](https://towardsdatascience.com/ensemble-learning-in-machine-learning-getting-started-4ed85eb38e00)  
[4] [Bootstrap Aggregating, Wikipedia](https://en.wikipedia.org/wiki/Bootstrap_aggregating)  
[5] [When should pasting be used instead of bagging, StackOverflow](https://stats.stackexchange.com/questions/219193/when-should-the-pasting-ensemble-method-be-used-instead-of-bagging)

# <a id='toc8_'></a>[Acknowledgements](#toc0_)

Thank you, David Henriques, for your awesome lesson structure and content!