# Bagging
ref: https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

## Base models

- homogeneous ensemble model: use a single base learning algorithm , homogeneous weak learners are trained in different ways


- heterogeneous ensemble model: use different type of base learning algorithms, some heterogeneous weak learners are combined


- coherent with the way we aggregate base models: if we choose base models with low bias but high variance, it should be with an aggregating method that tends to reduce variance, vice versa.


## Combine base models

- bagging: consider as homogeneous weak learners, learns independently from each other in parallel and combines them following some kind of deterministic averaging process


- boosting: consider as homogeneous weak learners, learns base models sequentially in a very adaptive way(a base model depends on the previous one) and combines them following a deterministic strategy


- stacking: consider as heterogeneous weak learners, leans in parallel and combines them by training a meta-model to output, a prediction based on the different weak models predictions

## Bagging

stand for bootstrap aggregating 

help to reduce variance, prevent overfitting

### Bootstrapping

generate samples of size B from an initial dataset of size N by randomly drawing **with replacement**

![image](./1.png)

The resampled data contains different characteristics that are as a whole in the original data


It draws the distribution present in the data points, and also tend to remain different from each other. 

avoid the problem of overfitting by using different sets of training data. The model becomes resilient

### Steps

1. create bootstrapped samples

2. apply a regression or classification algorithm to each sample

3. for regression, take the average over all the outputs predicted by the individual learners

    for classification, either the most voted class is accepted (hard-voting), or the highest average of all the class probabilities is taken as the output (soft-voting).

### Advantages and Disadvantages

- works well when the learners are unstable and tend to overfit

    i.e. small changes in the training data lead to major changes in the predicted output. 


- reduces the variance by aggregating the individual learners composed of different statistical properties,

    such as different standard deviations, means, etc. 


- works well for high variance models such as Decision Trees. The number of base learners (trees) to be chosen depends on the characteristics of the dataset. Using too many trees doesn’t lead to overfitting, but can consume a lot of computational power.


- performs well on high-dimensional data


- the missing values in the dataset do not affect the performance of the algorithm



- when used with low variance models such as linear regression, it doesn’t really affect the learning process. 


- final prediction is based on the mean predictions from the subset trees, rather than outputting the precise values for the classification or regression model.

## sklearn

API-BaggingRegressor:https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html

API-BaggingClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html


- base_estimator: The algorithm to be used on all the random subsets of the dataset. Default value is a decision tree.


- n_estimators: The number of base estimators in the ensemble. Default value is 10.

In [2]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, random_state=5)

# define the model
model = BaggingClassifier()

# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1,error_score='raise')

# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Accuracy: 0.861 (0.042)
