# Ensemble learning
- Bagging
- Boosting
- Stacking

## Bagging

In [1]:
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

In [2]:
# generate 100 samples, each represented by 4 features
X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
X.shape

(1000, 4)

In [3]:
# visualize the top-10 samples
X[:10].round(2)

array([[-1.67, -1.3 ,  0.27, -0.6 ],
       [-2.97, -1.09,  0.71,  0.42],
       [-0.6 , -1.37, -3.12,  0.64],
       [-1.07, -1.18, -1.91,  0.66],
       [-1.31, -0.97, -0.15,  1.19],
       [-2.18, -0.97, -0.1 , -0.89],
       [-1.25, -1.13, -0.15,  1.06],
       [-1.35, -1.07,  0.03, -0.11],
       [-1.13, -1.27,  0.74,  0.21],
       [-0.38, -1.09, -0.01,  1.37]])

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)
X_train.shape, X_test.shape

((670, 4), (330, 4))

- Model performance without bagging

In [5]:
SVC().fit(X_train,y_train).score(X_test, y_test)

0.9363636363636364

- Model performance with bagging

In [6]:
clf = BaggingClassifier(estimator=SVC(), n_estimators=10, random_state=0, 
                        max_samples=0.6, max_features=0.8, bootstrap=True)
clf.fit(X_train, y_train)

In [7]:
clf.score(X_test, y_test)

0.9424242424242424

In [8]:
# make prediction for a test sample
clf.predict([[0, 0, 0, 0]])

array([1])

In [9]:
# predicted probability
clf.predict_proba([[0, 0, 0, 0]])

array([[0.4, 0.6]])

In [10]:
clf.classes_

array([0, 1])

In [11]:
clf.estimators_

[SVC(random_state=2087557356),
 SVC(random_state=132990059),
 SVC(random_state=1109697837),
 SVC(random_state=123230084),
 SVC(random_state=633163265),
 SVC(random_state=998640145),
 SVC(random_state=1452413565),
 SVC(random_state=2006313316),
 SVC(random_state=45050103),
 SVC(random_state=395371042)]

In [12]:
for e in clf.estimators_:
    print(e.predict([[0, 0, 0, 0]]))

ValueError: X has 4 features, but SVC is expecting 3 features as input.

**Question:** can you tell what is wrong with the above implementation?

In [None]:
# your answer here: 


In [13]:
clf.estimators_features_

[array([1, 3, 2]),
 array([0, 2, 1]),
 array([0, 3, 2]),
 array([3, 2, 1]),
 array([0, 2, 3]),
 array([0, 3, 2]),
 array([3, 0, 2]),
 array([2, 1, 3]),
 array([1, 2, 0]),
 array([1, 0, 3])]

According to the clf.estimators_features_ output and max_features=0.8, we can see that each base estimator in the BaggingClassifier was trained on a different subset of features, but the prediction was attempted with the full set of features, leading to a mismatch in the expected number of features.

## Boosting

In [15]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [16]:
X, y = load_iris(return_X_y=True)

- Model performance without boosting

In [19]:
dt = DecisionTreeClassifier(max_depth=1, random_state=0)
dt_scores = cross_val_score(dt, X, y, cv=5)
dt_scores.mean()

0.6666666666666666

- Model performance using boosting

In [21]:
# create a boosting classifier
# please check documentation to understand what does "estimator=None" mean?
clf = AdaBoostClassifier(estimator=None, n_estimators=100, algorithm="SAMME")

If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1.

In [22]:
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()

0.9533333333333334

**Question:** compare model performance before and after using boosting, refer to our lecture to explain the improvement  

In [None]:
# your answer here: 

In [23]:
dt_scores

array([0.66666667, 0.66666667, 0.66666667, 0.66666667, 0.66666667])

In [24]:
scores

array([0.96666667, 0.96666667, 0.93333333, 0.9       , 1.        ])

Before using boosting, the model was a very simple Decision Tree with a maximum depth  of 1. This model, due to its oversimplified nature, was prone to underfitting the dataset. Underfitting occurs when a model is too simple to capture the complexities and patterns in the data, leading to low accuracy and poor generalization to new data. In the case of the Iris dataset, the single, shallow decision tree achieved an average accuracy of around 66.67%, indicating a rather basic understanding of the data's structure. The simplicity of this model, while beneficial in terms of computational efficiency and avoidance of overfitting, limited its ability to make accurate predictions, as it could not adequately capture the nuances and variations within the dataset.

The use of AdaBoost, and boosting technique, created a more complex and less biased model by sequentially focusing on the incorrectly classified instances. AdaBoost works by combining multiple weak learners (in this case, shallow decision trees), each compensating for the shortcomings of its predecessors. The algorithm assigns more weight to the samples that were incorrectly predicted by the previous models, thereby forcing subsequent models to focus on these harder-to-classify instances. As a result, the ensemble model, comprising 100 decision stumps, demonstrated a significant improvement, achieving an average accuracy of approximately 95.33% on the same dataset. This marked increase in performance showcases the effectiveness of boosting in enhancing a model's capacity to learn from data. By aggregating the insights of multiple simple models, AdaBoost effectively reduced the bias inherent in the initial decision tree, leading to a more accurate and robust predictive model.

## Single-layer stacking

In [25]:
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor

# create multiple individual estimators
base_estimators = [('ridge', RidgeCV()),
              ('lasso', LassoCV(random_state=42)),
              ('knr', KNeighborsRegressor(n_neighbors=20, metric='euclidean'))]

In [26]:
from sklearn.ensemble import GradientBoostingRegressor

# create a final estimator
gb_reg = GradientBoostingRegressor(n_estimators=25, subsample=0.5, min_samples_leaf=25, 
                                            max_features=1,random_state=42)

In [27]:
from sklearn.ensemble import StackingRegressor

# stacking the multiple individual estomators and the final estimator
stack_reg = StackingRegressor(estimators=base_estimators, final_estimator=gb_reg)

In [28]:
# prepare train/test data
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)

In [29]:
# fit the stacked regressors
stack_reg.fit(X_train, y_train)

In [30]:
stack_reg.score(X_test, y_test)

0.5267013426135393

**Question:** what does R^2 score mean? What value indicates the model is performing well? 

In [None]:
# your answer here: 

measures the proportion of the variance in the dependent variable that is predictable from the independent variables. 0.526 is a fair score, but there's a room for further improvement

## Multi-layer stacking

In [None]:
from sklearn.ensemble import RandomForestRegressor

final_layer_rfr = RandomForestRegressor(n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
final_layer_gbr = GradientBoostingRegressor(n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
final_layer = StackingRegressor(estimators=[('rf', final_layer_rfr),('gbrt', final_layer_gbr)],
                                final_estimator=RidgeCV())

multi_layer_regressor = StackingRegressor(estimators=[('ridge', RidgeCV()),
                                                      ('lasso', LassoCV(random_state=42)),
                                                      ('knr', KNeighborsRegressor(n_neighbors=20,metric='euclidean'))],
                                          final_estimator=final_layer)

multi_layer_regressor.fit(X_train, y_train)

The three estimators in the base layer make independent predictions.
These predictions are fed into the intermediate layer's stacking regressor (which includes both random forest and gradient boosting), where they are integrated into a new set of predictions.
Finally, this prediction is used for the final prediction at the top layer.