# Ensemble Learning and Random Forests

Random Forest: A bunch of decision trees trained on random subsets of the data 
    - Tends to beat single trees trained on full dataset

A good idea is to train a few different good models and then combine them making one ensemble model

Can weight each model's contribution to the ensemble based on its test accuracy

## Voting Classifiers

Example: you have trained a logistic regression, svm classifier, and random forest. Each are around 80% accuracy

Hard Voting Classifier: each model counts as a vote, the ensemble prediction is whichever class gets the most votes (simple majority)

Voting works well if you have models that are "good" at different classes (diverse models)

Get more diverse models by training them differently. Play with hyperparams or train on different subsets of data.

In [None]:
from sklearn.datasets import make_moons
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(500, noise=.3, random_state=42)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=42)

votingClf = VotingClassifier(
    estimators=[
        ("log", LogisticRegression(random_state=42)),
        ("rf", RandomForestClassifier(random_state=42)),
        ("svc", SVC(random_state=42))
    ]
)

votingClf.fit(Xtrain, ytrain)

In [32]:
for name, estimator in votingClf.named_estimators_.items():
    print(name, estimator.score(Xtest, ytest))

log 0.864
rf 0.896
svc 0.896


In [33]:
votingClf.score(Xtest, ytest)
# Hard voting gets about 1.5% better than individual models

0.912

#### Soft Voting

If the individual models have a .predict_proba() method, you can soft vote, which uses the averaged predicted probas to decide the correct class. This is usually better than hard voting.

In [34]:
votingClf.voting = "soft"
votingClf.named_estimators['svc'].probability = True
votingClf.fit(Xtrain, ytrain)
votingClf.score(Xtest, ytest)

# if predicted probs are not well calibrated, you can use sklearn.calibration.CalibratedClassifierCV to calibrate

0.92

## Bagging and Pasting
Idea: use multiple instances of the same training algorithm, but different subsets of the data

Bagging: Sampling data w/ replacement (different models can have very similar data subsets)

Pasting: sampling data without replacement (every data subset is fully unique)

For classification, take the mode prediction. For regression, take the average prediction.

In practice, this ensemble ends up with similar bias but lower variance than single models. So its a good idea to pick models with high variance/low bias to use with this model (trees)

Bagging > pasting when data is noisy or model overfits easily (deep trees). Otherwise, pasting > bagging.

In [35]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagClf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, n_jobs=-1, random_state=42)
bagClf.fit(Xtrain,ytrain)
# bagging classifiers automatically do soft voting

##### Out of Bag Evaluation

Each individual predictor only sees about 63% of the data, the remaining 37% are called out-of-bag instances. Every predictor has different oob instances.

You can evaluate a model with OOB score built-in:

In [36]:
bagClf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, n_jobs=-1, random_state=42, oob_score=True)
bagClf.fit(Xtrain,ytrain)
bagClf.oob_score_

0.9253333333333333

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(bagClf.predict(Xtest), ytest)

# The book showed OOB < true accuracy, but this test shows OOB was above true accuracy. Interesting that the bagClf's true performance is a little poor here.

0.904

In [None]:
bagClf.oob_decision_function_[:3] # predicted probs

array([[0.35579515, 0.64420485],
       [0.43513514, 0.56486486],
       [1.        , 0.        ]])

##### Random Patches and Random Subspaces
You can randomly sample features as well as instances. `max_features` and `bootstrap_features` work similarly to `max_samples` and `bootstrap`

This works well for high dimension data (images) to speed up training. 

A random patch is a subset of instances and features (for an n x m input, a x b random patch where a<n and b<m)

A random subspace is keeping all instances but sampling the features. set bootstrap=False and max_samples=1.0, and set bootstrap_features and max_features

This increases predictor diversity, lowering variance. As always, you are trading less variance for more bias.

## Random Forests

A random forest is an ensemble of decision trees, usually trained via bagging. 

RandomForestClassifier = DecisionTreeClassifier --> BaggingClassifier 

In [53]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier has all the arguments of DecisionTree and BaggingClassifier, so control regularization with max_depth, max_leaf_nodes, etc
rfClf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rfClf.fit(Xtrain, ytrain)
preds = rfClf.predict(Xtest)

accuracy_score(preds, ytest)

0.912

Random Forest works a little differently than a regular decision tree. Instead of looking for the best split, it looks for the best split among a subset of features, in order to increase tree diversity. 

##### Extra-Trees

For each tree in a random forest, a random subset of features is considered when splitting. You can make trees even MORE random by setting `splitter=random` in DecisionTreeClassifier. This sets random thresholds for each feature rather than searching for the best ones.

Again, this trades lower variance for higher bias. Works well with noisy data or situations where RandomForest is overfitting.

In [None]:
extremelyRandomized = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", splitter="random", max_leaf_nodes=16),
    n_estimators=500,
    n_jobs=-1,
    random_state=42,
    bootstrap=False
)

# there is also an sklearn ExtraTreesClassifier

##### Feature Importance

Sklearn measures feature importance by looking at all the tree nodes that use that feature, and then averaging how much those nodes reduce impurity. Its a weighted avg, so more samples in node = feature is better.

Sklearn computes these automatically:

In [63]:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
rfClf = RandomForestClassifier(n_estimators=1000)
rfClf.fit(iris.data, iris.target)

for score, name in zip(rfClf.feature_importances_, iris.data.columns):
    print(f"{name} : {round(score,2)}")

# this makes feature selection very easy! it looks like sepal width is the worst predictor by far

sepal length (cm) : 0.1
sepal width (cm) : 0.02
petal length (cm) : 0.44
petal width (cm) : 0.44


## Boosting

Boosting: any ensemble method that combines weak learners into a strong learner

Train predictors sequentially, where the next predictor learns where the previous one struggled

##### AdaBoost (adaptive boosting)

Idea: where the last model underfit, this model will focus. The newest predictors focus more and more on the hardest cases / decision boundaries.

Train a classifier > predict on training set > increase weight of training instances that were predicted wrong > go back to step 1

Important con: sequential instead of parallel training is slow.

Predictors are weighted based on their training set accuracy.

Weighted error rate of the j'th predictor:
$$r_j = {\sum_{i=1}^{m} w^{(j)}_i \cdot \mathbb{1}(y_i \neq \hat{y}_i^{(j)})}$$
${\hat{y}_j}^{(i)}$ is the j'th predictors prediction for the i'th instance

The weight is then calculated via: (alpha j is weight of predictor j, eta is the learning rate)
$$\alpha_j = \eta \log\left(\frac{1-r_j}{r_j}\right)$$

Weight update rute:
- if the prediction is right, no weight change
- if the prediction is wrong, ${w^{(i)}} = {w^{(i)}}{e^{(\alpha_j)}}$

Lastly, all weights get normalized so they sum to 1. Predictions are made by weighting the predictors by $\alpha_j$

$$\hat{y}(\mathbf{x}) = \underset{k}{argmax} \sum_{j=1}^{N} \alpha_j \cdot \mathbb{1}(\hat{y}_j(\mathbf{x}) = k)$$

In [73]:
from sklearn.ensemble import AdaBoostClassifier

adaClf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), # this is the default, a "decision stump"
    n_estimators=100,
    learning_rate=0.5,
    random_state=42,
    algorithm="SAMME" # this is the default, and is modified from the adaboost given above
)

adaClf.fit(Xtrain, ytrain)
adaClf.score(Xtest, ytest)

0.896

## Gradient Boosting

Similar conceptually to AdaBoost, but instead of tweaking weights, Gradient Boosting fits the new predictor to the *residual errors* made by the previous predictor.

In [76]:
# Gradient Boosting walkthru

import numpy as np
from sklearn.tree import DecisionTreeRegressor

m = 100 # n instances
rng = np.random.default_rng(seed=42)
X = rng.random((m,1)) - 0.5
noise = .05 * rng.standard_normal(m)
y = 3 * X[:,0]**2 + noise

treeReg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
treeReg1.fit(X,y)

y2 = y - treeReg1.predict(X)
treeReg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
treeReg2.fit(X,y2)

y3 = y2 - treeReg2.predict(X)
treeReg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
treeReg3.fit(X,y2)

# 1 predicts the val, 2 predicts residuals, 3 predicts residuals of residuals, and so on

In [77]:
# Making predictions
Xnew = np.array([[-.4], [0], [.5]])
sum(tree.predict(Xnew) for tree in [treeReg1, treeReg2, treeReg3])

array([0.54781022, 0.06427754, 0.98721801])

In [79]:
# Equivalent to above
from sklearn.ensemble import GradientBoostingRegressor

gradReg = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1, random_state=42)
gradReg.fit(X,y)
gradReg.predict(Xnew)
# performs just as poorly (this is a small n of estimators and depth)

array([0.57356534, 0.0405142 , 0.66914249])

`learning_rate` scales how much each tree contributes. Low eta = more trees needed, but generalizes better. This is called *shrinkage* and is regularization. Use cross-val to find best learning rate.

To find the optimal number of trees, set `n_iter_no_change` hyperparam. If you get no changes in 10 new trees, gradient boosting will stop.

In [82]:
gradStop = GradientBoostingRegressor(max_depth=2, learning_rate=.05, n_estimators=500, n_iter_no_change=10, random_state=42)
gradStop.fit(X,y)
gradStop.n_estimators_
# when n_iter_no_change is set, sklearn makes a small validation set with the training data to figure out if theres change. tol= hyperparam controls how much you need to see to count as change

53

In [83]:
# Stochastic Gradient Boosting
GradientBoostingRegressor(max_depth=2, learning_rate=.05, n_estimators=500, subsample=0.25, random_state=42)

#subsample trains each successive tree on that fraction of the data. High bias, lower variance, faster training.

#### Histogram-based Gradient Boosting

HGB is good for huge datasets. It bins the input features and replaces them with integers. N bins is controlled by max_bins= hyperparam, default is 255 (and it cant be higher than 255). Makes checking thresholds much faster.

HGB is O(b x m), grad boost is usually O(n x m x log(m)). b=bins, n=features, m=instances.


In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier
# Early stopping default turned on for m > 10000 
# No subsampling
# n_estimators becomes max_iter
# Minimal number of hyperparams to tweak

# Supports missing vals and categorical vals by default

In [None]:
# pipeline example for HGB -> based on california dataset ch2
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder

hgbReg = make_pipeline(
    make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]),
                            remainder="passthrough"),
    HistGradientBoostingRegressor(categorical_features=[0], random_state=42)
)

### Other Gradient Boosting

XGBoost, CatBoost, and LightGBM are all gradient boosting libraries that perform very well.

## Stacking (stacked generalization)

What if instead of some simple voting algorithm, we trained a model to do the aggregation?

To train the "blender" (algo that replaces voting):
 - Use cross_val_predict() on every predictor in ensemble to get out of sample training set predictions
 - Feed these predictions into the blender -> try to predict OG target vals

You could also train multiple blenders on top of all the original predictions, and then a blender for the blenders. Keep in mind the accuracy gains for doing this are likely very small.

In [93]:
from sklearn.ensemble import StackingClassifier

stackingClf = StackingClassifier(
    estimators = [
        ("lr", LogisticRegression()),
        ("rf", RandomForestClassifier()),
        ("svc", SVC(probability=True))
    ],
    final_estimator=RandomForestClassifier(),
    cv=5
)
stackingClf.fit(Xtrain, ytrain)
stackingClf.score(Xtest, ytest)
# Not a bad final score !

0.92

| Ensemble method | When to use it | Example use cases |
|---|---|---|
| Hard voting | Balanced classification dataset with multiple strong but diverse classifiers. | Spam detection, sentiment analysis, disease classification |
| Soft voting | Classification dataset with probabilistic models, where confidence scores matter. | Medical diagnosis, credit risk analysis, fake news detection |
| Bagging | Structured or semi-structured dataset with high variance and overfitting-prone models. | Financial risk modeling, ecommerce recommendation |
| Pasting | Structured or semi-structured dataset where more independent models are needed. | Customer segmentation, protein classification |
| Random forest | High-dimensional structured datasets with potentially noisy features. | Customer churn prediction, genetic data analysis, fraud detection |
| Extra-trees | Large structured datasets with many features, where speed is critical and reducing variance is important. | Real-time fraud detection, sensor data analysis |
| AdaBoost | Small to medium-sized, low-noise, structured datasets with weak learners (e.g., decision stumps), where interpretability is helpful. | Credit scoring, anomaly detection, predictive maintenance |
| Gradient boosting | Medium to large structured datasets where high predictive power is required, even at the cost of extra tuning. | Housing price prediction, risk assessment, demand forecasting |
| Histogram-based gradient boosting (HGB) | Large structured datasets where training speed and scalability are key. | Click-through rate prediction, ranking algorithms, real-time bidding in advertising |
| Stacking | Complex, high-dimensional datasets where combining multiple diverse models can maximize accuracy. | Recommendation engines, autonomous vehicle decision-making, Kaggle competitions |