<div style="text-align:center;">
    <img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">
    <h1>CSCI 416-01/516-01, Fundamentals of AI/ML</h1>
    <h1>Fall 2025</h1>
    <h1>Ensemble learning</h1>
</div>

# Contents

- [Ensemble learning](#Ensemble-learning)
- [The wine data set](#The-wine-data-set)
- [A majority vote classifier](#A-majority-vote-classifier)
- [Bagging](#Bagging)
- [Random forests](#Random-forests)
- [Adaptive boosting of weak learners](#Adaptive-boosting-of-weak-learners)
- [Feature selection using random forests](#Feature-selection-using-random-forests)

<p style="text-align:center;">
<a href="http://www.despair.com">
  <img src="https://cdn.shopify.com/s/files/1/0535/6917/products/meetingsdemotivator_grande.jpeg"/>
</a>
</p>

# Ensemble learning

**Ensemble learning** refers to combining ML algorithms to create an algorithm that is better than any one of the constituent algorithms.

Ensemble learning is particularly common for classifiers, so we will discuss ensemble learning for classifiers.

We will look at the following approaches to ensemble learning:
* majority vote classifiers,
* bagging,
* random forests,
* boosting.

# The wine data set

We will use the classic [wine data set](https://archive.ics.uci.edu/ml/datasets/wine).  The features are a variety of chemical properties of the wine; the labels are the quality of the wine as determined by "experts".

In [None]:
from sklearn.datasets import load_wine

wine = load_wine()

y = wine.target
X = wine.data

class_names = wine.target_names
print(wine.feature_names)
print(class_names)

In [None]:
print(wine.DESCR)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test =\
  train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
print(X[0:10,:])

The disparity in magnitude between the features will likely give kNN trouble, so we will standardize the data.

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)

X_train_std = sc.transform(X_train)
X_test_std  = sc.transform(X_test)

# A majority vote classifier

The idea here is simple:
1. Build a collection of classifiers.
2. Poll the classifiers and use the "majority vote" as the classification.

We can either use the majority vote for the class label, or, if our classifiers produce probabilities for each class, we can take a weighted sum of the probabilities and take the largest value.

## Grab a passel of classifiers

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = KNeighborsClassifier(n_neighbors=1)
clf4 = SVC(kernel='rbf', random_state=1, gamma=0.05)
clf5 = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=0)

## Train and evaluate the individual classifiers

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

for clf in [clf1, clf2, clf3, clf4, clf5]:
  title = type(clf)
  clf.fit(X_train_std, y_train)

  # Use the classifier to make predictions for the test set
  y_pred = clf.predict(X_test_std)

  disp = ConfusionMatrixDisplay.from_estimator(clf, X_test_std, y_test, normalize="true")
  disp.ax_.set_title(title)

## Compare with results for the training set

In [None]:

for clf in [clf1, clf2, clf3, clf4, clf5]:
  title = type(clf)

  # Use the classifier to make predictions for the training set.
  y_pred = clf.predict(X_train_std)

  disp = ConfusionMatrixDisplay.from_estimator(clf, X_train_std, y_train, normalize="true")
  disp.ax_.set_title(title)

## Train and evaluate some majority vote ensembles

We will use Scikit-Learn's [VotingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html).

In [None]:
from sklearn.ensemble import VotingClassifier

est = [('lr', clf1), ('rf', clf2), 
       ('knn', clf3), ('svm', clf4), ('tree', clf5)]

eclf1 = VotingClassifier(estimators=est, voting='hard')
eclf1 = eclf1.fit(X_train_std, y_train)
#print(eclf1.predict(X_train))

est = [('lr', clf1), ('rf', clf2), 
       ('knn', clf3), ('tree', clf5)]

eclf2 = VotingClassifier(estimators=est, voting='soft')
eclf2 = eclf2.fit(X_train_std, y_train)

eclf3 = VotingClassifier(estimators=est, voting='soft', weights=[2,1,1,1])
eclf3 = eclf3.fit(X_train_std, y_train)

In [None]:
for clf in [eclf1, eclf2, eclf3]:
  # Use the classifier to make predictions for the test set.
  title = type(clf)
  y_pred = clf.predict(X_test_std)

  disp = ConfusionMatrixDisplay.from_estimator(clf, X_test_std, y_test, normalize="true")
  disp.ax_.set_title(title)

## Compare with the results for the training set

In [None]:
for clf in [eclf1, eclf2, eclf3]:
  # Use the classifier to make predictions for the training set.
  title = type(clf)
  y_pred = clf.predict(X_train_std)

  disp = ConfusionMatrixDisplay.from_estimator(clf, X_train_std, y_train, normalize="true")
  disp.ax_.set_title(title)

# Bagging

Majority voting models are usually built from related models.  **Bagging**, which is a portmanteau word meaning **bootstrap aggregating**, is one method for producing related models.

## Bootstrapping in statistics

In statistics, a bootstrap technique is one that is based on repeated sampling with replacement.

Suppose we want to estimate the percentage of trees in the US that are white oaks.

It is not feasible to survey every individual tree in the US (the *population*), so we rely on a sample of trees.  This leads to a fundamental problem in statistics: how to make inferences about the total population from a sample.

The elementary approach is to look at the percentage of white oaks in our sample and use that as an estimate of the population.  This gives us a *single* estimate of the population percentage.

The bootstrapping idea is to model inference about a population from a sample by repeatedly resampling the sample and performing inference about the sample from the resampled data.

In our example, suppose we have a sample of $n$ trees.  In the resampling, we would draw $n$ times from our sample uniformly with replacement.  This means that most likely some trees will be repeated and some left out in any given resample.  The probability a tree is left out of a resample is
$$
(1 - \frac{1}{n})^{n} \approx e^{-1} = 0.368
$$
if $n$ is large.

This resampling process is repeated a large number of times, say, 1,000 to 10,000 times.  For each of these bootstrap samples we compute the percentage of white oaks. From these percentages we can build a histogram of the bootstrap percentages. This provides an estimate of the shape of the distribution of the white oak percentage.

## Building an ensemble of classifiers from bootstrap samples

Let 
* $M$ be the number of models we wish to build for our ensemble, and
* $T$ be the training set,
* $n$ be the number of elements in $T$.

Then,

For $m = 1, \ldots, M$:
* Build a bootstrap model $T_{m}$ by drawing $n$ training cases from $T$ with replacement.
* Use $T_{m}$ to build a classifier $C_{m}$.

The predictions of the $C_{m}$ are then combined to make a prediction by the ensemble algorithm.

# Random forests

In **random forests** we combine decision trees via bagging.

When decision trees are combined using bagging, the result is called a **random forest**.

In building random forests, bagging is often combined with another idea: build each tree from a different randomly chosen subset of the features.  This is called **subspace sampling** or **random subspaces**.

We will use the scikit-learn [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) and [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

We use subspace sampling in which each tree in the ensemble uses only a single feature (<code>max_features=1</code>).

We also use **stumps** &ndash; trees of height 1.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

tree = DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=1)

# Bag our own classifier.
bag = BaggingClassifier(n_estimators=500, bootstrap=True, bootstrap_features=False, n_jobs=1, random_state=42)

forest = RandomForestClassifier(n_estimators=500, n_jobs=-1, max_features=1, random_state=1)

In [None]:
from sklearn.metrics import accuracy_score

print('Single tree')
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred  = tree.predict(X_test)
tree_train = accuracy_score(y_train, y_train_pred)
tree_test  = accuracy_score(y_test,  y_test_pred)
print(f"Single tree train/test accuracies:   {tree_train:.3f}/{tree_test:.3f}")
cm = confusion_matrix(y_test, y_test_pred, normalize="true")
print('Normalized confusion matrix')
print(cm)
print(72*'=')

print('A bag of trees')
bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred  = bag.predict(X_test)
bag_train = accuracy_score(y_train, y_train_pred) 
bag_test  = accuracy_score(y_test,  y_test_pred) 
print(f"Bag of trees train/test accuracies:  {bag_train:.3f}/{bag_test:.3f}")
cm = confusion_matrix(y_test, y_test_pred, normalize="true")
print("Normalized confusion matrix")
print(cm)
print(72*'=')

print('Random forest')
forest = forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred  = forest.predict(X_test)
forest_train = accuracy_score(y_train, y_train_pred) 
forest_test  = accuracy_score(y_test,  y_test_pred) 
print(f"Random forest train/test accuracies: {forest_train:.3f}/{forest_test:.3f}")
cm = confusion_matrix(y_test, y_test_pred, normalize="true")
print('Normalized confusion matrix')
print(cm)

# Adaptive boosting of weak learners

Some classifiers are called **weak learners**.
A weak learner is a classifier that is not necessarily that good, but **is** better than random guessing, insofar as it consistently beats random guessing.

We can apply **adaptive boosting** to weak learners to greatly improve their performance.

In order for adapative boosting to work well on weak learners,
1. we want to be able to train a weak learner quickly, since we are going to be building large numbers (hundreds to thousands) of them, and
2. we also want a weak learner that makes predictions quickly.

Decision trees make for good weak learners. By changing the maximum depth of the tree, we can control the training and prediction times, but also the potential for overfitting (variance).

The classic weak learner is a stump.

## The boosting idea

Boosting is similiar to bagging, but it uses a more sophisticated technique than bootstrap resampling to create its training sets.

Suppose we train a classifier and find that its training error rate is $\varepsilon$.  We want to add another classifier to the ensemble that does better on the misclassifications of the first classifier.

One way to do this is to duplicate the misclassified cases in our resampled training set.  This will shift the new classifier's attention to fixing the mistakes of the previous classifier.

In practice, rather than duplicate misclassified cases, which makes the training set grow, we give them a higher weight.  For instance, in an SVM, we could add a weight term to the margin error for the misclassified cases.

## The weighting scheme in boosting

How much should the weights change?

The basic idea is that half of the total weight is assigned to the misclassified cases, and the other half to the rest.

We start with uniform weights that sum to 1.  The current weight assigned to the misclassified examples is the error rate $\varepsilon$, so we multiply their weights by $1/2\varepsilon$:
$$
\varepsilon \times \frac{1}{2\varepsilon} = \frac{1}{2}.
$$
Assuming $\varepsilon < 0.5$ this increases the weight, as desired.

The weights of the correctly classified examples are multiplied by $1/2(1-\varepsilon)$.

In the next round we do the same, except we take the non-uniform weights into account when evaluating the error rate.

Suppose we have the classifier results
<pre>
              predicted pets    predicted food  total
actual pets   24                16               40
actual food    9                51               60
total         33                67              100
</pre>
The error rate is $\varepsilon = (9+16)/100 = 0.25$.  The weight update for the misclassified cases is $1/2\varepsilon = 2$ and for the correctly classified cases $1/2(1-\varepsilon) = 2/3$. 

Using these weights leads to the reweighted confusion matrix
<pre>
              predicted pets    predicted food  total
actual pets   16                32               48
actual food   18                34               60
total         34                66              100
</pre>
Upon reweighting the error rate is 0.5.

The last piece of the boosting algorithm is a confidence factor $\alpha$ for each model in the ensemble.  This is used to compute the ensemble prediction, which is a weighted average of each individual model.

We want $\alpha$ to increase as $\varepsilon$ decreases.  A common choice is 
$$
\alpha = \frac{1}{2} \ln\frac{1-\varepsilon}{\varepsilon}.
$$

In [None]:
from sklearn.ensemble import AdaBoostClassifier

tree = DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=42)

ada = AdaBoostClassifier(n_estimators=5000, random_state=0)

In [None]:
print("Single stump")
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred  = tree.predict(X_test)
tree_train = accuracy_score(y_train, y_train_pred)
tree_test  = accuracy_score(y_test,  y_test_pred)
print(f"Decision tree train/test accuracies {tree_train:.3f}/{tree_test:.3f}")
cm = confusion_matrix(y_test, y_test_pred, normalize="true")
print("Normalized confusion matrix")
print(cm)
print(72*'=')

print("Boosted stumps")
ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred  = ada.predict(X_test)
ada_train = accuracy_score(y_train, y_train_pred) 
ada_test  = accuracy_score(y_test,  y_test_pred) 
print(f"AdaBoost train/test accuracies      {ada_train:.3f}/{ada_test:.3f}")
cm = confusion_matrix(y_test, y_test_pred, normalize="true")
print("Normalized confusion matrix")
print(cm)

# Feature selection using random forests

**Feature selection** refers to the selection of the "best" or "most useful" subset of the raw features for use in an ML algorithm.

**Feature extraction** refers to deriving new features from the raw features for use in an ML algorithm.

Here we discuss feature selection using random forests.  Random forests can estimate the importance of a feature for classification by examining how much the decision tree nodes that use the feature reduce impurity.

In estimating feature importance we average across all the nodes in all the trees in the random forest.

In Scikit-Learn, this score is computed automatically for each feature, with the results being scaled to add to 1 when summed over all features.

In [None]:
forest = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=1)
forest = forest.fit(X_train, y_train)

In [None]:
print("Random forest")
forest = forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred  = forest.predict(X_test)

forest_train = accuracy_score(y_train, y_train_pred) 
forest_test  = accuracy_score(y_test,  y_test_pred) 
print(f"Random forest train/test accuracies:       {forest_train:.3f}/{forest_test:.3f}")

cm = confusion_matrix(y_test, y_test_pred, normalize="true")
print('Normalized confusion matrix')
print(cm)

In [None]:
for name, score in zip(wine.feature_names, forest.feature_importances_):
  print(f"{name:32s} {score:f}")

## A reduced model

Let's build a model using only a subset of the original features.

We will only keep features with importances of 0.06 or more.

In [None]:
columns = [0,6,9,10,11,12]

X = X[:,columns]

X_train, X_test, y_train, y_test =\
  train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
print('Random forest with reduced number of features')
forest = forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred  = forest.predict(X_test)

forest_train = accuracy_score(y_train, y_train_pred) 
forest_test  = accuracy_score(y_test,  y_test_pred) 
print(f"Random forest train/test accuracies: {forest_train:.3f}/{forest_test:.3f}")

cm = confusion_matrix(y_test, y_test_pred, normalize="true")
print('Normalized confusion matrix')
print(cm)

We now have a simpler model that does just as well!

#### This notebook is brought to you by tolerance.

<blockquote>
What is tolerance? It is the consequence of humanity.  We are all formed of frailty and error;
let us pardon reciprocally one another’s folly - that is the first law of nature. <br/><br/>

Qu’est-ce que la tolérance? c’est l’apanage de l’humanité.  Nous sommes tous pétris de faiblesses et d’erreurs;
pardonnons-nous réciproquement nos sottises, c’est la première loi de la nature.<br/>
&ndash; Voltaire (1694-1778), Dictionnaire philosophique, “Tolérance” (1764)
</blockquote>