
# Linear AdaBoost Classifier — Theory, Implementation, and Visualization

**Date:** 2025-10-14 04:44

This notebook is a complete, self-contained guide to **AdaBoost** for (binary) classification, presented as a *linear additive model* in function space. It includes:
- A brief, math-first introduction with clean $$ ... $$ LaTeX blocks suitable for Jupyter
- The AdaBoost algorithm derived from exponential loss
- A from-scratch Python implementation using **decision stumps** as weak learners
- Visualizations of decision boundaries on synthetic datasets
- A comparison with `sklearn.ensemble.AdaBoostClassifier` (SAMME.R)
- Practical tips: regularization, class imbalance, and diagnostics

> **Notation.** We use training samples $(\mathbf{x}_i, y_i)$ with labels $y_i \in \{-1, +1\}$ and features $\mathbf{x}_i \in \mathbb{R}^d$.



As a formatting check, here's the requested Bayes theorem example using $$ ... $$ blocks in a single cell:

$$ P(y \mid \mathbf{x}) = \frac{P(\mathbf{x}\mid y)\,P(y)}{P(\mathbf{x})} \;\propto\; P(\mathbf{x}\mid y)\,P(y). $$

And an inline example: We start from **Bayes’ theorem** for a class label $y \in \{1,\ldots,C\}$ and a feature vector $\mathbf{x} = (x_1,\ldots,x_d)$.



## 1. AdaBoost as a Linear Additive Model in Function Space

AdaBoost combines many **weak learners** $h_t(\mathbf{x}) \in \{-1, +1\}$ into a strong classifier:
$$ F_T(\mathbf{x}) = \sum_{t=1}^{T} \alpha_t\, h_t(\mathbf{x}). $$
The final prediction is the sign:
$$ \hat{y}(\mathbf{x}) = \mathrm{sign}\!\left(F_T(\mathbf{x})\right). $$

AdaBoost minimizes the **exponential loss**:
$$ \mathcal{L}(F) = \sum_{i=1}^n \exp\!\left(-y_i\,F(\mathbf{x}_i)\right). $$

At each round $t$, we fit a weak learner $h_t$ to the **weighted** sample distribution $D_t(i)$ and choose a weight $\alpha_t$ to reduce the loss.



### 1.1 Weighted error, learner weight, and distribution update

Given weights $D_t(i)$ with $\sum_i D_t(i)=1$, the **weighted error** of $h_t$ is
$$ \varepsilon_t = \sum_{i=1}^{n} D_t(i)\,\mathbb{1}\{y_i \neq h_t(\mathbf{x}_i)\}. $$

If $\varepsilon_t < \tfrac{1}{2}$, we set
$$ \alpha_t = \tfrac{1}{2}\,\ln\!\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right). $$

We then update the distribution
$$ D_{t+1}(i) = \frac{ D_t(i)\,\exp\!\left(-\alpha_t\,y_i\,h_t(\mathbf{x}_i)\right) }{ Z_t }, $$
where $Z_t$ is a normalization constant so that $\sum_i D_{t+1}(i)=1$.



### 1.2 Interpretation

- The model $F_T(\mathbf{x})$ is **linear** in the basis functions $\{h_t(\cdot)\}_{t=1}^T$.
- Misclassified points get **upweighted**, forcing subsequent weak learners to focus on harder cases.
- The sign of $F_T(\mathbf{x})$ yields the class; $|F_T(\mathbf{x})|$ is a **margin** proxy.



## 2. Pseudocode (Binary AdaBoost with Decision Stumps)

**Input:** training set $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ with $y_i \in \{-1, +1\}$, number of rounds $T$.  
**Initialize:** $D_1(i) = \tfrac{1}{n}$ for all $i$.

For $t = 1, \ldots, T$:
1. Train weak learner $h_t$ using weights $D_t$ (e.g., a 1D threshold per feature, pick best).
2. Compute weighted error: $$ \varepsilon_t = \sum_i D_t(i)\,\mathbb{1}\{y_i \neq h_t(\mathbf{x}_i)\}. $$
3. Set $$ \alpha_t = \tfrac{1}{2}\ln\!\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right). $$
4. Update $$ D_{t+1}(i) \propto D_t(i)\,\exp\!\left(-\alpha_t y_i h_t(\mathbf{x}_i)\right). $$

**Output:** $F_T(\mathbf{x})=\sum_{t=1}^T \alpha_t h_t(\mathbf{x})$, classifier $\hat{y}(\mathbf{x})=\mathrm{sign}(F_T(\mathbf{x}))$.


In [None]:

# 3. From-scratch implementation of AdaBoost with decision stumps

import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass

np.random.seed(42)

@dataclass
class DecisionStump:
    feature: int = 0
    threshold: float = 0.0
    polarity: int = 1  # +1 or -1
    def predict(self, X):
        # X: (n_samples, n_features)
        feature_values = X[:, self.feature]
        preds = np.ones(X.shape[0], dtype=int)
        if self.polarity == 1:
            preds[feature_values < self.threshold] = -1
        else:
            preds[feature_values >= self.threshold] = -1
        return preds

class AdaBoostScratch:
    def __init__(self, n_estimators=50, learning_rate=1.0):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.alphas = []
        self.stumps = []

    def _best_stump(self, X, y, w):
        n_samples, n_features = X.shape
        best_stump = DecisionStump()
        min_error = np.inf

        for f in range(n_features):
            values = np.unique(X[:, f])
            if values.size == 1:
                thresholds = values
            else:
                thresholds = (values[:-1] + values[1:]) / 2.0

            for polarity in (+1, -1):
                for thr in thresholds:
                    stump = DecisionStump(feature=f, threshold=thr, polarity=polarity)
                    preds = stump.predict(X)
                    misclassified = (preds != y)
                    error = np.dot(w, misclassified.astype(float))
                    if error < min_error:
                        min_error = error
                        best_stump = stump
        return best_stump, min_error

    def fit(self, X, y):
        # Expect y in {-1, +1}
        n_samples = X.shape[0]
        w = np.ones(n_samples) / n_samples
        self.alphas = []
        self.stumps = []

        for t in range(self.n_estimators):
            stump, error = self._best_stump(X, y, w)
            error = max(1e-10, min(0.4999999999, error))  # clamp
            alpha = 0.5 * np.log((1 - error) / error)
            alpha *= self.learning_rate  # shrinkage

            preds = stump.predict(X)
            w *= np.exp(-alpha * y * preds)
            w /= np.sum(w)

            self.alphas.append(alpha)
            self.stumps.append(stump)

        return self

    def decision_function(self, X):
        F = np.zeros(X.shape[0])
        for alpha, stump in zip(self.alphas, self.stumps):
            F += alpha * stump.predict(X)
        return F

    def predict(self, X):
        return np.sign(self.decision_function(X)).astype(int)


In [None]:

# 4. Synthetic datasets and training

from sklearn.datasets import make_classification, make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler

def to_pm1(y01):
    return np.where(y01 == 1, 1, -1).astype(int)

# Dataset A: linearly separable-ish
X1, y1 = make_classification(n_samples=600, n_features=2, n_redundant=0, n_informative=2,
                             n_clusters_per_class=1, class_sep=1.2, flip_y=0.05, random_state=0)
y1_pm = to_pm1(y1)
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1_pm, test_size=0.3, random_state=0)

# Dataset B: moons (nonlinear structure)
X2, y2 = make_moons(n_samples=600, noise=0.25, random_state=0)
y2_pm = to_pm1(y2)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2_pm, test_size=0.3, random_state=0)

# Scaling for stability across features
scaler1 = StandardScaler().fit(X1_train)
X1_train_s = scaler1.transform(X1_train)
X1_test_s  = scaler1.transform(X1_test)

scaler2 = StandardScaler().fit(X2_train)
X2_train_s = scaler2.transform(X2_train)
X2_test_s  = scaler2.transform(X2_test)

# Train from-scratch AdaBoost
model1 = AdaBoostScratch(n_estimators=50, learning_rate=1.0).fit(X1_train_s, y1_train)
model2 = AdaBoostScratch(n_estimators=150, learning_rate=0.8).fit(X2_train_s, y2_train)

pred1 = model1.predict(X1_test_s)
pred2 = model2.predict(X2_test_s)

print("Dataset A (linear-ish) accuracy:", accuracy_score(y1_test, pred1))
print("Dataset B (moons) accuracy:", accuracy_score(y2_test, pred2))

def prob_from_scores(scores):
    return 1 / (1 + np.exp(-scores))

auc1 = roc_auc_score((y1_test==1).astype(int), prob_from_scores(model1.decision_function(X1_test_s)))
auc2 = roc_auc_score((y2_test==1).astype(int), prob_from_scores(model2.decision_function(X2_test_s)))
print("AUC A:", auc1, " | AUC B:", auc2)


In [None]:

# 5. Visualization helpers: decision boundaries

def plot_decision_boundary(model, X, y, title="Decision boundary", scaler=None, h=0.02):
    if scaler is not None:
        X_plot = scaler.transform(X)
    else:
        X_plot = X

    x_min, x_max = X_plot[:, 0].min() - 1.0, X_plot[:, 0].max() + 1.0
    y_min, y_max = X_plot[:, 1].min() - 1.0, X_plot[:, 1].max() + 1.0

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    grid = np.c_[xx.ravel(), yy.ravel()]
    Z = model.predict(grid).reshape(xx.shape)

    plt.figure(figsize=(6, 5))
    plt.contourf(xx, yy, Z, alpha=0.3, levels=[-np.inf, 0, np.inf])
    idx_pos = (y == 1)
    idx_neg = (y == -1)
    plt.scatter(X_plot[idx_pos, 0], X_plot[idx_pos, 1], s=20, label="+1")
    plt.scatter(X_plot[idx_neg, 0], X_plot[idx_neg, 1], s=20, marker="x", label="-1")
    plt.title(title)
    plt.legend(loc="best")
    plt.xlabel("x1 (scaled)" if scaler is not None else "x1")
    plt.ylabel("x2 (scaled)" if scaler is not None else "x2")
    plt.show()

plot_decision_boundary(model1, X1_train, y1_train, title="AdaBoost (scratch) on Dataset A", scaler=scaler1)
plot_decision_boundary(model2, X2_train, y2_train, title="AdaBoost (scratch) on Dataset B (moons)", scaler=scaler2)


In [None]:

# 6. Compare with scikit-learn's AdaBoost (SAMME.R)

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

sk_model1 = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    algorithm="SAMME.R",
    random_state=0
).fit(X1_train_s, (y1_train==1).astype(int))

sk_model2 = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=150,
    learning_rate=0.8,
    algorithm="SAMME.R",
    random_state=0
).fit(X2_train_s, (y2_train==1).astype(int))

sk_pred1 = np.where(sk_model1.predict(X1_test_s)==1, 1, -1)
sk_pred2 = np.where(sk_model2.predict(X2_test_s)==1, 1, -1)

print("Sklearn - Dataset A accuracy:", accuracy_score(y1_test, sk_pred1))
print("Sklearn - Dataset B accuracy:", accuracy_score(y2_test, sk_pred2))

class SKWrapper:
    def __init__(self, clf): self.clf = clf
    def predict(self, X):
        return np.where(self.clf.decision_function(X) >= 0, 1, -1)

plot_decision_boundary(SKWrapper(sk_model1), X1_train, y1_train, title="sklearn AdaBoost on Dataset A", scaler=scaler1)
plot_decision_boundary(SKWrapper(sk_model2), X2_train, y2_train, title="sklearn AdaBoost on Dataset B (moons)", scaler=scaler2)

print("Confusion Matrix (A):\n", confusion_matrix(y1_test, sk_pred1))
print("Confusion Matrix (B):\n", confusion_matrix(y2_test, sk_pred2))



## 7. Regularization, learning rate, and early stopping

- **Learning rate (shrinkage):** use $\nu \in (0,1]$ to scale each $\alpha_t \leftarrow \nu\,\alpha_t$ to reduce overfitting.
- **Early stopping:** monitor validation error or margin distribution and stop when performance plateaus.
- **Max depth of stumps/trees:** keep weak learners truly weak (e.g., depth-1 stumps) for the classic AdaBoost behavior.
- **Class imbalance:** reweight initial $D_1(i)$ by class priors or use stratified sampling.


In [None]:

# 8. Diagnostics: training margins and staged performance

def compute_margins(model, X, y):
    scores = model.decision_function(X)
    return y * scores

margins1 = compute_margins(model1, X1_train_s, y1_train)
margins2 = compute_margins(model2, X2_train_s, y2_train)

plt.figure(figsize=(6,4))
plt.hist(margins1, bins=30, alpha=0.7)
plt.title("Margin distribution — Dataset A (scratch model)")
plt.xlabel("y * F(x)"); plt.ylabel("count")
plt.show()

plt.figure(figsize=(6,4))
plt.hist(margins2, bins=30, alpha=0.7)
plt.title("Margin distribution — Dataset B (scratch model)")
plt.xlabel("y * F(x)"); plt.ylabel("count")
plt.show()



## 9. References

- Yoav Freund and Robert E. Schapire. *A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting.* Journal of Computer and System Sciences, 1997.
- Trevor Hastie, Robert Tibshirani, Jerome Friedman. *The Elements of Statistical Learning*, 2nd ed., Ch. 10.
- Schapire, R. E. (2013). *Explaining AdaBoost.*


In [None]:

# 10. Minimal "API" for reuse

class LinearAdaBoostClassifier(AdaBoostScratch):
    '''
    A thin alias to highlight that F(x) is linear in weak learners.
    Usage:
        clf = LinearAdaBoostClassifier(n_estimators=100).fit(X, y_pm1)
        yhat = clf.predict(X_test)
    '''
    pass

print("Ready: LinearAdaBoostClassifier")
