# Adaptive Boosting from Scratch
***
## Table of Contents
***

In [92]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from numpy.typing import NDArray
from typing import Tuple

## 1. Introduction
Adaptive Boosting (AdaBoost) is a foundational ensemble learning algorithm designed to improve the accuracy of machine learning models by combining multiple **weak classifiers** (often decision stumps - decision trees with a single split) into a single **strong classifier**. Althought AdaBoost is primarily for binary classification, it has been extended to handle multiclass problems and regression tasks in some variants. However, its core mechanism and main use case remain in binary classification.

### Advantages:
- Turn weak models into a strong classifier.
- Less overfitting.
- No need for parameter tuning.

### Limitations:
- Sensitive to outliers as misclassified samples get higher weights.
- Primarily for binary classification.

### Steps:
1. Initialise weights.
2. For each boosting round (M iterations),
    - Train a weak lerner (decision stump).
    - Compute weighted error.
    - Calculate lerner weights $\alpha$.
    - Update sample weights.
    - Repeat for the maximum number of iterations or until weighted error is sufficiently low.
3. Predict.

## 2. Loading Data

In [81]:
data = load_breast_cancer()
X, y = data.data, data.target
y = np.where(y == 0, -1, 1)     # AdaBoost expects labels as -1 and +1

## 3. Initialising Weights
All training samples are initialised with equal weight:

\begin{align*}
    w_i = \dfrac{1}{N}
\end{align*}

where $N$ is the number of samples. For $N = 5$, the initial weights of the sample will be:

\begin{align*}
    w_i = \dfrac{1}{5} = 0.2
\end{align*}

The `np.full` function from NumPy library can generate an array of the specified length with every entry set to the same value.

In [None]:
def initialise_weights(n_samples: int) -> NDArray[np.float64]:
    """
    Initialise all sample weights equally.
    """
    return np.full(n_samples, 1 / n_samples)

In [91]:
print(f'For N = 5: {initialise_weights(5)}')

For N = 5: [0.2 0.2 0.2 0.2 0.2]


## 4. Finding the Best Stump
<!-- The following `find_best_stump` function searches all features and possible thresholds, and for each, tries both polarities (direction of the inequality). It predicts labels, computes the weighted error, and keeps the stump with the lowest error. -->
The following `find_best_stump` function implements the decision stump: It exhaustively searches for the best one-level split across all features and possible thresholds consdering both directions (polarities), and selects the split that minimises the weighted classification error.

1. Initialise variables.
2. Loop over all features and thresholds (unique values).
3. Loop over both polarities: $[1, -1]$.
4. Make predictions.
    - Initialise all predictions to $+1$.
    - For polarity $1$: set to $-1$ if $\text{value} < \text{threshold}$.
    - Otherwise: set to $+1$.
5. Calculate weighted error.
\begin{align*}
    \epsilon_m = \dfrac{\sum^{N}_{i=1} w_i \cdot \mathbb{I}(h_m(x_i) \neq y_i)}{\sum^{N}_{i=1}w_i}
\end{align*}

    where:
    - $h_m$: $m$-th weak learner.
    - $y_i$: True label.
    - $\mathbb{I}$: Indicator function.

    In fact, weighted error is just a sum of weights for misclassified samples.
6. If the error rate is smaller than `min_error`, update the value (`min_error = error`), best stump and best prediction.
7. Return `best_stump`, `min_error`, and `best_predictions` with the least error.

In [None]:
def find_best_stump(X, y, sample_weights):
    """
    Find the decision stump (feature, threshold, polarity) that minimises weighted error.
    Returns: feature_index, threshold, polarity, min_error, predictions
    """
    n_samples, n_features = X.shape
    min_error = float('inf')
    best_stump = {}
    best_predictions = None

    for feature_i in range(n_features):  # Each feature
        feature_vals = X[:, feature_i]  # All values in the selected features
        thresholds = np.unique(feature_vals)  # Unique values in feature_vals
        for threshold in thresholds:
            for polarity in [1, -1]:
                # Predict: 1 if (polarity * feature) < (polarity * threshold), else -1
                predictions = np.ones(n_samples)
                if polarity == 1:
                    predictions[feature_vals < threshold] = -1
                else:
                    predictions[feature_vals > threshold] = -1

                # Calculate weighted error
                misclassified = predictions != y
                error = np.sum(sample_weights[misclassified])

                if error < min_error:
                    min_error = error
                    best_stump = {
                        "feature_index": feature_i,
                        "threshold": threshold,
                        "polarity": polarity
                    }
                    best_predictions = predictions.copy()
    return best_stump, min_error, best_predictions

## 5. Learner Weights
For the current learner $m$, the new weight $\alpha_m$ is:

\begin{align*}
    \alpha_m = \dfrac{1}{2} \text{ln} \left( \dfrac{1-\epsilon_m + \text{c}}{\epsilon_m + \text{c}} \right)
\end{align*}

where:
- $\epsilon_m$: Error rate calculated inside the `find_best_stump()` function.
- $c$: Small constant added to avoid division by zero. Set to $1 \times 10^{-10}$.

In [None]:
def compute_alpha(error):
    """
    Compute the weight of the weak learner (alpha).
    """
    c = 1e-10  # constant
    return 0.5 * np.log((1 - error + c) / (error + c))