# Boosting


## What Is Boosting?

**Boosting** is an *ensemble* technique that combines many weak predictive models—typically shallow decision trees—into one strong learner.

By fitting models **sequentially** and letting each new model focus on the errors of the previous ones, Boosting steadily reduces bias and improves overall accuracy.

---

### How It Works

Conceptual algorithm (Adaptive Boosting, or AdaBoost):

1. Start with **uniform** observation weights.
2. Train a stump, i.e., a one-level tree (refered to as a weak learner in adaBoost).
3. **Up-weight** the rows the stump misclassified.
4. Train the next stump on the re-weighted data.
5. **Combine** trees in a weighted vote (later trees usually get higher weight).
6. Repeat for $M$ rounds.

---

### Why it Works

* Each tree focuses on what its predecessors missed ⇒ **reduces bias**.
* Shallow trees + small learning rate keep variance in check.


---

### Popular Boosting methods

| Algorithm                         | Core Idea                                                                                             |
| --------------------------------- | ----------------------------------------------------------------------------------------------------- |
| **AdaBoost**                      | Adjust sample weights via exponential loss; emphasize hard cases.                                     |
| **Gradient Boosting**             | Fit each learner to the *negative gradient* of a differentiable loss (e.g., squared, log-loss).       |
| **XGBoost / LightGBM / CatBoost** | Engineering-heavy gradient boosting variants with tree pruning, regularization, and GPU acceleration. |

---

### Pros and Cons

✅ Often delivers *state-of-the-art* accuracy out-of-the-box.  
✅ **Reduces bias** by turning weak learners into a strong composite.  
✅ Handles **mixed feature types** and missing values (tree-based versions).  
✅ Offers many regularization knobs (learning rate, subsampling) to fight overfitting.  

❌ **Sequential training** means poor parallelism and longer training times.  
❌ Can *overfit* if the ensemble grows too deep or learning rate is too high.  
❌ Model is less interpretable than a single tree (though feature importance helps).  

### Key Hyper-parameters

| Symbol                         | Meaning             | Tips                                          |
| ------------------------------ | ------------------- | --------------------------------------------- |
| `n_estimators`                 | rounds $M$          | 100–500 common                                |
| `learning_rate`                | shrinkage per round | 0.01–0.1 (smaller = safer, needs more rounds) |
| `max_depth` / `max_leaf_nodes` | size of each tree   | 2–6 for tabular                               |



> *Boosting “learns from its mistakes,” layering many simple models into a powerful predictor that often outperforms standalone learners—provided you tune it with care.*  --- Words of wisdom by ChatGPT 3o

We will examine the same simulated example. 

In [3]:
import numpy as np

rng = np.random.default_rng(0)   
# Examples with more data points:
sample_size = 200
X = rng.uniform(0.1, 0.9, size=(sample_size, 2))
y = np.zeros(sample_size, dtype=int)
mask1 = X[:, 0] + X[:, 1] > 1.1
mask2 = (~mask1) & (X[:, 0] - X[:, 1] > 0.3)
y[mask1] = 1
y[mask2] = 0
y[~(mask1 | mask2)] = 2


In [5]:
# --- numerical & plotting -------------------------------------------------
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap   # only if you still need to define label_cmap
# --- machine learning -----------------------------------------------------
import sklearn                                # to check sklearn.__version__
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# --- widgets / interactivity ---------------------------------------------
import ipywidgets as widgets                  # IntSlider
from ipywidgets import interact               # high-level helper


label_cmap = ListedColormap(["#1f77b4", "#ff7f0e", "#2ca02c"])


def plot_adaboost(max_depth=1, n_estimators=100):
    # Build AdaBoost with the right keyword for your scikit-learn version
    ada_kwargs = dict(
        n_estimators=n_estimators,
        learning_rate=0.5,
        random_state=42,
    )
    if sklearn.__version__ >= "1.4":
        ada_kwargs["estimator"] = DecisionTreeClassifier(max_depth=max_depth)
    else:
        ada_kwargs["base_estimator"] = DecisionTreeClassifier(max_depth=max_depth)

    ada = AdaBoostClassifier(**ada_kwargs)
    ada.fit(X, y)

    # Create a fine mesh over the feature space
    x_min, x_max = X[:, 0].min() - 0.05, X[:, 0].max() + 0.05
    y_min, y_max = X[:, 1].min() - 0.05, X[:, 1].max() + 0.05
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, 400),
        np.linspace(y_min, y_max, 400)
    )
    Z = ada.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Plot
    plt.figure(figsize=(6, 5))
    plt.contourf(xx, yy, Z, alpha=0.25, cmap=label_cmap)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=label_cmap,
                edgecolors="k", s=80)
    plt.title(f"AdaBoost: max_depth = {max_depth}, "
              f"n_estimators = {n_estimators}")
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.tight_layout()
    plt.show()

# 3. ── interactive sliders (depth 1-4, trees 100-500) ──────────────────
interact(
    plot_adaboost,
    max_depth=widgets.IntSlider(min=1, max=4, step=1, value=1,
                                description="max_depth"),
    n_estimators=widgets.IntSlider(min=100, max=500, step=100, value=100,
                                   description="n_estimators")
);


interactive(children=(IntSlider(value=1, description='max_depth', max=4, min=1), IntSlider(value=100, descript…