<a href="https://colab.research.google.com/github/SzymonNowakowski/Machine-Learning-2024/blob/master/Lab07_gradient-boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 7 - Gradient Boosting

### Author: Szymon Nowakowski


# Introduction
-------------------------
Gradient Boosting is a powerful ensemble learning method that builds models sequentially to correct errors made by previous models. Unlike Random Forests, which construct trees independently and then aggregate their outputs, Gradient Boosting trains models in a stage-wise fashion, minimizing a loss function at each step.

Since you have already studied **CART (Classification and Regression Trees)** and **Random Forests**, this chapter introduces Gradient Boosting as an advanced tree-based method that often outperforms Random Forests in predictive tasks.


## The Idea Behind Gradient Boosting

Gradient Boosting builds an additive model of weak learners (typically decision trees) to minimize a given loss function. The model is updated iteratively:

1. Start with an initial model, typically a constant value:

   $$ F_0(x) = \arg\min_c \sum_{i=1}^{n} L(y_i, c) $$

   where $L(y_i, c)$ is the loss function (e.g., squared loss for regression or log loss for classification).

2. For each iteration $m$, compute the residuals (pseudo-residuals):

   $$ r_{i}^{(m)} = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m-1}} $$

   This represents the negative gradient of the loss function.

3. Fit a new weak learner $h_m(x)$ (a shallow decision tree) to predict these residuals.

4. Update the model:

   $$ F_m(x) = F_{m-1}(x) + \eta h_m(x) $$

   where $\eta$ (learning rate) controls the contribution of each weak learner.



## Comparing Gradient Boosting to Random Forests

| Feature              | Random Forests                          | Gradient Boosting                    |
|----------------------|--------------------------------|--------------------------------|
| Model Structure     | Ensemble of independent trees | Trees built sequentially       |
| Training Process   | Uses bagging (parallel training) | Boosting (sequential corrections) |
| Overfitting Risk   | Lower (averaging reduces variance) | Higher (but can be controlled) |
| Performance        | Strong, but may not optimize loss | Often superior for complex tasks |
| Speed             | Faster (can be parallelized) | Slower (sequential training) |





## Hyperparameters in Gradient Boosting

Tuning hyperparameters is crucial for Gradient Boosting performance. Key parameters include:

- **Learning Rate $\eta$**: Controls how much each tree contributes. Small values (e.g., 0.01) require more trees.
- **Number of Trees $M$**: Too many trees may lead to overfitting.
- **Tree Depth $d$**: Controls the complexity of each tree. Shallow trees (e.g., depth = 3-5) work well.
- **Min Samples Split & Min Samples Leaf**: Define when to stop growing trees.
- **Subsample**: Fraction of data used to train each tree, introducing randomness (like in Random Forests).



In the next section, we will implement Gradient Boosting in Python using `scikit-learn` and `XGBoost`.
