# 🔹 1. Setup

We want to learn a function $F(x)$ that predicts $y$ given input $x$.

* Data: $\{(x_i, y_i)\}_{i=1}^N$
* Loss function: $L(y, F(x))$

Our goal is to minimize the **empirical risk**:

$$
\min_{F} \; \sum_{i=1}^N L(y_i, F(x_i))
$$

---

# 🔹 2. Idea of Gradient Boosting

Gradient Boosting builds $F(x)$ **additively** as:

$$
F_M(x) = F_0(x) + \sum_{m=1}^M \gamma_m h_m(x)
$$

* $F_0(x)$ = initial model (e.g., constant mean for regression, log-odds for classification).
* At each step $m$, we add a new weak learner $h_m(x)$ scaled by weight $\gamma_m$.
* Each $h_m(x)$ is chosen to reduce the loss as much as possible.

---

# 🔹 3. Gradient Descent Analogy

In optimization, gradient descent updates parameters in the **negative gradient direction**:

$$
\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta L
$$

In **Gradient Boosting**, we don’t have parameters like $\theta$. Instead, our "parameter" is the **function** $F(x)$. So we do **functional gradient descent**:

$$
F_{m}(x) = F_{m-1}(x) - \eta \cdot g_m(x)
$$

where

$$
g_m(x_i) = \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m-1}}
$$

That is, the gradient is taken with respect to the **predictions**, not parameters.

---

# 🔹 4. Step-by-Step Boosting Algorithm

At iteration $m$:

1. **Compute pseudo-residuals** (negative gradient):

   $$
   r_{im} = - \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m-1}}
   $$

2. **Fit a weak learner** $h_m(x)$ to approximate $r_{im}$.

   * Usually $h_m(x)$ is a small decision tree.

3. **Compute optimal step size** (line search):

   $$
   \gamma_m = \arg\min_\gamma \sum_{i=1}^N L\big(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\big)
   $$

4. **Update the model**:

   $$
   F_m(x) = F_{m-1}(x) + \eta \gamma_m h_m(x)
   $$

where $\eta$ = learning rate.

---

# 🔹 5. Special Cases

* **Squared Error Loss** (Regression):

  $$
  L(y, F(x)) = \frac{1}{2}(y - F(x))^2
  $$

  Gradient:

  $$
  r_{im} = y_i - F_{m-1}(x_i) \quad (\text{residuals!})
  $$

  👉 Gradient Boosting reduces to "fit trees to residuals".

* **Logistic Loss** (Classification):

  $$
  L(y, F(x)) = \log(1 + e^{-yF(x)})
  $$

  Gradient:

  $$
  r_{im} = \frac{y_i}{1 + e^{y_i F_{m-1}(x_i)}}
  $$

  👉 Trees are fit to these residual-like terms.

---

# 🔹 6. Intuition in Words

* Gradient Boosting = **Gradient Descent in function space**.
* Each weak learner = "step in negative gradient direction".
* Learning rate $\eta$ = step size.
* Residuals = "errors the model still needs to fix".

---

✅ So, Gradient Boosting = iteratively fitting weak learners to the **gradient of the loss function** with respect to predictions.
