
## 1. Objective Function

At iteration $t$, the model prediction is:

$$
\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i)
$$

where $f_t$ is the new decision tree to be added.

The overall objective is:

$$
Obj^{(t)} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{k=1}^t \Omega(f_k)
$$

* $l$: loss function (e.g., log loss, MSE)
* $\Omega(f)$: regularization term controlling complexity of trees

$$
\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2
$$

where

* $T$: number of leaves
* $w_j$: weight of leaf $j$
* $\gamma, \lambda$: regularization parameters

---

## 2. Second-Order Taylor Expansion

Expand the loss around previous prediction $\hat{y}_i^{(t-1)}$:

$$
l(y_i, \hat{y}_i^{(t)}) \approx l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2
$$

where

$$
g_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}, \quad
h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2}
$$

Thus, optimization depends only on **gradients (first derivative)** and **Hessians (second derivative)**.

---

## 3. Simplified Objective at Iteration $t$

$$
Obj^{(t)} \approx \sum_{j=1}^T \left[ G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \gamma T
$$

where for leaf $j$:

* $G_j = \sum_{i \in I_j} g_i$
* $H_j = \sum_{i \in I_j} h_i$

---

## 4. Optimal Leaf Weight

For each leaf, optimal weight is:

$$
w_j^* = -\frac{G_j}{H_j + \lambda}
$$

---

## 5. Optimal Value of a Tree

The score of a tree (before adding) is:

$$
Obj^{(t)} = -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j + \lambda} + \gamma T
$$

This tells us whether splitting a node improves the objective.

---

## 6. Split Gain (Decision Rule)

When splitting a node into left (L) and right (R):

$$
Gain = \frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right) - \gamma
$$

If **Gain > 0**, the split is useful.

---

## Intuition in Words

* Each step builds a tree that minimizes the **second-order approximation** of the loss.
* Gradients push predictions toward the correct direction.
* Hessians adjust the step size for stability.
* Regularization ($\lambda, \gamma$) penalizes complex trees, improving generalization.
* Final model is the sum of many such optimized trees.

