# 🔹 Assumptions in Gradient Boosting

### 1. **Additive Model Assumption**

* The final predictor can be written as:

  $$
  F_M(x) = \sum_{m=1}^M \nu \cdot h_m(x)
  $$

  where $h_m(x)$ are weak learners (often decision trees).
* We assume that **incrementally adding weak models improves performance** → i.e., the target function is approximable by an additive expansion of weak learners.

---

### 2. **Weak Learner Assumption**

* Base learners (usually shallow trees) should be **slightly better than random guessing**.
* If weak learners are too strong (deep trees), GBM may overfit fast. If too weak (random predictions), boosting won’t converge.

---

### 3. **Gradient Approximation Assumption**

* Boosting fits new learners to the **negative gradient of the loss function**.
* Assumes that this approximation step is valid → i.e., weak learners can **reasonably approximate the gradient direction** in function space.

---

### 4. **Independent & Identically Distributed (i.i.d.) Data**

* Standard ML assumption: training samples are **independent** and come from the same distribution as test data.
* If data distribution shifts (non-stationary data), boosting may fail.

---

### 5. **Loss Function is Differentiable**

* Gradient Boosting assumes the **loss function is differentiable** wrt predictions, so we can compute gradients.

  * For regression: MSE, MAE, Huber loss.
  * For classification: Logistic loss, exponential loss.
* (Though there are tricks for non-differentiable losses like ranking loss in ranking problems.)

---

### 6. **Bias–Variance Tradeoff Assumption**

* Boosting assumes sequential corrections (reducing bias) improve generalization without exploding variance.
* Hence learning rate, tree depth, and number of iterations must balance bias vs variance.

---

### 7. **Noisy Data Sensitivity Assumption**

* Gradient Boosting assumes **errors are learnable**.
* If there’s a lot of **label noise**, boosting will overfit (because it keeps focusing on mistakes, even when they’re due to noise).

---

# ✅ Summary

Gradient Boosting assumes:

1. The target function can be approximated by an additive model.
2. Weak learners are slightly better than random.
3. Gradients can be reasonably approximated by trees.
4. Data is i.i.d.
5. Loss is differentiable.
6. Bias–variance tradeoff can be tuned.
7. Errors are meaningful (not pure noise).

