```{contents}
```

# Assumptions

XGBoost does not have strict **statistical assumptions** like linear regression, but it inherits assumptions from **decision trees** and **boosting frameworks**. These assumptions explain when and why XGBoost works well.

---

## 1. Additive Model Assumption

* XGBoost assumes the true function can be **approximated as a sum of weak learners (trees)**:

  $$
  \hat{y}_i = \sum_{m=1}^M f_m(x_i)
  $$
* Each new tree corrects residual errors of the previous ensemble.
* Works best if the relationship between features and target is **non-linear and can be captured by recursive splits**.

---

## 2. Differentiable Loss Function

* XGBoost assumes the **loss function is differentiable**, so it can compute **gradients** and **Hessians** for optimization.
* Example:

  * Regression → squared error, MAE, Huber.
  * Classification → logistic loss, cross-entropy.
* If the loss cannot be differentiated, XGBoost cannot optimize it.

---

## 3. Independent and Identically Distributed (i.i.d.) Data

* Assumes training examples are **independent and drawn from the same distribution** as test data.
* Violations (like time-series without ordering, or domain shift) can reduce performance.

---

## 4. Weak Learner Assumption

* Each individual tree is a **weak learner** (shallow, high-bias).
* XGBoost assumes boosting many weak learners will create a **strong learner** (low-bias, low-variance).

---

## 5. No Multicollinearity Requirement

* Unlike linear regression, XGBoost does **not assume features are independent**.
* But highly correlated features may reduce interpretability (e.g., feature importance becomes diluted).

---

## 6. Handling Missing Data

* Assumes missing values can be assigned to a **default split direction** in trees.
* XGBoost automatically learns the best default branch, so missingness is not problematic.

---

## 7. Complexity vs. Generalization

* Assumes a trade-off between tree complexity and generalization.
* That’s why regularization terms ($\lambda, \alpha, \gamma$) are included in the objective function.

---

**Summary**

1. Target can be modeled as additive trees.
2. Loss function must be differentiable.
3. Data is i.i.d.
4. Weak learners combined form a strong model.
5. No strict assumptions on feature independence.
6. Missing values can be handled by default splits.
7. Model complexity must be controlled to prevent overfitting.

