# XGBoost

XGBoost is a regularised, second order gradient boosting with trees, engineered for scale. 

Conceptually; 

- Same additive model as Gradient Boosting
- Same idea : fit corrections
- Uses 2nd order Taylor expansion
- Explicit Regularisation
- Optimizes tree structure + leaf values jointly

We build an additive model; 

$$
\hat{y_i} = F(x_i) = \sum_{t=1}^{T} f_t(x_i)
$$

Each f_t is a tree and each tree maps an input to a leaf value. 

XGBoost objective is as follows; 

$$
\begin{align*}\mathcal{L}&=\sum_{i=1}^n L(y_i, \hat{y}_i)+\sum_{t=1}^T \Omega(f_t)\end{align*}
$$

$$
\begin{align*}\Omega(f)&=\gamma T+\frac{1}{2}\lambda \sum_{j=1}^T w_j^2\end{align*}
$$

Here $\gamma$ is the penalty per leaf ( tree complexity) and $\lambda$ is the L2 penalty on leaf weights. This regularisation term is a huge difference from vanilla GBM. 

This penalizes too many leaves ( prevents deep trees) and large leaf values - prevents aggressive corrections. Don’t add a tree unless it helps enough to justify its complexity. 

At iteration t, - step-wise learning ( one tree at a time) 

$$
F^{t}(x) = F^{(t-1)}(x) + f_t(x)
$$

We want to choose $f_t$ to minimize; 

$$
\begin{align*}\mathcal{L}^{(t)}&=\sum_i L\left(y_i, F^{(t-1)}(x_i) + f_t(x_i)\right)+\Omega(f_t)\end{align*}
$$

What small function should I add to improve predictions?This is the gradient boosting logic.  

---

**XGBOOST - Stat-quest Notes - Regression**

- We start with an initial prediction (0.5) regardless of regression or classification
- We compute the residuals and fits a regression tree to the residuals  ( using xgboost tree)
- How we build a xgboost tree for regression ?

Each tree starts with a leaf of all the residuals; 

$$
\text{Similarity Score} = \frac{\text{Sum of Residuals,Squared}}{\text{Number of Residuals}+\lambda}
$$

We cluster similar residuals into two groups. 

How much better the leaves cluster similar residuals than the root node? We do this by calculating the Gain of splitting the residuals into two groups. 

$$
\text{Gain} = (\text{Left}_{similarity} + \text{Right}_{similarity}) - \text{Root}_{similarity}
$$

We pick a parameter called gamma ( $\gamma$ ) and we compare this with the gain. If the difference between the Gain and gamma is negative ( $\text{Gain}-\gamma$) , we will remove the branch. This is how we prune the tree. If we increase the gamma, the pruning will be extreme and we may end up in the initial prediction itself.

Lambda, is a Regularisation parameter, which means that it is intended to reduce the prediction’s sensitivity to individual observations. If lambda is ≥0, the similarity scores are smaller. 

$$
\text{Output Value} = \frac{\text{Sum of Residuals}}{\text{Number of Residuals}+\lambda}
$$

The output value equation is like the similarity score except we do not square the sum of the residuals. 

Learning rate of xgboost is 0.3 ( by default ).

We keep building trees until the Residuals are super small, or we have reached the maximum number. 

**XGBOOST - Stat-quest Notes - Classification**

We start with the initial prediction 0.5

We have a new formula for similarity scores in classification.

$$
\frac{\text{Sum of Residuals,Squared}}{\sum \left[ \text{Previous Probability}*(1-\text{Previous Probability})\right]+ \lambda}
$$

$$
Cover = {\sum \left[ \text{Previous Probability}*(1-\text{Previous Probability})\right]}
$$

In regression, cover is just the number of residuals in the leaf. 

We need to convert this probability into log(odds) value. 

and add this with the tree output; then we get the log(odds) value. 

Now we convert this into probability. and we got new residuals.