This notebooks rederives the tree splitting criteria formula as described in Section 2.2 of the [XGBoost](https://dl.acm.org/doi/10.1145/2939672.2939785) paper.

### Boosting model

The boosting model can be written as 

$$\hat y_i = \phi(\mathbf{x}_i, y_i) = \sum_{k=1}^K f_k(\mathbf{x}_i), \;\;\; f_k \in \mathcal{F}$$

where 

* $\mathbf{x}_i$ is the feature vector of the $i$-th example.
* $f_k$ is the kth regression tree, i.e. base/weak learner. ($k$ and $t$ below seem to be the same thing)
* $K$ is the number of trees.
* $\mathcal{F}$ is the space of regression trees:

Note the shrinkage is not mentioned yet until Section 2.3 in the paper.

$$\mathcal{F} = \{f(\mathbf{x}) = w_{q(\mathbf{x})}\}(q : \mathbb{R}^m \rightarrow T, \mathbf{w} \in \mathbb{R}^T)$$

where

* $q$ represents the structure of a regression tree that maps an example to the corresponding leaf index.
* $m$ is the dimension of the feature space.
* $T$ is the number of leaves in a tree
* $\mathbf{w}$ is the vector of leaf weights (i.e. output of each leaf, aka. scores in the paper) of a regression tree with $T$ leaves

### Regularized loss function

\begin{align*}
\mathcal{L}(\phi) 
&= \sum_i^n l \left(y_i, \hat{y_i} \right) + \sum_k \Omega(f_k) \\
&= \sum_i^n l \left(y_i, \hat{y_i} \right) + \sum_k \gamma T_k + \frac{1}{2}\lambda ||\mathbf{w}_k||^2
\end{align*}

where

* $l$ is a differentiable convex loss function
* $n$ is the number of examples for evaluating the loss function.
* $\Omega$ is a regularization term which can be decomposed into two parts:
 * $\gamma T_k$: favor trees with fewer leaves when the model performances are the same.
 * $\frac{1}{2}||\mathbf{w}_k||^2$: favor trees with lower weights.

### Boosting procedure

The boosting model (an ensemble of trees) is trained in a additive manner. At the $t$-th step, the boosting procedure tries to minimize the loss function $\mathcal{L}^{(t)}$:

\begin{align*}
\mathcal{L}^{(t)} = \sum_{i=1}^n l\left(y_i, \hat y_i^{(t - 1)} + f_t(\mathbf{x}_i)\right) + \Omega(f_t)
\end{align*}

where

* $\hat{y}_i^{(t-1)} = \sum_{k=1}^{t-1} f_k(\mathbf{x}_i)$, i.e. the prediction of the $i$-th example at step $t - 1$.
* $f_t$ is the tree to be learned at step t without modifying previous trees.

Next, we approximate $\mathcal{L}^{(t)}$ with the second-order [Taylor series](https://mathworld.wolfram.com/MaclaurinSeries.html) at around $f_t(\mathbf{x}_i) = 0$ (aka. Maclaurin series):

(Recall to approximate $f(x)$ at around $x=a$, $f(x) \approx f(0) + \frac{f'(0)}{1!}(x) + \frac{f''(0)}{2!}(x)^2 + \cdots$)

$$\mathcal{L}^{(t)} \approx \sum_{i=1}^n \left[ l\left(y_i, \hat y_i^{(t - 1)} \right ) + g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t^2(\mathbf{x}_i) \right ] + \Omega(f_t)$$

here, $f_t(\mathbf{x}_i)$ as a whole is treated as a variable, so 

\begin{align*} 
g_i = \frac{\partial l\left(y_i, \hat y_i^{(t - 1)} + f_t(\mathbf{x}_i)\right)}{\partial f_t(\mathbf{x}_i)}\Bigg|_{f_t(\mathbf{x}_i)=0} \\
h_i = \frac{\partial^2 l\left(y_i, \hat y_i^{(t - 1)} + f_t(\mathbf{x}_i)\right)}{\partial^2 f_t(\mathbf{x}_i)}\Bigg|_{f_t(\mathbf{x}_i)=0} \\
\end{align*}

*Personally, I feel the notations for $g_i$ and $h_i$ in the paper $g_i = \partial_{\hat y_i^{(t - 1)}} l(y_i, \hat y^{(t - 1)})$, $h_i = \partial^2_{\hat y_i^{(t - 1)}} l(y_i, \hat y^{(t - 1)})$) not very clear, but in either case, $g_i$ and $h_i$ are constants, so we won't need to write out the full form in the following derivations.*

In the Taylor expansion of $\mathcal{L}^{(t)}$, the first term $l\left(y_i, \hat y_i^{(t - 1)} \right )$ is a constant, so it can be simplified to

\begin{align*}
\tilde{\mathcal{L}}^{(t)} 
&= \sum_{i=1}^n \left[ g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t^2(\mathbf{x}_i) \right ] + \Omega(f_t) \\
&= \sum_{i=1}^n \left[ g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t^2(\mathbf{x}_i) \right ] + \gamma T_t + \frac{1}{2} \lambda ||\mathbf{w}_t||^2
\end{align*}

#### Optimal weights for minimal loss

Next, with the loss function at step t $\tilde{\mathcal{L}}^{(t)}$ obtained, we dervie the optimal weights that should be assigned to the leaves in order to minimize it.

Define $I_j = \{i | q(\mathbf{x}_i = j) \}$ as the instance set of leaf j, then

\begin{align*}
\tilde{\mathcal{L}}^{(t)} 
&= \sum_{j=1}^T \left[ \sum_{i \in I_j} g_i f_t(\mathbf{x}_i) + \frac{1}{2}\sum_{i \in I_j} h_i f_t^2(\mathbf{x}_i) \right ] + \gamma T_t + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 \\
&= \sum_{j=1}^T \left[ \sum_{i \in I_j} g_i w_j + \frac{1}{2}\sum_{i \in I_j} h_i w_j^2 \right ] + \gamma T_t + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 \\  
&= \sum_{j=1}^T \left[ w_j \sum_{i \in I_j} g_i + \frac{1}{2} w_j^2  \left( \lambda + \sum_{i \in I_j} h_i \right )  \right ] + \gamma T_t \\  
\end{align*}

Note here

* We sum over tree leaves ($\sum_{j=1}^T$) instead of over individual examples ($\sum_{i=1}^n$).
* For the leaf $j$, given $i \in I_j$, then $f_t(\mathbf{x}_i) = w_j$.

Then, for a fixed tree structure $q$, the optimal weights are

\begin{align*}
w_j^* = - \frac{\sum_{i \in I_j} g_j}{\lambda + \sum_{i \in I_j} h_i}
\end{align*}

and the minimal loss is

$$\tilde{\mathcal{L}}^{(t)*} = - \frac{1}{2} \sum_{j=1}^T \frac{\left( \sum_{i \in I_j} g_i \right )^2}{\lambda + \sum_{i \in I_j} h_i} + \gamma T_t$$

Note that given tree structure is fixed, i.e. the mapping of which example goes to which leaf is fixed,

* Each leaf is independent of each other, i.e. $\tilde{\mathcal{L}}^{(t)}$ consists of $T$ independent quadratic functions.
* $T_t$ is a constant.
* The general pattern for quadratic function optimization: $\arg\min_{x} \frac{1}{2} a x^2 + b x = - \frac{b}{2a}$, and the min value is $-\frac{b^2}{a}$. Here,
 * $a = \lambda + \sum_{i \in I_j} h_i$,
 * $b = \sum_{i \in I_j} g_i$.

#### Split evaluation

For a greedy tree growth algorithm, when trying to split a leaf ($I$) into two leaves ($I_L$, $I_R$), the performance of the split is evaluated by the reduction in loss before and after the split,

\begin{align*} 
\Delta &= \tilde{\mathcal{L}}^{(t)*}_{I} - \left( \tilde{\mathcal{L}}^{(t)*}_{I_L} + \tilde{\mathcal{L}}^{(t)*}_{I_R} \right) \\
&= - \frac{1}{2} \frac{\left( \sum_{i \in I} g_i \right )^2}{\lambda + \sum_{i \in I} h_i} + \gamma - \left[ - \frac{1}{2} \frac{\left( \sum_{i \in I_L} g_i \right )^2}{\lambda + \sum_{i \in I_L} h_i} + \gamma - \frac{1}{2} \frac{\left( \sum_{i \in I_R} g_i \right )^2}{\lambda + \sum_{i \in I_R} h_i} + \gamma \right ] \\
&= \frac{1}{2} \left[  \frac{\left( \sum_{i \in I_L} g_i \right )^2}{\lambda + \sum_{i \in I_L} h_i} + \frac{\left( \sum_{i \in I_R} g_i \right )^2}{\lambda + \sum_{i \in I_R} h_i} -  \frac{\left( \sum_{i \in I} g_i \right )^2}{\lambda + \sum_{i \in I} h_i} \right ] - \gamma
\end{align*}

Note, 

* we only focus on the loss change related to the three leaves ($I$, $I_L$, $I_R$) only because the loss of other leaves won't change during the split.
* This reduction loss plays a similar role as delta in gini index or entropy loss, which is only applicable to classification trees.
* The formula shows that if the gain is less than $\gamma$, it's better not to be applied.

# Debug

\begin{align*}
\mathcal{L}^{(t)} = l\left(y_i, \hat y_i^{(t - 1)} + f_t(\mathbf{x}_i)\right) + \Omega(f_t)
\end{align*}

$$\mathcal{L}^{(t)} \approx \sum_{i=1}^n \left[ l\left(y_i, \hat y_i^{(t - 1)} \right ) + g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t^2(\mathbf{x}_i) \right ] + \Omega(f_t)$$

If $l$ is squared loss,

In the first equation

\begin{align*}
\mathcal{L}^{(t)} 
&= \sum_{i=1}^n l\left(y_i, \hat y_i^{(t - 1)} + f_t(\mathbf{x}_i)\right) \\
&= \sum_{i=1}^n \left(\hat y_i^{(t - 1)} + f_t(\mathbf{x}_i) - y_i \right)^2 \\
&= \sum_{i=1}^n \left(f_t(\mathbf{x}_i) + \hat y_i^{(t - 1)} - y_i \right)^2 \\
\end{align*}

In the second equation

\begin{align*}
\mathcal{L}^{(t)} 
&\approx \sum_{i=1}^n \left[ l\left(y_i, \hat y_i^{(t - 1)} \right ) + g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t^2(\mathbf{x}_i) \right ] \\
&=\sum_{i=1}^n \left[ \left(y_i - \hat y_i^{(t - 1)} \right )^2 + 2\left(y_i - \hat y_i^{(t - 1)} \right ) f_t(\mathbf{x}_i) +  f_t^2(\mathbf{x}_i) \right ]
\end{align*}

They're identical because second-order Taylor approximation is identical for squared loss, which doesn't contain any higher-order information.