# XGBoost Methodology and Application

*First Version Date: 2020-11-08   
Latest update on: 2020-11-15*

## 1. XGBoost vs. Gradient Boosting Machine (GBM)
The full name for XGBoost is "Extreme Gradient Boosting which is a specific implementation of the Gradient Boosting method using accurate approximations to find the best tree model. Specifically,

1. XGBoost takes the *Taylor expansion* of the loss function up to the second order (i.e. second partial derivatives of the loss function). It simplifies the cost function and thus easy to compute.
2. XGBoost has advanced regularization (L1 & L2), which improves model generalization.
3. Training using XGBoost is very fast and can be parallelized / distributed across clusters.

Quote from XGBoost Author Tianqi Chen:

> *Both xgboost and gbm follow the principle of gradient boosting. There are however, the difference in modeling details. Specifically, xgboost used a more regularized model formalization to control over-fitting, which gives it better performance.*

> *The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost. For model, it might be more suitable to be called as regularized gradient boosting.*

## 2. Mathematical setup for Supervising Learning

1. **Model**  
A mathmatical structure or mapping to generate a prediction or estimate for $y_i$ from input $x_i$, where $i$ is the $i^{th}$ observation. Several examples as follows:
  + Linear Regression Model: $\hat{y}_i = \sum_j\theta_j x_{ij}$, where $j$ means the $j^{th}$ coefficient and $i$ is the $i^{th}$ observation.
  + Generalized Linear Model or GLM (Logistic Regression is one of them): $\hat{y}_i = g^{-1}\left(\sum_j\theta_j x_{ij}\right)$, where $g$ is the link function. Specifically, for logistic regression $g\left(\mu \right) = \log \left( \frac{\mu}{1-\mu} \right)$ and $g^{-1}\left(\mu \right)  = \frac{1}{1 + \exp^{-\mu}}$. For more information on GLM, please refer to notebook <mark> TBA </mark>.
  
2. **Parameters**  
Parameters are the part in the **Model** that needs to be determined in order to perform the prediction. In the Linear Regression Model and GLM case, $\theta$ is the set of parameters that needs to be "learnt" from the data. 

3. **Model Training**  
The process to find the **best** parameters $\theta$ that fit the training data $X$ and the target variable $y$.

4. **Objective Function**  
The next natural questions is: How do we find the **best** parameter? We need to define the **objective function** for the parameters to best fit for the training data. The generic term for the objective function is as follows:  
$$\text{obj}(\theta) = L(\theta) + \Omega(\theta)$$ 
How to interpret the above function? There are two components which are illustrated as follows:
 + **Training Loss**  
 $L\left( \cdot \right) $ is the training loss function. It measures how good or accurate prediction does the model have for the training data. Several examples as follows:
     * The most commonly used one is the *mean squared error(MSE)* given by: $L(\theta) = \sum_i (y_i-\hat{y}_i)^2$.
     * Another commonly used loss is the *logistic loss* given by:
     $L(\theta) = \sum_i\left[ y_i\ln (1+e^{-\hat{y}_i}) + (1-y_i)\ln (1+e^{\hat{y}_i})\right]$.
 
 + **Regularization Term**   
 $\Omega \left( \cdot \right)$ is the regularization term. It is a control from having a too complicated model and helps to avoid overfitting. This is generated from the principle to have a *parsimonious model*.  Parsimonious Models are simple models with great explanatory predictive power. They explain data with a minimum number of parameters, or predictor variables. Generally speaking, we need to develop a model that is a balance between *bias* and *variance*. For more illustration on **bias and variance tradeoff**, please refer to the notebook <mark> TBA</mark>.

## 3. Ensemble Tree Method

### 3.1 Decision Trees
For more information on the decision trees refer to another notebook: *<mark>Tree_Based_Methods.ipynb</mark>*.  
### 3.2 Classification and Regression trees (CART)
In CART, a real score is associated with each of the leaves, which gives us richer interpretations that allows for a principled, unified approach to optimization.

### 3.3 Ensemble Methods
Normally a single CART estimation is not accurate enough. Practically, different CARTs are integrated to generate the final prediction, e.g sum of different tree predictions as follows,
$$\hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F}$$
where $K$ is the total number of trees, $f_k\left( \cdot \right)$ is a function in the function space $\mathcal{F}$ and $\mathcal{F}$ is the set of all CARTs.  
And the objection function is defined as follows:
$$\text{obj}(\theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k)$$

### 3.4 Boosting and Additive Training
Learning tree structure is much harder than traditional optimization problem where you can simply take the gradient. It is intractable to learn all the trees at once. Instead, we can use an additive strategy: fix what we have learned, and add one new tree at a time. Let's rewrite the prediction and objection step by step:  
For prediction:
$$\begin{split}
\hat{y}_i^{(0)} &= 0\\
\hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\\
\hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i)\\
\dots\\
\hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)
\end{split}$$
For Objection function at step $t$:
$$\begin{split}
\text{obj}^{(t)} &= \sum_i^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t \Omega(f_i)\\
                         &= \sum_i^n l \left( y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\right) + \Omega(f_t) + \mathrm{Constant}
\end{split}$$

## 4. What are the specifics for XGBoost?

### 4.1 Taylor Expansion
As you can imagine the if the loss function is simple (meaning only with first order and quadratic terms like *MSE*), the optimization of the objective function is easier to solve. This generates the idea in XGBoost to use *Taylor Expansion of the loss function up to the second order* at $\hat{y}_i^{(t-1)}$:
$$\begin{split}
\text{obj}^{(t)} &= \sum_i^n l \left( y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\right) + \Omega(f_t) + \mathrm{Constant}\\
&= \sum_{i=1}^n \left[ l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)\right] + \Omega(f_t) + \mathrm{Constant}
\end{split}$$
where $g_i$ and $h_i$ are the first order and second order derivatives of $l(\cdot)$ at $\hat{y}_i^{(t-1)}$, specifically:
$$\begin{split}
g_i &= \partial_{\hat{y}_i^{(t-1)}} l\left(y_i, \hat{y}_i^{(t-1)}\right)\\
h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l\left(\hat{y}_i^{(t-1)}\right)
\end{split}$$
Don't forget the target of the optimization: we need to find the best $f_t(\cdot)$ that minimize the $obj^{(t)}$ and thus we can even simplify the objective function by removing the constants:
$$\sum_{i=1}^n \left[g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)\right] + \Omega(f_t)$$
As you can see from the above equation, the advantage is the value of the objective function only depends on $g_i$ and $h_i$. **In this sense, XGBoost can support customized loss functions as long as the first and second order derivative forms are provided!**  

### 4.2 Regularization
What about the regularization term $\Omega(f_t)$? XGBoost redefines the tree $f_t(x)$ as:
$$f_t(x) = w_{q(x)}, w \in R^T, q:R^d\rightarrow \{1,2,\cdots,T\} $$
Here $w$ is the vector of scores on leaves, $q$ is a function assigning each data point to the corresponding leaf, and $T$ is the number of leaves.  
XGBoost defines the complexity or the regularization term as follows according to the re-written tree definition:
$$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2$$
where $w_j$ is the score for the $j^{th}$ leaf in the tree.  
**Compared with other packages treating regularization less carefully, XGBoost defines the regularization term formmaly which can rigorously reduce the overfitting problem!**

### 4.3 The new objective function and solution! 
After re-formulating the tree model, we can write the objective value with the $t^{th}$ tree as:
$$\begin{split}
\text{obj}^{(t)} &\approx \sum_{i=1}^n \left[g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2 \right] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\
&= \sum^T_{j=1} \left[\left(\sum_{i\in I_j} g_i\right) w_j + \frac{1}{2} \left(\sum_{i\in I_j} h_i + \lambda \right) w_j^2 \right] + \gamma T
\end{split}$$  
where $I_j = \{i|q(x_i)=j\}$ is the set of indices of data points assigned to the $j^{th}$ leaf.  

Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score.  

We could further compress the expression by defining $G_j = \sum_{i\in I_j} g_i$ and $H_j = \sum_{i\in I_j} h_i$ as follows:  
$$
\text{obj}^{(t)} = \sum^T_{j=1} \left[G_jw_j + \frac{1}{2} \left(H_j+\lambda \right) w_j^2 \right] +\gamma T\$$

In this equation, $w_j$ is independent with respect to each other, the form $G_jw_j+\frac{1}{2}\left(H_j+\lambda\right)w_j^2$ is quadratic and the best $w_j$ for a given structure $q(x)$ and the best objective reduction we can get is
$$
\begin{split}w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\
\text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T
\end{split}$$
The last equation measures *how good* a tree structure $q(x)$ is.  
**Note**:This section is quoted the same as the XGBoost online tutorial.
### 4.4 Tree Pruning
TBA

## 5. Using XGBoost by hand - a 10 data-point example
Inspired by Statquest

## 6. Hyper Parameters and Tuning for XGBoost
### 6.1 Types of Parameters
* **General Parameters**
Which booster (tree or linear model) we want to choose for boosting
* **Booster Parameters**
Parameters related to the booster you choose.
* **Task Parameters**
Parameters decide the learning scenario.
* **Command line parameters**
Parameters that relate to behavior of CLI version of XGBoost.

The following sections will illustrate important and often used parameters under each of the type.
### 6.2 General Parameters
    
| Parameters | Description | Possible values |
| :---|:--- | :--- |
| `booster`(default = `gbtree`) | which booster to use | `gbtree`, `gblinear`, `dart`|
|`verbosity`(default = `1`)| verbosity of printing messages| `0`(silent), `1`(warning), `2`(info), `3`(debug)|
|`nthread`(default to maximum number of threads available if not set)|number of parallel threads used to run XGBoost| - |


### 6.3 Booster Parameters
#### 6.3.1 Tree Booster Parameters
| Parameters | Description | Possible values |
| :--|:-- | :-- |
|`eta`(default = 0.3, or `learning_rate`)|Step size shrinkage used in update to prevents overfitting.|range: $[0, 1]$|
|`gamma`(default = 0, or `min_split_loss`)|Minimum loss reduction required to make a further partition on a leaf node of the tree.|range: $[0,\infty)$|
|`max_depth`(default=6)|Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.|range: $[0,\infty)$|
|`min_child_weight`(default = 1)|Minimum sum of instance weight (hessian) needed in a child.The larger `min_child_weight` is, the more conservative the algorithm will be|range: $[0,\infty)$|
|`max_delta_step`(default = 0)|Maximum delta step we allow each leaf output to be.|range: $[0,\infty)$|
|`subsample`(default = 1)|Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees.|range: $(0, 1]$|
|`sampling_method`(default = `uniform`)|The method to use to sample the training instances.|`uniform`, `gradient_based`|
|`lambda`(default = 1, `reg_lambda`)|L2 regularization term on weights. Increasing this value will make model more conservative.|range: $[0,\infty)$|
|`alpha`(default = 0, `reg_alpha`|L1 regularization term on weights. Increasing this value will make model more conservative.|-|
|`scale_pos_weight`(default = 1)|Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: `sum(negative instances) / sum(positive instances)`.|-|

#### 6.3.2 Linear Booster Parameters
| Parameters | Description | Possible values |
| :--|:-- | :-- |
|`lambda`(default = 1, `reg_lambda`)|L2 regularization term on weights. Increasing this value will make model more conservative.|range: $[0,\infty)$|
|`alpha`(default = 0, `reg_alpha`|L1 regularization term on weights. Increasing this value will make model more conservative.|-|

### 6.4 Learning Task Parameters
| Parameters | Description | Possible values |
| :--|:-- | :-- |
|`objective`(default = `reg:squarederror`)|Learning objectives|`reg:squarederror`, `reg:logistic`, `binary:logistic`, `count:poisson`, `survival:cox`, `reg:gamma`|
|`base_score`(default = 0.5)|The initial prediction score of all instances|-|
|`eval_metric`(default according to objective)|Evaluation metrics for validation data|`rmse`, `mae`,`logloss`, `error`, `mlogloss`, `auc`, `poisson-nloglik`, `gamma-nloglik`, `cox-nloglik`|
|`seed`(default = 0)|Random number seed||

### 6.5 Command Line Parameters
TBA


## 7. Real Data Example

## References
1. https://www.shirin-glander.de/2018/11/ml_basics_gbm/#:~:text=Gradient%20Boosting%20Machines%20vs.,XGBoost&text=While%20regular%20gradient%20boosting%20uses,order%20derivative%20as%20an%20approximation.
2. https://xgboost.readthedocs.io/en/latest/index.html