# What is XGBoost?

XGBoost is a version of the **Gradient Boosted Decision Trees** algorithm. Originally, this algorithm was designed for large, complicated datasets and incorporates several different machine learning algorithms.

**These algorithms include:**
 - Decision Trees
 - Regularization (Ridge (L2) Regression)
 - Gradient Boost
 
 For optimal performance, XGBoost also allows user to fine-tune the model using *hyperparameters*. Hyperparameters act knobs and dials to find a middle ground between overfitting and underfitting the model. Often, tuning on XGBoost is done through experimentation. 
 
**XGBoost Hyperparameters:**

 - Booster
     - regression or classification
     
 - <span style="color: red;">**Eta (learning rate)**</span>
     - <span style="color: red;">*how much each decision tree contributes to our model*</span>
     
 - Max_depth (length of tree)
 
 - Min_child_weight
 
 - ...etc 
 
 
 # How does XGBoost work?
 
 There are many complicated steps involved in XGBoost, but for now we will focus on the most important step: **Gradient Boost**.
 
 Gradient Boost can be described as 3-step cycle starting with a "Naive Model" or initial guess. These steps can be resumed in the diagram below.
<img src="graph1.png">

Let's run through a simple example using pandas.

# Example

In this example we will use height, favorite color, and gender to predict weight.

In [1]:
import pandas as pd

dummy_data = pd.read_csv('dummy_data.csv')

dummy_data

Unnamed: 0,Height,Fav_Color,Gender,Weight
0,1.6,Blue,M,88
1,1.6,Green,F,76
2,1.5,Blue,F,56
3,1.8,Red,M,73
4,1.5,Green,M,77
5,1.4,Blue,F,57


### Naive Model
Now that we have our data, we need to create our "naive model" which will serve as our first guess at trying to guess someone's weight. We'll just take the average of target variable in this case.

In [2]:
avg_weight = dummy_data['Weight'].mean().round(1)

avg_weight

71.2

## Calculate Errors

Now that we have our native model, we need to calculate the residuals (same term used in linear regression) or calculate the "errors".

We will use the following formula:

    (Observed Target - Predicted Target) = Errors
    
For the first record in our data, it will look something like this...

    88 - 71.2 = 16.8
    
We then do this for every row in our table.
Let's see what our table looks like now.

In [3]:
dummy_data_cycle1 = dummy_data.copy()
dummy_data_cycle1['Residuals'] = dummy_data['Weight'] - 71.2
dummy_data_cycle1

Unnamed: 0,Height,Fav_Color,Gender,Weight,Residuals
0,1.6,Blue,M,88,16.8
1,1.6,Green,F,76,4.8
2,1.5,Blue,F,56,-15.2
3,1.8,Red,M,73,1.8
4,1.5,Green,M,77,5.8
5,1.4,Blue,F,57,-14.2


## Build Model Predicting Errors

Our next step is to build a *decision tree*, using Height Fav_Color, and Gender. The purpose of this tree is to help us predict the residuals.

In this example we will only allow 4 leaves, but people usually choose anywhere from 8 to 32 leaves.

<img src="dec_tree1.png">

**Note:** Left = True and Right = False.
(e.g. The residual for the first row would be on the bottom right corner, since it is a male and the color is not "not Blue".)

## Add Last Model to Ensemble

Since we currently only have one decision tree made for our model, we simply save our last decision tree in memory. Congrats! We've completed one full cycle of the XGBoost algorithm.

## What next???
Now we have to start the cycle all over again. First, we will calculate the residuals using our new decision tree.

(This formula may seem familiar):

    Observed Target - Predicted Target = New Predicted Residual

Using our first record...

    88 - 71.2 =  16.8

As you can see, we will get the same residuals we got from our initial model.
Remember, our predictions for weight will be:
    
        Predicted Target + Residual = New Predicted Target
        
        71.2 + 16.8 = 88

This will cause us to have **low bias**, but very **high variance**.
XGBoost uses the **learning rate**(value between 1 and 0) parameter to deal with this problem by scaling the contribution of the new tree. Here is our revised formula for predicting our weight:

    Predicted Target + (Learning Rate * Residual) = New Predicted Target

    Observed Target - (Predicted Target + Learning Rate * Residual) = New Predicted Residual
    
For this example, we will make our learning rate 0.1:

    71.2 + 0.1 * 16.8 = 72.9

    88 - (71.2 + 0.1 * 16.8) = 15.1
    
This new prediction is much worse than our initial prediction, but it is a small step in the right direction. Additionally, we have reduced some potential variance despite our bias increasing.

In [4]:
# Included rounded residuals
r1 = [16.8, 4.8, -14.7, 3.8, 3.8, -14.7]
dummy_data_cycle1['Residuals'] = pd.DataFrame({'Residuals':r1})

# New predicted weight using first residual
dummy_data_cycle1['New_Predicted_Weight'] = avg_weight + (0.1 * dummy_data_cycle1['Residuals']).round(1)

# Calulcating new residuals
dummy_data_cycle1['Residuals'] = dummy_data['Weight'] - (avg_weight + 0.1 * dummy_data_cycle1['Residuals']).round(1)

dummy_data_cycle1

Unnamed: 0,Height,Fav_Color,Gender,Weight,Residuals,New_Predicted_Weight
0,1.6,Blue,M,88,15.1,72.9
1,1.6,Green,F,76,4.3,71.7
2,1.5,Blue,F,56,-13.7,69.7
3,1.8,Red,M,73,1.4,71.6
4,1.5,Green,M,77,5.4,71.6
5,1.4,Blue,F,57,-12.7,69.7


## Build Model Predicting Errors Part 2

After calculating the errors again, it's time to build another decision tree based on our new errors.

Here is what it should look like:
<img src="dec_tree2.png">

**Note:** Left = True and Right = False.


## Add Last Model to Ensemble
Now we can combine the previous tree with our new tree. All trees are scaled by the same learning factor, which is 0.1 in this case. The new equation to calculate our new predicted weight should look like this:

    Predicted Target + (Learning Rate * Residual of Tree 1) + (Learning Rate * Residual of Tree 2) = New Residual Prediction
    
Using our first row...


71.2 + (0.1 * 16.8) + (0.1 * 15.1) = 74.4

This process continues to repeat until a maximum of cycles have been reached or adding additional trees does not significantly reduce the size of the residuals. 

<img src="dec_tree3.png">

In [5]:
# Included Rounded Residuals
r2 = [15.1, 4.3, -13.2, 3.4, 3.4, -13.2]

# Adding past 2 residuals
dummy_data_cycle2 = dummy_data.copy()
dummy_data_cycle2['Residuals1'] = pd.DataFrame({'Residuals':r1})
dummy_data_cycle2['Residuals2'] = pd.DataFrame({'Residuals2':r2})

# Calculate new predicted weight
dummy_data_cycle2['New Predicted Weight'] = avg_weight + (0.1 * dummy_data_cycle2['Residuals1']) + \
(0.1 * dummy_data_cycle2['Residuals2'])

dummy_data_cycle2

Unnamed: 0,Height,Fav_Color,Gender,Weight,Residuals1,Residuals2,New Predicted Weight
0,1.6,Blue,M,88,16.8,15.1,74.39
1,1.6,Green,F,76,4.8,4.3,72.11
2,1.5,Blue,F,56,-14.7,-13.2,68.41
3,1.8,Red,M,73,3.8,3.4,71.92
4,1.5,Green,M,77,3.8,3.4,71.92
5,1.4,Blue,F,57,-14.7,-13.2,68.41


## Conclusion

 ### Main Steps
 1) Make an initial guess
 
 2) Calculate the errors
 
 3) Build decision tree
 
 4) Add tree to ensemble (don't forget to scale tree)
 
 5) Repeat steps 2-4 until limit reached
 
 ### Main Purpose of Gradient Boosting
 
 It has been shown through empirical evidence that taking lots of *small steps* in the right direction provides better predictions with a testing dataset. In other words, iteratively scaling the decision trees allows us to slowly approach our actual values, while both variance and bias. 

# Citations

- https://www.kaggle.com/dansbecker/xgboost
- https://youtu.be/3CC4N4z3GJc
