# XGBoost

Let's learn about a very popular package for gradient boosting called `XGBoost`.

## What we will accomplish

In this notebook we will:
- Introduce the `XGBoost` package and point to the package installation process,
- Discuss what `XGBoost` is and why we use it over `sklearn` and
- Show how to implement gradient boosting regression in `XGBoost`:
    - Demonstrate `XGBoost` early stopping.

In [None]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
from seaborn import set_style

## This sets the plot style
## to have a grid on a white background
set_style("whitegrid")

## Gradient boosting reminder

In the previous notebook we learned what gradient boosting was and demonstrated how to implement it using `sklearn`'s `GradientBoostingRegressor` model object. Recall that this technique is a boosting approach where we iteratively train weak learners by training on the previous learner's residuals. 

## What is `XGBoost`?

While we implemented this algorithm using `sklearn`, another very popular package for gradient boosting is `XGBoost` which stands for eXtreme Gradient Boosting. This particular package is often utilized in winning data science competitions, which likely led to its increase in popularity.

#### Installing `XGBoost`

A quick note! You likely do not have `XGBoost` already installed on your computer (at least I did not prior to writing this notebook). If you have used `pip` to install python packages before you can install `XGBoost` using the command here, <a href="https://xgboost.readthedocs.io/en/latest/install.html#python">https://xgboost.readthedocs.io/en/latest/install.html#python</a>. If you use `conda` to install packages this link should help, <a href="https://anaconda.org/conda-forge/xgboost">https://anaconda.org/conda-forge/xgboost</a>.

<i>Note: When I installed `XGBoost` on my machine it did not work at first, I had to install another piece of softward onto my MacBook before it would work. Follow the documentation from `XGBoost`, it worked for me.</i>

<i>Also Note: If you are running a Mac with an M1 chip the standard installation instructions may not work for you. In that case you should perform a web search to find the relevant instructions.</i>

### Why `XGBoost`?

Why do so many people like using `XGBoost` over `sklearn`'s `GradientBoostingRegressor` and `GradientBoostingClassifier`? In comparison to `sklearn`'s implementation `XGBoost`'s code for fitting gradient boosting models is much faster and tends to perform better than `sklearn`. It even offers the capability for your model to be trained in parallel, which `sklearn` does not currently offer for gradient boosting.

## Implementing gradient boosting regression in `XGBoost`

With this motivation in mind, let's learn how to implement the same regression functionality we did with `sklearn` in the previous notebook. We will provide information on how to run gradient boosting classification and expand on the `XGBoost` syntax in the homework.

In [None]:
## First make our data set
np.random.seed(220)
X = np.linspace(-2,2,200)

y = X**2 + np.random.randn(200)

## Visualize the training data
plt.figure(figsize=(8,6))
plt.scatter(X,y)
plt.xlabel("$X$", fontsize=16)
plt.ylabel("$y$", fontsize=16)
plt.show()

One way to make a gradient boosting regressor in `XGBoost` is to use `XGBRegressor`, <a href="https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor">https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor</a>.

In [None]:
## import xgboost
import xgboost

In [None]:
## Let's recreate our learning_rate comparison

### Create an XGBRegressor object
### learning_rate=.1, max_depth=1, n_estimators=10
xgb_reg1 = 


## fit it
xgb_reg1

### Create an XGBRegressor object
### learning_rate=1, max_depth=1, n_estimators=10
xgb_reg2 =


## fit it
xgb_reg2


In [None]:
fig,ax = plt.subplots(1,2,figsize=(20,8))

ax[0].scatter(X,y,label='Training Points')
ax[0].plot(X, xgb_reg1.predict(X.reshape(-1,1)), 'k',label="Prediction")
ax[0].set_title("learning_rate=0.1", fontsize=18)
ax[0].legend(fontsize=14)
ax[0].set_xlabel("$X$", fontsize=16)
ax[0].set_ylabel("$y$", fontsize=16)

ax[1].scatter(X,y,label='Training Points')
ax[1].plot(X, xgb_reg2.predict(X.reshape(-1,1)), 'k',label="Prediction")
ax[1].set_title("learning_rate=1", fontsize=18)
ax[1].legend(fontsize=14)
ax[1].set_xlabel("$X$", fontsize=16)
ax[1].set_ylabel("$y$", fontsize=16)

plt.show()

A nice feature of `xgboost`'s model is that it automatically records the performance at each training step on a validation set, provided we give the model the validation set.

In [None]:
## Here I will generate a validation set because the data are randomly generated
## in practice you would need to split the data
X_val = np.linspace(-2,2,200)
y_val = X_val**2 + np.random.randn(200)

In [None]:
## make an XGBRegressor object
## n_estimators = 500, max_depth = 1, learning_rate = .1
xgb_reg = xgboost.XGBRegressor(n_estimators=500,
                          max_depth=1,
                          learning_rate=.1)

## fit the model, including an eval_set
xgb_reg.fit(X.reshape(-1,1), y, eval_set=[(X_val.reshape(-1,1), y_val)])

In [None]:
## demonstrate .evals_result()


In [None]:
## get the 'rmse'


In [None]:
plt.figure(figsize=(10,8))

plt.plot(range(1,len(xgb_reg.evals_result()['validation_0']['rmse'])+1), 
         xgb_reg.evals_result()['validation_0']['rmse'])
plt.scatter([range(1,len(xgb_reg.evals_result()['validation_0']['rmse'])+1)[np.argmin(xgb_reg.evals_result()['validation_0']['rmse'])]], 
            [np.min(xgb_reg.evals_result()['validation_0']['rmse'])], c='k')
plt.text(range(1,len(xgb_reg.evals_result()['validation_0']['rmse'])+1)[np.argmin(xgb_reg.evals_result()['validation_0']['rmse'])], 
         np.min(xgb_reg.evals_result()['validation_0']['rmse'])-.05, "Min.", fontsize=14)

plt.title("Validation Error", fontsize=20)
plt.xlabel("Number of Weak Learners", fontsize=16)
plt.ylabel("RMSE", fontsize=16)

plt.yticks(fontsize=14)
plt.xticks(fontsize=14)

plt.show()

Further, `XGBoost` allows us to implement early stopping without having to write our own code to do so. We just have to include an `early_stopping_rounds` argument during the `fit` step.

In [None]:
## same xgb_reg as before
xgb_reg = xgboost.XGBRegressor(n_estimators = 500,
                                  max_depth = 1,
                                  learning_rate = .1)


## Now show off early_stopping_rounds with eval_set


In [None]:
plt.figure(figsize=(10,8))

plt.plot(range(1,len(xgb_reg.evals_result()['validation_0']['rmse'])+1), 
         xgb_reg.evals_result()['validation_0']['rmse'])
plt.scatter([range(1,len(xgb_reg.evals_result()['validation_0']['rmse'])+1)[np.argmin(xgb_reg.evals_result()['validation_0']['rmse'])]], 
            [np.min(xgb_reg.evals_result()['validation_0']['rmse'])], c='k')
plt.text(range(1,len(xgb_reg.evals_result()['validation_0']['rmse'])+1)[np.argmin(xgb_reg.evals_result()['validation_0']['rmse'])], 
         np.min(xgb_reg.evals_result()['validation_0']['rmse'])-.05, "Min.", fontsize=14)

plt.title("Validation Error", fontsize=20)
plt.xlabel("Number of Weak Learners", fontsize=16)
plt.ylabel("RMSE", fontsize=16)

plt.yticks(fontsize=14)
plt.xticks(fontsize=14)

plt.show()

In [None]:
xgb_reg = xgboost.XGBRegressor(n_estimators = 220,
                                  max_depth = 1,
                                  learning_rate = .1)
xgb_reg.fit(X.reshape(-1,1), y)

plt.figure(figsize=(10,8))

plt.scatter(X,y,label='Training Points')
plt.plot(X, xgb_reg.predict(X.reshape(-1,1)), 'k',label="Prediction")
plt.legend(fontsize=14)
plt.xlabel("$X$", fontsize=16)
plt.ylabel("$y$", fontsize=16)

plt.show()

Here we have scratched the surface of what `XGBoost` can do. To learn more about the package check out the gradient boosting `Practice Problems` as well as the `XGBoost` documentation, <a href="https://xgboost.readthedocs.io/en/latest/index.html">https://xgboost.readthedocs.io/en/latest/index.html</a>.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)