<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# XGBoost

In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 

In [2]:
%matplotlib inline

"Gradient boosting," like bagging, is a general method for training decision tree ensembles.

XGBoost ("eXtreme Gradient Boosting") is a particular implementation of gradient boosted decision trees. It is popular on Kaggle because it is both fast to train and often gives excellent predictive performance.

<img src="https://miro.medium.com/max/1400/1*QJZ6W-Pck_W7RlIDwUIN9Q.jpeg" style="float: left;">

![](../../assets/xgboost.jpeg)

## Getting Started with XGBoost

We will use the `xgboost` library instead of scikit-learn for this lesson. Scikit-learn has [`GradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) and [`GradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) classes, but they lack some of the tricks that have made the `xgboost` library popular.

`xgboost` provides estimators that use the same interface as scikit-learn's, so we will not need to change our approach.

## XGBoost vs. Random Forests

### Similarities

Random forests and XGBoost both produce tree ensembles, and the provide many of the same parameters to reduce overfitting.

### Differences

#### Gradient Boosting vs. Bagging

- Bagging involves training each tree *independently* on a different *bootstrap sample*.
- Gradient boosting involves training each tree *sequentially to reduce the residual errors left by its predecessors*.

See [the official `xgboost` library documentation](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) and Chapter 10 of [Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) for details.

<b>Decision Tree:</b> Every hiring manager has a set of criteria such as education level, number of years of experience, interview performance. A decision tree is analogous to a hiring manager interviewing candidates based on his or her own criteria.

<b>Bagging:</b> Now imagine instead of a single interviewer, now there is an interview panel where each interviewer has a vote. Bagging or bootstrap aggregating involves combining inputs from all interviewers for the final decision through a democratic voting process.

<b>Random Forest:</b> It is a bagging-based algorithm with a key difference wherein only a subset of features is selected at random. In other words, every interviewer will only test the interviewee on certain randomly selected qualifications (e.g. a technical interview for testing programming skills and a behavioral interview for evaluating non-technical skills).

<b>Boosting:</b> This is an alternative approach where each interviewer alters the evaluation criteria based on feedback from the previous interviewer. This ‘boosts’ the efficiency of the interview process by deploying a more dynamic evaluation process.

#### Handling Missing Values

At each split for a given variable, XGBoost simply learns whether sending items with missing values left or right gives better results. This approach has a few advantages:

- It is automatic.
- Unlike dropping rows or columns, it allows you to use all of the values you do have.
- Unlike imputation, it treats "missing" as its own value rather than replacing it with some other value that might be wrong.

#### Install the xgboost library

There is a lot to this! Take a look at the [installation guide](https://xgboost.readthedocs.io/en/latest/build.html) if you are interested. The easiest way to install is:

`pip install xgboost`

Didn't work?

- Mac OSX: you might have to `brew install libomp` first
- Windows: you might have to clone the XGBoost repo from git. [Windows build guide](https://xgboost.readthedocs.io/en/latest/build.html#building-on-windows)


In [3]:
# Right now I am getting a useless warning every time I fit an `XGBoost` model.
# This line of code prevents warnings from being displayed. Not generally
# recommended.
warnings.filterwarnings(action='ignore')

In [4]:
# Import the xgboost package
import xgboost as xgb

In [23]:
# Instantiate an XGBoost regressor
xgbr = xgb.XGBRegressor(random_state=1)
xgbr

XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=None, max_delta_step=None, max_depth=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             objective='reg:squarederror', random_state=1, reg_alpha=None,
             reg_lambda=None, scale_pos_weight=None, subsample=None,
             tree_method=None, validate_parameters=None, verbosity=None)

## XG Boost

Extreme Gradient Boosting!

**Code along**

- Load the Ames housing dataset from `data/ames_train.csv` in this lesson's base directory.

In [6]:
ames_df = pd.read_csv('data/ames_train.csv')

- Create a feature matrix DataFrame `X` containing all of the numeric columns from the Ames dataset except "Id" and the target column "SalePrice". Drop "OverallQual" to make things more interesting -- that very is very predictive but expensive to collect.

In [7]:
ames_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


- Create a target vector Series `y` with the values of the variable "SalePrice".

In [17]:
ames_df2 = ames_df.select_dtypes(['int64', 'int32', 'float64', 'float32']).dropna(axis='columns')

In [19]:
ames_df.describe().columns

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [18]:
y = ames_df.SalePrice
ames_df1 = ames_df._get_numeric_data()
X = ames_df1.drop(["OverallQual", 'SalePrice', 'Id'], axis=1)
X

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,60,65.0,8450,5,2003,2003,196.0,706,0,150,...,548,0,61,0,0,0,0,0,2,2008
1,20,80.0,9600,8,1976,1976,0.0,978,0,284,...,460,298,0,0,0,0,0,0,5,2007
2,60,68.0,11250,5,2001,2002,162.0,486,0,434,...,608,0,42,0,0,0,0,0,9,2008
3,70,60.0,9550,5,1915,1970,0.0,216,0,540,...,642,0,35,272,0,0,0,0,2,2006
4,60,84.0,14260,5,2000,2000,350.0,655,0,490,...,836,192,84,0,0,0,0,0,12,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,62.0,7917,5,1999,2000,0.0,0,0,953,...,460,0,40,0,0,0,0,0,8,2007
1456,20,85.0,13175,6,1978,1988,119.0,790,163,589,...,500,349,0,0,0,0,0,0,2,2010
1457,70,66.0,9042,9,1941,2006,0.0,275,0,877,...,252,0,60,0,0,0,0,2500,5,2010
1458,20,68.0,9717,6,1950,1996,0.0,49,1029,0,...,240,366,0,112,0,0,0,0,4,2010


- Do a simple train/test split on `X` and `y`.

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.2)

- Fit an `XGBRegressor` on the training data.

In [39]:
xgbr.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=1, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

- Get an R^2 score for the training set.

In [40]:
from sklearn.metrics import r2_score
yhat_train = xgbr.predict(X_train)
train_r2 = r2_score(y_train, yhat_train)
train_r2

0.9997132805072122

In [41]:
xgbr.score(X_train, y_train)

0.9997132805072122

- Get an R^2 score for the test set.

In [42]:
yhat_test = xgbr.predict(X_test)
test_r2 = r2_score(y_test, yhat_test)
test_r2

0.8460193656617238

In [43]:
xgbr.score(X_test, y_test)

0.8460193656617236

- Is your model overfitting, underfitting, both, or neither? How do you know?
- Is overfitting!! training score too high and testing score much lower than training score.

In [67]:
# play with parameters n_estimators and learning_rate to improve score
xgbr = xgb.XGBRegressor(random_state=1, n_estimators=70, learning_rate=0.25)
xgbr.fit(X_train, y_train)
xgbr.score(X_train, y_train)
xgbr.score(X_test, y_test)

0.866711407605456

#### Handling Missing Values

## Tuning XGBoost

`XGBoost` provides many options that you can tune to improve predictive performance.

### `n_estimators` and `learning_rate`

The `learning_rate` controls how "aggressive" each tree is in trying to correct the errors of its predecessors.

- If it is too low, then getting good predictive performance will require a large value for `n_estimators` (and thus a lot of time).
- If it is too high, then the algorithm will keep overshooting the target and won't coverge to good results.

Unlike with a random forest, setting `n_estimators` too high can hurt predictive performance with boosting because it leads to overfitting.

### Addressing Overfitting

#### Reducing Model Complexity

One way to address overfitting is to restrict model complexity more or less directly. `xgboost` provides many options for this purpose:

- Restricting tree shape
    - `max_depth` / `max_leaf_nodes` puts a hard limit on the depth or number of leaves in each tree
    - `gamma` is the minimum loss reduction required in order to make another split
    - `min_child_weight` is the minimum number of observations required in each child node in order to make a split, adjusted for the weight that is placed on each observation
- Restricting sizes of weights: `reg_lambda` and `reg_alpha` provide L1 and L2 regularization on sample weights, respectively

#### Adding Randomness

Another way to address overfitting when ensembling is to add randomness to the process of training each item in the ensemble.

- `subsample` specifies what proportion of the data is used to train each tree.
- `colsample_bytree` and `colsample_bylevel` specify what proportion of the features are available at the tree and split level, respectively.

### Example

We will use this general approach to tune our model:

- Find the optimal number of trees with default learning rate.
- Tune additional parameters.
- Lower learning rate and increase the number of trees.

#### Find Optimal Number of Trees with Default Learning Rate

In [None]:
# Split data by column


In [None]:
# Instantiate model


In [None]:
# Fit and score on all data


In [None]:
# Score with 5-fold CV


In [None]:
# Vary number of trees


#### Tune Additional Parameters

scikit-learn has a `GridSearchCV` class that will run a model with various hyperparameter combinations and identify the combination that generated the best cross-validation scores.

In [None]:
# Try a few values for "max_depth" and "min_child_weight"


In [None]:
# Find out best parameters and their score


In [None]:
# Try a few values for "subsample"


In [None]:
# Find out best parameters and their score


In [None]:
# Get report on grid search results


The effect of one hyperparameter typically depends on the values of other hyperparameters -- for instance, increasing "max_depth" will have no effect if "min_child_weight" is sufficiently large. For this reason, it is generally valuable to do grid searches over multiple parameters simultaneously, rather than fixing one hyperparameter at a time. However, testing many combinations of many parameters can take a long time.

#### Lower Learning Rate and Increase the Number of Trees.

In [None]:
# Divide the learning rate by 10 and vary number of trees


## Summary

- `XGBoost` is a popular decision tree ensemble algorithm.
- `XGBoost` uses gradient boosting, meaning that each tree attempts to correct the errors of previous trees.
- scikit-learn's `GridSearchCV` helps with testing hyperparameter values.