# XGBoost Tutorial

This notebook shows the ways that XGBoost is used as a predictive model. 
The notebook contains many examples of using XGBoost on the seaborn diamond dataset.

The code in this notebook was created by following a tutorial by datacamp. The link to the tutorial can be found below:

https://www.datacamp.com/tutorial/xgboost-in-python

## Imports

The packages in the following cell are everything that is needed to run this notebook. 
The cell also includes importing the diamonds dataset from seaborn.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Isolate features into X and the target into y.

In [2]:
X, y = diamonds.drop('price', axis=1), diamonds[['price']]

XGBoost has the ability to handle categorical data. 
This can be accomplished by turning the non-numerical types into Pandas 'category' type.

In [3]:
cats = X.select_dtypes(exclude=np.number).columns.tolist()

for col in cats:
    X[col] = X[col].astype('category')

View the datatypes of the features

In [4]:
X.dtypes

carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
x           float64
y           float64
z           float64
dtype: object

Split the data into train and test sets.

In [5]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

XGBoost comes with its own class for storing datasets that is optimized for memory and speed.
Convert the training and testing sets into DMatrices.

In [6]:
dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)

Determine the hyperparameters for the model and train it.

In [7]:
params = {'objective': 'reg:squarederror', 'tree_method': 'gpu_hist'}
n = 100

model = xgb.train(params=params, dtrain=dtrain_reg, num_boost_round=n)

Make predictions and determine the error.

In [8]:
preds = model.predict(dtest_reg)

rmse = mean_squared_error(y_test, preds, squared=False)
print(f'RMSE of the base model: {rmse:.3f}')

RMSE of the base model: 543.203


Improve the model by providing ways to evaluate the model's performance during training

In [9]:
evals = [(dtrain_reg, "train"), (dtest_reg, "validation")]

model = xgb.train(params=params, 
                  dtrain=dtrain_reg, 
                  num_boost_round=n, 
                  evals=evals, 
                  verbose_eval=10)

[0]	train-rmse:3985.18329	validation-rmse:3930.52457
[10]	train-rmse:550.08330	validation-rmse:590.15023
[20]	train-rmse:488.51248	validation-rmse:551.73431
[30]	train-rmse:463.13288	validation-rmse:547.87843
[40]	train-rmse:447.69788	validation-rmse:546.57096
[50]	train-rmse:432.91655	validation-rmse:546.22557
[60]	train-rmse:421.24046	validation-rmse:546.28601
[70]	train-rmse:408.64125	validation-rmse:546.78238
[80]	train-rmse:396.41125	validation-rmse:544.69846
[90]	train-rmse:386.87996	validation-rmse:543.82192
[99]	train-rmse:378.30590	validation-rmse:543.20278


Implement early stopping to get a better accuracy

In [10]:
n = 10000
model = xgb.train(params=params, 
                  dtrain=dtrain_reg, 
                  num_boost_round=n, 
                  evals=evals, 
                  verbose_eval=10,
                  early_stopping_rounds=50)

[0]	train-rmse:3985.18329	validation-rmse:3930.52457
[10]	train-rmse:550.08330	validation-rmse:590.15023
[20]	train-rmse:488.51248	validation-rmse:551.73431
[30]	train-rmse:463.13288	validation-rmse:547.87843
[40]	train-rmse:447.69788	validation-rmse:546.57096
[50]	train-rmse:432.91655	validation-rmse:546.22557
[60]	train-rmse:421.24046	validation-rmse:546.28601
[70]	train-rmse:408.64125	validation-rmse:546.78238
[80]	train-rmse:396.41125	validation-rmse:544.69846
[90]	train-rmse:386.87996	validation-rmse:543.82192
[100]	train-rmse:377.66173	validation-rmse:542.92457
[110]	train-rmse:367.76765	validation-rmse:542.64203
[120]	train-rmse:356.78793	validation-rmse:542.36125
[130]	train-rmse:346.40116	validation-rmse:543.35004
[140]	train-rmse:341.56915	validation-rmse:543.26361
[150]	train-rmse:334.27548	validation-rmse:542.79733
[160]	train-rmse:326.12247	validation-rmse:543.01177
[167]	train-rmse:321.04059	validation-rmse:543.35679


Do cross-validation to evaluate the model's performance

In [11]:
n = 1000

results = xgb.cv(
   params, dtrain_reg,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)

View the results from the cross-validation

In [12]:
results

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,3985.91635,10.487016,3988.423951,41.574104
1,2849.172043,8.442412,2851.153868,27.958226
2,2061.848631,5.249746,2065.243634,20.725252
3,1519.083661,4.211126,1525.289339,15.322446
4,1153.624523,3.514243,1165.898398,11.494377
5,911.815939,3.338913,930.41338,10.731272
6,757.556653,3.0188,781.823935,9.9833
7,661.581176,2.924716,691.241342,9.533563
8,603.72846,3.366991,638.008955,10.043351
9,568.420969,3.286696,606.541825,11.651092


The best error for the validation set is shown below

In [13]:
best_rmse = results['test-rmse-mean'].min()
best_rmse

550.8959336674216

XGB also works for classification

In [14]:
from sklearn.preprocessing import OrdinalEncoder

X, y = diamonds.drop("cut", axis=1), diamonds[['cut']]

# Encode y to numeric
y_encoded = OrdinalEncoder().fit_transform(y)

# Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()

# Convert to pd.Categorical
for col in cats:
   X[col] = X[col].astype('category')

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, random_state=1, stratify=y_encoded)

Turn dataframe into DMatrix

In [15]:
# Create classification matrices
dtrain_clf = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_clf = xgb.DMatrix(X_test, y_test, enable_categorical=True)

Build a model and train it

In [16]:
params = {"objective": "multi:softprob", "tree_method": "gpu_hist", "num_class": 5}
n = 1000

results = xgb.cv(
   params, dtrain_clf,
   num_boost_round=n,
   nfold=5,
   metrics=["mlogloss", "auc", "merror"],
)

Metrics to evaluate model

In [17]:
results.keys()

Index(['train-mlogloss-mean', 'train-mlogloss-std', 'train-auc-mean',
       'train-auc-std', 'train-merror-mean', 'train-merror-std',
       'test-mlogloss-mean', 'test-mlogloss-std', 'test-auc-mean',
       'test-auc-std', 'test-merror-mean', 'test-merror-std'],
      dtype='object')

Best area under curve is maximum

In [18]:
results['test-auc-mean'].max()

0.9402233623451636