## Intro to XGBoost

This quick notebook is going to show how to implement the very basics of XGBoost and why it's a useful open source library.

Theo Baker | CSC630 | November 5, 2021

In [1]:
!pip install xgboost
!brew install libomp

Updating Homebrew...
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 1 tap (homebrew/cask).
[34m==>[0m [1mUpdated Casks[0m
Updated 1 cask.

To reinstall 13.0.0, run:
  brew reinstall libomp


In [2]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV


df = pd.read_csv("/Users/theobaker/Downloads/train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


So I've loaded a dataset from a Kaggle competition about the titanic because it's freezing in my room and that feels appropriate. As you can see, each passenger has a number of attributes and then a binary result as to whether or not they survived. 0 = No, 1 = Yes. Pclass is the ticket class, sibsp = # of siblings/spouses aboard the Titanic, parch = # of parents/children aboard the titanic, fare = passanger fare, cabin = cabin number, and embarked = port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton. For the purposes of this easy demo, we're just going to use Sex and Age. Let's start off by making sex into a binary rather than text.

In [3]:
dummies = pd.get_dummies(df['Sex'])
df = pd.concat([df, dummies], axis=1)
X = df[['Age','female','male']]
y = df['Survived']
seed = 42
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

In [4]:
model = XGBClassifier(objective= 'reg:logistic')
model.fit(X_train, y_train)
print(model)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1,
              objective='reg:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
Accuracy: 76.49%




As we can see, these predictors alone give us an almost 77% accuracy in predicting the survival of our subjects! And this is in just two lines without optimizing any of our hyperparameters or using the full dataset! From here, the model is ready to make predictions! But wait what were those hyperparameters I talked about...

In [5]:
{"learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ] }

{'learning_rate': [0.05, 0.1, 0.15, 0.2, 0.25, 0.3],
 'max_depth': [3, 4, 5, 6, 8, 10, 12, 15],
 'min_child_weight': [1, 3, 5, 7],
 'gamma': [0.0, 0.1, 0.2, 0.3, 0.4],
 'colsample_bytree': [0.3, 0.4, 0.5, 0.7]}

Those are the examples of some hyperparameters. max_depth is the maximum depth per tree – a deeper tree might increase performance but also could lead to overfitting. The default is 6. learning_rate is the step size at each iteration. colsample_bytree is the fraction of observations to be sampled for each tree – a lower value prevents overfitting but can allow underfitting. Gamma is a regularization parameter – the higher it is, the higher the regularization. Each hyperparameter, and the full list is available [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn), has a significant effect on how your model runs and optimizing those hyperparameters is a large part of a data scientist's job.

In [6]:
model = XGBClassifier(objective= 'reg:logistic', max_depth= 10, n_estimators = 100, subsample = 0.4)
model.fit(X_train, y_train)
print(model)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1,
              objective='reg:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=0.4,
              tree_method='exact', validate_parameters=1, verbosity=None)
Accuracy: 78.36%


Hmmm well that wasn't *quite* the jump we might've hoped for. But still! Things got better! To optimize those hyperparameters, I used ~ intuition ~ and did my best to choose values that made sense for the data. The other thing you can do is optimize using algorithms. The different algorithms optimize hyperparameters in different ways – some simply iterate over each possible option, brute-forcing the best solution, and some, like RandomSearchCV() from Scikit-learn have less time-intensive solutions. Random Search uses a large range of hyperparameter values and randomly iterates a specified number of times over combinations of those values.

In [8]:
params = { 'max_depth': [3, 5, 6, 10, 15, 20],
           'learning_rate': [0.01, 0.1, 0.2, 0.3],
           'subsample': np.arange(0.5, 1.0, 0.1),
           'colsample_bytree': np.arange(0.4, 1.0, 0.1),
           'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
           'n_estimators': [100, 500, 1000]}
xgbr = XGBClassifier(objective= 'reg:logistic', seed = 20, use_label_encoder =False)
clf = RandomizedSearchCV(estimator=xgbr,
                         param_distributions=params,
                         scoring='neg_mean_squared_error',
                         n_iter=25,
                         verbose=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Best parameters:", clf.best_params_)
print("Best Score: %.2f%%" % (accuracy * 100.0))

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 125 out of 125 | elapsed:   13.1s finished


Best parameters: {'subsample': 0.7999999999999999, 'n_estimators': 1000, 'max_depth': 15, 'learning_rate': 0.01, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7999999999999999}
Best Score: 80.60%


Awesome! The truth is, with such a small dataset and few features, our optimization didn't have all that much to do. But this is an explanatory notebook, and for that I say job done!