<h1 style="text-align:center;">XGBoost vs Gradient Boosting</h1>

## Introduction

XGBoost stands for "Extreme Gradient Boosting," and it represents an advanced iteration of the gradient boosting technique. The fundamental idea behind XGBoost remains similar to its predecessor (that we tackled in the previous notebook) – it enhances the performance of weak learners by aggregating the errors or residuals of individual decision trees.

However, in XGBoost, there is a notable terminology difference in the hyperparameters. Instead of referring to the learning rate as `learning_rate`, as it's commonly known in other gradient boosting implementations, XGBoost prefers to call it `eta`. This small naming variation aside, XGBoost harnesses the power of boosting to iteratively improve the predictive accuracy of a model, making it a robust tool in the world of machine learning and predictive analytics.

In [1]:
import pandas as pd
import numpy as np

import requests
import io

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from xgboost import XGBRegressor, XGBClassifier

from sklearn.metrics import accuracy_score
from zipfile import ZipFile
from time import time

import matplotlib.pyplot as plt
from helper_file import *
import warnings

warnings.filterwarnings('ignore')

In [2]:
df_bikes.sample(n=3, random_state=43)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
335,336,2011-12-02,4.0,0.0,12.0,0.0,5.0,1.0,1,0.314167,0.331433,0.625833,0.100754,268,3672,3940
631,632,2012-09-23,4.0,1.0,9.0,0.0,0.0,0.0,1,0.529167,0.518933,0.467083,0.223258,2454,5453,7907
620,621,2012-09-12,3.0,1.0,9.0,0.0,3.0,1.0,1,0.599167,0.570075,0.577083,0.131846,1050,6820,7870


In [3]:
bikes_X, bikes_y = splitX_y(df_bikes, 'cnt')

print(f"shape of target vector: {bikes_y.shape}")
print(f"shape of feature matrix: {bikes_X.shape}")

shape of target vector: (731,)
shape of feature matrix: (731, 15)


In [4]:
bikes_X_train, bikes_X_test, bikes_y_train, bikes_y_test = train_test_split(
        bikes_X, bikes_y, random_state=43)

In [5]:
pipe = Pipeline(
    [('tweak', PrepDataTransformer()),
     ('imputer', SimpleImputer(strategy='median')),  # Imputing null values using mean
     ('scaler', StandardScaler())
    ]
)

X_train = pipe.fit_transform(bikes_X_train)
X_test = pipe.transform(bikes_X_test)

In [6]:
best_params = {'max_depth': 2, 
               'subsample': 0.5, 
               'n_estimators': 500, 
               'learning_rate': 0.05}

In [7]:
best_model = GradientBoostingRegressor(**best_params, random_state=43)
best_model.fit(X_train, bikes_y_train)
y_pred = best_model.predict(X_test)
rmse_test = MSE(bikes_y_test, y_pred)**0.5
print(f"Test set score: {rmse_test:.3f}")

Test set score: 598.540


In [8]:
best_params = {'max_depth': 2, 
               'subsample': 0.5, 
               'n_estimators': 500, 
               'eta': 0.05}

In [9]:
xg_reg = XGBRegressor(**best_params, random_state=43)

xg_reg.fit(X_train, bikes_y_train)

y_pred = xg_reg.predict(X_test)

rmse_test = MSE(bikes_y_test, y_pred)**0.5
print(f"Test set score: {rmse_test:.3f}")

Test set score: 604.429


These scores are not too different from one another. We might need to go deeper into `xgboost` using a different dataset.

## Approaching big data – gradient boosting versus XGBoost

This we will want to examine exoplanets over time. The exoplanet dataset contains 5,087 rows and 3,189 columns, recording light flux measurements at different times across a star's lifecycle. With 5,087 rows and 3,189 columns, there are a total of 1.5 million data points in the dataset.

To build a random forest model on this dataset using a baseline of 100 trees, we would need 150 million data points. This is because a random forest creates each tree using a random subset of rows and columns from the full dataset. With 100 trees and 1.5 million data points, each tree would be trained on roughly 15,000 data points on average (1.5 million / 100). Thus, to properly train a 100-tree model, we need a dataset approximately 100 times larger, equating to 150 million data points.

### Kepler's Starlight Dataset from Kaggle (circa 2017)

Here's this cool dataset on [Kaggle](https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data). Now, here's the neat bit:

- **What's Inside?** Patterns of starlight! Imagine every row as a unique star, and the columns? They're like chapters of a book, each telling a snippet of the star's luminous tale over time.
- **Got Planets?** Oh, absolutely! There's a column named 'exoplanet'. A '2' in there? That star's got a buddy planet! A '1'? It's the star shining solo.

And for those diving into the cosmic jargon, this dataset is all about the light these stars beam out. That's called **light flux** or, for the poetic ones among us, **luminous flux**. It's like the universe's brightness meter.

The flux columns are floats, while the Label column is `2` for an exoplanet star and `1` for a non-exoplanet star.

In [10]:
url = "data/exoplanets.zip"

In [12]:
with ZipFile(url, 'r') as z:
    with z.open(z.namelist()[0]) as f:
        df = pd.read_csv(f)

df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,2,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,2,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5087 entries, 0 to 5086
Columns: 3198 entries, LABEL to FLUX.3197
dtypes: float64(3197), int64(1)
memory usage: 124.1 MB


In [14]:
df.isnull().sum().sum()

0

In [15]:
X, y = splitX_y(df, 'LABEL')

X_train, X_test, y_train, y_test = train_test_split(
        X, y, random_state=43)

print(f"shape of target vector: {y.shape}")
print(f"shape of feature matrix: {X.shape}")

shape of target vector: (5087,)
shape of feature matrix: (5087, 3197)


### Building gradient boosting classifiers

It's time to build a gradient boosting classifier to predict whether stars host exoplanets. It should be noted that Gradient boosting classifiers work in the same manner as gradient boosting regressors. The difference is primarily in the scoring.

Let's now compare `GradientBoostingClassifier` and `XGBoostClassifier` with the exoplanet dataset for its speed using the preceding code to mark time. We have set `max_depth=2` and `n_estimators=100` to limit the size of the model. Let's start with `GradientBoostingClassifier`.

In [16]:
start = time()

gbr = GradientBoostingClassifier(n_estimators=100, max_depth=2, random_state=43)

gbr.fit(X_train, y_train)

y_pred = gbr.predict(X_test)

score = accuracy_score(y_pred, y_test)

print(f'Score: {str(score)}')

end = time()

elapsed = end - start

print(f'\nRun Time: {elapsed/60} minutes')

Score: 0.9889937106918238

Run Time: 4.937528161207835 minutes


In [17]:
start = time()
y_train = y_train.map({1: 0, 2: 1})
y_test = y_test.map({1: 0, 2: 1})

xg_reg = XGBClassifier(n_estimators=100, max_depth=2, random_state=43)

xg_reg.fit(X_train, y_train)

y_pred = xg_reg.predict(X_test)

score = accuracy_score(y_pred, y_test)

print(f'Score: {str(score)}')

end = time()

elapsed = end - start

print(f'\nRun Time: {elapsed} seconds')

Score: 0.9921383647798742

Run Time: 19.283082485198975 seconds


Wow! The difference in timing is just staggering. In the world of big data XGBoost is the bomb.