<h1 style="text-align:center;">Early Stopping in XGBoost</h1>

# Introduction

Imagine you're trying to fill a glass of water. You want it to be full, but not overflowing. If you keep the tap on indefinitely, it'll spill, right? That's a bit like training a machine learning model. If you train it for too long, it might just "spill" or overfit. This is where "early stopping" comes into play.

Think of early stopping as a friend who tells you to stop pouring water once the glass is full. In machine learning, rather than pouring water, you're improving your model's performance. You keep on training until your model stops getting better, or even starts to get worse. 

Instead of saying, "I'll train for 1000 rounds no matter what," you say, "I'll train until my model stops improving." But how long should you wait before deciding it's not improving? 10 rounds? 100 rounds? That's up to you.

**Example**: Imagine you're climbing a hill. You want to get to the top (best performance). Each step you take is like a training round. Now, if you take 10 steps and don’t reach a higher point than before, maybe it's time to say, "I'm probably at the top." That's early stopping.

In XGBoost:
- `early_stopping_rounds=10` means you'll stop if 10 consecutive rounds don't improve the model.
- `early_stopping_rounds=100` means you'll be more patient and wait for 100 rounds.

## **But How Do We Know If We're Improving?**

Great question! We need a way to measure. That's where `eval_set` and `eval_metric` come in.

### **1. `eval_set`**
This is like having a checkpoint on your hill climb. You check your altitude (performance) here. In machine learning, it's a set of data you use to evaluate your model's performance. Typically, it's your test data (`X_test`, `y_test`).

### **2. `eval_metric`**
This is your altitude meter. It tells you how high you are (how good your model is). For instance:
- If you're classifying things (like apples vs oranges), you might use `error` as your metric.
- If you're predicting numbers (like house prices), `rmse` might be your pick.

When you use XGBoost's `.fit()`, you provide these two so the algorithm knows how it's doing after each round.

## **Wrapping Up**

Always remember, machine learning is like exploring nature. You're trying to find patterns, and sometimes, you don't need to look forever to see them. Early stopping helps ensure you see enough, but not too much that you start imagining patterns that aren't there.

And as with many things in life and science, it's about balance. Train enough to understand, but stop before you over-complicate things.

In [1]:
import os
import warnings

os.environ['PYTHONWARNINGS'] = 'ignore::FutureWarning' 

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import (train_test_split, cross_val_score, 
                        StratifiedKFold, GridSearchCV, RandomizedSearchCV)

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

from helper_file import *

warnings.filterwarnings("ignore", category=FutureWarning) 
# export PYTHONWARNINGS="ignore::FutureWarning"

In [2]:
import xgboost
print(xgboost.__version__)

1.7.6


In [3]:
data_path = "data/heart_disease.csv"
df = pd.read_csv(data_path)
df.sample(n=5, random_state=43)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
242,64,1,0,145,212,0,0,132,0,2.0,1,2,1,0
130,54,0,2,160,201,0,1,163,0,0.0,2,1,2,1
208,49,1,2,120,188,0,1,139,0,2.0,1,3,3,0
160,56,1,1,120,240,0,1,169,0,0.0,0,0,2,1
124,39,0,2,94,199,0,1,179,0,0.0,2,0,2,1


In [4]:
X, y = splitX_y(df, 'target')

print(f"shape of target vector: {y.shape}")
print(f"shape of feature matrix: {X.shape}")

shape of target vector: (303,)
shape of feature matrix: (303, 13)


In [5]:
# Declare eval_metric:
eval_metric = 'error'

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43)

# set up your XGBoost model with eval_metric
model = XGBClassifier(booster='gbtree', objective='binary:logistic', 
                      eval_metric=eval_metric,
                      random_state=43)

In [8]:
# Declare eval_set:
eval_set = [(X_test, y_test)]

In [9]:
# Fit the model with and eval_set:
model.fit(X_train, y_train, eval_set=eval_set)

[0]	validation_0-error:0.22368
[1]	validation_0-error:0.18421
[2]	validation_0-error:0.21053
[3]	validation_0-error:0.19737
[4]	validation_0-error:0.18421
[5]	validation_0-error:0.15789
[6]	validation_0-error:0.15789
[7]	validation_0-error:0.15789
[8]	validation_0-error:0.17105
[9]	validation_0-error:0.17105
[10]	validation_0-error:0.17105
[11]	validation_0-error:0.17105
[12]	validation_0-error:0.17105
[13]	validation_0-error:0.17105
[14]	validation_0-error:0.15789
[15]	validation_0-error:0.14474
[16]	validation_0-error:0.13158
[17]	validation_0-error:0.14474
[18]	validation_0-error:0.13158
[19]	validation_0-error:0.13158
[20]	validation_0-error:0.14474
[21]	validation_0-error:0.14474
[22]	validation_0-error:0.14474
[23]	validation_0-error:0.13158
[24]	validation_0-error:0.11842
[25]	validation_0-error:0.14474
[26]	validation_0-error:0.14474
[27]	validation_0-error:0.13158
[28]	validation_0-error:0.13158
[29]	validation_0-error:0.13158
[30]	validation_0-error:0.13158
[31]	validation_0-

In [10]:
# Get the best iteration
best_iteration = model.get_booster().best_iteration

# Since iteration starts from 0, add 1 to get the number of trees
num_trees = best_iteration + 1

print(f"The best score was achieved using {num_trees} trees.")


The best score was achieved using 100 trees.


In [11]:
# Check the final score:

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 84.21%


### **`early_stopping_rounds`**

`early_stopping_rounds` is an optional parameter to include with `eval_metric` and `eval_set` when fitting a model.

Let's try `early_stopping_rounds=10`.

The previous code is repeated with `early_stopping_rounds=10` added in:

In [12]:
# Create the model with the necessary parameters
model = XGBClassifier(booster='gbtree', objective='binary:logistic', 
                      eval_metric=eval_metric, early_stopping_rounds=10,
                      random_state=43)

# Fit the model without specifying the early_stopping_rounds in the fit method
model.fit(X_train, y_train, eval_set=eval_set, verbose=True)

# Predict using the trained model
y_pred = model.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")


[0]	validation_0-error:0.22368
[1]	validation_0-error:0.18421
[2]	validation_0-error:0.21053
[3]	validation_0-error:0.19737
[4]	validation_0-error:0.18421
[5]	validation_0-error:0.15789
[6]	validation_0-error:0.15789
[7]	validation_0-error:0.15789
[8]	validation_0-error:0.17105
[9]	validation_0-error:0.17105
[10]	validation_0-error:0.17105
[11]	validation_0-error:0.17105
[12]	validation_0-error:0.17105
[13]	validation_0-error:0.17105
[14]	validation_0-error:0.15789
[15]	validation_0-error:0.14474
[16]	validation_0-error:0.13158
[17]	validation_0-error:0.14474
[18]	validation_0-error:0.13158
[19]	validation_0-error:0.13158
[20]	validation_0-error:0.14474
[21]	validation_0-error:0.14474
[22]	validation_0-error:0.14474
[23]	validation_0-error:0.13158
[24]	validation_0-error:0.11842
[25]	validation_0-error:0.14474
[26]	validation_0-error:0.14474
[27]	validation_0-error:0.13158
[28]	validation_0-error:0.13158
[29]	validation_0-error:0.13158
[30]	validation_0-error:0.13158
[31]	validation_0-

In [13]:
# Get the best iteration
best_iteration = model.get_booster().best_iteration

# Since iteration starts from 0, add 1 to get the number of trees
num_trees = best_iteration + 1

print(f"The best score was achieved using {num_trees} trees.")


The best score was achieved using 25 trees.


In [14]:
# Create the model with the necessary parameters
model = XGBClassifier(booster='gbtree', objective='binary:logistic', 
                      eval_metric=eval_metric, early_stopping_rounds=10,
                      random_state=43)

# Fit the model without specifying the early_stopping_rounds in the fit method
model.fit(X_train, y_train, eval_set=eval_set, verbose=True)

# Predict using the trained model
y_pred = model.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")


[0]	validation_0-error:0.22368
[1]	validation_0-error:0.18421
[2]	validation_0-error:0.21053
[3]	validation_0-error:0.19737
[4]	validation_0-error:0.18421
[5]	validation_0-error:0.15789
[6]	validation_0-error:0.15789
[7]	validation_0-error:0.15789
[8]	validation_0-error:0.17105
[9]	validation_0-error:0.17105
[10]	validation_0-error:0.17105
[11]	validation_0-error:0.17105
[12]	validation_0-error:0.17105
[13]	validation_0-error:0.17105
[14]	validation_0-error:0.15789
[15]	validation_0-error:0.14474
[16]	validation_0-error:0.13158
[17]	validation_0-error:0.14474
[18]	validation_0-error:0.13158
[19]	validation_0-error:0.13158
[20]	validation_0-error:0.14474
[21]	validation_0-error:0.14474
[22]	validation_0-error:0.14474
[23]	validation_0-error:0.13158
[24]	validation_0-error:0.11842
[25]	validation_0-error:0.14474
[26]	validation_0-error:0.14474
[27]	validation_0-error:0.13158
[28]	validation_0-error:0.13158
[29]	validation_0-error:0.13158
[30]	validation_0-error:0.13158
[31]	validation_0-

In [15]:
# Get the best iteration
best_iteration = model.get_booster().best_iteration

# Since iteration starts from 0, add 1 to get the number of trees
num_trees = best_iteration + 1

print(f"The best score was achieved using {num_trees} trees.")


The best score was achieved using 25 trees.


In [16]:
# Create the model with the necessary parameters
model = XGBClassifier(booster='gbtree', objective='binary:logistic', 
                      eval_metric=eval_metric, early_stopping_rounds=10,
                      random_state=43)

# Fit the model without specifying the early_stopping_rounds in the fit method
model.fit(X_train, y_train, eval_set=eval_set, verbose=True)

[0]	validation_0-error:0.22368
[1]	validation_0-error:0.18421
[2]	validation_0-error:0.21053
[3]	validation_0-error:0.19737
[4]	validation_0-error:0.18421
[5]	validation_0-error:0.15789
[6]	validation_0-error:0.15789
[7]	validation_0-error:0.15789
[8]	validation_0-error:0.17105
[9]	validation_0-error:0.17105
[10]	validation_0-error:0.17105
[11]	validation_0-error:0.17105
[12]	validation_0-error:0.17105
[13]	validation_0-error:0.17105
[14]	validation_0-error:0.15789
[15]	validation_0-error:0.14474
[16]	validation_0-error:0.13158
[17]	validation_0-error:0.14474
[18]	validation_0-error:0.13158
[19]	validation_0-error:0.13158
[20]	validation_0-error:0.14474
[21]	validation_0-error:0.14474
[22]	validation_0-error:0.14474
[23]	validation_0-error:0.13158
[24]	validation_0-error:0.11842
[25]	validation_0-error:0.14474
[26]	validation_0-error:0.14474
[27]	validation_0-error:0.13158
[28]	validation_0-error:0.13158
[29]	validation_0-error:0.13158
[30]	validation_0-error:0.13158
[31]	validation_0-

In [17]:
# Get the best iteration
best_iteration = model.get_booster().best_iteration

# Since iteration starts from 0, add 1 to get the number of trees
num_trees = best_iteration + 1

print(f"The best score was achieved using {num_trees} trees.")


The best score was achieved using 25 trees.


In [18]:
# Predict using the trained model
y_pred = model.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")


Accuracy: 88.16%


By setting early_stopping_rounds=100, you are guaranteed to reach the default of 100 boosted trees presented by XGBoost.

In [19]:
# Create the model with the necessary parameters
model = XGBClassifier(booster='gbtree', objective='binary:logistic', 
                      eval_metric=eval_metric, early_stopping_rounds=100,
                      random_state=43, n_estimators=5000                      
                     )

# Fit the model without specifying the early_stopping_rounds in the fit method
model.fit(X_train, y_train, eval_set=eval_set, verbose=True)

[0]	validation_0-error:0.22368
[1]	validation_0-error:0.18421
[2]	validation_0-error:0.21053
[3]	validation_0-error:0.19737
[4]	validation_0-error:0.18421
[5]	validation_0-error:0.15789
[6]	validation_0-error:0.15789
[7]	validation_0-error:0.15789
[8]	validation_0-error:0.17105
[9]	validation_0-error:0.17105
[10]	validation_0-error:0.17105
[11]	validation_0-error:0.17105
[12]	validation_0-error:0.17105
[13]	validation_0-error:0.17105
[14]	validation_0-error:0.15789
[15]	validation_0-error:0.14474
[16]	validation_0-error:0.13158
[17]	validation_0-error:0.14474
[18]	validation_0-error:0.13158
[19]	validation_0-error:0.13158
[20]	validation_0-error:0.14474
[21]	validation_0-error:0.14474
[22]	validation_0-error:0.14474
[23]	validation_0-error:0.13158
[24]	validation_0-error:0.11842
[25]	validation_0-error:0.14474
[26]	validation_0-error:0.14474
[27]	validation_0-error:0.13158
[28]	validation_0-error:0.13158
[29]	validation_0-error:0.13158
[30]	validation_0-error:0.13158
[31]	validation_0-

In [22]:
# Predict using the trained model
y_pred = model.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")


Accuracy: 88.16%


The best score was achieved using 25 trees.

In [23]:
# Get the best iteration
best_iteration = model.get_booster().best_iteration

# Since iteration starts from 0, add 1 to get the number of trees
num_trees = best_iteration + 1

print(f"The best score was achieved using {num_trees} trees.")


The best score was achieved using 25 trees.
