XGBoost is a slow algorithm, especially you have a little bit large data. 

I just think to speed up xgboost model. I read a post on internet about XGBoost with using train is faster than XGBoost's sklearn api, or using fit method.

In this notebook, I want to compare this two method.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import xgboost as xgb

bold = "\033[1m"

In [None]:
train = pd.read_csv("../input/tabular-playground-series-aug-2021/train.csv")
print(bold + "Training Set :\n")
display(train.head())
print(bold + str(train.shape))

print(bold + "\nTest Set :\n")
test = pd.read_csv("../input/tabular-playground-series-aug-2021/test.csv")
display(test.head())
print(bold + str(test.shape))

# Data Preparation

In [None]:
target = "loss"
predictors = [x for x in train.columns if x not in ["id", "loss"]]

In [None]:
X = train[predictors]
y = train[target]
test = test[predictors]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = 42)

For using xgb.train, we need to convert data to DMatrix form.

In [None]:
d_train = xgb.DMatrix(data = X_train.values, label = y_train.values, silent = True, feature_names = X_train.columns)

d_val = xgb.DMatrix(data = X_val.values, label = y_val.values, silent = True, feature_names = X_val.columns)

# 1) Default Parameters

First comparison is training with default parameters. I only set n_jobs and random_state parameters. 

Default learning rate, or eta is 0.3 in xgboost. It is a little bit large number, so that I set early stopping rounds to 10.

In [None]:
params = {"n_jobs": -1,
          "random_state": 42}

In [None]:
start = datetime.now()

xgb.train(params, d_train, num_boost_round = 1000, evals = [(d_val, "eval")], early_stopping_rounds = 10, verbose_eval = 10)

print(datetime.now() - start)

In [None]:
xgb_sklearn = xgb.XGBRegressor(n_estimators = 1000, **params)

In [None]:
start = datetime.now()

xgb_sklearn.fit(X_train.values, y_train.values,
                eval_set=[(X_val.values, y_val.values)],
                eval_metric = "rmse",
                early_stopping_rounds = 10,
                verbose = 10)

print(datetime.now() - start)

We have same eval scores and almost same time with 2.07 (This could be change after rerun)

# 2) Initial Parameters

Second, I set parameters initially and using early stopping rounds as 50.

In [None]:
initial_params = {"n_jobs": -1,
                  "random_state": 42,
                  "gamma": 0.25,
                  "max_depth": 12,
                  "min_child_weight": 8,
                  "subsample": 0.8,
                  "colsample_bytree": 0.7,
                  }

In [None]:
start = datetime.now()

xgb.train(initial_params, d_train, num_boost_round = 1000, evals = [(d_val, "eval")], early_stopping_rounds = 50, verbose_eval = 25)

print(datetime.now() - start)

In [None]:
xgb_initial_sklearn = xgb.XGBRegressor(n_estimators = 1000, **initial_params)

In [None]:
start = datetime.now()

xgb_initial_sklearn.fit(X_train.values, y_train.values,
                        eval_set=[(X_val.values, y_val.values)],
                        eval_metric = "rmse",
                        early_stopping_rounds = 50,
                        verbose = 25)

print(datetime.now() - start)

Again, we have same eval scores and almost same time.

# 3) Default Parameters with using GPU

To speed up xgboost model, using GPU is the most effective option. 

To use GPU, we only need to set tree method to "gpu_hist"

I set early stopping to 500 to see the performance of GPU.

In [None]:
params["tree_method"] = "gpu_hist"

In [None]:
start = datetime.now()

xgb.train(params, d_train, num_boost_round = 1000, evals = [(d_val, "eval")], early_stopping_rounds = 500, verbose_eval = 50)

print(datetime.now() - start)

In [None]:
xgb_gpu_sklearn = xgb.XGBRegressor(n_estimators = 1000, **params)

In [None]:
start = datetime.now()

xgb_gpu_sklearn.fit(X_train.values, y_train.values,
                    eval_set=[(X_val.values, y_val.values)],
                    eval_metric = "rmse",
                    early_stopping_rounds = 500,
                    verbose = 50)

print(datetime.now() - start)

This process took five seconds. Again, everything looks same.

# 4) Initial Parameters with using GPU

In [None]:
initial_params["tree_method"] = "gpu_hist"
initial_params["eta"] = 0.1

In [None]:
start = datetime.now()

xgb.train(initial_params, d_train, num_boost_round = 1000, evals = [(d_val, "eval")], early_stopping_rounds = 500, verbose_eval = 100)

print(datetime.now() - start)

In [None]:
xgb_initial_gpu_sklearn = xgb.XGBRegressor(n_estimators = 1000, **initial_params)

In [None]:
start = datetime.now()

xgb_initial_gpu_sklearn.fit(X_train.values, y_train.values,
                            eval_set=[(X_val.values, y_val.values)],
                            eval_metric = "rmse",
                            early_stopping_rounds = 500,
                            verbose = 100)

print(datetime.now() - start)

Initial parameters with using GPU, again we have same results.

# Takeaways

It looks like there is no significant difference between **train** and **fit**.

**We get same results at the same time.**

Only difference is, we need DMatrix data form for using xgb.train.