# Introduction to XGBoost with RAPIDS

In this notebook, we'll show the acceleration one can gain by using GPUs with XGBoost in RAPIDS.

## Hardware setup

To start, let's see what hardware we're working with.

In [1]:
!nvidia-smi

Fri Dec 28 00:55:46 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   36C    P0    28W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   36C    P0    27W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:84:00.0 Off |                    0 |
| N/A   

In [2]:
!nproc

32


## CUDA Version

Next, let's see what CUDA version we have.

In [3]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


## Load our libraries

Let's load some of the libraries within the RAPIDs ecosystem and see which versions we have.

In [4]:
import cudf; print('cuDF Version:', cudf.__version__)
import cuml; print('cuML Version:', '0.2.0')
import dask; print('dask Version:', dask.__version__)
# import dask_gdf; print('dask_gdf Version:', dask_gdf.__version__)
# import dask_xgboost; print('dask_xgboost Version:', dask_xgboost.__version__)
import numba; print('numba Version:', numba.__version__)
import numpy; print('numpy Version:', numpy.__version__)
import matplotlib; print('matplotlib Version:', matplotlib.__version__)
import pandas; print('pandas Version:', pandas.__version__)
import pyarrow; print('pyarrow Version:', pyarrow.__version__)
import xgboost; print('XGBoost Version:', xgboost.__version__)

cuDF Version: 0+unknown
cuML Version: 0.2.0
dask Version: 0.19.2
numba Version: 0.40.0
numpy Version: 1.14.5
matplotlib Version: 3.0.2
pandas Version: 0.20.3
pyarrow Version: 0.10.0
XGBoost Version: 0.80


## Load/Simulate data

### Load data

We can load the data using `pandas.read_csv`.

### Simulate data

Alternatively, we can simulate data for our train and validation datasets. The features will be tabular with `n_rows` and `n_columns` in the training dataset, where each value is either of type `np.float32` if the data is numerical or `np.uint8` if the data is categorical. Both numerical and categorical data can also be combined; for this experiment, we have ignored this combination.

In [5]:
import numpy as np
import pandas as pd


# helper function for simulating data
def simulate_data(m, n, k=2, numerical=False):
    if numerical:
        features = np.random.rand(m, n)
    else:
        features = np.random.randint(2, size=(m, n))
    labels = np.random.randint(k, size=m)
    return np.c_[labels, features].astype(np.float32)


# helper function for loading data
def load_data(filename, n_rows):
    if n_rows >= 1e9:
        df = pd.read_csv(filename)
    else:
        df = pd.read_csv(filename, nrows=n_rows)
    return df.values.astype(np.float32)

In [6]:
# settings
LOAD = False
n_gpus = 1
n_rows = int(1e5)
n_columns = int(100)
n_categories = 2

In [7]:
%%time

if LOAD:
    dataset = load_data('/tmp', n_rows)
else:
    dataset = simulate_data(n_rows, n_columns, n_categories)
print(dataset.shape)

(100000, 101)
CPU times: user 59.4 ms, sys: 38.2 ms, total: 97.7 ms
Wall time: 96.6 ms


### Split data

We'll split our dataset into a 80% training dataset and a 20% validation dataset.

In [8]:
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

# split validation data
X_validation, y_validation = X[train_index:, :], y[train_index:]

### Check dimensions

We can check the dimensions and proportions of our training and validation dataets.

In [9]:
# print(X_train[:3, :], y_train[:3])

In [10]:
# check dimensions
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_validation', X_validation.shape, X_validation.dtype, 'y_validation: ', y_validation.shape, y_validation.dtype)

# check the proportions
total = X_train.shape[0] + X_validation.shape[0]
print('X_train proportion:', X_train.shape[0] / total)
print('X_validation proportion:', X_validation.shape[0] / total)

X_train:  (80000, 100) float32 y_train:  (80000,) float32
X_validation (20000, 100) float32 y_validation:  (20000,) float32
X_train proportion: 0.8
X_validation proportion: 0.2


## Convert NumPy data to DMatrix format

With out data simulated and formatted as NumPy arrays, our next step is to convert this to a `DMatrix` object that XGBoost can work with. We can instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label=` keyword argument. To learn more about XGBoost's support for data structures other than NumPy arrays, see the documentation for the Data Interface:


https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface


In [11]:
%%time

import xgboost as xgb


dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_validation, label=y_validation)

CPU times: user 38.9 ms, sys: 21.3 ms, total: 60.3 ms
Wall time: 58.9 ms


## Set parameters

There are a number of parameters that can be set before XGBoost can be run. 

* General parameters relate to which booster we are using to do boosting, commonly tree or linear model
* Booster parameters depend on which booster you have chosen
* Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

For more information on the configurable parameters within the XGBoost module, see the documentation here:


https://xgboost.readthedocs.io/en/latest/parameter.html

In [12]:
# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
# n_gpus = 0
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus
params.update(booster_params)

# learning task params
learning_task_params = {'eval_metric': 'auc', 'objective': 'binary:logistic'}
params.update(learning_task_params)
print(params)

{'n_gpus': 1, 'tree_method': 'gpu_hist', 'silent': 1, 'objective': 'binary:logistic', 'eval_metric': 'auc'}


## Train model

Now it's time to train our model! We can use the `xgb.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training. For more information on the parameters that can be passed into `xgb.train`, check out the documentation:


https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train

In [13]:
# model training settings
evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
num_round = 10

In [14]:
%%time

bst = xgb.train(params, dtrain, num_round, evallist)

[0]	validation-auc:0.493903	train-auc:0.543352
[1]	validation-auc:0.494909	train-auc:0.558923
[2]	validation-auc:0.494512	train-auc:0.572057
[3]	validation-auc:0.494829	train-auc:0.581474
[4]	validation-auc:0.494723	train-auc:0.590397
[5]	validation-auc:0.494047	train-auc:0.598135
[6]	validation-auc:0.495006	train-auc:0.60541
[7]	validation-auc:0.493484	train-auc:0.611741
[8]	validation-auc:0.493427	train-auc:0.618586
[9]	validation-auc:0.494259	train-auc:0.623553
CPU times: user 9.85 s, sys: 553 ms, total: 10.4 s
Wall time: 886 ms


In [15]:
# del bst