<a id="introduction"></a>
## Introduction to Dask XGBoost
#### By Paul Hendricks
-------

In this notebook, we will show how to work with Dask XGBoost in RAPIDS.

**Table of Contents**

* [Introduction to Dask XGBoost](#introduction)
* [Setup](#setup)
* [Load Libraries](#libraries)
* [Create a Cluster and Client](#cluster)
* [Generate Data](#generate)
  * [Load Data](#load)
  * [Simulate Data](#simulate)
  * [Split Data](#split)
  * [Check Dimensions](#check)
* [Distribute Data using Dask cuDF](#distribute)
* [Set Parameters](#parameters)
* [Train Model](#train)
* [Generate Predictions](#predict)
* [Evaluate Model](#evaluate)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `nvcr.io/nvidia/rapidsai/rapidsai:0.8-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [NGC - rapidsai/rapidsai](https://ngc.nvidia.com/catalog/containers/nvidia:rapidsai:rapidsai)
* `rapidsai/rapidsai-nightly:0.9-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub - rapidsai/rapidsai-nightly](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Wed Jul 17 16:55:02 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
| N/A   38C    P0    54W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  On   | 00000000:36:00.0 Off |                    0 |
| N/A   35C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM3...  On   | 00000000:39:00.0 Off |                    0 |
| N/A   

Next, let's see what CUDA version we have.

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


<a id="libraries"></a>
## Load Libraries

Let's load some of the libraries within the RAPIDs ecosystem and see which versions we have.

In [3]:
import cudf; print('cuDF Version:', cudf.__version__)
import dask; print('Dask Version:', dask.__version__)
import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)
import dask_xgboost; print('Dask XGBoost Version:', dask_xgboost.__version__)
import numpy as np; print('numpy Version:', np.__version__)
import pandas as pd; print('pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
# import xgboost as xgb; print('XGBoost Version:', xgb.__version__)

cuDF Version: 0.8.0+0.g8fa7bd3.dirty
Dask Version: 2.1.0
Dask cuDF Version: 0.8.0+0.g8fa7bd3.dirty
Dask XGBoost Version: 0.1.5
numpy Version: 1.16.2
pandas Version: 0.23.4
Scikit-Learn Version: 0.21.2


<a id="cluster"></a>
## Create a Cluster and Client

Let's start by creating a local cluster of workers and a client to interact with that cluster.

In [4]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster


# create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:33567  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 16  Cores: 16  Memory: 1.62 TB


<a id="generate"></a>
## Generate Data

<a id="load"></a>
### Load Data

We can load the data using `pandas.read_csv`. We've provided a helper function `load_data` that will load data from a CSV file (and will only read the first 1 billion rows if that file is unreasonably big).

In [5]:
# helper function for loading data
def load_data(filename, n_rows):
    if n_rows >= 1e9:
        df = pd.read_csv(filename)
    else:
        df = pd.read_csv(filename, nrows=n_rows)
    return df.values.astype(np.float32)

<a id="simulate"></a>
### Simulate Data

Alternatively, we can simulate data for our train and validation datasets. The features will be tabular with `n_rows` and `n_columns` in the training dataset, where each value is either of type `np.float32`. We can simulate data for both classification and regression using the `make_classification` or `make_regression` functions from the Scikit-Learn package.

In [6]:
from sklearn.datasets import make_classification, make_regression


# helper function for simulating data
def simulate_data(m, n, k=2, random_state=None, classification=True):
    if classification:
        features, labels = make_classification(n_samples=m, n_features=n, 
                                               n_informative=int(n/5), n_classes=k, 
                                               random_state=random_state)
    else:
        features, labels = make_regression(n_samples=m, n_features=n, 
                                           n_informative=int(n/5), n_targets=1, 
                                           random_state=random_state)
    return np.c_[labels, features].astype(np.float32)

In [7]:
# settings
simulate = True
classification = True  # change this to false to use regression
n_rows = int(10_000_000)  # we'll use 10 millions rows
n_columns = int(100)
n_categories = 2
random_state = np.random.RandomState(43210)

In [8]:
%%time

if simulate:
    dataset = simulate_data(n_rows, n_columns, n_categories, 
                            random_state=random_state, 
                            classification=classification)
else:
    dataset = load_data('/tmp', n_rows)
print(dataset.shape)

(10000000, 101)
CPU times: user 1min 38s, sys: 12.8 s, total: 1min 51s
Wall time: 1min 5s


<a id="split"></a>
### Split Data

We'll split our dataset into a 80% training dataset and a 20% validation dataset.

In [9]:
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

# split validation data
X_validation, y_validation = X[train_index:, :], y[train_index:]

<a id="check"></a>
### Check Dimensions

We can check the dimensions and proportions of our training and validation dataets.

In [10]:
# check dimensions
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_validation', X_validation.shape, X_validation.dtype, 'y_validation: ', y_validation.shape, y_validation.dtype)

# check the proportions
total = X_train.shape[0] + X_validation.shape[0]
print('X_train proportion:', X_train.shape[0] / total)
print('X_validation proportion:', X_validation.shape[0] / total)

X_train:  (8000000, 100) float32 y_train:  (8000000,) float32
X_validation (2000000, 100) float32 y_validation:  (2000000,) float32
X_train proportion: 0.8
X_validation proportion: 0.2


<a id="distribute"></a>
### Distribute Data using Dask cuDF

Next, let's distribute our data across multiple GPUs using Dask cuDF.

In [11]:
# create Pandas DataFrames for X_train and X_validation
n_columns = X_train.shape[1]
X_train_pdf = pd.DataFrame(X_train)
X_train_pdf.columns = ['feature_' + str(i) for i in range(n_columns)]
X_validation_pdf = pd.DataFrame(X_validation)
X_validation_pdf.columns = ['feature_' + str(i) for i in range(n_columns)]

# create Pandas DataFrames for y_train and y_validation
y_train_pdf = pd.DataFrame(y_train)
y_train_pdf.columns = ['y']
y_validation_pdf = pd.DataFrame(y_validation)
y_validation_pdf.columns = ['y']

In [12]:
# Dask settings
npartitions = 8

# create Dask DataFrames for X_train and X_validation
X_train_dask_pdf = dask.dataframe.from_pandas(X_train_pdf, npartitions=npartitions)
X_validation_dask_pdf = dask.dataframe.from_pandas(X_validation_pdf, npartitions=npartitions)

# create Dask cuDF DataFrames for X_train and X_validation
X_train_dask_cudf = dask_cudf.from_dask_dataframe(X_train_dask_pdf)
X_validation_dask_cudf = dask_cudf.from_dask_dataframe(X_validation_dask_pdf)

# create Dask DataFrames for y_train and y_validation
y_train_dask_pdf = dask.dataframe.from_pandas(y_train_pdf, npartitions=npartitions)
y_validation_dask_pdf = dask.dataframe.from_pandas(y_validation_pdf, npartitions=npartitions)

# create Dask cuDF DataFrames for y_train and y_validation
y_train_dask_cudf = dask_cudf.from_dask_dataframe(y_train_dask_pdf)
y_validation_dask_cudf = dask_cudf.from_dask_dataframe(y_validation_dask_pdf)

In [13]:
# Optional: persist training and validation data into memory
X_train_dask_cudf = X_train_dask_cudf.persist()
X_validation_dask_cudf = X_validation_dask_cudf.persist()
y_train_dask_cudf = y_train_dask_cudf.persist()
y_validation_dask_cudf = y_validation_dask_cudf.persist()

  (         feature_0  feature_1  feature_2  feature ... 625fae9fda69e')
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s)


<a id="parameters"></a>
## Set Parameters

There are a number of parameters that can be set before XGBoost can be run. 

* General parameters relate to which booster we are using to do boosting, commonly tree or linear model
* Booster parameters depend on which booster you have chosen
* Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

In [14]:
# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
n_gpus = 1  
booster_params = {}
booster_params['max_depth'] = 8
booster_params['grow_policy'] = 'lossguide'
booster_params['max_leaves'] = 2**8
booster_params['tree_method'] = 'gpu_hist'
booster_params['n_gpus'] = 1  # keep this at 1, even if using more than 1 GPU - Dask XGBoost uses 1 GPU per worker
params.update(booster_params)

# learning task params
learning_task_params = {}
if classification:
    learning_task_params['eval_metric'] = 'auc'
    learning_task_params['objective'] = 'binary:logistic'
else:
    learning_task_params['eval_metric'] = 'rmse'
    learning_task_params['objective'] = 'reg:squarederror'
params.update(learning_task_params)
print(params)

{'silent': 1, 'max_depth': 8, 'grow_policy': 'lossguide', 'max_leaves': 256, 'tree_method': 'gpu_hist', 'n_gpus': 1, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


<a id="train"></a>
## Train Model

Now it's time to train our model! We can use the `dask_xgboost.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training. 

In [15]:
# model training settings
num_round = 100

In [16]:
%%time


bst = dask_xgboost.train(client, params, X_train_dask_cudf, y_train_dask_cudf, num_boost_round=num_round)

CPU times: user 3.35 s, sys: 6.25 s, total: 9.61 s
Wall time: 27.4 s


<a id="predict"></a>
## Generate Predictions

We can generate predictions using the `dask_xgboost.predict` method and then using `dask.dataframe.multi.concat` to concatenate the multiple resulting dataframes together.

In [17]:
y_predictions = dask_xgboost.predict(client, bst, X_validation_dask_cudf)

In [18]:
y_predictions = dask.dataframe.multi.concat([y_predictions], axis=1)

<a id="evaluate"></a>
## Evaluate Model

Lastly, we can evaluate our model (depending on classification or regression) and calculate accuracy or rmse, respectively. 

In [19]:
from sklearn.metrics import accuracy_score


if classification:
    thresholded_predictions = (y_predictions[0] > 0.5).compute().to_array() * 1.0
    accuracy = accuracy_score(y_validation, thresholded_predictions)
    print('Accuracy:', accuracy)
else:
    test['squared_error'] = (y_predictions[0] - y_validation_dask_cudf['y'])**2
    rmse = np.sqrt(test.squared_error.mean().compute())
    print('Root Mean Squared Error:', rmse)

Accuracy: 0.9873415


<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with Dask XGBoost in RAPIDS.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)