<h1 align=center>Accelerating Data Science Workflows with RAPIDS</h1>

[RAPIDS](https://rapids.ai/) is a suite of open source software libraries that gives you the ability to accelerate and execute end-to-end data science workflows entirely on GPUs. RAPIDS relies on NVIDIA CUDA® primitives for low-level compute optimization, GPU parallelism, and high-bandwidth memory speed through user-friendly Python interfaces and APIs that are familiar to users of Pandas, Scikit-learn and Dask.

## Introduction to cuDF and XGBoost

In this lab we will discuss couple of packages in RAPIDS such as cuDF (DataFrame library interoperable with Pandas) and GPU accelerated XGBoost.You will work through a series of exercises to port and refactor CPU code onto GPU.

<a id="prerequisites"></a>
## Prerequisites

This lab is not an introduction to Data Science. We'll assume that you have background in Data Science and experience with the following programming tools and techniques:

- [Python 3 programming language](https://docs.python.org/)
- [Pandas Data Analysis Library](https://pandas.pydata.org/)
- [NumPy Library for Numerical Programming](http://www.numpy.org/)
- Machine learning model training with [XGBoost](https://xgboost.readthedocs.io/)
- Python plotting with [Matplotlib](https://matplotlib.org/)

## Setup

Install RAPIDS libraries: cuDF, cuML, cuGraph, XGBoost:

In [None]:
!wget https://github.com/zronaghi/Clemson-workshop/raw/master/utils/rapids-install.sh
!bash rapids-install.sh

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command

In [None]:
!nvidia-smi

Let's also check the CUDA version:

In [None]:
!nvcc --version

<a id="libraries"></a>
## Load Libraries

Let's load some of the RAPIDS libraries that we'll be using and check versions:

In [None]:
import cudf; print('cuDF version:', cudf.__version__)
import xgboost as xgb; print('XGBoost version:', xgb.__version__)

Additional libraries:

In [None]:
import time
import numpy as np; print('numpy version:', np.__version__)
import pandas as pd; print('pandas version:', pd.__version__)
import sklearn; print('Scikit-learn version:', sklearn.__version__)

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

<a id="generate"></a>
## Generate Data

We'll use sklearn.datasets to simulate data and build synthetic sub-datasets (SwissRolls and Blobs), combine these sub-datasets and then use a trained model to determine sample's sub-dataset.

In [None]:
#number of total samples
nSamples = 10000000

SamplesPerDatas = nSamples//2

swissrolls= datasets.make_swiss_roll( n_samples = SamplesPerDatas, noise = .005)[0]

blobs = datasets.make_blobs( n_samples = SamplesPerDatas, centers = 5,  n_features = 3, cluster_std = 0.25,  random_state = 0)[0] + [0, 1.5, 0]

In [None]:
#features 
X = np.vstack([blobs, swissrolls])

#generate labels 
blobsLabels = np.zeros(blobs.shape[0])
rollsLabels = 1 * np.ones(swissrolls.shape[0])

y = np.hstack( [blobsLabels, rollsLabels])

## Train and Test Data

We'll split our dataset into a 75% training dataset and a 25% validation dataset:

- Train Data (75% of total data) - Use to optimize model's parameters
- Test Data (25% of total data) - Use to evaluate trained model

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state = 0, shuffle=True)

Let's check the dimensions of these dataets:

In [None]:
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_test', X_test.shape, X_test.dtype, 'y_validation: ', y_test.shape, y_test.dtype)

<a id="visualize"></a>
## Visualize Data

Let's define a function for plotting using matplotlib:

In [None]:
def plot_data( data, colorStack = 'green',  
                  ax3D = False, markerScale=1):
    
    ax3D = plt.figure(figsize=(12,12)).gca(projection='3d')
        
    ax3D.scatter(data[0:10000,0], 
                 data[0:10000,1], 
                 data[0:10000,2], s = 20*markerScale, c=colorStack, depthshade=False)
    
    ax3D.view_init(elev=10, azim=95)

In [None]:
plot_data(X_train)

## ETL

Let's write the dataset to disk (as a comma separated file - CSV) and demonstrate data loading:

In [None]:
%%time
pd.DataFrame(data = X_train).to_csv('X_train.csv', index = False)
pd.DataFrame(data = X_test).to_csv('X_test.csv', index = False)
pd.DataFrame(data = y_train).to_csv('y_train.csv', index = False)
pd.DataFrame(data = y_test).to_csv('y_test.csv', index = False)

In [None]:
#check size of data on disk
!du -h *csv

### Load Data CPU

In [None]:
%%time
startTime = time.time()

pd_X_train = pd.read_csv('X_train.csv',  delimiter=',')
pd_X_test = pd.read_csv('X_test.csv',  delimiter=',')
pd_y_train = pd.read_csv('y_train.csv',  delimiter=',')
pd_y_test = pd.read_csv('y_test.csv',  delimiter=',')

PandasIngestion = time.time() - startTime

### Load Data GPU

RAPIDS enables reading data from disk directly to GPU memory using cuDF (DataFrame manipulation library) with a similar API to Pandas. 

### Exercise:
Use cuDF to load data onto GPU memory, [cuDF API Reference](https://rapidsai.github.io/projects/cudf/en/latest/api.html)

In [None]:
# read csv files into cudf_X_train, cudf_X_test, cudf_y_train, cudf_y_test 

In [None]:
print("Data load on GPU is {:.2f}x faster than CPU".format(PandasIngestion/cuDFIngestion))

[Solution](#solution1)

## Model Training with XGBoost

### Prepare Data

Let's convert our DataFrames to a DMatrix object for XGBoost training. We can instantiate an object of the xgboost.DMatrix by passing in the feature matrix as the first argument followed by the label vector using the label keyword argument.

In [None]:
%%time
startTime = time.time()

train_DataAndLabelsCPU = xgb.DMatrix(pd_X_train, label=pd_y_train)
test_DataAndLabelsCPU = xgb.DMatrix(pd_X_test, label=pd_y_test)

CPUDMatrix = time.time() - startTime

In [None]:
%%time
startTime = time.time()

train_DataAndLabelsGPU = xgb.DMatrix(cudf_X_train, label=cudf_y_train)
test_DataAndLabelsGPU = xgb.DMatrix(cudf_X_test, label=cudf_y_test)

GPUDMatrix = time.time() - startTime

In [None]:
print("DMatrix conversion on GPU is {:.2f}x faster than CPU".format(CPUDMatrix/GPUDMatrix))

<a id="parameters"></a>
## Set Parameters

There are a number of parameters that can be set before training XGBoost model:

* General parameters relate to which booster we are using, commonly tree or linear model
* Booster parameters depend on which booster you have chosen
* Learning task parameters decide on the learning scenario

For all available options execute the cell below:

In [None]:
#?xgb.XGBClassifier

### CPU Parameters

In [None]:
nCores = !nproc --all
nCores = int(nCores[0])
print(nCores)

In [None]:
# instantiate params
paramsCPU = {}

# booster params
booster_params = {
    'max_depth': 6,
    'num_class': 3,
    'tree_method':'hist',
    'random_state': 0,
    'n_jobs': nCores
}  
paramsCPU.update(booster_params)

# learning task params
learning_task_params = {
    'objective': 'multi:softmax'
}
paramsCPU.update(learning_task_params)

print(paramsCPU)

### GPU Parameters

Using XGBoost to train models on the GPU is very similar to CPU, we need to change couple of parameters:

In [None]:
paramsGPU = {}

booster_params = {
    'max_depth': 6,
    'num_class': 3,
    'tree_method':'gpu_hist',
    'random_state': 0,
    'n_gpus': 1
}  
paramsGPU.update(booster_params)

learning_task_params = {
    'objective': 'multi:softmax'
}
paramsGPU.update(learning_task_params)

print(paramsGPU)

<a id="train"></a>
## Train XGBoost Classification Model

Now it's time to train our model! We can use the `xgboost.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training. 
The wall time output indicates how long it took to train an XGBoost model.

In [None]:
# model training settings
num_round = 10

### Train on CPU

In [None]:
%%time
startTime = time.time()

xgBoostModelCPU = xgb.train( dtrain = train_DataAndLabelsCPU, params = paramsCPU, num_boost_round = num_round )

CPUXGB = time.time() - startTime

### Train on GPU

### Exercise

In [None]:
# Use GPU DMatrix and parameters to train the model on GPU

[Solution](#solution2)

In [None]:
print("Training GPU is {:.2f}x faster than CPU".format(CPUXGB/GPUXGB))

<a id="predict"></a>
## Evaluate Model

Generate predictions and evaluate model based on accuracy score

In [None]:
# Hint: use predict from xgboost and accuracy_score from sklearn

[Solution](#solution3)

## Conclusion

In this notebook, we showed how to use GPU DataFrames and XGBoost in RAPIDS.

To learn more about RAPIDS check out: 

* [RAPIDS Website](http://rapids.ai)
* [RAPIDS on GitHub](https://github.com/rapidsai/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)

## Solutions

<a id='solution1'></a>
#### Solution 1: Load Data GPU

In [None]:
%%time
startTime = time.time()

cudf_X_train = cudf.read_csv('X_train.csv', delimiter=',')
cudf_X_test = cudf.read_csv('X_test.csv', delimiter=',')
cudf_y_train = cudf.read_csv('y_train.csv', delimiter=',')
cudf_y_test = cudf.read_csv('y_test.csv', delimiter=',')

cuDFIngestion = time.time() - startTime

<a id='solution2'></a>
#### Solution 2: Train XGBoost on GPU

In [None]:
%%time
startTime = time.time()

xgBoostModelGPU = xgb.train( dtrain = train_DataAndLabelsGPU, params = paramsGPU, num_boost_round = num_round )

GPUXGB = time.time() - startTime

<a id='solution3'></a>
#### Solution 3: Evaluate Model

In [None]:
%%time
yPredTrainGPU = xgBoostModelGPU.predict(train_DataAndLabelsGPU)
yPredTestGPU = xgBoostModelGPU.predict(test_DataAndLabelsGPU)

In [None]:
print( 'GPU test accuracy: {0:.6f} '.format( accuracy_score(pd_y_test, yPredTestGPU) ))