# Supervised Learning
#### By Paul Hendricks
-------

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In this notebook, we will show how to quickly setup Dask and train an XGBoost model using cuDF and read the data from disk using cuIO.

**Table of Contents**

* Introduction to Supervised Learning
  * Classification
  * Regression
* Setup
* Classification
  * Generating Data
  * XGBoost
  * RandomForest
  * K Nearest Neighbors
* Regression
  * Generating Data
  * Linear Regression
  * Ridge Regression
  * Stochastic Gradient Descent
* Train Model
* Conclusion

Before going any further, let's make sure we have access to `matplotlib`, a popular Python library for data visualization.

In [None]:
import os

try:
    import matplotlib; print('Matplotlib Version:', matplotlib.__version__)
except ModuleNotFoundError:
    os.system('conda install -y matplotlib')

## Introduction to Supervised Learning

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai:0.6-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai)
* `rapidsai/rapidsai-nightly:0.6-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [None]:
!nvidia-smi

Next, let's see what CUDA version we have:

In [None]:
!nvcc --version

Next, let's load some helper functions from `matplotlib` and configure the Jupyter Notebook for visualization.

In [None]:
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt


%matplotlib inline

## Setup Dask

Dask is a library the allows for parallelized computing. Written in Python, it allows one to schedule tasks and do so dynamically as well handle large data structures - similar to those found in NumPy and Pandas. In the subsequent tutorials, we'll show how to use Dask with Pandas and cuDF and how we can use both to accelerate common ETL tasks as well as build ML models like XGBoost.

To learn more about Dask, check out the documentation here: http://docs.dask.org/en/latest/

Dask operates using a concept of a "Client" and "workers". The client tells the workers what tasks to perform and when to perform. Typically, we set the number of works to be equal to the number of computing resources we have available to us. For example, wer might set `n_workers = 8` if we have 8 CPU cores on our machine that can each operate in parallel. This allows us to take advantage of all of our computing resources and enjoy the most benefits from parallelization.

Dask is a first class citizen in the world of General Purpose GPU computing and the RAPIDS ecosystem makes it very easy to use Dask with cuDF and XGBoost. As we see below, we can inititate a Cluster and Client using only few lines of code.

In [None]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster


# create Dask cluster and client
cluster = LocalCUDACluster()
client = Client(cluster)

Now, let's show our current Dask status. We should see the IP Address for our Scheduler as well the the number of workers in our Cluster. 

In [None]:
# show current Dask status
client

You can also see the status and more information at the Dashboard, found at `http://<scheduler_uri>/status`. You can ignore this for now, we'll dive into this in subsequent tutorials.

## Generating Data

We'll generate some fake data using the `make_moons` function from the `sklearn.datasets` module. This function generates data points from two equations, each describing a half circle with a unique center. Since each data point is generated by one of these two equations, the cluster each data point belongs to is clear. The ideal classification algorithm will identify two clusters and associate each data point with the equation that generated it.

In [None]:
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)

In [None]:
from sklearn.datasets import make_classification, make_moons

In [None]:
# settings
n_samples = int(1e5)
noise = 0.50
visualize = n_samples <= int(1e5)

# create data
X, y = make_moons(n_samples=n_samples, noise=noise, random_state=0)
print(X.shape)

In [None]:
if visualize:
    plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            edgecolor='black', 
            c='red', marker='s', s=40, label='class 1')
    plt.scatter(X[y == 1, 0], X[y == 1, 1], 
                edgecolor='black', 
                c='lightblue', marker='o', s=40, label='class 2')
    plt.legend()
    plt.tight_layout()
    plt.show()

## Load Data into Data cuDF

Next, let's load our Numpy Arrays into a Pandas DataFrame.

In [None]:
import pandas as pd; print('pandas Version:', pd.__version__)

In [None]:
# convert the NumPy arrays into a Pandas DataFrame
X_df, y_df = pd.DataFrame(X), pd.DataFrame(y)

In [None]:
X_df.columns = ['feature_' + str(i) for i in range(2)]
y_df.columns = ['label']

In [None]:
print(X_df.head())

In [None]:
print(y_df.head())

We can now load our data into a cuDF DataFrame.

In [None]:
import cudf; print('cuDF Version:', cudf.__version__)

In [None]:
# convert the Pandas DataFrame into a cuDF DataFrame
X_gdf = cudf.DataFrame.from_pandas(X_df)
y_gdf = cudf.DataFrame.from_pandas(y_df)

Lastly, let's load our data into 

In [None]:
import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)

In [None]:
# convert the cuDF DataFrame into a Dask cuDF DataFrame
X_dgdf = dask_cudf.from_cudf(X_gdf, npartitions=8)
y_dgdf = dask_cudf.from_cudf(y_gdf, npartitions=8)

## Set Parameters

There are a number of parameters that can be set before XGBoost can be run.

* General parameters relate to which booster we are using to do boosting, commonly tree or linear model
* Booster parameters depend on which booster you have chosen
* Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

For more information on the configurable parameters within the XGBoost module, see the documentation here:

https://xgboost.readthedocs.io/en/latest/parameter.html

In [None]:
dxgb_gpu_params = {
    'nround':            1000,
    'max_depth':         8,
    'max_leaves':        2**8,
    'alpha':             0.9,
    'eta':               0.1,
    'gamma':             0.1,
    'learning_rate':     0.1,
    'subsample':         1,
    'reg_lambda':        1,
    'scale_pos_weight':  2,
    'min_child_weight':  30,
    'tree_method':       'gpu_hist',
    'n_gpus':            1,
    'distributed_dask':  True,
    'loss':              'ls',
    'objective':         'gpu:reg:linear',
    'max_features':      'auto',
    'criterion':         'friedman_mse',
    'grow_policy':       'lossguide',
    'verbose':           True
}

## Training

Now it's time to train our model! We can use the `dxgb_gpu.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training. For more information on the parameters that can be passed into `dxgb_gpu.train`, check out the documentation:

https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train

In [None]:
import dask_xgboost as dxgb_gpu

In [None]:
%%time
labels = None
bst = dxgb_gpu.train(client, dxgb_gpu_params, X_dgdf, y_dgdf, num_boost_round=dxgb_gpu_params['nround'])

In [None]:
bst

## Scoring

In [None]:
y_hat = dxgb_gpu.predict(client, bst, X_dgdf)

In [None]:
y_hat_np = y_hat.compute().to_array()

In [None]:
y_hat_np_thresholded = 1 * (y_hat_np > 0.5)

In [None]:
from sklearn.metrics import accuracy_score


print('Accuracy Score:', accuracy_score(y, y_hat_np_thresholded))

In [None]:
if visualize:
    plt.scatter(X[y_hat_np_thresholded == 0, 0], 
            X[y_hat_np_thresholded == 0, 1], 
            edgecolor='black', 
            c='red', marker='s', s=40, label='class 1')
    plt.scatter(X[y_hat_np_thresholded == 1, 0], 
                X[y_hat_np_thresholded == 1, 1], 
                edgecolor='black', 
                c='lightblue', marker='o', s=40, label='class 2')
    plt.legend()
    plt.tight_layout()
    plt.show()

## Conclusion

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)
