<a id="introduction"></a>
## Introduction to Dask using cuDF DataFrames
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames using Dask.

**Table of Contents**

* [Introduction to Dask using cuDF DataFrames](#introduction)
* [Setup](#setup)
* [Using cuDF DataFrames with Dask](#using)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai-dev-nightly:0.10-cuda10.0-devel-ubuntu18.04-py3.7` container from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA GV100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks-contrib/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [None]:
!nvidia-smi

Next, let's see what CUDA version we have:

In [None]:
!nvcc --version

## Install graphviz
The visualizations in this notebook require graphviz.  Your environment may not have it installed, but don't worry! If you don't, we're going to install it now.  This can take a little while, so sit tight.

In [None]:
import os
try:
    import graphviz
except ModuleNotFoundError:
    os.system('apt update')
    os.system('apt install -y graphviz')
    os.system('conda install -c conda-forge graphviz -y')
    os.system('conda install -c conda-forge python-graphviz -y')

Let's start by creating a local cluster of workers and a client to interact with that cluster.

In [None]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster


# create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)
client

We'll define a function called `load_data` that will create a `cudf.DataFrame` with two columns, `key` and `value`. The column `key` will be randomly filled with either a 0 or a 1, with 50% probability of either number being selected. The column `value` will be randomly filled with numbers sampled from a normal distribution.

In [None]:
import cudf; print('cuDF Version:', cudf.__version__)
import numpy as np; print('NumPy Version:', np.__version__)


def load_data(n_rows):
    df = cudf.DataFrame()
    random_state = np.random.RandomState(43210)
    df['key'] = random_state.binomial(n=1, p=0.5, size=(n_rows,))
    df['value'] = random_state.normal(size=(n_rows,))
    return df

We'll also define a function `head` that takes a `cudf.DataFrame` and returns the first 5 rows.

In [None]:
def head(dataframe):
    return dataframe.head()

We'll define the number of workers as well as the number of rows each dataframe will have.

In [None]:
# define the number of workers
n_workers = 4  # feel free to change this depending on how many GPUs you have

# define the number of rows each dataframe will have
n_rows = 125000000  # we'll use 125 million rows in each dataframe

We'll create each dataframe using the `delayed` operator. 

In [None]:
from dask.delayed import delayed


# create each dataframe using a delayed operation
dfs = [delayed(load_data)(n_rows) for i in range(n_workers)]
dfs

We see the result of this operation is a list of `Delayed` objects. It's important to note that these operations are "delayed" - nothing has been computed yet, meaning our data has not yet been created!

We can apply the `head` function to each of our "delayed" dataframes.

In [None]:
head_dfs = [delayed(head)(df) for df in dfs]
head_dfs

As before, we see that the result is a list of `Delayed` objects - an important thing to note is that our "key", or unique identifier for each operation, has changed. You should see the name of the function `head` followed by a hash sign. For example, one might see:

```
[Delayed('head-8e946db2-feaf-4e79-99ab-f732b6e28461'),
 Delayed('head-eb06bc77-9d5c-4a47-8c01-b5b36710b727'),
 Delayed('head-e1c976c8-3f94-4a01-8300-41def5117f93'),
 Delayed('head-7d0a7201-a973-4846-a68f-cb6f85b25076')]
```

Again, nothing has been computed - let's compute the results and execute the workflow using the `client.compute()` method.

In [None]:
from dask.distributed import wait


# use the client to compute - this means create each dataframe and take the head
futures = client.compute(head_dfs)
wait(futures)  # this will give Dask time to execute the work before moving to any subsequently defined operations
futures

We see that our results are a list of futures. Each object in this list tells us a bit information about itself: the status (pending, error, finished), the type of the object, and the key (unique identifief).

We can use the `client.gather` method to collect the results of each of these futures.

In [None]:
# collect the results
results = client.gather(futures)
results

We see that our results are a list of cuDF DataFrames, each having 2 columns and 5 rows. Let's inspect the first dataframe:

In [None]:
# let's inspect the head of the first dataframe
print(results[0])

Voila! 

That was a pretty simple example. Let's see how we can use this perform a more complex operation like figuring how many total rows we have across all of our dataframes. We'll define a function called `length` that will take a `cudf.DataFrame` and return the first value of the `shape` attribute i.e. the number of rows for that particular dataframe.

In [None]:
def length(dataframe):
    return dataframe.shape[0]

We'll define our operation on the dataframes we've created:

In [None]:
lengths = [delayed(length)(df) for df in dfs]

And then use Python's built-in `sum` function to sum all of these lengths.

In [None]:
total_number_of_rows = delayed(sum)(lengths)

At this point, `total_number_of_rows` hasn't been computed yet. But we can still visualize the graph of operations we've defined using the `visualize()` method.

In [None]:
total_number_of_rows.visualize()

The graph can be read from bottom to top. We see that for each worker, we will first execute the `load_data` function to create each dataframe. Then the function `length` will be applied to each dataframe; the results from these operations on each worker will then be combined into a single result via the `sum` function. 

Let's now execute our workflow and compute a value for the `total_number_of_rows` variable.

In [None]:
# use the client to compute the result and wait for it to finish
future = client.compute(total_number_of_rows)
wait(future)
future

We see that our computation has finished - our result is of type `int`. We can collect our result using the `client.gather()` method.

In [None]:
# collect result
result = client.gather(future)
result

That's all there is to it! We can define even more complex operations and workflows using cuDF DataFrames by using the `delayed`, `wait`, `client.submit()`, and `client.gather()` workflow.

However, there can sometimes be a drawback from using this pattern. For example, consider a common operation such as a groupby - we might want to group on certain keys and aggregate the values to compute a mean, variance, or even more complex aggregations. Each dataframe is located on a different GPU - and we're not guaranteed that all of the keys necessary for that groupby operation are located on a single GPU i.e. keys may be scattered across multiple GPUs. 

To make our problem even more concrete, let's consider the simple operation of grouping on our `key` column and calculating the mean of the `value` column. To sovle this problem, we'd have to sort the data and transfer keys and their associated values from one GPU to another - a tricky thing to do using the delayed pattern. In the example below, we'll show an example of this issue with the delayed pattern and motivate why one might consider using the `dask_cudf` API.

First, let's define a function `groupby` that takes a `cudf.DataFrame`, groups by the `key` column, and calculates the mean of the `value` column.

In [None]:
def groupby(dataframe):
    return dataframe.groupby('key')['value'].mean()

We'll apply the function `groupby` to each dataframe using the `delayed` operation.

In [None]:
groupbys = [delayed(groupby)(df) for df in dfs]

We'll then execute that operation:

In [None]:
# use the client to compute the result and wait for it to finish
groupby_dfs = client.compute(groupbys)
wait(groupby_dfs)
groupby_dfs

In [None]:
results = client.gather(groupby_dfs)
results

In [None]:
for i, result in enumerate(results):
    print('cuDF DataFrame:', i)
    print(result)

This isn't exactly what we wanted though - ideally, we'd get one dataframe where for each unique key (0 and 1), we get the mean of the `value` column.

We can use the `dask_cudf` API to help up solve this problem. First we'll import the `dask_cudf` library and then use the `dask_cudf.from_delayed` function to convert our list of delayed dataframes to an object of type `dask_cudf.core.DataFrame`. We'll use this object - `distributed_df` - along with the `dask_cudf` API to perform that "tricky" groupby operation.

In [None]:
import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)


# create a distributed cuDF DataFrame using Dask
distributed_df = dask_cudf.from_delayed(dfs)
print('Type:', type(distributed_df))
distributed_df

The `dask_cudf` API closely mirrors the `cuDF` API. We can use a groupby similar to how we would with cuDF - but this time, our operation is distributed across multiple GPUs!

In [None]:
result = distributed_df.groupby('key')['value'].mean().compute()
result

Lastly, let's examine our result!

In [None]:
print(result)

<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with cuDF DataFrames using Dask.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)