<a id="introduction"></a>
## Introduction to Dask cuDF
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames distributed across multiple GPUs using Dask.

**Table of Contents**

* [Introduction to Dask cuDF](#introduction)
* [Setup](#setup)
* [Dask cuDF DataFrame Basics](#daskcudfdataframes)
* [Input/Output](#io)
* [Dask cuDF API](#daskcudfapi)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai-nightly:0.8-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub - rapidsai/rapidsai-nightly](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Tue Jun 11 06:36:08 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   

Next, let's see what CUDA version we have:

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


<a id="daskcudfdataframes"></a>
## Dask cuDF DataFrame Basics (coming soon!)

#### Creating a Dask cudf DataFrame from Dask DataFrame (coming soon!)

In [9]:
import pandas as pd

pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

     a    b
0    0  0.0
1    1  0.1
2    2  0.2
3    3  NaN
4    4  0.4
5    5  0.5
6    6  0.6
7    7  0.7
8    8  0.8
9    9  0.9
10  10  1.0


In [11]:
import dask; print('Dask Version:', dask.__version__)

dask_df = dask.dataframe.from_pandas(pandas_df, npartitions=8)
dask_df

Dask Version: 1.2.2


Unnamed: 0_level_0,a,b
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,float64
2,...,...
...,...,...
8,...,...
10,...,...


In [13]:
import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)


ddf = dask_cudf.from_dask_dataframe(dask_df)
ddf

Dask cuDF Version: 0.7.2+0.g3ebd286.dirty


Unnamed: 0_level_0,a,b
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,float64
2,...,...
...,...,...
8,...,...
10,...,...


#### Creating a Dask cudf DataFrame from cuDF DataFrame (coming soon!)

In [14]:
import pandas as pd

pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

     a    b
0    0  0.0
1    1  0.1
2    2  0.2
3    3  NaN
4    4  0.4
5    5  0.5
6    6  0.6
7    7  0.7
8    8  0.8
9    9  0.9
10  10  1.0


In [15]:
df = cudf.from_pandas(pandas_df)
# df = cudf.DataFrame.from_pandas(pandas_df)  # alternative
print(df)

   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3     
4  4  0.4
5  5  0.5
6  6  0.6
7  7  0.7
8  8  0.8
9  9  0.9
[1 more rows]


In [18]:
ddf = dask_cudf.from_cudf(df, npartitions=8)
ddf

Unnamed: 0_level_0,a,b
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,float64
2,...,...
...,...,...
8,...,...
10,...,...


#### Inspecting a Dask cuDF DataFrame (coming soon!)

In [19]:
ddf

Unnamed: 0_level_0,a,b
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,float64
2,...,...
...,...,...
8,...,...
10,...,...


In [23]:
print(ddf)

<dask_cudf.DataFrame | 5 tasks | 5 npartitions>


In [30]:
print(ddf.compute())

   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3     
4  4  0.4
5  5  0.5
6  6  0.6
7  7  0.7
8  8  0.8
9  9  0.9
[1 more rows]


In [31]:
print(type(ddf.compute()))

<class 'cudf.dataframe.dataframe.DataFrame'>


In [22]:
type(ddf)

dask_cudf.core.DataFrame

In [21]:
ddf.npartitions

5

<a id="io"></a>
## Input/Output (coming soon!)

#### Writing and Loading CSV Files

<a id="daskcudfapi"></a>
## Dask cuDF API  (coming soon!)

#### Selecting Rows or Columns  (coming soon!)

#### Dropping Rows or Columns (coming soon!)

#### Defining New Columns (coming soon!)

#### Working with Missing Values  (coming soon!)

#### Working with Indexes  (coming soon!)

#### Sorting Values  (coming soon!)

#### Merging DataFrames  (coming soon!)

#### Concatenating DataFrames  (coming soon!)

#### Aggregating with Groupbys  (coming soon!)

#### One Hot Encoding (coming soon!)

<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with cuDF DataFrames distributed across multiple GPUs using Dask.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)
