<a id="introduction"></a>
## Introduction to Dask cuDF
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames distributed across multiple GPUs using Dask.

**Table of Contents**

* [Introduction to Dask cuDF](#introduction)
* [Setup](#setup)
* [Dask cuDF Series Basics (coming soon!)](#series)
* [Dask cuDF DataFrame Basics (coming soon!)](#dataframes)
* [Input/Output (coming soon!)](#io)
* [Dask cuDF API (coming soon!)](#daskcudfapi)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `nvcr.io/nvidia/rapidsai/rapidsai:0.8-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [NGC - rapidsai/rapidsai](https://ngc.nvidia.com/catalog/containers/nvidia:rapidsai:rapidsai)
* `rapidsai/rapidsai-nightly:0.9-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub - rapidsai/rapidsai-nightly](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Wed Jul 17 08:37:59 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   37C    P0    58W / 300W |    861MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    57W / 300W |   2470MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   

Next, let's see what CUDA version we have:

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


<a id="series"></a>
## Dask cuDF Series Basics (coming soon!)

<a id="dataframes"></a>
## Dask cuDF DataFrame Basics (coming soon!)

#### Creating a Dask cudf DataFrame from Dask DataFrame (coming soon!)

In [3]:
import numpy as np; print('NumPy Version:', np.__version__)
import pandas as pd; print('Pandas Version:', pd.__version__)


pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

NumPy Version: 1.16.2
Pandas Version: 0.23.4
     a    b
0    0  0.0
1    1  0.1
2    2  0.2
3    3  NaN
4    4  0.4
5    5  0.5
6    6  0.6
7    7  0.7
8    8  0.8
9    9  0.9
10  10  1.0


In [4]:
import dask; print('Dask Version:', dask.__version__)
import dask.dataframe as dd


dask_df = dd.from_pandas(pandas_df, npartitions=8)
dask_df

Dask Version: 2.1.0


Unnamed: 0_level_0,a,b
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,float64
2,...,...
...,...,...
8,...,...
10,...,...


In [5]:
import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)


ddf = dask_cudf.from_dask_dataframe(dask_df)
ddf

Dask cuDF Version: 0.8.0+0.g8fa7bd3.dirty


Unnamed: 0_level_0,a,b
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,float64
2,...,...
...,...,...
8,...,...
10,...,...


#### Creating a Dask cudf DataFrame from cuDF DataFrame (coming soon!)

In [6]:
import pandas as pd


pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

     a    b
0    0  0.0
1    1  0.1
2    2  0.2
3    3  NaN
4    4  0.4
5    5  0.5
6    6  0.6
7    7  0.7
8    8  0.8
9    9  0.9
10  10  1.0


In [7]:
import cudf; print('cuDF Version:', cudf.__version__)


df = cudf.from_pandas(pandas_df)
# df = cudf.DataFrame.from_pandas(pandas_df)  # alternative
print(df)

cuDF Version: 0.8.0+0.g8fa7bd3.dirty
   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3     
4  4  0.4
5  5  0.5
6  6  0.6
7  7  0.7
8  8  0.8
9  9  0.9
[1 more rows]


In [8]:
ddf = dask_cudf.from_cudf(df, npartitions=8)
ddf

Unnamed: 0_level_0,a,b
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,float64
2,...,...
...,...,...
8,...,...
10,...,...


#### Inspecting a Dask cuDF DataFrame (coming soon!)

In [9]:
ddf

Unnamed: 0_level_0,a,b
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,float64
2,...,...
...,...,...
8,...,...
10,...,...


In [10]:
print(ddf)

<dask_cudf.DataFrame | 5 tasks | 5 npartitions>


In [11]:
print(ddf.compute())

   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3     
4  4  0.4
5  5  0.5
6  6  0.6
7  7  0.7
8  8  0.8
9  9  0.9
[1 more rows]


In [12]:
print(type(ddf.compute()))

<class 'cudf.dataframe.dataframe.DataFrame'>


In [13]:
type(ddf)

dask_cudf.core.DataFrame

In [14]:
ddf.npartitions

5

<a id="io"></a>
## Input/Output (coming soon!)

#### Writing and Loading CSV Files

<a id="daskcudfapi"></a>
## Dask cuDF API  (coming soon!)

#### Selecting Rows or Columns  (coming soon!)

In [15]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [16]:
# ddf.iloc[:, 'a']

In [17]:
# ddf.iloc[:, ['a']]

In [18]:
# ddf.iloc[:, ['a', 'b']]

#### Dropping Rows or Columns (coming soon!)

In [19]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [20]:
# ddf.drop('a', axis=1)

#### Defining New Columns (coming soon!)

In [21]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [22]:
# ddf['d'] = 

#### Missing Data  (coming soon!)

In [23]:
df = cudf.DataFrame({'a': [0, None, 2, 3, 4, 5, 6, 7, 8, None, 10],
                     'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], 
                     'c': [0.0, 0.1, None, None, 0.4, 0.5, None, 0.7, 0.8, 0.9, 1.0]})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [24]:
print(ddf.compute())

   a    b    c
0  0  0.0  0.0
1     0.1  0.1
2  2  0.2     
3  3          
4  4  0.4  0.4
5  5  0.5  0.5
6  6  0.6     
7  7  0.7  0.7
8  8  0.8  0.8
9     0.9  0.9
[1 more rows]


In [25]:
new_ddf = ddf.fillna(-1)

In [26]:
print(new_ddf.compute())

    a     b     c
0   0   0.0   0.0
1  -1   0.1   0.1
2   2   0.2  -1.0
3   3  -1.0  -1.0
4   4   0.4   0.4
5   5   0.5   0.5
6   6   0.6  -1.0
7   7   0.7   0.7
8   8   0.8   0.8
9  -1   0.9   0.9
[1 more rows]


#### Boolean Indexing (coming soon!)

In [27]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [28]:
mask = ddf['a'] == 2
subset = ddf[mask]

In [29]:
subset.compute()

<cudf.DataFrame ncols=4 nrows=25 >

#### Sorting Data (coming soon!)

In [30]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [31]:
result = ddf.sort_values('d').compute()
print(result.head())

   a  b   c  d
0  3  0  99  1
1  3  0  98  2
2  3  1  97  3
3  3  1  96  4
4  3  1  95  5


In [32]:
# result = ddf.sort_values('c', ascending=False).compute()
# print(result.head())

In [33]:
result = ddf.sort_values(['a', 'b']).compute()
print(result.head())

   a  b   c   d
0  0  0   1  99
1  0  0   4  96
2  0  0   8  92
3  0  0   9  91
4  0  0  10  90


In [34]:
# result = ddf.sort_values(['a', 'b'], ascending=False).compute()
# print(result.head())

In [35]:
# result = ddf.sort_values(['a', 'b'], ascending=[False, True]).compute()
# print(result.head())

#### Statistical Operations (coming soon!)

In [36]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [37]:
ddf['a'].sum().compute()

150

In [38]:
# ddf.sum().compute()

#### Histogramming (coming soon!)

In [39]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [40]:
result = ddf['a'].value_counts().compute()
print(result)

0    25
1    25
2    25
3    25
dtype: int64


#### Concatenations (coming soon!)

In [41]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
ddf1 = dask_cudf.from_cudf(df1, npartitions=8)
ddf2 = dask_cudf.from_cudf(df2, npartitions=8)

In [42]:
ddf = dask_cudf.concat([ddf1, ddf2], axis=0)
ddf.compute()

<cudf.DataFrame ncols=4 nrows=200 >

In [43]:
# df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
#                       'b': np.random.randint(2, size=100).astype(np.int32), 
#                       'c': np.arange(0, 100).astype(np.int32), 
#                       'd': np.arange(100, 0, -1).astype(np.int32)})
# df2 = cudf.DataFrame({'e': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
#                       'f': np.random.randint(2, size=100).astype(np.int32), 
#                       'g': np.arange(0, 100).astype(np.int32), 
#                       'h': np.arange(100, 0, -1).astype(np.int32)})
# ddf1 = dask_cudf.from_cudf(df1, npartitions=8)
# ddf2 = dask_cudf.from_cudf(df2, npartitions=8)

In [44]:
# ddf = dask_cudf.concat([ddf1, ddf2], axis=1)
# ddf.compute()

#### Joins (coming soon!)

In [45]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'e': np.arange(0, 100).astype(np.int32), 
                      'f': np.arange(100, 0, -1).astype(np.int32)})
ddf1 = dask_cudf.from_cudf(df1, npartitions=8)
ddf2 = dask_cudf.from_cudf(df2, npartitions=8)

In [46]:
ddf1.merge(ddf2, on=['a'])

Unnamed: 0_level_0,a,b_x,c,d,b_y,e,f
npartitions=8,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,int32,int32,int32,int32,int32,int32,int32
,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...
,...,...,...,...,...,...,...


In [47]:
ddf1.merge(ddf2, on=['a', 'b'])

Unnamed: 0_level_0,a,b,c,d,e,f
npartitions=8,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,int32,int32,int32,int32,int32,int32
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [48]:
dask_cudf.DataFrame.merge(ddf1, ddf2, on=['a'])

Unnamed: 0_level_0,a,b_x,c,d,b_y,e,f
npartitions=8,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,int32,int32,int32,int32,int32,int32,int32
,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...
,...,...,...,...,...,...,...


In [49]:
dask_cudf.DataFrame.merge(ddf1, ddf2, on=['a', 'b'])

Unnamed: 0_level_0,a,b,c,d,e,f
npartitions=8,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,int32,int32,int32,int32,int32,int32
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


#### Appends (coming soon!)

#### Groupbys (coming soon!)

In [50]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [51]:
result = ddf.groupby('a').sum().compute()
print(result)

    b     c     d
2  13  1550   950
0  17   300  2200
3   8  2175   325
1  10   925  1575


In [52]:
result = ddf.groupby(['a', 'b']).sum().compute().to_pandas()
print(result)

        c     d
a b            
2 1   822   478
1 0   554   946
  1   371   629
3 1   693   107
0 1   232  1468
3 0  1482   218
2 0   728   472
0 0    68   732


#### One Hot Encoding (coming soon!)

<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with cuDF DataFrames distributed across multiple GPUs using Dask.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)
