<a id="introduction"></a>
## Introduction to Dask cuDF
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames distributed across multiple GPUs using Dask.

**Table of Contents**

* [Introduction to Dask cuDF](#introduction)
* [Setup](#setup)
* [Dask cuDF Series Basics](#series)
* [Dask cuDF DataFrame Basics](#dataframes)
* [Input/Output](#io)
* [Dask cuDF API](#daskcudfapi)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai-dev-nightly:0.10-cuda10.0-devel-ubuntu18.04-py3.7` container from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA GV100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks-contrib/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [None]:
!nvidia-smi

Next, let's see what CUDA version we have:

In [None]:
!nvcc --version

#### Creating a Dask cudf DataFrame from Dask DataFrame (coming soon!)

In [None]:
import numpy as np; print('NumPy Version:', np.__version__)
import pandas as pd; print('Pandas Version:', pd.__version__)


pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

In [None]:
import dask; print('Dask Version:', dask.__version__)
import dask.dataframe as dd


dask_df = dd.from_pandas(pandas_df, npartitions=8)
dask_df

In [None]:
import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)


ddf = dask_cudf.from_dask_dataframe(dask_df)
ddf

#### Creating a Dask cudf DataFrame from cuDF DataFrame (coming soon!)

In [None]:
import pandas as pd


pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

In [None]:
import cudf; print('cuDF Version:', cudf.__version__)


df = cudf.from_pandas(pandas_df)
# df = cudf.DataFrame.from_pandas(pandas_df)  # alternative
print(df)

In [None]:
ddf = dask_cudf.from_cudf(df, npartitions=8)
ddf

#### Inspecting a Dask cuDF DataFrame (coming soon!)

In [None]:
ddf

In [None]:
print(ddf)

In [None]:
print(ddf.compute())

In [None]:
print(type(ddf.compute()))

In [None]:
type(ddf)

In [None]:
ddf.npartitions

<a id="io"></a>
## Input/Output (coming soon!)

#### Writing and Loading CSV Files

<a id="daskcudfapi"></a>
## Dask cuDF API

#### Selecting Rows or Columns

In [None]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.int64), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
ddf.iloc[:, 1]

In [None]:
ddf.iloc[:, ['a']]

In [None]:
ddf.iloc[:, ['a', 'b']]

#### Dropping Rows or Columns (coming soon!)

In [None]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
# ddf.drop('a', axis=1)

#### Defining New Columns (coming soon!)

In [None]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
# ddf['d'] = 

#### Missing Data  (coming soon!)

In [None]:
df = cudf.DataFrame({'a': [0, None, 2, 3, 4, 5, 6, 7, 8, None, 10],
                     'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], 
                     'c': [0.0, 0.1, None, None, 0.4, 0.5, None, 0.7, 0.8, 0.9, 1.0]})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
print(ddf.compute())

In [None]:
new_ddf = ddf.fillna(-1)

In [None]:
print(new_ddf.compute())

#### Boolean Indexing (coming soon!)

In [None]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
mask = ddf['a'] == 2
subset = ddf[mask]

In [None]:
subset.compute()

#### Sorting Data (coming soon!)

In [None]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
result = ddf.sort_values('d').compute()
print(result.head())

In [None]:
# result = ddf.sort_values('c', ascending=False).compute()
# print(result.head())

In [None]:
result = ddf.sort_values(['a', 'b']).compute()
print(result.head())

In [None]:
# result = ddf.sort_values(['a', 'b'], ascending=False).compute()
# print(result.head())

In [None]:
# result = ddf.sort_values(['a', 'b'], ascending=[False, True]).compute()
# print(result.head())

#### Statistical Operations (coming soon!)

In [None]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
ddf['a'].sum().compute()

In [None]:
# ddf.sum().compute()

#### Histogramming (coming soon!)

In [None]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
result = ddf['a'].value_counts().compute()
print(result)

#### Concatenations (coming soon!)

In [None]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
ddf1 = dask_cudf.from_cudf(df1, npartitions=8)
ddf2 = dask_cudf.from_cudf(df2, npartitions=8)

In [None]:
ddf = dask_cudf.concat([ddf1, ddf2], axis=0)
ddf.compute()

In [None]:
# df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
#                       'b': np.random.randint(2, size=100).astype(np.int32), 
#                       'c': np.arange(0, 100).astype(np.int32), 
#                       'd': np.arange(100, 0, -1).astype(np.int32)})
# df2 = cudf.DataFrame({'e': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
#                       'f': np.random.randint(2, size=100).astype(np.int32), 
#                       'g': np.arange(0, 100).astype(np.int32), 
#                       'h': np.arange(100, 0, -1).astype(np.int32)})
# ddf1 = dask_cudf.from_cudf(df1, npartitions=8)
# ddf2 = dask_cudf.from_cudf(df2, npartitions=8)

In [None]:
# ddf = dask_cudf.concat([ddf1, ddf2], axis=1)
# ddf.compute()

#### Joins (coming soon!)

In [None]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'e': np.arange(0, 100).astype(np.int32), 
                      'f': np.arange(100, 0, -1).astype(np.int32)})
ddf1 = dask_cudf.from_cudf(df1, npartitions=8)
ddf2 = dask_cudf.from_cudf(df2, npartitions=8)

In [None]:
ddf1.merge(ddf2, on=['a'])

In [None]:
ddf1.merge(ddf2, on=['a', 'b'])

In [None]:
dask_cudf.DataFrame.merge(ddf1, ddf2, on=['a'])

In [None]:
dask_cudf.DataFrame.merge(ddf1, ddf2, on=['a', 'b'])

#### Appends (coming soon!)

#### Groupbys (coming soon!)

In [None]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
ddf = dask_cudf.from_cudf(df, npartitions=8)

In [None]:
result = ddf.groupby('a').sum().compute()
print(result)

In [None]:
result = ddf.groupby(['a', 'b']).sum().compute().to_pandas()
print(result)

#### One Hot Encoding (coming soon!)

<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with cuDF DataFrames distributed across multiple GPUs using Dask.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)
