<a id="introduction"></a>
## Introduction to Dask cuDF
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames distributed across multiple GPUs using Dask.

**Table of Contents**

* [Introduction to Dask cuDF](#introduction)
* [Setup](#setup)
* [Dask cuDF API](#daskcudfapi)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai-nightly:0.8-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub - rapidsai/rapidsai-nightly](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Tue Jun 11 01:34:19 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   39C    P0    44W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   39C    P0    42W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   

Next, let's see what CUDA version we have:

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


<a id="daskcudfdataframes"></a>
## Dask cuDF DataFrames

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

#### Reading data

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [3]:
import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)

Dask cuDF Version: 0.7.2+0.g3ebd286.dirty


#### Inspecting a Dask cuDF DataFrame

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [4]:
# performance_df

In [5]:
# type(performance_df)

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [6]:
# performance_df.npartitions

<a id="daskcudfapi"></a>
## Dask cuDF API

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [7]:
# performance_df.head()

In [8]:
# print(performance_df.head())

In [9]:
# # calculate number of rows
# performance_df.map_partitions(len).compute().sum()

In [10]:
# !cat /datasets/rapids/mortgage/mortgage_2000_1gb/perf/* | wc -l

In [11]:
# aggregation = performance_df['loan_age'].mean()
# print(aggregation.compute())

In [12]:
# %%bash

# ls -alh /datasets/rapids/mortgage/mortgage_2000_1gb/perf

In [13]:
# from collections import OrderedDict
# import cudf; print('cuDF Version:', cudf.__version__)
# import dask_cudf; print('Dask cuDF Version:', dask_cudf.__version__)
# import utils





In [14]:
# import os

# base_path = os.path.join('/', 'datasets', 'rapids', 'mortgage', 'mortgage_2000_1gb')
# filepath = os.path.join(base_path, 'perf', 'Performance_*')
# # filepath = os.path.join(base_path, 'perf', 'Performance_2000Q1.txt_0')

In [15]:
# df = load_performance_dataset(filepath)
# df

#### Selecting Rows or Columns

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [16]:
# client.restart()

In [17]:
# print(type(df))

In [18]:
# # select rows
# df_subset = df[0:4]
# print(df_subset)

In [19]:
# print(type(df_subset))

In [20]:
# df_result = df_subset.compute()
# print(df_result)

In [21]:
# print(type(df_result))

In [22]:
# print(df_result.shape)

In [23]:
# df.npartitions * 5

In [24]:
# # select columns
# df_subset = df['loan_id']
# print(df_subset)

In [25]:
# print(type(df_subset))

In [26]:
# print(df_subset.head())
# print(type(df_subset.head()))

In [27]:
# df_subset = df[['loan_id', 'current_loan_delinquency_status']]
# print(df_subset)

In [28]:
# print(type(df_subset))

In [29]:
# print(df_subset.head())

In [30]:
# # select both rows and columns
# df_subset = df.loc[0:4, ['loan_id', 'current_loan_delinquency_status']]
# print(df_subset)

In [31]:
# print(type(df_subset))

In [32]:
# df_result = df_subset.compute()
# print(df_result)

In [33]:
# print(type(df_result))
# print(df_result.shape)

#### Dropping Rows or Columns

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [34]:
# client.restart()

In [35]:
# df.map_partitions(len).compute().sum()

In [36]:
# df.drop(0:100, axis=0)

In [37]:
# df.map_partitions(len).compute().sum()

In [38]:
# df.columns

In [39]:
# df.drop(['loan_age'], axis=1)

In [40]:
# df.columns

#### Manipulating Columns

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [41]:
# client.restart()

In [42]:
# df['new_column'] = df['loan_id']

In [43]:
# df.columns

In [44]:
# print(type(df))

In [45]:
# print(df['new_column'].head())

In [46]:
# df.drop(['new_column'], axis=1)

#### Transforming Columns

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [47]:
# client.restart()

In [48]:
# df['mean_loan_age'] = df['loan_age'].mean()

#### Renaming Columns

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [49]:
# client.restart()

In [50]:
# df.columns

In [51]:
# df.columns[9] = 'metropolitan_statistical_area'

In [52]:
# df.columns

In [53]:
# df['new_column'] = df['loan_id']
# df.drop('loan_id', axis=1)

In [54]:
# df.columns

#### Modifying Data Types

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [55]:
# client.restart()

In [56]:
# df.dtypes

In [57]:
# df.dtypes

In [58]:
# df.dtypes

#### Working with Missing Values

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [59]:
# client.restart()

In [60]:
# # calculate how many rows in each column have actual values
# # # ideal
# # column_counts = df.count()
# # column_counts

# # alternative
# column_counts = []
# for column in list(df.columns):
#     column_count = df[column].count().compute()
#     column_counts.append((column, column_count))

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [61]:
# number_of_rows = df.map_partitions(len).compute().sum()

In [62]:
# for column, count in column_counts:
#     print(column, ':', (count / number_of_rows) * 100)

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [63]:
# # loop over each column in the dataframe and each column's dtype
# for column, data_type in df.dtypes.items():
#     # if the data type is not numeric, cast to int32 and fill with -1
#     if str(data_type) == "category":
#         df[column] = df[column].astype('int32').fillna(-1)

#     # if the data type is numeric, cast to appropriate type and fill with -1
#     if str(data_type) in ['int8', 'int16', 'int32', 'int64', 'float32', 'float64']:
#         df[column] = df[column].fillna(np.dtype(data_type).type(-1))

In [64]:
# df.persist()

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [65]:
# # calculate how many rows in each column have actual values
# # # ideal
# # column_counts = df.count()
# # column_counts

# # alternative
# column_counts = []
# for column in list(df.columns):
#     column_count = df[column].count().compute()
#     column_counts.append((column, column_count))

In [66]:
# for column, count in column_counts:
#     print(column, ':', (count / number_of_rows) * 100)

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

#### Working with Indexes

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [67]:
# client.restart()

#### Sorting Values

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [68]:
# client.restart()

#### Merging DataFrames

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [69]:
# client.restart()

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

#### Concatenating DataFrames

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [70]:
# client.restart()

In [71]:
# df_delayed = df.to_delayed()
# df_delayed

In [72]:
# from dask.delayed import delayed


# def head(dataframe):
#     return dataframe.head()


# dfs = [delayed(head)(d) for d in df_delayed]

In [73]:
# client.scheduler_info()

In [74]:
# workers = client.scheduler_info()['workers']
# print(len(workers))
# print(workers)

In [75]:
# worker_ids = [worker['id'] for worker in workers.values()]
# print(worker_ids)

In [76]:
# from dask.distributed import wait

# futures = client.compute(dfs)
# wait(futures)
# futures

In [77]:
# [(gpu_df, list(client.who_has(gpu_df).values())[0]) for gpu_df in gpu_dfs]

# partition_worker_map = [(partition, list(client.who_has(partition).values())[0]) for partition in df]
# [client.who_has(partition) for partition in df ]

In [78]:
# concatenations = []
# for worker, list_of_partitions_delayed in client.has_what().items():
#     concatenations.append(delayed(cudf.concat)(list_of_partitions_delayed))

In [79]:
# concatenations

In [80]:
# futures = client.compute(concatenations)

In [81]:
# # results = [result.result() for future in futures]
# results = client.gather(futures)

In [82]:
# results[0]

#### Aggregating with Groupbys

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [83]:
# client.restart()

#### One Hot Encoding

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [84]:
# client.restart()

#### Custom Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [85]:
# client.restart()

In [86]:
# performance_df_delayed = performance_df.to_delayed()
# performance_df_delayed

In [87]:
# head_dfs = [delayed(head)(d) for d in performance_df_delayed]
# wait(head_dfs)

In [88]:
# futures = client.compute(head_dfs)
# time.sleep(3)
# futures

In [89]:
# results = client.gather(futures)
# results

In [90]:
# print(results[0])

In [91]:
# print(results[1])

<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with cuDF DataFrames distributed across multiple GPUs using Dask.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)
