# Multi-GPU NVTabular: An Introduction to NVTabular + dask_cudf

**Last Updated**: August 5th 2020

## Overview of dask_cudf Integration in NVTabular 0.2.0

As of the 0.2.0 release (nvtabular>=0.2.0), many components of NVTabular have been refactored to use [Dask](https://dask.org/).  More specifically, the following components are now based on the [RAPIDS](https://rapids.ai/) `dask_cudf` library:

- **`nvtabular.Dataset`**: Most NVTabular functionality requires the raw data to be represented as a `Dataset` object. A `Dataset` can be initialized using file/directory paths ("csv" or "parquet"), a `pyarrow.Table`, a pandas/cudf `DataFrame`, or a pandas/cudf-based *Dask* `DataFrame`.  The purpose of this "wrapper" class is to provide other NVTabular components with reliable mechanisms to (1) translate the target data into a `dask_cudf.DataFrame`, and to (2) iterate over the target data in small-enough chunks to fit in GPU memory.
- **`nvtabular.Workflow`**: The central class used in NVTabular to compose a GPU-accelerated preprocessing pipeline.  The Workflow class now keeps tracks the state of the underlying data by applying all operations to an internal `dask_cudf.DataFrame` object (`ddf`). If the user specifically chooses to iterate over the data to apply transform operations, the iteration will be over partitions of the internal `ddf`.
- **`nvtabular.ops.StatOperator`**: All "statistics-gathering" operations must be designed to operate directly on the `Workflow` object's internal `ddf`.  This requirement facilitates the ability of NVTabular to handle the calculation of global statistics in a scalable way.

**Big Picture**:  NVTabular is tightly integrated with `dask_cudf`.  By representing the underlying dataset as a (lazily-evaluated) collection of cudf DataFrame objects (i.e. a single `dask_cudf.DataFrame`), we can seamlessly scale our preprocessing workflow to multiple GPUs.

## Simple Multi-GPU Toy Example
In order to illustrate the `dask_cudf`-based functionality of NVTabular, we will walk through a simple preprocessing example using *toy* data.


#### Step 1:  Import Libraries and Cleanup Working Directories

In [1]:
# Standard Libraries
import os
import glob
import shutil

# External Dependencies
import cupy as cp
import cudf
import dask_cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask.utils import parse_bytes

# NVTabular
import nvtabular as nvt
import nvtabular.ops as ops

In [2]:
# Choose a "fast" root directory for this example
basedir = "/raid/dask-space/rzamora"

# Define and clean our worker/output directories
dask_workdir = os.path.join(basedir, "workdir")
demo_output_path = os.path.join(basedir, "demo_output")
demo_dataset_path = os.path.join(basedir, "demo_dataset.parquet")

# Make sure we have a clean worker space for Dask
if os.path.isdir(dask_workdir):
    shutil.rmtree(dask_workdir)
    os.mkdir(dask_workdir)

# Make sure we have a clean output path
if os.path.isdir(demo_output_path):
    shutil.rmtree(demo_output_path)
    os.mkdir(demo_output_path)

#### Step 2: Create a "Toy" Parquet Dataset
In order to illustrate the power multi-GPU scaling, without requiring an excessive runtime, we can use the `cudf.datasets.timeseries` API to generate a 20GB toy dataset...

In [3]:
if not os.path.exists(demo_dataset_path):
    # Write a "largish" dataset (~20GB)
    nwrites = 25
    pw = cudf.io.parquet.ParquetWriter(demo_dataset_path)
    for i in range(nwrites):
        df = cudf.datasets.timeseries(start='2000-01-01', end='2000-12-31', freq='1s', seed=i).reset_index(drop=False)
        df["name"] = df["name"].astype("object")
        df["label"] = cp.random.choice(cp.array([0, 1], dtype="uint8"), len(df))
        pw.write_table(df)
    pw.close()

#### Step 3: Create an NVTabular `Dataset` object

As discussed above, the `nvt.Workflow` class requires data to be represented as an `nvt.Dataset`. This convention allows NVTabular to abstract way the raw format of the data, and convert everything to a consistent `dask_cudf.DataFrame` representation. Since the `Dataset` API effectively wraps functions like `dask_cudf.read_csv`, the syntax is very simple, and the execution time is insignificant.

**Important `Dataset` Considerations**:

- Can be initialized with the following objects:
    - 1+ file/directory paths. An `engine` argument is required to specify the file format (unless file names are appended with `csv` or `parquet`)
    - `cudf.DataFrame`. Internal `ddf` will have 1 partition.
    - `pandas.DataFrame`. Internal `ddf` will have 1 partition.
    - `pyarrow.Table`. Internal `ddf` will have 1 partition.
    - `dask_cudf.DataFrame`. Internal `ddf` will be a shallow copy of the input.
    - `dask.dataframe.DataFrame`. Internal `ddf` will be a direct pandas->cudf conversion of the input.
- For file-based data initialization, the size of the internall `ddf` partitions will be chosen according to the following arguments (in order of precedence):
    - `part_size`: Desired maximum size of each partition **in bytes**.  Note that you can pass a string here. like `"2GB"`.
    - `part_mem_fraction`: Desired maximum size of each partition as a **fraction of total GPU memory**.

In [4]:
# Create a Dataset
# (`engine` argument optional if file names appended with `csv` or `parquet`)
%time ds = nvt.Dataset(demo_dataset_path, engine="parquet", part_size="1.2GB")

CPU times: user 774 ms, sys: 519 ms, total: 1.29 s
Wall time: 1.29 s


Once your data is converted to a `Dataset` object, it can be converted to a `dask_cudf.DataFrame` using the `to_ddf` method...

In [5]:
ds.to_ddf().head()

Unnamed: 0,timestamp,id,name,x,y,label
0,2000-01-01 00:00:00,1019,Michael,0.168205,-0.54723,1
1,2000-01-01 00:00:01,984,Patricia,-0.145077,-0.240521,0
2,2000-01-01 00:00:02,935,Victor,0.557024,-0.098855,1
3,2000-01-01 00:00:03,970,Alice,0.527366,-0.632569,1
4,2000-01-01 00:00:04,997,Dan,0.309193,0.704845,1


Note that the output of a Dataset (a `ddf`) can be used to initialize a new Dataset.  This means we can use `dask_cudf` to perform complex ETL on our data before we process it in a `Workflow`. For example, although NVTabular does not support global shuffling transformations (yet), these operations **can** be performed before (or after) a Workflow...

In [6]:
ddf = ds.to_ddf().shuffle("id", ignore_index=True)
ds = nvt.Dataset(ddf)
ds.to_ddf()

Unnamed: 0_level_0,timestamp,id,name,x,y,label
npartitions=29,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,datetime64[us],int64,object,float64,float64,uint8
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [7]:
del ds
del ddf

#### Step 4: Distributed Cluster Deployment

Before we walk through the rest of this multi-GPU preprocessing example, it is important to reiterate that `dask_cudf` is used extensively within NVTabular.  This essentially means that you do **not** need to do anything special to *use* Dask here.  With that said, the default behavior of NVTabular is to to utilize Dask's ["synchronous"](https://docs.dask.org/en/latest/scheduling.html) task scheduler, which precludes distributed processing.  In order to properly utilize a multi-GPU system, you need to deploy a `dask.distributed` *cluster*.

There are many different ways to create a distributed Dask cluster.  In this notebook, we will focus only on the `LocalCUDACluster` API (which is provided by the RAPIDS [`dask_cuda`](https://github.com/rapidsai/dask-cuda) library). I also recommend that you check out [this blog article](https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters) to see a high-level summary of the (many) other cluster-deployment utilities.

For this example, we will assume that you want to perform preprocessing on a single machine with multiple GPUs. In this case, we can use `dask_cuda.LocalCUDACluster` to deploy a distributed cluster with each worker process being pinned to a distinct GPU.  This class also provides our workers with mechanisms for device-host memory spilling, and (optionally) enables the use of NVLink and infiniband-based inter-process communication via UCX...

In [8]:
# Deploy a Single-Machine Multi-GPU Cluster
protocol = "tcp"                     # "tcp" or "ucx"
visible_devices = "0,1,2,3,4,5,6,7"  # Delect devices to place workers
device_memory_limit = "28GB"         # Spill device mem to host at this limit
rmm_pool_size = "28GB"               # RMM pool size
memory_limit = "96GB"                # Spill host mem to disk near this limit
cluster = LocalCUDACluster(
    protocol = protocol,
    CUDA_VISIBLE_DEVICES = visible_devices,
    local_directory = dask_workdir,
    device_memory_limit = parse_bytes(device_memory_limit),
    memory_limit = parse_bytes(memory_limit),
)
client = Client(cluster)
client

NameError: name 'cleint' is not defined

Since allocating memory is often a performance bottleneck, it is usually a good idea to initialize a memory pool on each of our workers...

In [None]:
# Initialize RMM pool on all workers
client.run(
    cudf.set_allocator,
    pool=True,
    initial_pool_size=parse_bytes(rmm_pool_size),
    allocator="default"
)

#### Step 5: Define our NVTabular `Workflow`

In [None]:
cat_names = ["name", "id"]
cont_names = ["x", "y", "timestamp"]
label_name = ["label"]

workflow = nvt.Workflow(cat_names=cat_names, cont_names=cont_names, label_name=label_name, client=client)
workflow.add_preprocess(ops.Normalize(columns=["x", "y"]))
workflow.add_preprocess(
    ops.Categorify(
        columns=["id", "name", ["id", "name"]],
        encode_type="combo",
        name_sep="+",
        out_path=demo_output_path,
        cat_cache="device",
        tree_width=1,
    )
)
workflow.finalize()

#### Step 6: (Optional) Create a Dataset Object

Since we already created a Dataset above, we *could* still use it here.  However, lets create a new one since we unnecessarily shuffled that data (and deleted it anyway)...


In [None]:
dataset = nvt.Dataset(demo_dataset_path, part_size="1.2GB")

#### Step 7: Apply our Workflow

In [11]:
%%time
workflow.apply(
    dataset,
    output_format="parquet",
    output_path=os.path.join(demo_output_path,"processed"),
    shuffle=False,
    out_files_per_proc=1,
)

CPU times: user 6.24 s, sys: 3.94 s, total: 10.2 s
Wall time: 13.5 s


**1 GPU**:
```
CPU times: user 6.07 s, sys: 3.89 s, total: 9.96 s
Wall time: 33.7 s
```

**2 GPUs**:
```
CPU times: user 6.19 s, sys: 3.59 s, total: 9.78 s
Wall time: 20.1 s
```

**4 GPUs**:
```
CPU times: user 6.74 s, sys: 3.53 s, total: 10.3 s
Wall time: 21.4 s
```

**8 GPUs**:
```
CPU times: user 6.24 s, sys: 3.94 s, total: 10.2 s
Wall time: 13.5 s
```

In [12]:
dask_cudf.read_parquet(os.path.join(demo_output_path,"processed")).head()

Unnamed: 0,x,y,timestamp,name,id,label,id+name
0,0.242465,-0.144389,2000-10-04 21:50:24,16,157,1,3594
1,0.810014,-0.075761,2000-10-04 21:50:25,13,172,1,3981
2,1.283402,1.335227,2000-10-04 21:50:26,20,192,1,4508
3,-0.84154,0.419783,2000-10-05 00:33:36,13,152,1,3461
4,-1.290406,-0.667822,2000-10-05 03:32:48,12,212,0,5020


In [13]:
%%time
ddf = workflow.get_ddf()
ddf.to_parquet(os.path.join(demo_output_path, "dask_output"), write_index=False)

CPU times: user 519 ms, sys: 90.9 ms, total: 609 ms
Wall time: 4.72 s


In [15]:
glob.glob(os.path.join(demo_output_path, "dask_output/*"))

['/raid/dask-space/rzamora/demo_output/dask_output/part.1.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.25.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.19.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.10.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.24.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.20.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.11.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.12.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.21.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.5.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.17.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.27.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.3.parquet',
 '/raid/dask-space/rzamora/demo_output/dask_output/part.4.parquet',
 '/raid/dask-space/rzamora/demo_output