# Dask Overview

Dask is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. On the CPU, Dask uses Pandas (NumPy) to execute operations in parallel on DataFrame (array) partitions.

Dask-cuDF extends Dask where necessary to allow its DataFrame partitions to be processed by cuDF GPU DataFrames as opposed to Pandas DataFrames. For instance, when you call dask_cudf.read_csv(…), your cluster’s GPUs do the work of parsing the CSV file(s) with underlying cudf.read_csv(). Dask also supports array based workflows using CuPy.

## When to use Dask
If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF or CuPy. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask.

One additional benefit Dask provides is that it lets us easily spill data between device and host memory. This can be very useful when we need to do work that would otherwise cause out of memory errors.

In this brief notebook, you'll walk through an example of using Dask on a single GPU. Because we're using Dask, the same code in this notebook would work on two, eight, 16, or 100s of GPUs.

# Creating a Local Cluster

The easiest way to scale workflows on a single node is to use the `LocalCUDACluster` API. This lets us create a GPU cluster, using one worker per GPU by default.

In this case, we'll pass the following arguments. 

- `CUDA_VISIBLE_DEVICES`, to limit our cluster to a single GPU (for demonstration purposes).
- `device_memory_limit`, to illustrate how we can spill data between GPU and CPU memory. Artificial memory limits like this reduce our performance if we don't actually need them, but can let us accomplish much larger tasks when we do.
- `rmm_pool_size`, to use the RAPIDS Memory Manager to allocate one big chunk of memory upfront rather than having our operations call `cudaMalloc` all the time under the hood. This improves performance, and is generally a best practice.

In [None]:
from dask.distributed import Client, fire_and_forget, wait
from dask_cuda import LocalCUDACluster
from dask.utils import parse_bytes
import dask


cluster = LocalCUDACluster(
    CUDA_VISIBLE_DEVICES="0,1",
    device_memory_limit=parse_bytes("4GB"),
    rmm_pool_size=parse_bytes("8GB"),
)    

client = Client(cluster)
client

Click the **Dashboard** link above to view your Dask dashboard. 

## cuDF DataFrames to Dask DataFrames

Dask lets scale our cuDF workflows. We'll walk through a couple of examples below, and then also highlight how Dask lets us spill data from GPU to CPU memory.

First, we'll create a dataframe with CPU Dask and then send it to the GPU

In [None]:
import cudf
import dask_cudf

In [None]:
ddf = dask_cudf.from_dask_dataframe(dask.datasets.timeseries())
ddf.head()

### Example One: Groupby-Aggregations

In [None]:
ddf.groupby(["id", "name"]).agg({"x":['sum', 'mean']}).head()

Run the code above again.

If you look at the task stream in the dashboard, you'll notice that we're creating the data every time. That's because Dask is lazy. We need to `persist` the data if we want to cache it in memory.

In [None]:
ddf = ddf.persist()
wait(ddf);

In [None]:
ddf.groupby(["id", "name"]).agg({"x":['sum', 'mean']}).head()

This is the same API as cuDF, except it works across many GPUs.

### Example Two: Rolling Windows

We can also do things like rolling window calculations with Dask and GPUs.

In [None]:
ddf.head()

In [None]:
rolling = ddf[['x','y']].rolling(window=3)
type(rolling)

In [None]:
rolling.mean().head()

## Larger than GPU Memory Workflows

What if we needed to scale up even more, but didn't have enough GPU memory? Dask handles spilling for us, so we don't need to worry about it. The `device_memory_limit` parameter we used while creating the LocalCluster determines when we should start spilling. In this case, we'll start spilling when we've used about 4GB of GPU memory.

Let's create a larger dataframe to use as an example.

In [None]:
ddf = dask_cudf.from_dask_dataframe(dask.datasets.timeseries(start="2000-01-01", end="2000-12-31"))

ddf = ddf.persist()
len(ddf)

In [None]:
print(f"{ddf.memory_usage(deep=True).sum().compute() / 1e9} GB of data")

In [None]:
ddf.head()

Let's imagine we have some downstream operations that require all the data from a given unique identifier in the same partition. We can repartition our data based on the `name` column using the `shuffle` API.

Repartitioning our 31 million row dataframe will spike GPU memory higher than 4GB, so we'll need to spill to CPU memory.

In [None]:
ddf = ddf.shuffle(on="id")
ddf = ddf.persist()

len(ddf)

Watch the Dask Dashboard while this runs. You should see a lot of tasks in the stream like `disk-read` and `disk-write`. Setting a `device_memory_limit` tells dask to spill to CPU memory and potentially disk (if we overwhelm CPU memory). This lets us do these large computations even when we're almost out of memory (though in this case, we faked it).

# Dask Custom Functions

Dask DataFrames also provide a `map_partitions` API, which is very useful for parallelizing custom logic that doesn't quite fit perfectly or doesn't need to be used with the Dask dataframe API. Dask will `map` the function to every partition of the distributed dataframe.

Now that we have all the rows of each `id` collected in the same partitions, what if we just wanted to sort **within each partition**. Avoiding global sorts is usually a good idea if possible, since they're very expensive operations.

In [None]:
sorted_ddf = ddf.map_partitions(lambda x: x.sort_values("id"))
len(sorted_ddf)

We could also do something more complicated and wrap it into a function. Let's do a rolling window on the two value columns after sorting by the id column.

In [None]:
def sort_and_rolling_mean(df):
    df = df.sort_values("id")
    df = df.rolling(3)[["x", "y"]].mean()
    return df

In [None]:
result = ddf.map_partitions(sort_and_rolling_mean)
result = result.persist()
wait(result);

In [None]:
# let's look at a random partition
result.partitions[89].head()

Pretty cool. When we're using `map_partitions`, the function is executing on the individual cuDF DataFrames that make up our Dask DataFrame. This means we can do any cuDF operation, run CuPy array manipulations, or anything else we want.

# Dask Delayed

Dask also provides a `delayed` API, which is useful for parallelizing custom logic that doesn't quite fit into the DataFrame API.

Let's imagine we wanted to run thousands of regressions models on different combinations of two features. We can do this experiment super easily with dask.delayed.

In [None]:
from cuml.linear_model import LinearRegression
from dask import delayed
import dask
import numpy as np
from itertools import combinations

In [None]:
# Setup data
np.random.seed(12)

nrows = 1000000
ncols = 50
df = cudf.DataFrame({f"x{i}": np.random.randn(nrows) for i in range(ncols)})
df['y'] = np.random.randn(nrows)

In [None]:
feature_combinations = list(combinations(df.columns.drop("y"), 2))
feature_combinations[:10]

In [None]:
len(feature_combinations)

In [None]:
# Many calls to linear regression, parallelized with Dask
def fit_ols(df, feature_cols, target_col="y"):
    clf = LinearRegression()
    clf.fit(df[list(feature_cols)], df[target_col])
    return feature_cols, clf.coef_, clf.intercept_

In [None]:
# scatter the data to the workers beforehand
data_future = client.scatter(df, broadcast=True)

In [None]:
results = []

for features in feature_combinations:
    # note how i'm passing the scattered data future
    res = delayed(fit_ols)(data_future, features)
    results.append(res)

res = dask.compute(results)
res = res[0]

print("Features\t\tCoefficients\t\t\tIntercept")
for i in range(5):
    print(res[i][0], res[i][1].values, res[i][2], sep="\t")

# Handling Parquet Files

Dask and cuDF provide accelerated Parquet readers and writers, and it's useful to take advantage of these tools.

To start, let's write out our DataFrame `ddf` to Parquet files using the `to_parquet` API and delete it from memory.

In [None]:
ddf.to_parquet("ddf.parquet")

In [None]:
del ddf

Let's take a look at what happened.

In [None]:
!ls ddf.parquet | head

We end up with 365 parquet files, and one metadata file. Dask will write one file per partition.

Let's read the data back in with `dask_cudf.read_parquet`.

In [None]:
ddf = dask_cudf.read_parquet("ddf.parquet/")
ddf

Only about 210 partitions? It turns out, some of our partitions were empty. The `_metadata` file helps us avoid reading these files in. But, we can still read them all if want by using a `*` wildcard in the filepath and ignoring the metadata.

In [None]:
ddf = dask_cudf.read_parquet("ddf.parquet/*.parquet")
ddf

Let's now write one big parquet file and then read it back in. We can `repartition` our dataset down to a single partition.

In [None]:
ddf.repartition(npartitions=1).to_parquet("big_ddf.parquet")

In [None]:
dask_cudf.read_parquet("big_ddf.parquet/")

We still get 32 partitions? We can control the splitting behavior using the `split_row_groups` parameter.

In [None]:
dask_cudf.read_parquet("big_ddf.parquet/", split_row_groups=False)

In general, we want to avoid massive partitions. The sweet spot is probably around 2-3 GB of data per partition for a 32GB V100.

# Understanding Persist and Compute

Before we close, it's worth coming back to the concepts of `persist` and `compute`. We've seen them several times, but haven't gone into depth.

Most Dask operations are lazy. This is a common pattern in distributed computing, but is likely unfamiliar to those who primarily use single-machine libraries like pandas and cuDF. As a result, you'll usually need to call an **eager** operation like `len` or `persist` to actually trigger work.

In general, you should avoid calling `compute` except when collecting small datasets or scalars. When we spin up a cluster, we're interacting with our cluster in what we call the `Client` Python process. When we created a `Client` object above, this is what we did. Calling `compute` brings all of the results back to a single GPU cuDF DataFrame in the client process, not in any of the worker processes. This means we're not using the same memory pool, so we could go out of memory if we're not careful.

For those of you with Spark experience, you can think of `persist` as triggering work and caching the dataframe in distributed memory and `compute` as collecting the data or results into a single GPU dataframe (cuDF) on the driver.


### Should I Persist My Data?

Persisting is generally a good idea if the data needs to be accessed multiple times, to avoid repeated computation. However, if the size of your data would lead to memory pressure, this could cause spilling, which hurts performance. As a best practice, we recommend persisting only when necessary or when you're using an eager operation in the middle of your workflow (to avoid repeating computation).

# Summary

RAPIDS lets us scale up and take advantage of GPU acceleration. Dask lets us scale out to multiple machines. Dask supports both cuDF DataFrames and CuPy arrays, with generally the same APIs as the single-machine libraries.

We encourage you to read the Dask [documentation](https://docs.dask.org/en/latest/) to learn more, and also look at our [10 Minute Guide to cuDF and Dask cuDF](https://docs.rapids.ai/api/cudf/nightly/10min.html)