<a href="https://colab.research.google.com/drive/1nCh7YtUZ4vqCxkdpV5PrqhfZIxvowMr_?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accelerating Dask with GPUs (via RAPIDS)

We've seen in lecture how the RAPIDS libraries make it possible to accelerate common analytical workflows on GPUs using libraries like cudf (for GPU DataFrames) and cuml (for basic GPU machine learning operations on DataFrames). When your data gets especially large (e.g. exceeding the memory capacity of a single GPU) or your computations get especially cumbersome, Dask makes it possible to scale these workflows out even further -- distributing work out across a cluster of GPUs.

In AWS Academy, recall that we cannot create GPU clusters. However, this notebook should also be runnable on multi-GPU EC2 instances and clusters (on AWS) if you use a personal account to request these resources. Here (on Colab), we'll demonstrate using a single GPU. Note that the setup portion of this notebook draws on [a setup notebook](https://colab.research.google.com/drive/13sspqiEZwso4NYTbsflpPyNFaVAAxUgr) linked in the RAPIDS documentation and is meant to be run in a Colab notebook.

This demo is built off of the notebooks provided in the [RAPIDS notebook repositories](https://github.com/rapidsai/notebooks) on GitHub (and you can explore them further if you are interested! There are many other relevant libraries in the RAPIDS ecosystem -- e.g. `cugraph` which allows you to perform network analyses on GPUs).

## Setup

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [1]:
!nvidia-smi

Mon Apr 28 18:53:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   49C    P8             12W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Then we run the setup script below, which:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Installs the **current stable version** of RAPIDSAI's core libraries using pip and **will complete in about 3-4 minutes**

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

At this point, our RAPIDS libraries are now installed on Colab and we can import them into our session. Let's use `dask_cuda`'s API to launch a Dask GPU cluster and pass this cluster object to our `dask.distributed` client. `LocalCUDACluster()` will count each available GPU in our cluster (in this case, 1 GPU) as a Dask worker and assign it work.

In [3]:
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster() # Identify all available GPUs
client = Client(cluster)

INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:46669
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:8787/status
INFO:distributed.scheduler:Registering Worker plugin shuffle
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:38045'
INFO:distributed.scheduler:Register worker addr: tcp://127.0.0.1:41751 name: 0
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:41751
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:56224
INFO:distributed.scheduler:Receive client connection: Client-5fbf17e2-2462-11f0-80d0-0242ac1c000c
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:56234


## GPU DataFrames

From here, we can use `dask_cudf` to automate the process of partitioning our data across our GPU workers and instantiating a GPU-based DataFrame on our GPU that we can work with. Let's load in the same AirBnB data that we were working with in the `numba` + `dask` CPU demonstration:

In [5]:
import dask_cudf

df = dask_cudf.read_csv('listings*.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,3781,HARBORSIDE-Walk to subway,4804,Frank,,East Boston,42.36413,-71.02991,Entire home/apt,125,32,19,2021-02-26,0.27,1,106
1,6695,$99 Special!! Home Away! Condo,8229,Terry,,Roxbury,42.32802,-71.09387,Entire home/apt,169,29,115,2019-11-02,0.81,4,40
2,10813,"Back Bay Apt-blocks to subway, Newbury St, The...",38997,Michelle,,Back Bay,42.35061,-71.08787,Entire home/apt,96,29,5,2020-12-02,0.08,11,307
3,10986,North End (Waterfront area) CLOSE TO MGH & SU...,38997,Michelle,,North End,42.36377,-71.05206,Entire home/apt,96,29,2,2016-05-23,0.03,11,293
4,13247,Back Bay studio apartment,51637,Susan,,Back Bay,42.35164,-71.08752,Entire home/apt,75,91,0,,,2,0


Once we have that data, we can perform many of the standard DataFrame operations we perform on CPUs -- just accelerated by our GPU cluster!

In [6]:
df.groupby(['neighbourhood', 'room_type']) \
  .price \
  .mean() \
  .compute()

neighbourhood       room_type      
Edgewater           Private room        79.964912
Outer Richmond      Entire home/apt    221.408163
Washington Heights  Private room        37.333333
North End           Entire home/apt    189.984375
West End            Entire home/apt    201.440000
                                          ...    
Leather District    Entire home/apt    199.000000
Mission Hill        Shared room         20.000000
Nob Hill            Private room       142.802198
Mission             Shared room         32.000000
Near West Side      Shared room         33.000000
Name: price, Length: 341, dtype: float64

## Training Machine Learning Models with `cuml`

In addition to preprocessing and analyzing data on GPUs, we can also train (a limited set of) Machine Learning models directly on our GPU cluster using the `cuml` library in the RAPIDS ecoystem as well. This can give us a significant speedup in training time over libraries like `sklearn` on CPUs for large datasets.

For instance, let's train a linear regression model based on our data from San Francisco, Chicago, and Boston to predict the price of an AirBnB based on other values in its listing information (e.g. \"reviews per month\" and \"minimum nights\"). We'll then use this model to make predictions about the price of AirBnBs in another city (NYC):

In [None]:
from cuml.dask.linear_model import LinearRegression
import numpy as np

X = df[['reviews_per_month', 'minimum_nights']].astype(np.float32).dropna()
y = df[['price']].astype(np.float32).dropna()
fit = LinearRegression().fit(X, y)

Then, we can read in the NYC dataset and make predictions about what prices will be in NYC on the basis of the model we trained on data from our three original cities:

In [9]:
df_nyc = dask_cudf.read_csv('test*.csv')
X_test = df_nyc[['reviews_per_month', 'minimum_nights']].astype(np.float32) \
                                                        .dropna()
fit.predict(X_test) \
   .compute() \
   .head()

0    184.802887
1    188.286636
2    184.802887
3    183.658218
4    186.646774
dtype: float32

If we take a look at other standard machine learning algorithms in the documentation (for instance [k means clustering](https://github.com/rapidsai/cuml/blob/branch-23.04/notebooks/kmeans_demo.ipynb)) as well, we can see significant speedups over performing the same operations on large datasets in scikit-learn on a CPU.

Note, though, that this is only true of larger data. For smaller data sizes, we will see comparable performance on CPU and GPUs. **Ask:** why?