## Simpler Multi-GPU ETL using Dask ##
A major focus of the RAPIDS project is easier scaling: up *and* out.

While multi-gpu/multi-node data processing can be difficult to manage, the [dask-cuda project](https://github.com/rapidsai/dask-cuda), automatically handles configuring Dask worker processes to make use of available GPUs, and [dask-cudf](https://github.com/rapidsai/cudf/tree/branch-21.06/python/dask_cudf) supports a variety of common ETL operations and friendlier parallel IO.

The rest of this notebook demonstrates just how simple parallel processing is with RAPIDS, and how you can scale your data science work to multiple GPUs with ease.

First, let's see what GPUs we have available...

In [1]:
import cudf
import dask, dask_cudf
import os
import urllib.request
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask.diagnostics import ProgressBar

# Use dask-cuda to start one worker per GPU on a single-node system
# When you shutdown this notebook kernel, the Dask cluster also shuts down.
cluster = LocalCUDACluster(ip='0.0.0.0')
client = Client(cluster)
# print client info
client

0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://172.17.0.2:8787/status,

0,1
Dashboard: http://172.17.0.2:8787/status,Workers: 6
Total threads: 6,Total memory: 220.05 GiB
Status: running,Using processes: True

0,1
Comm: tcp://172.17.0.2:36637,Workers: 6
Dashboard: http://172.17.0.2:8787/status,Total threads: 6
Started: Just now,Total memory: 220.05 GiB

0,1
Comm: tcp://172.17.0.2:39173,Total threads: 1
Dashboard: http://172.17.0.2:40233/status,Memory: 36.67 GiB
Nanny: tcp://172.17.0.2:37105,
Local directory: /tmp/dask-scratch-space/worker-hqti_i_q,Local directory: /tmp/dask-scratch-space/worker-hqti_i_q

0,1
Comm: tcp://172.17.0.2:36185,Total threads: 1
Dashboard: http://172.17.0.2:37283/status,Memory: 36.67 GiB
Nanny: tcp://172.17.0.2:38357,
Local directory: /tmp/dask-scratch-space/worker-68l8t9x3,Local directory: /tmp/dask-scratch-space/worker-68l8t9x3

0,1
Comm: tcp://172.17.0.2:32843,Total threads: 1
Dashboard: http://172.17.0.2:40347/status,Memory: 36.67 GiB
Nanny: tcp://172.17.0.2:37553,
Local directory: /tmp/dask-scratch-space/worker-tq899dxl,Local directory: /tmp/dask-scratch-space/worker-tq899dxl

0,1
Comm: tcp://172.17.0.2:46851,Total threads: 1
Dashboard: http://172.17.0.2:43919/status,Memory: 36.67 GiB
Nanny: tcp://172.17.0.2:39139,
Local directory: /tmp/dask-scratch-space/worker-qt9hl3jr,Local directory: /tmp/dask-scratch-space/worker-qt9hl3jr

0,1
Comm: tcp://172.17.0.2:44841,Total threads: 1
Dashboard: http://172.17.0.2:34689/status,Memory: 36.67 GiB
Nanny: tcp://172.17.0.2:35507,
Local directory: /tmp/dask-scratch-space/worker-911rowxj,Local directory: /tmp/dask-scratch-space/worker-911rowxj

0,1
Comm: tcp://172.17.0.2:46541,Total threads: 1
Dashboard: http://172.17.0.2:37207/status,Memory: 36.67 GiB
Nanny: tcp://172.17.0.2:44231,
Local directory: /tmp/dask-scratch-space/worker-dld36fot,Local directory: /tmp/dask-scratch-space/worker-dld36fot




Ok, we've got a cluster of GPU workers. Notice also the link to the Dask status dashboard. It provides lots of useful information while running data processing tasks.

We will now import s3fs and matplotlib libraries. If these libraries are not installed in the environment then they will be installed and imported

In [2]:
!mamba install -y -c conda-forge s3fs
import urllib.request

try:
    import matplotlib.pyplot as plt
except ImportError:
    !mamba install -y -c conda-forge matplotlib
    import matplotlib.pyplot as plt
from matplotlib.pyplot import *

conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
pytorch/linux-64                                            Using cache
pytorch/noarch                                              Using cache
nvidia/linux-64                                             Using cache
nvidia/noarch                                               Using cache
[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[?25h[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
rapidsai-nightly/linux-64 ━━━━━━━━━━━━╸[3m[00m━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.0s
rapidsai-nightly/noarch   ━━━━━━━━╸━━━━━━━━━   0.0 B /  ??.?MB @  ??.?MB/s  0.0s
dask/label/dev/linux-64   ━━━━━━━━━━━━━╸━━━━   0.0 B /  ??.?MB @  ??.?MB/s  0.0s
[2K[1A[2K[1A[2K[1A[2K[1A[2K[0G[+] 0.2s.0 B /  ??.?MB @  ??.?MB/s  0.0s
rapidsai-nightly/linux-64 ━━━━━━━━━━━━━╸[3m[00m━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.1s
rapidsai-nightly/noarch   ━━━━━━━━━╸━━━━━━━━



## Accessing Data

Now, let's download a dataset.

If you're working on a local machine, you'd normally use wget, Python's `urllib` package, or another tool to pull down the data you want to analyze.

For the sake of not making you wait for 200+ files to download, the cell below uses urllib to download just 20 years of weather records, and a metadata file about the stations that recorded it. You can update the `years` list if you want to download more, but it won't change the logic in the notebook either way, it'll just process more data.

*Note*: The rest of the markdown commentary in this notebook assumes you're operating on all 232 years of data.

### Make and set a home for your data

In [3]:
top_dir = "./data/"
if not os.path.exists(top_dir):
    print('creating data directory')
    os.system('mkdir ./data')
data_dir = './data/weather/'
if not os.path.exists(data_dir):
    print('creating weather directory')
    os.system('mkdir -p ./data/weather')

### Choose and Download your data

This step may take a few minutes as data is downloaded for nearly 20 years.

In [4]:
# download weather observations
base_url = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/'
years = list(range(2000, 2020))
for year in years:
    fn = str(year) + '.csv.gz'
    if not os.path.isfile(data_dir+fn):
        print(f'Downloading {base_url+fn} to {data_dir+fn}')
        urllib.request.urlretrieve(base_url+fn, data_dir+fn)
        
# download weather station metadata
station_meta_url = 'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt'
if not os.path.isfile(data_dir+'ghcnd-stations.txt'):
    print('Downloading station meta..')
    urllib.request.urlretrieve(station_meta_url, data_dir+'ghcnd-stations.txt')

## Alternatives to Pre-Downloading Data
**Please uncomment the code cell below if you intend on using this method**

While downloading or copying data to your local environment is a good way to get started, many users will want other options:

1. Reading directly from distributed storage, like HDFS
2. Reading from cloud storage (S3, GCS, ADLS, etc)

See [Dask Remote Data Services](http://docs.dask.org/en/latest/remote-data-services.html) for more details on supported providers, authentication, and other storage configuration options.

Here's an example of reading the same weather data, conveniently available in a public Amazon S3 bucket.

But first make sure your Python environment has the right packages to read from your storage system of choice.

For this example: ```conda install -y s3fs```

In [5]:
# # these CSV files don't have headers, we specify column names manually
names = ["station_id", "date", "type", "val"]
# # there are more fields, but only the first 4 are relevant in this notebook
usecols = names[0:4]

# url = 's3://noaa-ghcn-pds/csv/1788.csv'
# dask_cudf.read_csv(url, names=names, usecols=usecols, storage_options={'anon': True})

##  Reading Large & Multi-File DataSets

Wait... there are many weather files: one for each year going back to the 1780s.

To read all these files in, we can simply use `dask_cudf.read_csv`, which supports file globs, _and_ automatically splits files into chunks that can be processed serially when needed, so you're less likely to run out of memory.

When you call `dask_cudf.read_csv`, Dask reads metadata for each CSV file and tasks workers with lists of filenames & byte-ranges that they're responsible for loading with cuDF's GPU CSV reader.

*Note*: compressed files are not splittable on read, but you can [repartition](https://docs.dask.org/en/latest/dataframe-best-practices.html#repartition-to-reduce-overhead) them downstream.

In [6]:
weather_ddf = dask_cudf.read_csv(data_dir+'*.csv.gz', names=names, usecols=usecols, compression='gzip', blocksize=None)

## Let's Process Some Data

Per the [readme](https://docs.opendata.aws/noaa-ghcn-pds/readme.html) for this dataset, multiple types of weather observations are in the same files, and each carries a different units of measure:

| Observation Type  | Existing Units | Action |
| ------------- | ------------- | ------------- |
| PRCP | Precipitation (tenths of mm) | convert to inches |
| SNWD | Snow depth (mm) | convert to inches |
| TMAX | tenths of degrees C | convert to fahrenheit |
| TMIN | tenths of degrees C | convert to fahrenheit |

There are more even more observation types, each with their own units of measure, but I won't list them all. In this notebook, I'm going to focus specifically on precipitation.

The `type` column tells us what kind of weather observation each record represents. Ordinarily, you might use `query` to filter out subsets of records and apply different logic to each subset. However, [query doesn't support string datatypes yet](https://github.com/rapidsai/cudf/issues/111). Instead, you can use boolean indexing.

For numeric types, Dask with cuDF works mostly like regular Dask. For instance, you can define new columns as combinations of other columns:

In [7]:
precip_index = weather_ddf['type'] == 'PRCP'
precip_ddf = weather_ddf[precip_index]

# convert 10ths of mm to inches
mm_to_inches = 0.0393701
precip_ddf['val'] = precip_ddf['val'] * 1/10 * mm_to_inches

Note: Calling .head() will read the first few rows, usually from the first partition.

In our case, the first partition represents weather data from 1788. Apparently, there wasn't _any_ precipitation data collected that year:

Beware in your own analyes, that you .head() from partitions that you haven't already filtered everything out of!

In [8]:
precip_ddf.get_partition(1).head()

Unnamed: 0,station_id,date,type,val
27,AGM00060355,20010101,PRCP,0.03937
30,AGM00060360,20010101,PRCP,0.11811
33,AGM00060402,20010101,PRCP,0.161417
37,AGM00060419,20010101,PRCP,0.07874
47,AGM00060445,20010101,PRCP,0.03937


Ok, we have a lot of weather observations. Now what?

# Answering Questions With Data ##

For some reason, residents of particular cities like to lay claim to having the best, or the worst of something. For Los Angeles, it's having the worst traffic. New Yorkers and Chicagoans argue over who has the best pizza. [West Coasters argue about who has the most rain](https://twitter.com/MikeNiccoABC7/status/1105184947663396864).

Well... as a longtime Atlanta resident suffering from humidity exhaustion, I like to joke that with all the spring showers, _Atlanta_ is the new Seattle.

Does my theory hold water? Or will the data rain on my bad pun parade?

# How Can I Test My Theory?

We've already created `precip_df`, which is only the precipitation observations, but it's for all 100k weather stations, most of them nowhere near Atlanta, and this is time-series data, so we'll need to aggregate over time ranges.

To get down to just Atlanta and Seattle precipitation records, we have to...

1. Extract year, month, and day from the compound "date" column, so that we can compare total rainfall across time.

2. Load up the station metadata file.

3. There's no "city" in the station metadata, so we'll do some geo-math and keep only stations near Atlanta and Seattle.

4. Use a Groupby to compare changing precipitation patterns across time

5. Use inner joins to filter the precipitation dataframe down to just Atlanta & Seattle data.

## 1. Extracting Finer Grained Date Fields

We can use cuDF's `to_datetime` function to map our date column into separate date parts. Dask's [map_partitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.map_partitions) function applies a given Python function to all partitions of a distributed DataFrame or Series. When you do this on a dask_cudf DataFrame or Series, your input is a cuDF object.

In [9]:
#convert date column to a series of datetime objects
dates = precip_ddf['date'].map_partitions(cudf.to_datetime, format='%Y%m%d', meta=("date", "datetime64[ns]"))

#assign new columns to their respective date parts
precip_ddf['year'] = dates.dt.year
precip_ddf['month'] = dates.dt.month
precip_ddf['day'] = dates.dt.day

precip_ddf.head()

Unnamed: 0,station_id,date,type,val,year,month,day
14,AG000060390,20000101,PRCP,0.031496,2000,1,1
18,AG000060590,20000101,PRCP,0.0,2000,1,1
21,AG000060611,20000101,PRCP,0.0,2000,1,1
24,AG000060680,20000101,PRCP,0.0,2000,1,1
28,AGE00147718,20000101,PRCP,0.0,2000,1,1


The map_partitions pattern is also useful whenever there are cuDF specific functions without a direct mapping into Dask.

## 2. Loading Station Metadata ##

In [10]:
!head -n 5 /data/weather/ghcnd-stations.txt

head: cannot open '/data/weather/ghcnd-stations.txt' for reading: No such file or directory


Wait... That's no CSV file! It's fixed-width!

That's annoying because we don't have a reader for it. We could use CPU code to pre-process the file, making it friendlier for loading into a DataFrame, but, RAPIDS is about end-to-end data processing without leaving the GPU.

This file is small enough that we can handle it directly with cuDF on a single GPU.

Here's how to cleanup this metadata using cuDF and string operations:

In [11]:
import cudf

fn = data_dir+'ghcnd-stations.txt'
# There are no '|' chars in the file. Use that to read the file as a single column per line
# quoting=3 handles misplaced quotes in the `name` field 
station_df = cudf.read_csv(fn, sep='|', quoting=3, names=['lines'], header=None)

# you can use normal DataFrame .str accessor, and chain operators together
station_df['station_id'] = station_df['lines'].str.slice(0, 11).str.strip()
station_df['latitude'] = station_df['lines'].str.slice(12, 20).str.strip()
station_df['longitude'] = station_df['lines'].str.slice(21, 30).str.strip()
station_df = station_df.drop('lines', axis=1)

station_df.head()

Unnamed: 0,station_id,latitude,longitude
0,ACW00011604,17.1167,-61.7833
1,ACW00011647,17.1333,-61.7833
2,AE000041196,25.333,55.517
3,AEM00041194,25.255,55.364
4,AEM00041217,24.433,54.651


# Managing Memory

While GPU memory is very fast, there's less of it than host RAM. It's a good idea to avoid storing lots of columns that aren't useful for what you're trying to do, especially when they're strings.

For example, for the station metadata, there are more columns than we parsed out above. In this workflow we only need `station_id`, `latitude`, and `longitude`, so we skipped parsing the rest of the columns.

We also need to convert latitude and longitude from strings to floats, and convert the single-GPU DataFrame to a Dask DataFrame that can be distributed across workers.

In [12]:
# you can cast string columns to numerics
station_df['latitude'] = station_df['latitude'].astype('float')
station_df['longitude'] = station_df['longitude'].astype('float')

In [13]:
station_df.head(20).to_csv("test.csv", index = False)

## 3. Filtering Weather Stations by Distance

We will be using cuSpatial to get the Haversine Distance and figure out which stations are within a given distance from a city.

For this scenario, we've manually looked up Atlanta and Seattle's city centers and will fill `cudf.Series` with their latitude and longitude values. Then we can call a cuSpatial function to compute the distance between each station and each city.

In [14]:
import cuspatial
import geopandas
from shapely.geometry import Point

#Let's create a cuSpatial GeoSeries with the station data
stations = cuspatial.GeoSeries.from_points_xy(station_df[['longitude','latitude']].interleave_columns())

# fill new GeoSeries with Atlanta lat/lng
station_df['atlanta_lat'] = 33.7490
station_df['atlanta_lng'] = -84.3880
atl = cuspatial.GeoSeries.from_points_xy(station_df[['atlanta_lng','atlanta_lat']].interleave_columns())

# compute distance from each station to Atlanta
station_df['atlanta_dist'] = cuspatial.haversine_distance(stations, atl)

# fill new GeoSeries with Seattle lat/lng
station_df['seattle_lat'] = 47.6219
station_df['seattle_lng'] = -122.3517
stl = cuspatial.GeoSeries.from_points_xy(station_df[['seattle_lng','seattle_lat']].interleave_columns())

# compute distance from each station to Seattle
station_df['seattle_dist'] = cuspatial.haversine_distance(stations, stl)

### Checking the Results

In [15]:
# Inspect the results:
atlanta_stations_df = station_df.query('atlanta_dist <= 25')
seattle_stations_df = station_df.query('seattle_dist <= 25')

print(f'Atlanta Stations: {len(atlanta_stations_df)}')
print(f'Seattle Stations: {len(seattle_stations_df)}')

atlanta_stations_df.head()

Atlanta Stations: 78
Seattle Stations: 156


Unnamed: 0,station_id,latitude,longitude,atlanta_lat,atlanta_lng,atlanta_dist,seattle_lat,seattle_lng,seattle_dist
64503,US1GACB0002,33.8939,-84.4938,33.749,-84.388,18.844744,47.6219,-122.3517,3489.923424
64505,US1GACB0004,33.9512,-84.4219,33.749,-84.388,22.700514,47.6219,-122.3517,3491.328996
64506,US1GACB0005,33.8274,-84.4988,33.749,-84.388,13.447851,47.6219,-122.3517,3494.054111
64508,US1GACB0007,33.8714,-84.5221,33.749,-84.388,18.404877,47.6219,-122.3517,3489.369691
64510,US1GACB0014,33.8907,-84.5946,33.749,-84.388,24.749221,47.6219,-122.3517,3482.751406


[Google tells me those station ids are from Smyrna](https://geographic.org/global_weather/georgia/smyrna_23_ne_002.html), a town just outside of Atlanta's perimeter. Our distance calculation worked!

## 4. Grouping & Aggregating by Time Range

Before using an inner join to filter down to city-specific precipitation data, we can use a groupby to sum the precipitation for station and year. That'll allow the join to proceed faster and use less memory.

One total precipitation record per station per year is relatively small, and we're going to need to graph this data, so we'll go ahead and `compute()` the result, asking Dask to aggregate across the 200+ years worth of data, bringing the results back to the client as a single GPU cuDF DataFrame.

Note that with Dask, data is partitioned and distributed across multiple workers. Some operations require that workers "[shuffle](http://docs.dask.org/en/latest/dataframe-groupby.html#)" data from their partitions back and forth across the network, which has major performance implications. Today join, groupby, and sort operations can be fairly network constrained.

See the [slides](https://www.slideshare.net/MatthewRocklin/ucxpython-a-flexible-communication-library-for-python-applications) from a recent talk at GTC San Jose to learn more about [ongoing efforts to integrate Dask with UCX](https://github.com/rapidsai/ucx-py/) and allow it to use accelerated networking hardware like Infiniband and [nvlink](https://www.nvidia.com/en-us/data-center/nvlink/).

In the meantime, distributed operators that require shuffling like joins, groupbys, and sorts work, albeit not as fast as we'd like.

In [16]:
precip_year_ddf = precip_ddf.groupby(by=['station_id', 'year']).val.sum()

Note that we're calling `compute` again here. This tells Dask to actually start computing the full set of processing logic defined thus far:

1. Read and decompress 232 gzipped files (about 100 GB decompressed)
2. Send to the GPU and parse
3. Filter down to precipitation records
4. Apply a conversion to inches
5. Sum total inches of rain per year per each of the 108k weather stations
6. Combine and pull results a single GPU DataFrame on the client host

To wit... this will take some time.

In [17]:
%time 
precip_year_df = precip_year_ddf.compute()

# Convert from the groupby multi-indexed DataFrame back to a normal DF which we can use with merge
precip_year_df = precip_year_df.reset_index()

CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 7.63 μs


2025-02-11 20:09:08,010 - distributed.worker - ERROR - Compute Failed
Key:       ('read_csv-fused-operation-d4d9bc4e2557f2d53e8c288eacef307a', 18)
State:     executing
Task:  <Task ('read_csv-fused-operation-d4d9bc4e2557f2d53e8c288eacef307a', 18) _execute_subgraph(...)>
Exception: "RuntimeError('CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/comp/uncomp.cpp:284: ZLIB decompression failed')"
Traceback: '  File "/opt/conda/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cudf/io/csv.py", line 257, in read_csv\n    table_w_meta = plc.io.csv.read_csv(options)\n                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "csv.pyx", line 631, in pylibcudf.io.csv.read_csv\n  File "csv.pyx", line 649, in pylibcudf.io.csv.read_csv\n'



RuntimeError: CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/comp/uncomp.cpp:284: ZLIB decompression failed

## 5. Using Inner Joins to Filter Weather Observations

We have separate DataFrames containing Atlanta and Seattle stations, and we have our total precipitation grouped by `station_id` and `year`. Computing inner joins can let us compute total precipitation by year for just Atlanta and Seattle.

In [None]:
%time atlanta_precip_df = precip_year_df.merge(atlanta_stations_df, on=['station_id'], how='inner')

In [None]:
atlanta_precip_df.head()

In [None]:
%time seattle_precip_df = precip_year_df.merge(seattle_stations_df, on=['station_id'], how='inner')

In [None]:
seattle_precip_df.head()

Lastly, we need to normalize the total amount of rain in each city by the number of stations which collected rainfall: Seattle had twice as many stations collecting, but that doesn't mean more total rain fell! 

In [None]:
atlanta_rain = atlanta_precip_df.groupby(['year']).val.sum()/len(atlanta_stations_df)
atlanta_rain.head()

In [None]:
seattle_rain = seattle_precip_df.groupby(['year']).val.sum()/len(seattle_stations_df)

seattle_rain.head()

## Visualizing the Answer

To generate the graphs in the cells below, first you'll need to ```conda install -y python-graphviz matplotlib```

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.pyplot import *

plt.close('all')
plt.rcParams['figure.figsize'] = [20, 10]

fig, ax = subplots()

atlanta_rain.sort_index().to_pandas().plot(ax=ax)
seattle_rain.sort_index().to_pandas().plot(ax=ax)

ax.legend(['Atlanta', 'Seattle'])

# Results

It looks like I'm right (mostly)! At least for roughly the last 80 years, it rains more by volume in Atlanta than it does in Seattle. The data seems to confirm my suspicions.

But as usual the answer raises additional questions:

1. Without singling out Atlanta and Seattle, which city actually has the most precipitation by volume?

2. Why is there such a large increase in observed precipitation in the last 10 years?

3. One friend noted that it rains more frequently in Seattle, just not as hard. A contrarian was quick to point out that it mists a lot in Seattle. How often is it just "misty", but not really raining?

We'll revisit these questions in a future post, and look forward to seeing what kinds of analyses YOU come up with.

# Takeaways

We just showed some of the ways you can use Dask and cuDF to parallelize typical data processing tasks on multiple GPUs. Hopefully this notebook provides useful examples to refer to while doing your own ETL & analytics work.

For more info on what's working today with Dask and cuDF, see [our summary](https://docs.rapids.ai/api/cudf/stable/), and follow [our ongoing development](https://github.com/rapidsai/cudf).

Also checkout out other [community contributed notebooks](https://github.com/rapidsai/notebooks-contrib), and submit your own!