# GPU ETL with cuDF

<img src="https://raw.githubusercontent.com/rapidsai-community/tutorial/refs/heads/main/images/cudf-benchmark.png" style="float: right; margin-left: 15px;">

`cuDF` is a high level Python GPU DataFrame library (built on Apache Arrow columnar memory format) with a pandas-like API.

- Core functions for loading, filtering, aggregating, joining and manipulating data
- Numeric, datetime, categorical, string and nested data
- GPU accelerated I/O (e.g., CSV, Parquet, JSON)
- 10-100x faster than pandas*
- Implements a [subset](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/#cudf-api) of the `pandas` API (60-75%), 
but it should be very familiar to `pandas` users.
- Built upon the `libcudf` CUDA C++ library
 
When using `cuDF` data is loaded onto the GPU and all operations are performed with GPU compute. 

**Two modes of usage:**
- Standalone library (this notebook)
- cudf.pandas (next notebook)

**Image Note:**
Benchmark on AMD EPYC 7642 (using 1x 2.3GHz CPU core) w/ 512GB and NVIDIA A100 80GB (1x GPU) w/ pandas v1.5 and cuDF v23.02

**Attribution:** This section of the tutorial is based on the cuDF notebook from [Accelerated Computing Hub GPU python-tutorial](https://github.com/NVIDIA/accelerated-computing-hub/blob/main/gpu-python-tutorial/6.0_cuDF.ipynb)

### Data 

If you are running this locally, and you followed the steps in notebook [0.Welcome_and_Setup.ipynb](https://github.com/rapidsai-community/tutorial/blob/main/0.Welcome_and_Setup.ipynb), you should have the `/data` folder ready to go. 

#### Google Colab Instructions

In the next step we download a script that will allow you to get the data for this notebook session.


In [None]:
# colab: uncomment next line to get the data setup script
#! wget https://raw.githubusercontent.com/rapidsai-community/tutorial/refs/heads/main/data_setup.py

In [None]:
# colab: uncomment next line to get the pageviews data set
#! python data_setup.py --pageviews

In [None]:
# Verify that you are running with an NVIDIA GPU
! nvidia-smi  # this should display information about available GPUs

This `pageviews.csv` file contains just over 1M records of pageview counts from Wikipedia in various languages.

The data we will use in this tutorial is too small to really benefit from GPU acceleration, but we will explore it 
anyway.

In [None]:
import cudf

In [None]:
pageviews = cudf.read_csv('data/pageviews_small.csv', sep=" ")
pageviews.head()

The `pageviews_small.csv` file contains just over 1M records of pageview counts from 
Wikipedia in various languages.

Let's rename the columns and drop the unused x column.

In [None]:
pageviews.columns = ['project', 'page', 'requests', 'x']

pageviews = pageviews.drop('x', axis=1)

pageviews

If we want to select only the ones in english we can do:

In [None]:
pageviews[pageviews.project == 'en']

**Exercise**: Find the number of english records in the dataset

<details>
  <summary>Solution (click dropdown) </summary>
  <p>

```python
# to run this type it in a code cell
pageviews[pageviews.project == 'en'].count()
```
  </p>
</details>


In [None]:
# your solution here


We can group by `project` and get a count of the pages by language

In [None]:
grouped_pageviews = pageviews.groupby('project').count().reset_index()
grouped_pageviews

**Exercise**: Get `grouped_pageviews` sorted in descending order.
_Hint_: Check the [cudf docs](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.dataframe.sort_values/)

<details>
  <summary>Solution (click dropdown) </summary>
  <p>

```python
# to run this type it in a code cell
grouped_pageviews.sort_values('page', ascending=False)
```
  </p>
</details>

In [None]:
# your solution here

We can also take a look at the result for English, French, Chinese and Spanish.

In [None]:
grouped_pageviews[grouped_pageviews.project.isin(['en', 'fr', 'zh', 'es'])]


If you are a `pandas` user this syntax should be very familiar to you. These are only 
a few examples of a large portion of the `pandas` API that is implemented in `cuDF`. 

The only difference is that all the operations we have run so far are running on the GPU. 

### Strings

`cuDF` string operations are accelerated with specialized kernels. This means that 
operations like capitalizing strings can be parallelized on the GPU. 

In [None]:
pageviews[pageviews.project == 'en'].page.str.upper()

In [None]:
pageviews[pageviews.project == 'en'].page.str.replace('_', ' ')

### UDFs

`cuDF` also has support for user defined functions (UDFs) that can be mapped over a Series or DataFrame in parallel on the GPU.

UDFs can be defined as pure Python functions that take a single value. These functions are compiled by Numba at runtime into 
GPU-executable code when .apply() is called.

In [None]:
def udf(x):
    if x < 5:
        return 0
    return x

In [None]:
pageviews.requests = pageviews.requests.apply(udf)
pageviews.requests

We can apply more than one filter too:

In [None]:
(pageviews[(pageviews.requests != 0) & (pageviews.project == 'en')]
 .sort_values('requests', ascending=False))

### Rolling windows

In `cuDF` you can also apply kernels over rolling windows. 

In [None]:
def neigborhood_mean(window):
    c = 0
    for val in window:
        c += val
    return c / len(window)

In [None]:
pageviews.requests.rolling(3, 1, True).apply(neigborhood_mean)

## Conclusion

In this notebook, we explored the basics of cuDF - RAPIDS' GPU-accelerated DataFrame library. We learned how to:

- Load and manipulate data with cuDF DataFrames
- Filter and sort data efficiently
- Apply custom functions using UDFs
- Work with rolling windows

To learn more checkout the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable/)

In the next notebook, we will learn about the `cudf.pandas` accelerator and how to get performance out of your pandas
code with zero-code changes.

[Colab Next Notebook: 2 cudf.pandas →](https://colab.research.google.com/github/rapidsai-community/tutorial/blob/main/2.cudf_pandas.ipynb)

