# RAPIDS cuDF's pandas accelerator mode (cudf.pandas)

<img src="https://raw.githubusercontent.com/rapidsai-community/tutorial/refs/heads/main/images/cudf-pandas-exec-flow.png" style="float: right; margin-left: 5px; width: 250px;">

In the previous notebook we learn how `cuDF` help us leverage the GPU while keeping a familiar pandas API. But `cuDF` 
also has a zero-code-change feature that can help you get better performance from your existing pandas code.

`cuDF` provides a pandas accelerator mode (`cudf.pandas`), allowing you to bring accelerated computing to your pandas 
workflows without requiring any code change.


## Why should I use `cudf.pandas`?

- Requires no changes to existing pandas code. Just
    - `%load_ext cudf.pandas`
    - `$ python –m cudf.pandas <script.py>`
- 100% of the pandas API
- Accelerates workflows up to [150x using the GPU](https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/)
- Compatible with code that uses third-party libraries
- Falls back to using pandas on the CPU for unsupported functions and methods

**Attribution:** This section of the tutorial is based on the `cudf.pandas` [quickstart notebook](https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/cudf_pandas_colab_demo.ipynb?ncid=ref-inor-554580) from the RAPIDS documentation.

### Data 

The data we'll be working with is the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y) 
dataset from NYC Open Data.

The dataset was downloaded during the setup step in welcome and setup notebook, and it is a copy of the original dataset. 
The only difference is that it is hosted by NVIDIA on an S3 bucket and it's in `.parquet` format and to provide faster download speeds.

If you are running this locally, and you followed the steps in the [0.Welcome_and_Setup.ipynb](https://github.com/rapidsai-community/tutorial/blob/main/0.Welcome_and_Setup.ipynb) notebook, you should have the `/data` folder ready to go. 

#### Google Colab Instructions

In the next step we download a script that will allow you to get the data for this notebook session.

In [None]:

# colab: uncomment next line to get the data setup script
#! wget https://raw.githubusercontent.com/rapidsai-community/tutorial/refs/heads/main/data_setup.py

In [None]:
# colab: uncomment next line to get the pageviews data set
#! python data_setup.py --nyc-parking 

In [None]:
# Verify that you are running with an NVIDIA GPU
! nvidia-smi  # this should display information about available GPUs

In [1]:
%load_ext cudf.pandas
import pandas as pd

In [None]:
# read some columns of the dataset
df = pd.read_parquet(
    "data/nyc_parking_violations_2022.parquet",
    columns=[
        "Registration State",
        "Violation Code", 
        "Vehicle Body Type",
        "Vehicle Make",
        "Violation Time",
        "Violation County", 
        "Vehicle Year",
        "Violation Description",
        "Issue Date",
        "Summons Number",
    ],
)

# view a random sample of 10 rows:
df.head()

## Parking violations by Registration state 

Each record in our dataset contains the state of registration of the offending vehicle, and the type of parking violation. 
To get the most common type of violation for vehicles registered in different states, we use [value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and [GroupBy.head](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.head.html):

In [None]:
%%time

(
    df[["Registration State", "Violation Description"]]  # get only these two columns
    .value_counts()  # get the count of violations per state and per type of offence
    .groupby("Registration State")  # group by state
    .head(1)  # get the first row in each group (the type of violation with the largest count)
    .sort_index()  # sort by state name
    .reset_index()
)

The code above uses [method chaining](https://tomaugspurger.net/posts/method-chaining/) to combine a series of operations 
into a single statement. You might find it useful to break the code up into multiple statements and inspect each of the intermediate results.

## What types of vehicle are most frequently involved in parking violations?

In [None]:
%%time

(
    df.groupby(["Vehicle Body Type"])
    .agg({"Summons Number": "count"})
    .rename(columns={"Summons Number": "Count"})
    .sort_values(["Count"], ascending=False)
)

From the [Vehicle Body Type dictionary](https://data.ny.gov/api/assets/83055271-29A6-4ED4-9374-E159F30DB5AE) form the 
NYC Parking Data.

- SUBN: SUBURBAN
- 4DSD: FOUR-DOOR SEDAN
- VAN: VAN TRUCK
- DELV: DELIVERY TRUCK
- PICK: PICK-UP TRUCK

**Exercise:** Get the top 5 parking offenders by Vehicle Brands. 

<details>
  <summary>Solution (click dropdown) </summary>
  <p>

```python
# to run this type it in a code cell
(df
 .groupby(["Vehicle Make"])
 .agg({"Summons Number": "count"})
 .rename(columns={"Summons Number": "Count"})
 .sort_values(["Count"], ascending=False)
 .head(5)
)
```
  </p>
</details>


In [6]:
## your solution here

## Day of the week when more parking violations occur

In [None]:
%%time
weekday_names = {
    0: "Monday",
    1: "Tuesday", 
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values(ascending=False)

## What is the county where most of the parking violations happen? 

In [None]:
(
    df.groupby("Violation County")
    .size()
    .sort_values(ascending=False)
    .head(10)
)

**Exercise:** Find the top 5 most common parking violations for vehicles that are either SUVs (Vehicle Body Type = "SUBN")
or pickup trucks (Vehicle Body Type = "PICK"), but only for vehicles made after 2010, and show the count for each violation type.

<details>
  <summary>Solution (click dropdown) </summary>
  <p>

```python
# to run this type it in a code cell

# Filter for SUVs and pickup trucks made after 2010
recent_suv_pickup = df[
    (df["Vehicle Body Type"].isin(["SUBN", "PICK"])) & 
    (df["Vehicle Year"] > 2010)
]

# Group by violation type and count, then get top 5
(
    recent_suv_pickup
    .groupby("Violation Description")
    .size()
    .sort_values(ascending=False)
    .head(5)
    .rename("Number of Violations")
)
```
  </p>
</details>

In [None]:
# your solution here

## Using third-party libraries with cudf.pandas

You can pass Pandas objects to third-party libraries when using `cudf.pandas`, just like you would when using regular Pandas.

Below, we show an example of using [plotly-express](https://plotly.com/python/plotly-express/) to visualize the data we've been processing.

### Visualizing which states have more pickup trucks relative to other vehicles?

In [None]:
import plotly.express as px

df = df.rename(
    columns={
        "Registration State": "reg_state",
        "Vehicle Body Type": "vehicle_type",
    }
)

# vehicle counts per state:
counts = df.groupby("reg_state").size().sort_index()
# vehicles with type "PICK" (Pickup Truck)
pickup_counts = df.where(df["vehicle_type"] == "PICK").groupby("reg_state").size()
# percentage of pickup trucks by state:
pickup_frac = ((pickup_counts / counts) * 100).rename("% Pickup Trucks")
del pickup_frac["MB"]  # (Manitoba CA is a huge outlier!)

# plot the results:
pickup_frac = pickup_frac.reset_index()
px.choropleth(
    pickup_frac,
    locations="reg_state",
    color="% Pickup Trucks",
    locationmode="USA-states",
    scope="usa",
)

# Understanding Performance

`cudf.pandas` provides profiling utilities to help you better understand performance. With these tools, you can identify which parts of your code ran on the GPU and which parts ran on the CPU.

They're accessible in the `cudf.pandas` namespace since the `cudf.pandas` extension was loaded above with `load_ext cudf.pandas`.

#### Colab Note
If you're running in Colab, the first time you run use the profiler it may take 10+ seconds due to Colab's debugger interacting with the built-in Python function [sys.settrace](https://docs.python.org/3/library/sys.html#sys.settrace) that we use for profiling. For demo purposes, this isn't an issue. Just run the cell again.

## Profiling Functionality

We can generate a per-function profile:

In [None]:
%%cudf.pandas.profile

small_df = pd.DataFrame({"a": ["0", "1", "2"], "b": ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis)
    axis = i

counts = small_df.groupby("a").b.count()

In [None]:
%%cudf.pandas.line_profile

small_df = pd.DataFrame({"a": ["0", "1", "2"], "b": ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis)
    axis = 1

counts = small_df.groupby("a").b.count()

## Behind the scenes: What's going on here?

When you load `cudf.pandas`, Pandas types like `Series` and `DataFrame` are replaced by proxy objects that dispatch 
operations to cuDF when possible. We can verify that `cudf.pandas` is active by looking at our `pd` variable:

In [None]:
pd

As a result, all pandas functions, methods, and created objects are proxies:

In [None]:
type(pd.read_csv)

Operations supported by cuDF will be **very** fast:

In [None]:
%%time
df.count(axis=0)

Operations not supported by cuDF will be slower, as they fall back to using Pandas (copying data between the CPU and GPU 
under the hood as needed). For example, cuDF does not currently support the `axis=` parameter to the `count` method. So 
this operation will run on the CPU and be noticeably slower than the previous one.

In [None]:
%%time
df.count(axis=1) # This will use pandas, because cuDF doesn't support axis=1 for the .count() method

## FAQ

### When should I use cuDF (direct import) versus cudf.pandas?

**Use cudf.pandas if**
- You have existing pandas code and you want to run it on GPUs with 0 effort
- The ability to run the same code on GPU-enabled as well as CPU-only systems is important

**Use cuDF (direct import) if:**
- You want everything to run on GPU (CPU fallback is prohibitively expensive)
- You need functionality that cuDF provides but pandas does not

### How do you ensure pandas compatibility?

- We run the entire pandas unit test suite with cudf.pandas enabled
    -  ~94% of the tests passing – a few minor differences
- We turn on cuDF’s “pandas compatibility mode” (ensures result ordering matches pandas, etc.)

    ```python
    cudf.set_option("mode.pandas_compatible", True)
    ```

## Tips and Tricks

- Use the profiler to learn which function are run on CPU and GPU (doesn't report CPU<->GPU transfer)
- CPU fallback involves copying data between CPU and GPU – twice in the worst case.
- Use GPU-supported operations as much as possible
- GPU memory is limited compared to CPU RAM
    - If you ran out of GPU memory, it will fall back to CPU (**unexpected slowdown**)
    - Keep only the data that you need 
    - Monitor GPU usage (only on Jupyter - NVDashboard)
- When possible use idiomatic pandas and avoid udfs  


## Conclusion

In this notebook, we learned:
- How to use cudf.pandas to accelerate pandas code on GPUs without code changes
- The differences between direct cuDF import and cudf.pandas
- When operations fall back to CPU and the performance implications
- Best practices for using cudf.pandas effectively

With `cudf.pandas`, you can continue using pandas as your primary dataframe library. When things start to get a little 
slow, just load the `cudf.pandas` and run your existing code on a GPU!

To learn more, we encourage you to visit [rapids.ai/cudf-pandas](https://rapids.ai/cudf-pandas).

In the next notebook, we will learn about the cuDF polars engine

[Next Notebook: 3 cudf polars →](https://colab.research.google.com/github/rapidsai-community/tutorial/blob/main/3.cudf_polars_engine.ipynb)


