# 10 Minutes to cuDF's pandas accelerator mode (cudf.pandas)

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.

As of the v23.10 release, cuDF now provides a pandas accelerator mode (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.

This notebook is a short introduction to `cudf.pandas`.

# ⚠️ Verify your setup

First, we'll verify that you are running with an NVIDIA GPU and that cuDF is available.

If you haven't installed cuDF, please visit https://rapids.ai/#quick-start to choose your favorite installation method.

In [None]:
!nvidia-smi  # this should display information about available GPUs

Sat Jan 20 06:59:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   25C    P0              26W /  70W |      2MiB / 15360MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                         

With our GPU-enabled Databricks runtime active, we'll now install cuDF.

If you're interested in installing on other platforms, please visit https://rapids.ai/#quick-start to learn more.

In [None]:
!pip install --extra-index-url=https://pypi.nvidia.com cudf-cu11

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu11
  Downloading https://pypi.nvidia.com/cudf-cu11/cudf_cu11-23.12.1-cp310-cp310-manylinux_2_28_x86_64.whl (506.4 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/506.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/506.4 MB[0m [31m34.0 MB/s[0m eta [36m0:00:15[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/506.4 MB[0m [31m123.3 MB/s[0m eta [36m0:00:05[0m[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.5/506.4 MB[0m [31m201.5 MB/s[0m eta [36m0:00:03[0m[2K     [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.8/506.4 MB[0m [31m212.8 MB/s[0m eta [36m0:00:03[0m[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.8/506.4 MB[0m [31m199.9 MB/s[0m eta [36m0:00:03[0m[2K     

In [None]:
import cudf  # this should work without any errors

We'll also install `plotly-express` for visualizing data.

In [None]:
!pip install --upgrade pip plotly_express==0.4.1 nbformat

Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/2.1 MB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/2.1 MB[0m [31m8.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m1.1/2.1 MB[0m [31m11.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m1.8/2.1 MB[0m [31m13.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting plotly_express==0.4.1
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Collecting nbformat
  Downloading nbformat-5.9

# Download the data

The data we'll be working with is the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y) dataset from NYC Open Data.

We're downloading a copy of this dataset from an s3 bucket hosted by NVIDIA to provide faster download speeds. We'll start by downloading the data. This should take about 30 seconds.

## Data License and Terms
As this dataset originates from the NYC Open Data Portal, it's governed by their license and terms of use.

### Are there restrictions on how I can use Open Data?

> Open Data belongs to all New Yorkers. There are no restrictions on the use of Open Data. Refer to Terms of Use for more information.

### [Terms of Use](https://opendata.cityofnewyork.us/overview/#termsofuse)

> By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

> The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

> Submitting City Agencies are the authoritative source of data available on NYC Open Data. These entities are responsible for data quality and retain version control of data sets and feeds accessed on the Site. Data may be updated, corrected, or refreshed at any time.

In [None]:
!wget https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet

--2024-01-20 07:02:09--  https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet
Resolving data.rapids.ai (data.rapids.ai)... 204.246.191.16, 204.246.191.103, 204.246.191.75, ...
Connecting to data.rapids.ai (data.rapids.ai)|204.246.191.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 477161608 (455M) [binary/octet-stream]
Saving to: ‘nyc_parking_violations_2022.parquet’


2024-01-20 07:02:25 (29.2 MB/s) - ‘nyc_parking_violations_2022.parquet’ saved [477161608/477161608]



# Analysis using Standard Pandas

First, let's use Pandas to read in some columns of the dataset:

In [None]:
import pandas as pd



In [None]:
# read 5 columns data:
df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=[
        "Registration State",
        "Violation Description",
        "Vehicle Body Type",
        "Issue Date",
        "Summons Number",
    ],
)

# view a random sample of 10 rows:
df.sample(10)

Unnamed: 0,Registration State,Violation Description,Vehicle Body Type,Issue Date,Summons Number
3617922,FL,20A-No Parking (Non-COM),4DSD,09/17/2021,8856492490
2676451,NY,46B-Double Parking (Com-100Ft),VAN,08/04/2021,8963434515
6750213,NY,PHTO SCHOOL ZN SPEED VIOLATION,SUBN,12/07/2021,4759508480
8116279,NY,21-No Parking (street clean),4DSD,12/28/2021,8846568280
7593706,NY,38-Failure to Dsplay Meter Rec,SUBN,12/16/2021,8885932939
9800674,NY,19-No Stand (bus stop),SUBN,01/27/2022,8894054690
2775199,NY,21-No Parking (street clean),SUBN,08/12/2021,8983980837
3879001,NY,50-Crosswalk,4DSD,09/16/2021,8954009300
6563990,NY,16A-No Std (Com Veh) Non-COM,4DSD,10/27/2021,8965979067
2605823,NY,48-Bike Lane,VAN,08/03/2021,8949986000


Next, we'll try to answer a few questions using the data.

## Which parking violation is most commonly committed by vehicles from various U.S states?

Each record in our dataset contains the state of registration of the offending vehicle, and the type of parking offence. Let's say we want to get the most common type of offence for vehicles registered in different states. We can do this in Pandas using a combination of [value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and [GroupBy.head](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.head.html):

In [None]:
(
    df[["Registration State", "Violation Description"]]  # get only these two columns
    .value_counts()  # get the count of offences per state and per type of offence
    .groupby("Registration State")  # group by state
    .head(
        1
    )  # get the first row in each group (the type of offence with the largest count)
    .sort_index()  # sort by state name
    .reset_index()
)

Unnamed: 0,Registration State,Violation Description,0
0,99,74-Missing Display Plate,835
1,AB,14-No Standing,22
2,AK,PHTO SCHOOL ZN SPEED VIOLATION,125
3,AL,PHTO SCHOOL ZN SPEED VIOLATION,3668
4,AR,PHTO SCHOOL ZN SPEED VIOLATION,537
...,...,...,...
60,VT,PHTO SCHOOL ZN SPEED VIOLATION,3024
61,WA,21-No Parking (street clean),3732
62,WI,14-No Standing,1639
63,WV,PHTO SCHOOL ZN SPEED VIOLATION,1185


The code above uses [method chaining](https://tomaugspurger.net/posts/method-chaining/) to combine a series of operations into a single statement. You might find it useful to break the code up into multiple statements and inspect each of the intermediate results!

## Which vehicle body types are most frequently involved in parking violations?

We can also investigate which vehicle body types most commonly appear in parking violations

In [None]:
(
    df.groupby(["Vehicle Body Type"])
    .agg({"Summons Number": "count"})
    .rename(columns={"Summons Number": "Count"})
    .sort_values(["Count"], ascending=False)
)

Unnamed: 0_level_0,Count
Vehicle Body Type,Unnamed: 1_level_1
SUBN,6449007
4DSD,4402991
VAN,1317899
DELV,436430
PICK,429798
...,...
CARY,1
ISUZ,1
IXMR,1
BILB,1


## How do parking violations vary across days of the week?

In [None]:
weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

Out[10]: issue_weekday
Sunday        462992
Saturday     1108385
Monday       2488563
Wednesday    2760088
Tuesday      2809949
Friday       2891679
Thursday     2913951
Name: Summons Number, dtype: int64

It looks like there are fewer violations on weekends, which makes sense! During the week, more people are driving in New York City.

## Let's time it!

Loading and processing this data took a little time. Let's measure how long these pipelines take in Pandas:

In [None]:
%%time

import pandas as pd

# read 5 columns data:
df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=[
        "Registration State",
        "Violation Description",
        "Vehicle Body Type",
        "Issue Date",
        "Summons Number",
    ],
)

# Which parking violation is most commonly committed by vehicles from various U.S states?
(
    df[["Registration State", "Violation Description"]]
    .value_counts()
    .groupby("Registration State")
    .head(1)
    .sort_index()
    .reset_index()
)

CPU times: user 10.9 s, sys: 3.02 s, total: 13.9 s
Wall time: 11.2 s


Unnamed: 0,Registration State,Violation Description,0
0,99,74-Missing Display Plate,835
1,AB,14-No Standing,22
2,AK,PHTO SCHOOL ZN SPEED VIOLATION,125
3,AL,PHTO SCHOOL ZN SPEED VIOLATION,3668
4,AR,PHTO SCHOOL ZN SPEED VIOLATION,537
...,...,...,...
60,VT,PHTO SCHOOL ZN SPEED VIOLATION,3024
61,WA,21-No Parking (street clean),3732
62,WI,14-No Standing,1639
63,WV,PHTO SCHOOL ZN SPEED VIOLATION,1185


In [None]:
%%time

# Which vehicle body types are most frequently involved in parking violations?
(
    df.groupby(["Vehicle Body Type"])
    .agg({"Summons Number": "count"})
    .rename(columns={"Summons Number": "Count"})
    .sort_values(["Count"], ascending=False)
)

CPU times: user 2.48 s, sys: 173 ms, total: 2.65 s
Wall time: 2.64 s


Unnamed: 0_level_0,Count
Vehicle Body Type,Unnamed: 1_level_1
SUBN,6449007
4DSD,4402991
VAN,1317899
DELV,436430
PICK,429798
...,...
CARY,1
ISUZ,1
IXMR,1
BILB,1


In [None]:
%%time

# How do parking violations vary across days of the week?
weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

CPU times: user 5.44 s, sys: 725 ms, total: 6.17 s
Wall time: 6.15 s
Out[13]: issue_weekday
Sunday        462992
Saturday     1108385
Monday       2488563
Wednesday    2760088
Tuesday      2809949
Friday       2891679
Thursday     2913951
Name: Summons Number, dtype: int64

# Using cuDF's pandas accelerator mode (cudf.pandas)

Now, let's re-run the Pandas code above with the `cudf.pandas` extension loaded.

Typically, you should load the `cudf.pandas` extension as the first step in your notebook, before importing any modules. Here, we explicitly restart the kernel to simulate that behavior.

In [None]:
get_ipython().kernel.do_shutdown(restart=True)

Out[14]: {'status': 'ok', 'restart': True}

In [None]:
!pip install --extra-index-url=https://pypi.nvidia.com cudf-cu11

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu11
  Using cached https://pypi.nvidia.com/cudf-cu11/cudf_cu11-23.12.1-cp310-cp310-manylinux_2_28_x86_64.whl (506.4 MB)
Collecting rich
  Using cached rich-13.7.0-py3-none-any.whl (240 kB)
Collecting fsspec>=0.6.0
  Using cached fsspec-2023.12.2-py3-none-any.whl (168 kB)
Collecting ptxcompiler-cu11
  Using cached https://pypi.nvidia.com/ptxcompiler-cu11/ptxcompiler_cu11-0.7.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.8 MB)
Collecting pyarrow<15.0.0a0,>=14.0.1
  Using cached pyarrow-14.0.2-cp310-cp310-manylinux_2_28_x86_64.whl (38.0 MB)
Collecting cubinlinker-cu11
  Using cached https://pypi.nvidia.com/cubinlinker-cu11/cubinlinker_cu11-0.3.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.8 MB)
Collecting numba<0.58,>=0.57
  Using cached numba-0.57.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.6 MB)
Collecting nvtx>=0.2.1
 

In [None]:
!pip install --upgrade pip plotly_express==0.4.1 nbformat

Collecting pip
  Using cached pip-23.3.2-py3-none-any.whl (2.1 MB)
Collecting plotly_express==0.4.1
  Using cached plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Collecting nbformat
  Using cached nbformat-5.9.2-py3-none-any.whl (77 kB)
Collecting patsy>=0.5
  Using cached patsy-0.5.6-py2.py3-none-any.whl (233 kB)
Collecting statsmodels>=0.9.0
  Using cached statsmodels-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
Collecting scipy>=0.18
  Using cached scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.4 MB)
Collecting plotly>=4.1.0
  Using cached plotly-5.18.0-py3-none-any.whl (15.6 MB)
Collecting jsonschema>=2.6
  Using cached jsonschema-4.21.1-py3-none-any.whl (85 kB)
Collecting fastjsonschema
  Using cached fastjsonschema-2.19.1-py3-none-any.whl (23 kB)
Collecting attrs>=22.2.0
  Using cached attrs-23.2.0-py3-none-any.whl (60 kB)
Collecting referencing>=0.28.4
  Using cached referencing-0.32.1-py3-none-

In [None]:
%load_ext cudf.pandas

In [None]:
%%time

import pandas as pd

# read 5 columns data:
df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=[
        "Registration State",
        "Violation Description",
        "Vehicle Body Type",
        "Issue Date",
        "Summons Number",
    ],
)

# Which parking violation is most commonly committed by vehicles from various U.S states?
(
    df[["Registration State", "Violation Description"]]
    .value_counts()
    .groupby("Registration State")
    .head(1)
    .sort_index()
    .reset_index()
)



CPU times: user 483 ms, sys: 254 ms, total: 737 ms
Wall time: 872 ms


Unnamed: 0,Registration State,Violation Description,0
0,99,74-Missing Display Plate,835
1,AB,14-No Standing,22
2,AK,PHTO SCHOOL ZN SPEED VIOLATION,125
3,AL,PHTO SCHOOL ZN SPEED VIOLATION,3668
4,AR,PHTO SCHOOL ZN SPEED VIOLATION,537
...,...,...,...
60,VT,PHTO SCHOOL ZN SPEED VIOLATION,3024
61,WA,21-No Parking (street clean),3732
62,WI,14-No Standing,1639
63,WV,PHTO SCHOOL ZN SPEED VIOLATION,1185


In [None]:
%%time

# Which vehicle body types are most frequently involved in parking violations?
(
    df.groupby(["Vehicle Body Type"])
    .agg({"Summons Number": "count"})
    .rename(columns={"Summons Number": "Count"})
    .sort_values(["Count"], ascending=False)
)

CPU times: user 38.9 ms, sys: 7.33 ms, total: 46.3 ms
Wall time: 45.8 ms


Unnamed: 0_level_0,Count
Vehicle Body Type,Unnamed: 1_level_1
SUBN,6449007
4DSD,4402991
VAN,1317899
DELV,436430
PICK,429798
...,...
YANT,1
YBSD,1
YEL,1
YL,1


In [None]:
%%time

# How do parking violations vary across days of the week?
weekday_names = {
    0: "Monday",
    1: "Tuesday",
    2: "Wednesday",
    3: "Thursday",
    4: "Friday",
    5: "Saturday",
    6: "Sunday",
}

df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)

df.groupby(["issue_weekday"])["Summons Number"].count().sort_values()

CPU times: user 181 ms, sys: 55.6 ms, total: 236 ms
Wall time: 237 ms
Out[6]: issue_weekday
Sunday        462992
Saturday     1108385
Monday       2488563
Wednesday    2760088
Tuesday      2809949
Friday       2891679
Thursday     2913951
Name: Summons Number, dtype: int64

Much faster! Operations that took 5-20 seconds can now potentially finish in just milliseconds without changing any code.

# Understanding Performance

cuDF's pandas accelerator mode provides profiling utilities to help you better understand performance. With these tools, you can identify which parts of your code ran on the GPU and which parts ran on the CPU.

They're accessible in the `cudf.pandas` namespace since the `cudf.pandas` extension was loaded above with `load_ext cudf.pandas`.

#### Colab Note
If you're running in Colab, the first time you run use the profiler it may take 10+ seconds due to Colab's debugger interacting with the built-in Python function [sys.settrace](https://docs.python.org/3/library/sys.html#sys.settrace) that we use for profiling. For demo purposes, this isn't an issue. Just run the cell again.

## Profiling Functionality

We can generate a per-function profile:

In [None]:
%%cudf.pandas.profile

small_df = pd.DataFrame({'a': [0, 1, 2], 'b': ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis)
    axis = 1

counts = small_df.groupby("a").b.count()



[3m                                                                                [0m
[3m                           Total time elapsed: 92.818 seconds                   [0m
[3m                         5 GPU function calls in 92.383 seconds                 [0m
[3m                         1 CPU function calls in 0.022 seconds                  [0m
[3m                                                                                [0m
[3m                                         Stats                                  [0m
[3m                                                                                [0m
┏━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃[1m          [0m┃[1m [0m[1mGPU     [0m[1m [0m┃[1m [0m[1mGPU     [0m[1m [0m┃[1m [0m[1mGPU     [0m[1m [0m┃[1m [0m[1mCPU      [0m[1m [0m┃[1m [0m[1mCPU     [0m[1m [0m┃[1m [0m[1mCPU      [0m[1m [0m┃
┃[1m [0m[1mFunction[0m[1m [0m┃[1m [0m[1mncalls  [0m[

In [None]:
%%cudf.pandas.line_profile

small_df = pd.DataFrame({'a': [0, 1, 2], 'b': ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

axis = 0
for i in range(0, 2):
    small_df.min(axis=axis)
    axis = 1

counts = small_df.groupby("a").b.count()



[3m                                                                                [0m
[3m                           Total time elapsed: 1.508 seconds                    [0m
[3m                                                                                [0m
[3m                                         Stats                                  [0m
[3m                                                                                [0m
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃[1m [0m[1mLine no.[0m[1m [0m┃[1m [0m[1mLine                                 [0m[1m [0m┃[1m [0m[1mGPU TIME(s)[0m[1m [0m┃[1m [0m[1mCPU TIME(s)[0m[1m [0m┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ 2        │ [97;40m    [0m[97;40msmall_df[0m[97;40m [0m[91;40m=[0m[97;40m [0m[97;40mpd[0m[91;40m.[0m[97;40mDataFrame[0m[97;40m([0m[97;40m{[0m[93;40m'[0m[93;40ma[0m[93;40m'[0m[97;40m:[0m[

## Behind the scenes: What's going on here?

When you load cuDF's pandas accelerator mode, Pandas types like `Series` and `DataFrame` are replaced by proxy objects that dispatch operations to cuDF when possible. We can verify that `cudf.pandas` is active by looking at our `pd` variable:

In [None]:
pd

Out[9]: <module 'pandas' (ModuleAccelerator(fast=cudf, slow=pandas))>

As a result, all pandas functions, methods, and created objects are proxies:

In [None]:
type(pd.read_csv)

Out[10]: cudf.pandas.fast_slow_proxy._FunctionProxy

Operations supported by cuDF will be **very** fast:

In [None]:
%%time
df.count(axis=0)

CPU times: user 393 ms, sys: 356 ms, total: 749 ms
Wall time: 749 ms
Out[11]: Registration State       15435607
Violation Description    15117819
Vehicle Body Type        15402365
Issue Date               15435607
Summons Number           15435607
issue_weekday            15435607
dtype: int64

Operations not supported by cuDF will be slower, as they fall back to using Pandas (copying data between the CPU and GPU under the hood as needed). For example, cuDF does not currently support the `axis=` parameter to the `count` method. So this operation will run on the CPU and be noticeably slower than the previous one.

In [None]:
%%time
df.count(
    axis=1
)  # This will use pandas, because cuDF doesn't support axis=1 for the .count() method

CPU times: user 13.4 s, sys: 2.81 s, total: 16.2 s
Wall time: 16 s
Out[12]: 0           5
1           5
2           5
3           5
4           5
           ..
15435602    6
15435603    6
15435604    6
15435605    6
15435606    6
Length: 15435607, dtype: int64

But the story doesn't end here. We often need to mix our own code with third-party libraries that other people have written. Many of these libraries accept pandas objects as inputs.

# Using third-party libraries with cuDF's pandas accelerator mode

You can pass Pandas objects to third-party libraries when using `cudf.pandas`, just like you would when using regular Pandas.

Below, we show an example of using [plotly-express](https://plotly.com/python/plotly-express/) to visualize the data we've been processing:

## Visualizing which states have more pickup trucks relative to other vehicles?

In [None]:
import plotly.express as px

df = df.rename(
    columns={
        "Registration State": "reg_state",
        "Vehicle Body Type": "vehicle_type",
    }
)

# vehicle counts per state:
counts = df.groupby("reg_state").size().sort_index()
# vehicles with type "PICK" (Pickup Truck)
pickup_counts = df.where(df["vehicle_type"] == "PICK").groupby("reg_state").size()
# percentage of pickup trucks by state:
pickup_frac = ((pickup_counts / counts) * 100).rename("% Pickup Trucks")
del pickup_frac["MB"]  # (Manitoba is a huge outlier!)

# plot the results:
pickup_frac = pickup_frac.reset_index()
fig = px.choropleth(
    pickup_frac,
    locations="reg_state",
    color="% Pickup Trucks",
    locationmode="USA-states",
    scope="usa",
)

fig.show(renderer="databricks")

## Beyond just passing data: **Accelerating** third-party code

Being able to pass these proxy objects to libraries like Plotly is great, but the benefits don't end there.

When you enable cuDF's pandas accelerator mode, pandas operations running **inside the third-party library's functions** will also benefit from GPU acceleration where possible!

Below, you can see an image illustrating how `cudf.pandas` can accelerate the pandas backend in Ibis, a library that provides a unified DataFrame API to various backends. We ran this example on a system with an NVIDIA H100 GPU and an Intel Xeon Platinum 8480CL CPU.


By loading the `cudf.pandas` extension, pandas operations within Ibis can use the GPU with zero code change. It just works.

![ibis](https://drive.google.com/uc?id=1uOJq2JtbgVb7tb8qw8a2gG3JRBo72t_H)

# Conclusion

With cuDF's pandas accelerator mode, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the `cudf.pandas` extension and run your existing code on a GPU!

To learn more about `cudf.pandas`, we encourage you to visit [rapids.ai/cudf-pandas](https://rapids.ai/cudf-pandas).