# ⚠️ Verify your setup

First, we'll verify that you are running with an NVIDIA GPU.

In [None]:
!nvidia-smi  # this should display information about available GPUs

Mon Nov 25 23:28:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Download the data

The data we'll be working with is the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y) dataset from NYC Open Data.

We're downloading a copy of this dataset from an s3 bucket hosted by NVIDIA to provide faster download speeds. We'll start by downloading the data. This should take about 30 seconds.

## Data License and Terms
As this dataset originates from the NYC Open Data Portal, it's governed by their license and terms of use.

### Are there restrictions on how I can use Open Data?

> Open Data belongs to all New Yorkers. There are no restrictions on the use of Open Data. Refer to Terms of Use for more information.

### [Terms of Use](https://opendata.cityofnewyork.us/overview/#termsofuse)

> By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

> The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

> Submitting City Agencies are the authoritative source of data available on NYC Open Data. These entities are responsible for data quality and retain version control of data sets and feeds accessed on the Site. Data may be updated, corrected, or refreshed at any time.

In [None]:
!wget https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet -O /tmp/nyc_parking_violations_2022.parquet

--2024-11-25 23:28:10--  https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet
Resolving data.rapids.ai (data.rapids.ai)... 13.225.4.33, 13.225.4.96, 13.225.4.58, ...
Connecting to data.rapids.ai (data.rapids.ai)|13.225.4.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 474211285 (452M) [binary/octet-stream]
Saving to: ‘/tmp/nyc_parking_violations_2022.parquet’


2024-11-25 23:28:13 (179 MB/s) - ‘/tmp/nyc_parking_violations_2022.parquet’ saved [474211285/474211285]



# Let's try to load the file using just `cudf`

In [None]:
import cudf

In [None]:
# read 5 columns data:
df = cudf.read_parquet(
    "/tmp/nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
)


df.head(5)

Unnamed: 0,Registration State,Violation Description,Vehicle Body Type,Issue Date,Summons Number
0,NY,,VAN,06/25/2021,1457617912
1,NY,,SUBN,06/25/2021,1457617924
2,TX,,SDN,06/17/2021,1457622427
3,MO,,SDN,06/16/2021,1457638629
4,NY,,TAXI,07/04/2021,1457639580


Performing a Join on `df` that is not quite taxing to the GPU memory works just fine.

In [None]:
df.merge(df, on="Summons Number")

Unnamed: 0,Registration State_x,Violation Description_x,Vehicle Body Type_x,Issue Date_x,Summons Number,Registration State_y,Violation Description_y,Vehicle Body Type_y,Issue Date_y
0,WA,,SDN,06/15/2021,1478653530,WA,,SDN,06/15/2021
1,TX,,SUBN,07/10/2021,1478653577,TX,,SUBN,07/10/2021
2,NY,,SDN,07/09/2021,1478653772,NY,,SDN,07/09/2021
3,NY,,SDN,07/09/2021,1478653784,NY,,SDN,07/09/2021
4,NY,,SUBN,07/10/2021,1478653978,NY,,SUBN,07/10/2021
...,...,...,...,...,...,...,...,...,...
15435602,NJ,16A-No Std (Com Veh) Non-COM,4DSD,06/06/2022,8991294911,NJ,16A-No Std (Com Veh) Non-COM,4DSD,06/06/2022
15435603,NY,10-No Stopping,4DSD,06/03/2022,8991294819,NY,10-No Stopping,4DSD,06/03/2022
15435604,NY,16A-No Std (Com Veh) Non-COM,MOPD,06/06/2022,8991294856,NY,16A-No Std (Com Veh) Non-COM,MOPD,06/06/2022
15435605,NY,16A-No Std (Com Veh) Non-COM,SUBN,06/06/2022,8991294870,NY,16A-No Std (Com Veh) Non-COM,SUBN,06/06/2022


# Let's try performing a join on a data-frame that's double the size

In [None]:
new_df = cudf.concat([df, df])

Performing a join requires additional intermediate memory which the GPU ran out of, and thus a `MemoryError` is raised.

In [None]:
new_df.merge(new_df, on="Summons Number")

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /tmp/pip-build-env-z8x7r21l/normal/lib/python3.12/site-packages/librmm/include/rmm/mr/device/cuda_memory_resource.hpp:62: cudaErrorMemoryAllocation out of memory

# Stepping it up even more with a very large dataframe

In [None]:
new_df = cudf.concat([df, df, df, df, df, df[:int(len(df)/2)]])

Let's try to write this huge dataframe back into a parquet file, notice that here we run into Memory Error. The reason for this error is the dataframe needs roughly 3x more memory on GPU to encode to write it to a file and thus the memory error.

In [None]:
new_df.to_parquet("larger_df.parquet")

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /tmp/pip-build-env-z8x7r21l/normal/lib/python3.12/site-packages/librmm/include/rmm/mr/device/cuda_memory_resource.hpp:62: cudaErrorMemoryAllocation out of memory

# Using cudf.pandas

Now, let's re-run the Pandas code above with the `cudf.pandas` extension loaded.

**Note:
Typically, you should load the `cudf.pandas` extension as the first step in your notebook, before importing any modules. Here, we explicitly restart the kernel to simulate that behavior.**


In [None]:
get_ipython().kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}

In [None]:
%load_ext cudf.pandas

`cudf.pandas` make use of cuda managed memory that will utilize both GPU and CPU memory whenever GPU falls short of memory.

In [None]:
import pandas as pd

In [None]:
df = pd.read_parquet(
    "/tmp/nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"],
)


In [None]:
df.head(5)

Unnamed: 0,Registration State,Violation Description,Vehicle Body Type,Issue Date,Summons Number
0,NY,,VAN,06/25/2021,1457617912
1,NY,,SUBN,06/25/2021,1457617924
2,TX,,SDN,06/17/2021,1457622427
3,MO,,SDN,06/16/2021,1457638629
4,NY,,TAXI,07/04/2021,1457639580


The above `merge` that failed with a `MemoryError` with `cudf`, now fully executes with `cudf.pandas` on GPU!!

In [None]:
new_df = pd.concat([df, df])

In [None]:
new_df.merge(new_df, on="Summons Number")

Unnamed: 0,Registration State_x,Violation Description_x,Vehicle Body Type_x,Issue Date_x,Summons Number,Registration State_y,Violation Description_y,Vehicle Body Type_y,Issue Date_y
0,NY,,VAN,06/25/2021,1457617912,NY,,VAN,06/25/2021
1,NY,,VAN,06/25/2021,1457617912,NY,,VAN,06/25/2021
2,NY,,SUBN,06/25/2021,1457617924,NY,,SUBN,06/25/2021
3,NY,,SUBN,06/25/2021,1457617924,NY,,SUBN,06/25/2021
4,TX,,SDN,06/17/2021,1457622427,TX,,SDN,06/17/2021
...,...,...,...,...,...,...,...,...,...
61742423,NY,21-No Parking (street clean),2DSD,06/07/2022,8995222785,NY,21-No Parking (street clean),2DSD,06/07/2022
61742424,VA,21-No Parking (street clean),SUBN,06/07/2022,8995222827,VA,21-No Parking (street clean),SUBN,06/07/2022
61742425,VA,21-No Parking (street clean),SUBN,06/07/2022,8995222827,VA,21-No Parking (street clean),SUBN,06/07/2022
61742426,MD,14-No Standing,4DSD,06/07/2022,8995222839,MD,14-No Standing,4DSD,06/07/2022


In [None]:
new_df = pd.concat([df, df, df, df, df, df[:int(len(df)/2)]])

The same for IO: We now see that cuda managed memory is using the GPU + CPU memory to encode and write the dataframe into a parquet file.

In [None]:
new_df.to_parquet("larger_df.parquet")

In [None]:
!ls -al larger_df.parquet

-rw-r--r-- 1 root root 852288745 Nov 25 23:29 larger_df.parquet


Comparing the performance against `pandas`

In [None]:
get_ipython().kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}

In [None]:
import pandas as pd

In [None]:
df = pd.read_parquet(
    "/tmp/nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"],
)


In [None]:
df.head(5)

Unnamed: 0,Registration State,Violation Description,Vehicle Body Type,Issue Date,Summons Number
0,NY,,VAN,06/25/2021,1457617912
1,NY,,SUBN,06/25/2021,1457617924
2,TX,,SDN,06/17/2021,1457622427
3,MO,,SDN,06/16/2021,1457638629
4,NY,,TAXI,07/04/2021,1457639580


In [None]:
new_df = pd.concat([df, df])

In [None]:
%%time
new_df.merge(new_df, on="Summons Number")

In [None]:
new_df = pd.concat([df, df, df, df, df, df[:int(len(df)/2)]])

In [None]:
%%time
new_df.to_parquet("larger_df.parquet")