# Processing of large datasets (near GPU Memory size) with cuDF pandas Accelerator Mode  
<a href="https://github.com/rapidsai/cudf">cuDF</a> is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.

cuDF now provides a <a href="https://rapids.ai/cudf-pandas/">pandas accelerator mode</a> (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.

This notebook demonstrates how the memory management automation added to `cudf.pandas`accelerates processing of much larger datasets. Now, `cudf.pandas` uses a managed memory pool by default which allows cudf.pandas to process datasets larger than the memory of the GPU it is running on. Managed memory prefetching is also enabled by default to improve memory access performance. For more information on CUDA Unified Memory (managed memory), performance, and prefetching, see this <a href="https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/">NVIDIA Developer blog post</a>

# ⚠️ Verify your setup

First, we'll verify that you are running with an NVIDIA GPU.

In [1]:
!nvidia-smi

Tue Aug  6 20:59:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       On  | 00000000:3B:00.0 Off |                    0 |
| N/A   32C    P8              10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:5E:00.0 Off |  

# Download the data

The data we will be working with lists approximately 90 million transactions with relatively higher illicit (HI) activity.

We're downloading a curated copy of this Kaggle dataset [https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml?select=HI-Large_Trans.csv] from a GCP bucket hosted by NVIDIA to provide faster download speeds. We'll start by downloading the data. This should take about 30 seconds.

**Data License and Terms** <br>
As this dataset originates from a Kaggle dataset, it's governed by that dataset's license and terms of use, which is the Open Data Commons license. Review here: https://opendatacommons.org/licenses/by/1-0/index.html . For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.

**Are there restrictions on how I can use this data? </br>**
For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.

In [2]:
# <TO-DO:> Download dataset from a GCP bucket-

Let's download all the required python libraries-

In [13]:
import pandas as pd
import numpy as np

# Analysis using Standard Pandas

First, let's use Pandas to read in some columns of the dataset:

# Let's load the dataset using pandas and analyze it! 

### WARNING - Avoid running the below cell as it takes around 5 minutes to load the data!

In [4]:
%%time
df_transactions = pd.read_csv('/nvme/1/manass/notebooks/polars_exp/Data/HI-Large_Trans_reduced.csv')
df_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89851114 entries, 0 to 89851113
Data columns (total 11 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Timestamp           object 
 1   From Bank           int64  
 2   Account             object 
 3   To Bank             int64  
 4   Account.1           object 
 5   Amount Received     float64
 6   Receiving Currency  object 
 7   Amount Paid         float64
 8   Payment Currency    object 
 9   Payment Format      object 
 10  Is Laundering       int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 7.4+ GB
CPU times: user 4min 32s, sys: 33.4 s, total: 5min 5s
Wall time: 5min 5s


This 10 GB dataset takes around 5 minutes to load ! See below for a data snapshot-

In [5]:
df_transactions.head()

Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
0,2022/08/01 00:02,3196,800107150,3196,800107150,7739.29,US Dollar,7739.29,US Dollar,Reinvestment,0
1,2022/08/01 00:03,1208,80010E650,20,80010E6F0,73966883.0,US Dollar,73966883.0,US Dollar,Cheque,0
2,2022/08/01 00:27,3203,80010EA80,3203,80010EA80,13284.41,US Dollar,13284.41,US Dollar,Reinvestment,0
3,2022/08/01 00:09,1208,80010E430,1208,80010E430,7.66,US Dollar,7.66,US Dollar,Reinvestment,0
4,2022/08/01 00:06,1208,80010E650,1208,80010E650,4.86,US Dollar,4.86,US Dollar,Reinvestment,0


We can see that the dataset consists of bank information (account details), transaction details, and whether the transaction is associate with money laundering or not.

# Which banks have the most money laundering related money transferred between them?

Such an analysis would be helpful identifying the banks which are highly associated with money laundering and check their transaction data

In [6]:
%%time
# Aggregate-
result=df_transactions.groupby(
    ["From Bank","To Bank","Payment Currency"]).agg({"Amount Received":"sum","Is Laundering":"sum"})

filtered_result = result[result["Is Laundering"] > 0].sort_values(by="Amount Received", ascending=False)
filtered_result.head(10)

CPU times: user 27.4 s, sys: 956 ms, total: 28.4 s
Wall time: 28.4 s


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Amount Received,Is Laundering
From Bank,To Bank,Payment Currency,Unnamed: 3_level_1,Unnamed: 4_level_1
4011,4011,US Dollar,22875630000000.0,1
18824,18824,US Dollar,11814720000000.0,3
18184,4,Rupee,5257959000000.0,1
221118,221118,US Dollar,4857862000000.0,1
214853,214853,US Dollar,1908466000000.0,1
28781,28781,US Dollar,1458829000000.0,1
2310,2310,US Dollar,1359950000000.0,8
70,137768,Yen,1216654000000.0,9
5763,5763,US Dollar,1087834000000.0,2
76,27,Rupee,1017549000000.0,4


27 seconds is a long waiting time for a simple aggregation! Based on the data, Its interesting to note that most money laundering related transactions happen within (to and from) the same bank! 

# Which locations are most highly correlated with money laundering related transaction?

It is helpful to understand the locations that have the most money laundering related activity to take appropriate steps. 

Since we don't have the location information in the dataset, we will create a dummy dataset of random locations and then join them with our original dataset.

In [7]:
def create_location_dataset(df_transactions):
    """Generate random dataset with account numbers and their locations
    """
    # Assuming unique_accounts is a list or a Polars Series of unique accounts
    unique_banks = df_transactions['From Bank'].unique()
    
    # Create a DataFrame for unique accounts
    df_location = pd.DataFrame({'From Bank': unique_banks})
    
    # Display the unique accounts DataFrame
    df_location.head()
    
    # List of sample cities
    city_list = ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix",
                 "Philadelphia", "San Antonio", "San Diego", "Dallas", "San Jose"]
    
    # Number of rows in the DataFrame
    num_rows = len(df_location)
    
    # Generate a list of random city names
    random_cities = np.random.choice(city_list, num_rows)
    
    # Add the random cities as a new column
    df_location['location'] = random_cities

    return df_location

`df_location` is the dataset with location details

In [8]:
df_location = create_location_dataset(df_transactions)

In [9]:
%%time
df_merged = df_transactions.merge(df_location, how='left', on='From Bank')

CPU times: user 17.6 s, sys: 6.4 s, total: 24 s
Wall time: 23.9 s


In [12]:
%%time
# Aggregate-
result=df_merged.groupby(
    ["location"]).agg({"Amount Received":"sum","Is Laundering":"sum"})

filtered_result = result[result["Is Laundering"] > 0].sort_values(by="Amount Received", ascending=False)
filtered_result.head(10)

CPU times: user 14.6 s, sys: 538 ms, total: 15.1 s
Wall time: 15 s


Unnamed: 0_level_0,Amount Received,Is Laundering
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Los Angeles,137806400000000.0,9535
New York,69544460000000.0,11519
San Jose,64085180000000.0,10467
Phoenix,49666580000000.0,21231
Philadelphia,49099360000000.0,10273
Dallas,29940200000000.0,10711
San Antonio,26147540000000.0,9272
Houston,25792240000000.0,10911
San Diego,25651400000000.0,9882
Chicago,21794350000000.0,9152


Banks in Los Angeles have the most transactions associated with Money Laundering (It's all based off mock data)

# Analysis with cuDF Pandas

Let's first install cudf latest version

In [None]:
!pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12==24.8.*

Typically, you should load the `cudf.pandas` extension as the first step in your notebook, before importing any modules. Here, we explicitly restart the kernel to simulate that behavior.

Note: We just added the `%load-ext` and the rest of the code remains the same

In [2]:
%load_ext cudf.pandas

The cudf.pandas extension is already loaded. To reload it, use:
  %reload_ext cudf.pandas


Before we load data, lets use the `rmm` library to make sure we are tracking GPU utilization. It would be interesting too see if we ever utilize d GPU to its capacity and how much were the speed ups then!


In [3]:
import rmm
stats_mr = rmm.mr.StatisticsResourceAdaptor(
    rmm.mr.get_current_device_resource())
rmm.mr.set_current_device_resource(stats_mr)

we are using the `rmm` library which is responsible for managing memory during `cudf.pandas` operations.


In [4]:
import pandas as pd

We just wrap a `StatisticsResourceAdaptor` on our memory resource to see what our memory allocations were for the previous operations

#### We'll run the same code as above to get a feel what GPU-acceleration brings to pandas workflows!

In [5]:
%%cudf.pandas.profile
df_transactions = pd.read_csv('/nvme/1/manass/notebooks/polars_exp/Data/HI-Large_Trans_reduced.csv')

Nice! The data loading time has come down from 5 minutes to around 25 seconds for the 10 GB dataset

# Can we handle workloads larger than GPU memory?

In [9]:
print(f"Total memory usage {round(stats_mr.allocation_counts.current_bytes/(1024**3),0)} GB")

Total memory usage 10.0 GB


Dataset is occupying 10 GB of GPU memory

In [8]:
print(f"Peak memory usage {round(stats_mr.allocation_counts.peak_bytes/(1024**3),0)} GB")

Peak memory usage 28.0 GB


It's interesting to see that peak memory usage was much higher than the GPU memory and yet we saw speedups from 5 minutes to around 30 seconds, implying the ability to process larger than GPU memory workloads with `cudf.pandas`


In [18]:
%%time
# Aggregate-
df_transactions.groupby(
    ["From Bank","To Bank","Payment Currency"]).agg({"Amount Received":"sum"})

CPU times: user 2.76 s, sys: 492 ms, total: 3.25 s
Wall time: 2.97 s


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Amount Received
From Bank,To Bank,Payment Currency,Unnamed: 3_level_1
0,0,Bitcoin,8.257410e-01
0,0,US Dollar,4.443487e+11
0,1,Euro,1.737009e+06
0,2,Yuan,1.555276e+06
0,3,Euro,4.393948e+05
...,...,...,...
3225441,1210700,Bitcoin,1.981200e-02
3225444,1217496,Bitcoin,4.904000e-02
3225451,78520,US Dollar,2.820300e+02
3225454,180343,US Dollar,3.459300e+03


This was much faster than before! We were able to get the processing time down for aggregation by 5-7x


In [11]:
def create_location_dataset(df_transactions):
    """Generate random dataset with account numbers and their locations
    """
    # Assuming unique_accounts is a list or a Polars Series of unique accounts
    unique_banks = df_transactions['From Bank'].unique()
    
    # Create a DataFrame for unique accounts
    df_location = pd.DataFrame({'From Bank': unique_banks})
    
    # Display the unique accounts DataFrame
    df_location.head()
    
    # List of sample cities
    city_list = ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix",
                 "Philadelphia", "San Antonio", "San Diego", "Dallas", "San Jose"]
    
    # Number of rows in the DataFrame
    num_rows = len(df_location)
    
    # Generate a list of random city names
    random_cities = np.random.choice(city_list, num_rows)
    
    # Add the random cities as a new column
    df_location['location'] = random_cities

    return df_location

Recreating the location dataset

In [16]:
df_location = create_location_dataset(df_transactions)

In [17]:
%%time
df_merged = df_transactions.merge(df_location, how='left', on='From Bank')

CPU times: user 3.22 s, sys: 1.5 s, total: 4.72 s
Wall time: 3.79 s


In [18]:
df_merged.head()

Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering,location
0,2022/08/01 00:02,3196,800107150,3196,800107150,7739.29,US Dollar,7739.29,US Dollar,Reinvestment,0,Phoenix
1,2022/08/01 00:03,1208,80010E650,20,80010E6F0,73966883.0,US Dollar,73966883.0,US Dollar,Cheque,0,Houston
2,2022/08/01 00:27,3203,80010EA80,3203,80010EA80,13284.41,US Dollar,13284.41,US Dollar,Reinvestment,0,San Diego
3,2022/08/01 00:09,1208,80010E430,1208,80010E430,7.66,US Dollar,7.66,US Dollar,Reinvestment,0,Houston
4,2022/08/01 00:06,1208,80010E650,1208,80010E650,4.86,US Dollar,4.86,US Dollar,Reinvestment,0,Houston


Nice! We brought the time down by 5-10x again for the data joining step above!

Its clear that we are seeing speedups for large datasets with `cudf.pandas` with zero code change! This can be attributed to better memory management attributed to managed memory pool and prefetching concepts we explained in the beginning of the notebook. sSe this <a href="https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/">NVIDIA Developer blog post</a> for more details

But what happens if we switch that feature off?

In [23]:
# Restart notebook-
get_ipython().kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}

cudf.pandas provides an environment variable that you can set to `cuda` to turn off managed memory

In [1]:
%env CUDF_PANDAS_RMM_MODE=cuda

import os
# Step 3: Verify the environment variable
print(os.environ['CUDF_PANDAS_RMM_MODE'])

env: CUDF_PANDAS_RMM_MODE=cuda
cuda


In [2]:
%load_ext cudf.pandas

In [3]:
import pandas as pd

In [4]:
%%cudf.pandas.profile
df_transactions = pd.read_csv('/nvme/1/manass/notebooks/polars_exp/Data/HI-Large_Trans_reduced.csv')

Disabling managed memory led to CPU fallback because of `Out of Memory` issues on the GPU 

In [5]:
%%cudf.pandas.profile
# Aggregate-
df_transactions.groupby(
    ["From Bank"]).agg({"Amount Received":"sum"})

Unnamed: 0_level_0,Amount Received
From Bank,Unnamed: 1_level_1
0,1.592349e+12
1,5.536070e+10
2,4.319278e+11
3,2.200754e+12
4,4.873087e+12
...,...
3225441,1.981200e-02
3225444,4.904000e-02
3225451,2.820300e+02
3225454,3.459300e+03


Even group by slowed down considerably once managed memory was switched off.

# Summary

With cudf.pandas, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the `cudf.pandas` extension and enjoy the incredible speedups

If you like Google Colab and want to get peak `cudf.pandas` performance to process even larger datasets, Google Colab's paid tier includes both L4 and A100 GPUs (in addition to the T4 GPU this demo notebook is using).

To learn more about cudf.pandas, we encourage you to visit rapids.ai/cudf-pandas.

# Do you have any feedback for us?

Fill this quick survey <a href="https://www.surveymonkey.com/r/TX3QQQR">HERE</a>

Raise an issue on our github repo <a href="https://github.com/rapidsai/cudf/issues">HERE</a>