In [1]:
!nvidia-smi

Fri Dec  8 08:53:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro GV100        Off  | 00000000:15:00.0 Off |                  Off |
| 37%   50C    P2    42W / 250W |     10MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro GV100        Off  | 00000000:2D:00.0  On |                  Off |
| 44%   58C    P0    45W / 250W |    777MiB / 32768MiB |     12%      Default |
|       

## Part 1: Quick intro to cudf.pandas 🚀🐼

Comment this line in or out depending on whether you want to enable `cudf.pandas`:

In [1]:
%load_ext cudf.pandas

In [2]:
import pandas as pd
import numpy as np
import random
import string

In this section, we're running some basic pandas functions with randomly generated data and timing them.

In [None]:
STRINGS = ["".join(random.choices("abcdefg", k=5)) for _ in range(1000)] + [None]    

def make_df(size):
    return pd.DataFrame(
        {
            "id": np.random.randint(low=0, high=100, size=size),
            "x": np.random.rand(size),
            "y": np.random.rand(size),
            "s": random.choices(STRINGS, k=size)
        }
    )

df1 = make_df(10_000_000)
df2 = make_df(10_000)

In [None]:
%%time
df1.groupby("id").mean()

In [None]:
%%time
df1["s"].str.contains("a")

In [None]:
%%time
df1.merge(df2, on=["id", "s"], how="left")

In [None]:
%%time
df1.count(axis=0)

In [None]:
%%time
df1.count(axis=1)

### How does this work?

When we did `%load_ext cudf.pandas`, we made it so that `import pandas` (or any submodules) imports a proxy module:

In [None]:
pd

That proxy module is composed of proxy functions, and proxy types containing proxy methods:

In [None]:
print(type(pd.read_csv))
print(type(pd.DataFrame))
print(type(pd.DataFrame.max))

Operations on proxy functions and methods dispatch to cuDF or pandas:

<img src="how-cudf-pandas-works.png" width="700">

### Why `.count(axis=1)` is slower when `cudf.pandas` is enabled?

As you can see from the diagram above, when an operation isn't supported by cuDF, we copy data from GPU to CPU and then use pandas for that operation. This copying can add signficant overhead (especially if the data is large).

### Can we use `cudf.pandas` with other libraries?

When `cudf.pandas` enabled, you can still pass DataFrames to other libraries and expect things to work:

In [None]:
import seaborn as sns

sns.scatterplot(x=df2.x[::10], y=df2.y[::10])

## Part 2: Understanding Performance

Let's generate some data and do some timeseries operations with it:

In [None]:
%%time

# generate some random timeseries data:
rng = pd.date_range("2023-01-01", "2023-02-01", freq="10ms")
data = pd.DataFrame(
    {
        "a": np.random.rand(len(rng)),
        "b": np.random.rand(len(rng))
    },
    index=rng
)

# filter the data to just between 9:30am and 4pm:
data = data.iloc[rng.indexer_between_time("09:30", "16:00")]

# get daily means:
results = data.groupby(pd.Grouper(freq="1D")).mean()
results.head()

That runs quite slowly, even when `cudf.pandas` is enabled. Notice what happens when you run the same code with the `%%cudf.pandas.profile` magic: 

### Using the profiler

In [None]:
%%cudf.pandas.profile

rng = pd.date_range("2023-01-01", "2023-02-01", freq="10ms")
data = pd.DataFrame(
    {
        "a": np.random.rand(len(rng)),
        "b": np.random.rand(len(rng))
    },
    index=rng
)
data = data.iloc[rng.indexer_between_time("09:30", "16:00")]
results = data.groupby(pd.Grouper(freq="1D")).mean()
results.head()

We get a report that shows what functions executed on the GPU, and what functions executed on the CPU. In the code above, everything executed on the GPU except for the `indexer_between_time` function, which is [not supported by the cuDF library](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/). 

### Optimizing our code for GPU execution

The key to getting great performance with `cudf.pandas` is to minimize the number of operations that fall back to CPU. While rarely, this cannot be avoided, it often can be achieved with simple rewrites of your code. Let's rewrite the code to use operations that _are_ supported by cuDF:

In [None]:
%%time

rng = pd.date_range("2023-01-01", "2023-02-01", freq="100ms")
data = pd.DataFrame(
    {
        "a": np.random.rand(len(rng)),
        "b": np.random.rand(len(rng))
    },
    index=rng
)

# note: using datetime properties instead of `indexer_between_time`:
data = data.iloc[((rng.hour >= 9) & (rng.minute >= 30)) | (rng.hour <= 16)]

results = data.groupby(pd.Grouper(freq="1D")).mean()
results.head()

Not only is this _much_ faster on the GPU, but it's also quite a bit faster on the CPU - a nice win-win!

## Part 3: third-party code acceleration

In this section, we'll demonstrate how `cudf.pandas` works with third-party libraries that depend on pandas.

We'll load some data into a DataFrame and use [langchain's Pandas integration](https://python.langchain.com/docs/integrations/toolkits/pandas) to answer questions about that data. We'll see that even though langchain doesn't know anything about cuDF, it will still automagically use the GPU to answer those questions much faster than with regular pandas!

⚠️ Note that at the time of writing, `langchain` is undergoing considerable changes (for example, see [here](https://github.com/langchain-ai/langchain/discussions/14243)). You may have to change some of the code in this notebook to make it work.

💰❗ Here we're using OpenAI's `gpt-4` model with langchain ([setup instructions](https://python.langchain.com/docs/integrations/llms/openai)). Note that at the time of writing, you need to buy credits from OpenAI to use this model via API.

In [None]:
from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain.llms import OpenAI

from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))

### Loading the data

In [None]:
df = pd.concat(
    [
        pd.read_parquet("yellow_tripdata_2021-{:02d}.parquet".format(i))
        for i in range(1, 13)
    ]
)

df = df[["VendorID",
         "tpep_pickup_datetime",
         "tpep_dropoff_datetime",
         "passenger_count",
         "tip_amount"]]

df.head()

### Asking questions about our data (note: responses are cached!)

In [None]:
agent = create_pandas_dataframe_agent(
    ChatOpenAI(temperature=0, model="gpt-4"),
    df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    handle_parsing_errors=True,
)

In [None]:
%%time
print(agent.run("How many rows are there?"))

In [None]:
%%time
print(agent.run("Which vendor received the most tips?"))

In [None]:
%%time
print(
    agent.run(
        """
        Which 30-minute, 1-hour, 2-hour, 5-hour and 24-hour windows have the most trips?
        Don't use any inplace operations please!
        """
    )
)