# 2 minutes to cudf.pandas 🚀🐼

In this notebook, we're running some basic pandas functions with randomly generated data and timing them.

Comment this line in or out depending on whether you want to enable `cudf.pandas`:

In [None]:
%load_ext cudf.pandas

In [None]:
import pandas as pd
import numpy as np
import random
import string

In [None]:
pd.DataFrame({'a': [1]})  # warmup

In [None]:
STRINGS = ["".join(random.choices("abcdefg", k=5)) for _ in range(1000)] + [None]

In [None]:
def make_df(size):
    return pd.DataFrame(
        {
            "id": np.random.randint(low=0, high=100, size=size),
            "x": np.random.rand(size),
            "y": np.random.rand(size),
            "s": random.choices(STRINGS, k=size)
        }
    )

In [None]:
df1 = make_df(10_000_000)
df2 = make_df(10_000)

In [None]:
%%time
df1.groupby("id").mean()

In [None]:
%%time
df1["s"].str.contains("a")

In [None]:
%%time
df1.merge(df2, on=["id", "s"], how="left")

In [None]:
%%time
df1.count(axis=0)

In [None]:
%%time
df1.count(axis=1)

## How does this work?

When we did `%load_ext cudf.pandas`, we made it so that `import pandas` (or any submodules) imports a proxy module:

In [None]:
pd

That proxy module is composed of proxy functions, and proxy types containing proxy methods:

In [None]:
print(type(pd.read_csv))
print(type(pd.DataFrame))
print(type(pd.DataFrame.max))

Operations on proxy functions and methods dispatch to cuDF or pandas:

<img src="how-cudf-pandas-works.png" width="700">

### Why `.count(axis=1)` is slower when `cudf.pandas` is enabled?

As you can see from the diagram above, when an operation isn't supported by cuDF, we copy data from GPU to CPU and then use pandas for that operation. This can be slow.

In [None]:
# importing cudf directly to show that .count(axis=1) is not supported

import cudf

df = cudf.DataFrame({'a': [1, 2], 'b': [2, 3]})
df.count(axis=1)

## Can we use `cudf.pandas` with other libraries?

When `cudf.pandas` enabled, you can still pass DataFrames to other libraries and expect things to work:

In [None]:
import seaborn as sns

sns.scatterplot(x=df2.x[::10], y=df2.y[::10])