<a target="_blank" href="https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/10min_to_cudf_colab.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


# 10 Minutes to cuDF

Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF, geared mainly towards new users.

[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library to accelerate loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of [pandas](https://pandas.pydata.org).

cuDF runs on a single GPU. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you may want to use [Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf).


Before getting started - be sure to change your runtime to use a GPU Hardware accelerator! Use the Runtime -> "Change runtime type" menu option to add a GPU.

# Let's get started using RAPIDS

In [1]:
!nvidia-smi

Fri May  9 00:22:58 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA H100 PCIe               Off |   00000000:01:00.0 Off |                    0 |
| N/A   36C    P0             45W /  310W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import pandas as pd
import cudf

import cupy as cp
import os

cp.random.seed(0)

## Creating Series and DataFrame objects


Creating a `cudf.Series`.

In [3]:
s = cudf.Series([1, 2, 3, None, 4])
s

0       1
1       2
2       3
3    <NA>
4       4
dtype: int64

Creating a `cudf.DataFrame` by specifying values for each column.

In [4]:
df = cudf.DataFrame(
    {
        "a": list(range(20)),
        "b": list(reversed(range(20))),
        "c": list(range(20)),
    }
)
df

Unnamed: 0,a,b,c
0,0,19,0
1,1,18,1
2,2,17,2
3,3,16,3
4,4,15,4
5,5,14,5
6,6,13,6
7,7,12,7
8,8,11,8
9,9,10,9


Creating a `cudf.DataFrame` from a pandas `Dataframe`.

In [5]:
pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.3]})
gdf = cudf.DataFrame.from_pandas(pdf)
gdf

Unnamed: 0,a,b
0,0,0.1
1,1,0.2
2,2,
3,3,0.3


## Viewing Data

Viewing the top rows of a GPU dataframe.

In [6]:
df.head(2)

Unnamed: 0,a,b,c
0,0,19,0
1,1,18,1


# Selecting Data

## Getting

Selecting a single column, which yields a `cudf.Series`.

In [7]:
df["a"]

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
Name: a, dtype: int64

## Selection by Label

Selecting rows from index 2 to index 5 from columns 'a' and 'b'.

In [8]:
df.loc[2:5, ["a", "b"]]

Unnamed: 0,a,b
2,2,17
3,3,16
4,4,15
5,5,14


## Selection by Position

Selecting via integers and integer slices, like numpy/pandas.

In [9]:
df.iloc[0]

a     0
b    19
c     0
Name: 0, dtype: int64

In [10]:
df.iloc[0:3, 0:2]

Unnamed: 0,a,b
0,0,19
1,1,18
2,2,17


You can also select elements of a `DataFrame` or `Series` with direct index access.

In [11]:
df[3:5]

Unnamed: 0,a,b,c
3,3,16,3
4,4,15,4


In [12]:
s[3:5]

3    <NA>
4       4
dtype: int64

## Boolean Indexing

Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing.

In [13]:
df[df.b > 15]

Unnamed: 0,a,b,c
0,0,19,0
1,1,18,1
2,2,17,2
3,3,16,3


Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API.

In [14]:
df.query("b == 3")

Unnamed: 0,a,b,c
16,16,3,16


With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword. Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`.

In [15]:
cudf_comparator = 3
df.query("b == @cudf_comparator")

Unnamed: 0,a,b,c
16,16,3,16


Using the `isin` method for filtering.

In [16]:
df[df.a.isin([0, 5])]

Unnamed: 0,a,b,c
0,0,19,0
5,5,14,5


## MultiIndex

cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex.

In [17]:
arrays = [["a", "a", "b", "b"], [1, 2, 3, 4]]
tuples = list(zip(*arrays))
idx = cudf.MultiIndex.from_tuples(tuples)
idx

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 3),
            ('b', 4)],
           )

This index can back either axis of a DataFrame.

In [18]:
gdf1 = cudf.DataFrame(
    {"first": cp.random.rand(4), "second": cp.random.rand(4)}
)
gdf1.index = idx
gdf1

Unnamed: 0,Unnamed: 1,first,second
a,1,0.438451,0.053011
a,2,0.460365,0.337699
b,3,0.250215,0.396763
b,4,0.494744,0.874419


In [19]:
gdf2 = cudf.DataFrame(
    {"first": cp.random.rand(4), "second": cp.random.rand(4)}
).T
gdf2.columns = idx
gdf2

Unnamed: 0_level_0,a,a,b,b
Unnamed: 0_level_1,1,2,3,4
first,0.482167,0.04284,0.508414,0.65455
second,0.512604,0.2643,0.051981,0.578997


Accessing values of a DataFrame with a MultiIndex, both with `.loc`

In [20]:
gdf1.loc[("b", 3)]

first     0.250215
second    0.396763
Name: ('b', 3), dtype: float64

And `.iloc`

In [21]:
gdf1.iloc[0:2]

Unnamed: 0,Unnamed: 1,first,second
a,1,0.438451,0.053011
a,2,0.460365,0.337699


Missing Data
------------

Missing data can be replaced by using the `fillna` method.

In [22]:
s.fillna(999)

0      1
1      2
2      3
3    999
4      4
dtype: int64

# Operating on Data

## Stats

Calculating descriptive statistics for a `Series`.

In [23]:
s.mean(), s.var()

(np.float64(2.5), np.float64(1.6666666666666667))

## Applymap

Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe.

In [24]:
def add_ten(num):
    return num + 10


df["a"].apply(add_ten)

0     10
1     11
2     12
3     13
4     14
5     15
6     16
7     17
8     18
9     19
10    20
11    21
12    22
13    23
14    24
15    25
16    26
17    27
18    28
19    29
Name: a, dtype: int64

## Histogramming

Counting the number of occurrences of each unique value of variable.

In [25]:
df.a.value_counts()

a
15    1
10    1
18    1
2     1
11    1
5     1
3     1
16    1
9     1
12    1
19    1
6     1
7     1
8     1
13    1
4     1
17    1
14    1
1     1
0     1
Name: count, dtype: int64

## String Methods

Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the [cuDF API documentation](https://docs.rapids.ai/api/cudf/stable/api_docs/series.html#string-handling) for more information.

In [26]:
s = cudf.Series(["A", "B", "C", "Aaba", "Baca", None, "CABA", "dog", "cat"])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: object

As well as simple manipulation, We can also match strings using [regular expressions](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.core.column.string.StringMethods.match.html).

In [27]:
s.str.match("^[aAc].+")

0    False
1    False
2    False
3     True
4    False
5     <NA>
6    False
7    False
8     True
dtype: bool

## Concat

Concatenating `Series` and `DataFrames` row-wise.

In [28]:
s = cudf.Series([1, 2, 3, None, 5])
cudf.concat([s, s])

0       1
1       2
2       3
3    <NA>
4       5
0       1
1       2
2       3
3    <NA>
4       5
dtype: int64

## Join

Performing SQL style merges. Note that the dataframe order is **not maintained**, but may be restored post-merge by sorting by the index.

In [29]:
df_a = cudf.DataFrame()
df_a["key"] = ["a", "b", "c", "d", "e"]
df_a["vals_a"] = [float(i + 10) for i in range(5)]

df_b = cudf.DataFrame()
df_b["key"] = ["a", "c", "e"]
df_b["vals_b"] = [float(i + 100) for i in range(3)]

merged = df_a.merge(df_b, on=["key"], how="left")
merged

Unnamed: 0,key,vals_a,vals_b
0,a,10.0,100.0
1,c,12.0,101.0
2,e,14.0,102.0
3,b,11.0,
4,d,13.0,


## Grouping

Like [pandas](https://pandas.pydata.org/docs/user_guide/groupby.html), cuDF and Dask-cuDF support the [Split-Apply-Combine groupby paradigm](https://doi.org/10.18637/jss.v040.i01).

In [30]:
df["agg_col1"] = [1 if x % 2 == 0 else 0 for x in range(len(df))]
df["agg_col2"] = [1 if x % 3 == 0 else 0 for x in range(len(df))]


Grouping and then applying the `sum` function to the grouped data.

In [31]:
df.groupby("agg_col1").sum()

Unnamed: 0_level_0,a,b,c,agg_col2
agg_col1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,90,100,90,4
0,100,90,100,3


Grouping hierarchically then applying the `sum` function to grouped data.

In [32]:
df.groupby(["agg_col1", "agg_col2"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
agg_col1,agg_col2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,36,40,36
1,0,54,60,54
0,1,27,30,27
0,0,73,60,73


Grouping and applying statistical functions to specific columns, using `agg`.

In [33]:
df.groupby("agg_col1").agg({"a": "max", "b": "mean", "c": "sum"})

Unnamed: 0_level_0,a,b,c
agg_col1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,18,10.0,90
0,19,9.0,100


## Sorting

Sorting by values.

In [34]:
df.sort_values(by="b")

Unnamed: 0,a,b,c,agg_col1,agg_col2
19,19,0,19,0,0
18,18,1,18,1,1
17,17,2,17,0,0
16,16,3,16,1,0
15,15,4,15,0,1
14,14,5,14,1,0
13,13,6,13,0,0
12,12,7,12,1,1
11,11,8,11,0,0
10,10,9,10,1,0


## Transpose

Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF.

In [35]:
sample = cudf.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
sample

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [36]:
sample.transpose()

Unnamed: 0,0,1,2
a,1,2,3
b,4,5,6


Time Series
------------

`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps.

In [37]:
import datetime as dt

date_df = cudf.DataFrame()
date_df["date"] = pd.date_range("11/20/2018", periods=72, freq="D")
date_df["value"] = cp.random.sample(len(date_df))

search_date = dt.datetime.strptime("2018-11-23", "%Y-%m-%d")
date_df.query("date <= @search_date")

Unnamed: 0,date,value
0,2018-11-20,0.385556
1,2018-11-21,0.908215
2,2018-11-22,0.64162
3,2018-11-23,0.283399


Categoricals
------------

`DataFrames` support categorical columns.

In [38]:
gdf = cudf.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "grade": ["a", "b", "b", "a", "a", "e"]}
)
gdf["grade"] = gdf["grade"].astype("category")
gdf

Unnamed: 0,id,grade
0,1,a
1,2,b
2,3,b
3,4,a
4,5,a
5,6,e


Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF.

In [39]:
gdf.grade.cat.categories

Index(['a', 'b', 'e'], dtype='object')

Accessing the underlying code values of each categorical observation.

In [40]:
gdf.grade.cat.codes

0    0
1    1
2    1
3    0
4    0
5    2
dtype: uint8

# Converting Data Representation

## Pandas

Converting a cuDF `DataFrame` to a pandas `DataFrame`.

In [41]:
df.head().to_pandas()

Unnamed: 0,a,b,c,agg_col1,agg_col2
0,0,19,0,1,1
1,1,18,1,0,0
2,2,17,2,1,0
3,3,16,3,0,1
4,4,15,4,1,0


## Numpy

Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`.

In [42]:
df.to_numpy()

array([[ 0, 19,  0,  1,  1],
       [ 1, 18,  1,  0,  0],
       [ 2, 17,  2,  1,  0],
       [ 3, 16,  3,  0,  1],
       [ 4, 15,  4,  1,  0],
       [ 5, 14,  5,  0,  0],
       [ 6, 13,  6,  1,  1],
       [ 7, 12,  7,  0,  0],
       [ 8, 11,  8,  1,  0],
       [ 9, 10,  9,  0,  1],
       [10,  9, 10,  1,  0],
       [11,  8, 11,  0,  0],
       [12,  7, 12,  1,  1],
       [13,  6, 13,  0,  0],
       [14,  5, 14,  1,  0],
       [15,  4, 15,  0,  1],
       [16,  3, 16,  1,  0],
       [17,  2, 17,  0,  0],
       [18,  1, 18,  1,  1],
       [19,  0, 19,  0,  0]])

Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`.

In [43]:
df["a"].to_numpy()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

## Arrow

Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`.

In [44]:
df.to_arrow()

pyarrow.Table
a: int64
b: int64
c: int64
agg_col1: int64
agg_col2: int64
----
a: [[0,1,2,3,4,...,15,16,17,18,19]]
b: [[19,18,17,16,15,...,4,3,2,1,0]]
c: [[0,1,2,3,4,...,15,16,17,18,19]]
agg_col1: [[1,0,1,0,1,...,0,1,0,1,0]]
agg_col2: [[1,0,0,1,0,...,1,0,0,1,0]]

# Reading and Writing Data

## CSV

Writing to a CSV file.

In [45]:
if not os.path.exists("example_output"):
    os.mkdir("example_output")

df.to_csv("example_output/foo.csv", index=False)

Reading from a csv file.

In [46]:
df = cudf.read_csv("example_output/foo.csv")
df

Unnamed: 0,a,b,c,agg_col1,agg_col2
0,0,19,0,1,1
1,1,18,1,0,0
2,2,17,2,1,0
3,3,16,3,0,1
4,4,15,4,1,0
5,5,14,5,0,0
6,6,13,6,1,1
7,7,12,7,0,0
8,8,11,8,1,0
9,9,10,9,0,1


Note that for the dask-cuDF case, we use `dask_cudf.read_csv` in preference to `dask_cudf.from_cudf(cudf.read_csv)` since the former can parallelize across multiple GPUs and handle larger CSV files that would fit in memory on a single GPU.

Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard.

## Parquet

Writing to parquet files with cuDF's GPU-accelerated parquet writer

In [47]:
df.to_parquet("example_output/temp_parquet")

Reading parquet files with cuDF's GPU-accelerated parquet reader.

In [48]:
df = cudf.read_parquet("example_output/temp_parquet")
df

Unnamed: 0,a,b,c,agg_col1,agg_col2
0,0,19,0,1,1
1,1,18,1,0,0
2,2,17,2,1,0
3,3,16,3,0,1
4,4,15,4,1,0
5,5,14,5,0,0
6,6,13,6,1,1
7,7,12,7,0,0
8,8,11,8,1,0
9,9,10,9,0,1


## ORC

Writing ORC files.

In [49]:
df.to_orc("example_output/temp_orc")

And reading

In [50]:
df2 = cudf.read_orc("example_output/temp_orc")
df2

Unnamed: 0,a,b,c,agg_col1,agg_col2
0,0,19,0,1,1
1,1,18,1,0,0
2,2,17,2,1,0
3,3,16,3,0,1
4,4,15,4,1,0
5,5,14,5,0,0
6,6,13,6,1,1
7,7,12,7,0,0
8,8,11,8,1,0
9,9,10,9,0,1


# Performance with large dataframes

cuDF is great for handling large dataframes. In this example we aggregate values after grouping by a key:

In [51]:
nr = 100_000_000
df = cudf.DataFrame({
    'key': cp.random.randint(0, 10, nr),
    'value': cp.random.random(nr)
})

%time df.groupby('key')['value'].mean()

pdf = df.to_pandas()
%time pdf.groupby('key')['value'].mean()

CPU times: user 13.7 ms, sys: 1.97 ms, total: 15.7 ms
Wall time: 15.7 ms
CPU times: user 1.08 s, sys: 81.8 ms, total: 1.17 s
Wall time: 1.17 s


key
0    0.500149
1    0.499845
2    0.499982
3    0.499855
4    0.499957
5    0.499990
6    0.500085
7    0.499953
8    0.500066
9    0.500020
Name: value, dtype: float64

cuDF also has efficient join algorithms. In this example we use a hash join to combine values from two dataframes based on a key:

In [52]:
nr = 50_000_000
df = cudf.DataFrame({
        'key': cp.random.randint(0, 10, nr),
        'value': cp.random.random(nr)
})
lookup = cudf.DataFrame({
        'key': range(10),
        'lookup': cp.random.random(10)
})

%time df.merge(lookup, on='key')

pdf = df.to_pandas()
plookup = lookup.to_pandas()
%time pdf.merge(plookup, on='key')

CPU times: user 18.2 ms, sys: 1.02 ms, total: 19.2 ms
Wall time: 19.3 ms
CPU times: user 1.74 s, sys: 333 ms, total: 2.07 s
Wall time: 2.08 s


Unnamed: 0,key,value,lookup
0,6,0.576150,0.086387
1,5,0.228107,0.430998
2,1,0.023034,0.031909
3,5,0.340157,0.430998
4,7,0.186632,0.114731
...,...,...,...
49999995,8,0.354744,0.068901
49999996,5,0.722694,0.430998
49999997,6,0.005692,0.086387
49999998,1,0.785842,0.031909


Computing and applying filters to cuDF dataframes are also efficient operations.

In [53]:
nr = 20_000_000
df = cudf.DataFrame({
    'rating_a': cp.random.randint(1, 5, nr),
    'rating_b': cp.random.randint(1, 5, nr),
    'rating_c': cp.random.randint(1, 5, nr),
})

%time df.where(df>3)

pdf = df.to_pandas()
%time pdf.where(pdf>3)

CPU times: user 7.06 ms, sys: 2 ms, total: 9.06 ms
Wall time: 9.1 ms
CPU times: user 267 ms, sys: 72.9 ms, total: 340 ms
Wall time: 341 ms


Unnamed: 0,rating_a,rating_b,rating_c
0,4.0,4.0,
1,,,
2,,,
3,,,4.0
4,,,
...,...,...,...
19999995,,,4.0
19999996,,,
19999997,,4.0,4.0
19999998,,4.0,4.0


And sorting is another task where cuDF shows great acceleration.

In [54]:
nr = 10_000_000
df = cudf.DataFrame({
    'a': cp.random.rand(nr),
    'b': cp.random.rand(nr),
    'c': cp.random.rand(nr),
})

%time df.sort_values('a')

pdf = df.to_pandas()
%time pdf.sort_values('a')

CPU times: user 9.13 ms, sys: 2.01 ms, total: 11.1 ms
Wall time: 11.2 ms
CPU times: user 1.96 s, sys: 53.1 ms, total: 2.01 s
Wall time: 2.01 s


Unnamed: 0,a,b,c
7327996,5.919987e-08,0.103876,0.362604
2624195,4.288834e-07,0.952080,0.597901
7367962,6.171877e-07,0.592560,0.187291
5549822,7.305300e-07,0.384424,0.512523
4596785,7.623830e-07,0.084830,0.753644
...,...,...,...
1258033,9.999995e-01,0.267930,0.605742
7004706,9.999996e-01,0.300102,0.418403
1955702,9.999998e-01,0.428556,0.238820
7634263,9.999999e-01,0.025396,0.385628
