[Data] - Add `cudf` as a batch_format by goutamvenkat-anyscale · Pull Request #61329 · ray-project/ray

goutamvenkat-anyscale · 2026-02-25T22:28:10Z

Description

As title states.

Related issues

Closes #61325

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2026-02-25T22:31:00Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces cudf as a new batch_format in Ray Data, enabling more efficient GPU-native data processing pipelines. The changes are extensive, touching core data block abstractions, expression evaluation, and various dataset operations to support cudf.DataFrame as a first-class citizen. This includes a new CudfBlockAccessor with optimized GPU-native implementations for sorting and partitioning. The addition of comprehensive tests and CI configuration for cuDF is also a great step. My feedback focuses on a few opportunities for performance improvement and code simplification.

python/ray/data/_internal/cudf_block.py

python/ray/data/block.py

gemini-code-assist

Code Review

This pull request introduces cudf as a new batch format for Ray Data, enabling GPU-native pipelines. The changes include adding CudfBlockAccessor, CudfBlockBuilder, and CudfBlockColumnAccessor implementations, integrating cudf into expression evaluation, and updating relevant docstrings and batch conversion utilities. A new test file test_cudf_batch_format.py has been added to cover the new functionality, and cudf has been added to the GPU requirements. Overall, the implementation is well-structured and considers performance aspects like lazy imports and GPU-native operations. One minor issue was found in the CudfRow.__getitem__ method's handling of single-item access.

python/ray/data/_internal/cudf_block.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/block.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/cudf_block.py

python/ray/data/block.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/cudf_block.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/cudf_block.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py

python/ray/data/util/data_batch_conversion.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/cudf_block.py

iamjustinhsu

didn't look at tests

python/ray/data/block.py

python/ray/data/_internal/cudf_block.py

iamjustinhsu · 2026-02-27T01:18:13Z

python/ray/data/_internal/table_block.py


+    def to_cudf(self) -> Any:
+        """Convert this block to a cudf.DataFrame (requires cudf to be installed)."""
+        import cudf


try_lazy_import?

python/ray/data/block.py

python/ray/data/_internal/table_block.py

Signed-off-by: Goutam <goutam@anyscale.com>

.buildkite/data.rayci.yml

Signed-off-by: Goutam <goutam@anyscale.com>

iamjustinhsu · 2026-03-09T20:15:44Z

python/ray/data/tests/test_cudf_batch_format.py

+            assert list(batch.columns) == ["id"]
+        elif data_source == "range_tensor":
+            assert "data" in batch.columns


how come we don't also exhaust the iterator like in the else statement

.buildkite/data.rayci.yml

python/ray/data/util/data_batch_conversion.py

alexeykudinkin · 2026-03-11T00:49:59Z

python/ray/data/util/data_batch_conversion.py

            )
        return pyarrow.Table.from_pandas(data)
-
+    elif type == BatchFormat.CUDF:


Why are you making changes to these methods?

These are used by Train, and we should not accept "cudf" in iter_batches (explicitly disallowing it)

Scratch that actually, let's do it for consistency of expectations (to make it similar to every other format)

python/ray/data/block.py

alexeykudinkin · 2026-03-11T01:20:47Z

python/ray/data/block.py

+        elif _is_cudf_dataframe(block):
+            from ray.data._internal.arrow_block import ArrowBlockAccessor
+
+            return ArrowBlockAccessor(block.to_arrow())


This shouldn't be possible, right?

python/ray/data/util/data_batch_conversion.py

alexeykudinkin · 2026-03-11T01:27:03Z

@goutamvenkat-anyscale LGTM, minor comments

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/util/data_batch_conversion.py

Signed-off-by: Goutam <goutam@anyscale.com>

cursor · 2026-03-11T19:08:47Z

python/ray/data/_internal/planner/plan_udf_map_op.py

            "`fn` to return a `pandas.DataFrame`, `pyarrow.Table`, "
-            "`numpy.ndarray`, `list`, or `dict[str, numpy.ndarray]`."
+            "`cudf.DataFrame`, `numpy.ndarray`, `list`, or "
+            "`dict[str, numpy.ndarray]`."


No cudf guard before Mapping check in validation

Low Severity

_validate_batch_output doesn't guard against cudf.DataFrame before the collections.abc.Mapping isinstance check at line 517. Since cudf.DataFrame implements the Mapping protocol (as noted in the comment in batch_to_block), a cudf batch enters the dict-validation path, where each column is checked via _is_valid_column_values. This is inconsistent with batch_to_block, which explicitly handles cudf before the Mapping check. If a cudf.Series ever fails is_ndarray_like, the error message would incorrectly reference "dict" values.

Signed-off-by: Goutam <goutam@anyscale.com>

cursor · 2026-03-11T23:33:46Z

python/ray/data/_internal/planner/plan_udf_map_op.py

            "`fn` to return a `pandas.DataFrame`, `pyarrow.Table`, "
-            "`numpy.ndarray`, `list`, or `dict[str, numpy.ndarray]`."
+            "`cudf.DataFrame`, `numpy.ndarray`, `list`, or "
+            "`dict[str, numpy.ndarray]`."


cudf DataFrame fails Mapping validation in batch output

High Severity

A cudf.DataFrame implements the collections.abc.Mapping protocol (as explicitly noted in the batch_to_block comment). In _validate_batch_output, after passing the allowed check, a cudf.DataFrame will match isinstance(batch, collections.abc.Mapping) on line 517. This causes iteration over batch.items(), where each value is a cudf.Series. Since cudf.Series is not a list, np.ndarray, or ndarray-like, _is_valid_column_values returns False, and the function raises a misleading ValueError. The batch_to_block method correctly handles this by checking _is_cudf_dataframe before the Mapping branch, but the same guard is missing here.

Additional Locations (1)

python/ray/data/_internal/planner/plan_udf_map_op.py#L486-L498

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/data/_internal/table_block.py

Signed-off-by: Goutam <goutam@anyscale.com>

[Data] - Add cudf as a batch_format

a62bac6

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested review from a team, matthewdeng and richardliaw as code owners February 25, 2026 22:28

goutamvenkat-anyscale added the data Ray Data-related issues label Feb 25, 2026

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

python/ray/data/_internal/cudf_block.py Outdated Show resolved Hide resolved

python/ray/data/_internal/cudf_block.py Outdated Show resolved Hide resolved

python/ray/data/block.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

python/ray/data/_internal/cudf_block.py Outdated Show resolved Hide resolved

gemini comments

ee84a9d

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Feb 25, 2026

View reviewed changes

python/ray/data/block.py Outdated Show resolved Hide resolved

Some more comments

9fabcba

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Feb 25, 2026

View reviewed changes

python/ray/data/_internal/cudf_block.py Outdated Show resolved Hide resolved

python/ray/data/block.py Show resolved Hide resolved

goutamvenkat-anyscale added 3 commits February 25, 2026 16:04

Batch format only

036115b

Signed-off-by: Goutam <goutam@anyscale.com>

Clean up

135603d

Signed-off-by: Goutam <goutam@anyscale.com>

More cleanup

0903a61

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale changed the title ~~[Data] - Add cudf as a batch_format~~ [Data] - Add cudf as a batch_format Feb 26, 2026

cursor bot reviewed Feb 26, 2026

View reviewed changes

python/ray/data/_internal/cudf_block.py Outdated Show resolved Hide resolved

Test fixes

e3cfbe0

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Feb 26, 2026

View reviewed changes

python/ray/data/_internal/cudf_block.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits February 26, 2026 10:05

fix tests

4944643

Signed-off-by: Goutam <goutam@anyscale.com>

Fix test

4d05618

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Feb 26, 2026

View reviewed changes

python/ray/data/_internal/planner/plan_expression/expression_evaluator.py Outdated Show resolved Hide resolved

python/ray/data/util/data_batch_conversion.py Show resolved Hide resolved

iamjustinhsu self-assigned this Feb 26, 2026

alexeykudinkin assigned alexeykudinkin and unassigned iamjustinhsu Feb 26, 2026

goutamvenkat-anyscale added 2 commits February 26, 2026 14:18

Merge from master

0e5dd7a

Signed-off-by: Goutam <goutam@anyscale.com>

Clean up

5db6225

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Feb 26, 2026

View reviewed changes

python/ray/data/_internal/cudf_block.py Outdated Show resolved Hide resolved

iamjustinhsu reviewed Feb 27, 2026

View reviewed changes

cursor bot reviewed Feb 27, 2026

View reviewed changes

python/ray/data/block.py Outdated Show resolved Hide resolved

python/ray/data/_internal/table_block.py Show resolved Hide resolved

goutamvenkat-anyscale added 4 commits February 27, 2026 10:18

Expr

4babd9a

Signed-off-by: Goutam <goutam@anyscale.com>

Fix tests

129b0db

Signed-off-by: Goutam <goutam@anyscale.com>

One more

0035242

Signed-off-by: Goutam <goutam@anyscale.com>

Add cudf tag

ac1f15c

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Mar 2, 2026

goutamvenkat-anyscale added 3 commits March 2, 2026 13:20

arrow v9

1d57d07

Signed-off-by: Goutam <goutam@anyscale.com>

Merge branch 'master' into goutam/cudf_batch_format

9e1bac6

dep

6bfac8e

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Mar 3, 2026

View reviewed changes

.buildkite/data.rayci.yml Outdated Show resolved Hide resolved

Try again

6e51ff2

Signed-off-by: Goutam <goutam@anyscale.com>

iamjustinhsu approved these changes Mar 9, 2026

View reviewed changes

goutamvenkat-anyscale requested a review from alexeykudinkin March 10, 2026 18:07

alexeykudinkin reviewed Mar 11, 2026

View reviewed changes

goutamvenkat-anyscale added 2 commits March 11, 2026 11:40

Address comments

381901d

Signed-off-by: Goutam <goutam@anyscale.com>

one more comment

771d2d0

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Mar 11, 2026

View reviewed changes

python/ray/data/util/data_batch_conversion.py Show resolved Hide resolved

One more

5ffb607

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Mar 11, 2026

View reviewed changes

goutamvenkat-anyscale added 2 commits March 11, 2026 14:14

skip for_block check for cudf

170e90b

Signed-off-by: Goutam <goutam@anyscale.com>

Merge branch 'master' into goutam/cudf_batch_format

5cfea4a

cursor bot reviewed Mar 11, 2026

View reviewed changes

Remaining changes

639ff06

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Mar 12, 2026

View reviewed changes

python/ray/data/_internal/table_block.py Show resolved Hide resolved

goutamvenkat-anyscale added 3 commits March 11, 2026 18:36

Raise error

4831675

Signed-off-by: Goutam <goutam@anyscale.com>

Cursor

f49324c

Signed-off-by: Goutam <goutam@anyscale.com>

Add experimental flag

0b169cb

Signed-off-by: Goutam <goutam@anyscale.com>

richardliaw approved these changes Mar 12, 2026

View reviewed changes

richardliaw merged commit 63bc264 into ray-project:master Mar 12, 2026
6 checks passed

goutamvenkat-anyscale deleted the goutam/cudf_batch_format branch March 12, 2026 20:40

Conversation

goutamvenkat-anyscale commented Feb 25, 2026

Description

Related issues

Additional information

Uh oh!

goutamvenkat-anyscale commented Feb 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin commented Mar 11, 2026

Uh oh!

Uh oh!

cursor bot Mar 11, 2026

Choose a reason for hiding this comment

No cudf guard before Mapping check in validation

Uh oh!

cursor bot Mar 11, 2026

Choose a reason for hiding this comment

cudf DataFrame fails Mapping validation in batch output

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees