Skip to content

[Data] - Add cudf as a batch_format#61329

Merged
richardliaw merged 30 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/cudf_batch_format
Mar 12, 2026
Merged

[Data] - Add cudf as a batch_format#61329
richardliaw merged 30 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/cudf_batch_format

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

Description

As title states.

Related issues

Closes #61325

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

@goutamvenkat-anyscale goutamvenkat-anyscale added the data Ray Data-related issues label Feb 25, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cudf as a new batch_format in Ray Data, enabling more efficient GPU-native data processing pipelines. The changes are extensive, touching core data block abstractions, expression evaluation, and various dataset operations to support cudf.DataFrame as a first-class citizen. This includes a new CudfBlockAccessor with optimized GPU-native implementations for sorting and partitioning. The addition of comprehensive tests and CI configuration for cuDF is also a great step. My feedback focuses on a few opportunities for performance improvement and code simplification.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cudf as a new batch format for Ray Data, enabling GPU-native pipelines. The changes include adding CudfBlockAccessor, CudfBlockBuilder, and CudfBlockColumnAccessor implementations, integrating cudf into expression evaluation, and updating relevant docstrings and batch conversion utilities. A new test file test_cudf_batch_format.py has been added to cover the new functionality, and cudf has been added to the GPU requirements. Overall, the implementation is well-structured and considers performance aspects like lazy imports and GPU-native operations. One minor issue was found in the CudfRow.__getitem__ method's handling of single-item access.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale changed the title [Data] - Add cudf as a batch_format [Data] - Add cudf as a batch_format Feb 26, 2026
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Copy link
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't look at tests


def to_cudf(self) -> Any:
"""Convert this block to a cudf.DataFrame (requires cudf to be installed)."""
import cudf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_lazy_import?

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Mar 2, 2026
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Comment on lines +66 to +68
assert list(batch.columns) == ["id"]
elif data_source == "range_tensor":
assert "data" in batch.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come we don't also exhaust the iterator like in the else statement

)
return pyarrow.Table.from_pandas(data)

elif type == BatchFormat.CUDF:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you making changes to these methods?

These are used by Train, and we should not accept "cudf" in iter_batches (explicitly disallowing it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that actually, let's do it for consistency of expectations (to make it similar to every other format)

Comment on lines +573 to +576
elif _is_cudf_dataframe(block):
from ray.data._internal.arrow_block import ArrowBlockAccessor

return ArrowBlockAccessor(block.to_arrow())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be possible, right?

@alexeykudinkin
Copy link
Contributor

@goutamvenkat-anyscale LGTM, minor comments

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
"`fn` to return a `pandas.DataFrame`, `pyarrow.Table`, "
"`numpy.ndarray`, `list`, or `dict[str, numpy.ndarray]`."
"`cudf.DataFrame`, `numpy.ndarray`, `list`, or "
"`dict[str, numpy.ndarray]`."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No cudf guard before Mapping check in validation

Low Severity

_validate_batch_output doesn't guard against cudf.DataFrame before the collections.abc.Mapping isinstance check at line 517. Since cudf.DataFrame implements the Mapping protocol (as noted in the comment in batch_to_block), a cudf batch enters the dict-validation path, where each column is checked via _is_valid_column_values. This is inconsistent with batch_to_block, which explicitly handles cudf before the Mapping check. If a cudf.Series ever fails is_ndarray_like, the error message would incorrectly reference "dict" values.

Fix in Cursor Fix in Web

"`fn` to return a `pandas.DataFrame`, `pyarrow.Table`, "
"`numpy.ndarray`, `list`, or `dict[str, numpy.ndarray]`."
"`cudf.DataFrame`, `numpy.ndarray`, `list`, or "
"`dict[str, numpy.ndarray]`."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf DataFrame fails Mapping validation in batch output

High Severity

A cudf.DataFrame implements the collections.abc.Mapping protocol (as explicitly noted in the batch_to_block comment). In _validate_batch_output, after passing the allowed check, a cudf.DataFrame will match isinstance(batch, collections.abc.Mapping) on line 517. This causes iteration over batch.items(), where each value is a cudf.Series. Since cudf.Series is not a list, np.ndarray, or ndarray-like, _is_valid_column_values returns False, and the function raises a misleading ValueError. The batch_to_block method correctly handles this by checking _is_cudf_dataframe before the Mapping branch, but the same guard is missing here.

Additional Locations (1)
Fix in Cursor Fix in Web

Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@richardliaw richardliaw merged commit 63bc264 into ray-project:master Mar 12, 2026
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/cudf_batch_format branch March 12, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] - Add support for cudf as a batch_format

4 participants