[Data] - Add cudf as a batch_format#61329
Conversation
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces cudf as a new batch_format in Ray Data, enabling more efficient GPU-native data processing pipelines. The changes are extensive, touching core data block abstractions, expression evaluation, and various dataset operations to support cudf.DataFrame as a first-class citizen. This includes a new CudfBlockAccessor with optimized GPU-native implementations for sorting and partitioning. The addition of comprehensive tests and CI configuration for cuDF is also a great step. My feedback focuses on a few opportunities for performance improvement and code simplification.
There was a problem hiding this comment.
Code Review
This pull request introduces cudf as a new batch format for Ray Data, enabling GPU-native pipelines. The changes include adding CudfBlockAccessor, CudfBlockBuilder, and CudfBlockColumnAccessor implementations, integrating cudf into expression evaluation, and updating relevant docstrings and batch conversion utilities. A new test file test_cudf_batch_format.py has been added to cover the new functionality, and cudf has been added to the GPU requirements. Overall, the implementation is well-structured and considers performance aspects like lazy imports and GPU-native operations. One minor issue was found in the CudfRow.__getitem__ method's handling of single-item access.
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
cudf as a batch_format
Signed-off-by: Goutam <goutam@anyscale.com>
python/ray/data/_internal/planner/plan_expression/expression_evaluator.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Goutam <goutam@anyscale.com>
iamjustinhsu
left a comment
There was a problem hiding this comment.
didn't look at tests
|
|
||
| def to_cudf(self) -> Any: | ||
| """Convert this block to a cudf.DataFrame (requires cudf to be installed).""" | ||
| import cudf |
| assert list(batch.columns) == ["id"] | ||
| elif data_source == "range_tensor": | ||
| assert "data" in batch.columns |
There was a problem hiding this comment.
how come we don't also exhaust the iterator like in the else statement
| ) | ||
| return pyarrow.Table.from_pandas(data) | ||
|
|
||
| elif type == BatchFormat.CUDF: |
There was a problem hiding this comment.
Why are you making changes to these methods?
These are used by Train, and we should not accept "cudf" in iter_batches (explicitly disallowing it)
There was a problem hiding this comment.
Scratch that actually, let's do it for consistency of expectations (to make it similar to every other format)
python/ray/data/block.py
Outdated
| elif _is_cudf_dataframe(block): | ||
| from ray.data._internal.arrow_block import ArrowBlockAccessor | ||
|
|
||
| return ArrowBlockAccessor(block.to_arrow()) |
There was a problem hiding this comment.
This shouldn't be possible, right?
|
@goutamvenkat-anyscale LGTM, minor comments |
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
| "`fn` to return a `pandas.DataFrame`, `pyarrow.Table`, " | ||
| "`numpy.ndarray`, `list`, or `dict[str, numpy.ndarray]`." | ||
| "`cudf.DataFrame`, `numpy.ndarray`, `list`, or " | ||
| "`dict[str, numpy.ndarray]`." |
There was a problem hiding this comment.
No cudf guard before Mapping check in validation
Low Severity
_validate_batch_output doesn't guard against cudf.DataFrame before the collections.abc.Mapping isinstance check at line 517. Since cudf.DataFrame implements the Mapping protocol (as noted in the comment in batch_to_block), a cudf batch enters the dict-validation path, where each column is checked via _is_valid_column_values. This is inconsistent with batch_to_block, which explicitly handles cudf before the Mapping check. If a cudf.Series ever fails is_ndarray_like, the error message would incorrectly reference "dict" values.
Signed-off-by: Goutam <goutam@anyscale.com>
| "`fn` to return a `pandas.DataFrame`, `pyarrow.Table`, " | ||
| "`numpy.ndarray`, `list`, or `dict[str, numpy.ndarray]`." | ||
| "`cudf.DataFrame`, `numpy.ndarray`, `list`, or " | ||
| "`dict[str, numpy.ndarray]`." |
There was a problem hiding this comment.
cudf DataFrame fails Mapping validation in batch output
High Severity
A cudf.DataFrame implements the collections.abc.Mapping protocol (as explicitly noted in the batch_to_block comment). In _validate_batch_output, after passing the allowed check, a cudf.DataFrame will match isinstance(batch, collections.abc.Mapping) on line 517. This causes iteration over batch.items(), where each value is a cudf.Series. Since cudf.Series is not a list, np.ndarray, or ndarray-like, _is_valid_column_values returns False, and the function raises a misleading ValueError. The batch_to_block method correctly handles this by checking _is_cudf_dataframe before the Mapping branch, but the same guard is missing here.
Additional Locations (1)
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>


Description
As title states.
Related issues
Closes #61325
Additional information