feat: Add partition awareness to GroupBy to skip shuffle operations by Annmool · Pull Request #60869 · ray-project/ray

Annmool · 2026-02-09T17:11:56Z

Description

This PR adds partition-awareness optimization to Ray Data’s GroupBy operations.

Right now, groupby() always triggers a HashShuffleOperation, even when the dataset is already partitioned by the same column (for example, Hive-style partitioned Parquet data). In those cases, the shuffle is unnecessary and causes avoidable network transfer and memory overhead.

With this change, we first check whether the dataset is already partitioned by the groupby key using block metadata and input file paths. If it is, we skip the shuffle entirely and perform the groupby directly on the existing blocks. If not, the behavior remains unchanged and we fall back to the regular shuffle-based execution.

This optimization is fully backward compatible and does not introduce any API changes.

Changes

1- partition_aware.py (NEW)

extract_partition_values_from_paths()
Parses Hive-style file paths and extracts partition column values.
is_partition_aware_groupby_possible()
Validates whether the dataset blocks are already partitioned in a way that allows us to safely skip the shuffle phase.

2- grouped_data.py (MODIFIED)

Added _check_partition_awareness() to the GroupedData class.
Updated map_groups() to evaluate partition awareness before triggering a shuffle.
If the dataset is partition-aware, we use it as-is.
Otherwise, we fall back to the existing shuffle implementation.

3- test_partition_aware_groupby.py (NEW)

Added unit tests for partition value extraction.
Added tests for partition-awareness detection logic.
Covered edge cases such as inconsistent partitions, missing columns, and duplicate partition values.

Impact

Avoids unnecessary shuffle operations for pre-partitioned datasets.
Reduces network data movement.
Improves memory efficiency.
Provides better performance for Hive-partitioned workloads without changing existing APIs.

Signed-off-by: Annmool <aydv.267@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a valuable optimization for GroupBy operations by adding partition awareness to skip unnecessary shuffles. The implementation, which includes new utility functions for partition detection and modifications to GroupedData, is well-structured. The addition of unit tests is also a positive step.

I have identified a critical issue in how block metadata is collected, which would likely lead to a runtime error. Additionally, I've suggested an improvement to the error handling to enhance observability when the partition awareness check fails. Overall, this is a strong contribution that should significantly improve performance for workloads with pre-partitioned data.

gemini-code-assist · 2026-02-09T17:14:10Z

python/ray/data/grouped_data.py

+            for ref, metadata in self._dataset.iter_internal_ref_bundles():
+                if hasattr(metadata, 'blocks') and metadata.blocks:
+                    for block_ref, block_metadata in metadata.blocks:
+                        blocks_metadata.append(block_metadata)


The iteration over self._dataset.iter_internal_ref_bundles() appears to be incorrect. This method returns an iterator of RefBundle objects, not tuples. Each RefBundle can contain metadata for multiple blocks.

You should iterate over the RefBundle objects and then access their metadata property to get the list of BlockMetadata for all blocks within that bundle.

for bundle in self._dataset.iter_internal_ref_bundles(): blocks_metadata.extend(bundle.metadata)

gemini-code-assist · 2026-02-09T17:14:10Z

python/ray/data/grouped_data.py

+        except Exception as e:
+            # If anything goes wrong, fall back to regular shuffle
+            return False, f"Partition awareness check failed: {str(e)}"


The broad exception handler correctly falls back to the shuffle-based approach, which is a safe default. However, it currently might hide underlying issues in the partition awareness check by only returning the error message as a string.

To improve observability and make debugging easier, I recommend logging the exception as a warning. This will ensure that failures in this optimization path are more visible in the logs.

except Exception as e: # If anything goes wrong, fall back to regular shuffle logger = logging.getLogger(__name__) logger.warning( "Partition awareness check failed, falling back to shuffle: %s", e, exc_info=True, ) return False, f"Partition awareness check failed: {str(e)}"

…ion awareness check - Fix iteration over iter_internal_ref_bundles() to properly access RefBundle objects - Each RefBundle contains metadata property with blocks information - Add logging for exception handling to improve observability - Include stack trace in warning logs for debugging Signed-off-by: Annmool <aydv.267@gmail.com>

python/ray/data/grouped_data.py

python/ray/data/tests/test_partition_aware_groupby.py

python/ray/data/grouped_data.py

python/ray/data/_internal/partition_aware.py

… partition info, update tests, add logging Signed-off-by: Annmool <aydv.267@gmail.com>

…rsing, logging, and tests Signed-off-by: Annmool <aydv.267@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/data/grouped_data.py

…uristic; update tests; tighten partition parsing Signed-off-by: Annmool <aydv.267@gmail.com>

github-actions · 2026-02-24T12:30:26Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2026-03-11T00:48:04Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

Annmool requested a review from a team as a code owner February 9, 2026 17:11

Add partition awareness optimization for GroupBy operations

772d2b6

Signed-off-by: Annmool <aydv.267@gmail.com>

Annmool force-pushed the partition-awareness-groupby branch from 1d09b85 to 772d2b6 Compare February 9, 2026 17:13

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

Annmool force-pushed the partition-awareness-groupby branch from dcb1840 to e7bd41e Compare February 9, 2026 17:22

Merge branch 'master' into partition-awareness-groupby

7827be9

cursor bot reviewed Feb 9, 2026

View reviewed changes

python/ray/data/grouped_data.py Show resolved Hide resolved

python/ray/data/tests/test_partition_aware_groupby.py Show resolved Hide resolved

python/ray/data/grouped_data.py Show resolved Hide resolved

python/ray/data/_internal/partition_aware.py Show resolved Hide resolved

Annmool added 2 commits February 9, 2026 23:36

fix(partition-awareness): handle RefBundle.metadata, require per-file…

51af6f5

… partition info, update tests, add logging Signed-off-by: Annmool <aydv.267@gmail.com>

fix(groupby): address code review - RefBundle iteration, partition pa…

2482b09

…rsing, logging, and tests Signed-off-by: Annmool <aydv.267@gmail.com>

cursor bot reviewed Feb 9, 2026

View reviewed changes

python/ray/data/grouped_data.py Outdated Show resolved Hide resolved

ray-gardener bot added the community-contribution Contributed by the community label Feb 9, 2026

Annmool and others added 2 commits February 10, 2026 13:14

fix(groupby): require one read task per file before using per-file he…

7bffb4f

…uristic; update tests; tighten partition parsing Signed-off-by: Annmool <aydv.267@gmail.com>

Merge branch 'master' into partition-awareness-groupby

8920ebe

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 24, 2026

github-actions bot closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add partition awareness to GroupBy to skip shuffle operations#60869

feat: Add partition awareness to GroupBy to skip shuffle operations#60869
Annmool wants to merge 7 commits intoray-project:masterfrom
Annmool:partition-awareness-groupby

Annmool commented Feb 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Uh oh!

gemini-code-assist bot Feb 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Feb 24, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Annmool commented Feb 9, 2026

Description

Changes

Impact

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 24, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant