Skip to content

feat: Add partition awareness to GroupBy to skip shuffle operations#60869

Closed
Annmool wants to merge 7 commits intoray-project:masterfrom
Annmool:partition-awareness-groupby
Closed

feat: Add partition awareness to GroupBy to skip shuffle operations#60869
Annmool wants to merge 7 commits intoray-project:masterfrom
Annmool:partition-awareness-groupby

Conversation

@Annmool
Copy link
Copy Markdown

@Annmool Annmool commented Feb 9, 2026

Description

This PR adds partition-awareness optimization to Ray Data’s GroupBy operations.

Right now, groupby() always triggers a HashShuffleOperation, even when the dataset is already partitioned by the same column (for example, Hive-style partitioned Parquet data). In those cases, the shuffle is unnecessary and causes avoidable network transfer and memory overhead.

With this change, we first check whether the dataset is already partitioned by the groupby key using block metadata and input file paths. If it is, we skip the shuffle entirely and perform the groupby directly on the existing blocks. If not, the behavior remains unchanged and we fall back to the regular shuffle-based execution.

This optimization is fully backward compatible and does not introduce any API changes.

Changes

1- partition_aware.py (NEW)

  • extract_partition_values_from_paths()
    Parses Hive-style file paths and extracts partition column values.

  • is_partition_aware_groupby_possible()
    Validates whether the dataset blocks are already partitioned in a way that allows us to safely skip the shuffle phase.

2- grouped_data.py (MODIFIED)

  • Added _check_partition_awareness() to the GroupedData class.
  • Updated map_groups() to evaluate partition awareness before triggering a shuffle.
  • If the dataset is partition-aware, we use it as-is.
  • Otherwise, we fall back to the existing shuffle implementation.

3- test_partition_aware_groupby.py (NEW)

  • Added unit tests for partition value extraction.
  • Added tests for partition-awareness detection logic.
  • Covered edge cases such as inconsistent partitions, missing columns, and duplicate partition values.

Impact

  • Avoids unnecessary shuffle operations for pre-partitioned datasets.
  • Reduces network data movement.
  • Improves memory efficiency.
  • Provides better performance for Hive-partitioned workloads without changing existing APIs.

@Annmool Annmool requested a review from a team as a code owner February 9, 2026 17:11
Signed-off-by: Annmool <aydv.267@gmail.com>
@Annmool Annmool force-pushed the partition-awareness-groupby branch from 1d09b85 to 772d2b6 Compare February 9, 2026 17:13
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable optimization for GroupBy operations by adding partition awareness to skip unnecessary shuffles. The implementation, which includes new utility functions for partition detection and modifications to GroupedData, is well-structured. The addition of unit tests is also a positive step.

I have identified a critical issue in how block metadata is collected, which would likely lead to a runtime error. Additionally, I've suggested an improvement to the error handling to enhance observability when the partition awareness check fails. Overall, this is a strong contribution that should significantly improve performance for workloads with pre-partitioned data.

Comment on lines +69 to +72
for ref, metadata in self._dataset.iter_internal_ref_bundles():
if hasattr(metadata, 'blocks') and metadata.blocks:
for block_ref, block_metadata in metadata.blocks:
blocks_metadata.append(block_metadata)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The iteration over self._dataset.iter_internal_ref_bundles() appears to be incorrect. This method returns an iterator of RefBundle objects, not tuples. Each RefBundle can contain metadata for multiple blocks.

You should iterate over the RefBundle objects and then access their metadata property to get the list of BlockMetadata for all blocks within that bundle.

            for bundle in self._dataset.iter_internal_ref_bundles():
                blocks_metadata.extend(bundle.metadata)

Comment on lines +84 to +86
except Exception as e:
# If anything goes wrong, fall back to regular shuffle
return False, f"Partition awareness check failed: {str(e)}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The broad exception handler correctly falls back to the shuffle-based approach, which is a safe default. However, it currently might hide underlying issues in the partition awareness check by only returning the error message as a string.

To improve observability and make debugging easier, I recommend logging the exception as a warning. This will ensure that failures in this optimization path are more visible in the logs.

        except Exception as e:
            # If anything goes wrong, fall back to regular shuffle
            logger = logging.getLogger(__name__)
            logger.warning(
                "Partition awareness check failed, falling back to shuffle: %s",
                e,
                exc_info=True,
            )
            return False, f"Partition awareness check failed: {str(e)}"

…ion awareness check

- Fix iteration over iter_internal_ref_bundles() to properly access RefBundle objects
- Each RefBundle contains metadata property with blocks information
- Add logging for exception handling to improve observability
- Include stack trace in warning logs for debugging

Signed-off-by: Annmool <aydv.267@gmail.com>
@Annmool Annmool force-pushed the partition-awareness-groupby branch from dcb1840 to e7bd41e Compare February 9, 2026 17:22
… partition info, update tests, add logging

Signed-off-by: Annmool <aydv.267@gmail.com>
…rsing, logging, and tests

Signed-off-by: Annmool <aydv.267@gmail.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 9, 2026
Annmool and others added 2 commits February 10, 2026 13:14
…uristic; update tests; tighten partition parsing

Signed-off-by: Annmool <aydv.267@gmail.com>
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 24, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant