[data] Update streaming_repartition and map_batches_fusion by xinyuangui2 · Pull Request #59476 · ray-project/ray

xinyuangui2 · 2025-12-16T18:47:27Z

Analysis of the two operator patterns:

Streaming_repartition → map_batches

	Number of `map_batches` tasks
Fused	`num_input_blocks` (which is ≤ number of output blocks of StreamingRepartition)
Not fused	number of output blocks of StreamingRepartition

When fused, the number of tasks equals the number of input blocks, which is
≤ the number of output blocks of StreamingRepartition. If StreamingRepartition
is supposed to break down blocks to increase parallelism, that won't happen
when fused. So we don't fuse.

Map_batches → streaming_repartition

batch_size % target_num_rows == 0

	Number of `map_batches` tasks
Fused	== total_rows / batch_size
Not fused	== total_rows / batch_size

So, the fusion doesn’t affect the parallelism.

Thus, we currently disable the Streaming_repartition → map_batches fusion and enable the fusion when batch_size % target_num_rows == 0 for Map_batches → streaming_repartition.

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

gemini-code-assist

Code Review

This pull request updates the operator fusion logic for StreamingRepartition and MapBatches. The main change is to allow fusion of MapBatches -> StreamingRepartition when the batch_size of MapBatches is a multiple of target_num_rows_per_block of StreamingRepartition. The reverse fusion (StreamingRepartition -> MapBatches) is disabled. The changes look good and are well-tested. I've added one comment to improve code readability in a test file.

python/ray/data/tests/test_repartition_e2e.py

python/ray/data/_internal/logical/rules/operator_fusion.py

cursor

Bug: Fused operator uses wrong target size for output blocks

The StreamingRepartitionRefBundler is created with batch_size (from the upstream MapBatches operator) instead of target_num_rows_per_block (from the downstream StreamingRepartition operator). Previously this was correct because the fusion condition required batch_size == target_num_rows_per_block. After changing the condition to batch_size % target_num_rows_per_block == 0, fusion now occurs when batch_size is larger (e.g., batch_size=100, target_num_rows=20). The fused operator will produce batch_size-sized output blocks instead of target_num_rows_per_block-sized blocks, violating the expected repartition behavior.

python/ray/data/_internal/logical/rules/operator_fusion.py#L308-L309

ray/python/ray/data/_internal/logical/rules/operator_fusion.py

Lines 308 to 309 in 44380fa

    
           compute_strategy=compute, 
        
           ref_bundler=StreamingRepartitionRefBundler(batch_size),

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/data/_internal/logical/rules/operator_fusion.py

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

python/ray/data/tests/test_repartition_e2e.py

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/data/_internal/logical/rules/operator_fusion.py

Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

…ct#59476) Analysis of the two operator patterns: ## Streaming_repartition → map_batches | | Number of `map_batches` tasks | |----------------------|---------------------------------------------------------------------------| | **Fused** | `num_input_blocks` (which is ≤ number of output blocks of StreamingRepartition) | | **Not fused** | number of output blocks of StreamingRepartition | When fused, the number of tasks equals the number of input blocks, which is ≤ the number of output blocks of StreamingRepartition. If StreamingRepartition is supposed to break down blocks to increase parallelism, that won't happen when fused. So we don't fuse. --- ## Map_batches → streaming_repartition `batch_size % target_num_rows == 0` | | Number of `map_batches` tasks | |----------------------|-------------------------------| | **Fused** | == total_rows / batch_size | | **Not fused** | == total_rows / batch_size | So, the fusion doesn’t affect the parallelism. --- Thus, we currently disable the `Streaming_repartition → map_batches` fusion and enable the fusion when `batch_size % target_num_rows == 0` for `Map_batches → streaming_repartition`. --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

…ct#59476) Analysis of the two operator patterns: ## Streaming_repartition → map_batches | | Number of `map_batches` tasks | |----------------------|---------------------------------------------------------------------------| | **Fused** | `num_input_blocks` (which is ≤ number of output blocks of StreamingRepartition) | | **Not fused** | number of output blocks of StreamingRepartition | When fused, the number of tasks equals the number of input blocks, which is ≤ the number of output blocks of StreamingRepartition. If StreamingRepartition is supposed to break down blocks to increase parallelism, that won't happen when fused. So we don't fuse. Signed-off-by: lee1258561 <lee1258561@gmail.com> --- ## Map_batches → streaming_repartition `batch_size % target_num_rows == 0` | | Number of `map_batches` tasks | |----------------------|-------------------------------| | **Fused** | == total_rows / batch_size | | **Not fused** | == total_rows / batch_size | So, the fusion doesn’t affect the parallelism. --- Thus, we currently disable the `Streaming_repartition → map_batches` fusion and enable the fusion when `batch_size % target_num_rows == 0` for `Map_batches → streaming_repartition`. --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

xinyuangui2 and others added 14 commits November 17, 2025 16:47

Avoid lock if serialization result is cached

de4f17f

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Merge branch 'ray-project:master' into master

003b4ab

Merge branch 'ray-project:master' into master

93ab9d2

Merge branch 'ray-project:master' into master

e2cd6b8

Merge branch 'ray-project:master' into master

136ec12

Merge branch 'ray-project:master' into master

dc4258f

Merge branch 'ray-project:master' into master

80f2246

Merge branch 'ray-project:master' into master

52fb570

Merge branch 'ray-project:master' into master

3c42af4

Merge branch 'ray-project:master' into master

87333fe

Merge branch 'ray-project:master' into master

c4272db

Merge branch 'ray-project:master' into master

b6204ac

Merge branch 'ray-project:master' into master

da4e3c9

update fusion rule

44380fa

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from a team as a code owner December 16, 2025 18:47

xinyuangui2 requested a review from alexeykudinkin December 16, 2025 18:47

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

python/ray/data/tests/test_repartition_e2e.py Show resolved Hide resolved

cursor bot reviewed Dec 16, 2025

View reviewed changes

python/ray/data/_internal/logical/rules/operator_fusion.py Show resolved Hide resolved

cursor bot reviewed Dec 16, 2025

View reviewed changes

resolve comments

fa08893

Signed-off-by: xgui <xgui@anyscale.com>

ray-gardener bot added the data Ray Data-related issues label Dec 16, 2025

xinyuangui2 added 2 commits December 16, 2025 22:39

update comment

a4bdf0f

Signed-off-by: xgui <xgui@anyscale.com>

update comment

27b451c

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 added the go add ONLY when ready to merge, run all tests label Dec 16, 2025

alexeykudinkin reviewed Dec 18, 2025

View reviewed changes

xinyuangui2 and others added 2 commits December 18, 2025 12:24

Merge branch 'master' into update-fusion

9fab905

Update python/ray/data/_internal/logical/rules/operator_fusion.py

4a3ff0b

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

cursor bot reviewed Dec 18, 2025

View reviewed changes

python/ray/data/tests/test_repartition_e2e.py Show resolved Hide resolved

xinyuangui2 added 2 commits December 18, 2025 21:39

resolve comments

24df4ac

Signed-off-by: xgui <xgui@anyscale.com>

resolve comments

2611994

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from alexeykudinkin December 18, 2025 22:20

alexeykudinkin approved these changes Dec 18, 2025

View reviewed changes

python/ray/data/_internal/logical/rules/operator_fusion.py Outdated Show resolved Hide resolved

Tidying up copy

becdf19

Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

alexeykudinkin enabled auto-merge (squash) December 18, 2025 23:47

alexeykudinkin merged commit e7ef57f into ray-project:master Dec 19, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Update streaming_repartition and map_batches_fusion#59476

[data] Update streaming_repartition and map_batches_fusion#59476
alexeykudinkin merged 22 commits intoray-project:masterfrom
xinyuangui2:update-fusion

xinyuangui2 commented Dec 16, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	compute_strategy=compute,
	ref_bundler=StreamingRepartitionRefBundler(batch_size),

Conversation

xinyuangui2 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Streaming_repartition → map_batches

Map_batches → streaming_repartition

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Fused operator uses wrong target size for output blocks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xinyuangui2 commented Dec 16, 2025 •

edited

Loading