Skip to content

[data] Update streaming_repartition and map_batches_fusion#59476

Merged
alexeykudinkin merged 22 commits intoray-project:masterfrom
xinyuangui2:update-fusion
Dec 19, 2025
Merged

[data] Update streaming_repartition and map_batches_fusion#59476
alexeykudinkin merged 22 commits intoray-project:masterfrom
xinyuangui2:update-fusion

Conversation

@xinyuangui2
Copy link
Contributor

@xinyuangui2 xinyuangui2 commented Dec 16, 2025

Analysis of the two operator patterns:

Streaming_repartition → map_batches

Number of map_batches tasks
Fused num_input_blocks (which is ≤ number of output blocks of StreamingRepartition)
Not fused number of output blocks of StreamingRepartition

When fused, the number of tasks equals the number of input blocks, which is
≤ the number of output blocks of StreamingRepartition. If StreamingRepartition
is supposed to break down blocks to increase parallelism, that won't happen
when fused. So we don't fuse.


Map_batches → streaming_repartition

batch_size % target_num_rows == 0

Number of map_batches tasks
Fused == total_rows / batch_size
Not fused == total_rows / batch_size

So, the fusion doesn’t affect the parallelism.


Thus, we currently disable the Streaming_repartition → map_batches fusion and enable the fusion when batch_size % target_num_rows == 0 for Map_batches → streaming_repartition.

@xinyuangui2 xinyuangui2 requested a review from a team as a code owner December 16, 2025 18:47
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the operator fusion logic for StreamingRepartition and MapBatches. The main change is to allow fusion of MapBatches -> StreamingRepartition when the batch_size of MapBatches is a multiple of target_num_rows_per_block of StreamingRepartition. The reverse fusion (StreamingRepartition -> MapBatches) is disabled. The changes look good and are well-tested. I've added one comment to improve code readability in a test file.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Fused operator uses wrong target size for output blocks

The StreamingRepartitionRefBundler is created with batch_size (from the upstream MapBatches operator) instead of target_num_rows_per_block (from the downstream StreamingRepartition operator). Previously this was correct because the fusion condition required batch_size == target_num_rows_per_block. After changing the condition to batch_size % target_num_rows_per_block == 0, fusion now occurs when batch_size is larger (e.g., batch_size=100, target_num_rows=20). The fused operator will produce batch_size-sized output blocks instead of target_num_rows_per_block-sized blocks, violating the expected repartition behavior.

python/ray/data/_internal/logical/rules/operator_fusion.py#L308-L309

compute_strategy=compute,
ref_bundler=StreamingRepartitionRefBundler(batch_size),

Fix in Cursor Fix in Web


Signed-off-by: xgui <xgui@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Dec 16, 2025
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 added the go add ONLY when ready to merge, run all tests label Dec 16, 2025
xinyuangui2 and others added 2 commits December 18, 2025 12:24
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) December 18, 2025 23:47
@alexeykudinkin alexeykudinkin merged commit e7ef57f into ray-project:master Dec 19, 2025
7 checks passed
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
…ct#59476)

Analysis of the two operator patterns:

## Streaming_repartition → map_batches

| | Number of `map_batches` tasks |

|----------------------|---------------------------------------------------------------------------|
| **Fused** | `num_input_blocks` (which is ≤ number of output blocks of
StreamingRepartition) |
| **Not fused** | number of output blocks of StreamingRepartition |

When fused, the number of tasks equals the number of input blocks, which
is
≤ the number of output blocks of StreamingRepartition. If
StreamingRepartition
is supposed to break down blocks to increase parallelism, that won't
happen
when fused. So we don't fuse.

---

## Map_batches → streaming_repartition

`batch_size % target_num_rows == 0`

|                      | Number of `map_batches` tasks |
|----------------------|-------------------------------|
| **Fused**            | == total_rows / batch_size |
| **Not fused**        | == total_rows / batch_size |

So, the fusion doesn’t affect the parallelism.

---

Thus, we currently disable the `Streaming_repartition → map_batches`
fusion and enable the fusion when `batch_size % target_num_rows == 0`
for `Map_batches → streaming_repartition`.

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
…ct#59476)

Analysis of the two operator patterns:

## Streaming_repartition → map_batches

| | Number of `map_batches` tasks |

|----------------------|---------------------------------------------------------------------------|
| **Fused** | `num_input_blocks` (which is ≤ number of output blocks of
StreamingRepartition) |
| **Not fused** | number of output blocks of StreamingRepartition |

When fused, the number of tasks equals the number of input blocks, which
is
≤ the number of output blocks of StreamingRepartition. If
StreamingRepartition
is supposed to break down blocks to increase parallelism, that won't
happen
when fused. So we don't fuse.

Signed-off-by: lee1258561 <lee1258561@gmail.com>

---

## Map_batches → streaming_repartition

`batch_size % target_num_rows == 0`

|                      | Number of `map_batches` tasks |
|----------------------|-------------------------------|
| **Fused**            | == total_rows / batch_size |
| **Not fused**        | == total_rows / batch_size |

So, the fusion doesn’t affect the parallelism.

---

Thus, we currently disable the `Streaming_repartition → map_batches`
fusion and enable the fusion when `batch_size % target_num_rows == 0`
for `Map_batches → streaming_repartition`.

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants