[data] Disable block slicing for shuffle ops #40538

stephanie-wang · 2023-10-20T21:12:08Z

Why are these changes needed?

#40248 changed output block creation so that when a task produces its output blocks, it will try to slice them before yielding to respect the target block size. Unfortunately, all-to-all ops currently don't support dynamic block splitting. This means that if we try to fuse an upstream map iterator with an all-to-all op, the all-to-all task will still have to fuse all of the sliced blocks back together again. This seems to increase memory usage significantly.

This PR avoids this issue by overriding the upstream map iterator's target block size to infinity when it is fused with an all-to-all op. This also adds a logger warning for how to workaround.

Related issue number

Closes #40518.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang · 2023-10-20T21:13:03Z

We don't have a good way to unit test this right now, but I think with better perf introspection, we can check the peak memory usage.

stephanie-wang · 2023-10-20T22:07:14Z

Running the failing test here.

raulchen · 2023-10-20T22:26:51Z

python/ray/data/_internal/planner/exchange/shuffle_task_spec.py

@@ -33,22 +42,37 @@ def map(
        idx: int,
        block: Block,
        output_num_blocks: int,
+        target_max_block_size: int,


Suggested change

target_max_block_size: int,

target_shuffle_max_block_size: int,

raulchen · 2023-10-20T22:28:40Z

python/ray/data/_internal/planner/exchange/shuffle_task_spec.py

+                "This can lead to out-of-memory errors and can happen "
+                "when map tasks are fused to the shuffle operation. "
+                "To prevent fusion, call Dataset.materialize() on the "
+                "dataset before shuffling."


Setting different resources can also prevent fusion, without having to materialize all data.
Another workaround is to increase the parallelism to make each map op finer-grained.

I think materializing is fine here because you have to materialize all data for an all-to-all op anyway.

makes sense.

c21 · 2023-10-20T22:47:00Z

python/ray/data/_internal/planner/exchange/shuffle_task_spec.py

        block = BlockAccessor.for_block(block)
+        if block.size_bytes() > target_max_block_size:
+            logger.get_logger().warn(


I am worried about if this warning would confuse users. Can we log it as debug?

I think print a warning makes sense. But we may want to make it less verbose by, for example, checking size_bytes > 2 * target_max_block_size.
Also, suggesting increasing parallelism seems to make more sense than disabling fusion.

I'll change to 2 * target_max_block_size.

But materialize() is the best solution here (see other comment). Hard to recommend an exact fix via increasing parallelism.

c21 · 2023-10-20T22:47:46Z

python/ray/data/_internal/planner/repartition.py

+            # NOTE(swang): We override the target block size with infinity, to
+            # prevent the upstream map from slicing its output into smaller
+            # blocks. Since the shuffle task will just fuse these back
+            # together, the extra slicing and re-fusing can add high memory
+            # overhead. This can be removed once dynamic block splitting is
+            # supported for all-to-all ops.
+            # See https://github.com/ray-project/ray/issues/40518.
+            map_transformer.set_target_max_block_size(float("inf"))


nit: better to put this into a separate method, to avoid repeated comments.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang · 2023-10-23T18:58:43Z

Re-ran the failing release test here.

Although the run time still looks a bit higher than before the offending PR, the number of workers killed due to out-of-memory is back to normal, so appears that the test is fixed.

ray-project#40248 changed output block creation so that when a task produces its output blocks, it will try to slice them before yielding to respect the target block size. Unfortunately, all-to-all ops currently don't support dynamic block splitting. This means that if we try to fuse an upstream map iterator with an all-to-all op, the all-to-all task will still have to fuse all of the sliced blocks back together again. This seems to increase memory usage significantly. This PR avoids this issue by overriding the upstream map iterator's target block size to infinity when it is fused with an all-to-all op. This also adds a logger warning for how to workaround. Related issue number Closes ray-project#40518. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

#40248 changed output block creation so that when a task produces its output blocks, it will try to slice them before yielding to respect the target block size. Unfortunately, all-to-all ops currently don't support dynamic block splitting. This means that if we try to fuse an upstream map iterator with an all-to-all op, the all-to-all task will still have to fuse all of the sliced blocks back together again. This seems to increase memory usage significantly. This PR avoids this issue by overriding the upstream map iterator's target block size to infinity when it is fused with an all-to-all op. This also adds a logger warning for how to workaround. Related issue number Closes #40518. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Don't slice blocks for shuffle ops

ce1306f

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen and Zandew as code owners October 20, 2023 21:12

stephanie-wang assigned raulchen and c21 Oct 20, 2023

raulchen approved these changes Oct 20, 2023

View reviewed changes

c21 approved these changes Oct 20, 2023

View reviewed changes

stephanie-wang added 2 commits October 20, 2023 16:19

x

4984e0e

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

1.5

9c75a1e

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang added release-blocker P0 Issue that blocks the release v2.8.0-pick labels Oct 20, 2023

stephanie-wang added 2 commits October 23, 2023 09:42

Merge remote-tracking branch 'upstream/master' into fix-40508

80b8ca5

fix

b5421c0

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang merged commit 1c405fc into ray-project:master Oct 23, 2023
30 of 35 checks passed

stephanie-wang mentioned this pull request Oct 27, 2023

Release test dataset_shuffle_push_based_random_shuffle_1tb.aws failed #40519

Closed

raulchen mentioned this pull request Oct 27, 2023

[data] benchmark regression iter_tensor_batches_benchmark_multi_node #40759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Disable block slicing for shuffle ops #40538

[data] Disable block slicing for shuffle ops #40538

stephanie-wang commented Oct 20, 2023 •

edited

Loading

stephanie-wang commented Oct 20, 2023

stephanie-wang commented Oct 20, 2023

raulchen Oct 20, 2023

raulchen Oct 20, 2023

stephanie-wang Oct 20, 2023

raulchen Oct 20, 2023

c21 Oct 20, 2023

raulchen Oct 20, 2023

stephanie-wang Oct 20, 2023

c21 Oct 20, 2023

stephanie-wang commented Oct 23, 2023

	target_max_block_size: int,
	target_shuffle_max_block_size: int,

[data] Disable block slicing for shuffle ops #40538

[data] Disable block slicing for shuffle ops #40538

Conversation

stephanie-wang commented Oct 20, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang commented Oct 20, 2023

stephanie-wang commented Oct 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Oct 23, 2023

stephanie-wang commented Oct 20, 2023 •

edited

Loading