[data] Fix bug in memory reservation #43511

raulchen · 2024-02-28T19:45:35Z

Why are these changes needed?

Fix bugs in ReservationOpResourceAllocator (introduced by #43171)

We treat a map op and its following non-map ops as the same group. update_resources already handles this properly . But _should_unblock_streaming_output_backpressure and _op_outputs_reserved_remaining didn't consider this.
Since we don't reserve any resources for limit and streaming_split, should set num_cpus=0 for their tasks.
_reserved_for_op_outputs currently also includes op's internal output buffers. This is incorrect, because when preserve_order=True, task outputs will accumulate in op's internal output buffer, and use all the memory budget from _reserved_for_op_outputs. Then we still don't have memory budget to pull the blocks from the internal output buffer. Excluding the internal output buffer from _reserved_for_op_outputs fixes this issue.

Also deflake test_backpressure_from_output and test_e2e_autoscaling_up, as they depend on physical memory size of the node.

Related issue number

closes #43493
closes #43490

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>

c21

LG

c21 · 2024-02-28T20:17:06Z

python/ray/data/_internal/execution/resource_manager.py

+
+        E.g.,
+          - "cur_map->downstream_map" will return [downstream_map].
+          - "cur_map->limit1->limi2->downstream_map" will return [downstream_map].


limi2 typo

Signed-off-by: Hao Chen <chenh1024@gmail.com>

bveeramani · 2024-02-28T23:06:48Z

python/ray/data/tests/test_backpressure_e2e.py

 @pytest.mark.parametrize(
-    "cluster_cpus, cluster_obj_store_mem_mb",
+    "cluster_cpus, cluster_obj_store_mem_mb, insert_limit_op",
    [
-        (3, 500),  # CPU not enough
-        (4, 100),  # Object store memory not enough
-        (3, 100),  # Both not enough
+        (3, 500, False),  # CPU not enough
+        (3, 500, True),  # CPU not enough
+        (4, 100, False),  # Object store memory not enough
+        (4, 100, True),  # Object store memory not enough
+        (3, 100, False),  # Both not enough
+        (3, 100, True),  # Both not enough
    ],
 )


Nit: To test all existing combinations with insert_limit_op, it might be cleaner to do another parametrize:

@pytest.mark.parametrize( "cluster_cpus", "cluster_obj_store_mem_mb", [ (3, 500), (4, 100), (3, 100), ] ) @pytest.mark.parametrize("insert_limit_op", [False, True])

Signed-off-by: Hao Chen <chenh1024@gmail.com>

can-anyscale · 2024-02-29T00:52:30Z

Can you help check that the failing tests on premerge are same as ones failing. in go/flaky. Thankks

raulchen · 2024-02-29T00:56:35Z

@can-anyscale checked. one is an existing flaky test, the other is because hugging face is down now.

raulchen added 3 commits February 28, 2024 11:12

fix downstream non-map resource usage

d3fce73

Signed-off-by: Hao Chen <chenh1024@gmail.com>

parameterize insert_limit_op

f200b29

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix test_backpressure_from_output

70afd20

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, stephanie-wang and omatthew98 as code owners February 28, 2024 19:45

raulchen added 3 commits February 28, 2024 11:46

remove debug info

9eab36b

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

f22d04c

Signed-off-by: Hao Chen <chenh1024@gmail.com>

deflaky test_e2e_autoscaling_up

5408b2e

Signed-off-by: Hao Chen <chenh1024@gmail.com>

c21 approved these changes Feb 28, 2024

View reviewed changes

raulchen added 2 commits February 28, 2024 12:39

typo

d9018b6

Signed-off-by: Hao Chen <chenh1024@gmail.com>

num_cpus=0 for limit/streaming_split tasks

f2c2c7e

Signed-off-by: Hao Chen <chenh1024@gmail.com>

bveeramani approved these changes Feb 28, 2024

View reviewed changes

raulchen added 5 commits February 28, 2024 15:14

reserve for op internal output buffers

6461339

Signed-off-by: Hao Chen <chenh1024@gmail.com>

debug string

4d09678

Signed-off-by: Hao Chen <chenh1024@gmail.com>

simplify parameterize

06f4642

Signed-off-by: Hao Chen <chenh1024@gmail.com>

add test_no_deadlock_with_preserve_order

714841e

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

ed60066

Signed-off-by: Hao Chen <chenh1024@gmail.com>

can-anyscale merged commit c61bbc0 into ray-project:master Feb 29, 2024
7 of 9 checks passed

raulchen deleted the fix-memory-reservation branch February 29, 2024 01:04

raulchen mentioned this pull request Feb 29, 2024

[data] fix test_resource_manager.py #43547

Merged

8 tasks

bveeramani mentioned this pull request Feb 29, 2024

CI test linux://python/ray/data:test_streaming_integration is flaky #43481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Fix bug in memory reservation #43511

[data] Fix bug in memory reservation #43511

raulchen commented Feb 28, 2024 •

edited

c21 left a comment

c21 Feb 28, 2024

bveeramani Feb 28, 2024

can-anyscale commented Feb 29, 2024

raulchen commented Feb 29, 2024

[data] Fix bug in memory reservation #43511

[data] Fix bug in memory reservation #43511

Conversation

raulchen commented Feb 28, 2024 • edited

Why are these changes needed?

Related issue number

Checks

c21 left a comment

Choose a reason for hiding this comment

c21 Feb 28, 2024

Choose a reason for hiding this comment

bveeramani Feb 28, 2024

Choose a reason for hiding this comment

can-anyscale commented Feb 29, 2024

raulchen commented Feb 29, 2024

raulchen commented Feb 28, 2024 •

edited