Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Fix bug in memory reservation #43511

Merged
merged 13 commits into from
Feb 29, 2024

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Feb 28, 2024

Why are these changes needed?

Fix bugs in ReservationOpResourceAllocator (introduced by #43171)

  • We treat a map op and its following non-map ops as the same group. update_resources already handles this properly . But _should_unblock_streaming_output_backpressure and _op_outputs_reserved_remaining didn't consider this.
  • Since we don't reserve any resources for limit and streaming_split, should set num_cpus=0 for their tasks.
  • _reserved_for_op_outputs currently also includes op's internal output buffers. This is incorrect, because when preserve_order=True, task outputs will accumulate in op's internal output buffer, and use all the memory budget from _reserved_for_op_outputs. Then we still don't have memory budget to pull the blocks from the internal output buffer. Excluding the internal output buffer from _reserved_for_op_outputs fixes this issue.

Also deflake test_backpressure_from_output and test_e2e_autoscaling_up, as they depend on physical memory size of the node.

Related issue number

closes #43493
closes #43490

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG


E.g.,
- "cur_map->downstream_map" will return [downstream_map].
- "cur_map->limit1->limi2->downstream_map" will return [downstream_map].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limi2 typo

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Comment on lines 121 to 131
@pytest.mark.parametrize(
"cluster_cpus, cluster_obj_store_mem_mb",
"cluster_cpus, cluster_obj_store_mem_mb, insert_limit_op",
[
(3, 500), # CPU not enough
(4, 100), # Object store memory not enough
(3, 100), # Both not enough
(3, 500, False), # CPU not enough
(3, 500, True), # CPU not enough
(4, 100, False), # Object store memory not enough
(4, 100, True), # Object store memory not enough
(3, 100, False), # Both not enough
(3, 100, True), # Both not enough
],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: To test all existing combinations with insert_limit_op, it might be cleaner to do another parametrize:

@pytest.mark.parametrize(
    "cluster_cpus", "cluster_obj_store_mem_mb",
    [
        (3, 500),
        (4, 100),
        (3, 100),
    ]
)
@pytest.mark.parametrize("insert_limit_op", [False, True])

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@can-anyscale
Copy link
Collaborator

Can you help check that the failing tests on premerge are same as ones failing. in go/flaky. Thankks

@raulchen
Copy link
Contributor Author

@can-anyscale checked. one is an existing flaky test, the other is because hugging face is down now.

@can-anyscale can-anyscale merged commit c61bbc0 into ray-project:master Feb 29, 2024
7 of 9 checks passed
@raulchen raulchen deleted the fix-memory-reservation branch February 29, 2024 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants