-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[core] Release resources only after tasks have stopped executing #53660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@edoakes Can you please review the test? This is how I have tried to repro the resource oversubscription issue due to early |
cd69568
to
177840a
Compare
python/ray/tests/test_scheduling.py
Outdated
time.sleep(0.3) | ||
|
||
# Sanity check: At this point, no CPUs should be available. | ||
assert ray.available_resources().get("CPU", 0) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoakes can this API be stale? This checks to GCS for available resources, is it possible that the task has started executing and the GCS has not been updated instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right; there's no guarantee that this API returns the updated value immediately. it will be updated once the raylet broadcasts its resource usage to the GCS.
so we need to wrap this in a wait_for_condition
(or drop the check since it's non-essential)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that's the reason i had added some sleep. But, now I've modified the test to assert actual behavior using SignalActor
s instead of asserting resource accounting dependent on these APIs.
python/ray/tests/test_scheduling.py
Outdated
# If the bug exists, at some point Ray's accounting will show that | ||
# 2 CPUs are in use, despite only 1 being in the cluster. | ||
# With the fix, this should never exceed 1. | ||
max_used_cpus = max(used_cpus_snapshots) | ||
print(f"Max CPUs reported as used by Ray: {max_used_cpus}") | ||
print(f"Total CPUs in cluster: {total_cpus}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is getting too specific to Ray's implementation and therefore will be brittle.
What you actually want to enforce that task2 does not start executing before task1 exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point -- better way to write the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion of using SignalActor
. I have modified the test as follows:
- Start task1 and wait for it to signal it's running
- Submit task2 (should be queued)
- Try to wait for task2 to start with 1-second timeout:
- No timeout: Bug! Task2 started immediately --> cleanup and fail
- Timeout occurs: Correct! Task2 is queued --> continue
- Complete task1 and assert expected result
- Wait for task2 to start (should happen immediately now)
- Complete task2 and assert expected result
177840a
to
a25a462
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, minor nits
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…s other comments Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
21336f7
to
5f48512
Compare
) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…-project#53660) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
…-project#53660) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Why are these changes needed?
In
CoreWorker::Exit()
, the code calls:ray/src/ray/core_worker/core_worker.cc
Line 1169 in c54437c
This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence.
I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test:
With the fix, the test always passes (no oversubscription detected)