[core] Release resources only after tasks have stopped executing #53660

codope · 2025-06-09T11:56:12Z

Why are these changes needed?

In CoreWorker::Exit(), the code calls:

Line 1169 in c54437c

auto status = local_raylet_client_->NotifyDirectCallTaskBlocked();

This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence.

I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test:

Ray init with only 1 CPU.
Start task1 and wait for it to signal it's running
Submit task2 (should be queued)
Try to wait for task2 to start with 1-second timeout:
- No timeout: Bug! Task2 started immediately --> cleanup and fail
- Timeout occurs: Correct! Task2 is queued --> continue
Complete task1 and assert expected result
Wait for task2 to start (should happen immediately now)
Complete task2 and assert expected result

With the fix, the test always passes (no oversubscription detected)

codope · 2025-06-09T12:06:36Z

@edoakes Can you please review the test? This is how I have tried to repro the resource oversubscription issue due to early local_raylet_client_->NotifyDirectCallTaskBlocked() call.

python/ray/tests/test_resource_oversubscription.py

python/ray/tests/test_scheduling.py

israbbani · 2025-06-12T20:17:29Z

python/ray/tests/test_scheduling.py

+    time.sleep(0.3)
+
+    # Sanity check: At this point, no CPUs should be available.
+    assert ray.available_resources().get("CPU", 0) == 0


@edoakes can this API be stale? This checks to GCS for available resources, is it possible that the task has started executing and the GCS has not been updated instead?

you're right; there's no guarantee that this API returns the updated value immediately. it will be updated once the raylet broadcasts its resource usage to the GCS.

so we need to wrap this in a wait_for_condition (or drop the check since it's non-essential)

Actually, that's the reason i had added some sleep. But, now I've modified the test to assert actual behavior using SignalActors instead of asserting resource accounting dependent on these APIs.

israbbani · 2025-06-12T20:24:12Z

python/ray/tests/test_scheduling.py

+    # If the bug exists, at some point Ray's accounting will show that
+    # 2 CPUs are in use, despite only 1 being in the cluster.
+    # With the fix, this should never exceed 1.
+    max_used_cpus = max(used_cpus_snapshots)
+    print(f"Max CPUs reported as used by Ray: {max_used_cpus}")
+    print(f"Total CPUs in cluster: {total_cpus}")


I think this test is getting too specific to Ray's implementation and therefore will be brittle.

What you actually want to enforce that task2 does not start executing before task1 exits.

This is a good point -- better way to write the test

Thanks for suggestion of using SignalActor. I have modified the test as follows:

Start task1 and wait for it to signal it's running

Submit task2 (should be queued)

Try to wait for task2 to start with 1-second timeout:

No timeout: Bug! Task2 started immediately --> cleanup and fail

Timeout occurs: Correct! Task2 is queued --> continue

Complete task1 and assert expected result

Wait for task2 to start (should happen immediately now)

Complete task2 and assert expected result

src/ray/core_worker/core_worker.cc

edoakes

looks good, minor nits

python/ray/tests/test_scheduling.py

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

…s other comments Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…-project#53660) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Scott Lee <scott.lee@rebellions.ai>

…-project#53660) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

) ## Why are these changes needed? In `CoreWorker::Exit()`, the code calls: https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169 This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence. I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test: 0. Ray init with only 1 CPU. 1. Start task1 and wait for it to signal it's running 2. Submit task2 (should be queued) 3. Try to wait for task2 to start with 1-second timeout: - No timeout: Bug! Task2 started immediately --> cleanup and fail - Timeout occurs: Correct! Task2 is queued --> continue 4. Complete task1 and assert expected result 5. Wait for task2 to start (should happen immediately now) 6. Complete task2 and assert expected result With the fix, the test always passes (no oversubscription detected) --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

edoakes reviewed Jun 9, 2025

View reviewed changes

codope force-pushed the cw-shutdown-oversub branch from cd69568 to 177840a Compare June 12, 2025 08:35

codope marked this pull request as ready for review June 12, 2025 08:36

codope added the go add ONLY when ready to merge, run all tests label Jun 12, 2025

edoakes reviewed Jun 12, 2025

View reviewed changes

python/ray/tests/test_scheduling.py Outdated Show resolved Hide resolved

python/ray/tests/test_scheduling.py Outdated Show resolved Hide resolved

python/ray/tests/test_scheduling.py Outdated Show resolved Hide resolved

python/ray/tests/test_scheduling.py Outdated Show resolved Hide resolved

israbbani reviewed Jun 12, 2025

View reviewed changes

codope force-pushed the cw-shutdown-oversub branch from 177840a to a25a462 Compare June 16, 2025 06:16

edoakes reviewed Jun 16, 2025

View reviewed changes

python/ray/tests/test_scheduling.py Outdated Show resolved Hide resolved

python/ray/tests/test_scheduling.py Outdated Show resolved Hide resolved

codope and others added 5 commits June 16, 2025 15:22

[core] Release resources only after tasks have stopped executing

caa7b8e

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

refactor test

72a1d5c

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

change test to assert actual behavior instead of resources and addres…

bbbc1b7

…s other comments Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

reduce some boilerplate and timeout to 0.5

b6ad385

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

add type hints

5f48512

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope force-pushed the cw-shutdown-oversub branch from 21336f7 to 5f48512 Compare June 16, 2025 15:37

edoakes approved these changes Jun 16, 2025

View reviewed changes

edoakes enabled auto-merge (squash) June 16, 2025 16:17

edoakes merged commit 0ec2c47 into ray-project:master Jun 16, 2025
6 checks passed

[core] Release resources only after tasks have stopped executing #53660

[core] Release resources only after tasks have stopped executing #53660

Uh oh!

Conversation

codope commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Uh oh!

codope commented Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

israbbani Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

codope Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

israbbani Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

codope Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codope commented Jun 9, 2025 •

edited

Loading