Skip to content

[core] Release resources only after tasks have stopped executing #53660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 16, 2025

Conversation

codope
Copy link
Contributor

@codope codope commented Jun 9, 2025

Why are these changes needed?

In CoreWorker::Exit(), the code calls:

auto status = local_raylet_client_->NotifyDirectCallTaskBlocked();

This tells the raylet to immediately release all resources that were allocated to this worker. However, the worker may still have tasks running at this point in the exit sequence.

I have added a test to reproduce the issue. However, the test is reproducible 2 out of 5 times locally. In the test:

  1. Ray init with only 1 CPU.
  2. Start task1 and wait for it to signal it's running
  3. Submit task2 (should be queued)
  4. Try to wait for task2 to start with 1-second timeout:
    • No timeout: Bug! Task2 started immediately --> cleanup and fail
    • Timeout occurs: Correct! Task2 is queued --> continue
  5. Complete task1 and assert expected result
  6. Wait for task2 to start (should happen immediately now)
  7. Complete task2 and assert expected result

With the fix, the test always passes (no oversubscription detected)

@codope
Copy link
Contributor Author

codope commented Jun 9, 2025

@edoakes Can you please review the test? This is how I have tried to repro the resource oversubscription issue due to early local_raylet_client_->NotifyDirectCallTaskBlocked() call.

@codope codope force-pushed the cw-shutdown-oversub branch from cd69568 to 177840a Compare June 12, 2025 08:35
@codope codope marked this pull request as ready for review June 12, 2025 08:36
@codope codope added the go add ONLY when ready to merge, run all tests label Jun 12, 2025
time.sleep(0.3)

# Sanity check: At this point, no CPUs should be available.
assert ray.available_resources().get("CPU", 0) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes can this API be stale? This checks to GCS for available resources, is it possible that the task has started executing and the GCS has not been updated instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right; there's no guarantee that this API returns the updated value immediately. it will be updated once the raylet broadcasts its resource usage to the GCS.

so we need to wrap this in a wait_for_condition (or drop the check since it's non-essential)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's the reason i had added some sleep. But, now I've modified the test to assert actual behavior using SignalActors instead of asserting resource accounting dependent on these APIs.

Comment on lines 794 to 799
# If the bug exists, at some point Ray's accounting will show that
# 2 CPUs are in use, despite only 1 being in the cluster.
# With the fix, this should never exceed 1.
max_used_cpus = max(used_cpus_snapshots)
print(f"Max CPUs reported as used by Ray: {max_used_cpus}")
print(f"Total CPUs in cluster: {total_cpus}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is getting too specific to Ray's implementation and therefore will be brittle.

What you actually want to enforce that task2 does not start executing before task1 exits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point -- better way to write the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion of using SignalActor. I have modified the test as follows:

  1. Start task1 and wait for it to signal it's running
  2. Submit task2 (should be queued)
  3. Try to wait for task2 to start with 1-second timeout:
    • No timeout: Bug! Task2 started immediately --> cleanup and fail
    • Timeout occurs: Correct! Task2 is queued --> continue
  4. Complete task1 and assert expected result
  5. Wait for task2 to start (should happen immediately now)
  6. Complete task2 and assert expected result

@codope codope force-pushed the cw-shutdown-oversub branch from 177840a to a25a462 Compare June 16, 2025 06:16
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, minor nits

codope and others added 5 commits June 16, 2025 15:22
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…s other comments

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@codope codope force-pushed the cw-shutdown-oversub branch from 21336f7 to 5f48512 Compare June 16, 2025 15:37
@edoakes edoakes enabled auto-merge (squash) June 16, 2025 16:17
@edoakes edoakes merged commit 0ec2c47 into ray-project:master Jun 16, 2025
6 checks passed
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
)

## Why are these changes needed?

In `CoreWorker::Exit()`, the code calls:

https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169

This tells the raylet to immediately release all resources that were
allocated to this worker. However, the worker may still have tasks
running at this point in the exit sequence.

I have added a test to reproduce the issue. However, the test is
reproducible 2 out of 5 times locally. In the test:

0. Ray init with only 1 CPU.
1. Start task1 and wait for it to signal it's running
2. Submit task2 (should be queued)
3. Try to wait for task2 to start with 1-second timeout:
    - No timeout: Bug! Task2 started immediately --> cleanup and fail
    - Timeout occurs: Correct! Task2 is queued --> continue
4. Complete task1 and assert expected result
5. Wait for task2 to start (should happen immediately now)
6. Complete task2 and assert expected result

With the fix, the test always passes (no oversubscription detected)

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request Jun 21, 2025
…-project#53660)

## Why are these changes needed?

In `CoreWorker::Exit()`, the code calls:

https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169

This tells the raylet to immediately release all resources that were
allocated to this worker. However, the worker may still have tasks
running at this point in the exit sequence.

I have added a test to reproduce the issue. However, the test is
reproducible 2 out of 5 times locally. In the test:

0. Ray init with only 1 CPU.
1. Start task1 and wait for it to signal it's running
2. Submit task2 (should be queued)
3. Try to wait for task2 to start with 1-second timeout:
    - No timeout: Bug! Task2 started immediately --> cleanup and fail
    - Timeout occurs: Correct! Task2 is queued --> continue
4. Complete task1 and assert expected result
5. Wait for task2 to start (should happen immediately now)
6. Complete task2 and assert expected result

With the fix, the test always passes (no oversubscription detected)

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…-project#53660)

## Why are these changes needed?

In `CoreWorker::Exit()`, the code calls:


https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169

This tells the raylet to immediately release all resources that were
allocated to this worker. However, the worker may still have tasks
running at this point in the exit sequence.

I have added a test to reproduce the issue. However, the test is
reproducible 2 out of 5 times locally. In the test:

0. Ray init with only 1 CPU.
1. Start task1 and wait for it to signal it's running
2. Submit task2 (should be queued)
3. Try to wait for task2 to start with 1-second timeout:
    - No timeout: Bug! Task2 started immediately --> cleanup and fail
    - Timeout occurs: Correct! Task2 is queued --> continue
4. Complete task1 and assert expected result
5. Wait for task2 to start (should happen immediately now)
6. Complete task2 and assert expected result

With the fix, the test always passes (no oversubscription detected)

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
)

## Why are these changes needed?

In `CoreWorker::Exit()`, the code calls:

https://github.com/ray-project/ray/blob/c54437c42fa138580a0367f813b8c4bd9ca0b3e8/src/ray/core_worker/core_worker.cc#L1169

This tells the raylet to immediately release all resources that were
allocated to this worker. However, the worker may still have tasks
running at this point in the exit sequence.

I have added a test to reproduce the issue. However, the test is
reproducible 2 out of 5 times locally. In the test:

0. Ray init with only 1 CPU.
1. Start task1 and wait for it to signal it's running
2. Submit task2 (should be queued)
3. Try to wait for task2 to start with 1-second timeout:
    - No timeout: Bug! Task2 started immediately --> cleanup and fail
    - Timeout occurs: Correct! Task2 is queued --> continue
4. Complete task1 and assert expected result
5. Wait for task2 to start (should happen immediately now)
6. Complete task2 and assert expected result

With the fix, the test always passes (no oversubscription detected)

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants