Skip to content

[core] uncap WaitPlacementGroupUntilReady to resolve pg.ready() deadlocks#62086

Merged
MengjinYan merged 2 commits intoray-project:masterfrom
rueian:pg-ready-uncapped
Mar 30, 2026
Merged

[core] uncap WaitPlacementGroupUntilReady to resolve pg.ready() deadlocks#62086
MengjinYan merged 2 commits intoray-project:masterfrom
rueian:pg-ready-uncapped

Conversation

@rueian
Copy link
Copy Markdown
Contributor

@rueian rueian commented Mar 26, 2026

Description

I found this issue while debugging timeouts on the test_incremental_pg_and_actor_scheduling test.

The test is creating a lot of placement groups on a 0 CPU cluster and adding CPU to the cluster one by one, and expecting pg.ready() will be resolved one by one too.

However, we have a cap on the WaitPlacementGroupUntilReady RPC handler, which is what pg.ready() uses under the hood. The issue is that capped WaitPlacementGroupUntilReady requests will only be handled if previous requests are resolved, but the actual pg scheduling order can be different from the order of these requests being handled, and this can cause deadlocks. For example:

Let's say we have a cluster that has 0 CPU, and its cap of WaitPlacementGroupUntilReady is 1.
And we create 2 placement groups: A and B, both require 1 CPU, and issue pg.ready() on both of them.
One of the pg.ready() will be capped out. Let's say it is B's.

Now, if we add a 1 CPU worker node to the cluster. One of the placement groups will be scheduled. Let's say it is B as well.

However, at this point, even though the placement group B is scheduled, the pg.ready() for B won't be resolved. It can only be resolved if the pg.ready() for A is resolved, but there is no more CPU for A to be scheduled.

Solution

Make WaitPlacementGroupUntilReady uncapped.

Related issues

anyscale#891
anyscale#888

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…ocks

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request uncaps the WaitPlacementGroupUntilReady RPC handler in src/ray/gcs/grpc_services.cc by setting its concurrency limit to -1. This change aims to prevent deadlocks that could arise from a capped handler when placement group scheduling order differs from client pg.ready() calls. The reviewer noted that while this change resolves deadlocks, it introduces a risk of resource exhaustion. They suggested updating the code comment to reflect this trade-off and adding a new metric to track active WaitPlacementGroupUntilReady requests for better production diagnostics.

Comment thread src/ray/gcs/grpc_services.cc
@rueian rueian added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Mar 26, 2026
@Yicheng-Lu-llll
Copy link
Copy Markdown
Member

LGTM!

@rueian rueian marked this pull request as ready for review March 27, 2026 18:48
@rueian rueian requested a review from a team as a code owner March 27, 2026 18:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be4cc545f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ray/gcs/grpc_services.cc
@MengjinYan MengjinYan merged commit 945f423 into ray-project:master Mar 30, 2026
6 checks passed
mancfactor pushed a commit to mancfactor/ray that referenced this pull request Apr 2, 2026
…ocks (ray-project#62086)

## Description

I found this issue while debugging timeouts on the
`test_incremental_pg_and_actor_scheduling` test.

The test is creating a lot of placement groups on a 0 CPU cluster and
adding CPU to the cluster one by one, and expecting `pg.ready()` will be
resolved one by one too.

However, we have a cap on the `WaitPlacementGroupUntilReady` RPC
handler, which is what `pg.ready()` uses under the hood. The issue is
that capped `WaitPlacementGroupUntilReady` requests will only be handled
if previous requests are resolved, but the actual pg scheduling order
can be different from the order of these requests being handled, and
this can cause deadlocks. For example:

Let's say we have a cluster that has 0 CPU, and its cap of
`WaitPlacementGroupUntilReady` is 1.
And we create 2 placement groups: A and B, both require 1 CPU, and issue
`pg.ready()` on both of them.
One of the `pg.ready()` will be capped out. Let's say it is B's.

Now, if we add a 1 CPU worker node to the cluster. One of the placement
groups will be scheduled. Let's say it is B as well.

However, at this point, even though the placement group B is scheduled,
the `pg.ready()` for B won't be resolved. It can only be resolved if the
`pg.ready()` for A is resolved, but there is no more CPU for A to be
scheduled.

# Solution

Make `WaitPlacementGroupUntilReady` uncapped.

## Related issues
anyscale#891
anyscale#888

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Frank Mancina <fmancina@haproxy.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants