[core] uncap WaitPlacementGroupUntilReady to resolve pg.ready() deadlocks by rueian · Pull Request #62086 · ray-project/ray

rueian · 2026-03-26T07:01:43Z

Description

I found this issue while debugging timeouts on the test_incremental_pg_and_actor_scheduling test.

The test is creating a lot of placement groups on a 0 CPU cluster and adding CPU to the cluster one by one, and expecting pg.ready() will be resolved one by one too.

However, we have a cap on the WaitPlacementGroupUntilReady RPC handler, which is what pg.ready() uses under the hood. The issue is that capped WaitPlacementGroupUntilReady requests will only be handled if previous requests are resolved, but the actual pg scheduling order can be different from the order of these requests being handled, and this can cause deadlocks. For example:

Let's say we have a cluster that has 0 CPU, and its cap of WaitPlacementGroupUntilReady is 1.
And we create 2 placement groups: A and B, both require 1 CPU, and issue pg.ready() on both of them.
One of the pg.ready() will be capped out. Let's say it is B's.

Now, if we add a 1 CPU worker node to the cluster. One of the placement groups will be scheduled. Let's say it is B as well.

However, at this point, even though the placement group B is scheduled, the pg.ready() for B won't be resolved. It can only be resolved if the pg.ready() for A is resolved, but there is no more CPU for A to be scheduled.

Solution

Make WaitPlacementGroupUntilReady uncapped.

Related issues

anyscale#891
anyscale#888

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…ocks Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

gemini-code-assist

Code Review

This pull request uncaps the WaitPlacementGroupUntilReady RPC handler in src/ray/gcs/grpc_services.cc by setting its concurrency limit to -1. This change aims to prevent deadlocks that could arise from a capped handler when placement group scheduling order differs from client pg.ready() calls. The reviewer noted that while this change resolves deadlocks, it introduces a risk of resource exhaustion. They suggested updating the code comment to reflect this trade-off and adding a new metric to track active WaitPlacementGroupUntilReady requests for better production diagnostics.

Yicheng-Lu-llll · 2026-03-26T07:12:37Z

LGTM！

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be4cc545f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ocks (ray-project#62086) ## Description I found this issue while debugging timeouts on the `test_incremental_pg_and_actor_scheduling` test. The test is creating a lot of placement groups on a 0 CPU cluster and adding CPU to the cluster one by one, and expecting `pg.ready()` will be resolved one by one too. However, we have a cap on the `WaitPlacementGroupUntilReady` RPC handler, which is what `pg.ready()` uses under the hood. The issue is that capped `WaitPlacementGroupUntilReady` requests will only be handled if previous requests are resolved, but the actual pg scheduling order can be different from the order of these requests being handled, and this can cause deadlocks. For example: Let's say we have a cluster that has 0 CPU, and its cap of `WaitPlacementGroupUntilReady` is 1. And we create 2 placement groups: A and B, both require 1 CPU, and issue `pg.ready()` on both of them. One of the `pg.ready()` will be capped out. Let's say it is B's. Now, if we add a 1 CPU worker node to the cluster. One of the placement groups will be scheduled. Let's say it is B as well. However, at this point, even though the placement group B is scheduled, the `pg.ready()` for B won't be resolved. It can only be resolved if the `pg.ready()` for A is resolved, but there is no more CPU for A to be scheduled. # Solution Make `WaitPlacementGroupUntilReady` uncapped. ## Related issues anyscale#891 anyscale#888 Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Frank Mancina <fmancina@haproxy.com>

[core] uncap WaitPlacementGroupUntilReady to resolve pg.ready() deadl…

e8351d9

…ocks Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

Comment thread src/ray/gcs/grpc_services.cc

rueian added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Mar 26, 2026

Yicheng-Lu-llll approved these changes Mar 26, 2026

View reviewed changes

Merge branch 'master' into pg-ready-uncapped

be4cc54

rueian marked this pull request as ready for review March 27, 2026 18:48

rueian requested a review from a team as a code owner March 27, 2026 18:48

chatgpt-codex-connector bot reviewed Mar 27, 2026

View reviewed changes

Comment thread src/ray/gcs/grpc_services.cc

MengjinYan merged commit 945f423 into ray-project:master Mar 30, 2026
6 checks passed

rueian mentioned this pull request Mar 31, 2026

CI test linux://python/ray/tests:test_placement_group_3_client_mode is flaky anyscale/ray#891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] uncap WaitPlacementGroupUntilReady to resolve pg.ready() deadlocks#62086

[core] uncap WaitPlacementGroupUntilReady to resolve pg.ready() deadlocks#62086
MengjinYan merged 2 commits intoray-project:masterfrom
rueian:pg-ready-uncapped

rueian commented Mar 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Yicheng-Lu-llll commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rueian commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Yicheng-Lu-llll commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rueian commented Mar 26, 2026 •

edited

Loading