[core] uncap WaitPlacementGroupUntilReady to resolve pg.ready() deadlocks#62086
[core] uncap WaitPlacementGroupUntilReady to resolve pg.ready() deadlocks#62086MengjinYan merged 2 commits intoray-project:masterfrom
Conversation
…ocks Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request uncaps the WaitPlacementGroupUntilReady RPC handler in src/ray/gcs/grpc_services.cc by setting its concurrency limit to -1. This change aims to prevent deadlocks that could arise from a capped handler when placement group scheduling order differs from client pg.ready() calls. The reviewer noted that while this change resolves deadlocks, it introduces a risk of resource exhaustion. They suggested updating the code comment to reflect this trade-off and adding a new metric to track active WaitPlacementGroupUntilReady requests for better production diagnostics.
|
LGTM! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be4cc545f5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ocks (ray-project#62086) ## Description I found this issue while debugging timeouts on the `test_incremental_pg_and_actor_scheduling` test. The test is creating a lot of placement groups on a 0 CPU cluster and adding CPU to the cluster one by one, and expecting `pg.ready()` will be resolved one by one too. However, we have a cap on the `WaitPlacementGroupUntilReady` RPC handler, which is what `pg.ready()` uses under the hood. The issue is that capped `WaitPlacementGroupUntilReady` requests will only be handled if previous requests are resolved, but the actual pg scheduling order can be different from the order of these requests being handled, and this can cause deadlocks. For example: Let's say we have a cluster that has 0 CPU, and its cap of `WaitPlacementGroupUntilReady` is 1. And we create 2 placement groups: A and B, both require 1 CPU, and issue `pg.ready()` on both of them. One of the `pg.ready()` will be capped out. Let's say it is B's. Now, if we add a 1 CPU worker node to the cluster. One of the placement groups will be scheduled. Let's say it is B as well. However, at this point, even though the placement group B is scheduled, the `pg.ready()` for B won't be resolved. It can only be resolved if the `pg.ready()` for A is resolved, but there is no more CPU for A to be scheduled. # Solution Make `WaitPlacementGroupUntilReady` uncapped. ## Related issues anyscale#891 anyscale#888 Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Frank Mancina <fmancina@haproxy.com>
Description
I found this issue while debugging timeouts on the
test_incremental_pg_and_actor_schedulingtest.The test is creating a lot of placement groups on a 0 CPU cluster and adding CPU to the cluster one by one, and expecting
pg.ready()will be resolved one by one too.However, we have a cap on the
WaitPlacementGroupUntilReadyRPC handler, which is whatpg.ready()uses under the hood. The issue is that cappedWaitPlacementGroupUntilReadyrequests will only be handled if previous requests are resolved, but the actual pg scheduling order can be different from the order of these requests being handled, and this can cause deadlocks. For example:Let's say we have a cluster that has 0 CPU, and its cap of
WaitPlacementGroupUntilReadyis 1.And we create 2 placement groups: A and B, both require 1 CPU, and issue
pg.ready()on both of them.One of the
pg.ready()will be capped out. Let's say it is B's.Now, if we add a 1 CPU worker node to the cluster. One of the placement groups will be scheduled. Let's say it is B as well.
However, at this point, even though the placement group B is scheduled, the
pg.ready()for B won't be resolved. It can only be resolved if thepg.ready()for A is resolved, but there is no more CPU for A to be scheduled.Solution
Make
WaitPlacementGroupUntilReadyuncapped.Related issues
anyscale#891
anyscale#888
Additional information