[core] fix placement groups with label domain being stuck on the infeasible queue#62483
Merged
MengjinYan merged 11 commits intoray-project:masterfrom Apr 25, 2026
Merged
Conversation
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request improves label-locality-aware rescheduling for placement groups in Ray by ensuring that if all bundles become unplaced due to a total domain failure, the placement group is moved back to the pending queue to allow for rescheduling on a new domain. The changes include adding a helper method to check for all unplaced bundles and updating the GCS placement group manager and scheduler logic. I have provided feedback to improve the readability of a lambda capture and to correct typos in a code comment.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
MengjinYan
approved these changes
Apr 25, 2026
pushpavanthar
pushed a commit
to pushpavanthar/ray
that referenced
this pull request
Apr 25, 2026
…asible queue (ray-project#62483) ## Description Let us say we have a placement group that has nodes scheduled on some domain. Part of the nodes go down, which makes the scheduler try to schedule the bundles back on the same domain assignment, which fails and the placement group is placed into the infeasible queue. Now that the placement group is infeasible, let us say all of the bundles' nodes go down and the placement group is all unplaced. There is no code currently to wake up the scheduling for this placement group from the infeasible queue again. Instead, the placement group will be forever stuck there until OnNodeAdd clears the queue. These fixes are to help with this. ## Testing If you copy the test and run it without any changes, this should currently fail with timeout on pg2.ready() line. Now, run the test again with these changes and it should now succeed. Command to run: `python -m pytest python/ray/tests/test_bundle_label_selector.py::test_scheduling_feasible_after_rack_kill` --------- Signed-off-by: aaron.li <aaron.li@anyscale.com> Signed-off-by: aaronscalene <aaron.li@anyscale.com> Signed-off-by: Joshua Lee <joshlee@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Joshua Lee <joshlee@anyscale.com> Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Let us say we have a placement group that has nodes scheduled on some domain. Part of the nodes go down, which makes the scheduler try to schedule the bundles back on the same domain assignment, which fails and the placement group is placed into the infeasible queue.
Now that the placement group is infeasible, let us say all of the bundles' nodes go down and the placement group is all unplaced. There is no code currently to wake up the scheduling for this placement group from the infeasible queue again. Instead, the placement group will be forever stuck there until OnNodeAdd clears the queue. These fixes are to help with this.
Testing
If you copy the test and run it without any changes, this should currently fail with timeout on pg2.ready() line.
Now, run the test again with these changes and it should now succeed.
Command to run:
python -m pytest python/ray/tests/test_bundle_label_selector.py::test_scheduling_feasible_after_rack_kill