Skip to content

[core] fix placement groups with label domain being stuck on the infeasible queue#62483

Merged
MengjinYan merged 11 commits intoray-project:masterfrom
aaronscalene:aaron/pg-infeasible-queue-fix
Apr 25, 2026
Merged

[core] fix placement groups with label domain being stuck on the infeasible queue#62483
MengjinYan merged 11 commits intoray-project:masterfrom
aaronscalene:aaron/pg-infeasible-queue-fix

Conversation

@aaronscalene
Copy link
Copy Markdown
Contributor

@aaronscalene aaronscalene commented Apr 9, 2026

Description

Let us say we have a placement group that has nodes scheduled on some domain. Part of the nodes go down, which makes the scheduler try to schedule the bundles back on the same domain assignment, which fails and the placement group is placed into the infeasible queue.

Now that the placement group is infeasible, let us say all of the bundles' nodes go down and the placement group is all unplaced. There is no code currently to wake up the scheduling for this placement group from the infeasible queue again. Instead, the placement group will be forever stuck there until OnNodeAdd clears the queue. These fixes are to help with this.

Testing

If you copy the test and run it without any changes, this should currently fail with timeout on pg2.ready() line.
Now, run the test again with these changes and it should now succeed.

Command to run:
python -m pytest python/ray/tests/test_bundle_label_selector.py::test_scheduling_feasible_after_rack_kill

Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
@aaronscalene aaronscalene requested a review from a team as a code owner April 9, 2026 21:58
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves label-locality-aware rescheduling for placement groups in Ray by ensuring that if all bundles become unplaced due to a total domain failure, the placement group is moved back to the pending queue to allow for rescheduling on a new domain. The changes include adding a helper method to check for all unplaced bundles and updating the GCS placement group manager and scheduler logic. I have provided feedback to improve the readability of a lambda capture and to correct typos in a code comment.

Comment thread src/ray/gcs/gcs_placement_group_manager.cc Outdated
Comment thread src/ray/gcs/gcs_placement_group_scheduler.cc Outdated
@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Apr 10, 2026
@aaronscalene aaronscalene marked this pull request as draft April 10, 2026 18:44
aaronscalene and others added 4 commits April 17, 2026 11:21
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
@aaronscalene aaronscalene marked this pull request as ready for review April 17, 2026 18:49
@aaronscalene aaronscalene changed the title [core] Fix placement groups with label domain being stuck on the infeasible queue [core] fix placement groups with label domain being stuck on the infeasible queue Apr 20, 2026
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
Copy link
Copy Markdown
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find! 🚢

@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Apr 24, 2026
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
@MengjinYan MengjinYan merged commit 65d2640 into ray-project:master Apr 25, 2026
6 checks passed
pushpavanthar pushed a commit to pushpavanthar/ray that referenced this pull request Apr 25, 2026
…asible queue (ray-project#62483)

## Description
Let us say we have a placement group that has nodes scheduled on some
domain. Part of the nodes go down, which makes the scheduler try to
schedule the bundles back on the same domain assignment, which fails and
the placement group is placed into the infeasible queue.

Now that the placement group is infeasible, let us say all of the
bundles' nodes go down and the placement group is all unplaced. There is
no code currently to wake up the scheduling for this placement group
from the infeasible queue again. Instead, the placement group will be
forever stuck there until OnNodeAdd clears the queue. These fixes are to
help with this.

## Testing
If you copy the test and run it without any changes, this should
currently fail with timeout on pg2.ready() line.
Now, run the test again with these changes and it should now succeed.

Command to run:
`python -m pytest
python/ray/tests/test_bundle_label_selector.py::test_scheduling_feasible_after_rack_kill`

---------

Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Joshua Lee <joshlee@anyscale.com>
Signed-off-by: Purushotham Pushpavanth <pushpavanthar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants