Skip to content

[RLlib] Attempt aggregator actor / learner grouped scheduling by default#63223

Closed
ArturNiederfahrenhorst wants to merge 17 commits into
ray-project:masterfrom
ArturNiederfahrenhorst:rllib-aggregator-node-affinity
Closed

[RLlib] Attempt aggregator actor / learner grouped scheduling by default#63223
ArturNiederfahrenhorst wants to merge 17 commits into
ray-project:masterfrom
ArturNiederfahrenhorst:rllib-aggregator-node-affinity

Conversation

@ArturNiederfahrenhorst
Copy link
Copy Markdown
Contributor

@ArturNiederfahrenhorst ArturNiederfahrenhorst commented May 8, 2026

Prerequisite: #63303

Description

We currently don't force aggregator actors onto the same nodes as learners.
This means that they will be placed on arbitrary nodes (sometimes with a learner, sometimes a worker node).

The simple and deterministic thing to do is to place them next to their specific learners, which is what we do with this PR.
If they can not be placed there, we can place them elsewhere. This is a soft change in that it does not change requirements to cluster shape. This PR also gives a bit more control over placement by exposing custom resource requirements for aggregator actors.

From my own experience, this change improves stability in cases where the batches produced by aggregator actors are on the larger side. They can get queued up in object store with multiple in flight learner updates which can lead to instabilities, especially if this happens on "random EnvRunenr worker nodes" and needs to be serialized and sent to learners over network. This also offers a simple mental model (just colocate aggregator actors with "their" learners).

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements proactive co-location of AggregatorActors with their assigned Learners using Ray's NodeAffinitySchedulingStrategy, replacing the previous post-hoc matching logic. It introduces new configuration parameters for aggregator resources and node affinity softness, along with a multi-node test to verify placement. Feedback highlights a potential ValueError if 'CPU' is redundantly specified in custom resources and suggests optimizing actor creation by using .options() instead of re-wrapping the remote class within a loop.

Comment thread rllib/algorithms/algorithm.py Outdated
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ArturNiederfahrenhorst ArturNiederfahrenhorst marked this pull request as ready for review May 8, 2026 15:18
@ArturNiederfahrenhorst ArturNiederfahrenhorst requested a review from a team as a code owner May 8, 2026 15:18
Comment thread rllib/algorithms/algorithm.py Outdated
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ray-gardener ray-gardener Bot added the rllib RLlib related issues label May 8, 2026
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread rllib/algorithms/algorithm.py
@ArturNiederfahrenhorst ArturNiederfahrenhorst changed the title [RLlib] Simple node affinity scheduling for aggregator actors [RLlib] Attempt aggregator actor / learner grouped scheduling by default May 12, 2026
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread rllib/algorithms/utils.py Outdated
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread rllib/algorithms/algorithm.py Outdated
Mirror of `custom_resources_per_env_runner`: lets users attach
custom Ray resource requirements to each Learner worker. Plumbed
into `learner_group.py`'s `resources_per_learner` dict so the
custom resources are claimed alongside CPU/GPU when Ray Train
schedules Learners.

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit d757cac. Configure here.

"model",
"optimizer",
"custom_resources_per_env_runner",
"custom_resources_per_learner",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing custom_resources_per_aggregator_actor in _allow_unknown_subkeys

Medium Severity

custom_resources_per_learner was correctly added to _allow_unknown_subkeys, but custom_resources_per_aggregator_actor was not. This list governs whether arbitrary user-defined sub-keys (like {"my_resource": 0.5}) are accepted during dict-based config merges (used by deep_update in update_from_dict and Tune's param_space). Without it, users setting custom aggregator resources via dict config (common in Tune trials) will have those sub-keys rejected or silently dropped during config merging.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d757cac. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rllib RLlib related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant