[RLlib] Attempt aggregator actor / learner grouped scheduling by default by ArturNiederfahrenhorst · Pull Request #63223 · ray-project/ray

ArturNiederfahrenhorst · 2026-05-08T08:48:22Z

Prerequisite: #63303

Description

We currently don't force aggregator actors onto the same nodes as learners.
This means that they will be placed on arbitrary nodes (sometimes with a learner, sometimes a worker node).

The simple and deterministic thing to do is to place them next to their specific learners, which is what we do with this PR.
If they can not be placed there, we can place them elsewhere. This is a soft change in that it does not change requirements to cluster shape. This PR also gives a bit more control over placement by exposing custom resource requirements for aggregator actors.

From my own experience, this change improves stability in cases where the batches produced by aggregator actors are on the larger side. They can get queued up in object store with multiple in flight learner updates which can lead to instabilities, especially if this happens on "random EnvRunenr worker nodes" and needs to be serialized and sent to learners over network. This also offers a simple mental model (just colocate aggregator actors with "their" learners).

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

gemini-code-assist

Code Review

This pull request implements proactive co-location of AggregatorActors with their assigned Learners using Ray's NodeAffinitySchedulingStrategy, replacing the previous post-hoc matching logic. It introduces new configuration parameters for aggregator resources and node affinity softness, along with a multi-node test to verify placement. Feedback highlights a potential ValueError if 'CPU' is redundantly specified in custom resources and suggests optimizing actor creation by using .options() instead of re-wrapping the remote class within a loop.

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Mirror of `custom_resources_per_env_runner`: lets users attach custom Ray resource requirements to each Learner worker. Plumbed into `learner_group.py`'s `resources_per_learner` dict so the custom resources are claimed alongside CPU/GPU when Ray Train schedules Learners. Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…e-affinity # Conflicts: # rllib/BUILD.bazel

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit d757cac. Configure here.}

cursor · 2026-05-12T19:07:29Z

        "model",
        "optimizer",
        "custom_resources_per_env_runner",
+        "custom_resources_per_learner",


Missing custom_resources_per_aggregator_actor in _allow_unknown_subkeys

Medium Severity

custom_resources_per_learner was correctly added to _allow_unknown_subkeys, but custom_resources_per_aggregator_actor was not. This list governs whether arbitrary user-defined sub-keys (like {"my_resource": 0.5}) are accepted during dict-based config merges (used by deep_update in update_from_dict and Tune's param_space). Without it, users setting custom aggregator resources via dict config (common in Tune trials) will have those sub-keys rejected or silently dropped during config merging.

^{Reviewed by Cursor Bugbot for commit d757cac. Configure here.}

fix

d45a764

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Comment thread rllib/algorithms/algorithm.py Outdated

ArturNiederfahrenhorst added 2 commits May 8, 2026 16:03

Merge branch 'master' into rllib-aggregator-node-affinity

b3d08b4

fix

bf9446c

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst marked this pull request as ready for review May 8, 2026 15:18

ArturNiederfahrenhorst requested a review from a team as a code owner May 8, 2026 15:18

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread rllib/algorithms/algorithm.py Outdated

learner resources

6334e67

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ray-gardener Bot added the rllib RLlib related issues label May 8, 2026

ArturNiederfahrenhorst added 3 commits May 11, 2026 22:53

fix

80bcf41

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Merge branch 'master' into rllib-aggregator-node-affinity

75731c8

wip

c0a5072

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread rllib/algorithms/algorithm.py

ArturNiederfahrenhorst changed the title ~~[RLlib] Simple node affinity scheduling for aggregator actors~~ [RLlib] Attempt aggregator actor / learner grouped scheduling by default May 12, 2026

wip

3497bb8

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread rllib/algorithms/utils.py Outdated

wip

2547625

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread rllib/algorithms/algorithm.py Outdated

ArturNiederfahrenhorst added 8 commits May 12, 2026 16:55

wip

240832c

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

polish

887f9a7

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

fix

c05e0c2

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

fix

5e705e6

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

better testing

e85cebf

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Merge branch 'custom-resources-per-learner' into rllib-aggregator-nod…

e6666d1

…e-affinity # Conflicts: # rllib/BUILD.bazel

better test

d757cac

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed May 12, 2026

View reviewed changes

ArturNiederfahrenhorst closed this May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Attempt aggregator actor / learner grouped scheduling by default#63223

[RLlib] Attempt aggregator actor / learner grouped scheduling by default#63223
ArturNiederfahrenhorst wants to merge 17 commits into
ray-project:masterfrom
ArturNiederfahrenhorst:rllib-aggregator-node-affinity

ArturNiederfahrenhorst commented May 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ArturNiederfahrenhorst commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 12, 2026

Choose a reason for hiding this comment

Missing custom_resources_per_aggregator_actor in _allow_unknown_subkeys

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ArturNiederfahrenhorst commented May 8, 2026 •

edited

Loading

Missing `custom_resources_per_aggregator_actor` in `_allow_unknown_subkeys`