[RLlib] Attempt aggregator actor / learner grouped scheduling by default#63223
[RLlib] Attempt aggregator actor / learner grouped scheduling by default#63223ArturNiederfahrenhorst wants to merge 17 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements proactive co-location of AggregatorActors with their assigned Learners using Ray's NodeAffinitySchedulingStrategy, replacing the previous post-hoc matching logic. It introduces new configuration parameters for aggregator resources and node affinity softness, along with a multi-node test to verify placement. Feedback highlights a potential ValueError if 'CPU' is redundantly specified in custom resources and suggests optimizing actor creation by using .options() instead of re-wrapping the remote class within a loop.
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Mirror of `custom_resources_per_env_runner`: lets users attach custom Ray resource requirements to each Learner worker. Plumbed into `learner_group.py`'s `resources_per_learner` dict so the custom resources are claimed alongside CPU/GPU when Ray Train schedules Learners. Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…e-affinity # Conflicts: # rllib/BUILD.bazel
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit d757cac. Configure here.
| "model", | ||
| "optimizer", | ||
| "custom_resources_per_env_runner", | ||
| "custom_resources_per_learner", |
There was a problem hiding this comment.
Missing custom_resources_per_aggregator_actor in _allow_unknown_subkeys
Medium Severity
custom_resources_per_learner was correctly added to _allow_unknown_subkeys, but custom_resources_per_aggregator_actor was not. This list governs whether arbitrary user-defined sub-keys (like {"my_resource": 0.5}) are accepted during dict-based config merges (used by deep_update in update_from_dict and Tune's param_space). Without it, users setting custom aggregator resources via dict config (common in Tune trials) will have those sub-keys rejected or silently dropped during config merging.
Reviewed by Cursor Bugbot for commit d757cac. Configure here.


Prerequisite: #63303
Description
We currently don't force aggregator actors onto the same nodes as learners.
This means that they will be placed on arbitrary nodes (sometimes with a learner, sometimes a worker node).
The simple and deterministic thing to do is to place them next to their specific learners, which is what we do with this PR.
If they can not be placed there, we can place them elsewhere. This is a soft change in that it does not change requirements to cluster shape. This PR also gives a bit more control over placement by exposing custom resource requirements for aggregator actors.
From my own experience, this change improves stability in cases where the batches produced by aggregator actors are on the larger side. They can get queued up in object store with multiple in flight learner updates which can lead to instabilities, especially if this happens on "random EnvRunenr worker nodes" and needs to be serialized and sent to learners over network. This also offers a simple mental model (just colocate aggregator actors with "their" learners).