[train] support memory per worker #42999

matthewdeng · 2024-02-06T00:56:56Z

Why are these changes needed?

This enables scheduling Ray Train Workers by specifying "memory" in ScalingConfig.resources_per_worker. This additional logic is necessary because Ray Actors require the special memory kwarg rather than a "memory" entry in the resources dictionary.

As part of this change, I am also changing the interface of BackendExecutor and WorkerGroup to simply take a dictionary for all resources. This matches the interface of ScalingConfig and PlacementGroup, and centralizes all logic to convert to Ray Actor/Task kwargs in WorkerGroup.

Example:

from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer


scaling_config = ScalingConfig(num_workers=2, resources_per_worker={"memory": 10_000})


def train_func(): 
    ...


trainer = TorchTrainer(train_func, scaling_config=scaling_config)
trainer.fit()

Before:

ValueError: The resources dictionary must not contain the key 'memory' or 'object_store_memory'

After:


(TorchTrainer pid=67598) Started distributed worker processes: 
(TorchTrainer pid=67598) - (ip=127.0.0.1, pid=67604) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=67598) - (ip=127.0.0.1, pid=67605) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=67604) Setting up process group for: env:// [rank=0, world_size=2]

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Deng <matt@anyscale.com>

python/ray/train/_internal/worker_group.py

justinvyu

Thanks! See the comment about also needing to update ScalingConfig.as_placement_group_factory, and also a request for the unit test.

python/ray/train/tests/test_worker_group.py

Signed-off-by: Matthew Deng <matt@anyscale.com>

woshiyyya

Good to Go!

Also Thanks for cleaning up the redundant input arguments.

matthewdeng added 4 commits February 5, 2024 16:55

[train] support memory per worker

d85f484

Signed-off-by: Matthew Deng <matt@anyscale.com>

lint

12c35bf

Signed-off-by: Matthew Deng <matt@anyscale.com>

handle None

88aa838

Signed-off-by: Matthew Deng <matt@anyscale.com>

lint

d8057fa

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng assigned justinvyu Feb 6, 2024

matthewdeng requested a review from woshiyyya February 6, 2024 18:03

matthewdeng assigned woshiyyya Feb 6, 2024

matthewdeng marked this pull request as ready for review February 6, 2024 18:03

woshiyyya requested changes Feb 6, 2024

View reviewed changes

python/ray/train/_internal/worker_group.py Outdated Show resolved Hide resolved

justinvyu reviewed Feb 6, 2024

View reviewed changes

python/ray/train/tests/test_worker_group.py Outdated Show resolved Hide resolved

unify interface on resources_per_worker

fddecb4

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng requested review from sven1977, avnishn, ArturNiederfahrenhorst, maxpumperla and kouroshHakha as code owners February 6, 2024 22:57

matthewdeng marked this pull request as draft February 6, 2024 22:58

matthewdeng added 5 commits February 6, 2024 15:51

fix tests

be7c441

Signed-off-by: Matthew Deng <matt@anyscale.com>

fix

2199de6

Signed-off-by: Matthew Deng <matt@anyscale.com>

fix rllib

972dc76

Signed-off-by: Matthew Deng <matt@anyscale.com>

fix rllib

259f379

Signed-off-by: Matthew Deng <matt@anyscale.com>

update test

d67801f

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng marked this pull request as ready for review February 7, 2024 15:28

matthewdeng requested review from justinvyu and woshiyyya February 7, 2024 15:28

woshiyyya approved these changes Feb 23, 2024

View reviewed changes

kouroshHakha approved these changes Feb 23, 2024

View reviewed changes

matthewdeng merged commit 6908b12 into ray-project:master Feb 23, 2024
9 checks passed

matthewdeng deleted the train-worker-memory branch February 23, 2024 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] support memory per worker #42999

[train] support memory per worker #42999

matthewdeng commented Feb 6, 2024 •

edited

justinvyu left a comment

woshiyyya left a comment •

edited

[train] support memory per worker #42999

[train] support memory per worker #42999

Conversation

matthewdeng commented Feb 6, 2024 • edited

Why are these changes needed?

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

woshiyyya left a comment • edited

Choose a reason for hiding this comment

matthewdeng commented Feb 6, 2024 •

edited

woshiyyya left a comment •

edited