[Train] Support Accelerator Type in `ScalingConfig` #43090

woshiyyya · 2024-02-11T08:33:44Z

Why are these changes needed?

Ray Core recently allows specifying accelerator type for remote tasks and actors[link].

This PR leveraged this feature and allowed users to specify accelerator types, enabling Ray Train to schedule trainers and workers onto nodes with the specified accelerators. Internally, Ray Train appends {"accelerator_type:A10G": 0.001} to the all resource bundles of the trainer and workers.

This feature enables multiple use cases:

Example 1: Use a single accelerator type in a heterogeneous cluster.

For example, you have a cluster with 16 x T4 and 16 x A10G. If you want to launch 16 workers with A10G GPUs, instead of T4s. You can now specify the ScalingConfig as below:

scaling_config = ScalingConfig(
    num_workers=16,
    use_gpu=True,
    accelerator_type="A10G"
)

Example 2: Specify extra resources for global rank 0 worker.

For example, you are training with 16 x A10Gs, but want to launch the rank 0 worker on a node with more CPU memory for large model checkpointing.

You can specify accelerator_type="A10G", which ensures that the trainer is also scheduled on an A100 node. Since Ray Train always try to colocate trainer and global rank 0 worker onto the same node. Therefore, you can leverage trainer_resources to allocate extra memory to rank 0 workers:

scaling_config = ScalingConfig(
    num_workers=16,
    use_gpu=True,
    trainer_resources={"memory": 200 * 1024 ** 3}, 
    accelerator_type="A10G"
)

In this case, the rank 0 worker and the trainer will be scheduled on the same A100 node, which has at least 200GB memory.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…into train/support_accelerator_type

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

justinvyu

Nice testing! I have a few questions/small comments.

Also, can we add a section about why we choose to go with this solution over allowing users to pass in resources_per_worker as a list where the first bundle goes to rank 0?

trainer_resources as a concept is not necessary, and I would like to get rid of it in the long term.
So, this decision to double down on "colocating rank 0 with the trainer" makes it a bit harder to remove the trainer resources concept.
If we considered the problem without Tune Trainable creation in mind, would it make more sense to pass in a list of resources_per_worker?

python/ray/air/tests/test_api.py

python/ray/train/_internal/worker_group.py

python/ray/train/tests/test_data_parallel_trainer.py

python/ray/air/config.py

woshiyyya · 2024-02-12T23:48:48Z

trainer_resources as a concept is not necessary, and I would like to get rid of it in the long term.
So, this decision to double down on "colocating rank 0 with the trainer" makes it a bit harder to remove the trainer resources concept.

@justinvyu +1. I don't think trainer_resources should be a concept for Ray Train at all. However, since Ray Train is built on top of Tune Trainable, it unavoidable to mention this concept, and this is the only way we can think of to support extra rank 0 resources without fully refractoring BackendExecutor and WorkerGroup.

If we considered the problem without Tune Trainable creation in mind, would it make more sense to pass in a list of resources_per_worker?

That's my initial idea. But there are a bunch of limitations that blocked me from doing it:

Ray Train always sorts the worker by node id, gpu id, and colocate with trainer. So, under the current design, there's no guarantee that the global rank equals to the corresponding bundle index. (major reason)
Tune schedulers assume all workers have the same amount of resources. (implementation details)

The conclusion is, if Ray Core cannot schedule workers in the order of GPU id, we cannot build a static mapping from rank to bundle, thus won't be able to use a list of resources_per_worker.

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

ericl · 2024-02-13T00:45:12Z

In this case, the rank 0 worker and the trainer will be scheduled on the same A100 node, which has at least 200GB memory.

I want to understand the semantics here, it sounds like this is best effort, not guaranteed that the rank zero worker will be on the trainer node. If that's the case it doesn't actually provide correctness for use cases that require more memory on rank 0.

Ray Train always sorts the worker by node id, gpu id, and colocate with trainer. So, under the current design, there's no guarantee that the global rank equals to the corresponding bundle index. (major reason)

We could disable the sorting if rank 0 resources are specified right?

Tune schedulers assume all workers have the same amount of resources.

Hmm not sure I get this, what's the concrete issue?

woshiyyya · 2024-02-13T01:33:14Z

I want to understand the semantics here, it sounds like this is best effort, not guaranteed that the rank zero worker will be on the trainer node. If that's the case it doesn't actually provide correctness for use cases that require more memory on rank 0.

Yes, it's true. It will try to colocate trainer and rank 0 worker if it's feasible to accommodate the combined resource bundle onto a single node.

We could disable the sorting if rank 0 resources are specified right?

Unfortunately we can't. Libraries like deepspeed and huggingface accelerate assume that the ranks are aligned with the gpu ids. More details in #40803.

Tune schedulers assume all workers have the same amount of resources.

It's an implementation detail. Tune scheduler can dynamically adjust the workers resources and assumes every worker have the same amount of base resource. Not so important since we can bypass this if necessary.

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

ericl · 2024-03-01T00:12:07Z

Do we still need this PR to land the accelerator_type support?

woshiyyya · 2024-03-01T00:37:18Z

@ericl Yes I'll try to merge this PR to land the accelerator_type support.

Actually the users now can already specify accelerator_type in ScalingConfig, e.g.

ScalingConfig(
    ...,
    resources_per_worker={"accelerator_type:A100": 0.01, ...}. 
)

This PR provides a better user interface without the awkward 0.01 workaround.

…elerator_type

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

justinvyu

Does accelerator_type not need to be popped out of the dict and handled similar to num_cpus, num_gpus, and memory? Is it ok to be left as an "additional resource"? ray.remote accepts accelerator_type as a named argument.

woshiyyya · 2024-03-06T00:20:15Z

@justinvyu yeah, I've asked core team and they told we can use it as an additional resource in Ray Train. I'm trying to avoid combining and splitting accelerator_type_key back and forth with the current solution.

matthewdeng · 2024-03-08T17:28:57Z

python/ray/air/config.py

+        if self.accelerator_type:
+            resources_per_worker[
+                f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}"
+            ] = 0.001


Should we use setdefault similar to GPU in the line above? In case the user manually sets resources_per_worker. Or validate that it's not set in the resources dict if we want this to be the only interface where the user can specify.

The "accelerator_type:{type}" key should only be used as internal api, users should not specify it in resource_per_worker, as we provided the ScalingConfig(accelerator_type=) api as the only entry.

For example, if a user needs two A100 GPUs, they can do:

ScalingConfig( num_workers=..., resources_per_worker={"GPU": 2}, accelerator_type="A100", )

But I agree to change it to setdefault to provide more flexibility for advanced use cases.

matthewdeng · 2024-03-08T17:29:40Z

python/ray/air/config.py

+        if self.accelerator_type:
+            trainer_resources[
+                f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}"
+            ] = 0.001


Do we need to explicitly set it here too? Won't it get merged with rank 0?

nice catch. This shouldn't be here after we merged the previous colocate pr.

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…into train/support_accelerator_type

matthewdeng · 2024-03-08T18:35:21Z

python/ray/air/config.py

+                resources_per_worker = {"GPU": 1}
            else:
-                return {"CPU": 1}
-        resources_per_worker = {
-            k: v for k, v in self.resources_per_worker.items() if v != 0
-        }
+                resources_per_worker = {"CPU": 1}
+        else:
+            resources_per_worker = {


nit: I think we can clean up this branching logic a bit now since we don't return early anymore, but we can do it in a separate PR...

OK. Let's merge it first. I'll post a followup PR to remove the branchings like colab & num_workers=None, etc.

Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

woshiyyya and others added 9 commits February 11, 2024 00:33

init

567690c

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

init

6d5456a

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

update allowed keys

e046b40

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

add tests

cf8e4a4

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

modify docstring

096f0c3

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge branch 'master' into train/support_accelerator_type

f0da060

fix test

620f69f

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fix resource per worker not none

7daad93

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Update tune-pytorch-lightning.ipynb

f1ac1e0

Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

woshiyyya marked this pull request as ready for review February 12, 2024 17:58

woshiyyya requested review from justinvyu, matthewdeng and ericl February 12, 2024 17:59

woshiyyya assigned justinvyu and matthewdeng Feb 12, 2024

woshiyyya added 3 commits February 12, 2024 11:18

fix tests

7339b97

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'origin/train/support_accelerator_type' …

3fdef03

…into train/support_accelerator_type

update

ddab5f3

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

justinvyu reviewed Feb 12, 2024

View reviewed changes

Update python/ray/air/config.py

49eacc6

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

ericl self-assigned this Feb 13, 2024

fix test

be081d4

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya added 3 commits February 29, 2024 16:42

Merge remote-tracking branch 'upstream/master' into train/support_acc…

d098c9c

…elerator_type

update tests after merge bundle

d0566d8

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

update test

482e5e2

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya and others added 3 commits March 4, 2024 11:47

fix test

c16b336

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fix unreferenced local var

3842c89

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge branch 'master' into train/support_accelerator_type

3f19960

justinvyu approved these changes Mar 6, 2024

View reviewed changes

matthewdeng reviewed Mar 8, 2024

View reviewed changes

woshiyyya added 2 commits March 8, 2024 10:10

address comments

e227213

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'origin/train/support_accelerator_type' …

bbc3d62

…into train/support_accelerator_type

woshiyyya requested a review from matthewdeng March 8, 2024 18:28

matthewdeng approved these changes Mar 8, 2024

View reviewed changes

matthewdeng merged commit 1c9779a into ray-project:master Mar 8, 2024
9 checks passed

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024

[Train] Support Accelerator Type in ScalingConfig (ray-project#43090)

fbfb0e5

Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Support Accelerator Type in `ScalingConfig` #43090

[Train] Support Accelerator Type in `ScalingConfig` #43090

woshiyyya commented Feb 11, 2024 •

edited

Loading

justinvyu left a comment

woshiyyya commented Feb 12, 2024 •

edited

Loading

ericl commented Feb 13, 2024 •

edited

Loading

woshiyyya commented Feb 13, 2024 •

edited

Loading

ericl commented Mar 1, 2024

woshiyyya commented Mar 1, 2024 •

edited

Loading

justinvyu left a comment

woshiyyya commented Mar 6, 2024

matthewdeng Mar 8, 2024

woshiyyya Mar 8, 2024

matthewdeng Mar 8, 2024

woshiyyya Mar 8, 2024

matthewdeng Mar 8, 2024

woshiyyya Mar 8, 2024

[Train] Support Accelerator Type in ScalingConfig #43090

[Train] Support Accelerator Type in ScalingConfig #43090

Conversation

woshiyyya commented Feb 11, 2024 • edited Loading

Why are these changes needed?

Example 1: Use a single accelerator type in a heterogeneous cluster.

Example 2: Specify extra resources for global rank 0 worker.

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

woshiyyya commented Feb 12, 2024 • edited Loading

ericl commented Feb 13, 2024 • edited Loading

woshiyyya commented Feb 13, 2024 • edited Loading

ericl commented Mar 1, 2024

woshiyyya commented Mar 1, 2024 • edited Loading

justinvyu left a comment

Choose a reason for hiding this comment

woshiyyya commented Mar 6, 2024

matthewdeng Mar 8, 2024

Choose a reason for hiding this comment

woshiyyya Mar 8, 2024

Choose a reason for hiding this comment

matthewdeng Mar 8, 2024

Choose a reason for hiding this comment

woshiyyya Mar 8, 2024

Choose a reason for hiding this comment

matthewdeng Mar 8, 2024

Choose a reason for hiding this comment

woshiyyya Mar 8, 2024

Choose a reason for hiding this comment

[Train] Support Accelerator Type in `ScalingConfig` #43090

[Train] Support Accelerator Type in `ScalingConfig` #43090

woshiyyya commented Feb 11, 2024 •

edited

Loading

woshiyyya commented Feb 12, 2024 •

edited

Loading

ericl commented Feb 13, 2024 •

edited

Loading

woshiyyya commented Feb 13, 2024 •

edited

Loading

woshiyyya commented Mar 1, 2024 •

edited

Loading