Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Support Accelerator Type in ScalingConfig #43090

Merged

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented Feb 11, 2024

Why are these changes needed?

Ray Core recently allows specifying accelerator type for remote tasks and actors[link].

This PR leveraged this feature and allowed users to specify accelerator types, enabling Ray Train to schedule trainers and workers onto nodes with the specified accelerators. Internally, Ray Train appends {"accelerator_type:A10G": 0.001} to the all resource bundles of the trainer and workers.

This feature enables multiple use cases:

Example 1: Use a single accelerator type in a heterogeneous cluster.

For example, you have a cluster with 16 x T4 and 16 x A10G. If you want to launch 16 workers with A10G GPUs, instead of T4s. You can now specify the ScalingConfig as below:

scaling_config = ScalingConfig(
    num_workers=16,
    use_gpu=True,
    accelerator_type="A10G"
)

Example 2: Specify extra resources for global rank 0 worker.

For example, you are training with 16 x A10Gs, but want to launch the rank 0 worker on a node with more CPU memory for large model checkpointing.

You can specify accelerator_type="A10G", which ensures that the trainer is also scheduled on an A100 node. Since Ray Train always try to colocate trainer and global rank 0 worker onto the same node. Therefore, you can leverage trainer_resources to allocate extra memory to rank 0 workers:

scaling_config = ScalingConfig(
    num_workers=16,
    use_gpu=True,
    trainer_resources={"memory": 200 * 1024 ** 3}, 
    accelerator_type="A10G"
)

In this case, the rank 0 worker and the trainer will be scheduled on the same A100 node, which has at least 200GB memory.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

woshiyyya and others added 9 commits February 11, 2024 00:33
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya marked this pull request as ready for review February 12, 2024 17:58
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice testing! I have a few questions/small comments.

Also, can we add a section about why we choose to go with this solution over allowing users to pass in resources_per_worker as a list where the first bundle goes to rank 0?

  • trainer_resources as a concept is not necessary, and I would like to get rid of it in the long term.
  • So, this decision to double down on "colocating rank 0 with the trainer" makes it a bit harder to remove the trainer resources concept.
  • If we considered the problem without Tune Trainable creation in mind, would it make more sense to pass in a list of resources_per_worker?

python/ray/air/tests/test_api.py Outdated Show resolved Hide resolved
python/ray/train/_internal/worker_group.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_data_parallel_trainer.py Outdated Show resolved Hide resolved
python/ray/air/config.py Show resolved Hide resolved
python/ray/air/config.py Outdated Show resolved Hide resolved
@woshiyyya
Copy link
Member Author

woshiyyya commented Feb 12, 2024

trainer_resources as a concept is not necessary, and I would like to get rid of it in the long term.
So, this decision to double down on "colocating rank 0 with the trainer" makes it a bit harder to remove the trainer resources concept.

@justinvyu +1. I don't think trainer_resources should be a concept for Ray Train at all. However, since Ray Train is built on top of Tune Trainable, it unavoidable to mention this concept, and this is the only way we can think of to support extra rank 0 resources without fully refractoring BackendExecutor and WorkerGroup.

If we considered the problem without Tune Trainable creation in mind, would it make more sense to pass in a list of resources_per_worker?

That's my initial idea. But there are a bunch of limitations that blocked me from doing it:

  1. Ray Train always sorts the worker by node id, gpu id, and colocate with trainer. So, under the current design, there's no guarantee that the global rank equals to the corresponding bundle index. (major reason)
  2. Tune schedulers assume all workers have the same amount of resources. (implementation details)

The conclusion is, if Ray Core cannot schedule workers in the order of GPU id, we cannot build a static mapping from rank to bundle, thus won't be able to use a list of resources_per_worker.

Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
@ericl
Copy link
Contributor

ericl commented Feb 13, 2024

In this case, the rank 0 worker and the trainer will be scheduled on the same A100 node, which has at least 200GB memory.

I want to understand the semantics here, it sounds like this is best effort, not guaranteed that the rank zero worker will be on the trainer node. If that's the case it doesn't actually provide correctness for use cases that require more memory on rank 0.

Ray Train always sorts the worker by node id, gpu id, and colocate with trainer. So, under the current design, there's no guarantee that the global rank equals to the corresponding bundle index. (major reason)

We could disable the sorting if rank 0 resources are specified right?

Tune schedulers assume all workers have the same amount of resources.

Hmm not sure I get this, what's the concrete issue?

@ericl ericl self-assigned this Feb 13, 2024
@woshiyyya
Copy link
Member Author

woshiyyya commented Feb 13, 2024

I want to understand the semantics here, it sounds like this is best effort, not guaranteed that the rank zero worker will be on the trainer node. If that's the case it doesn't actually provide correctness for use cases that require more memory on rank 0.

Yes, it's true. It will try to colocate trainer and rank 0 worker if it's feasible to accommodate the combined resource bundle onto a single node.

We could disable the sorting if rank 0 resources are specified right?

Unfortunately we can't. Libraries like deepspeed and huggingface accelerate assume that the ranks are aligned with the gpu ids. More details in #40803.

Tune schedulers assume all workers have the same amount of resources.

It's an implementation detail. Tune scheduler can dynamically adjust the workers resources and assumes every worker have the same amount of base resource. Not so important since we can bypass this if necessary.

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@ericl
Copy link
Contributor

ericl commented Mar 1, 2024

Do we still need this PR to land the accelerator_type support?

@woshiyyya
Copy link
Member Author

woshiyyya commented Mar 1, 2024

@ericl Yes I'll try to merge this PR to land the accelerator_type support.

Actually the users now can already specify accelerator_type in ScalingConfig, e.g.

ScalingConfig(
    ...,
    resources_per_worker={"accelerator_type:A100": 0.01, ...}. 
)

This PR provides a better user interface without the awkward 0.01 workaround.

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
woshiyyya and others added 3 commits March 4, 2024 11:47
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does accelerator_type not need to be popped out of the dict and handled similar to num_cpus, num_gpus, and memory? Is it ok to be left as an "additional resource"? ray.remote accepts accelerator_type as a named argument.

@woshiyyya
Copy link
Member Author

@justinvyu yeah, I've asked core team and they told we can use it as an additional resource in Ray Train. I'm trying to avoid combining and splitting accelerator_type_key back and forth with the current solution.

Comment on lines 207 to 210
if self.accelerator_type:
resources_per_worker[
f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}"
] = 0.001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use setdefault similar to GPU in the line above? In case the user manually sets resources_per_worker. Or validate that it's not set in the resources dict if we want this to be the only interface where the user can specify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "accelerator_type:{type}" key should only be used as internal api, users should not specify it in resource_per_worker, as we provided the ScalingConfig(accelerator_type=) api as the only entry.

For example, if a user needs two A100 GPUs, they can do:

ScalingConfig(
    num_workers=...,
    resources_per_worker={"GPU": 2},
    accelerator_type="A100",
)

But I agree to change it to setdefault to provide more flexibility for advanced use cases.

Comment on lines 240 to 243
if self.accelerator_type:
trainer_resources[
f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}"
] = 0.001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly set it here too? Won't it get merged with rank 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch. This shouldn't be here after we merged the previous colocate pr.

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Comment on lines +196 to +200
resources_per_worker = {"GPU": 1}
else:
return {"CPU": 1}
resources_per_worker = {
k: v for k, v in self.resources_per_worker.items() if v != 0
}
resources_per_worker = {"CPU": 1}
else:
resources_per_worker = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can clean up this branching logic a bit now since we don't return early anymore, but we can do it in a separate PR...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's merge it first. I'll post a followup PR to remove the branchings like colab & num_workers=None, etc.

@matthewdeng matthewdeng merged commit 1c9779a into ray-project:master Mar 8, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants