New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Support Accelerator Type in ScalingConfig
#43090
[Train] Support Accelerator Type in ScalingConfig
#43090
Conversation
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice testing! I have a few questions/small comments.
Also, can we add a section about why we choose to go with this solution over allowing users to pass in resources_per_worker
as a list where the first bundle goes to rank 0?
trainer_resources
as a concept is not necessary, and I would like to get rid of it in the long term.- So, this decision to double down on "colocating rank 0 with the trainer" makes it a bit harder to remove the trainer resources concept.
- If we considered the problem without Tune Trainable creation in mind, would it make more sense to pass in a list of
resources_per_worker
?
@justinvyu +1. I don't think
That's my initial idea. But there are a bunch of limitations that blocked me from doing it:
The conclusion is, if Ray Core cannot schedule workers in the order of GPU id, we cannot build a static mapping from |
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
I want to understand the semantics here, it sounds like this is best effort, not guaranteed that the rank zero worker will be on the trainer node. If that's the case it doesn't actually provide correctness for use cases that require more memory on rank 0.
We could disable the sorting if rank 0 resources are specified right?
Hmm not sure I get this, what's the concrete issue? |
Yes, it's true. It will try to colocate trainer and rank 0 worker if it's feasible to accommodate the combined resource bundle onto a single node.
Unfortunately we can't. Libraries like deepspeed and huggingface accelerate assume that the ranks are aligned with the gpu ids. More details in #40803.
It's an implementation detail. Tune scheduler can dynamically adjust the workers resources and assumes every worker have the same amount of base resource. Not so important since we can bypass this if necessary. |
Do we still need this PR to land the |
@ericl Yes I'll try to merge this PR to land the accelerator_type support. Actually the users now can already specify
This PR provides a better user interface without the awkward 0.01 workaround. |
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does accelerator_type
not need to be popped out of the dict and handled similar to num_cpus
, num_gpus
, and memory
? Is it ok to be left as an "additional resource"? ray.remote
accepts accelerator_type
as a named argument.
@justinvyu yeah, I've asked core team and they told we can use it as an additional resource in Ray Train. I'm trying to avoid combining and splitting accelerator_type_key back and forth with the current solution. |
python/ray/air/config.py
Outdated
if self.accelerator_type: | ||
resources_per_worker[ | ||
f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}" | ||
] = 0.001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use setdefault
similar to GPU in the line above? In case the user manually sets resources_per_worker
. Or validate that it's not set in the resources dict if we want this to be the only interface where the user can specify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "accelerator_type:{type}"
key should only be used as internal api, users should not specify it in resource_per_worker
, as we provided the ScalingConfig(accelerator_type=)
api as the only entry.
For example, if a user needs two A100 GPUs, they can do:
ScalingConfig(
num_workers=...,
resources_per_worker={"GPU": 2},
accelerator_type="A100",
)
But I agree to change it to setdefault
to provide more flexibility for advanced use cases.
python/ray/air/config.py
Outdated
if self.accelerator_type: | ||
trainer_resources[ | ||
f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}" | ||
] = 0.001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to explicitly set it here too? Won't it get merged with rank 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch. This shouldn't be here after we merged the previous colocate pr.
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…into train/support_accelerator_type
resources_per_worker = {"GPU": 1} | ||
else: | ||
return {"CPU": 1} | ||
resources_per_worker = { | ||
k: v for k, v in self.resources_per_worker.items() if v != 0 | ||
} | ||
resources_per_worker = {"CPU": 1} | ||
else: | ||
resources_per_worker = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we can clean up this branching logic a bit now since we don't return early anymore, but we can do it in a separate PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let's merge it first. I'll post a followup PR to remove the branchings like colab & num_workers=None, etc.
Why are these changes needed?
Ray Core recently allows specifying accelerator type for remote tasks and actors[link].
This PR leveraged this feature and allowed users to specify accelerator types, enabling Ray Train to schedule trainers and workers onto nodes with the specified accelerators. Internally, Ray Train appends
{"accelerator_type:A10G": 0.001}
to the all resource bundles of the trainer and workers.This feature enables multiple use cases:
Example 1: Use a single accelerator type in a heterogeneous cluster.
For example, you have a cluster with 16 x T4 and 16 x A10G. If you want to launch 16 workers with A10G GPUs, instead of T4s. You can now specify the
ScalingConfig
as below:Example 2: Specify extra resources for global rank 0 worker.
For example, you are training with 16 x A10Gs, but want to launch the rank 0 worker on a node with more CPU memory for large model checkpointing.
You can specify
accelerator_type="A10G"
, which ensures that the trainer is also scheduled on an A100 node. Since Ray Train always try to colocate trainer and global rank 0 worker onto the same node. Therefore, you can leveragetrainer_resources
to allocate extra memory to rank 0 workers:In this case, the rank 0 worker and the trainer will be scheduled on the same A100 node, which has at least 200GB memory.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.