-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix placement groups scheduling when no resources are specified #39946 #43269
Conversation
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you include the revert revert fix?
nope, I reverted unnecessary refactoring. I think there are some failures I need to handle still. I will finish it today. |
@jjyao after this PR, all the scheduling request will include "bundle": 0.001. All the test failures happen because when you call
|
IMO, this is implementation detail and should not expose to the end user via a public API. |
cc @jjyao for the review |
python/ray/_private/utils.py
Outdated
# it is an implementation detail. | ||
# This resource is automatically added to the resource | ||
# request for all tasks that require placement groups. | ||
result = PLACEMENT_GROUP_BUNDLE_KEY_RESOURCE_PATTERN.match(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think we need to introduce another regex, We can just skip if the parsed original resource == "bundle"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not in this PR but ideally we should do the same thing for Java and C++ as well, otherwise they will see the bundle resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. Fixed it, and I also added input validation to disallow "bundle"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dded input validation to disallow "bundle"
Will this break current users if they don't use PG. Maybe we should rename our "bundle" to something more private like _pg_bundle_
…sources. (#43448) #43269 starts respecting pg when resources are not specified for tasks. But it still doesn't respect bundle index. This PR fixes the issue by always including bundle index formatted resources if bundle index is specified. This PR also adds an additional unit test to check observability API not displaying formatted resources
This PR automatically merge the trainer bundle with the rank 0 worker bundle, so that the trainer and rank 0 worker can always colocate on the same node. ### Benefits: - Enables users to specify additional resources for rank 0 worker. - Always colocate trainers and rank 0 workers together to make the scheduling behavior deterministic. ### Major changes: #### 1. Merge trainer bundle and the first worker bundle. Specifically, we build a placement groups with bundles `[{}, {trainer+worker}, {worker}, ..., {worker}]`, and schedule the `TrainTrainable` with the first non-empty bundle. When assigning worker ranks, we designate the worker with the smallest GPU ID on the same node as the trainer to be rank 0. #### 2. Set `num_workers=1` by default in `ScalingConfig`. Previously, setting `num_workers` to `None` resulted launching a single `TrainTrainable` with zero workers. It no longer applies to the current Ray Train, as all Trainers now require at least one worker to execute the `train_func`. Additionally, this approach led to undefined behaviors during the merging and separation of the first bundle. To ensure the consistent behavior, we have now set the default value of `num_workers` to 1. #### 3. Forbid using `ScalingConfig` with `tune.with_resources`. `ScalingConfig` should be a Ray Train only utility and it's should not be used for Tune Trainables. For example, it doesn't make sense to provide ScalingConfig for a function trainable, since there's no trainer and worker concepts for it. Passed Release Test:https://buildkite.com/ray-project/release/builds/9650#018dee6e-e3ce-4376-9f3d-5ad7e250e513 ## Related PRs: The below two PRs enabled that the actors with empty resources can be launched on the node of a specific bundle in placement group. - #43269 - #43448 Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Why are these changes needed?
it is a revert revert of #39946
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.