[core] Fix placement groups scheduling when no resources are specified #39946 #43269

rkooo567 · 2024-02-20T05:23:56Z

Why are these changes needed?

it is a revert revert of #39946

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

Signed-off-by: vitsai <vitsai@cs.stanford.edu> draft Signed-off-by: vitsai <vitsai@cs.stanford.edu> fix Signed-off-by: vitsai <vitsai@cs.stanford.edu>

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

jjyao

Did you include the revert revert fix?

rkooo567 · 2024-02-20T21:04:57Z

nope, I reverted unnecessary refactoring. I think there are some failures I need to handle still. I will finish it today.

rkooo567 · 2024-02-21T15:07:53Z

@jjyao after this PR, all the scheduling request will include "bundle": 0.001.

All the test failures happen because when you call get_assigned_ids() API, it returns now "bundles": 0.001. Also, when we call ray status, the pending req also contains bundle: 0.001

I hided this info from ray status demand output.
But for get_assigned_ids(). it is more accurate to include bundles:0.001. Should we always include this to the output. Wdyt?

jjyao · 2024-02-21T18:47:49Z

But for get_assigned_ids(). it is more accurate to include bundles:0.001. Should we always include this to the output. Wdyt?

IMO, this is implementation detail and should not expose to the end user via a public API.

rkooo567 · 2024-02-22T15:23:15Z

cc @jjyao for the review

jjyao · 2024-02-22T17:04:45Z

python/ray/_private/utils.py

+        # it is an implementation detail.
+        # This resource is automatically added to the resource
+        # request for all tasks that require placement groups.
+        result = PLACEMENT_GROUP_BUNDLE_KEY_RESOURCE_PATTERN.match(key)


Nit: I don't think we need to introduce another regex, We can just skip if the parsed original resource == "bundle"

Probably not in this PR but ideally we should do the same thing for Java and C++ as well, otherwise they will see the bundle resource.

good point. Fixed it, and I also added input validation to disallow "bundle"

dded input validation to disallow "bundle"

Will this break current users if they don't use PG. Maybe we should rename our "bundle" to something more private like _pg_bundle_

…sources. (#43448) #43269 starts respecting pg when resources are not specified for tasks. But it still doesn't respect bundle index. This PR fixes the issue by always including bundle index formatted resources if bundle index is specified. This PR also adds an additional unit test to check observability API not displaying formatted resources

This PR automatically merge the trainer bundle with the rank 0 worker bundle, so that the trainer and rank 0 worker can always colocate on the same node. ### Benefits: - Enables users to specify additional resources for rank 0 worker. - Always colocate trainers and rank 0 workers together to make the scheduling behavior deterministic. ### Major changes: #### 1. Merge trainer bundle and the first worker bundle. Specifically, we build a placement groups with bundles `[{}, {trainer+worker}, {worker}, ..., {worker}]`, and schedule the `TrainTrainable` with the first non-empty bundle. When assigning worker ranks, we designate the worker with the smallest GPU ID on the same node as the trainer to be rank 0. #### 2. Set `num_workers=1` by default in `ScalingConfig`. Previously, setting `num_workers` to `None` resulted launching a single `TrainTrainable` with zero workers. It no longer applies to the current Ray Train, as all Trainers now require at least one worker to execute the `train_func`. Additionally, this approach led to undefined behaviors during the merging and separation of the first bundle. To ensure the consistent behavior, we have now set the default value of `num_workers` to 1. #### 3. Forbid using `ScalingConfig` with `tune.with_resources`. `ScalingConfig` should be a Ray Train only utility and it's should not be used for Tune Trainables. For example, it doesn't make sense to provide ScalingConfig for a function trainable, since there's no trainer and worker concepts for it. Passed Release Test：https://buildkite.com/ray-project/release/builds/9650#018dee6e-e3ce-4376-9f3d-5ad7e250e513 ## Related PRs: The below two PRs enabled that the actors with empty resources can be launched on the node of a specific bundle in placement group. - #43269 - #43448 Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

vitsai and others added 6 commits October 23, 2023 19:12

revert revert

43e1111

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

fix

1e30f19

Signed-off-by: vitsai <vitsai@cs.stanford.edu> draft Signed-off-by: vitsai <vitsai@cs.stanford.edu> fix Signed-off-by: vitsai <vitsai@cs.stanford.edu>

Merge branch 'master' into revert-revert

3fae20b

remove in lower layer

e4b3f9a

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

use regex, fix test

d2b0f42

Signed-off-by: vitsai <vitsai@cs.stanford.edu>

Merge branch 'master' into revert-revert

2e6724b

rkooo567 requested a review from a team as a code owner February 20, 2024 05:23

SangBin Cho added 3 commits February 20, 2024 23:03

done

d395f57

remove unnecessary files.

d00c06d

fix unit tests

d693cd4

jjyao reviewed Feb 20, 2024

View reviewed changes

SangBin Cho added 2 commits February 22, 2024 00:09

handle demand.

8d0a77f

done

0181aa4

rkooo567 requested review from ericl, architkulkarni and hongchaodeng as code owners February 21, 2024 15:12

woshiyyya mentioned this pull request Feb 21, 2024

[Train] Colocate Trainer and rank 0 worker #43115

Merged

8 tasks

fixed.

f6ca57a

jjyao approved these changes Feb 22, 2024

View reviewed changes

addressed code review.

3df8942

jjyao merged commit 0452098 into ray-project:master Feb 23, 2024
8 of 9 checks passed

jjyao mentioned this pull request Feb 26, 2024

[core] Placement group allocation is ignored if no resources are passed #34866

Closed

rkooo567 mentioned this pull request Feb 27, 2024

[Core] respect bundle index when task is scheduled with pg with no resources. #43448

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Fix placement groups scheduling when no resources are specified #39946 #43269

[core] Fix placement groups scheduling when no resources are specified #39946 #43269

rkooo567 commented Feb 20, 2024

jjyao left a comment

rkooo567 commented Feb 20, 2024

rkooo567 commented Feb 21, 2024

jjyao commented Feb 21, 2024

rkooo567 commented Feb 22, 2024

jjyao Feb 22, 2024

jjyao Feb 22, 2024

rkooo567 Feb 23, 2024

jjyao Feb 23, 2024

[core] Fix placement groups scheduling when no resources are specified #39946 #43269

[core] Fix placement groups scheduling when no resources are specified #39946 #43269

Conversation

rkooo567 commented Feb 20, 2024

Why are these changes needed?

Related issue number

Checks

jjyao left a comment

Choose a reason for hiding this comment

rkooo567 commented Feb 20, 2024

rkooo567 commented Feb 21, 2024

jjyao commented Feb 21, 2024

rkooo567 commented Feb 22, 2024

jjyao Feb 22, 2024

Choose a reason for hiding this comment

jjyao Feb 22, 2024

Choose a reason for hiding this comment

rkooo567 Feb 23, 2024

Choose a reason for hiding this comment

jjyao Feb 23, 2024

Choose a reason for hiding this comment