[Core] respect bundle index when task is scheduled with pg with no resources. #43448

rkooo567 · 2024-02-27T00:15:28Z

Why are these changes needed?

#43269 starts respecting pg when resources are not specified for tasks. But it still doesn't respect bundle index. This PR fixes the issue by always including bundle index formatted resources if bundle index is specified.

This PR also adds an additional unit test to check observability API not displaying formatted resources

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/_private/utils.py

jjyao · 2024-02-27T16:24:32Z

python/ray/_private/utils.py

-        if result and len(result.groups()) == 2:
+        result = PLACEMENT_GROUP_INDEXED_BUNDLED_RESOURCE_PATTERN.match(key)
+        if result and len(result.groups()) == 3:
+            # This should be already skipped from the logic above.


This comment should be removed

jjyao · 2024-02-27T16:26:58Z

src/ray/common/bundle_spec.cc

@@ -203,6 +203,11 @@ std::unordered_map<std::string, double> AddPlacementGroupConstraint(
  auto bundle_key =
      FormatPlacementGroupResource(kBundle_ResourceLabel, placement_group_id, -1);
  new_resources[bundle_key] = 0.001;
+  if (bundle_index >= 0) {


Actually I'm wondering if we should only add bundle resource when normal resources are empty.

We should add index resource in case users want to use bundle_index

This PR automatically merge the trainer bundle with the rank 0 worker bundle, so that the trainer and rank 0 worker can always colocate on the same node. ### Benefits: - Enables users to specify additional resources for rank 0 worker. - Always colocate trainers and rank 0 workers together to make the scheduling behavior deterministic. ### Major changes: #### 1. Merge trainer bundle and the first worker bundle. Specifically, we build a placement groups with bundles `[{}, {trainer+worker}, {worker}, ..., {worker}]`, and schedule the `TrainTrainable` with the first non-empty bundle. When assigning worker ranks, we designate the worker with the smallest GPU ID on the same node as the trainer to be rank 0. #### 2. Set `num_workers=1` by default in `ScalingConfig`. Previously, setting `num_workers` to `None` resulted launching a single `TrainTrainable` with zero workers. It no longer applies to the current Ray Train, as all Trainers now require at least one worker to execute the `train_func`. Additionally, this approach led to undefined behaviors during the merging and separation of the first bundle. To ensure the consistent behavior, we have now set the default value of `num_workers` to 1. #### 3. Forbid using `ScalingConfig` with `tune.with_resources`. `ScalingConfig` should be a Ray Train only utility and it's should not be used for Tune Trainables. For example, it doesn't make sense to provide ScalingConfig for a function trainable, since there's no trainer and worker concepts for it. Passed Release Test：https://buildkite.com/ray-project/release/builds/9650#018dee6e-e3ce-4376-9f3d-5ad7e250e513 ## Related PRs: The below two PRs enabled that the actors with empty resources can be launched on the node of a specific bundle in placement group. - #43269 - #43448 Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

SangBin Cho added 2 commits February 27, 2024 09:13

fixed.

14c5436

fixed.

729dbc1

rkooo567 assigned jjyao Feb 27, 2024

jjyao reviewed Feb 27, 2024

View reviewed changes

python/ray/_private/utils.py Outdated Show resolved Hide resolved

SangBin Cho added 2 commits February 27, 2024 11:04

Addressed code review.

d331c93

Merge branch 'master' into respect-bundle-pg

e8da645

jjyao approved these changes Feb 27, 2024

View reviewed changes

.

447988c

rkooo567 merged commit edc33b1 into ray-project:master Feb 27, 2024
8 of 9 checks passed

woshiyyya mentioned this pull request Feb 27, 2024

[Train] Colocate Trainer and rank 0 worker #43115

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] respect bundle index when task is scheduled with pg with no resources. #43448

[Core] respect bundle index when task is scheduled with pg with no resources. #43448

rkooo567 commented Feb 27, 2024

jjyao Feb 27, 2024

jjyao Feb 27, 2024

rkooo567 Feb 27, 2024

[Core] respect bundle index when task is scheduled with pg with no resources. #43448

[Core] respect bundle index when task is scheduled with pg with no resources. #43448

Conversation

rkooo567 commented Feb 27, 2024

Why are these changes needed?

Related issue number

Checks

jjyao Feb 27, 2024

Choose a reason for hiding this comment

jjyao Feb 27, 2024

Choose a reason for hiding this comment

rkooo567 Feb 27, 2024

Choose a reason for hiding this comment