[autoscaler] Support legacy cluster configs with the new resource demand scheduler #11751

AmeerHajAli · 2020-11-02T13:55:21Z

The PR adds code to the new resource demand scheduler to support legacy cluster configs.

We actually rewrite the cluster configs from legacy to available node types.

Since the head node is already launched, we can get its resources. If there are remaining unfulfilled resources we launch max(1, min_workers) to get the resources of the workers so we can later launch based on demand.

We do not check if the head node and worker nodes are the same.
I am not sure how we can know if they are the same for any cloud as they have different config names:
e.g, InstanceType for AWS vs machineType for gcp.

Relevant tests were added to test the support of legacy node types.

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This reverts commit 818a63a.

python/ray/autoscaler/_private/resource_demand_scheduler.py

… node in resources

AmeerHajAli · 2020-11-03T15:32:40Z

python/ray/tests/test_autoscaler.py

@@ -1053,8 +1053,6 @@ def testConfiguresNewNodes(self):

    def testReportsConfigFailures(self):
        config = copy.deepcopy(SMALL_CLUSTER)
-        config["provider"]["type"] = "external"
-        config = prepare_config(config)


removed as after we rewrite in prepare_config, the autofilling of instance resources fails as no module is provided. This prepare_config is redundant here too and we cannot have a type "external" without the appropriate module.

python/ray/autoscaler/_private/resource_demand_scheduler.py

ericl · 2020-11-03T18:01:30Z

python/ray/autoscaler/_private/resource_demand_scheduler.py

+            # node is connected and its resources show up in LoadMetrics.
+            return_immediately = self._handle_legacy_yaml(
+                nodes, pending_nodes, static_node_resources)
+            if return_immediately:


Can you give this a better name than return_immediately?

Actually how about we rename _handle_legacy_yaml to _infer_legacy_node_resources_if_needed(), remove the early return, and change the logic in line 110 to return empty dict if a worker is already pending?

(That way we never need to return early here)

@ericl better?

ericl

Looks good, just one suggested refactor

ericl

It looks great. What are the follow-up PRs?

AmeerHajAli · 2020-11-03T21:13:59Z

-Removing the legacy code.
-Rewriting the alpha of max concurrent scaling up from target utilization and aggressiveness.

ericl · 2020-11-03T23:53:23Z

Seems many failing tests

//python/ray/tests:test_autoscaler                                       FAILED in 3 out of 3 in 8.4s
  Stats over 3 runs: max = 8.4s, min = 8.3s, avg = 8.3s, dev = 0.1s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler/test_attempts/attempt_2.log
//python/ray/tests:test_autoscaler_aws                                   FAILED in 3 out of 3 in 3.0s
  Stats over 3 runs: max = 3.0s, min = 1.8s, avg = 2.2s, dev = 0.5s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_aws/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_aws/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_aws/test_attempts/attempt_2.log
//python/ray/tests:test_autoscaler_yaml                                  FAILED in 3 out of 3 in 2.7s
  Stats over 3 runs: max = 2.7s, min = 1.8s, avg = 2.1s, dev = 0.5s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_yaml/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_yaml/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_yaml/test_attempts/attempt_2.log

AmeerHajAli · 2020-11-04T09:04:19Z

@ericl , yeah, my tests passed but there were changes in master that made it break.
I am looking into this.
Update: The error seems to be for two reasons:

for test_autoscaler*: because we raise an error for any exception in autofill AWS instances, but sometimes we shouldn't as the reason is not missing key. but rather that the package is not installed (e.g., trying to autofill for staroid when it is not installed).
for test_cli: because we did not include the logging of autofill in test_cli_patterns. Which started erroring after we rewrite the yaml since rewriting causes more autofills. I fixed it now, it should be good to go!

Ameer Haj Ali and others added 30 commits September 24, 2020 00:26

prepare for head node

7248cf9

move command runner interface outside _private

bc43e46

Merge github.com:ray-project/ray

14be0a1

remove space

ab660a8

Eric

1ea0c1f

flake

dad31ae

Merge github.com:ray-project/ray

49bcf56

min_workers in multi node type

16d736d

Merge github.com:ray-project/ray

06911df

fixing edge cases

0d8dddb

eric not idle

fe69ce3

fix target_workers to consider min_workers of node types

35832ed

idle timeout

ca0be53

minor

c9518bd

minor fix

5452d39

test

9e904cd

lint

f9edcbe

eric v2

cb02267

eric 3

5e5d403

min_workers constraint before bin packing

4d44cd8

Merge github.com:ray-project/ray

614abbf

Update resource_demand_scheduler.py

818a63a

Revert "Update resource_demand_scheduler.py"

539b29c

This reverts commit 818a63a.

reducing diff

9a63866

Merge branch 'master' of github.com:AmeerHajAli/ray

7501623

make get_nodes_to_launch return a dict

b4edd21

Merge github.com:ray-project/ray

fc48725

merge

0aef789

weird merge fix

39245a8

auto fill instance types for AWS

c7eb4ad

wuisawesome reviewed Nov 2, 2020

View reviewed changes

python/ray/autoscaler/_private/resource_demand_scheduler.py Outdated Show resolved Hide resolved

Ameer Haj Ali added 4 commits November 3, 2020 11:17

Merge github.com:ray-project/ray into autofill_legacy

7876be8

comments, refactor, consider unfulfilled, consider headnode vs worker…

c5d91db

… node in resources

lint

1ba5aff

lint

500bf7a

AmeerHajAli commented Nov 3, 2020

View reviewed changes

AmeerHajAli requested review from ericl and wuisawesome November 3, 2020 15:33

AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2020

ericl reviewed Nov 3, 2020

View reviewed changes

python/ray/autoscaler/_private/resource_demand_scheduler.py Show resolved Hide resolved

copy the node types

c1fd3b4

ericl reviewed Nov 3, 2020

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2020

Eric

2ea67ad

ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2020

ericl approved these changes Nov 3, 2020

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2020

Merge github.com:ray-project/ray into autofill_legacy

d9091e9

Ameer Haj Ali added 3 commits November 4, 2020 11:20

fixing test to only raise if ValueError

0ded6fc

Merge github.com:ray-project/ray into autofill_legacy

31517a0

fix test_cli

5a466df

AmeerHajAli requested a review from ericl November 4, 2020 12:51

AmeerHajAli added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 4, 2020

AmeerHajAli mentioned this pull request Nov 4, 2020

[autoscaler] Remove legacy autoscaler #11802

Merged

6 tasks

ericl merged commit ebdf8ba into ray-project:master Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Support legacy cluster configs with the new resource demand scheduler #11751

[autoscaler] Support legacy cluster configs with the new resource demand scheduler #11751

AmeerHajAli commented Nov 2, 2020 •

edited

AmeerHajAli Nov 3, 2020

ericl Nov 3, 2020

ericl Nov 3, 2020

ericl Nov 3, 2020

AmeerHajAli Nov 3, 2020

ericl left a comment

ericl left a comment

AmeerHajAli commented Nov 3, 2020 •

edited

ericl commented Nov 3, 2020

AmeerHajAli commented Nov 4, 2020 •

edited

[autoscaler] Support legacy cluster configs with the new resource demand scheduler #11751

[autoscaler] Support legacy cluster configs with the new resource demand scheduler #11751

Conversation

AmeerHajAli commented Nov 2, 2020 • edited

AmeerHajAli Nov 3, 2020

Choose a reason for hiding this comment

ericl Nov 3, 2020

Choose a reason for hiding this comment

ericl Nov 3, 2020

Choose a reason for hiding this comment

ericl Nov 3, 2020

Choose a reason for hiding this comment

AmeerHajAli Nov 3, 2020

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

AmeerHajAli commented Nov 3, 2020 • edited

ericl commented Nov 3, 2020

AmeerHajAli commented Nov 4, 2020 • edited

AmeerHajAli commented Nov 2, 2020 •

edited

AmeerHajAli commented Nov 3, 2020 •

edited

AmeerHajAli commented Nov 4, 2020 •

edited