Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Support legacy cluster configs with the new resource demand scheduler #11751

Merged
merged 81 commits into from
Nov 4, 2020
Merged

[autoscaler] Support legacy cluster configs with the new resource demand scheduler #11751

merged 81 commits into from
Nov 4, 2020

Conversation

AmeerHajAli
Copy link
Contributor

@AmeerHajAli AmeerHajAli commented Nov 2, 2020

The PR adds code to the new resource demand scheduler to support legacy cluster configs.

We actually rewrite the cluster configs from legacy to available node types.

Since the head node is already launched, we can get its resources. If there are remaining unfulfilled resources we launch max(1, min_workers) to get the resources of the workers so we can later launch based on demand.

We do not check if the head node and worker nodes are the same.
I am not sure how we can know if they are the same for any cloud as they have different config names:
e.g, InstanceType for AWS vs machineType for gcp.

Relevant tests were added to test the support of legacy node types.

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -1053,8 +1053,6 @@ def testConfiguresNewNodes(self):

def testReportsConfigFailures(self):
config = copy.deepcopy(SMALL_CLUSTER)
config["provider"]["type"] = "external"
config = prepare_config(config)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed as after we rewrite in prepare_config, the autofilling of instance resources fails as no module is provided. This prepare_config is redundant here too and we cannot have a type "external" without the appropriate module.

@AmeerHajAli AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2020
# node is connected and its resources show up in LoadMetrics.
return_immediately = self._handle_legacy_yaml(
nodes, pending_nodes, static_node_resources)
if return_immediately:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give this a better name than return_immediately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually how about we rename _handle_legacy_yaml to _infer_legacy_node_resources_if_needed(), remove the early return, and change the logic in line 110 to return empty dict if a worker is already pending?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That way we never need to return early here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl better?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one suggested refactor

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2020
@ericl ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2020
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great. What are the follow-up PRs?

@AmeerHajAli
Copy link
Contributor Author

AmeerHajAli commented Nov 3, 2020

-Removing the legacy code.
-Rewriting the alpha of max concurrent scaling up from target utilization and aggressiveness.

@ericl
Copy link
Contributor

ericl commented Nov 3, 2020

Seems many failing tests

//python/ray/tests:test_autoscaler                                       FAILED in 3 out of 3 in 8.4s
  Stats over 3 runs: max = 8.4s, min = 8.3s, avg = 8.3s, dev = 0.1s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler/test_attempts/attempt_2.log
//python/ray/tests:test_autoscaler_aws                                   FAILED in 3 out of 3 in 3.0s
  Stats over 3 runs: max = 3.0s, min = 1.8s, avg = 2.2s, dev = 0.5s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_aws/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_aws/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_aws/test_attempts/attempt_2.log
//python/ray/tests:test_autoscaler_yaml                                  FAILED in 3 out of 3 in 2.7s
  Stats over 3 runs: max = 2.7s, min = 1.8s, avg = 2.1s, dev = 0.5s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_yaml/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_yaml/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_autoscaler_yaml/test_attempts/attempt_2.log

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2020
@AmeerHajAli
Copy link
Contributor Author

AmeerHajAli commented Nov 4, 2020

@ericl , yeah, my tests passed but there were changes in master that made it break.
I am looking into this.
Update: The error seems to be for two reasons:

  1. for test_autoscaler*: because we raise an error for any exception in autofill AWS instances, but sometimes we shouldn't as the reason is not missing key. but rather that the package is not installed (e.g., trying to autofill for staroid when it is not installed).
  2. for test_cli: because we did not include the logging of autofill in test_cli_patterns. Which started erroring after we rewrite the yaml since rewriting causes more autofills. I fixed it now, it should be good to go!

@AmeerHajAli AmeerHajAli added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 4, 2020
@ericl ericl merged commit ebdf8ba into ray-project:master Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants