Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Handle node type key change/deletion #16691

Merged
merged 34 commits into from
Jul 6, 2021

Conversation

DmitriGekhtman
Copy link
Contributor

Why are these changes needed?

Currently, deleting a worker node type and then running ray up would cause autoscaler failure due to KeyError.
Changing the name of the head node type without changing the head node's config also leads to unexpected behavior.

With the changes in this PR,

  • The autoscaler deletes worker nodes whose node type tag is not in the autoscaling config
  • commands.get_or_create_head_node shuts down a head node with outdated node type name (current behavior is to only shut down head node with outdated config)

This PR also

  • Fixes a typo in a ResourceDemandScheduler warning message, upgrades the warning to an error
  • Makes a quick fix to validation of the node-type resources field.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Also tested manually.

@@ -757,6 +749,59 @@ def get_or_create_head_node(config: Dict[str, Any],
cli_logger.print(" {}", remote_shell_str.strip())


def _should_create_new_head(head_node: Optional[str], launch_hash: str,
head_node_type: str,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only new logic here is the node type name check. Just extracted the if condition into its own function + added the extra check.

@@ -337,6 +337,7 @@
"min_workers": {"type": "integer"},
"max_workers": {"type": "integer"},
"resources": {
"type": "object",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped this line in a PR a few months ago, oops.

@DmitriGekhtman DmitriGekhtman changed the title [autoscaler] Handle node type key change [autoscaler] Handle node type key change/deletion Jun 27, 2021
@AmeerHajAli
Copy link
Contributor

@edoakes @richardliaw , you mentioned having such issues in the past and I want to kindly ask you if the changes this PR makes are what you expect/would like to happen.

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few comments about testing

python/ray/tests/test_autoscaler.py Outdated Show resolved Hide resolved
python/ray/tests/test_autoscaler.py Outdated Show resolved Hide resolved
@AmeerHajAli
Copy link
Contributor

My main concern is that in some sense this is an API change and needs to get approved.
Can we please discuss this in open source slack or api changes?

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Jun 27, 2021

My main concern is that in some sense this is an API change and needs to get approved.
Can we please discuss this in open source slack or api changes?

Hmm, it is indeed an API change, but the previous state of the API was broken/undefined.
That said, there is a decision made here on how to fix the API, so will open discussion on API changes channel.

@AmeerHajAli
Copy link
Contributor

@DmitriGekhtman I wouldn’t say it is broken because we never said we support modifying an existing cluster yaml. Now that we are proposing to “support it” we need to make sure we are doing the desired behavior. Thanks for starting the discussion.

@DmitriGekhtman
Copy link
Contributor Author

we never said we support modifying an existing cluster yaml

I disagree on this point. According to our docs, ray up creates or updates a cluster and users do use ray up to update.

@AmeerHajAli
Copy link
Contributor

But we do not explicitly say that we support modifying the available node types.
Anyway, lets make sure to circulate this in API changes and whatever becomes the decision we should document as well.
Thanks for fixing this.

Comment on lines -25 to -26
:ref:`worker_nodes <cluster-configuration-worker-nodes>`:
:ref:`node_config <cluster-configuration-node-config-type>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we forget to remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like it.

Comment on lines +369 to +373
If the field ``head_node_type`` is changed and an update is executed with :ref:`ray up<ray-up-doc>`, the currently running head node will
be considered outdated. The user will receive a prompt asking to confirm scale-down of the outdated head node, and the cluster will restart with a new
head node. Changing the :ref:`node_config<cluster-configuration-node-config>` of the :ref:`node_type<cluster-configuration-node-types-type>` with key ``head_node_type`` will also result in cluster restart after a user prompt.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.

Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left a few comments. Thanks for doing this.

Comment on lines -367 to -372
.. _cluster-configuration-worker-nodes:

``worker_nodes``
~~~~~~~~~~~~~~~~

The configuration to be used to launch worker nodes on the cloud service provider. Generally, node configs are set in the :ref:`node config of each node type <cluster-configuration-node-config>`. Setting this property allows propagation of a default value to all the node types when they launch as workers (e.g., using spot instances across all workers can be configured here so that it doesn't have to be set across all instance types).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing these =)

Comment on lines 464 to 465
Also updates the counter dict (node_type_counts), which is passed in by
reference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future I think it would be great if we just returned a new counts dict (as opposed to mutating the dict).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just make that change now, else this will be forgotten forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty good point--thanks!

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now! Thanks for adding super clear docs :)

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

@DmitriGekhtman
Copy link
Contributor Author

Looks good to merge -- test failures seem unrelated.

@DmitriGekhtman DmitriGekhtman added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 2, 2021
@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 6, 2021

Windows test is clearly unrelated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants