-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Handle node type key change/deletion #16691
Conversation
@@ -757,6 +749,59 @@ def get_or_create_head_node(config: Dict[str, Any], | |||
cli_logger.print(" {}", remote_shell_str.strip()) | |||
|
|||
|
|||
def _should_create_new_head(head_node: Optional[str], launch_hash: str, | |||
head_node_type: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only new logic here is the node type name check. Just extracted the if condition into its own function + added the extra check.
@@ -337,6 +337,7 @@ | |||
"min_workers": {"type": "integer"}, | |||
"max_workers": {"type": "integer"}, | |||
"resources": { | |||
"type": "object", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped this line in a PR a few months ago, oops.
@edoakes @richardliaw , you mentioned having such issues in the past and I want to kindly ask you if the changes this PR makes are what you expect/would like to happen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few comments about testing
My main concern is that in some sense this is an API change and needs to get approved. |
Hmm, it is indeed an API change, but the previous state of the API was broken/undefined. |
@DmitriGekhtman I wouldn’t say it is broken because we never said we support modifying an existing cluster yaml. Now that we are proposing to “support it” we need to make sure we are doing the desired behavior. Thanks for starting the discussion. |
I disagree on this point. According to our docs, |
But we do not explicitly say that we support modifying the available node types. |
:ref:`worker_nodes <cluster-configuration-worker-nodes>`: | ||
:ref:`node_config <cluster-configuration-node-config-type>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we forget to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like it.
If the field ``head_node_type`` is changed and an update is executed with :ref:`ray up<ray-up-doc>`, the currently running head node will | ||
be considered outdated. The user will receive a prompt asking to confirm scale-down of the outdated head node, and the cluster will restart with a new | ||
head node. Changing the :ref:`node_config<cluster-configuration-node-config>` of the :ref:`node_type<cluster-configuration-node-types-type>` with key ``head_node_type`` will also result in cluster restart after a user prompt. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left a few comments. Thanks for doing this.
.. _cluster-configuration-worker-nodes: | ||
|
||
``worker_nodes`` | ||
~~~~~~~~~~~~~~~~ | ||
|
||
The configuration to be used to launch worker nodes on the cloud service provider. Generally, node configs are set in the :ref:`node config of each node type <cluster-configuration-node-config>`. Setting this property allows propagation of a default value to all the node types when they launch as workers (e.g., using spot instances across all workers can be configured here so that it doesn't have to be set across all instance types). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing these =)
Also updates the counter dict (node_type_counts), which is passed in by | ||
reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future I think it would be great if we just returned a new counts dict (as opposed to mutating the dict).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just make that change now, else this will be forgotten forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a pretty good point--thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now! Thanks for adding super clear docs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!
Looks good to merge -- test failures seem unrelated. |
Windows test is clearly unrelated! |
Why are these changes needed?
Currently, deleting a worker node type and then running ray up would cause autoscaler failure due to
KeyError
.Changing the name of the head node type without changing the head node's config also leads to unexpected behavior.
With the changes in this PR,
This PR also
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.Also tested manually.