Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autoscaler] Get_Head_Node should return an up-to-date node #14579

Merged
merged 7 commits into from
Mar 16, 2021

Conversation

ijrsvt
Copy link
Contributor

@ijrsvt ijrsvt commented Mar 9, 2021

Why are these changes needed?

  • If multiple head nodes exist (because one of them is in some sort of failed state), calling get_head_node can return the failed node. This PR ensures that get_head_node only returns a node in the up-to-date state.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@AmeerHajAli
Copy link
Contributor

Hi @ijrsvt , I realize this issue is happening and thanks for taking the initiative here.
My concern is that the tag status we are checking is not always the tag we are looking for. For example, immediately after successfully creating the head node the state is:

head_node_tags[TAG_RAY_NODE_STATUS] = STATUS_UNINITIALIZED

Checking that the status is up to date is not going to catch that.

@AmeerHajAli
Copy link
Contributor

I think the right solution is that when creating a head node, if it fails to create it, we should just terminate it.

@ijrsvt
Copy link
Contributor Author

ijrsvt commented Mar 12, 2021

Checking that the status is up to date is not going to catch that.

@AmeerHajAli The head node is set to STATUS_UP_TO_DATE before create_or_update_cluster returns. I don't think we need to be worrying about a use case where a user is trying to manipulate the cluster in an two separate places.

I think the right solution is that when creating a head node, if it fails to create it, we should just terminate it.

I don't think this is a good idea because this adds a lot of latency to iterating through writing setup/initialization commands. For example, if a user has a typo in their setup_commands, the behavior of leaving the node running and re-using it for the next attempt avoids the time of (node_shutdown + node_startup).

At it's core, this PR is defending against a bad implementation of a NodeProvider (one that does not properly re-use the failed head_node).

@AmeerHajAli
Copy link
Contributor

The issue is that if there was a bug in the setup commands, the head node will stay running and the status tag will be STATUS_UPDATE_FAILED but the node will be non terminated and the node can be reused in that case.

@AmeerHajAli
Copy link
Contributor

Oh, wait, @ijrsvt, this is not modifying create_or_update_cluster right?

@ijrsvt
Copy link
Contributor Author

ijrsvt commented Mar 12, 2021

Oh, wait, @ijrsvt, this is not modifying create_or_update_cluster right?

Yep, this does not change create_or_update_cluster!

@AmeerHajAli
Copy link
Contributor

Do you mind adding some warning that other head nodes exist in failed states if there are head nodes with non STATUS_UP_TO_DATE tag?

@AmeerHajAli AmeerHajAli added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 12, 2021
@ijrsvt
Copy link
Contributor Author

ijrsvt commented Mar 13, 2021

@AmeerHajAli I can add that, but let me switch my implementation to only call non_terminated_nodes once.

@ijrsvt ijrsvt removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 15, 2021
Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ijrsvt ijrsvt merged commit d251bb6 into ray-project:master Mar 16, 2021
@ijrsvt ijrsvt deleted the rsync-to-non-failed-head branch March 16, 2021 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants