-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Trigger failover based on whether the previous cluster was ever UP #2977
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Some questions.
sky/backends/cloud_vm_ray_backend.py
Outdated
# If we don't refresh the state of the cluster and reset it | ||
# back to STOPPED, our failover logic will consider it as an | ||
# abnormal cluster after hitting resources capacity limit on | ||
# the cloud, and will start failover. This is not desired, | ||
# because the user may want to keep the data on the disk of | ||
# that cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since now we have the "ever up" logic, is this comment still accurate? Do we still need to force-refresh for INIT (I guess it doesn't harm to properly restore it to STOPPED for nicer UX, but perhaps the avoid-failover reason no longer applies)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Updated the comment. PTAL. : )
sky/backends/cloud_vm_ray_backend.py
Outdated
message = (f'Failed to launch the cluster {cluster_name!r}. ' | ||
'It is now stopped.\n\tTo remove the cluster ' | ||
f'please run: sky down {cluster_name}\n\t' | ||
'Try launching the cluster again with: ' | ||
f'sky start {cluster_name}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use the same message as L1305? Seems applicable in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are different. Here it is the case when we restart/re-launch a cluster that was ever up and failed, the cluster is stopped and we fail immediately without triggerring the failover.
However, in the case at L1305, the previous cluster is terminated, and the failover is triggered. We do not need to print the hints, as the failover will automatically do the termination
and re-launch.
Minor UX: while testing
Shall we make the provisioner.py:425 message debug()? It seems too low-level to print. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll - some quick nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Just tried to start a
Is this message expected? (Is this from GCP provisioner V2 PR?) The |
This is a great catch! It seems our original Tested:
|
The previous logic for trigger failover is based on the previous cluster status to be
None
orINIT
. This can be buggy, as if the previous cluster has been UP before, but is somehow inINIT
state at the moment, the provisioner may terminate the cluster when therun_instance
fails on the VM.This is to make the failover a bit more conservation and robust, to not terminate clusters that was ever UP before.
Tested (run the relevant ones):
bash format.sh
INIT
:sky launch -c test-vm --cloud gcp touch myfile.txt
;ssh test-vm
;ray stop
;sky status -r
raise RuntimeError
sky launch -c test-vm ls myfile.txt
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh