-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Provisioner] Robustify the termiantion for provision failure to avoid leakage #2990
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @Michaelvll! I assume provisioner V1 doesn't have this problem, since https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L1693-L1694 seems to throw?
…ify-the-termination-for-failure
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…ypilot-org/skypilot into robustify-the-termination-for-failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sky/provision/provisioner.py
Outdated
f'{terminate_str.lower()} {cluster_name!r} failed. ' | ||
'This can cause resource leakage. Please check the ' | ||
'failure and the cluster status on the cloud, and ' | ||
'manually terminate the cluster. ' | ||
f'Details: {formatted_exception}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make L173 as e
and raise this error from e
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should raise from the teardown exception? I make it from the e
raised by the teardown_cluster
This is to fix the following case, inspired by #2975 (review):
sky launch -c test
skypilot/sky/provision/provisioner.py
Line 162 in eac8592
raise RuntimeError('test: failed teardown')
atskypilot/sky/provision/provisioner.py
Line 190 in eac8592
In this PR, we add additional retries for the termination, and if the failure still occurs after retrying for three times, we error out, instead of keeps failover through the following regions. Also, the failed cluster will remain in the cluster table, and the user should be able to check the cluster's status and terminate it manually.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh