Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] Robustify the termiantion for provision failure to avoid leakage #2990

Merged
merged 10 commits into from
Jan 21, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 16, 2024

This is to fix the following case, inspired by #2975 (review):

  1. sky launch -c test
  2. The provision for the new VM fails with a VM actually created on the cloud (to simulate add `raise RuntimeError('test provisioning') at
    return _bulk_provision(cloud, region, zones, cluster_name,
    )
  3. The new provisioner tries to termiante the cluster and trigger failover.
  4. If the termination fails (e.g., the cloud API fails to terminate a VM that is in a state of being created), our current implementation will ignore the error and go to the next region, causing the leakage of previous VM. (to simulate, add raise RuntimeError('test: failed teardown') at )

In this PR, we add additional retries for the termination, and if the failure still occurs after retrying for three times, we error out, instead of keeps failover through the following regions. Also, the failed cluster will remain in the cluster table, and the user should be able to check the cluster's status and terminate it manually.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Test described above.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @Michaelvll! I assume provisioner V1 doesn't have this problem, since https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L1693-L1694 seems to throw?

sky/provision/provisioner.py Show resolved Hide resolved
sky/provision/provisioner.py Show resolved Hide resolved
sky/provision/provisioner.py Outdated Show resolved Hide resolved
sky/provision/provisioner.py Outdated Show resolved Hide resolved
sky/provision/common.py Outdated Show resolved Hide resolved
sky/provision/provisioner.py Outdated Show resolved Hide resolved
Michaelvll and others added 8 commits January 19, 2024 01:00
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…ypilot-org/skypilot into robustify-the-termination-for-failure
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sky/provision/common.py Outdated Show resolved Hide resolved
f'{terminate_str.lower()} {cluster_name!r} failed. '
'This can cause resource leakage. Please check the '
'failure and the cluster status on the cloud, and '
'manually terminate the cluster. '
f'Details: {formatted_exception}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make L173 as e and raise this error from e?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should raise from the teardown exception? I make it from the e raised by the teardown_cluster

sky/provision/provisioner.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit cbd8fc8 into master Jan 21, 2024
19 checks passed
@Michaelvll Michaelvll deleted the robustify-the-termination-for-failure branch January 21, 2024 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants