Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] More logging for restarting an instance in STOPPING state and fail fast for the restart #2998

Merged
merged 18 commits into from
Feb 2, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 18, 2024

A user encountered an issue where the AWS cluster takes a long time (>40 mins) to raise the no capacity issue when restarting a STOPPED cluster. This PR mitigates the problem by adding more logs when we are waiting an instance that is in STOPPING state and fail fast for the restart with capacity issue.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud aws --gpus V100:8 -i 0 echo hi; sky stop sky-8919-gcpuser
    • During the stopping of the instance: sky start sky-8919-gcpuser. We show more logging and error out if timeout
      image
    • When the instance is stopped: sky start sky-8919-gcpuser. Show the capacity issue quickly.
  • All smoke tests: pytest tests/test_smoke.py --aws
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for several nits 🫡

sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
Michaelvll and others added 5 commits February 2, 2024 08:40
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
@Michaelvll Michaelvll merged commit 75ce821 into master Feb 2, 2024
19 checks passed
@Michaelvll Michaelvll deleted the fail-fast-for-restart branch February 2, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants