Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Add retry for transient error during launching GCP clusters #2669

Merged
merged 6 commits into from
Oct 6, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Oct 6, 2023

GCP generats some flaky error while launching causing #2666. We should add retry for the ray up during launching as well as the blocklist update for errors.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud gcp --gpus A100:8 --image-id projects/deeplearning-platform-release/global/images/pytorch-2-0-gpu-v20230822-ubuntu-2004-py310 --down --retry-until-up
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll changed the title [GCP] Add retry for flaky error during launching GCP clusters [GCP] Add retry for transient error during launching GCP clusters Oct 6, 2023
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll.

if 'Head node fetch timed out' in stderr:
# Example: click.exceptions.ClickException: Head node fetch
# timed out. Failed to create head node.
# This is a transient error, but we have retried in need_ray_up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/Q: is this a transient error from the cloud, or due to certain image sizes being too large (which then makes the timeout in ray flaky)?

Copy link
Collaborator Author

@Michaelvll Michaelvll Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is likely the issue is with the cloud, as ray will get through the waiting loop, once the VM goes into PROVISIONING state and has the tag set, while the image setup should probably happen in the next state STAGING. https://cloud.google.com/compute/docs/instances/instance-life-cycle

I have seen multiple times when I sky launch --gpus A100:8, our program gets the IP address for the VM, while the VM just got deleted by GCP on the console.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if 'Head node fetch timed out' in stderr:
# Example: click.exceptions.ClickException: Head node fetch
# timed out. Failed to create head node.
# This is a transient error, but we have retried in need_ray_up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

@Michaelvll Michaelvll merged commit f8f613d into master Oct 6, 2023
18 checks passed
@Michaelvll Michaelvll deleted the retry-for-resource-not-found branch October 6, 2023 16:19
jc9123 pushed a commit to jc9123/skypilot that referenced this pull request Oct 11, 2023
…ypilot-org#2669)

* Add retry for flaky error during launching GCP clusters

* handle error

* format

* Do not log out stderr

* Add retry for gcloud crash

* fix retry return code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants