Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Exponential Backoff for Retry Until Up #2821

Conversation

rohanvaidya45
Copy link
Contributor

@rohanvaidya45 rohanvaidya45 commented Nov 25, 2023

Looks to resolve #2805
Replaces the use of exponential backoff with the static initial value each retry
Tested:

  • Code formatting: bash format.sh
  • All smoke tests: pytest tests/test_smoke.py

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @rohanvaidya45! It might be better to remove the exponential backoff only and keep the random jitter.

@@ -2859,7 +2859,7 @@ def _provision(
# TODO(suquark): once we have sky on PyPI, we should directly
# install sky from PyPI.
local_wheel_path, wheel_hash = wheel_utils.build_sky_wheel()
backoff = common_utils.Backoff(_RETRY_UNTIL_UP_INIT_GAP_SECONDS)
jitter = common_utils.Jitter(_RETRY_UNTIL_UP_INIT_GAP_SECONDS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use the Backoff with max_backoff_factor=1?

@@ -2859,7 +2859,9 @@ def _provision(
# TODO(suquark): once we have sky on PyPI, we should directly
# install sky from PyPI.
local_wheel_path, wheel_hash = wheel_utils.build_sky_wheel()
backoff = common_utils.Backoff(_RETRY_UNTIL_UP_INIT_GAP_SECONDS)
backoff = common_utils.Backoff(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a comment on why we disable exponential backoff. Something like:

The most frequent reason for the failure of a provision request is resource unavailability instead of rate limiting; to make users wait shorter, we do not make the backoffs exponential.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @rohanvaidya45! LGTM.

# limiting; to make users wait shorter, we do not make
# backoffs exponential.
backoff = common_utils.Backoff(
initial_backoff=_RETRY_UNTIL_UP_INIT_GAP_SECONDS,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried a few clouds and it seems we can also reduce this _RETRY_UNTIL_UP_INIT_GAP_SECONDS to 30. : )

@Michaelvll Michaelvll merged commit 84313ed into skypilot-org:master Dec 6, 2023
19 checks passed
remyleone pushed a commit to remyleone/skypilot that referenced this pull request Dec 26, 2023
* remove exponential backoff

* add back random jitter

* format

* use backoff

* format

* add comment

* rerun checks

* format

* pylint

* form

* change to 30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Do not use exponential backoff for --retry-until-up
3 participants