-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving retryable-err checks to errors.As, moving some to not-retryable #1167
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Groxx
commented
Jun 22, 2022
Comment on lines
+14
to
+17
// service-busy means "retry later", which is still transient/retryable. | ||
// callers with retries MUST detect this separately and delay before retrying, | ||
// and ideally we'll return a minimum time-to-wait in errors in the future. | ||
&s.ServiceBusyError{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tackled in a followup PR, as I want that a separate commit/review
Groxx
requested review from
davidporter-id-au,
mantas-sidlauskas and
vytautas-karpavicius
June 22, 2022 02:01
Groxx
added a commit
to Groxx/cadence-client
that referenced
this pull request
Jun 22, 2022
Builds on cadence-workflow#1167, but adds delay before retrying service-busy errors. For now, since our server-side RPS quotas are calculated per second, this delays at least 1 second per service busy error. This is in contrast to the previous behavior, which would have retried up to about a dozen times in the same period, which is the cause of service-busy-based retry storms that cause lots more service-busy errors. --- This also gives us an easy way to make use of "retry after" information in errors we return to the caller, though currently our errors do not contain that. Eventually this should probably come from the server, which has a global view of how many requests this service has sent, and can provide a more precise delay to individual callers. E.g. currently our server-side ratelimiter works in 1-second slices... but that isn't something that's guaranteed to stay true. The server could also detect truly large floods of requests, and return jittered values larger than 1 second to more powerfully stop the storm, or to allow prioritizing some requests (like activity responses) over others simply by returning a lower delay.
Pull Request Test Coverage Report for Build 0181880b-5443-4634-9400-8e5f001cb004
💛 - Coveralls |
mantas-sidlauskas
approved these changes
Jun 22, 2022
Groxx
added a commit
to Groxx/cadence-client
that referenced
this pull request
Jun 22, 2022
Builds on cadence-workflow#1167, but adds delay before retrying service-busy errors. For now, since our server-side RPS quotas are calculated per second, this delays at least 1 second per service busy error. This is in contrast to the previous behavior, which would have retried up to about a dozen times in the same period, which is the cause of service-busy-based retry storms that cause lots more service-busy errors. --- This also gives us an easy way to make use of "retry after" information in errors we return to the caller, though currently our errors do not contain that. Eventually this should probably come from the server, which has a global view of how many requests this service has sent, and can provide a more precise delay to individual callers. E.g. currently our server-side ratelimiter works in 1-second slices... but that isn't something that's guaranteed to stay true. The server could also detect truly large floods of requests, and return jittered values larger than 1 second to more powerfully stop the storm, or to allow prioritizing some requests (like activity responses) over others simply by returning a lower delay.
Groxx
added a commit
that referenced
this pull request
Nov 7, 2022
Builds on #1167, but adds delay before retrying service-busy errors. For now, since our server-side RPS quotas are calculated per second, this delays at least 1 second per service busy error. This is in contrast to the previous behavior, which would have retried up to about a dozen times in the same period, which is the cause of service-busy-based retry storms that cause lots more service-busy errors. --- This also gives us an easy way to make use of "retry after" information in errors we return to the caller, though currently our errors do not contain that. Eventually this should probably come from the server, which has a global view of how many requests this service has sent, and can provide a more precise delay to individual callers. E.g. currently our server-side ratelimiter works in 1-second slices... but that isn't something that's guaranteed to stay true. The server could also detect truly large floods of requests, and return jittered values larger than 1 second to more powerfully stop the storm, or to allow prioritizing some requests (like activity responses) over others simply by returning a lower delay.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Part 1 of 2 for solving retry storms, particularly around incorrectly-categorized
errors (e.g. limit exceeded) and service-busy.
This PR moves us to
errors.As
to support wrapped errors in the future, andre-categorizes some incorrectly-retried errors. This is both useful on its own,
and helps make #1174 a smaller and clearer change.
Service-busy behavior is actually changed in #1174, this PR intentionally
maintains its current (flawed) behavior.
Commits are ordered for ease of reading / verifying, and the added tests pass
at each stage (but I did not run the full suite each time):