Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving retryable-err checks to errors.As, moving some to not-retryable #1167

Merged
merged 8 commits into from
Jun 22, 2022

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented May 19, 2022

Part 1 of 2 for solving retry storms, particularly around incorrectly-categorized
errors (e.g. limit exceeded) and service-busy.

This PR moves us to errors.As to support wrapped errors in the future, and
re-categorizes some incorrectly-retried errors. This is both useful on its own,
and helps make #1174 a smaller and clearer change.

Service-busy behavior is actually changed in #1174, this PR intentionally
maintains its current (flawed) behavior.

Commits are ordered for ease of reading / verifying, and the added tests pass
at each stage (but I did not run the full suite each time):

  • just adds test
  • just moves to errors.As, tests still pass
  • changes which types are retried
  • (minor test fix)
  • changes from feedback

Comment on lines +14 to +17
// service-busy means "retry later", which is still transient/retryable.
// callers with retries MUST detect this separately and delay before retrying,
// and ideally we'll return a minimum time-to-wait in errors in the future.
&s.ServiceBusyError{},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tackled in a followup PR, as I want that a separate commit/review

Groxx added a commit to Groxx/cadence-client that referenced this pull request Jun 22, 2022
Builds on cadence-workflow#1167, but adds delay before retrying service-busy errors.

For now, since our server-side RPS quotas are calculated per second, this delays
at least 1 second per service busy error.
This is in contrast to the previous behavior, which would have retried up to about
a dozen times in the same period, which is the cause of service-busy-based retry
storms that cause lots more service-busy errors.

---

This also gives us an easy way to make use of "retry after" information in errors
we return to the caller, though currently our errors do not contain that.

Eventually this should probably come from the server, which has a global view of
how many requests this service has sent, and can provide a more precise delay to
individual callers.
E.g. currently our server-side ratelimiter works in 1-second slices... but that
isn't something that's guaranteed to stay true.  The server could also detect truly
large floods of requests, and return jittered values larger than 1 second to more
powerfully stop the storm, or to allow prioritizing some requests (like activity
responses) over others simply by returning a lower delay.
@coveralls
Copy link

Pull Request Test Coverage Report for Build 0181880b-5443-4634-9400-8e5f001cb004

  • 43 of 43 (100.0%) changed or added relevant lines in 1 file are covered.
  • 11 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 63.851%

Files with Coverage Reduction New Missed Lines %
internal/compatibility/thrift/history.go 11 57.65%
Totals Coverage Status
Change from base Build 01816e39-350f-43ae-a79e-40eaad99261d: 0.1%
Covered Lines: 12433
Relevant Lines: 19472

💛 - Coveralls

@Groxx Groxx merged commit 1005ea5 into cadence-workflow:master Jun 22, 2022
@Groxx Groxx deleted the retries branch June 22, 2022 19:42
Groxx added a commit to Groxx/cadence-client that referenced this pull request Jun 22, 2022
Builds on cadence-workflow#1167, but adds delay before retrying service-busy errors.

For now, since our server-side RPS quotas are calculated per second, this delays
at least 1 second per service busy error.
This is in contrast to the previous behavior, which would have retried up to about
a dozen times in the same period, which is the cause of service-busy-based retry
storms that cause lots more service-busy errors.

---

This also gives us an easy way to make use of "retry after" information in errors
we return to the caller, though currently our errors do not contain that.

Eventually this should probably come from the server, which has a global view of
how many requests this service has sent, and can provide a more precise delay to
individual callers.
E.g. currently our server-side ratelimiter works in 1-second slices... but that
isn't something that's guaranteed to stay true.  The server could also detect truly
large floods of requests, and return jittered values larger than 1 second to more
powerfully stop the storm, or to allow prioritizing some requests (like activity
responses) over others simply by returning a lower delay.
Groxx added a commit that referenced this pull request Nov 7, 2022
Builds on #1167, but adds delay before retrying service-busy errors.

For now, since our server-side RPS quotas are calculated per second, this delays
at least 1 second per service busy error.
This is in contrast to the previous behavior, which would have retried up to about
a dozen times in the same period, which is the cause of service-busy-based retry
storms that cause lots more service-busy errors.

---

This also gives us an easy way to make use of "retry after" information in errors
we return to the caller, though currently our errors do not contain that.

Eventually this should probably come from the server, which has a global view of
how many requests this service has sent, and can provide a more precise delay to
individual callers.
E.g. currently our server-side ratelimiter works in 1-second slices... but that
isn't something that's guaranteed to stay true.  The server could also detect truly
large floods of requests, and return jittered values larger than 1 second to more
powerfully stop the storm, or to allow prioritizing some requests (like activity
responses) over others simply by returning a lower delay.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants