Moving retryable-err checks to errors.As, moving some to not-retryable #1167

Groxx · 2022-05-19T01:41:55Z

Part 1 of 2 for solving retry storms, particularly around incorrectly-categorized
errors (e.g. limit exceeded) and service-busy.

This PR moves us to errors.As to support wrapped errors in the future, and
re-categorizes some incorrectly-retried errors. This is both useful on its own,
and helps make #1174 a smaller and clearer change.

Service-busy behavior is actually changed in #1174, this PR intentionally
maintains its current (flawed) behavior.

Commits are ordered for ease of reading / verifying, and the added tests pass
at each stage (but I did not run the full suite each time):

just adds test
just moves to errors.As, tests still pass
changes which types are retried
(minor test fix)
changes from feedback

… one

internal/internal_retry.go

internal/internal_retry_test.go

Groxx · 2022-06-22T02:01:07Z

internal/internal_retry_test.go

+			// service-busy means "retry later", which is still transient/retryable.
+			// callers with retries MUST detect this separately and delay before retrying,
+			// and ideally we'll return a minimum time-to-wait in errors in the future.
+			&s.ServiceBusyError{},


tackled in a followup PR, as I want that a separate commit/review

Builds on cadence-workflow#1167, but adds delay before retrying service-busy errors. For now, since our server-side RPS quotas are calculated per second, this delays at least 1 second per service busy error. This is in contrast to the previous behavior, which would have retried up to about a dozen times in the same period, which is the cause of service-busy-based retry storms that cause lots more service-busy errors. --- This also gives us an easy way to make use of "retry after" information in errors we return to the caller, though currently our errors do not contain that. Eventually this should probably come from the server, which has a global view of how many requests this service has sent, and can provide a more precise delay to individual callers. E.g. currently our server-side ratelimiter works in 1-second slices... but that isn't something that's guaranteed to stay true. The server could also detect truly large floods of requests, and return jittered values larger than 1 second to more powerfully stop the storm, or to allow prioritizing some requests (like activity responses) over others simply by returning a lower delay.

coveralls · 2022-06-22T02:21:48Z

Pull Request Test Coverage Report for Build 0181880b-5443-4634-9400-8e5f001cb004

43 of 43 (100.0%) changed or added relevant lines in 1 file are covered.
11 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.1%) to 63.851%

Files with Coverage Reduction	New Missed Lines	%
internal/compatibility/thrift/history.go	11	57.65%

Totals
Change from base Build 01816e39-350f-43ae-a79e-40eaad99261d:	0.1%
Covered Lines:	12433
Relevant Lines:	19472

💛 - Coveralls

Builds on cadence-workflow#1167, but adds delay before retrying service-busy errors. For now, since our server-side RPS quotas are calculated per second, this delays at least 1 second per service busy error. This is in contrast to the previous behavior, which would have retried up to about a dozen times in the same period, which is the cause of service-busy-based retry storms that cause lots more service-busy errors. --- This also gives us an easy way to make use of "retry after" information in errors we return to the caller, though currently our errors do not contain that. Eventually this should probably come from the server, which has a global view of how many requests this service has sent, and can provide a more precise delay to individual callers. E.g. currently our server-side ratelimiter works in 1-second slices... but that isn't something that's guaranteed to stay true. The server could also detect truly large floods of requests, and return jittered values larger than 1 second to more powerfully stop the storm, or to allow prioritizing some requests (like activity responses) over others simply by returning a lower delay.

Builds on #1167, but adds delay before retrying service-busy errors. For now, since our server-side RPS quotas are calculated per second, this delays at least 1 second per service busy error. This is in contrast to the previous behavior, which would have retried up to about a dozen times in the same period, which is the cause of service-busy-based retry storms that cause lots more service-busy errors. --- This also gives us an easy way to make use of "retry after" information in errors we return to the caller, though currently our errors do not contain that. Eventually this should probably come from the server, which has a global view of how many requests this service has sent, and can provide a more precise delay to individual callers. E.g. currently our server-side ratelimiter works in 1-second slices... but that isn't something that's guaranteed to stay true. The server could also detect truly large floods of requests, and return jittered values larger than 1 second to more powerfully stop the storm, or to allow prioritizing some requests (like activity responses) over others simply by returning a lower delay.

Add tests to show the current behavior

eb72c77

Groxx force-pushed the retries branch from 8452a9f to 1e55f6b Compare May 19, 2022 01:42

Groxx added 2 commits May 18, 2022 18:43

switch to errors.Is/As, though wow is this distasteful

3c0e70a

Expanding tests, moving some to terminal, and adding comments to each…

8630da2

… one

Groxx force-pushed the retries branch from 1e55f6b to 8630da2 Compare May 19, 2022 01:43

switch to a now-retryable error

b835963

davidporter-id-au reviewed May 19, 2022

View reviewed changes

internal/internal_retry.go Outdated Show resolved Hide resolved

vytautas-karpavicius reviewed May 19, 2022

View reviewed changes

internal/internal_retry_test.go Outdated Show resolved Hide resolved

mantas-sidlauskas reviewed Jun 1, 2022

View reviewed changes

internal/internal_retry_test.go Show resolved Hide resolved

mantas-sidlauskas reviewed Jun 1, 2022

View reviewed changes

internal/internal_retry_test.go Show resolved Hide resolved

Groxx added 3 commits June 15, 2022 14:31

Merge branch 'master' into retries

9b4d05c

Merge branch 'master' into retries

d4a556f

Tackling feedback

949b86c

Groxx commented Jun 22, 2022

View reviewed changes

Groxx requested review from davidporter-id-au, mantas-sidlauskas and vytautas-karpavicius June 22, 2022 02:01

Groxx mentioned this pull request Jun 22, 2022

Retry service-busy errors after a delay #1174

Merged

mantas-sidlauskas approved these changes Jun 22, 2022

View reviewed changes

Merge branch 'master' into retries

4bc4569

Groxx merged commit 1005ea5 into cadence-workflow:master Jun 22, 2022

Groxx deleted the retries branch June 22, 2022 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving retryable-err checks to errors.As, moving some to not-retryable #1167

Moving retryable-err checks to errors.As, moving some to not-retryable #1167

Groxx commented May 19, 2022 •

edited

Loading

Groxx Jun 22, 2022

coveralls commented Jun 22, 2022

Moving retryable-err checks to errors.As, moving some to not-retryable #1167

Moving retryable-err checks to errors.As, moving some to not-retryable #1167

Conversation

Groxx commented May 19, 2022 • edited Loading

Groxx Jun 22, 2022

Choose a reason for hiding this comment

coveralls commented Jun 22, 2022

Pull Request Test Coverage Report for Build 0181880b-5443-4634-9400-8e5f001cb004

💛 - Coveralls

Groxx commented May 19, 2022 •

edited

Loading