Simplify system retry logic: Part 1 #3172

yycptt · 2022-08-02T01:18:00Z

What changed?

One retry layer at caller (service client), one retry layer at handler (handler interceptor), one retry layer when calling downstream dependencies (persistence client). Each retry layer has a max attempt of 2.
Resource exhausted error is only retried at caller layer (service client)
Frontend handler interceptor has additional retry attempts than other handlers
No more retry in history workflow context/transaction
History handler no longer retry (conditionalRetryCount) for ErrConflict, ErrStaleState and when unable to locate current workflow run. Instead when ^ error happens, history interceptor will attempt a retry.
Improve retry behavior for background task loading (Previously there's no backoff for timer and no special handle for service busy)
Task processor will no longer retry task when holding the goroutine. Instead task will immediately be sent to the end of the task queue for retry.
Remove unnecessary retry layer in other misc. places.

NOTE:

Retry for visibility client is not done in this PR.
This PR mainly focuses on core history service. Additional clean up is still needed for frontend, history replication, matching and worker service.

Why?

Simplify system retry logic

How did you test it?

Existing unit/integration test + Canary.

Potential risks

API calls and background tasks are more likely to fail.

Is hotfix candidate?

No.

dnr · 2022-08-02T02:00:07Z

common/rpc/interceptor/retry.go

+	}
+)
+
+var _ grpc.UnaryServerInterceptor = (*RetryableInterceptor)(nil).Intercept


This is great.. are you planning to consolidate the "retryable rpc clients" (client/*/retryable_client*.go) with an client interceptor also?

Oh that's a good idea. I wasn't planning for it, but yeah I can replace them with a client interceptor.

Metrics client probably also need to be converted into a client interceptor as currently retryable client is on top of metrics client.

Created a task for now #3183 as the replacement is not the main purpose of this PR.

yiminc · 2022-08-04T06:01:23Z

common/persistence/client/factory.go

+	case *serviceerror.Unavailable:
+		return true


under what condition does persistence layer return Unavailable error?

I couldn't find any usage for cassandra, but for sql, it's returned here: https://github.com/temporalio/temporal/blob/master/common/persistence/sql/common.go#L66

I removed the resource exhausted error from the existing IsPersistenceTransientError implementation, so only Unavailable error is remained here.

yiminc · 2022-08-04T06:16:43Z

service/history/api/consistency_checker.go

+	currentRunID, err := c.getCurrentRunID(
+		ctx,
+		shardOwnershipAsserted,
+		namespaceID,
+		workflowID,
+	)


why we need this get again? It is done at line 243 already

If workflow is closed, we can't know if there's a newer run created after getting the runID on L243. So need to get runID again to validate it's still the latest run. And since workflow is locked before the second check, there's won't be a newer run if the second check is successful.

yiminc · 2022-08-04T06:17:37Z

service/history/consts/const.go

-	// ErrMaxAttemptsExceeded is exported temporarily for integration test
-	ErrMaxAttemptsExceeded = errors.New("maximum attempts exceeded to update history")
+	// ErrLocateCurrentWorkflowExecution is the error returned when current workflow execution can't be located
+	ErrLocateCurrentWorkflowExecution = serviceerror.NewUnavailable("unable to locate current workflow execution")


why is this a retryable error?

This only happens when workflow creates a newer run between getting current runID and loading the mutable state for that runID in getCurrentWorkflowContext(). So it's due to a race condition of two concurrent user requests, which can likely be resolved by a retry.

service/history/workflow/transaction_impl.go

alexshtin · 2022-08-04T19:43:26Z

common/util.go

+	return false
+}
+
+func IsServiceHandlerRetryableError(err error) bool {
 	switch err.(type) {
 	case *serviceerror.Internal,


I am surprised to see Internal here. It shouldn't be retryable.

I agree. But I am not confident enough that we follow this requirement throughout our codebase so it can be removed from here. 😞

Simplify system retry behavior: Part 1

182526d

yycptt requested a review from yiminc August 2, 2022 01:18

yycptt requested a review from a team as a code owner August 2, 2022 01:18

dnr reviewed Aug 2, 2022

View reviewed changes

fix tests

8d3adbe

yiminc approved these changes Aug 4, 2022

View reviewed changes

rename appendHistoryV2Events

4f0494d

yycptt mentioned this pull request Aug 4, 2022

Replace metrics and retryable client with client interceptors #3183

Open

alexshtin approved these changes Aug 4, 2022

View reviewed changes

yycptt merged commit 0b4bf47 into temporalio:master Aug 4, 2022

yycptt deleted the simplify-retry branch August 4, 2022 21:39

yycptt added the release/1.17.3 label Aug 6, 2022

yycptt added a commit that referenced this pull request Aug 12, 2022

Simplify system retry logic: Part 1 (#3172)

5b05741

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify system retry logic: Part 1 #3172

Simplify system retry logic: Part 1 #3172

yycptt commented Aug 2, 2022 •

edited

Loading

dnr Aug 2, 2022

yycptt Aug 2, 2022 •

edited

Loading

yycptt Aug 4, 2022

yiminc Aug 4, 2022

yycptt Aug 4, 2022

yiminc Aug 4, 2022

yycptt Aug 4, 2022

yiminc Aug 4, 2022

yycptt Aug 4, 2022

alexshtin Aug 4, 2022

yycptt Aug 4, 2022

Simplify system retry logic: Part 1 #3172

Simplify system retry logic: Part 1 #3172

Conversation

yycptt commented Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

yycptt Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yycptt commented Aug 2, 2022 •

edited

Loading

yycptt Aug 2, 2022 •

edited

Loading