Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify system retry logic: Part 1 #3172

Merged
merged 3 commits into from
Aug 4, 2022
Merged

Conversation

yycptt
Copy link
Member

@yycptt yycptt commented Aug 2, 2022

What changed?

  • One retry layer at caller (service client), one retry layer at handler (handler interceptor), one retry layer when calling downstream dependencies (persistence client). Each retry layer has a max attempt of 2.
  • Resource exhausted error is only retried at caller layer (service client)
  • Frontend handler interceptor has additional retry attempts than other handlers
  • No more retry in history workflow context/transaction
  • History handler no longer retry (conditionalRetryCount) for ErrConflict, ErrStaleState and when unable to locate current workflow run. Instead when ^ error happens, history interceptor will attempt a retry.
  • Improve retry behavior for background task loading (Previously there's no backoff for timer and no special handle for service busy)
  • Task processor will no longer retry task when holding the goroutine. Instead task will immediately be sent to the end of the task queue for retry.
  • Remove unnecessary retry layer in other misc. places.

NOTE:

  • Retry for visibility client is not done in this PR.
  • This PR mainly focuses on core history service. Additional clean up is still needed for frontend, history replication, matching and worker service.

Why?

  • Simplify system retry logic

How did you test it?

  • Existing unit/integration test + Canary.

Potential risks

  • API calls and background tasks are more likely to fail.

Is hotfix candidate?

  • No.

@yycptt yycptt requested a review from yiminc August 2, 2022 01:18
@yycptt yycptt requested a review from a team as a code owner August 2, 2022 01:18
}
)

var _ grpc.UnaryServerInterceptor = (*RetryableInterceptor)(nil).Intercept
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great.. are you planning to consolidate the "retryable rpc clients" (client/*/retryable_client*.go) with an client interceptor also?

Copy link
Member Author

@yycptt yycptt Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good idea. I wasn't planning for it, but yeah I can replace them with a client interceptor.

Metrics client probably also need to be converted into a client interceptor as currently retryable client is on top of metrics client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a task for now #3183 as the replacement is not the main purpose of this PR.

Comment on lines +213 to +214
case *serviceerror.Unavailable:
return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under what condition does persistence layer return Unavailable error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any usage for cassandra, but for sql, it's returned here: https://github.com/temporalio/temporal/blob/master/common/persistence/sql/common.go#L66

I removed the resource exhausted error from the existing IsPersistenceTransientError implementation, so only Unavailable error is remained here.

Comment on lines +265 to +270
currentRunID, err := c.getCurrentRunID(
ctx,
shardOwnershipAsserted,
namespaceID,
workflowID,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this get again? It is done at line 243 already

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If workflow is closed, we can't know if there's a newer run created after getting the runID on L243. So need to get runID again to validate it's still the latest run. And since workflow is locked before the second check, there's won't be a newer run if the second check is successful.

// ErrMaxAttemptsExceeded is exported temporarily for integration test
ErrMaxAttemptsExceeded = errors.New("maximum attempts exceeded to update history")
// ErrLocateCurrentWorkflowExecution is the error returned when current workflow execution can't be located
ErrLocateCurrentWorkflowExecution = serviceerror.NewUnavailable("unable to locate current workflow execution")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a retryable error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens when workflow creates a newer run between getting current runID and loading the mutable state for that runID in getCurrentWorkflowContext(). So it's due to a race condition of two concurrent user requests, which can likely be resolved by a retry.

service/history/workflow/transaction_impl.go Outdated Show resolved Hide resolved
return false
}

func IsServiceHandlerRetryableError(err error) bool {
switch err.(type) {
case *serviceerror.Internal,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised to see Internal here. It shouldn't be retryable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But I am not confident enough that we follow this requirement throughout our codebase so it can be removed from here. 😞

@yycptt yycptt merged commit 0b4bf47 into temporalio:master Aug 4, 2022
@yycptt yycptt deleted the simplify-retry branch August 4, 2022 21:39
yycptt added a commit that referenced this pull request Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants