Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resilient shard #3584

Merged
merged 1 commit into from Nov 15, 2022
Merged

Fix resilient shard #3584

merged 1 commit into from Nov 15, 2022

Conversation

yiminc
Copy link
Member

@yiminc yiminc commented Nov 13, 2022

What changed?
Keep acquireShard retrying except ShardOwnershipLostError or lifecycleCtx ended.

Why?
Resilient shard should not give up on other any other errors except ShardOwnershipLost error, for example should keep retry on timeout errors.

How did you test it?
Manual test locally with fault injection enabled (slightly modification so it does not return shard ownership error).

Potential risks
No.

Is hotfix candidate?
Yes.

@yiminc yiminc requested a review from a team as a code owner November 13, 2022 07:07
@@ -1933,7 +1933,18 @@ func (s *ContextImpl) acquireShard() {
return nil
}

err := backoff.ThrottleRetry(op, policy, common.IsPersistenceTransientError)
// keep retrying except ShardOwnershipLostError or lifecycle context ended
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update common.IsPersistenceTransientError to exclude shard ownership lost error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsPersistenceTransientError only return true for Unavailable and ResourceExhausted. ShardOwnershipLostError is already excluded there.

Copy link
Contributor

@yux0 yux0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we worry about retry too much on resource exhausted error?

@yiminc
Copy link
Member Author

yiminc commented Nov 15, 2022

This retry policy only apply to acquireShard, and initial backoff is 1s.

@yiminc yiminc merged commit 76b58a7 into temporalio:master Nov 15, 2022
alexshtin pushed a commit that referenced this pull request Nov 15, 2022
err := backoff.ThrottleRetry(op, policy, common.IsPersistenceTransientError)
// keep retrying except ShardOwnershipLostError or lifecycle context ended
acquireShardRetryable := func(err error) bool {
if s.lifecycleCtx.Err() != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't really have to check lifecycleCtx here: the first thing op does is check isValid, which checks if state >= stopping, and if so returns a shardownershiplost error. state >= stopping iff lifecycleCtx is cancelled.

I'd prefer not doing this additional check since it makes the code more confusing (to me). but it doesn't hurt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants