New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resilient shard #3584
Fix resilient shard #3584
Conversation
@@ -1933,7 +1933,18 @@ func (s *ContextImpl) acquireShard() { | |||
return nil | |||
} | |||
|
|||
err := backoff.ThrottleRetry(op, policy, common.IsPersistenceTransientError) | |||
// keep retrying except ShardOwnershipLostError or lifecycle context ended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update common.IsPersistenceTransientError to exclude shard ownership lost error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsPersistenceTransientError only return true for Unavailable and ResourceExhausted. ShardOwnershipLostError is already excluded there.
06d92a5
to
fdef158
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we worry about retry too much on resource exhausted error?
This retry policy only apply to acquireShard, and initial backoff is 1s. |
err := backoff.ThrottleRetry(op, policy, common.IsPersistenceTransientError) | ||
// keep retrying except ShardOwnershipLostError or lifecycle context ended | ||
acquireShardRetryable := func(err error) bool { | ||
if s.lifecycleCtx.Err() != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't really have to check lifecycleCtx
here: the first thing op
does is check isValid
, which checks if state >= stopping, and if so returns a shardownershiplost error. state >= stopping iff lifecycleCtx is cancelled.
I'd prefer not doing this additional check since it makes the code more confusing (to me). but it doesn't hurt
What changed?
Keep acquireShard retrying except ShardOwnershipLostError or lifecycleCtx ended.
Why?
Resilient shard should not give up on other any other errors except ShardOwnershipLost error, for example should keep retry on timeout errors.
How did you test it?
Manual test locally with fault injection enabled (slightly modification so it does not return shard ownership error).
Potential risks
No.
Is hotfix candidate?
Yes.