Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resilient shard #3584

Merged
merged 1 commit into from
Nov 15, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
15 changes: 13 additions & 2 deletions service/history/shard/context_impl.go
Original file line number Diff line number Diff line change
Expand Up @@ -1869,7 +1869,7 @@ func (s *ContextImpl) acquireShard() {
//
// We stop retrying on any of:
// 1. We succeed in acquiring the rangeid lock.
// 2. We get any error other than transient errors.
// 2. We get ShardOwnershipLostError or lifecycleCtx ended.
// 3. The state changes to Stopping or Stopped.
//
// If the shard controller sees that service resolver has assigned ownership to someone
Expand Down Expand Up @@ -1933,7 +1933,18 @@ func (s *ContextImpl) acquireShard() {
return nil
}

err := backoff.ThrottleRetry(op, policy, common.IsPersistenceTransientError)
// keep retrying except ShardOwnershipLostError or lifecycle context ended
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update common.IsPersistenceTransientError to exclude shard ownership lost error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsPersistenceTransientError only return true for Unavailable and ResourceExhausted. ShardOwnershipLostError is already excluded there.

acquireShardRetryable := func(err error) bool {
if s.lifecycleCtx.Err() != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't really have to check lifecycleCtx here: the first thing op does is check isValid, which checks if state >= stopping, and if so returns a shardownershiplost error. state >= stopping iff lifecycleCtx is cancelled.

I'd prefer not doing this additional check since it makes the code more confusing (to me). but it doesn't hurt

return false
}
switch err.(type) {
case *persistence.ShardOwnershipLostError:
return false
}
return true
}
err := backoff.ThrottleRetry(op, policy, acquireShardRetryable)
if err != nil {
// We got an unretryable error (perhaps context cancelled or ShardOwnershipLostError).
s.contextTaggedLogger.Error("Couldn't acquire shard", tag.Error(err))
Expand Down