Fix resilient shard #3584

yiminc · 2022-11-13T07:07:31Z

What changed?
Keep acquireShard retrying except ShardOwnershipLostError or lifecycleCtx ended.

Why?
Resilient shard should not give up on other any other errors except ShardOwnershipLost error, for example should keep retry on timeout errors.

How did you test it?
Manual test locally with fault injection enabled (slightly modification so it does not return shard ownership error).

Potential risks
No.

Is hotfix candidate?
Yes.

yux0 · 2022-11-14T17:21:17Z

service/history/shard/context_impl.go

@@ -1933,7 +1933,18 @@ func (s *ContextImpl) acquireShard() {
 		return nil
 	}

-	err := backoff.ThrottleRetry(op, policy, common.IsPersistenceTransientError)
+	// keep retrying except ShardOwnershipLostError or lifecycle context ended


Should we update common.IsPersistenceTransientError to exclude shard ownership lost error?

IsPersistenceTransientError only return true for Unavailable and ResourceExhausted. ShardOwnershipLostError is already excluded there.

yux0

Should we worry about retry too much on resource exhausted error?

yiminc · 2022-11-15T01:40:37Z

This retry policy only apply to acquireShard, and initial backoff is 1s.

dnr · 2022-11-24T09:39:43Z

service/history/shard/context_impl.go

-	err := backoff.ThrottleRetry(op, policy, common.IsPersistenceTransientError)
+	// keep retrying except ShardOwnershipLostError or lifecycle context ended
+	acquireShardRetryable := func(err error) bool {
+		if s.lifecycleCtx.Err() != nil {


you don't really have to check lifecycleCtx here: the first thing op does is check isValid, which checks if state >= stopping, and if so returns a shardownershiplost error. state >= stopping iff lifecycleCtx is cancelled.

I'd prefer not doing this additional check since it makes the code more confusing (to me). but it doesn't hurt

yiminc requested a review from a team as a code owner November 13, 2022 07:07

yux0 reviewed Nov 14, 2022

View reviewed changes

Fix resilient shard

fdef158

yiminc force-pushed the fix_resilient_shard branch from 06d92a5 to fdef158 Compare November 14, 2022 17:25

yux0 approved these changes Nov 14, 2022

View reviewed changes

yiminc merged commit 76b58a7 into temporalio:master Nov 15, 2022

yiminc added the release/1.18.5 label Nov 15, 2022

alexshtin pushed a commit that referenced this pull request Nov 15, 2022

Fix resilient shard (#3584)

077c5e0

dnr reviewed Nov 24, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resilient shard #3584

Fix resilient shard #3584

yiminc commented Nov 13, 2022

yux0 Nov 14, 2022

yiminc Nov 14, 2022

yux0 left a comment

yiminc commented Nov 15, 2022

dnr Nov 24, 2022

Fix resilient shard #3584

Fix resilient shard #3584

Conversation

yiminc commented Nov 13, 2022

yux0 Nov 14, 2022

Choose a reason for hiding this comment

yiminc Nov 14, 2022

Choose a reason for hiding this comment

yux0 left a comment

Choose a reason for hiding this comment

yiminc commented Nov 15, 2022

dnr Nov 24, 2022

Choose a reason for hiding this comment