improve replication/handover check logging and readability#9844

Merged

thestephenstanton merged 10 commits intomainfrom

sstanton/cleaner-more-detailed-catchup-funcs

Apr 8, 2026

Contributor

thestephenstanton commented Apr 7, 2026

Continuing from this PR: #9787

What changed?

Replaced the log dump that happened on the first non ready shard with with a single summary log per invocation that includes total, ready, and not ready shard counts and the slowest shard by task lag and time lag, making it easier to diagnose stalled replication and handover.

Refactored checkReplicationOnce, checkHandoverOnce, and checkReplicationOnRemoteCluster to use guard clauses, eliminating nested logic and the logged guard pattern while also renaming some variables; all for improved readability.

Also, there is a fix to a subtle bug where if the remote cluster shard progress lookup (map inside the shard loop) doesn't have data and we have already logged previous non ready shards, then we would actually not return an error.

Why?

Log change: The old log only captured one shard's state at a time, making it hard to understand overall progress during a stalled catchup or handover. A single summary with counts and the slowest shards gives you the full picture in one entry.

Refactor: The nested if/logged guard pattern made it harder to follow than needed; guarding flattens this out vs having deep nests. Also renamed variables make things more clear.

How did you test it?

Potential risks

If anyone is extremely dependent on the previous log fields.

thestephenstanton requested review from temporal-nick and yux0

April 7, 2026 18:39

thestephenstanton requested a review from a team as a code owner

April 7, 2026 18:39

CLAassistant commented Apr 7, 2026 •

edited

Loading

All committers have signed the CLA.

temporal-nick reviewed

View reviewed changes

service/worker/migration/activities.go

    
              			shardID:      localShard.GetShardId(),

              			laggingTasks: laggingTasks,

              			timeLag:      timeLag,

              			isReady:      fullyCaughtUp || (passedRequiredMinimum && withinLagTolerance),

Contributor

temporal-nick Apr 7, 2026

For a future PR: fullyCaughtUp guarantees withinLagTolerance is true, but I don't want the scope of this PR to get too huge.

temporal-nick reviewed

View reviewed changes

service/worker/migration/activities.go

    
              	isReady := notReadyShardCount == 0

              	if !isReady {

              		a.logger.Info("Wait catchup not ready",

Contributor

temporal-nick Apr 7, 2026 •

edited

Loading

(Future PR problem) This feels like it wants a "TotalTimeSpentWaiting" kind of tag, since we expect to be here about 1/sec

thestephenstanton added 9 commits

April 7, 2026 14:58


          checkReplicationOnce

9d7f50c


          checkHandoverOnce

97796c2


          checkReplicationOnRemoteCluster

c4aa80a


          update var name

213f99d


          create better var for resp

f428c89


          use shardStatus struct to keep track of each shard and aggregate afte…

87c1682

…r first main loop


          add comment for filter

7d89a3c


          added email

ae457e4


          update comment

40b84bb

thestephenstanton force-pushed the sstanton/cleaner-more-detailed-catchup-funcs branch from 25a051e to 536018c Compare

April 7, 2026 20:09

temporal-nick reviewed

View reviewed changes

service/worker/migration/activities.go

               // Check if remote cluster has caught up on all shards on replication tasks from target replica.
-              func (a *activities) checkReplicationOnRemoteCluster(ctx context.Context, waitRequest waitCatchupRequest, targetAckIDOnShard map[int32]int64) (bool, error) {
+              func (a *activities) checkReplicationOnRemoteCluster(ctx context.Context, waitRequest waitCatchupRequest, requiredMinTaskIDPerShard map[int32]int64) (bool, error) {

Contributor

temporal-nick Apr 7, 2026

Future PR problem: Figure out what's different between this and checkHandoverOnce, and deduplicate these functions

temporal-nick reviewed

View reviewed changes

service/worker/migration/activities.go

-              				tag.Int64("ActualLaggingTasks", shard.MaxReplicationTaskId-clusterInfo.AckedTaskId),
-              			)
+              		// If the target acked task ID is NOT found, the shard is considered ready, as the remote ack level
+              		// is assumed to be more up-to-date than the active ack level.

Contributor

temporal-nick Apr 7, 2026

Future PR problem: This comment is pretty vague. Why is this ok? How does "more up to date" imply that the task ID might be missing? I have some guesses as to why, but we should write it authoritatively here.

temporal-nick approved these changes

View reviewed changes

yux0 approved these changes

View reviewed changes

service/worker/migration/activities.go Show resolved Hide resolved

thestephenstanton force-pushed the sstanton/cleaner-more-detailed-catchup-funcs branch from 536018c to 40b84bb Compare

April 8, 2026 13:19


          better logs

4e3bfa5

thestephenstanton enabled auto-merge (squash)

April 8, 2026 13:25

thestephenstanton merged commit 32c14d5 into main

46 checks passed

thestephenstanton deleted the sstanton/cleaner-more-detailed-catchup-funcs branch

April 8, 2026 13:43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet