Ensure WD Version properly revives if recreated after deletion by ShahabT · Pull Request #9382 · temporalio/temporal

ShahabT · 2026-02-23T02:45:48Z

What changed?

Ensure the TQs receive and apply the right version data after revive.
Made delete propagation to always happen serial to other propagations. It ensures all other propagations are cancelled before starting delete propagation.
Deprecate the deleted flag in version data and the GC logic around it. Now we use the good old forgetVersion path which immediately removes the version data from TQ.
Ensure version state is reset after revive, in case the recreation happened before workflow close.
Also, now workflows CaN based on SDK suggestion if no pending Signal or Update is present.

Why?

The version could stuck at deleted state from TQ POV if revived before the (now deprecated) GC logic cleans it up.

How did you test it?

Potential risks

None

- Use InDelta instead of Equal for float comparison - Use s.Require().NoError for error assertions in callbacks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ShahabT · 2026-03-25T21:20:18Z

service/matching/physical_task_queue_manager.go

 		}
 	}()

-	if c.deploymentVersionRegistered {


@Shivs11 note that this was a separate obstacle preventing the TQ being registered after revived if the partition is not reloaded since last registration.

I wonder: I know the userData calls below that you have mentioned are all in-mem calls, but if a partition has a ton of pollers, would that be a lot on our matching engine pod?

the user data is in memory and access to it should not be expensive for each poll.

actually, I don't think every poll needs to wait for the lock. I added a user date check before the lock.

service/worker/workerdeployment/version_workflow.go

tests/worker_deployment_version_test.go

Shivs11

few comments/thinking points

Shivs11 · 2026-03-26T17:01:29Z

service/worker/workerdeployment/version_workflow.go

 		asyncPropagationsInProgress int
 		// When true, all the ongoing propagations should cancel themselves
-		// Deprecated. With version data revision number, we don't need to cancel propagations anymore.
+		// Used when delete happens while there are ungoing propagations.


nit: on-going

Shivs11 · 2026-03-26T17:04:34Z

service/worker/workerdeployment/version_workflow.go

+				// And there is a force CaN or a propagated state change or history got too large
+				(d.forceCAN || (d.stateChanged && d.asyncPropagationsInProgress == 0) || workflow.GetInfo(ctx).GetContinueAsNewSuggested()))


im confused here - why did we never have this before? workflow.GetInfo(ctx).GetContinueAsNewSuggested()

moreover, aren't we always bound to CAN before we actually hit this limit? or did you get this idea from the recent investigation we noticed where we had 10,000 or so signals but we sadly were not can'ing?

one other point:

when CaN triggers via this path, d.stateChanged might be false and asyncPropagationsInProgress might be > 0. The condition allows CaN even with in-flight propagations when history is too large. Is that intentional? I wonder if this could leave our poor TQ in a really poor state then

The condition allows CaN even with in-flight propagations

I don't think this is the case, the previous clause is still in effect:
(!d.signalHandler.signalSelector.HasPending() && d.signalHandler.processingSignals == 0 && workflow.AllHandlersFinished(ctx) &&

Shivs11 · 2026-03-26T17:23:45Z

service/matching/matching_engine.go

 				}
 			} else if v := req.GetForgetVersion(); v != nil {
-				if idx := worker_versioning.FindDeploymentVersion(deploymentData, v); idx >= 0 {
+				// Go through the new and old deployment data format for this deployment and remove the version if present.


in the helper function, right now, we return early if workerDeploymentData == nil. In that manner, we won't go through the old format no? (say the version was present in the deploymentsData format only)

afaik, you are trying to merge these two deletion paths (which is great), but this could be something you wanna think off

good point, I'm gonna fix that case.

Shivs11 · 2026-03-26T17:32:01Z

service/matching/physical_task_queue_manager.go

 		}
 	}()

-	if c.deploymentVersionRegistered {


I wonder: I know the userData calls below that you have mentioned are all in-mem calls, but if a partition has a ton of pollers, would that be a lot on our matching engine pod?

…ersion # Conflicts: # service/worker/workerdeployment/version_workflow_test.go # service/worker/workerdeployment/workflow_test.go

…ralio#9382) ## What changed? - Ensure the TQs receive and apply the right version data after revive. - Made delete propagation to always happen serial to other propagations. It ensures all other propagations are cancelled before starting delete propagation. - Deprecate the `deleted` flag in version data and the GC logic around it. Now we use the good old forgetVersion path which immediately removes the version data from TQ. - Ensure version state is reset after revive, in case the recreation happened before workflow close. - Also, now workflows CaN based on SDK suggestion if no pending Signal or Update is present. ## Why? The version could stuck at deleted state from TQ POV if revived before the (now deprecated) GC logic cleans it up. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks None --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reset WDV state when revived

9d0a18b

ShahabT requested review from a team as code owners February 23, 2026 02:45

ShahabT and others added 7 commits March 13, 2026 12:00

NDE protection

44e7ec1

CaN on suggestion + fix lint

9669693

Fix lint

68bca83

Fix lint errors in version_workflow_test.go

80bb530

- Use InDelta instead of Equal for float comparison - Use s.Require().NoError for error assertions in callbacks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add rules in AGENTS.md

b2b9cf8

ensure TQ gets new version data after revive

939f341

Merge remote-tracking branch 'origin/main' into revive-version

b6b33c0

ShahabT changed the title ~~Reset WD Version state when revived~~ Ensure WD Version properly revives if recreated after deletion Mar 25, 2026

small refactor

fcb1baa

ShahabT commented Mar 25, 2026

View reviewed changes

service/worker/workerdeployment/version_workflow.go Show resolved Hide resolved

ShahabT commented Mar 25, 2026

View reviewed changes

tests/worker_deployment_version_test.go Show resolved Hide resolved

Shivs11 reviewed Mar 26, 2026

View reviewed changes

ShahabT added 2 commits March 26, 2026 14:13

Merge remote-tracking branch 'refs/remotes/origin/main' into revive-v…

ca40524

…ersion # Conflicts: # service/worker/workerdeployment/version_workflow_test.go # service/worker/workerdeployment/workflow_test.go

address comments

177b5f8

ShahabT requested review from a team as code owners March 26, 2026 21:18

Shivs11 approved these changes Mar 26, 2026

View reviewed changes

check userdata before lock

687d339

ShahabT merged commit d344f08 into main Mar 26, 2026
46 checks passed

ShahabT deleted the revive-version branch March 26, 2026 23:01

ShahabT mentioned this pull request Apr 2, 2026

Serverless Feature Integration #9779

Merged

		// And there is a force CaN or a propagated state change or history got too large
		(d.forceCAN \|\| (d.stateChanged && d.asyncPropagationsInProgress == 0) \|\| workflow.GetInfo(ctx).GetContinueAsNewSuggested()))

Conversation

ShahabT commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Shivs11 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShahabT commented Feb 23, 2026 •

edited

Loading