Skip to content

Ensure WD Version properly revives if recreated after deletion#9382

Merged
ShahabT merged 12 commits intomainfrom
revive-version
Mar 26, 2026
Merged

Ensure WD Version properly revives if recreated after deletion#9382
ShahabT merged 12 commits intomainfrom
revive-version

Conversation

@ShahabT
Copy link
Copy Markdown
Contributor

@ShahabT ShahabT commented Feb 23, 2026

What changed?

  • Ensure the TQs receive and apply the right version data after revive.
  • Made delete propagation to always happen serial to other propagations. It ensures all other propagations are cancelled before starting delete propagation.
  • Deprecate the deleted flag in version data and the GC logic around it. Now we use the good old forgetVersion path which immediately removes the version data from TQ.
  • Ensure version state is reset after revive, in case the recreation happened before workflow close.
  • Also, now workflows CaN based on SDK suggestion if no pending Signal or Update is present.

Why?

The version could stuck at deleted state from TQ POV if revived before the (now deprecated) GC logic cleans it up.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

None

@ShahabT ShahabT requested review from a team as code owners February 23, 2026 02:45
ShahabT and others added 7 commits March 13, 2026 12:00
- Use InDelta instead of Equal for float comparison
- Use s.Require().NoError for error assertions in callbacks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ShahabT ShahabT changed the title Reset WD Version state when revived Ensure WD Version properly revives if recreated after deletion Mar 25, 2026
}
}()

if c.deploymentVersionRegistered {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shivs11 note that this was a separate obstacle preventing the TQ being registered after revived if the partition is not reloaded since last registration.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder: I know the userData calls below that you have mentioned are all in-mem calls, but if a partition has a ton of pollers, would that be a lot on our matching engine pod?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the user data is in memory and access to it should not be expensive for each poll.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I don't think every poll needs to wait for the lock. I added a user date check before the lock.

Copy link
Copy Markdown
Member

@Shivs11 Shivs11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments/thinking points

asyncPropagationsInProgress int
// When true, all the ongoing propagations should cancel themselves
// Deprecated. With version data revision number, we don't need to cancel propagations anymore.
// Used when delete happens while there are ungoing propagations.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: on-going

Comment on lines +281 to +282
// And there is a force CaN or a propagated state change or history got too large
(d.forceCAN || (d.stateChanged && d.asyncPropagationsInProgress == 0) || workflow.GetInfo(ctx).GetContinueAsNewSuggested()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im confused here - why did we never have this before? workflow.GetInfo(ctx).GetContinueAsNewSuggested()

moreover, aren't we always bound to CAN before we actually hit this limit? or did you get this idea from the recent investigation we noticed where we had 10,000 or so signals but we sadly were not can'ing?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one other point:

when CaN triggers via this path, d.stateChanged might be false and asyncPropagationsInProgress might be > 0. The condition allows CaN even with in-flight propagations when history is too large. Is that intentional? I wonder if this could leave our poor TQ in a really poor state then

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition allows CaN even with in-flight propagations

I don't think this is the case, the previous clause is still in effect:
(!d.signalHandler.signalSelector.HasPending() && d.signalHandler.processingSignals == 0 && workflow.AllHandlersFinished(ctx) &&

}
} else if v := req.GetForgetVersion(); v != nil {
if idx := worker_versioning.FindDeploymentVersion(deploymentData, v); idx >= 0 {
// Go through the new and old deployment data format for this deployment and remove the version if present.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the helper function, right now, we return early if workerDeploymentData == nil. In that manner, we won't go through the old format no? (say the version was present in the deploymentsData format only)

afaik, you are trying to merge these two deletion paths (which is great), but this could be something you wanna think off

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I'm gonna fix that case.

}
}()

if c.deploymentVersionRegistered {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder: I know the userData calls below that you have mentioned are all in-mem calls, but if a partition has a ton of pollers, would that be a lot on our matching engine pod?

ShahabT added 2 commits March 26, 2026 14:13
…ersion

# Conflicts:
#	service/worker/workerdeployment/version_workflow_test.go
#	service/worker/workerdeployment/workflow_test.go
@ShahabT ShahabT requested review from a team as code owners March 26, 2026 21:18
@ShahabT ShahabT merged commit d344f08 into main Mar 26, 2026
46 checks passed
@ShahabT ShahabT deleted the revive-version branch March 26, 2026 23:01
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
…ralio#9382)

## What changed?
- Ensure the TQs receive and apply the right version data after revive.
- Made delete propagation to always happen serial to other propagations.
It ensures all other propagations are cancelled before starting delete
propagation.
- Deprecate the `deleted` flag in version data and the GC logic around
it. Now we use the good old forgetVersion path which immediately removes
the version data from TQ.
- Ensure version state is reset after revive, in case the recreation
happened before workflow close.
- Also, now workflows CaN based on SDK suggestion if no pending Signal
or Update is present.

## Why?
The version could stuck at deleted state from TQ POV if revived before
the (now deprecated) GC logic cleans it up.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [x] added new functional test(s)

## Potential risks
None

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
…ralio#9382)

## What changed?
- Ensure the TQs receive and apply the right version data after revive.
- Made delete propagation to always happen serial to other propagations.
It ensures all other propagations are cancelled before starting delete
propagation.
- Deprecate the `deleted` flag in version data and the GC logic around
it. Now we use the good old forgetVersion path which immediately removes
the version data from TQ.
- Ensure version state is reset after revive, in case the recreation
happened before workflow close.
- Also, now workflows CaN based on SDK suggestion if no pending Signal
or Update is present.

## Why?
The version could stuck at deleted state from TQ POV if revived before
the (now deprecated) GC logic cleans it up.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [x] added new functional test(s)

## Potential risks
None

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants