Trampolining fix: Suppress targetWorkerDeploymentVersionChanged on inline WFT path#10217
Conversation
…path
The inline WFT path in RespondWorkflowTaskCompleted (both bypassTaskGeneration
and speculative branches) calls AddWorkflowTaskStartedEvent with
targetDeploymentVersion=nil and targetRevisionNumber=0 because matching is
never consulted. The state machine's case 3 compares buildId+deploymentName
only, so nil target fails that case and the default fires the flag, also
corrupting LastNotifiedTargetVersion with {nil, 0}. For update-driven PINNED
workflows this manifests as a continuous CaN loop.
Strengthen the state machine in two places:
- Case 3 now also requires targetRevisionNumber >= highestSeenRevNumber
before clearing declined/notified state. A stale partition that happens
to match by buildId no longer wipes legitimate notifications.
- Default now breaks out early when targetRevisionNumber <=
highestSeenRevNumber. Stale matching reports (or non-routing-derived
revision-0 calls like the inline path) cannot fire the flag or corrupt
LastNotifiedTargetVersion.
highestSeenRevNumber is computed as max(LastNotifiedTargetVersion.Revision,
DeclinedTargetVersionUpgrade.Revision). No new persistence fields, no proto
changes, no extra RPCs to matching, and no changes to inline call sites
which keep passing nil/0.
Also closes the TODO at line 516 (case 3 revision strengthening).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reproduces the lab scenario where a PINNED workflow on stable routing
hits the inline path in RespondWorkflowTaskCompleted (via a buffered
signal during WFT processing + ReturnNewWorkflowTask=true) and asserts
the resulting WFT-Started event does NOT carry
targetWorkerDeploymentVersionChanged=true.
The test self-verifies that it exercises the inline path by asserting
requestId == "request-from-RespondWorkflowTaskCompleted" on the WFT-Started
event before checking the core flag assertion. Without that, a future
change to the test setup could silently stop hitting the buggy code path
without anyone noticing.
Confirmed:
- Pre-fix (HEAD~1): test fails on the flag assertion (with the inline
marker requestId already present, proving the test reaches the bug).
- Post-fix: test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cosmetic — gci rewrites the test's doc comment numbered list to use two-space indentation per godoc convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous default-branch strengthening used `<=` against highestSeenRevNumber, which incorrectly suppressed legitimate re-firings on subsequent WFTs that reported the same target/revision. This broke TestPinnedCaN_UpgradeOnCaN_TransientWFT_WithSuggest (all 8 sub-variants) because the transient retry has the same target/revision as the first attempt — `M <= M` is true, so the retry's WFT-Started event no longer carried TargetWorkerDeploymentVersionChanged=true. Switch to strict `<` and pass a -1 revision sentinel from callers that don't consult matching: - workflow_task_state_machine.go default branch: `<=` → `<`. Same-revision re-firings now fire as the original design intended. - respondworkflowtaskcompleted/api.go inline call sites (lines 606, 732): pass `-1` for targetRevisionNumber instead of `0`. This is an explicit "no routing info" sentinel that lives below the range of real matching revisions, so `<` cleanly suppresses inline-path calls without affecting legitimate same-revision re-fires. - workflow_resetter.go: same -1 sentinel for the synthetic reset WFT, for symmetry. This is a synthetic event that never had real routing info either. Verified via the 12-test suite (PR #9895's 8 tests + new TestInlinePath_StableRouting_NoSpuriousFlag + 3 _UpgradeOnCaN_*_WithSuggest variants). All pass after this change, including the previously-failing transient variant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric with the other synthetic-WFT call sites (inline path in
RespondWorkflowTaskCompleted, workflow resetter). The eager-exec path
also creates a WFT-Started event without consulting matching for
routing info.
Today the outer gate at workflow_task_state_machine.go:496 skips the
flag-firing block when behavior is UNSPECIFIED — which is the typical
case for fresh top-level workflow starts. But for eager exec on a
workflow with inherited versioning info (cron next-run, retries,
parent-inherited child runs), the block does run, and passing `0` could
falsely fire the flag on the synthetic WFT-Started event.
The TODO at line 105 ("get build ID from Starter so eager workflows
can be versioned") indicates the team plans to extend eager exec to
versioned workflows. Using `-1` here today is forward-compatible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| // Strict `<` (not `<=`) so legitimate same-revision re-firings (e.g., transient retries, repeated updates) still fire; inline path uses revision=-1 sentinel to be caught here. | ||
| if targetRevisionNumber < highestSeenRevNumber { |
There was a problem hiding this comment.
@ShahabT / @carlydf - this proved to be quite important since TestPinnedCaN_UpgradeOnCaN_TransientWFT_WithSuggest test started failing.
Basically, that test was forcefully failing a couple of updates and it tests out if transient workflow tasks still have the SDK flag correctly set to true. With the <= check, it was failing since this constraint is basically saying that if we have ever seen this, we drop the toggling of this flag (which is not great)
s.WaitForChannel (inherited from FunctionalTestBase) is marked deprecated in favor of (*TestEnv).WaitForChannel, but Versioning3Suite tests universally use the suite-level method. Add //nolint:staticcheck to match the existing pattern in the file (SA1019 nolints are used in 4+ other places in versioning_3_test.go). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| true, | ||
| nil, | ||
| 0, | ||
| -1, // sentinel: synthetic event, no routing info |
There was a problem hiding this comment.
the idea of sending a -1 here is so that this path does not affect the path that workflow tasks take normally (coming from matching)
a -1 represents a sentinel value since these in-line workflow tasks don't really have up to date routing information
…line WFT path (#10217) ## Summary - Added revision number mechanics to strengthen checks while deciding between clearing/setting the LastNotifiedTargetVersion! ## Problem - The inline WFT path in `RespondWorkflowTaskCompleted` (both `bypassTaskGeneration` at line 596 and speculative at line 722) calls `AddWorkflowTaskStartedEvent` with `targetDeploymentVersion=nil` and `targetRevisionNumber=0` because matching is never consulted on this path. - In AddWorkflowTaskStartedEvent, case 3 compared only `buildId + deploymentName`, so a `nil` target fails that case and the default fires — also corrupting `LastNotifiedTargetVersion` with `{nil, 0}`. For update- or signal-driven PINNED workflows (where the updates and the signals were buffered on workflow task completion) this manifested as a continuous CaN loop on a stable deployment. - A separate but related hazard: even on the normal poll path, a stale matching partition that happens to report `target == effective` by buildId would spuriously hit case 3 and clear legitimate notifications, silently defeating the trampolining-suppression mechanism from #9895. ## Test plan - [x] `TestInlinePath_StableRouting_NoSpuriousFlag` — new integration test that exercises the user-reported scenario: - Sends a buffered signal during a regular WFT, completes with `ReturnNewWorkflowTask=true`. - Asserts the inline follow-up WFT's WFT-Started event has `requestId == "request-from-RespondWorkflowTaskCompleted"` (self-verification that the test actually exercises the inline code path). - Asserts `TargetWorkerDeploymentVersionChanged == false` on the inline WFT-Started event. - Confirmed to **fail** before the state machine change (test commit is on top of the fix commit) and **pass** after. - [x] All 8 related existing tests still pass: - `TestStalePartition_RevisionSuppressesTrampolining` (from #9895) - `TestPinnedCaN_NoAUOnCaN_NoInfiniteLoop` - `TestOverride_SuppressesTargetVersionChangedSignal` - `TestAutoUpgrade_SuppressesTargetVersionChangedSignal` - `TestPinnedCaN_TargetChangesAgain_SignalsTrue` - `TestRemoveOverride_ClearsDeclinedState` - `TestRetryOfDeclinedCaN_SignalsOnNewTarget` - `TestPinnedCaN_RollbackResetsDeclined` 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches workflow-task start/versioning decision logic in history service; mistakes could suppress legitimate upgrade notifications or alter trampolining behavior, but changes are localized and covered by a new integration test. > > **Overview** > Fixes a trampolining loop where inline workflow tasks created in `RespondWorkflowTaskCompleted` could incorrectly set `targetWorkerDeploymentVersionChanged` and corrupt `LastNotifiedTargetVersion` despite not consulting matching. > > Inline/eager/synthetic `AddWorkflowTaskStartedEvent` call sites now pass `targetRevisionNumber=-1` as a sentinel, and the workflow-task state machine strengthens its decision logic by tracking the *highest seen* matching revision and refusing to clear/set notification state based on stale (older-revision) reports. > > Adds `TestInlinePath_StableRouting_NoSpuriousFlag` to reproduce the buffered-signal inline-WFT scenario and assert the flag remains false on stable routing. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit baf0f63. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line WFT path (#10217) ## Summary - Added revision number mechanics to strengthen checks while deciding between clearing/setting the LastNotifiedTargetVersion! ## Problem - The inline WFT path in `RespondWorkflowTaskCompleted` (both `bypassTaskGeneration` at line 596 and speculative at line 722) calls `AddWorkflowTaskStartedEvent` with `targetDeploymentVersion=nil` and `targetRevisionNumber=0` because matching is never consulted on this path. - In AddWorkflowTaskStartedEvent, case 3 compared only `buildId + deploymentName`, so a `nil` target fails that case and the default fires — also corrupting `LastNotifiedTargetVersion` with `{nil, 0}`. For update- or signal-driven PINNED workflows (where the updates and the signals were buffered on workflow task completion) this manifested as a continuous CaN loop on a stable deployment. - A separate but related hazard: even on the normal poll path, a stale matching partition that happens to report `target == effective` by buildId would spuriously hit case 3 and clear legitimate notifications, silently defeating the trampolining-suppression mechanism from #9895. ## Test plan - [x] `TestInlinePath_StableRouting_NoSpuriousFlag` — new integration test that exercises the user-reported scenario: - Sends a buffered signal during a regular WFT, completes with `ReturnNewWorkflowTask=true`. - Asserts the inline follow-up WFT's WFT-Started event has `requestId == "request-from-RespondWorkflowTaskCompleted"` (self-verification that the test actually exercises the inline code path). - Asserts `TargetWorkerDeploymentVersionChanged == false` on the inline WFT-Started event. - Confirmed to **fail** before the state machine change (test commit is on top of the fix commit) and **pass** after. - [x] All 8 related existing tests still pass: - `TestStalePartition_RevisionSuppressesTrampolining` (from #9895) - `TestPinnedCaN_NoAUOnCaN_NoInfiniteLoop` - `TestOverride_SuppressesTargetVersionChangedSignal` - `TestAutoUpgrade_SuppressesTargetVersionChangedSignal` - `TestPinnedCaN_TargetChangesAgain_SignalsTrue` - `TestRemoveOverride_ClearsDeclinedState` - `TestRetryOfDeclinedCaN_SignalsOnNewTarget` - `TestPinnedCaN_RollbackResetsDeclined` 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches workflow-task start/versioning decision logic in history service; mistakes could suppress legitimate upgrade notifications or alter trampolining behavior, but changes are localized and covered by a new integration test. > > **Overview** > Fixes a trampolining loop where inline workflow tasks created in `RespondWorkflowTaskCompleted` could incorrectly set `targetWorkerDeploymentVersionChanged` and corrupt `LastNotifiedTargetVersion` despite not consulting matching. > > Inline/eager/synthetic `AddWorkflowTaskStartedEvent` call sites now pass `targetRevisionNumber=-1` as a sentinel, and the workflow-task state machine strengthens its decision logic by tracking the *highest seen* matching revision and refusing to clear/set notification state based on stale (older-revision) reports. > > Adds `TestInlinePath_StableRouting_NoSpuriousFlag` to reproduce the buffered-signal inline-WFT scenario and assert the flag remains false on stable routing. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit baf0f63. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Problem
RespondWorkflowTaskCompleted(bothbypassTaskGenerationat line 596 and speculative at line 722) callsAddWorkflowTaskStartedEventwithtargetDeploymentVersion=nilandtargetRevisionNumber=0because matching is never consulted on this path.buildId + deploymentName, so aniltarget fails that case and the default fires — also corruptingLastNotifiedTargetVersionwith{nil, 0}. For update- or signal-driven PINNED workflows (where the updates and the signals were buffered on workflow task completion) this manifested as a continuous CaN loop on a stable deployment.target == effectiveby buildId would spuriously hit case 3 and clear legitimate notifications, silently defeating the trampolining-suppression mechanism from Fix pinned workflow trampolining via revision-based signal suppression #9895.Test plan
TestInlinePath_StableRouting_NoSpuriousFlag— new integration test that exercises the user-reported scenario:ReturnNewWorkflowTask=true.requestId == "request-from-RespondWorkflowTaskCompleted"(self-verification that the test actually exercises the inline code path).TargetWorkerDeploymentVersionChanged == falseon the inline WFT-Started event.TestStalePartition_RevisionSuppressesTrampolining(from Fix pinned workflow trampolining via revision-based signal suppression #9895)TestPinnedCaN_NoAUOnCaN_NoInfiniteLoopTestOverride_SuppressesTargetVersionChangedSignalTestAutoUpgrade_SuppressesTargetVersionChangedSignalTestPinnedCaN_TargetChangesAgain_SignalsTrueTestRemoveOverride_ClearsDeclinedStateTestRetryOfDeclinedCaN_SignalsOnNewTargetTestPinnedCaN_RollbackResetsDeclined🤖 Generated with Claude Code
Note
Medium Risk
Touches workflow-task start/versioning decision logic in history service; mistakes could suppress legitimate upgrade notifications or alter trampolining behavior, but changes are localized and covered by a new integration test.
Overview
Fixes a trampolining loop where inline workflow tasks created in
RespondWorkflowTaskCompletedcould incorrectly settargetWorkerDeploymentVersionChangedand corruptLastNotifiedTargetVersiondespite not consulting matching.Inline/eager/synthetic
AddWorkflowTaskStartedEventcall sites now passtargetRevisionNumber=-1as a sentinel, and the workflow-task state machine strengthens its decision logic by tracking the highest seen matching revision and refusing to clear/set notification state based on stale (older-revision) reports.Adds
TestInlinePath_StableRouting_NoSpuriousFlagto reproduce the buffered-signal inline-WFT scenario and assert the flag remains false on stable routing.Reviewed by Cursor Bugbot for commit baf0f63. Bugbot is set up for automated code reviews on this repo. Configure here.