Skip to content

Trampolining fix: Suppress targetWorkerDeploymentVersionChanged on inline WFT path#10217

Merged
Shivs11 merged 6 commits into
mainfrom
inline-path-flag-fix
May 11, 2026
Merged

Trampolining fix: Suppress targetWorkerDeploymentVersionChanged on inline WFT path#10217
Shivs11 merged 6 commits into
mainfrom
inline-path-flag-fix

Conversation

@Shivs11
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 commented May 11, 2026

Summary

  • Added revision number mechanics to strengthen checks while deciding between clearing/setting the LastNotifiedTargetVersion!

Problem

  • The inline WFT path in RespondWorkflowTaskCompleted (both bypassTaskGeneration at line 596 and speculative at line 722) calls AddWorkflowTaskStartedEvent with targetDeploymentVersion=nil and targetRevisionNumber=0 because matching is never consulted on this path.
  • In AddWorkflowTaskStartedEvent, case 3 compared only buildId + deploymentName, so a nil target fails that case and the default fires — also corrupting LastNotifiedTargetVersion with {nil, 0}. For update- or signal-driven PINNED workflows (where the updates and the signals were buffered on workflow task completion) this manifested as a continuous CaN loop on a stable deployment.
  • A separate but related hazard: even on the normal poll path, a stale matching partition that happens to report target == effective by buildId would spuriously hit case 3 and clear legitimate notifications, silently defeating the trampolining-suppression mechanism from Fix pinned workflow trampolining via revision-based signal suppression #9895.

Test plan

  • TestInlinePath_StableRouting_NoSpuriousFlag — new integration test that exercises the user-reported scenario:
    • Sends a buffered signal during a regular WFT, completes with ReturnNewWorkflowTask=true.
    • Asserts the inline follow-up WFT's WFT-Started event has requestId == "request-from-RespondWorkflowTaskCompleted" (self-verification that the test actually exercises the inline code path).
    • Asserts TargetWorkerDeploymentVersionChanged == false on the inline WFT-Started event.
    • Confirmed to fail before the state machine change (test commit is on top of the fix commit) and pass after.
  • All 8 related existing tests still pass:
    • TestStalePartition_RevisionSuppressesTrampolining (from Fix pinned workflow trampolining via revision-based signal suppression #9895)
    • TestPinnedCaN_NoAUOnCaN_NoInfiniteLoop
    • TestOverride_SuppressesTargetVersionChangedSignal
    • TestAutoUpgrade_SuppressesTargetVersionChangedSignal
    • TestPinnedCaN_TargetChangesAgain_SignalsTrue
    • TestRemoveOverride_ClearsDeclinedState
    • TestRetryOfDeclinedCaN_SignalsOnNewTarget
    • TestPinnedCaN_RollbackResetsDeclined

🤖 Generated with Claude Code


Note

Medium Risk
Touches workflow-task start/versioning decision logic in history service; mistakes could suppress legitimate upgrade notifications or alter trampolining behavior, but changes are localized and covered by a new integration test.

Overview
Fixes a trampolining loop where inline workflow tasks created in RespondWorkflowTaskCompleted could incorrectly set targetWorkerDeploymentVersionChanged and corrupt LastNotifiedTargetVersion despite not consulting matching.

Inline/eager/synthetic AddWorkflowTaskStartedEvent call sites now pass targetRevisionNumber=-1 as a sentinel, and the workflow-task state machine strengthens its decision logic by tracking the highest seen matching revision and refusing to clear/set notification state based on stale (older-revision) reports.

Adds TestInlinePath_StableRouting_NoSpuriousFlag to reproduce the buffered-signal inline-WFT scenario and assert the flag remains false on stable routing.

Reviewed by Cursor Bugbot for commit baf0f63. Bugbot is set up for automated code reviews on this repo. Configure here.

Shivs11 and others added 2 commits May 11, 2026 12:52
…path

The inline WFT path in RespondWorkflowTaskCompleted (both bypassTaskGeneration
and speculative branches) calls AddWorkflowTaskStartedEvent with
targetDeploymentVersion=nil and targetRevisionNumber=0 because matching is
never consulted. The state machine's case 3 compares buildId+deploymentName
only, so nil target fails that case and the default fires the flag, also
corrupting LastNotifiedTargetVersion with {nil, 0}. For update-driven PINNED
workflows this manifests as a continuous CaN loop.

Strengthen the state machine in two places:

  - Case 3 now also requires targetRevisionNumber >= highestSeenRevNumber
    before clearing declined/notified state. A stale partition that happens
    to match by buildId no longer wipes legitimate notifications.

  - Default now breaks out early when targetRevisionNumber <=
    highestSeenRevNumber. Stale matching reports (or non-routing-derived
    revision-0 calls like the inline path) cannot fire the flag or corrupt
    LastNotifiedTargetVersion.

highestSeenRevNumber is computed as max(LastNotifiedTargetVersion.Revision,
DeclinedTargetVersionUpgrade.Revision). No new persistence fields, no proto
changes, no extra RPCs to matching, and no changes to inline call sites
which keep passing nil/0.

Also closes the TODO at line 516 (case 3 revision strengthening).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reproduces the lab scenario where a PINNED workflow on stable routing
hits the inline path in RespondWorkflowTaskCompleted (via a buffered
signal during WFT processing + ReturnNewWorkflowTask=true) and asserts
the resulting WFT-Started event does NOT carry
targetWorkerDeploymentVersionChanged=true.

The test self-verifies that it exercises the inline path by asserting
requestId == "request-from-RespondWorkflowTaskCompleted" on the WFT-Started
event before checking the core flag assertion. Without that, a future
change to the test setup could silently stop hitting the buggy code path
without anyone noticing.

Confirmed:
  - Pre-fix (HEAD~1): test fails on the flag assertion (with the inline
    marker requestId already present, proving the test reaches the bug).
  - Post-fix: test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Shivs11 Shivs11 requested review from a team as code owners May 11, 2026 17:23
@Shivs11 Shivs11 changed the title Suppress spurious targetWorkerDeploymentVersionChanged on inline WFT path Trampolining fix: Suppress targetWorkerDeploymentVersionChanged on inline WFT path May 11, 2026
Cosmetic — gci rewrites the test's doc comment numbered list
to use two-space indentation per godoc convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shivs11 and others added 2 commits May 11, 2026 14:50
The previous default-branch strengthening used `<=` against
highestSeenRevNumber, which incorrectly suppressed legitimate re-firings
on subsequent WFTs that reported the same target/revision. This broke
TestPinnedCaN_UpgradeOnCaN_TransientWFT_WithSuggest (all 8 sub-variants)
because the transient retry has the same target/revision as the first
attempt — `M <= M` is true, so the retry's WFT-Started event no longer
carried TargetWorkerDeploymentVersionChanged=true.

Switch to strict `<` and pass a -1 revision sentinel from callers that
don't consult matching:

- workflow_task_state_machine.go default branch: `<=` → `<`. Same-revision
  re-firings now fire as the original design intended.

- respondworkflowtaskcompleted/api.go inline call sites (lines 606, 732):
  pass `-1` for targetRevisionNumber instead of `0`. This is an explicit
  "no routing info" sentinel that lives below the range of real matching
  revisions, so `<` cleanly suppresses inline-path calls without
  affecting legitimate same-revision re-fires.

- workflow_resetter.go: same -1 sentinel for the synthetic reset WFT,
  for symmetry. This is a synthetic event that never had real routing
  info either.

Verified via the 12-test suite (PR #9895's 8 tests + new
TestInlinePath_StableRouting_NoSpuriousFlag + 3 _UpgradeOnCaN_*_WithSuggest
variants). All pass after this change, including the previously-failing
transient variant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric with the other synthetic-WFT call sites (inline path in
RespondWorkflowTaskCompleted, workflow resetter). The eager-exec path
also creates a WFT-Started event without consulting matching for
routing info.

Today the outer gate at workflow_task_state_machine.go:496 skips the
flag-firing block when behavior is UNSPECIFIED — which is the typical
case for fresh top-level workflow starts. But for eager exec on a
workflow with inherited versioning info (cron next-run, retries,
parent-inherited child runs), the block does run, and passing `0` could
falsely fire the flag on the synthetic WFT-Started event.

The TODO at line 105 ("get build ID from Starter so eager workflows
can be versioned") indicates the team plans to extend eager exec to
versioned workflows. Using `-1` here today is forward-compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +535 to +536
// Strict `<` (not `<=`) so legitimate same-revision re-firings (e.g., transient retries, repeated updates) still fire; inline path uses revision=-1 sentinel to be caught here.
if targetRevisionNumber < highestSeenRevNumber {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShahabT / @carlydf - this proved to be quite important since TestPinnedCaN_UpgradeOnCaN_TransientWFT_WithSuggest test started failing.

Basically, that test was forcefully failing a couple of updates and it tests out if transient workflow tasks still have the SDK flag correctly set to true. With the <= check, it was failing since this constraint is basically saying that if we have ever seen this, we drop the toggling of this flag (which is not great)

s.WaitForChannel (inherited from FunctionalTestBase) is marked deprecated
in favor of (*TestEnv).WaitForChannel, but Versioning3Suite tests
universally use the suite-level method. Add //nolint:staticcheck to
match the existing pattern in the file (SA1019 nolints are used in 4+
other places in versioning_3_test.go).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
true,
nil,
0,
-1, // sentinel: synthetic event, no routing info
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea of sending a -1 here is so that this path does not affect the path that workflow tasks take normally (coming from matching)

a -1 represents a sentinel value since these in-line workflow tasks don't really have up to date routing information

@Shivs11 Shivs11 enabled auto-merge (squash) May 11, 2026 21:50
@Shivs11 Shivs11 merged commit 246d752 into main May 11, 2026
49 checks passed
@Shivs11 Shivs11 deleted the inline-path-flag-fix branch May 11, 2026 21:55
lilydoar pushed a commit that referenced this pull request May 12, 2026
…line WFT path (#10217)

## Summary
- Added revision number mechanics to strengthen checks while deciding
between clearing/setting the LastNotifiedTargetVersion!

## Problem
- The inline WFT path in `RespondWorkflowTaskCompleted` (both
`bypassTaskGeneration` at line 596 and speculative at line 722) calls
`AddWorkflowTaskStartedEvent` with `targetDeploymentVersion=nil` and
`targetRevisionNumber=0` because matching is never consulted on this
path.
- In AddWorkflowTaskStartedEvent, case 3 compared only `buildId +
deploymentName`, so a `nil` target fails that case and the default fires
— also corrupting `LastNotifiedTargetVersion` with `{nil, 0}`. For
update- or signal-driven PINNED workflows (where the updates and the
signals were buffered on workflow task completion) this manifested as a
continuous CaN loop on a stable deployment.
- A separate but related hazard: even on the normal poll path, a stale
matching partition that happens to report `target == effective` by
buildId would spuriously hit case 3 and clear legitimate notifications,
silently defeating the trampolining-suppression mechanism from #9895.

## Test plan
- [x] `TestInlinePath_StableRouting_NoSpuriousFlag` — new integration
test that exercises the user-reported scenario:
- Sends a buffered signal during a regular WFT, completes with
`ReturnNewWorkflowTask=true`.
- Asserts the inline follow-up WFT's WFT-Started event has `requestId ==
"request-from-RespondWorkflowTaskCompleted"` (self-verification that the
test actually exercises the inline code path).
- Asserts `TargetWorkerDeploymentVersionChanged == false` on the inline
WFT-Started event.
- Confirmed to **fail** before the state machine change (test commit is
on top of the fix commit) and **pass** after.
- [x] All 8 related existing tests still pass:
  - `TestStalePartition_RevisionSuppressesTrampolining` (from #9895)
  - `TestPinnedCaN_NoAUOnCaN_NoInfiniteLoop`
  - `TestOverride_SuppressesTargetVersionChangedSignal`
  - `TestAutoUpgrade_SuppressesTargetVersionChangedSignal`
  - `TestPinnedCaN_TargetChangesAgain_SignalsTrue`
  - `TestRemoveOverride_ClearsDeclinedState`
  - `TestRetryOfDeclinedCaN_SignalsOnNewTarget`
  - `TestPinnedCaN_RollbackResetsDeclined`

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches workflow-task start/versioning decision logic in history
service; mistakes could suppress legitimate upgrade notifications or
alter trampolining behavior, but changes are localized and covered by a
new integration test.
> 
> **Overview**
> Fixes a trampolining loop where inline workflow tasks created in
`RespondWorkflowTaskCompleted` could incorrectly set
`targetWorkerDeploymentVersionChanged` and corrupt
`LastNotifiedTargetVersion` despite not consulting matching.
> 
> Inline/eager/synthetic `AddWorkflowTaskStartedEvent` call sites now
pass `targetRevisionNumber=-1` as a sentinel, and the workflow-task
state machine strengthens its decision logic by tracking the *highest
seen* matching revision and refusing to clear/set notification state
based on stale (older-revision) reports.
> 
> Adds `TestInlinePath_StableRouting_NoSpuriousFlag` to reproduce the
buffered-signal inline-WFT scenario and assert the flag remains false on
stable routing.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
baf0f63. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@temporal-cicd temporal-cicd Bot mentioned this pull request May 12, 2026
temporal-cicd Bot pushed a commit that referenced this pull request May 12, 2026
…line WFT path (#10217)

## Summary
- Added revision number mechanics to strengthen checks while deciding
between clearing/setting the LastNotifiedTargetVersion!

## Problem
- The inline WFT path in `RespondWorkflowTaskCompleted` (both
`bypassTaskGeneration` at line 596 and speculative at line 722) calls
`AddWorkflowTaskStartedEvent` with `targetDeploymentVersion=nil` and
`targetRevisionNumber=0` because matching is never consulted on this
path.
- In AddWorkflowTaskStartedEvent, case 3 compared only `buildId +
deploymentName`, so a `nil` target fails that case and the default fires
— also corrupting `LastNotifiedTargetVersion` with `{nil, 0}`. For
update- or signal-driven PINNED workflows (where the updates and the
signals were buffered on workflow task completion) this manifested as a
continuous CaN loop on a stable deployment.
- A separate but related hazard: even on the normal poll path, a stale
matching partition that happens to report `target == effective` by
buildId would spuriously hit case 3 and clear legitimate notifications,
silently defeating the trampolining-suppression mechanism from #9895.

## Test plan
- [x] `TestInlinePath_StableRouting_NoSpuriousFlag` — new integration
test that exercises the user-reported scenario:
- Sends a buffered signal during a regular WFT, completes with
`ReturnNewWorkflowTask=true`.
- Asserts the inline follow-up WFT's WFT-Started event has `requestId ==
"request-from-RespondWorkflowTaskCompleted"` (self-verification that the
test actually exercises the inline code path).
- Asserts `TargetWorkerDeploymentVersionChanged == false` on the inline
WFT-Started event.
- Confirmed to **fail** before the state machine change (test commit is
on top of the fix commit) and **pass** after.
- [x] All 8 related existing tests still pass:
  - `TestStalePartition_RevisionSuppressesTrampolining` (from #9895)
  - `TestPinnedCaN_NoAUOnCaN_NoInfiniteLoop`
  - `TestOverride_SuppressesTargetVersionChangedSignal`
  - `TestAutoUpgrade_SuppressesTargetVersionChangedSignal`
  - `TestPinnedCaN_TargetChangesAgain_SignalsTrue`
  - `TestRemoveOverride_ClearsDeclinedState`
  - `TestRetryOfDeclinedCaN_SignalsOnNewTarget`
  - `TestPinnedCaN_RollbackResetsDeclined`

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches workflow-task start/versioning decision logic in history
service; mistakes could suppress legitimate upgrade notifications or
alter trampolining behavior, but changes are localized and covered by a
new integration test.
> 
> **Overview**
> Fixes a trampolining loop where inline workflow tasks created in
`RespondWorkflowTaskCompleted` could incorrectly set
`targetWorkerDeploymentVersionChanged` and corrupt
`LastNotifiedTargetVersion` despite not consulting matching.
> 
> Inline/eager/synthetic `AddWorkflowTaskStartedEvent` call sites now
pass `targetRevisionNumber=-1` as a sentinel, and the workflow-task
state machine strengthens its decision logic by tracking the *highest
seen* matching revision and refusing to clear/set notification state
based on stale (older-revision) reports.
> 
> Adds `TestInlinePath_StableRouting_NoSpuriousFlag` to reproduce the
buffered-signal inline-WFT scenario and assert the flag remains false on
stable routing.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
baf0f63. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants