Fail page request if query mutable state returns missing events#9389

Merged

spkane31 merged 36 commits intomainfrom

spk/update-premature-eos

Mar 10, 2026

Contributor

spkane31 commented Feb 24, 2026 •

edited

Loading

What changed?

Fix a race condition in GetWorkflowExecutionHistory that caused SDK workers to receive incomplete history and fail with "premature end of stream". On the last page of a paginated GetWorkflowExecutionHistory response, re-query mutable state to detect events that were committed to the DB between the first and last page fetches. If freshNextEventId > continuationToken.NextEventId, the gap is fetched from persistence and appended to the response before transient/speculative events are added. The continuation token is updated with the fresh boundary so appendTransientTasks validates against the correct NextEventId. If the re-query itself fails, the request returns an error so the client retries.

Also adds a nil check in ValidateTransientWorkflowTaskEvents, preventing a possible nil-pointer dereference.

Why?

A speculative WFT is converted to normal (e.g., by an incoming signal), committing 1–2 new events. The continuation token from page 1 points to NextEventId=8; the DB range [6, 8) on page 2 returns only events 6–7, missing the newly-committed events 8 and 9. appendTransientTasks finds no transient tasks (speculative was committed), so the assembled history is missing 2 events.

How did you test it?

Potential risks

The re-query on the last history page adds one extra GetMutableState RPC per paginated GetWorkflowExecutionHistory call. This is bounded to final-page responses only and the existing path already made this call inside appendTransientTasks, so the net overhead is one additional call specifically when a gap is detected.
Returning an error when the fresh mutable-state re-query fails changes the previous behavior of silently continuing. Clients will retry, which is correct, but retry storms are possible if persistence is consistently unavailable mitigated by the client's existing backoff.

spkane31 added 8 commits

February 23, 2026 17:15


          Race condition fix for update workflow

e793050


          Merge branch 'main' of github.com:temporalio/temporal into spk/update…

fb82558

…-premature-eos


          use extra logging

6e0b11a


          adding logging around shard changes, ms clearing, validation around t…

8d52ff2

…ransient wft being dropped, and to appendtransienttasks


          adding more logging events

f857103


          updating logging

0ae3626


          confirming seans changes work

f9c9b18


          push the docker build for this branch only

9a0e8b9

semgrep-managed-scans bot reviewed

View reviewed changes

.github/workflows/docker-build-manual.yml Outdated Show resolved Hide resolved


          changes to build

fbad037

semgrep-managed-scans bot reviewed

View reviewed changes

.github/actions/build-docker-images/action.yml Outdated Show resolved Hide resolved

spkane31 added 10 commits

February 27, 2026 09:24


          adding context around dropped tasks

42039c3


          more logging

6a70894


          adding better logging

6fbc7ae


          Merge branch 'main' of github.com:temporalio/temporal into spk/update…

85bf11e

…-premature-eos


          adding debug logging around update

c3a69a6


          logging fix and also test update

1bd9e93


          implement test to repro

1befdf5


          fail request if query fails

16f6618


          add logs to show exact wft failures

c8699dc


          adding metrics on worker and server side

f28690e

spkane31 changed the title ~~[DO NOT MERGE] Race condition fix for update workflow~~ Fail page request if query mutable state returns missing events

spkane31 added 6 commits

March 7, 2026 19:50


          removing logging

59cf74c


          cleanup

d525f84


          more cleanup


          keep only the single test that will fail if this behavior is removed

d572847


          bit more cleanup

3c3a1f9


          pull in main

6a2255a

spkane31 marked this pull request as ready for review

March 8, 2026 03:35

spkane31 requested a review from a team as a code owner

March 8, 2026 03:35

spkane31 requested a review from a team as a code owner

March 8, 2026 03:35

spkane31 added 6 commits

March 8, 2026 09:49


          linters

8c837c9


          linters again

3fb0095


          memory monitor change


          dont use math.MaxInt, use max page size

fa73cfc


          fixing page request size

4dd4493


          Merge branch 'main' of github.com:temporalio/temporal into spk/update…

1a3b8ef

…-premature-eos

yycptt approved these changes

View reviewed changes

Member

yycptt left a comment

Do you plan to have a separate PR for improving the observability into recorded premature EOS workflow task failures?

service/history/api/getworkflowexecutionhistory/api.go Outdated

@@ @@ -50,6 +50,7 @@ func appendTransientTasks( @@
               	// Check this FIRST before doing any work
               	clientName, _ := headers.GetClientNameAndVersion(ctx)
               	if clientName == headers.ClientNameCLI || clientName == headers.ClientNameUI {

Member

yycptt Mar 10, 2026

Technically that's a breaking change since now we start to return non-durable events for normal get history calls from SDK.

I don't know how many people really cares but shall we have a feature flag for controlling if transient/speculative events are returned when they are not part of the continuation token? We can default it to true to fix the bug. But if needed we can turn it off if any customer complains.

I feel we need a better story here...

service/history/api/getworkflowexecutionhistory/api.go Outdated Show resolved Hide resolved

service/history/api/getworkflowexecutionhistory/api.go Outdated Show resolved Hide resolved

service/history/api/getworkflowexecutionhistory/api.go Outdated Show resolved Hide resolved

service/history/api/get_history_util.go Outdated Show resolved Hide resolved

service/history/api/getworkflowexecutionhistory/api.go

+              	// fetchGapEvents fetches events in [fromEventID, toEventID) from persistence and appends
+              	// them to the current response (history or historyBlob). Used to close gaps that form
+              	// when events are committed to DB between paginated GetWorkflowExecutionHistory calls.
+              	fetchGapEvents := func(fromEventID, toEventID int64, branchToken []byte) error {

Member

yycptt Mar 10, 2026

well, I guess technically this will cause the max page size specified in the request to be exceeded, but probably not a big deal... we probably should track this somewhere.

service/history/api/getworkflowexecutionhistory/api.go Outdated

Member

yycptt Mar 10, 2026

I don't really follow the change you made here in a previous PR. Maybe we should find some time to chat, I think I am missing something here. :)

service/history/api/getworkflowexecutionhistory/api.go

+              				// those events are committed to DB with IDs < continuationToken.NextEventId but were excluded
+              				// because the DB fetch was capped at the original boundary. Fetch the gap now, then update
+              				// the nextEventID boundary so appendTransientTasks validates against the correct ID.
+              				_, _, _, freshNextEventID, freshIsRunning, freshVersionHistoryItem, freshVersionedTransition, freshTransientTasks, freshErr :=

Member

yycptt Mar 10, 2026

now that I think about it, I feel we can avoid doing this on last page.

What I am thinking is that on the first page, we already know the nextEventID and if there's transient/speculative events or not. I think the problem with the implementation on main is that the transient/speculative events info are in memory only so on later pages they can be gone. If we record them in the continuation token, then the problem can be solved as well and we don't need to load additional events from DB.

I don't want to block this PR because we need the fix, but I think this is something we can potentially improve.


          review

c06b6a6

yiminc reviewed

View reviewed changes

tests/premature_eos_test.go

Comment on lines +55 to +57

+              	//   4: WorkflowTaskCompleted  (ForceCreateNewWorkflowTask=true)
+              	//   5: WorkflowTaskScheduled  (force-created)
+              	//   6: WorkflowTaskStarted

Member

yiminc Mar 10, 2026

I think these 3 events would be in one batch

tests/premature_eos_test.go Outdated

+              	// This drops the speculative WFT from memory, but the pending update survives in the
+              	// update registry (no ShardFinalizerTimeout=0 override). On the next mutable state
+              	// load the pending update will cause a normal WFT_SCHEDULED to be written as event 8.
+              	// closeShard(s, tv.WorkflowID())

Member

yiminc Mar 10, 2026

should you uncomment this?

Contributor Author

spkane31 Mar 10, 2026

Actually can remove it entirely, found this wasn't caused by shard closure

tests/premature_eos_test.go Outdated

+              	// Simulate shard movement after the signal to clear mutable state again before page 2
+              	// is fetched. This ensures the gap-detection fix works even when the shard is reloaded
+              	// between the signal and the subsequent GetWorkflowExecutionHistory call.
+              	// closeShard(s, tv.WorkflowID())

Member

yiminc Mar 10, 2026

should you uncomment this?

Contributor Author

spkane31 Mar 10, 2026

Removing

spkane31 added 4 commits

March 10, 2026 10:02


          remove commented out code

1ab2a6a


          add a dynamic config and logs for when wft fails with prem eos

71fd43f


          linters

ee991c8


          Merge branch 'main' of github.com:temporalio/temporal into spk/update…

ce6c331

…-premature-eos

spkane31 merged commit b030bb3 into main

46 of 48 checks passed

spkane31 deleted the spk/update-premature-eos branch

March 10, 2026 23:56

birme pushed a commit to eyevinn-osaas/temporal that referenced this pull request


          Fail page request if query mutable state returns missing events (temp…

2d89991

…oralio#9389)

## What changed?
Fix a race condition in GetWorkflowExecutionHistory that caused SDK
workers to receive incomplete history and fail with "premature end of
stream". On the last page of a paginated `GetWorkflowExecutionHistory`
response, re-query mutable state to detect events that were committed to
the DB between the first and last page fetches. If `freshNextEventId >
continuationToken.NextEventId`, the gap is fetched from persistence and
appended to the response before transient/speculative events are added.
The continuation token is updated with the fresh boundary so
`appendTransientTasks` validates against the correct `NextEventId`. If
the re-query itself fails, the request returns an error so the client
retries.

Also adds a `nil` check in `ValidateTransientWorkflowTaskEvents`,
preventing a possible nil-pointer dereference.

## Why?
A speculative WFT is converted to normal (e.g., by an incoming signal),
committing 1–2 new events. The continuation token from page 1 points to
NextEventId=8; the DB range [6, 8) on page 2 returns only events 6–7,
missing the newly-committed events 8 and 9. `appendTransientTasks` finds
no transient tasks (speculative was committed), so the assembled history
is missing 2 events.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [X] added new functional test(s)

## Potential risks
- The re-query on the last history page adds one extra GetMutableState
RPC per paginated `GetWorkflowExecutionHistory` call. This is bounded to
final-page responses only and the existing path already made this call
inside `appendTransientTasks`, so the net overhead is one additional
call specifically when a gap is detected.
- Returning an error when the fresh mutable-state re-query fails changes
the previous behavior of silently continuing. Clients will retry, which
is correct, but retry storms are possible if persistence is consistently
unavailable mitigated by the client's existing backoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet