Fix TestWorkflowStartConflict flaky test by spkane31 · Pull Request #9776 · temporalio/temporal

spkane31 · 2026-04-01T22:08:25Z

What changed?

Fix flaky test TestUpdateWithStartSuite/TestWorkflowStartConflict/workflow_id_conflict_policy_fail:_use-existing handling both possible orderings of the race condition it tests.

Why?

The test injects a hook (UpdateWithStartInBetweenLockAndStart) that fires a concurrent StartWorkflowExecution between the update-with-start lock acquisition and start attempt to simulate a race condition. The test assumes the update always lands in a second speculative WFT:

WFT Add workspace file to gitignore for vscode development #1 polled with no messages → completed (empty)
UWS retry creates speculative WFT Sync from Cadence (22 October 2019) #2 → polled with update → accepted/completed

This assumption broke after the parallelsuite migration, which gives each test its own isolated testcore.NewEnv running in parallel. With a dedicated env, the retryable history client's immediate retry of the Unavailable error races tightly with RecordWorkflowTaskStarted:

Ordering A (retry wins the lock first): the update is admitted while WFT Add workspace file to gitignore for vscode development #1 is still scheduled, so it attaches to WFT Add workspace file to gitignore for vscode development #1's messages when the worker polls it. No WFT Sync from Cadence (22 October 2019) #2 is ever created.
Ordering B (RecordWorkflowTaskStarted wins): WFT Add workspace file to gitignore for vscode development #1 starts and completes with no messages, then the retry runs with no pending WFT, creating speculative WFT Sync from Cadence (22 October 2019) #2 with the update.

The original two-poll design panicked in ordering A (index out of range [0] with length 0 on task.Messages[0] in the empty-response first poll). The initial single-poll fix panicked in ordering B for the same reason. Neither ordering is
guaranteed, so the test must handle both.

How did you test it?

built
run locally and tested manually - tested with -count=50 locally
covered by existing tests
added new unit test(s)
added new functional test(s)

Potential risks

None, test only

…-with-start-failure

Fix TestWorkflowStartConflict flaky test

fc0ece7

spkane31 requested review from a team as code owners April 1, 2026 22:08

spkane31 requested a review from stephanos April 1, 2026 22:09

stephanos approved these changes Apr 1, 2026

View reviewed changes

spkane31 enabled auto-merge (squash) April 1, 2026 22:31

Merge branch 'main' of github.com:temporalio/temporal into spk/update…

ecb44ba

…-with-start-failure

spkane31 force-pushed the spk/update-with-start-failure branch from a8bf294 to ecb44ba Compare April 2, 2026 16:26

spkane31 merged commit f09abd1 into main Apr 2, 2026
46 checks passed

spkane31 deleted the spk/update-with-start-failure branch April 2, 2026 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TestWorkflowStartConflict flaky test#9776

Fix TestWorkflowStartConflict flaky test#9776
spkane31 merged 2 commits intomainfrom
spk/update-with-start-failure

spkane31 commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spkane31 commented Apr 1, 2026

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants