Fix two flaky/broken functional tests by spkane31 · Pull Request #9660 · temporalio/temporal

spkane31 · 2026-03-25T17:55:13Z

What changed?

tests/task_queue_stats_test.go — Remove the t *testing.T field from taskQueueStatsSuite and replace all s.t usages with s.T(). Critically, fix validateDescribeWorkerDeploymentVersion to use a.NoError(err) instead of require.NoError(s.T(), err) inside the callback.
tests/xdc/history_replication_signals_and_updates_test.go — Fix pollWorkflowResult to handle errors from GetWorkflowExecutionHistory gracefully: return the error from the inner function, retry with a nil page token on transient errors (e.g. CurrentBranchChanged after conflict resolution), and abort cleanly when the context expires.

Why?

TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in a separate goroutine on a ticker. If the outer test context expires and the test completes while this callback is still running, the require.NoError call invokes t.Fail() on a completed test, which panics the entire test binary with panic: Fail in goroutine after TestXxx has completed. The fix uses the CollectT-scoped a.NoError(err) so assertions are buffered and safe.

TestConflictResolutionGetResult hang/crash: pollWorkflowResult was calling responseInner.History.Events without checking if responseInner was nil. When GetWorkflowExecutionHistory returns an error (e.g. CurrentBranchChanged after conflict resolution resets the branch token), the response is nil, causing a nil pointer dereference that panics the goroutine. Because the goroutine crashed before sending to workflowResultCh, the main test goroutine blocked on <-workflowResultCh for the entire 30-minute CI timeout. The fix returns the error from the inner function and retries with a nil page token on transient errors, which allows the poll to succeed on the updated branch.

How did you test it?

Potential risks

The XDC fix changes pollWorkflowResult to retry indefinitely on any non-context error. If a non-transient error were to occur repeatedly, this could loop until the context expires rather than failing immediately. This is acceptable since the context timeout provides a hard bound, and the test failure message via c.t.s.NoError(ctx.Err(), ...) will make the root cause clear.

spkane31 · 2026-03-25T17:56:05Z

tests/task_queue_stats_test.go

 	// taskQueueStatsSuite encapsulates the test environment and parameters for task queue stats tests.
 	taskQueueStatsSuite struct {
 		testcore.Env
 		usePriMatcher   bool


We could add a lint to prevent embedding testing.T in new suites, this is causing issues for our tests

Since #9536 decouples Env and assertions, that might already help a lot.

stephanos · 2026-03-25T21:46:12Z

tests/task_queue_stats_test.go


 		req.ReportTaskQueueStats = true
 		resp, err := s.FrontendClient().DescribeWorkerDeploymentVersion(ctx, req)
-		require.NoError(s.T(), err)


This would be impossible with #9490 :D

stephanos · 2026-03-25T21:49:29Z

tests/xdc/history_replication_signals_and_updates_test.go

+	if len(allEvents) == 0 {
+		return nil
+	}


Why do we still need this one if the assertion checks it's 1 already?

## What changed? 1. tests/task_queue_stats_test.go — Remove the `t *testing.T` field from `taskQueueStatsSuite` and replace all `s.t` usages with `s.T()`. Critically, fix `validateDescribeWorkerDeploymentVersion` to use `a.NoError(err)` instead of `require.NoError(s.T(), err)` inside the callback. 2. tests/xdc/history_replication_signals_and_updates_test.go — Fix `pollWorkflowResult` to handle errors from `GetWorkflowExecutionHistory` gracefully: return the error from the inner function, retry with a nil page token on transient errors (e.g. CurrentBranchChanged after conflict resolution), and abort cleanly when the context expires. ## Why? TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in a separate goroutine on a ticker. If the outer test context expires and the test completes while this callback is still running, the require.NoError call invokes t.Fail() on a completed test, which panics the entire test binary with panic: Fail in goroutine after TestXxx has completed. The fix uses the CollectT-scoped a.NoError(err) so assertions are buffered and safe. TestConflictResolutionGetResult hang/crash: pollWorkflowResult was calling responseInner.History.Events without checking if responseInner was nil. When GetWorkflowExecutionHistory returns an error (e.g. CurrentBranchChanged after conflict resolution resets the branch token), the response is nil, causing a nil pointer dereference that panics the goroutine. Because the goroutine crashed before sending to workflowResultCh, the main test goroutine blocked on <-workflowResultCh for the entire 30-minute CI timeout. The fix returns the error from the inner function and retries with a nil page token on transient errors, which allows the poll to succeed on the updated branch. ## How did you test it? - [ ] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks The XDC fix changes `pollWorkflowResult` to retry indefinitely on any non-context error. If a non-transient error were to occur repeatedly, this could loop until the context expires rather than failing immediately. This is acceptable since the context timeout provides a hard bound, and the test failure message via `c.t.s.NoError(ctx.Err(), ...)` will make the root cause clear.

Fixes for functional-test-crash

82712b6

spkane31 commented Mar 25, 2026

View reviewed changes

merge conflicts

42517d6

spkane31 marked this pull request as ready for review March 25, 2026 21:43

spkane31 requested review from a team as code owners March 25, 2026 21:43

spkane31 requested a review from stephanos March 25, 2026 21:43

stephanos reviewed Mar 25, 2026

View reviewed changes

stephanos approved these changes Mar 25, 2026

View reviewed changes

spkane31 and others added 2 commits March 25, 2026 16:14

remove extra

beaa764

Merge branch 'main' into functional-test-crash

055d95d

spkane31 enabled auto-merge (squash) March 25, 2026 22:18

Merge branch 'main' into functional-test-crash

012d87b

spkane31 merged commit 4c756bb into main Mar 26, 2026
68 of 70 checks passed

spkane31 deleted the functional-test-crash branch March 26, 2026 15:18

ShahabT mentioned this pull request Apr 2, 2026

Serverless Feature Integration #9779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix two flaky/broken functional tests#9660

Fix two flaky/broken functional tests#9660
spkane31 merged 5 commits intomainfrom
functional-test-crash

spkane31 commented Mar 25, 2026

Uh oh!

spkane31 Mar 25, 2026

Uh oh!

stephanos Mar 25, 2026 •

edited

Loading

Uh oh!

stephanos Mar 25, 2026 •

edited

Loading

Uh oh!

stephanos Mar 25, 2026

Uh oh!

spkane31 Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spkane31 commented Mar 25, 2026

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

spkane31 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

stephanos Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

spkane31 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stephanos Mar 25, 2026 •

edited

Loading

stephanos Mar 25, 2026 •

edited

Loading