Skip to content

Fix two flaky/broken functional tests#9660

Merged
spkane31 merged 5 commits intomainfrom
functional-test-crash
Mar 26, 2026
Merged

Fix two flaky/broken functional tests#9660
spkane31 merged 5 commits intomainfrom
functional-test-crash

Conversation

@spkane31
Copy link
Copy Markdown
Contributor

What changed?

  1. tests/task_queue_stats_test.go — Remove the t *testing.T field from taskQueueStatsSuite and replace all s.t usages with s.T(). Critically, fix validateDescribeWorkerDeploymentVersion to use a.NoError(err) instead of require.NoError(s.T(), err) inside the callback.
  2. tests/xdc/history_replication_signals_and_updates_test.go — Fix pollWorkflowResult to handle errors from GetWorkflowExecutionHistory gracefully: return the error from the inner function, retry with a nil page token on transient errors (e.g. CurrentBranchChanged after conflict resolution), and abort cleanly when the context expires.

Why?

TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in a separate goroutine on a ticker. If the outer test context expires and the test completes while this callback is still running, the require.NoError call invokes t.Fail() on a completed test, which panics the entire test binary with panic: Fail in goroutine after TestXxx has completed. The fix uses the CollectT-scoped a.NoError(err) so assertions are buffered and safe.

TestConflictResolutionGetResult hang/crash: pollWorkflowResult was calling responseInner.History.Events without checking if responseInner was nil. When GetWorkflowExecutionHistory returns an error (e.g. CurrentBranchChanged after conflict resolution resets the branch token), the response is nil, causing a nil pointer dereference that panics the goroutine. Because the goroutine crashed before sending to workflowResultCh, the main test goroutine blocked on <-workflowResultCh for the entire 30-minute CI timeout. The fix returns the error from the inner function and retries with a nil page token on transient errors, which allows the poll to succeed on the updated branch.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

The XDC fix changes pollWorkflowResult to retry indefinitely on any non-context error. If a non-transient error were to occur repeatedly, this could loop until the context expires rather than failing immediately. This is acceptable since the context timeout provides a hard bound, and the test failure message via c.t.s.NoError(ctx.Err(), ...) will make the root cause clear.

// taskQueueStatsSuite encapsulates the test environment and parameters for task queue stats tests.
taskQueueStatsSuite struct {
testcore.Env
usePriMatcher bool
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a lint to prevent embedding testing.T in new suites, this is causing issues for our tests

Copy link
Copy Markdown
Contributor

@stephanos stephanos Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #9536 decouples Env and assertions, that might already help a lot.

@spkane31 spkane31 marked this pull request as ready for review March 25, 2026 21:43
@spkane31 spkane31 requested review from a team as code owners March 25, 2026 21:43
@spkane31 spkane31 requested a review from stephanos March 25, 2026 21:43

req.ReportTaskQueueStats = true
resp, err := s.FrontendClient().DescribeWorkerDeploymentVersion(ctx, req)
require.NoError(s.T(), err)
Copy link
Copy Markdown
Contributor

@stephanos stephanos Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be impossible with #9490 :D

Comment on lines +1088 to +1090
if len(allEvents) == 0 {
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still need this one if the assertion checks it's 1 already?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@spkane31 spkane31 enabled auto-merge (squash) March 25, 2026 22:18
@spkane31 spkane31 merged commit 4c756bb into main Mar 26, 2026
68 of 70 checks passed
@spkane31 spkane31 deleted the functional-test-crash branch March 26, 2026 15:18
awln-temporal pushed a commit that referenced this pull request Mar 26, 2026
## What changed?
1. tests/task_queue_stats_test.go — Remove the `t *testing.T` field from
`taskQueueStatsSuite` and replace all `s.t` usages with `s.T()`.
Critically, fix `validateDescribeWorkerDeploymentVersion` to use
`a.NoError(err)` instead of `require.NoError(s.T(), err)` inside the
callback.
2. tests/xdc/history_replication_signals_and_updates_test.go — Fix
`pollWorkflowResult` to handle errors from `GetWorkflowExecutionHistory`
gracefully: return the error from the inner function, retry with a nil
page token on transient errors (e.g. CurrentBranchChanged after conflict
resolution), and abort cleanly when the context expires.

## Why?
TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an
EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in
a separate goroutine on a ticker. If the outer test context expires and
the test completes while this callback is still running, the
require.NoError call invokes t.Fail() on a completed test, which panics
the entire test binary with panic: Fail in goroutine after TestXxx has
completed. The fix uses the CollectT-scoped a.NoError(err) so assertions
are buffered and safe.

TestConflictResolutionGetResult hang/crash: pollWorkflowResult was
calling responseInner.History.Events without checking if responseInner
was nil. When GetWorkflowExecutionHistory returns an error (e.g.
CurrentBranchChanged after conflict resolution resets the branch token),
the response is nil, causing a nil pointer dereference that panics the
goroutine. Because the goroutine crashed before sending to
workflowResultCh, the main test goroutine blocked on <-workflowResultCh
for the entire 30-minute CI timeout. The fix returns the error from the
inner function and retries with a nil page token on transient errors,
which allows the poll to succeed on the updated branch.

## How did you test it?
- [ ] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
The XDC fix changes `pollWorkflowResult` to retry indefinitely on any
non-context error. If a non-transient error were to occur repeatedly,
this could loop until the context expires rather than failing
immediately. This is acceptable since the context timeout provides a
hard bound, and the test failure message via `c.t.s.NoError(ctx.Err(),
...)` will make the root cause clear.
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
## What changed?
1. tests/task_queue_stats_test.go — Remove the `t *testing.T` field from
`taskQueueStatsSuite` and replace all `s.t` usages with `s.T()`.
Critically, fix `validateDescribeWorkerDeploymentVersion` to use
`a.NoError(err)` instead of `require.NoError(s.T(), err)` inside the
callback.
2. tests/xdc/history_replication_signals_and_updates_test.go — Fix
`pollWorkflowResult` to handle errors from `GetWorkflowExecutionHistory`
gracefully: return the error from the inner function, retry with a nil
page token on transient errors (e.g. CurrentBranchChanged after conflict
resolution), and abort cleanly when the context expires.

## Why?
TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an
EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in
a separate goroutine on a ticker. If the outer test context expires and
the test completes while this callback is still running, the
require.NoError call invokes t.Fail() on a completed test, which panics
the entire test binary with panic: Fail in goroutine after TestXxx has
completed. The fix uses the CollectT-scoped a.NoError(err) so assertions
are buffered and safe.

TestConflictResolutionGetResult hang/crash: pollWorkflowResult was
calling responseInner.History.Events without checking if responseInner
was nil. When GetWorkflowExecutionHistory returns an error (e.g.
CurrentBranchChanged after conflict resolution resets the branch token),
the response is nil, causing a nil pointer dereference that panics the
goroutine. Because the goroutine crashed before sending to
workflowResultCh, the main test goroutine blocked on <-workflowResultCh
for the entire 30-minute CI timeout. The fix returns the error from the
inner function and retries with a nil page token on transient errors,
which allows the poll to succeed on the updated branch.

## How did you test it?
- [ ] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
The XDC fix changes `pollWorkflowResult` to retry indefinitely on any
non-context error. If a non-transient error were to occur repeatedly,
this could loop until the context expires rather than failing
immediately. This is acceptable since the context timeout provides a
hard bound, and the test failure message via `c.t.s.NoError(ctx.Err(),
...)` will make the root cause clear.
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
## What changed?
1. tests/task_queue_stats_test.go — Remove the `t *testing.T` field from
`taskQueueStatsSuite` and replace all `s.t` usages with `s.T()`.
Critically, fix `validateDescribeWorkerDeploymentVersion` to use
`a.NoError(err)` instead of `require.NoError(s.T(), err)` inside the
callback.
2. tests/xdc/history_replication_signals_and_updates_test.go — Fix
`pollWorkflowResult` to handle errors from `GetWorkflowExecutionHistory`
gracefully: return the error from the inner function, retry with a nil
page token on transient errors (e.g. CurrentBranchChanged after conflict
resolution), and abort cleanly when the context expires.

## Why?
TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an
EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in
a separate goroutine on a ticker. If the outer test context expires and
the test completes while this callback is still running, the
require.NoError call invokes t.Fail() on a completed test, which panics
the entire test binary with panic: Fail in goroutine after TestXxx has
completed. The fix uses the CollectT-scoped a.NoError(err) so assertions
are buffered and safe.

TestConflictResolutionGetResult hang/crash: pollWorkflowResult was
calling responseInner.History.Events without checking if responseInner
was nil. When GetWorkflowExecutionHistory returns an error (e.g.
CurrentBranchChanged after conflict resolution resets the branch token),
the response is nil, causing a nil pointer dereference that panics the
goroutine. Because the goroutine crashed before sending to
workflowResultCh, the main test goroutine blocked on <-workflowResultCh
for the entire 30-minute CI timeout. The fix returns the error from the
inner function and retries with a nil page token on transient errors,
which allows the poll to succeed on the updated branch.

## How did you test it?
- [ ] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
The XDC fix changes `pollWorkflowResult` to retry indefinitely on any
non-context error. If a non-transient error were to occur repeatedly,
this could loop until the context expires rather than failing
immediately. This is acceptable since the context timeout provides a
hard bound, and the test failure message via `c.t.s.NoError(ctx.Err(),
...)` will make the root cause clear.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants