Skip to content

Update agent code review guidelines flaky test findings#9669

Merged
spkane31 merged 4 commits intomainfrom
spk/fix-common-testing-issues
Mar 30, 2026
Merged

Update agent code review guidelines flaky test findings#9669
spkane31 merged 4 commits intomainfrom
spk/fix-common-testing-issues

Conversation

@spkane31
Copy link
Copy Markdown
Contributor

What changed?

Add new review guidelines for test reliability and fix instances found in the current codebase.

Why?

I have found these to be the most common patterns across the last two weeks of targeting flaky tests.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Minimal

@spkane31 spkane31 requested review from a team as code owners March 25, 2026 23:03
@spkane31 spkane31 requested a review from stephanos March 25, 2026 23:03
select {
case <-cancelCh:
case <-ctx.Done():
s.Fail("Test timed out after 90s waiting for activity cancellation", ctx.Err())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's not hard-code timeouts (90s) as they might change

case <-listReqDone:
case <-ctx.Done():
s.Fail("timed out waiting for list nexus endpoints request to complete")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking musings: I wonder if this could be a helper on TestEnv; like s.WaitForChn and it uses the right context automatically. I didn't think if happens that often, but now seeing multiple examples ...

- Use `require.ErrorAs(t, err, &specificErr)` for specific error type checks
- Prefer `require` over `assert` - it's rarely useful to continue a test after a failed assertion
- Add comments explaining why `Eventually` is needed (e.g., eventual consistency)
- Do not use single-value type assertions on errors (`err.(*T)`); this panics instead of failing the test when the type doesn't match. Use `errors.As` with a guarded return, or the two-value form `v, ok := err.(*T)` followed by an explicit `require.True(t, ok)`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest only errors.As and not the longer/less helpful version of checked type casting.

Copy link
Copy Markdown
Contributor

@stephanos stephanos Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 1-liner is now possible in latest Go:

s.ErrorAs(err, new(*serviceerror.NotFound))

That might be actually the best advice IMO.

- Never call testify assertions (`s.NoError`, `s.Equal`, `require.NoError`, even `assert.NoError`) inside a `go func()` — if the goroutine outlives the test, the assertion panics the binary with `panic: Fail in goroutine after TestXxx has completed`. Move assertions to the test goroutine or use a buffered error channel.
- Any `<-ch` that isn't inside a `select` with `ctx.Done()` will hang indefinitely if the sender never sends. Always provide a context cancellation fallback.
- Never write to package-level or global variables in tests — parallel tests share the same process; thread values through function parameters instead.
- Proto message fields accessed outside the workflow lock must be cloned, not aliased: use `common.CloneProto(...)` rather than returning the pointer directly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test specific? or does it belong to Concurrency and Safety?

- Never write to package-level or global variables in tests — parallel tests share the same process; thread values through function parameters instead.
- Proto message fields accessed outside the workflow lock must be cloned, not aliased: use `common.CloneProto(...)` rather than returning the pointer directly.
- Never use `time.Sleep` or `time.Since(start) > threshold` to enforce ordering — use channels, `sync.WaitGroup`, or `EventuallyWithT` instead.
- Do not use `RealTimeSource` with a zero-duration delay in unit tests — it creates an infinite spin loop. Use a mock/event time source with a non-zero delay and advance the clock explicitly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow that one; is that referring to NewTimer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing this one, it was a one-off that I don't think happens frequently enough for a rule

- Proto message fields accessed outside the workflow lock must be cloned, not aliased: use `common.CloneProto(...)` rather than returning the pointer directly.
- Never use `time.Sleep` or `time.Since(start) > threshold` to enforce ordering — use channels, `sync.WaitGroup`, or `EventuallyWithT` instead.
- Do not use `RealTimeSource` with a zero-duration delay in unit tests — it creates an infinite spin loop. Use a mock/event time source with a non-zero delay and advance the clock explicitly.
- When combining a short-lived background operation with a propagation wait, verify the timeouts are compatible: if the background op can time out before the condition is satisfied, the propagation wait will never succeed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe AI will get it, but I don't :D What's "propagation wait" meaning here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning an Eventually that is waiting on an async operation to take affect. Rewording to be more clear

@spkane31 spkane31 requested a review from a team as a code owner March 26, 2026 15:41
@spkane31 spkane31 requested a review from stephanos March 30, 2026 19:13
- Prefer `sync.Mutex` over `sync.RWMutex` almost always, except when reads are much more common than writes (>1000×) or readers hold the lock for significant time
- Don't do IO while holding locks - use side effect tasks
- Clone data before releasing locks if it might be modified
- Be suspicious of `go s.someHelper(ctx, ...)` calls where the goroutine runs exactly once and the test then immediately waits for something that helper was supposed to cause. If the operation can fail transiently (network, tight deadline, busy CI), the single attempt may fail silently and the wait will never succeed. Either loop the goroutine until `ctx.Done()`, or check that the operation succeeded before proceeding.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be in ## 3. Testify Suite Correctness and Reliability ?

@spkane31 spkane31 enabled auto-merge (squash) March 30, 2026 19:21
@spkane31 spkane31 merged commit 820d7d2 into main Mar 30, 2026
46 checks passed
@spkane31 spkane31 deleted the spk/fix-common-testing-issues branch March 30, 2026 19:52
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
)

## What changed?
Add new review guidelines for test reliability and fix instances found
in the current codebase.

## Why?
I have found these to be the most common patterns across the last two
weeks of targeting flaky tests.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Minimal
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
)

## What changed?
Add new review guidelines for test reliability and fix instances found
in the current codebase.

## Why?
I have found these to be the most common patterns across the last two
weeks of targeting flaky tests.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Minimal
chaptersix pushed a commit that referenced this pull request Apr 2, 2026
## What changed?
Add new review guidelines for test reliability and fix instances found
in the current codebase.

## Why?
I have found these to be the most common patterns across the last two
weeks of targeting flaky tests.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Minimal
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
)

## What changed?
Add new review guidelines for test reliability and fix instances found
in the current codebase.

## Why?
I have found these to be the most common patterns across the last two
weeks of targeting flaky tests.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Minimal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants