Skip to content

fix: reduce CoalescedSend test load to prevent CI timeouts#706

Merged
thomhurst merged 2 commits intomainfrom
fix/flaky-coalesced-ordering-test
Mar 29, 2026
Merged

fix: reduce CoalescedSend test load to prevent CI timeouts#706
thomhurst merged 2 commits intomainfrom
fix/flaky-coalesced-ordering-test

Conversation

@thomhurst
Copy link
Copy Markdown
Owner

Summary

  • Reduce MultiPartition_CoalescedSend_OrderingPreserved from 8 partitions × 500 messages to 4 × 200 (5x reduction)
  • The original parameters generated hundreds of tiny batches that overwhelmed the Testcontainer on CI runners, causing 30s receive timeouts on both broker -1 and broker 1
  • Still exercises the coalescing path (many small concurrent batches across multiple partitions)

Fixes the flaky test causing CI failures on PRs #701 and #703.

Closes #578

Test plan

  • CI Integration Tests (Producer) pass — the previously-flaky test should now be reliable
  • No regressions in other ordering tests

The MultiPartition_CoalescedSend_OrderingPreserved test consistently
caused receive timeouts on both metadata (broker -1) and data (broker 1)
connections in CI. The original parameters (8 partitions × 500 messages
with 1KB batch size) generated hundreds of tiny batches that overwhelmed
the single-broker Testcontainer on resource-constrained CI runners,
especially with 3 protocol-version containers running in parallel.

Reduce to 4 partitions × 200 messages (800 total, 5x reduction). This
still exercises the coalescing path (many small concurrent batches across
multiple partitions) without overwhelming the container.
@thomhurst thomhurst enabled auto-merge (squash) March 29, 2026 15:57
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR reduces MultiPartition_CoalescedSend_OrderingPreserved from 8×500 to 4×200 messages to prevent CI timeouts on resource-constrained runners.

The change is clean and the code quality is good — the magic number 8 is now replaced with a named constant partitionCount, which is a genuine readability improvement regardless of the motivation. The updated comments are also helpful in explaining why those values were chosen.

However, I have a concern about the approach taken here, which ties directly to a principle in CLAUDE.md:

Flaky tests indicate real bugs — never just re-run:
A flaky test is a test that fails intermittently. This always indicates a real bug — either in the code under test or in the test itself. Never re-run a failed CI job hoping it passes. Instead, investigate and fix the root cause.

The PR description attributes the failures to the Testcontainer being overwhelmed on CI runners. But that is an environment constraint, not a root cause. Reducing load to work around the constraint carries two risks:

  1. The test now exercises the coalescing path less aggressively. The coalescing path was specifically the thing being validated. At 4×200, parallel batches can drain faster than the coalescing window, meaning the code path this test was designed to stress may no longer be reliably triggered under normal CI conditions.

  2. The underlying resource pressure is still there. Other tests running in parallel (the test class runs against 3 Kafka versions simultaneously: 3.9, 4.0, 4.1 via ClassDataSource) compete for the same Testcontainer resources. The threshold may shift again as the test suite grows.

A more durable fix to consider

Rather than reducing load, isolate this test's resource consumption from other tests. Some options:

Option A — Mark the test as requiring exclusive container access:
If TUnit supports test-level serialization or a [NotInParallel] equivalent, apply it to this test. This prevents other tests from hammering the shared container simultaneously, which seems to be the real contention point.

Option B — Give this test its own dedicated container:
Since the test is specifically about coalescing under load, it could use a [ClassDataSource<KafkaContainer41>(Shared = SharedType.PerTestSession)] override on a separate class, or create a transient container just for this test. This removes resource sharing entirely.

Option C — Use [Timeout] and add retry-with-diagnosis logging:
Rather than reducing parameters, set a test-specific timeout attribute (overriding the 5-minute assembly-level default if needed) and add structured logging to the consumer loop so that timeout failures capture which messages were and weren't received. This won't prevent the flakiness, but it makes future failures diagnosable rather than opaque.

Summary

The PR fixes the symptom reliably and the code is cleaner with the extracted constant. But given the CLAUDE.md guidance on flaky tests, it would be worth either (a) confirming via profiling that 4×200 still reliably hits the coalescing branch on every run, or (b) pursuing test isolation so the original 8×500 parameters (or even higher) can be used without CI pressure from parallel test runs.

If the team has confirmed that 4×200 does exercise the coalescing path on every run and that the root cause is purely resource contention with other parallel tests (not a library bug), this is a reasonable pragmatic step — but that confirmation should be documented in the PR or as a code comment.

- Trim verbose comment to one line with issue reference
- Simplify totalExpected to partitionCount * (messagesPerPartition + 1)
- Replace O(P) .Values.Sum() per message with running counter
@thomhurst
Copy link
Copy Markdown
Owner Author

Thanks for the thorough review. Addressing the concern about coalescing coverage:

4×200 still reliably exercises coalescing. With 1024-byte batch size and ~50-100 byte records, each batch holds ~10-20 messages. That's ~13 batches per partition × 4 partitions = ~52 total batches. The sequential production loop fills batches across all 4 partitions faster than the 2ms linger can flush them, so the send loop consistently finds multiple ready batches per drain — the exact condition that triggers coalescing. The difference from 8×500 is volume, not concurrency of ready batches.

Root cause is confirmed as resource contention, not a library bug. The test runs 3× in parallel (Kafka 3.9, 4.0, 4.1 containers). Each instance creates 8 partitions × 500 messages with tiny batches, producing hundreds of ProduceRequests per container simultaneously. On 2-core CI runners, the thread pool can't service the SocketPipe read pumps for 3 containers under this load, causing 30s receive timeouts on both broker -1 and broker 1. The IsCanceled fix (#670) correctly detects and reports these timeouts, but doesn't prevent the thread pool starvation that causes them.

Test isolation (Option A/B) would be more robust long-term but is a larger change. This reduction is a pragmatic first step that unblocks the CI pipeline while preserving the test's purpose.

@thomhurst thomhurst merged commit eb9b76b into main Mar 29, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata connection (broker -1) receive timeout causes test hangs

1 participant