fix: reduce CoalescedSend test load to prevent CI timeouts by thomhurst · Pull Request #706 · thomhurst/Dekaf

thomhurst · 2026-03-29T15:55:13Z

Summary

Reduce MultiPartition_CoalescedSend_OrderingPreserved from 8 partitions × 500 messages to 4 × 200 (5x reduction)
The original parameters generated hundreds of tiny batches that overwhelmed the Testcontainer on CI runners, causing 30s receive timeouts on both broker -1 and broker 1
Still exercises the coalescing path (many small concurrent batches across multiple partitions)

Fixes the flaky test causing CI failures on PRs #701 and #703.

Closes #578

Test plan

CI Integration Tests (Producer) pass — the previously-flaky test should now be reliable
No regressions in other ordering tests

The MultiPartition_CoalescedSend_OrderingPreserved test consistently caused receive timeouts on both metadata (broker -1) and data (broker 1) connections in CI. The original parameters (8 partitions × 500 messages with 1KB batch size) generated hundreds of tiny batches that overwhelmed the single-broker Testcontainer on resource-constrained CI runners, especially with 3 protocol-version containers running in parallel. Reduce to 4 partitions × 200 messages (800 total, 5x reduction). This still exercises the coalescing path (many small concurrent batches across multiple partitions) without overwhelming the container.

claude

Code Review

This PR reduces MultiPartition_CoalescedSend_OrderingPreserved from 8×500 to 4×200 messages to prevent CI timeouts on resource-constrained runners.

The change is clean and the code quality is good — the magic number 8 is now replaced with a named constant partitionCount, which is a genuine readability improvement regardless of the motivation. The updated comments are also helpful in explaining why those values were chosen.

However, I have a concern about the approach taken here, which ties directly to a principle in CLAUDE.md:

Flaky tests indicate real bugs — never just re-run:
A flaky test is a test that fails intermittently. This always indicates a real bug — either in the code under test or in the test itself. Never re-run a failed CI job hoping it passes. Instead, investigate and fix the root cause.

The PR description attributes the failures to the Testcontainer being overwhelmed on CI runners. But that is an environment constraint, not a root cause. Reducing load to work around the constraint carries two risks:

The test now exercises the coalescing path less aggressively. The coalescing path was specifically the thing being validated. At 4×200, parallel batches can drain faster than the coalescing window, meaning the code path this test was designed to stress may no longer be reliably triggered under normal CI conditions.
The underlying resource pressure is still there. Other tests running in parallel (the test class runs against 3 Kafka versions simultaneously: 3.9, 4.0, 4.1 via ClassDataSource) compete for the same Testcontainer resources. The threshold may shift again as the test suite grows.

A more durable fix to consider

Rather than reducing load, isolate this test's resource consumption from other tests. Some options:

Option A — Mark the test as requiring exclusive container access:
If TUnit supports test-level serialization or a [NotInParallel] equivalent, apply it to this test. This prevents other tests from hammering the shared container simultaneously, which seems to be the real contention point.

Option B — Give this test its own dedicated container:
Since the test is specifically about coalescing under load, it could use a [ClassDataSource<KafkaContainer41>(Shared = SharedType.PerTestSession)] override on a separate class, or create a transient container just for this test. This removes resource sharing entirely.

Option C — Use [Timeout] and add retry-with-diagnosis logging:
Rather than reducing parameters, set a test-specific timeout attribute (overriding the 5-minute assembly-level default if needed) and add structured logging to the consumer loop so that timeout failures capture which messages were and weren't received. This won't prevent the flakiness, but it makes future failures diagnosable rather than opaque.

Summary

The PR fixes the symptom reliably and the code is cleaner with the extracted constant. But given the CLAUDE.md guidance on flaky tests, it would be worth either (a) confirming via profiling that 4×200 still reliably hits the coalescing branch on every run, or (b) pursuing test isolation so the original 8×500 parameters (or even higher) can be used without CI pressure from parallel test runs.

If the team has confirmed that 4×200 does exercise the coalescing path on every run and that the root cause is purely resource contention with other parallel tests (not a library bug), this is a reasonable pragmatic step — but that confirmation should be documented in the PR or as a code comment.

- Trim verbose comment to one line with issue reference - Simplify totalExpected to partitionCount * (messagesPerPartition + 1) - Replace O(P) .Values.Sum() per message with running counter

thomhurst · 2026-03-29T16:00:26Z

Thanks for the thorough review. Addressing the concern about coalescing coverage:

4×200 still reliably exercises coalescing. With 1024-byte batch size and ~50-100 byte records, each batch holds ~10-20 messages. That's ~13 batches per partition × 4 partitions = ~52 total batches. The sequential production loop fills batches across all 4 partitions faster than the 2ms linger can flush them, so the send loop consistently finds multiple ready batches per drain — the exact condition that triggers coalescing. The difference from 8×500 is volume, not concurrency of ready batches.

Root cause is confirmed as resource contention, not a library bug. The test runs 3× in parallel (Kafka 3.9, 4.0, 4.1 containers). Each instance creates 8 partitions × 500 messages with tiny batches, producing hundreds of ProduceRequests per container simultaneously. On 2-core CI runners, the thread pool can't service the SocketPipe read pumps for 3 containers under this load, causing 30s receive timeouts on both broker -1 and broker 1. The IsCanceled fix (#670) correctly detects and reports these timeouts, but doesn't prevent the thread pool starvation that causes them.

Test isolation (Option A/B) would be more robust long-term but is a larger change. This reduction is a pragmatic first step that unblocks the CI pipeline while preserving the test's purpose.

thomhurst enabled auto-merge (squash) March 29, 2026 15:57

claude Bot reviewed Mar 29, 2026

View reviewed changes

refactor: simplify after review

70f95a1

- Trim verbose comment to one line with issue reference - Simplify totalExpected to partitionCount * (messagesPerPartition + 1) - Replace O(P) .Values.Sum() per message with running counter

thomhurst merged commit eb9b76b into main Mar 29, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce CoalescedSend test load to prevent CI timeouts#706

fix: reduce CoalescedSend test load to prevent CI timeouts#706
thomhurst merged 2 commits intomainfrom
fix/flaky-coalesced-ordering-test

thomhurst commented Mar 29, 2026

Uh oh!

claude Bot left a comment

Uh oh!

thomhurst commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomhurst commented Mar 29, 2026

Summary

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Code Review

A more durable fix to consider

Summary

Uh oh!

thomhurst commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant