KAFKA-19452: Fix flaky test LogRecoveryTest.testHWCheckpointWithFailuresMultipleLogSegments #20121

karuturi · 2025-07-07T15:38:16Z

The test was flaky because awaitLeaderChange was timing out.

The original code called awaitLeaderChange(servers, ...) where servers
is a list containing both server1 and server2. This was problematic
because server1 was still offline. The function was trying to poll a
dead broker, which led to unpredictable timeouts and caused the test to
fail intermittently.

The fix is to make the awaitLeaderChange call more specific, so it only
polls the broker that is actually running(server2).

ran the test in a loop to ensure its working fine for i in {1..10}; do ./gradlew core:test --tests kafka.server.LogRecoveryTest && echo "Run $i passed" || exit 1; done

junrao

@karuturi : Thanks for the PR. Left a comment.

junrao · 2025-07-07T20:24:31Z

core/src/test/scala/unit/kafka/server/LogRecoveryTest.scala

@@ -224,6 +224,9 @@ class LogRecoveryTest extends QuorumTestHarness {
    server1.startup()
    updateProducer()

+    waitUntilTrue(() => server2.replicaManager.onlinePartition(topicPartition).get.inSyncReplicaIds.size == 2,


Hmm, why is this needed? Later on, there is another waitUntilTrue(). That should be enough, right?

TestUtils.waitUntilTrue(() => server1.replicaManager.localLogOrException(topicPartition).highWatermark == hw, "Failed to update high watermark for follower after timeout")

Also, from the jira, the failure happened in the following line.

leader = awaitLeaderChange(servers, topicPartition, oldLeaderOpt = Some(leader), timeout = 30000L)

@junrao My previous fix was not addressing the specific failure point mentioned in the JIRA issue, and the waitUntilTrue I added was likely redundant for the reasons you mentioned. My apologies.

I updated the commit with a fix for the actual issue.

@junrao Can you please take a look again?

…resMultipleLogSegments The test was flaky because awaitLeaderChange was timing out. The original code called awaitLeaderChange(servers, ...) where servers is a list containing both server1 and server2. This was problematic because server1 was still offline. The function was trying to poll a dead broker, which led to unpredictable timeouts and caused the test to fail intermittently. The fix is to make the awaitLeaderChange call more specific, so it only polls the broker that is actually running(server2).

github-actions bot added core Kafka Broker tests Test fixes (including flaky tests) small Small PRs labels Jul 7, 2025

junrao reviewed Jul 7, 2025

View reviewed changes

karuturi force-pushed the fix/KAFKA-19452-flaky-log-recovery-test branch from 6c911ef to 755494a Compare July 8, 2025 04:28

karuturi force-pushed the fix/KAFKA-19452-flaky-log-recovery-test branch from 755494a to 8c32f2b Compare July 8, 2025 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19452: Fix flaky test LogRecoveryTest.testHWCheckpointWithFailuresMultipleLogSegments #20121

KAFKA-19452: Fix flaky test LogRecoveryTest.testHWCheckpointWithFailuresMultipleLogSegments #20121

karuturi commented Jul 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

junrao left a comment

Uh oh!

junrao Jul 7, 2025

Uh oh!

karuturi Jul 8, 2025 •

edited

Loading

Uh oh!

karuturi Jul 9, 2025

Uh oh!

Uh oh!

KAFKA-19452: Fix flaky test LogRecoveryTest.testHWCheckpointWithFailuresMultipleLogSegments #20121

Are you sure you want to change the base?

KAFKA-19452: Fix flaky test LogRecoveryTest.testHWCheckpointWithFailuresMultipleLogSegments #20121

Conversation

karuturi commented Jul 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

junrao Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

karuturi Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karuturi Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

karuturi commented Jul 7, 2025 •

edited by github-actions bot

Loading

karuturi Jul 8, 2025 •

edited

Loading