Skip to content

KAFKA-19452: Fix flaky test LogRecoveryTest.testHWCheckpointWithFailuresMultipleLogSegments #20121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

karuturi
Copy link
Member

@karuturi karuturi commented Jul 7, 2025

The test was flaky because awaitLeaderChange was timing out.

The original code called awaitLeaderChange(servers, ...) where servers
is a list containing both server1 and server2. This was problematic
because server1 was still offline. The function was trying to poll a
dead broker, which led to unpredictable timeouts and caused the test to
fail intermittently.

The fix is to make the awaitLeaderChange call more specific, so it only
polls the broker that is actually running(server2).

ran the test in a loop to ensure its working fine for i in {1..10}; do ./gradlew core:test --tests kafka.server.LogRecoveryTest && echo "Run $i passed" || exit 1; done

@github-actions github-actions bot added core Kafka Broker tests Test fixes (including flaky tests) small Small PRs labels Jul 7, 2025
Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karuturi : Thanks for the PR. Left a comment.

@@ -224,6 +224,9 @@ class LogRecoveryTest extends QuorumTestHarness {
server1.startup()
updateProducer()

waitUntilTrue(() => server2.replicaManager.onlinePartition(topicPartition).get.inSyncReplicaIds.size == 2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why is this needed? Later on, there is another waitUntilTrue(). That should be enough, right?

    TestUtils.waitUntilTrue(() =>
      server1.replicaManager.localLogOrException(topicPartition).highWatermark == hw,
      "Failed to update high watermark for follower after timeout")

Also, from the jira, the failure happened in the following line.

leader = awaitLeaderChange(servers, topicPartition, oldLeaderOpt = Some(leader), timeout = 30000L)

Copy link
Member Author

@karuturi karuturi Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrao My previous fix was not addressing the specific failure point mentioned in the JIRA issue, and the waitUntilTrue I added was likely redundant for the reasons you mentioned. My apologies.

I updated the commit with a fix for the actual issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrao Can you please take a look again?

@karuturi karuturi force-pushed the fix/KAFKA-19452-flaky-log-recovery-test branch from 6c911ef to 755494a Compare July 8, 2025 04:28
…resMultipleLogSegments

The test was flaky because awaitLeaderChange was timing out.

The original code called awaitLeaderChange(servers, ...) where servers
is a list containing both server1 and server2. This was problematic
because server1 was still offline. The function was trying to poll a
dead broker, which led to unpredictable timeouts and caused the test to
fail intermittently.

The fix is to make the awaitLeaderChange call more specific, so it only
polls the broker that is actually running(server2).
@karuturi karuturi force-pushed the fix/KAFKA-19452-flaky-log-recovery-test branch from 755494a to 8c32f2b Compare July 8, 2025 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Kafka Broker small Small PRs tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants