test: bump wait_for_peer_online timeout for dirty-shard crash-loop test by qdrant-cloud-bot · Pull Request #9124 · qdrant/qdrant

qdrant-cloud-bot · 2026-05-21T18:56:18Z

Summary

The test_dirty_shard_survives_update_collection test occasionally fails in CI with:

Exception: Timeout waiting for condition peer_is_online to be satisfied in 30 seconds

at the first wait_for_peer_online(sync_uri) after the target peer is restarted to sync the UpdateCollection entry. Under pytest-xdist parallel load (introduced in #8717) the rejoining peer needs more than the default 30s budget to replicate consensus and report /readyz.

This mirrors #8963, where the same fix (60s timeout) was applied to a different test that was hitting an analogous CPU-contention flake.

Changes

Bump wait_for_peer_online to 60s for the post-kill sync restart.
Bump wait_for_peer_online_or_crash to 60s for the dirty-shard restart, since it has the same underlying constraint.

Test plan

CI passes consensus tests including test_dirty_shard_survives_update_collection.

Made with Cursor

Under parallel pytest-xdist load the rejoining peer needs more than the default 30s budget to replicate the UpdateCollection entry and report ready, causing intermittent CI failures: Exception: Timeout waiting for condition peer_is_online to be satisfied in 30 seconds Bump both peer-online waits in the dirty shard crash-loop test to 60s, matching the precedent set by #8963 for similar CPU-contention flakes. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-05-21T18:58:09Z

📝 Walkthrough

Walkthrough

The PR adjusts timeout parameters in a regression test for consensus cluster behavior. The test_dirty_shard_crash_loop.py test now waits 60 seconds (instead of defaults) for rejoiners to report ready after UpdateCollection requests and after cluster restarts. Comments document that parallel pytest-xdist execution can delay replica synchronization. The changes replace two wait-for-peer-online calls with longer, explicit timeouts to reduce test flakiness under load.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

qdrant/qdrant#8963: Adjusts consensus test readiness polling with similar wait_for_peer_online timeout increases to avoid test flakes.
qdrant/qdrant#8824: Adjusts consensus rejoin/crash-loop test timing by adding waits for peers to become ready after cluster membership changes.
qdrant/qdrant#9121: Changes UpdateCollection behavior to trigger optimizer recreation asynchronously, which likely motivates this PR's timeout increases.

Suggested reviewers

agourlay
generall

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: bumping timeouts in the dirty-shard crash-loop test to address flakiness.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description clearly explains the timeout issue, references a similar fix, and specifies the exact changes made to address the flaky test.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-dirty-shard-test-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…st (#9124) Under parallel pytest-xdist load the rejoining peer needs more than the default 30s budget to replicate the UpdateCollection entry and report ready, causing intermittent CI failures: Exception: Timeout waiting for condition peer_is_online to be satisfied in 30 seconds Bump both peer-online waits in the dirty shard crash-loop test to 60s, matching the precedent set by #8963 for similar CPU-contention flakes. Co-authored-by: Cursor Agent <agent@cursor.com> Co-authored-by: Cursor <cursoragent@cursor.com>

* test: wait for WAL flock release on peer restart in dirty-shard test The flake addressed by #9124 (bumping `wait_for_peer_online` to 60s) was misdiagnosed as CPU contention. Logs show the restarted peer panics within ~1s of startup at `consensus_wal.rs:36` with: Wal error: Can't init WAL: Kind(WouldBlock) Panic: Can't open consensus WAL: Kind(WouldBlock) `wal::Wal::open` calls `fs4::FileExt::try_lock` (non-blocking `flock`) on the WAL directory fd. After `p.kill()` (SIGKILL + waitpid) the kernel normally releases the killed peer's flock immediately, but under pytest-xdist load there is a small window where it lags. The fresh peer's startup then races and panics. After the panic `/readyz` never returns 200, so neither 30s nor 60s rescues the test. Fix: add a `wait_for_wal_unlocked` helper that polls both the consensus WAL directory and the local-shard WAL directory with the same exclusive non-blocking flock that qdrant uses, and call it after every `p.kill()` that is followed by a `start_peer` on the same `peer_dir`. The two 60s timeouts are restored to the default 30s now that the underlying race is gone. Co-authored-by: Cursor <cursoragent@cursor.com> * test: fail fast if sync restart crashes in dirty-shard test The sync restart in `test_dirty_shard_survives_update_collection` used plain `wait_for_peer_online(sync_uri)`, which only polls `/readyz`. If the freshly started peer panics on startup (e.g. WAL `WouldBlock`), the test waits the full 30s timeout and then reports a `/readyz` timeout instead of the actual panic message and exit code. Switch the sync restart to `wait_for_peer_online_or_crash(...)` (same helper already used for the dirty restart). On crash it dumps the peer log tail so the next CI failure shows the real reason instead of a generic timeout. Co-authored-by: Cursor <cursoragent@cursor.com> * test: drop speculative WAL flock wait, keep crash-detection switch Reverts the `wait_for_wal_unlocked` helper that polled the WAL directories with non-blocking flock. The kernel-side flock race it was guarding against could not be reproduced in isolation (0/300 iters of SIGKILL+wait+re-flock on bare Linux), so it was speculative. Kept: - Sync restart now uses `wait_for_peer_online_or_crash(...)` instead of plain `wait_for_peer_online`, so any startup panic surfaces fast with the actual log tail instead of hiding behind a 30s `/readyz` timeout. - The two 60s timeouts bumped in #9124 are restored to the default 30s. If the flake recurs in CI, the new failure output will tell us the actual cause (panic message + exit code), which is more useful than papering over it. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor Agent <agent@cursor.com> Co-authored-by: Cursor <cursoragent@cursor.com>

coszio approved these changes May 21, 2026

View reviewed changes

coszio merged commit cd9fbc6 into dev May 21, 2026
15 checks passed

coszio deleted the fix-dirty-shard-test-timeout branch May 21, 2026 20:41

qdrant-cloud-bot mentioned this pull request May 25, 2026

test: surface peer startup crashes in dirty-shard test #9161

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: bump wait_for_peer_online timeout for dirty-shard crash-loop test#9124

test: bump wait_for_peer_online timeout for dirty-shard crash-loop test#9124
coszio merged 1 commit into
devfrom
fix-dirty-shard-test-timeout

qdrant-cloud-bot commented May 21, 2026 •

edited by coszio

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qdrant-cloud-bot commented May 21, 2026 • edited by coszio Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qdrant-cloud-bot commented May 21, 2026 •

edited by coszio

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading