test: bump wait_for_peer_online timeout for dirty-shard crash-loop test#9124
Conversation
Under parallel pytest-xdist load the rejoining peer needs more than the default 30s budget to replicate the UpdateCollection entry and report ready, causing intermittent CI failures: Exception: Timeout waiting for condition peer_is_online to be satisfied in 30 seconds Bump both peer-online waits in the dirty shard crash-loop test to 60s, matching the precedent set by #8963 for similar CPU-contention flakes. Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughThe PR adjusts timeout parameters in a regression test for consensus cluster behavior. The Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…st (#9124) Under parallel pytest-xdist load the rejoining peer needs more than the default 30s budget to replicate the UpdateCollection entry and report ready, causing intermittent CI failures: Exception: Timeout waiting for condition peer_is_online to be satisfied in 30 seconds Bump both peer-online waits in the dirty shard crash-loop test to 60s, matching the precedent set by #8963 for similar CPU-contention flakes. Co-authored-by: Cursor Agent <agent@cursor.com> Co-authored-by: Cursor <cursoragent@cursor.com>
* test: wait for WAL flock release on peer restart in dirty-shard test The flake addressed by #9124 (bumping `wait_for_peer_online` to 60s) was misdiagnosed as CPU contention. Logs show the restarted peer panics within ~1s of startup at `consensus_wal.rs:36` with: Wal error: Can't init WAL: Kind(WouldBlock) Panic: Can't open consensus WAL: Kind(WouldBlock) `wal::Wal::open` calls `fs4::FileExt::try_lock` (non-blocking `flock`) on the WAL directory fd. After `p.kill()` (SIGKILL + waitpid) the kernel normally releases the killed peer's flock immediately, but under pytest-xdist load there is a small window where it lags. The fresh peer's startup then races and panics. After the panic `/readyz` never returns 200, so neither 30s nor 60s rescues the test. Fix: add a `wait_for_wal_unlocked` helper that polls both the consensus WAL directory and the local-shard WAL directory with the same exclusive non-blocking flock that qdrant uses, and call it after every `p.kill()` that is followed by a `start_peer` on the same `peer_dir`. The two 60s timeouts are restored to the default 30s now that the underlying race is gone. Co-authored-by: Cursor <cursoragent@cursor.com> * test: fail fast if sync restart crashes in dirty-shard test The sync restart in `test_dirty_shard_survives_update_collection` used plain `wait_for_peer_online(sync_uri)`, which only polls `/readyz`. If the freshly started peer panics on startup (e.g. WAL `WouldBlock`), the test waits the full 30s timeout and then reports a `/readyz` timeout instead of the actual panic message and exit code. Switch the sync restart to `wait_for_peer_online_or_crash(...)` (same helper already used for the dirty restart). On crash it dumps the peer log tail so the next CI failure shows the real reason instead of a generic timeout. Co-authored-by: Cursor <cursoragent@cursor.com> * test: drop speculative WAL flock wait, keep crash-detection switch Reverts the `wait_for_wal_unlocked` helper that polled the WAL directories with non-blocking flock. The kernel-side flock race it was guarding against could not be reproduced in isolation (0/300 iters of SIGKILL+wait+re-flock on bare Linux), so it was speculative. Kept: - Sync restart now uses `wait_for_peer_online_or_crash(...)` instead of plain `wait_for_peer_online`, so any startup panic surfaces fast with the actual log tail instead of hiding behind a 30s `/readyz` timeout. - The two 60s timeouts bumped in #9124 are restored to the default 30s. If the flake recurs in CI, the new failure output will tell us the actual cause (panic message + exit code), which is more useful than papering over it. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor Agent <agent@cursor.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
The
test_dirty_shard_survives_update_collectiontest occasionally fails in CI with:at the first
wait_for_peer_online(sync_uri)after the target peer is restarted to sync theUpdateCollectionentry. Under pytest-xdist parallel load (introduced in #8717) the rejoining peer needs more than the default 30s budget to replicate consensus and report/readyz.This mirrors #8963, where the same fix (60s timeout) was applied to a different test that was hitting an analogous CPU-contention flake.
Changes
wait_for_peer_onlineto 60s for the post-kill sync restart.wait_for_peer_online_or_crashto 60s for the dirty-shard restart, since it has the same underlying constraint.Test plan
test_dirty_shard_survives_update_collection.Made with Cursor