Skip to content

test: bump wait_for_peer_online timeout for dirty-shard crash-loop test#9124

Merged
coszio merged 1 commit into
devfrom
fix-dirty-shard-test-timeout
May 21, 2026
Merged

test: bump wait_for_peer_online timeout for dirty-shard crash-loop test#9124
coszio merged 1 commit into
devfrom
fix-dirty-shard-test-timeout

Conversation

@qdrant-cloud-bot
Copy link
Copy Markdown
Contributor

@qdrant-cloud-bot qdrant-cloud-bot commented May 21, 2026

Summary

The test_dirty_shard_survives_update_collection test occasionally fails in CI with:

Exception: Timeout waiting for condition peer_is_online to be satisfied in 30 seconds

at the first wait_for_peer_online(sync_uri) after the target peer is restarted to sync the UpdateCollection entry. Under pytest-xdist parallel load (introduced in #8717) the rejoining peer needs more than the default 30s budget to replicate consensus and report /readyz.

This mirrors #8963, where the same fix (60s timeout) was applied to a different test that was hitting an analogous CPU-contention flake.

Changes

  • Bump wait_for_peer_online to 60s for the post-kill sync restart.
  • Bump wait_for_peer_online_or_crash to 60s for the dirty-shard restart, since it has the same underlying constraint.

Test plan

  • CI passes consensus tests including test_dirty_shard_survives_update_collection.

Made with Cursor

Under parallel pytest-xdist load the rejoining peer needs more than the
default 30s budget to replicate the UpdateCollection entry and report
ready, causing intermittent CI failures:

  Exception: Timeout waiting for condition peer_is_online to be satisfied
  in 30 seconds

Bump both peer-online waits in the dirty shard crash-loop test to 60s,
matching the precedent set by #8963 for similar CPU-contention flakes.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adjusts timeout parameters in a regression test for consensus cluster behavior. The test_dirty_shard_crash_loop.py test now waits 60 seconds (instead of defaults) for rejoiners to report ready after UpdateCollection requests and after cluster restarts. Comments document that parallel pytest-xdist execution can delay replica synchronization. The changes replace two wait-for-peer-online calls with longer, explicit timeouts to reduce test flakiness under load.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • qdrant/qdrant#8963: Adjusts consensus test readiness polling with similar wait_for_peer_online timeout increases to avoid test flakes.
  • qdrant/qdrant#8824: Adjusts consensus rejoin/crash-loop test timing by adding waits for peers to become ready after cluster membership changes.
  • qdrant/qdrant#9121: Changes UpdateCollection behavior to trigger optimizer recreation asynchronously, which likely motivates this PR's timeout increases.

Suggested reviewers

  • agourlay
  • generall
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: bumping timeouts in the dirty-shard crash-loop test to address flakiness.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains the timeout issue, references a similar fix, and specifies the exact changes made to address the flaky test.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-dirty-shard-test-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coszio coszio merged commit cd9fbc6 into dev May 21, 2026
15 checks passed
@coszio coszio deleted the fix-dirty-shard-test-timeout branch May 21, 2026 20:41
generall pushed a commit that referenced this pull request May 22, 2026
…st (#9124)

Under parallel pytest-xdist load the rejoining peer needs more than the
default 30s budget to replicate the UpdateCollection entry and report
ready, causing intermittent CI failures:

  Exception: Timeout waiting for condition peer_is_online to be satisfied
  in 30 seconds

Bump both peer-online waits in the dirty shard crash-loop test to 60s,
matching the precedent set by #8963 for similar CPU-contention flakes.

Co-authored-by: Cursor Agent <agent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
coszio pushed a commit that referenced this pull request May 25, 2026
* test: wait for WAL flock release on peer restart in dirty-shard test

The flake addressed by #9124 (bumping `wait_for_peer_online` to 60s) was
misdiagnosed as CPU contention. Logs show the restarted peer panics within
~1s of startup at `consensus_wal.rs:36` with:

    Wal error: Can't init WAL: Kind(WouldBlock)
    Panic: Can't open consensus WAL: Kind(WouldBlock)

`wal::Wal::open` calls `fs4::FileExt::try_lock` (non-blocking `flock`) on
the WAL directory fd. After `p.kill()` (SIGKILL + waitpid) the kernel
normally releases the killed peer's flock immediately, but under
pytest-xdist load there is a small window where it lags. The fresh peer's
startup then races and panics. After the panic `/readyz` never returns
200, so neither 30s nor 60s rescues the test.

Fix: add a `wait_for_wal_unlocked` helper that polls both the consensus
WAL directory and the local-shard WAL directory with the same exclusive
non-blocking flock that qdrant uses, and call it after every `p.kill()`
that is followed by a `start_peer` on the same `peer_dir`. The two
60s timeouts are restored to the default 30s now that the underlying
race is gone.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: fail fast if sync restart crashes in dirty-shard test

The sync restart in `test_dirty_shard_survives_update_collection` used
plain `wait_for_peer_online(sync_uri)`, which only polls `/readyz`. If
the freshly started peer panics on startup (e.g. WAL `WouldBlock`), the
test waits the full 30s timeout and then reports a `/readyz` timeout
instead of the actual panic message and exit code.

Switch the sync restart to `wait_for_peer_online_or_crash(...)` (same
helper already used for the dirty restart). On crash it dumps the peer
log tail so the next CI failure shows the real reason instead of a
generic timeout.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: drop speculative WAL flock wait, keep crash-detection switch

Reverts the `wait_for_wal_unlocked` helper that polled the WAL
directories with non-blocking flock. The kernel-side flock race it was
guarding against could not be reproduced in isolation (0/300 iters of
SIGKILL+wait+re-flock on bare Linux), so it was speculative.

Kept:
- Sync restart now uses `wait_for_peer_online_or_crash(...)` instead of
  plain `wait_for_peer_online`, so any startup panic surfaces fast with
  the actual log tail instead of hiding behind a 30s `/readyz` timeout.
- The two 60s timeouts bumped in #9124 are restored to the default 30s.

If the flake recurs in CI, the new failure output will tell us the
actual cause (panic message + exit code), which is more useful than
papering over it.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor Agent <agent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants