feat(jepsen): complete kubevirt release gate#71
Conversation
Stabilize the KubeVirt Jepsen runner, add the final gate report, and keep the generated artifacts repo-local but ignored while preserving the fully green 15-run matrix evidence.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
WalkthroughThis pull request enhances observability and introduces KubeVirt testbed infrastructure. Changes include: adding comprehensive logging throughout protocol handling, replica state transitions, and expiration processing in the allocdb-node binary; introducing a new public kubevirt_testbed module for KubeVirt cluster configuration and layout management; expanding documentation with KubeVirt-specific testing workflows and Jepsen gate reports; and refactoring protocol request handling into dedicated functions. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (5)
crates/allocdb-node/src/replica.rs (1)
1676-1678: Consider consolidating duplicateformat_optional_lsnimplementations.This function has the same signature and implementation as the one in
crates/allocdb-node/src/bin/allocdb-jepsen.rs(lines 3826-3828 per relevant code snippets). Consider extracting this to a shared utility module to avoid duplication, though this is a minor concern given the function's simplicity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/replica.rs` around lines 1676 - 1678, The helper function format_optional_lsn (fn format_optional_lsn(lsn: Option<Lsn>) -> String) is duplicated; extract it into a shared utility (e.g., a new module or existing utils lib) and replace the local implementations with a single use/import of that function; ensure the new shared function has the same signature and uses Lsn::get().to_string(), update both references (the one in replica.rs and the one in allocdb-jepsen.rs) to call the consolidated function, and add any needed pub(crate) visibility and module imports so the callers compile.docs/kubevirt-jepsen-report.md (1)
114-124: Consider clarifying that artifact links are local-only references.The links to
../.artifacts/kubevirt-jepsen-20260315/paths reference git-ignored local artifacts. These links will not resolve for users who clone the repository fresh. Consider adding a note that these are representative path examples for local development, not clickable links in the repository.📝 Suggested clarification
## Representative Evidence +Note: The paths below reference local `.artifacts/` directories that are git-ignored. They serve as +representative examples for developers who have run the gate locally. + Representative final status files: - [`reservation_contention-crash-restart-status.txt`](../.artifacts/kubevirt-jepsen-20260315/lane-a/reservation_contention-crash-restart-status.txt)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubevirt-jepsen-report.md` around lines 114 - 124, The artifact links under the representative final artifact bundles section point to ../.artifacts/kubevirt-jepsen-20260315/... which are local, git-ignored build artifacts and won't resolve for fresh clones; update the docs/kubevirt-jepsen-report.md text near those bullet lists to add a short clarifying note that these are local-only representative paths (e.g., "These links point to local, git-ignored artifacts in .artifacts and will not be accessible from a fresh clone; see instructions below to reproduce or where to find uploaded bundles."), and optionally suggest how readers can obtain the artifacts (reproduce the run, or link to external storage) so users know why the links may be broken.crates/allocdb-node/src/kubevirt_testbed.rs (1)
147-152: Keepload()read-only.Right now
load()goes throughnew(), which creates the workspace on disk and re-absolutizes relative paths against the current working directory. For a read path, that’s a surprising side effect and it can mask a typoed or stale archived layout. I’d split validation/normalization from on-disk workspace preparation.Also applies to: 271-284, 324-368
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/kubevirt_testbed.rs` around lines 147 - 152, load() currently triggers on-disk workspace creation and path re-absolutization via new(); change load() to be pure/read-only by having it only read the file and call decode_layout(&bytes) (or a new pure parse function) without creating the workspace or mutating paths; extract the validation/normalization logic currently embedded in new() into a separate pure method (e.g., normalize_or_validate_layout or normalize_paths) that new() will call before performing on-disk workspace preparation, and update the code paths in the other affected areas (the blocks around the original new() usage at the ranges noted: 271-284 and 324-368) to use the new pure normalization function where they should not touch disk, reserving on-disk creation exclusively for a clearly named method like create_workspace_on_disk().crates/allocdb-node/src/bin/allocdb-local-cluster.rs (2)
811-899: Please add direct regression coverage for the extracted protocol helpers.This refactor moves substantive behavior into
handle_prepare_protocol_request,handle_commit_protocol_request, andcommit_tick_entry, but the new tests only pin ACK matching and oversized reads. A few focused negative-path tests around view mismatch, append/commit rejection, and quorum-unavailable tick commits would make this much safer to evolve.As per coding guidelines
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.Also applies to: 1065-1110, 2304-2357
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-local-cluster.rs` around lines 811 - 899, Add regression tests that directly exercise the new protocol helper functions to cover negative paths: write unit/integration tests that call handle_prepare_protocol_request, handle_commit_protocol_request, and commit_tick_entry (or the public API that triggers them) to assert behavior on view mismatch, append_prepared_entry rejection, commit_prepared_through rejection, and tick-commit when quorum is unavailable; for each test, construct a ReplicaNode/LocalClusterReplicaConfig in the failing state, invoke the helper (or simulate the incoming protocol message), assert the returned Err (and/or ReplicaProtocolResponse) matches the expected protocol_error_message or view-mismatch string, and verify side-effects/logs/metadata (e.g., no commit applied, commit_lsn unchanged) so these negative-paths are explicitly covered alongside the existing ACK/oversize tests.
118-121: Use leveled logging here, but keep the replication hot path out ofinfo.
init_logging()defaults toinfo, so every prepare/commit/quorum/tick trace below becomes always-on during Jepsen runs, while several rejection paths still bypass the logger witheprintln!. That will make long-lane artifacts noisy and hard to filter. I’d keep startup/role/view transitions atinfo, move per-request traces todebug, and route expected rejections/quorum loss throughwarn!/error!.As per coding guidelines
**/*.rs: Use the right log level:infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development. Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.Also applies to: 828-847, 862-893, 908-923, 958-969, 983-998, 1021-1049, 1098-1108, 1274-1289, 1329-1407
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-local-cluster.rs` around lines 118 - 121, init_logging currently sets a global default that makes hot-path traces always-on; update the init_logging function (the env_logger::Builder::from_env call) to use a global default of "info" but add module-level filters so per-request/replication hot-path logs are at "debug" (use Builder::filter_module or parse_filters to set e.g. "allocdb_node::replication=debug" and any request-handling modules to debug) and keep lifecycle/startup paths at info; also ensure expected rejection paths in the codebase use warn!/error! (adjust those call sites) rather than eprintln! so logs follow the leveled configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/allocdb-node/src/bin/allocdb-local-cluster.rs`:
- Around line 1122-1130: The structs PrepareQuorumResult and
CommitBroadcastResult currently populate acked_peers by pre-seeding with
replica.replica_id (so acked_peers/acked_prepare_peers/commit_ack_peers include
the local replica), causing over-reporting of remote acknowledgements; fix by
initializing the acked_peers Vecs empty (or if semantics require including local
replica, rename the fields to something like all_acked_replicas) and ensure code
paths that push acknowledgements (the logic around replica.replica_id) only add
remote replicas so that acked_replicas and acked_peers accurately represent
remote ack counts and peers (check usages at the
PrepareQuorumResult/CommitBroadcastResult construction sites and the consumers
acked_prepare_peers and commit_ack_peers).
In `@crates/allocdb-node/src/kubevirt_testbed.rs`:
- Around line 114-139: new() currently validates replica_guests but allows a
control_guest that may carry a replica_id; add a guard in KubevirtTestbed::new
(and the other constructors/creators that mirror this pattern, e.g., the
functions around the other affected ranges) to reject any KubevirtGuestConfig
where control_guest.replica_id.is_some(), returning a
KubevirtTestbedLayoutError; reuse or mirror the validation logic from
validate_replica_guests (or extract a helper) so that control_guest with a
non-None replica_id is treated as invalid and the error prevents creating the
layout, ensuring persist() / load() round-trips remain consistent.
- Around line 185-245: The cluster builder maps fixed ports onto each guest IP
causing duplicate socket addresses when multiple replicas share the same IP;
update validation to detect and reject overlapping addresses by checking
LocalClusterLayout.replicas' LocalClusterReplicaConfig.control_addr, client_addr
and protocol_addr for duplicates (in validate_replica_guests or immediately
after build_guest_local_cluster_layout) and return an error (or panic) with a
clear message listing the conflicting ReplicaId(s) and address(es) so callers
cannot create an invalid LocalClusterLayout.
In `@docs/testing.md`:
- Around line 447-455: Update the documented command examples (the jepsen CLI
listings that include flags like the --output-root <artifacts> usage such as in
the allocdb-jepsen commands shown and the other occurrences around lines
referenced) to explicitly state and recommend the repo-local output root
`.artifacts/` (e.g., replace or annotate `--output-root <artifacts>` with
`--output-root .artifacts/` and add a short note that evidence should be
archived under the repo-local `.artifacts/` directory). Ensure you update every
occurrence (including the other block noted) of `--output-root <artifacts>` in
this section so docs match the PR-standardized location and avoid archiving
outside the ignored path.
---
Nitpick comments:
In `@crates/allocdb-node/src/bin/allocdb-local-cluster.rs`:
- Around line 811-899: Add regression tests that directly exercise the new
protocol helper functions to cover negative paths: write unit/integration tests
that call handle_prepare_protocol_request, handle_commit_protocol_request, and
commit_tick_entry (or the public API that triggers them) to assert behavior on
view mismatch, append_prepared_entry rejection, commit_prepared_through
rejection, and tick-commit when quorum is unavailable; for each test, construct
a ReplicaNode/LocalClusterReplicaConfig in the failing state, invoke the helper
(or simulate the incoming protocol message), assert the returned Err (and/or
ReplicaProtocolResponse) matches the expected protocol_error_message or
view-mismatch string, and verify side-effects/logs/metadata (e.g., no commit
applied, commit_lsn unchanged) so these negative-paths are explicitly covered
alongside the existing ACK/oversize tests.
- Around line 118-121: init_logging currently sets a global default that makes
hot-path traces always-on; update the init_logging function (the
env_logger::Builder::from_env call) to use a global default of "info" but add
module-level filters so per-request/replication hot-path logs are at "debug"
(use Builder::filter_module or parse_filters to set e.g.
"allocdb_node::replication=debug" and any request-handling modules to debug) and
keep lifecycle/startup paths at info; also ensure expected rejection paths in
the codebase use warn!/error! (adjust those call sites) rather than eprintln! so
logs follow the leveled configuration.
In `@crates/allocdb-node/src/kubevirt_testbed.rs`:
- Around line 147-152: load() currently triggers on-disk workspace creation and
path re-absolutization via new(); change load() to be pure/read-only by having
it only read the file and call decode_layout(&bytes) (or a new pure parse
function) without creating the workspace or mutating paths; extract the
validation/normalization logic currently embedded in new() into a separate pure
method (e.g., normalize_or_validate_layout or normalize_paths) that new() will
call before performing on-disk workspace preparation, and update the code paths
in the other affected areas (the blocks around the original new() usage at the
ranges noted: 271-284 and 324-368) to use the new pure normalization function
where they should not touch disk, reserving on-disk creation exclusively for a
clearly named method like create_workspace_on_disk().
In `@crates/allocdb-node/src/replica.rs`:
- Around line 1676-1678: The helper function format_optional_lsn (fn
format_optional_lsn(lsn: Option<Lsn>) -> String) is duplicated; extract it into
a shared utility (e.g., a new module or existing utils lib) and replace the
local implementations with a single use/import of that function; ensure the new
shared function has the same signature and uses Lsn::get().to_string(), update
both references (the one in replica.rs and the one in allocdb-jepsen.rs) to call
the consolidated function, and add any needed pub(crate) visibility and module
imports so the callers compile.
In `@docs/kubevirt-jepsen-report.md`:
- Around line 114-124: The artifact links under the representative final
artifact bundles section point to ../.artifacts/kubevirt-jepsen-20260315/...
which are local, git-ignored build artifacts and won't resolve for fresh clones;
update the docs/kubevirt-jepsen-report.md text near those bullet lists to add a
short clarifying note that these are local-only representative paths (e.g.,
"These links point to local, git-ignored artifacts in .artifacts and will not be
accessible from a fresh clone; see instructions below to reproduce or where to
find uploaded bundles."), and optionally suggest how readers can obtain the
artifacts (reproduce the run, or link to external storage) so users know why the
links may be broken.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 721952da-9385-4193-9f72-c02be491cb17
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
.gitignorecrates/allocdb-node/Cargo.tomlcrates/allocdb-node/src/bin/allocdb-jepsen.rscrates/allocdb-node/src/bin/allocdb-local-cluster.rscrates/allocdb-node/src/engine.rscrates/allocdb-node/src/kubevirt_testbed.rscrates/allocdb-node/src/lib.rscrates/allocdb-node/src/replica.rsdocs/README.mddocs/kubevirt-jepsen-report.mddocs/operator-runbook.mddocs/status.mddocs/testing.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/lib.rscrates/allocdb-node/src/engine.rscrates/allocdb-node/src/bin/allocdb-local-cluster.rscrates/allocdb-node/src/replica.rscrates/allocdb-node/src/kubevirt_testbed.rs
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Files:
docs/status.mddocs/kubevirt-jepsen-report.mddocs/README.mddocs/operator-runbook.mddocs/testing.md
docs/status.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep
docs/status.mdcurrent as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Files:
docs/status.md
🧠 Learnings (4)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: `error` for invariant breaks, corruption, and failed operations that require intervention; `warn` for degraded but expected conditions such as overload, lag, or rejected requests; `info` for meaningful lifecycle and state-transition events; `debug` for detailed execution traces useful in development; `trace` only for very high-volume diagnostic detail.
Applied to files:
crates/allocdb-node/src/engine.rscrates/allocdb-node/src/bin/allocdb-local-cluster.rscrates/allocdb-node/src/replica.rs
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Use the GitHub Project `AllocDB` as the operational work board. Keep planned work on the board, not only in milestone pages or local docs.
Applied to files:
docs/status.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to docs/status.md : Keep [`docs/status.md`](./docs/status.md) current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Applied to files:
docs/status.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Applied to files:
crates/allocdb-node/src/bin/allocdb-local-cluster.rs
🧬 Code graph analysis (4)
crates/allocdb-node/src/engine.rs (4)
crates/allocdb-node/src/replica.rs (1)
get(36-38)crates/allocdb-node/src/bounded_queue.rs (1)
len(26-28)crates/allocdb-core/src/state_machine_metrics.rs (2)
config(31-33)last_applied_lsn(21-23)crates/allocdb-core/src/state_machine.rs (1)
deadline_slot(225-230)
crates/allocdb-node/src/bin/allocdb-local-cluster.rs (4)
crates/allocdb-node/src/local_cluster.rs (7)
replica(303-307)encode_role(1261-1269)from(69-71)from(119-121)from(188-190)from(233-235)from(1440-1462)crates/allocdb-node/src/jepsen.rs (2)
encode_role(1408-1416)kind(514-521)crates/allocdb-core/src/state_machine_metrics.rs (1)
last_applied_lsn(21-23)crates/allocdb-node/src/engine_observe.rs (1)
active_snapshot_lsn(43-45)
crates/allocdb-node/src/replica.rs (3)
crates/allocdb-node/src/replica_tests.rs (1)
identity(61-66)crates/allocdb-node/src/bin/allocdb-jepsen.rs (1)
format_optional_lsn(3827-3829)crates/allocdb-node/src/engine_observe.rs (1)
active_snapshot_lsn(43-45)
crates/allocdb-node/src/kubevirt_testbed.rs (1)
crates/allocdb-node/src/local_cluster.rs (2)
default_local_cluster_core_config(457-468)default_local_cluster_engine_config(471-477)
🔇 Additional comments (15)
.gitignore (1)
17-17: LGTM! Correctly excludes repo-local Jepsen artifacts.The pattern
/.artifacts/appropriately ignores the root-level artifacts directory while keeping Jepsen evidence repo-local. The placement in the "Test and benchmark artifacts" section is logical.crates/allocdb-node/src/engine.rs (3)
655-662: LGTM: Structured logging for expiration tick planning.The
infolevel is appropriate for this lifecycle event per coding guidelines. The log includes all relevant diagnostic fields (current_slot, request_slot, due_candidates, max_per_tick, queue_depth) in a structured format.
682-688: LGTM: Completion log complements the planning log.Logging both the planning and completion of expiration ticks provides good observability for diagnosing expiration processing issues.
1082-1095: LGTM: Per-command logging for internal expiration application.Logging individual expiration commands at
infolevel is appropriate given the operational significance of expiration state transitions. The conditional logging only forCommand::Expirevariants keeps the log focused.docs/README.md (1)
22-22: LGTM: Documentation index updated with new report link.The KubeVirt Jepsen Report link is correctly placed in the Engineering Docs section, maintaining consistency with the existing documentation structure.
docs/kubevirt-jepsen-report.md (1)
1-74: LGTM: Comprehensive and well-structured Jepsen gate report.The report clearly documents:
- Execution profile and topology
- What the gate proved (and importantly, what it does not prove)
- Artifact policy rationale
This provides valuable context for understanding the validation performed in this PR.
crates/allocdb-node/src/replica.rs (3)
1064-1072: LGTM: Appropriate warn-level logging for protocol rejections.Using
warnfor rejected role transitions aligns with coding guidelines (degraded but expected conditions). The structured format includes all relevant diagnostic context.
1637-1664: LGTM: Well-designed transition logging helper.The
log_metadata_transitionhelper consolidates transition logging with before/after state snapshots, reducing duplication acrossconfigure_normal_role,record_durable_vote, andenter_view_uncertain. The format provides clear visibility into state changes.
1498-1530: LGTM: Commit-through logging provides good operational visibility.The logging before and after commit operations captures the full state transition, which is valuable for debugging replication issues. The early-return log when commit is already satisfied helps trace no-op paths.
docs/status.md (1)
206-219: LGTM: Status accurately reflects the completed KubeVirt release gate.The status update correctly captures:
- The completed 15-run Jepsen matrix across 3 lanes
- Isolation improvements and debugging enhancements
- Clear articulation of the next decision point
This maintains
docs/status.mdas the authoritative progress snapshot per coding guidelines.docs/operator-runbook.md (3)
124-132: LGTM: New KubeVirt commands documented with consistent formatting.The new commands (
capture-kubevirt-layout,verify-kubevirt-surface,run-kubevirt,watch-kubevirt,watch-kubevirt-fleet,archive-kubevirt) follow the existing documentation patterns and provide operators with clear command surfaces.
134-136: LGTM: Fault window override documented with appropriate caveats.The documentation correctly warns that
ALLOCDB_JEPSEN_FAULT_WINDOW_SECS_OVERRIDEis for debugging only and that release-gate evidence requires the full 1800s minimum.
227-242: LGTM: Comprehensive KubeVirt operational guidance.The documentation covers:
- Watch commands for monitoring runs
- Fleet aggregation for multi-lane batches
- Storage and anti-affinity configuration
- Lane-scoped temp path handling
- Debugging guidance with fault window override
This provides operators with actionable guidance for the KubeVirt testbed.
crates/allocdb-node/Cargo.toml (1)
10-10: LGTM: env_logger dependency addition supports the logging infrastructure.The
env_loggercrate enables the structured logging introduced throughout the replica and engine code paths. This is appropriate for the operational observability goals of the PR.crates/allocdb-node/src/lib.rs (1)
5-5: LGTM: New public module for KubeVirt testbed support.The
kubevirt_testbedmodule is correctly exported as public, following the same pattern asqemu_testbedandlocal_cluster. The alphabetical ordering with other modules is maintained.
Summary
.artifacts/while leaving it ignored by gitValidation
./scripts/preflight.shlonghorn-strict-local-wffcprofileFollow-up