Skip to content

refactor(jepsen): split CLI into focused modules#72

Merged
skel84 merged 4 commits intomainfrom
issue-70-jepsen-refactor
Mar 15, 2026
Merged

refactor(jepsen): split CLI into focused modules#72
skel84 merged 4 commits intomainfrom
issue-70-jepsen-refactor

Conversation

@skel84
Copy link
Owner

@skel84 skel84 commented Mar 15, 2026

Summary

  • split the allocdb-jepsen binary into focused modules for args, watch/rendering, remote backends, run orchestration, cluster helpers, and scenario support
  • add the Jepsen refactor plan doc and keep the docs index/status snapshot aligned with issue #70
  • preserve CLI behavior while shrinking the root binary from a multi-thousand-line god file to a smaller dispatcher-oriented entrypoint

Validation

  • ./scripts/preflight.sh
  • short live KubeVirt smoke across the documented scenario set with ALLOCDB_JEPSEN_FAULT_WINDOW_SECS_OVERRIDE=180 for faulted runs

Break the allocdb-jepsen binary into smaller modules for args, runs, watch, remote backends, scenario support, and KubeVirt helpers while preserving behavior.

Add the refactor plan and keep the docs/status snapshot aligned with issue #70.

Refs #70
@coderabbitai
Copy link

coderabbitai bot commented Mar 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f256be35-ec15-4a89-bde4-96c77d1a8605

📥 Commits

Reviewing files that changed from the base of the PR and between 8f95c51 and acbb408.

📒 Files selected for processing (7)
  • crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
  • docs/jepsen-refactor-plan.md
📜 Recent review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/jepsen-refactor-plan.md
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
🧠 Learnings (4)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.md : Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Applied to files:

  • docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Keep the system buildable after each chunk whenever feasible. If a larger refactor is required, stage it as a short sequence of intermediate, verifiable steps.

Applied to files:

  • docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: `error` for invariant breaks, corruption, and failed operations that require intervention; `warn` for degraded but expected conditions such as overload, lag, or rejected requests; `info` for meaningful lifecycle and state-transition events; `debug` for detailed execution traces useful in development; `trace` only for very high-volume diagnostic detail.

Applied to files:

  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
🧬 Code graph analysis (4)
crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (9)
crates/allocdb-node/src/jepsen.rs (5)
  • analyze_history (744-753)
  • create_artifact_bundle (784-833)
  • load_history (686-690)
  • persist_history (698-703)
  • render_analysis_report (756-775)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
  • prepare_kubevirt_helper (146-169)
  • value (288-290)
crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs (2)
  • execute_control_run (19-41)
  • execute_crash_restart_run (43-86)
crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (1)
  • execute_mixed_failover_run (64-102)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
  • load_qemu_layout (14-18)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (1)
  • ensure_runtime_cluster_ready (54-104)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (3)
  • workspace_root (703-705)
  • new (654-695)
  • backend_name (699-701)
crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (6)
  • workspace_root (28-28)
  • workspace_root (48-50)
  • workspace_root (79-81)
  • backend_name (27-27)
  • backend_name (44-46)
  • backend_name (75-77)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (2)
  • value (50-51)
  • value (61-62)
crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (6)
crates/allocdb-node/src/kubevirt_testbed.rs (1)
  • kubevirt_testbed_layout_path (162-164)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
  • load_kubevirt_layout (140-144)
  • prepare_kubevirt_helper (146-169)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (1)
  • request_remote_control_status (23-44)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (7)
  • maybe_load_run_status_snapshot (385-402)
  • run_events_path (299-301)
  • run_status_path (295-297)
  • snapshot (208-210)
  • new (29-38)
  • new (165-206)
  • encode_tracker_field (442-444)
crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (3)
  • parse_watch_event_line (131-144)
  • render_kubevirt_fleet_watch (64-114)
  • render_kubevirt_watch (17-62)
crates/allocdb-node/src/replica.rs (1)
  • status (959-961)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (4)
crates/allocdb-node/src/api_codec.rs (2)
  • decode_response (165-210)
  • encode_response (96-158)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
  • send_remote_api_request (20-29)
  • run_remote_tcp_request (31-38)
crates/allocdb-node/src/api.rs (1)
  • from_client_request (42-47)
crates/allocdb-core/src/result.rs (1)
  • with_reservation (41-51)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
crates/allocdb-node/src/kubevirt_testbed.rs (1)
  • kubevirt_testbed_layout_path (162-164)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
  • build_remote_tcp_probe_command (40-49)
  • encode_hex (112-119)
🔇 Additional comments (7)
docs/jepsen-refactor-plan.md (1)

1-131: LGTM!

The refactor plan documentation is well-structured with clear goals, constraints, non-goals, and acceptance criteria. The Target Shape section (lines 26-47) accurately reflects the module structure implemented in this PR, including the actual modules like tracker.rs, support.rs, surface.rs, watch_render.rs, runs.rs, etc.

crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (1)

1-551: LGTM!

This module is well-implemented with proper error handling throughout. Key improvements from past reviews have been addressed:

  • Child process reaping on stdin-write failures (lines 187-208, 364-382)
  • Shell quoting via shell_single_quote for path safety (lines 437-443, 474-487)
  • BatchMode=yes added to SSH args (lines 538-539)
  • kubevirt_helper_phase now returns errors for non-NotFound failures (lines 312-322)

The guard pattern for KubevirtHelperGuard::drop and systematic error propagation throughout are solid.

crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (1)

1-1001: LGTM!

This module provides comprehensive Jepsen event handling with solid error propagation and test coverage. Key observations:

  • Resource ID mismatch validation added at lines 300-306 (addresses past review)
  • drain_expiration_until_resource_available properly bounds iterations with MAX_EXPIRATION_RECOVERY_DRAIN_TICKS and returns errors appropriately
  • Test suite (lines 633-1001) includes a FakeExternalTestbed implementation and covers key scenarios: result code mapping, retry exhaustion, resource availability detection, and resource ID mismatch rejection
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (1)

1-481: LGTM!

This module provides robust run tracking with proper encoding/decoding for persisted fields. Key improvements from past reviews:

  • encode_tracker_field at line 443 properly escapes backslashes before newlines
  • decode_tracker_field at lines 446-467 now handles escaped backslashes symmetrically (line 454-455)
  • Round-trip test at lines 473-480 validates both backslash and newline handling

The RequestNamespace design using time-based client IDs and saturating arithmetic is sound for avoiding collisions.

crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)

1-951: LGTM!

This module provides a comprehensive terminal watch UI with proper handling of interactive vs non-interactive output. Key improvements from past reviews:

  • Screen clearing only when interactive (lines 29-31, 81-83)
  • Multiline text flattened via flatten_watch_text before rendering (lines 212, 219, 544-547, 590)
  • Event details truncated in single-watch mode (lines 542-548) with truncate_for_watch
  • watch_interactive_output() and watch_color_enabled() properly check TTY and NO_COLOR (lines 644-650)

The rendering helpers are well-organized with consistent styling patterns.

crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (1)

1-418: LGTM!

This module provides solid run orchestration with proper error handling and fault window enforcement. Key improvements from past reviews:

  • effective_minimum_fault_window_secs (lines 117-123) now uses early return (?) to ensure the override only applies to runs that already have a fault window, not control runs
  • tracker.complete() is called at line 218 before enforce_minimum_fault_window at line 219, ensuring completion metadata is persisted even if the minimum-window check fails
  • Redundant tracker.fail for release gate blocked was removed (lines 224-228 just return Err)
  • Regression test at lines 397-409 validates that fault window override doesn't change control runs
crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (1)

1-372: LGTM!

This module provides robust watch functionality with proper control flow and error handling. Key improvements from past reviews:

  • One-shot mode now exits immediately via should_stop_watch returning true when follow == false (line 267)
  • Lane preparation retries on failure since errors don't set contexts[index] to Some (lines 95-100)
  • Real lane errors are preserved and displayed, with a fallback message only when no specific error exists (lines 117-122)
  • Test coverage (lines 314-371) validates:
    • One-shot vs follow mode termination logic
    • Event limit enforcement (keeps only last N events)
    • Graceful degradation on missing files and malformed lines

The collect_replica_watch_snapshots function properly handles nested error cases without panicking, returning partial results with per-replica error fields.


Summary by CodeRabbit

  • New Features
    • Major Jepsen refactor: modular test runner, QEMU/KubeVirt backends, runtime probing, remote probing, nemeses/scenarios, run tracking, archiving, and watch/fleet support with terminal UI.
  • Bug Fixes
    • SSH client now includes fail-fast options for more robust remote commands.
  • Documentation
    • Added Jepsen refactor plan and updated project status notes.

Walkthrough

Refactors the monolithic allocdb-jepsen binary into many focused modules (args, cluster, common, events, kubevirt, nemesis, remote, runs, runtime, scenarios, support, surface, tracker, watch, watch_render), adding CLI parsing, test orchestration, QEMU/KubeVirt integrations, a terminal watch UI, run tracking, and supporting utilities; documentation updated with a Jepsen refactor plan.

Changes

Cohort / File(s) Summary
CLI and Argument Parsing
crates/allocdb-node/src/bin/allocdb-jepsen/args.rs
New CLI parser and ParsedCommand enum with per-subcommand parsers, helpers (workspace/history flags, lane spec parsing), usage text and validation; returns structured commands or usage-tagged errors.
Runtime Cluster Orchestration
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
Replica lifecycle and runtime orchestration: probing/wait helpers, crash/restart/isolate/heal, staged-replica summaries and rewrite logic, failover and rejoin workflows, plus tests.
Utilities and Support
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs, crates/allocdb-node/src/bin/allocdb-jepsen/support.rs
Small typed parsers, atomic text I/O, time/ID utilities (common.rs); staging, tar import/export, HistoryBuilder, RunExecutionContext, and staged-replica workspace management (support.rs).
Jepsen Event Modeling
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
Remote API request/response handling, outcome classification, reserve/tick/read event implementations, expiration-drain plans, and mapping of node responses to Jepsen events (new module).
Test Scenarios & Nemesis
crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs, crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs
Scenario executors and nemesis sequences for workload families: control/crash-restart runs and partition-heal/mixed-failover flows producing Jepsen history events.
Remote / QEMU Testbed
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs, crates/allocdb-node/src/bin/allocdb-qemu-testbed.rs
QEMU remote TCP/SSH probe helpers, API encoding/decoding, probe validation/commit extraction, SSH arg updates (fail-fast options) and corresponding tests.
KubeVirt Support
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
KubeVirt layout capture, helper-pod lifecycle, SSH/tcp remote execution against guest VMs, layout loading and helper guard management.
Run Orchestration & Analysis
crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs, crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
End-to-end run orchestration, archive/verify flows for QEMU & KubeVirt, run-spec resolution, fault-window enforcement, history persistence, run-tracking and status snapshot management (file-backed RunTracker).
Runtime Probing
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs
RuntimeReplicaProbe/Topology types, remote control status requests, live-role enrichment, probe summaries, and helpers to select active replicas.
Surface Verification
crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs
Introduces ExternalTestbed trait with QEMU/KubeVirt implementations and a unified external-surface verification flow including protocol probes and resource checks.
Watch & Monitoring
crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs, crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
KubeVirt lane/fleet watch loops, per-replica metrics collection, recent-event loading, and a terminal colorized UI renderer with progress bars, tables, and event formatting.
Documentation
docs/README.md, docs/jepsen-refactor-plan.md, docs/status.md
Adds Jepsen refactor plan doc and README link; updates status note to track maintainability follow-up.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant CLI
participant RunTracker
participant Orchestrator as Runs/Scenarios/Nemesis
participant Testbed as ExternalTestbed (QEMU/KubeVirt)
participant Replica
CLI->>RunTracker: start run (run_id, output_root)
RunTracker->>Orchestrator: set phase / provide RunExecutionContext
Orchestrator->>Testbed: load layout / prepare helper
Orchestrator->>Testbed: run_remote_host_command / run_remote_tcp_request
Testbed->>Replica: SSH/TCP probe / control command
Replica-->>Testbed: status/metrics/ApiResponse
Testbed-->>Orchestrator: response bytes
Orchestrator->>RunTracker: append history events / update status
RunTracker->>CLI: write status, artifacts, completion

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely describes the main change: refactoring the jepsen binary into focused modules, which matches the primary objective of the changeset.
Description check ✅ Passed The PR description includes a summary, validation steps, and addresses the required sections comprehensively, matching the repository template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch issue-70-jepsen-refactor
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (7)
docs/jepsen-refactor-plan.md (1)

20-24: Optional: Consider varying sentence structure in Non-Goals.

Three consecutive bullets start with "No" (lines 20-22), which creates slight repetition. This is a minor stylistic point and doesn't affect clarity.

✍️ Optional rewrite for variety
 ## Non-Goals
 
-- No new Jepsen workloads or backend features.
-- No Hetzner backend work in this task.
-- No analyzer/history model redesign in [`jepsen.rs`](../crates/allocdb-node/src/jepsen.rs)
+- New Jepsen workloads or backend features are out of scope.
+- Hetzner backend work is deferred.
+- Analyzer/history model redesign in [`jepsen.rs`](../crates/allocdb-node/src/jepsen.rs)
   unless a minimal extraction requires a purely mechanical move.
-- No dedicated `allocdb-validation` crate yet.
+- Dedicated `allocdb-validation` crate is deferred to a later decision point.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/jepsen-refactor-plan.md` around lines 20 - 24, Edit the "Non-Goals"
bullet list to reduce repetition by rephrasing at least two of the three
consecutive bullets that start with "No" (for example, change one to "Does not
include ..." and another to "This task will not cover ...") while keeping the
same meanings; specifically update the bullets referencing "Jepsen workloads or
backend features", "Hetzner backend work", and the mention of "`jepsen.rs`" so
that one or two use alternative sentence starts for variety but retain the
original constraints and links.
crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (1)

138-144: Kubevirt helper guard is held but named _helper.

The _helper variable holds a KubevirtHelperGuard that likely performs cleanup on drop. The underscore prefix conventionally suggests an unused variable, but here it's essential to keep the guard alive. Consider renaming to _kubevirt_helper or adding a comment to clarify intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs` around lines 138 -
144, The local binding _helper in verify_kubevirt_surface holds a
KubevirtHelperGuard returned by prepare_kubevirt_helper and must be kept alive
for its Drop cleanup, but the underscore prefix falsely signals it's unused;
rename _helper to a clear binding like kubevirt_helper (or _kubevirt_helper)
and/or add an inline comment above the binding noting "keep guard alive for
cleanup on drop" so the intent is explicit while still preventing
unused-variable warnings.
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (1)

46-60: Side effect: runtime_replica_probes mutates cloned replica configs.

The function clones each replica and updates replica.role based on remote status. This is a useful enrichment but the function name doesn't hint at mutation. Consider documenting this behavior or renaming to runtime_replica_probes_with_updated_roles.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs` around lines 46 - 60,
The function runtime_replica_probes currently clones each replica and mutates
the clone's role based on request_remote_control_status before returning
RuntimeReplicaProbe, which is a surprising side-effect; either rename the
function to runtime_replica_probes_with_updated_roles or document the mutation
in the function's docstring and public API, and ensure callers expecting
unchanged replica data are updated to use the new name or behavior; locate
runtime_replica_probes, the mapping that sets replica.role from status (via
request_remote_control_status) and change the identifier and/or add a comment
explaining the enrichment so the mutation is explicit to users.
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (1)

77-80: Tracker errors silently ignored in HistoryBuilder::push.

If set_history_events or append_event fails, the error is discarded with let _ =. This could hide I/O failures during history recording. Consider logging at warn level when these operations fail.

🔧 Proposed improvement
         if let Some(tracker) = &self.tracker {
-            let _ = tracker.set_history_events(self.next_sequence);
-            let _ = tracker.append_event(&format!("history sequence={}", self.next_sequence));
+            if let Err(e) = tracker.set_history_events(self.next_sequence) {
+                log::warn!("failed to update history events count: {e}");
+            }
+            if let Err(e) = tracker.append_event(&format!("history sequence={}", self.next_sequence)) {
+                log::warn!("failed to append history event: {e}");
+            }
         }

As per coding guidelines: "Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: warn for degraded but expected conditions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/support.rs` around lines 77 - 80,
The calls to tracker.set_history_events and tracker.append_event in
HistoryBuilder::push currently swallow errors with `let _ =`, hiding failures;
change these to check the Result and log a warning on Err using the project's
logging facility (e.g., log::warn! or tracing::warn!), referencing the same
context (include the sequence number and the operation name). Specifically,
replace the `let _ = tracker.set_history_events(self.next_sequence)` and `let _
= tracker.append_event(...)` with error-aware handling that logs a warn message
containing the error and the history sequence (and continue operation if
desired).
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (1)

273-276: Event detail field is not escaped.

Similar to last_error, the detail string is written directly without escaping. A detail containing = or newlines would make the log harder to parse reliably.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs` around lines 273 -
276, The append_event function writes detail directly into the log causing
parsing issues for values containing '=' or newlines; update append_event (and
the formatted line creation using events_path and append_text_line) to escape or
serialize the detail string (e.g., replace newlines, escape '=' and backslashes,
or use a quoted/percent-encoded representation) before embedding it in
format!("time_millis={} detail={detail}\n", ...), mirroring how last_error is
handled so the output remains unambiguous and parseable.
crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (1)

429-434: JepsenExpiredReservation constructed inline multiple times.

The same JepsenExpiredReservation is constructed identically in multiple places within the same function. Consider extracting to a local variable for clarity.

Also applies to: 702-707

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs` around lines 429 -
434, The code repeatedly constructs the same JepsenExpiredReservation inline;
extract a single local variable (e.g., let expired_reservation =
JepsenExpiredReservation { resource_id, holder_id: 606, reservation_id:
reserve_commit.reservation_id.get(), released_lsn: None };) near the first use
and reuse that variable for subsequent places (including the other occurrence
around lines 702-707) instead of reconstructing it to improve clarity and avoid
duplication; ensure you reference the same reservation_id expression
(reserve_commit.reservation_id.get()) and keep released_lsn as None.
crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (1)

327-329: The hardcoded /var/lib/allocdb-qemu path is intentional and used for both QEMU and KubeVirt backends, but the "qemu" naming is misleading.

The fetch_external_logs_archive function is generic across ExternalTestbed implementations and is called by both backends. The directory path is consistent with the allocdb-qemu-control script, which also names itself with "qemu" despite being deployed to both QEMU VMs and KubeVirt Pods. The directory is created on-demand by the control script's collect-logs function via mkdir -p.

Consider renaming the script and base directory to use a backend-agnostic name (e.g., allocdb-control, /var/lib/allocdb-control) to clarify that both backends share the same design, or add a comment documenting why "qemu" is retained as the naming convention across both backends.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs` around lines 327 - 329,
The hardcoded path assigned to remote_dir inside fetch_external_logs_archive
uses "/var/lib/allocdb-qemu" which is misleading for a backend-agnostic routine;
either rename the directory and related script references to a neutral name
(e.g., "/var/lib/allocdb-control" and "allocdb-control") across the codebase and
deployment scripts, or add a clarifying comment where remote_dir is built (and
in the allocdb-qemu-control script) explaining that the "qemu" name is
historical and intentionally shared by both QEMU and KubeVirt backends; update
any references to allocdb-qemu-control/remote_dir in deploy/collect-logs logic
to match the chosen approach so names remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/args.rs`:
- Around line 76-79: The match arm for "plan" currently returns
ParsedCommand::Plan without validating any trailing arguments; update the
parsing logic so that when subcommand == "plan" you verify there are no extra
CLI arguments (e.g., check the remaining args iterator or slice used by the
parser) and return an Err (parse error) if any unknown/trailing flags are
present instead of accepting them silently; modify the "plan" arm that now
returns ParsedCommand::Plan to perform that check before returning to ensure
typos like `allocdb-jepsen plan --workspace foo` fail fast.
- Around line 71-115: Add comprehensive negative-path unit tests for the CLI by
exercising parse_args and parse_watch_kubevirt_lane_spec: assert that parse_args
returns Err for unknown subcommands (e.g., "foo"), returns Help when no
subcommand is provided, and that commands requiring flags (e.g., Analyze,
VerifyQemuSurface/VerifyKubevirtSurface) return Err when their required flags
are missing or malformed; separately add tests for
parse_watch_kubevirt_lane_spec to assert Err on empty parts, too few
comma-separated fields, or whitespace-only fields and Ok on well-formed
"<name,workspace,output-root>" inputs. Use the public function names parse_args
and parse_watch_kubevirt_lane_spec to locate code and cover all reported
negative cases and regression scenarios.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs`:
- Around line 401-539: Add comprehensive tests covering perform_failover and
perform_rejoin to pin down source selection, target_commit_lsn derivation,
staged rewrites, and role/view transitions: create unit/integration tests that
use StagedReplicaWorkspace and staged_replica_summary to craft scenarios (higher
LSN on supporting vs new_primary, equal commit LSN but higher apply LSN, missing
LSNs) and assert the chosen source, computed target_commit_lsn, and new_view
returned by perform_failover; assert that rewrite_replica_from_source produced
expected staged summaries and that after import/restart
wait_for_runtime_replica_role observes ReplicaRole::Primary/Backup as expected;
also add negative-path/recovery tests (e.g., missing source, conflicting
summaries, failed import/restart) to ensure errors are propagated and invariants
hold for perform_rejoin (check it uses source_summary fields correctly and sets
target view/commit LSN), placing tests alongside other allocdb-node tests for
*.rs per guidelines.
- Around line 40-42: The loop currently swallows all errors from heal_replica
(and similarly in the later remediation calls around lines 56–69); update the
calls to handle and log failures instead of ignoring them: call
heal_replica(layout, replica.replica_id), check its Result, and on Err emit an
error-level log via the crate logger that includes the replica.replica_id, the
layout id/context, and the error details (use debug/span context if available);
do the same for the other remediation functions referenced in that region so
failures are visible rather than producing only a later “did not converge”
timeout.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs`:
- Around line 299-308: In classify_resource_read_outcome, ensure the
ResourceResponse::Found actually matches the requested resource_id: inside the
RemoteApiOutcome::Api(ApiResponse::GetResource(ResourceResponse::Found(resource)))
arm compare resource.id (or the actual identifier field on resource) with the
function parameter resource_id and if they differ return an error (e.g., Err
with a descriptive message) instead of treating it as a successful read; only
construct ResourceReadObservation::Available or ::Held when the ids match.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs`:
- Around line 403-445: prepare_kubevirt_helper_stage_dir and
chmod_kubevirt_helper_ssh_key interpolate user-controlled helper_stage_dir into
sh -lc strings unquoted, allowing shell injection; update the format!() calls in
those functions to wrap the stage_dir in single quotes (and escape any embedded
single quotes in the path) so the shell treats it as a single literal argument
(e.g., change the command templates used by prepare_kubevirt_helper_stage_dir
and chmod_kubevirt_helper_ssh_key to quote {stage_dir} and
{stage_dir}/id_ed25519 respectively); copy_kubevirt_helper_ssh_key needs no
change because it passes the path as a kubectl arg.
- Around line 149-157: The phase-detection currently treats any non-zero kubectl
get pod exit as Ok(None) which hides real kubectl failures; update
kubevirt_helper_phase to return Err for non-zero kubectl exit codes (propagating
the underlying error) instead of Ok(None), and ensure the caller (the prepare
function that calls kubevirt_helper_phase) propagates that error rather than
treating None as "pod missing"; keep the existing logic that when phase is
Some("Running") it skips recreate, and continue using
delete_kubevirt_helper_pod, apply_kubevirt_helper_pod, and
kubevirt_helper_should_delete_on_drop unchanged — only change
kubevirt_helper_phase's error behavior and the caller to propagate errors.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs`:
- Around line 137-178: The qemu_ssh_args used by run_qemu_remote_host_command is
missing SSH options that prevent hangs/prompts and noisy logs; update the
qemu_ssh_args function to include the same fail-fast flags used in
kubevirt_ssh_args (add "ConnectTimeout=5", "BatchMode=yes", and
"LogLevel=ERROR") so SSH will time out, avoid interactive prompts, and suppress
verbose output; also add these options to the ssh_args definition in
allocdb-qemu-testbed.rs for consistency.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs`:
- Around line 214-219: The code currently calls tracker.complete(...) and then,
when report.release_gate_passed() is false, calls
tracker.fail(RunTrackerPhase::Completed, ...) which can overwrite the state;
update the logic in the block handling the end-of-run so that you call either
tracker.complete(...) (which should set success/failure based on
report.release_gate_passed()) or tracker.fail(...) but not both—specifically
remove the redundant tracker.fail(RunTrackerPhase::Completed, "Jepsen release
gate is blocked") when tracker.complete(...) has already been invoked, or
alternatively adjust tracker.complete to return/propagate the error so you only
call fail when complete was not called.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs`:
- Around line 360-366: The current serialization of snapshot.last_error inside
the format! call emits raw multiline text which breaks line-based parsing in
decode_run_status_snapshot; update the serialization in the format! that writes
last_error to either escape newlines (e.g., replace '\n' with '\\n') or encode
the field as base64, and update decode_run_status_snapshot to perform the
corresponding unescape/base64-decode so parsing remains line-oriented and
robust; make sure to modify the references to snapshot.last_error (the format!
writer) and the decode_run_status_snapshot reader together so they remain
symmetric.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs`:
- Around line 23-27: The render currently unconditionally emits full-screen ANSI
sequences (see the print!("\x1B[2J\x1B[H") and related println! calls
surrounding watch_color_enabled, watch_spinner_frame, and watch_pulse_frame),
which pollutes redirected output; fix by detecting whether stdout is a TTY
(e.g., using atty::is(atty::Stream::Stdout) or an equivalent helper) and only
emit the clear/reset escape sequences and any interactive-only output when that
check passes, applying the same guard to the other renderer block referenced
(lines 64-76).

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs`:
- Around line 94-116: The code currently discards the real error by calling
try_prepare_kubevirt_watch_lane_context(spec).ok(), causing all failures to be
reported as "workspace is not ready yet"; instead capture and preserve the
original error and propagate it into the KubevirtWatchLaneSnapshot.lane_error.
Change the storage in contexts to keep the Result (or alongside the prepared
context keep an Option<String> for the error), replace .ok() with matching the
Err(e) to save e.to_string(), and in the branch where contexts[index] is None
return a KubevirtWatchLaneSnapshot whose lane_error is the preserved error
string (fallback to the existing workspace-not-ready message only if no specific
error was captured). Ensure you reference
try_prepare_kubevirt_watch_lane_context, contexts (per-index storage),
KubevirtWatchLaneSnapshot, and kubevirt_testbed_layout_path/spec.workspace_root
when making these changes.
- Around line 59-79: The loop in watch.rs never exits for one-shot because it
only breaks when a non-running snapshot appears; change the control flow so that
when follow == false you perform a single iteration and return immediately after
calling render_kubevirt_watch instead of sleeping/waiting for a state change.
Locate the loop that calls maybe_load_run_status_snapshot,
collect_replica_watch_snapshots, load_recent_run_events, and
render_kubevirt_watch and adjust it to either (a) wrap the current loop body in
an if follow { loop { ... } } else { ...single iteration... } or (b)
unconditionally render once and then if follow is true enter the repeating loop;
ensure no sleep occurs for the one-shot path.

---

Nitpick comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs`:
- Around line 429-434: The code repeatedly constructs the same
JepsenExpiredReservation inline; extract a single local variable (e.g., let
expired_reservation = JepsenExpiredReservation { resource_id, holder_id: 606,
reservation_id: reserve_commit.reservation_id.get(), released_lsn: None };) near
the first use and reuse that variable for subsequent places (including the other
occurrence around lines 702-707) instead of reconstructing it to improve clarity
and avoid duplication; ensure you reference the same reservation_id expression
(reserve_commit.reservation_id.get()) and keep released_lsn as None.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs`:
- Around line 327-329: The hardcoded path assigned to remote_dir inside
fetch_external_logs_archive uses "/var/lib/allocdb-qemu" which is misleading for
a backend-agnostic routine; either rename the directory and related script
references to a neutral name (e.g., "/var/lib/allocdb-control" and
"allocdb-control") across the codebase and deployment scripts, or add a
clarifying comment where remote_dir is built (and in the allocdb-qemu-control
script) explaining that the "qemu" name is historical and intentionally shared
by both QEMU and KubeVirt backends; update any references to
allocdb-qemu-control/remote_dir in deploy/collect-logs logic to match the chosen
approach so names remain consistent.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs`:
- Around line 46-60: The function runtime_replica_probes currently clones each
replica and mutates the clone's role based on request_remote_control_status
before returning RuntimeReplicaProbe, which is a surprising side-effect; either
rename the function to runtime_replica_probes_with_updated_roles or document the
mutation in the function's docstring and public API, and ensure callers
expecting unchanged replica data are updated to use the new name or behavior;
locate runtime_replica_probes, the mapping that sets replica.role from status
(via request_remote_control_status) and change the identifier and/or add a
comment explaining the enrichment so the mutation is explicit to users.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/support.rs`:
- Around line 77-80: The calls to tracker.set_history_events and
tracker.append_event in HistoryBuilder::push currently swallow errors with `let
_ =`, hiding failures; change these to check the Result and log a warning on Err
using the project's logging facility (e.g., log::warn! or tracing::warn!),
referencing the same context (include the sequence number and the operation
name). Specifically, replace the `let _ =
tracker.set_history_events(self.next_sequence)` and `let _ =
tracker.append_event(...)` with error-aware handling that logs a warn message
containing the error and the history sequence (and continue operation if
desired).

In `@crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs`:
- Around line 138-144: The local binding _helper in verify_kubevirt_surface
holds a KubevirtHelperGuard returned by prepare_kubevirt_helper and must be kept
alive for its Drop cleanup, but the underscore prefix falsely signals it's
unused; rename _helper to a clear binding like kubevirt_helper (or
_kubevirt_helper) and/or add an inline comment above the binding noting "keep
guard alive for cleanup on drop" so the intent is explicit while still
preventing unused-variable warnings.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs`:
- Around line 273-276: The append_event function writes detail directly into the
log causing parsing issues for values containing '=' or newlines; update
append_event (and the formatted line creation using events_path and
append_text_line) to escape or serialize the detail string (e.g., replace
newlines, escape '=' and backslashes, or use a quoted/percent-encoded
representation) before embedding it in format!("time_millis={}
detail={detail}\n", ...), mirroring how last_error is handled so the output
remains unambiguous and parseable.

In `@docs/jepsen-refactor-plan.md`:
- Around line 20-24: Edit the "Non-Goals" bullet list to reduce repetition by
rephrasing at least two of the three consecutive bullets that start with "No"
(for example, change one to "Does not include ..." and another to "This task
will not cover ...") while keeping the same meanings; specifically update the
bullets referencing "Jepsen workloads or backend features", "Hetzner backend
work", and the mention of "`jepsen.rs`" so that one or two use alternative
sentence starts for variety but retain the original constraints and links.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 127c87e6-72f8-4229-89e2-5a8b551f190d

📥 Commits

Reviewing files that changed from the base of the PR and between 65a8953 and b258871.

📒 Files selected for processing (19)
  • crates/allocdb-node/src/bin/allocdb-jepsen.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/args.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/common.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/support.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
  • docs/README.md
  • docs/jepsen-refactor-plan.md
  • docs/status.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/status.md
  • docs/jepsen-refactor-plan.md
  • docs/README.md
docs/status.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep docs/status.md current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.

Files:

  • docs/status.md
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/args.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/common.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/support.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to docs/status.md : Keep [`docs/status.md`](./docs/status.md) current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.

Applied to files:

  • docs/status.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Keep the system buildable after each chunk whenever feasible. If a larger refactor is required, stage it as a short sequence of intermediate, verifiable steps.

Applied to files:

  • docs/jepsen-refactor-plan.md
🧬 Code graph analysis (7)
crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (4)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (6)
  • first_backup_replica (110-118)
  • heal_replica (188-204)
  • isolate_replica (170-186)
  • maybe_crash_replica (131-150)
  • perform_failover (401-491)
  • perform_rejoin (493-540)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (2)
  • backup_process_name (381-383)
  • primary_process_name (377-379)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
  • unique_probe_resource_id (135-139)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (3)
  • new (27-33)
  • new (55-61)
  • new (95-122)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
  • as_str (77-83)
  • as_str (107-117)
crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (8)
crates/allocdb-node/src/jepsen.rs (6)
  • analyze_history (744-753)
  • create_artifact_bundle (784-833)
  • load_history (686-690)
  • persist_history (698-703)
  • release_gate_plan (109-154)
  • render_analysis_report (756-775)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
  • load_kubevirt_layout (140-144)
  • prepare_kubevirt_helper (146-169)
crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs (2)
  • execute_control_run (19-41)
  • execute_crash_restart_run (43-86)
crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (1)
  • execute_mixed_failover_run (64-102)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
  • load_qemu_layout (14-18)
  • sanitize_run_id (121-135)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (1)
  • ensure_runtime_cluster_ready (39-73)
crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (7)
  • workspace_root (28-28)
  • workspace_root (48-50)
  • workspace_root (79-81)
  • verify_external_surface (146-211)
  • backend_name (27-27)
  • backend_name (44-46)
  • backend_name (75-77)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
  • new (29-38)
  • new (165-206)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (3)
crates/allocdb-node/src/local_cluster.rs (1)
  • decode_control_status_response (782-786)
crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
  • replica_role_label (817-825)
crates/allocdb-node/src/replica.rs (1)
  • status (959-961)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (3)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (5)
  • render_runtime_probe_summary (106-136)
  • runtime_probe_is_active (62-67)
  • runtime_replica_probes (46-60)
  • summarize_runtime_probes (69-90)
  • probes (107-134)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (7)
  • copy_file_or_remove (277-297)
  • prepare_log_path_for (264-275)
  • run_remote_control_command (177-188)
  • new (27-33)
  • new (55-61)
  • new (95-122)
  • from_export (124-136)
crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
  • replica_role_label (817-825)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (2)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
  • sanitize_run_id (121-135)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
  • current_time_millis (128-133)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
crates/allocdb-node/src/kubevirt_testbed.rs (1)
  • kubevirt_testbed_layout_path (162-164)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
  • build_remote_tcp_probe_command (40-49)
  • encode_hex (112-119)
🪛 LanguageTool
docs/jepsen-refactor-plan.md

[style] ~22-~22: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...No Hetzner backend work in this task. - No analyzer/history model redesign in [`je...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🔇 Additional comments (24)
docs/status.md (1)

218-220: LGTM! Status narrative updated to reflect the refactor plan.

The updated text correctly captures the immediate next step (issue #70 Jepsen refactor) and properly defers the larger architectural decision (dedicated validation crate vs. Hetzner follow-on) until after the maintainability improvement is complete. This aligns well with the new jepsen-refactor-plan.md document.

docs/jepsen-refactor-plan.md (1)

1-121: LGTM! Well-structured refactor plan.

The document clearly defines goals, constraints, and a phased execution plan. The emphasis on keeping the repo buildable after each slice (line 15) and including validation steps for each slice aligns well with the learning about incremental, verifiable steps. The decision point at the end (lines 116-121) appropriately defers the crate extraction choice until after the internal seams are proven.

docs/README.md (1)

21-21: LGTM! Documentation index updated.

The new Jepsen Refactor Plan entry is properly added to the Engineering Docs section. The placement between Fault Model and Operator Runbook is reasonable, though you could also consider positioning it adjacent to "KubeVirt Jepsen Report" (line 23) since both documents relate to Jepsen tooling.

crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (3)

69-81: Sequence starts at 1, not 0 after first push.

In HistoryBuilder::push (referenced from support.rs), next_sequence is incremented before assigning to the event. If initialized with next_sequence = 0, the first event gets sequence = 1. This appears intentional but verify it matches the expected Jepsen history indexing convention.


28-38: LGTM: Unique ID generation is well-designed.

The combination of milliseconds timestamp with an atomic nonce provides good uniqueness guarantees. The use of saturating_mul and saturating_add prevents overflow panics.


164-206: LGTM: RunTracker initialization is robust.

Proper directory creation, atomic file writes, and initial status snapshot emission. The error messages are descriptive with path context.

crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (2)

146-210: LGTM: Comprehensive surface verification logic.

The verification flow is well-structured:

  1. Ensures cluster readiness
  2. Probes each replica's metrics and protocol endpoints
  3. Validates write/read roundtrip on primary

Error messages include backend name and replica ID for debugging.


26-41: LGTM: ExternalTestbed trait design.

Clean abstraction over backend differences. The trait provides the minimal surface needed for Jepsen operations while keeping backend-specific logic encapsulated.

crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (3)

153-157: LGTM: Safe Drop implementation for staged workspace.

Silently ignoring remove_dir_all failure is acceptable here since cleanup failures during drop shouldn't panic. The temp directory will be cleaned up by OS eventually.


159-175: LGTM: Robust temp directory naming.

Combines prefix, PID, atomic sequence counter, and milliseconds to ensure uniqueness across concurrent processes and within the same process.


277-297: LGTM: copy_file_or_remove handles NotFound gracefully.

When source doesn't exist, the destination is cleaned up if it exists. This prevents stale files from persisting when the source is absent.

crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (3)

93-110: LGTM: Atomic file write implementation.

Correctly implements write-to-temp, sync, rename pattern for crash-safe file updates. Also handles cleanup on rename failure.


135-139: Verify: unique_probe_resource_id bit layout.

The ID is (millis << 32) | nonce. With millis being a u128 timestamp (~44 bits needed for current epoch), the left-shift by 32 is safe. The nonce occupies the lower 32 bits. This provides good uniqueness but verify callers don't assume any particular bit layout.


46-75: LGTM: Optional parsing helpers are consistent.

All optional parsers handle the "none" sentinel uniformly and provide clear error messages including the invalid value.

crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs (3)

19-41: LGTM: Clean dispatch pattern for control runs.

Properly resolves primary/backup replicas and dispatches to workload-specific handlers. Error propagation is consistent.


43-86: LGTM: Crash-restart scenarios properly orchestrate failover.

The pattern of identifying primary, failover target, and supporting backup before dispatching to handlers is consistent and correct.


248-251: Consistent error handling for expected committed reserves.

All workloads that require a committed reserve use ok_or_else with descriptive error messages indicating which scenario expected the commit. This is good defensive programming.

Also applies to: 331-333, 533-535, 599-601

crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (2)

106-136: LGTM: Compact probe summary rendering.

The rendering handles all three states (active, faulted, down) with appropriate truncation of long error messages to the first line. Good for operational visibility.


69-90: LGTM: Topology summarization is correct.

Only counts active replicas and correctly distinguishes primaries from backups. Non-active roles are ignored as expected.

crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (2)

74-76: LGTM: Clear error for unsupported combination.

Returns an explicit error when ReservationContention is used with mixed failover runs, which is documented as undefined for this workload.


104-180: LGTM: Partition-heal reservation contention scenario.

Well-structured sequence: create resource → isolate backups → reserve → heal → crash primary → failover → retry reserve → second reserve → rejoin. The orchestration correctly exercises the partition-heal nemesis.

crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (3)

133-220: LGTM: Comprehensive run orchestration flow.

The run_external function properly sequences all phases: surface verification → execution → persistence → analysis → archiving → completion. Error handling at each phase updates the tracker appropriately.


222-277: LGTM: Fault window iteration loop is well-designed.

The loop correctly:

  • Ensures cluster readiness before each nemesis iteration
  • Creates fresh request namespaces per iteration
  • Validates non-empty history output
  • Tracks sequence numbers across iterations
  • Respects fault window duration constraints

253-258: Good defensive check for empty iteration history.

Explicitly failing when an iteration produces no events prevents silent test failures and aids debugging.

Apply the substantive CodeRabbit feedback for the refactor PR: tighten CLI validation, improve tracker/watch encoding and one-shot behavior, harden KubeVirt helper handling, add failover plan coverage, and make QEMU SSH helpers fail fast.

Validation: ./scripts/preflight.sh

Refs #70
@skel84
Copy link
Owner Author

skel84 commented Mar 15, 2026

Addressed the substantive CodeRabbit follow-ups: tightened CLI negative-path handling/tests, hardened tracker/watch encoding and one-shot behavior, improved KubeVirt helper error handling and quoting, added failover-plan coverage, and added fail-fast SSH options for QEMU helpers. Reran: ./scripts/preflight.sh

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (1)

166-171: Route lifecycle events through log instead of eprintln!.

These state-transition messages bypass the configured logger, so they will not honor filters/formatting and are harder to correlate with the log::error! output already used earlier in this file. begin/complete events look like info!; the detailed plan summaries look like debug!.

As per coding guidelines, **/*.rs: "Add extensive logging where it materially improves debuggability or operational clarity." and "Logging must be structured and purposeful."

Also applies to: 187-192, 205-210, 223-228, 353-366, 374-384, 478-553, 563-603

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs` around lines 166 -
171, Replace eprintln! lifecycle messages in cluster.rs with structured log
macros so they go through the configured logger; for the shown call that prints
backend={}, workspace={}, replica={} (using layout.backend_name(),
layout.workspace_root(), replica_id.get()) switch to info!(...) for
begin/complete events and debug!(...) for detailed plan summaries, preserving
the same message text and fields, and apply the same change to the other
occurrences you flagged (lines around the calls that use layout.backend_name(),
layout.workspace_root(), replica_id.get(), and similar site-specific variables
at the ranges you noted) so all lifecycle output respects filters/formatting and
correlates with existing log::error! usage.
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (1)

94-200: Add focused regression tests for the outcome-mapping helpers.

map_reserve_submit_response, classify_resource_read_outcome, and drain_expiration_until_resource_available now define how Jepsen history is recorded. Please pin the negative/recovery branches here—text not primary fallback, mismatched resource_id, result-code mapping, and drain retry exhaustion—with unit tests so these semantics do not drift silently.

As per coding guidelines, **/*.rs: "Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage."

Also applies to: 294-338, 404-464, 482-631

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs` around lines 94 - 200,
The PR lacks focused regression/unit tests for the outcome-mapping helpers that
define Jepsen history semantics; add tests that pin negative/recovery branches
for map_reserve_submit_response, classify_resource_read_outcome, and
drain_expiration_until_resource_available: specifically, write tests asserting
the "not primary" Text fallback path, mismatched resource_id handling in
map_reserve_submit_response, explicit result-code -> JepsenEventOutcome mappings
for classify_resource_read_outcome (including rejection codes), and
exhaustion/retry behavior for drain_expiration_until_resource_available; place
tests alongside the module (covering the functions by name) and include
negative-path, recovery, and invariant assertions so these behaviors cannot
regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs`:
- Around line 494-520: kubevirt_helper_ssh_args builds SSH arguments for the
helper pod but is missing the non-interactive flag BatchMode=yes; update
kubevirt_helper_ssh_args to include the "-o" "BatchMode=yes" option (same
style/location as other "-o" options) so SSH won't prompt for
passwords/passphrases and will behave like qemu_ssh_args when stdin is piped;
modify the args vector returned by kubevirt_helper_ssh_args to insert the
BatchMode=yes pair alongside the existing
StrictHostKeyChecking/UserKnownHostsFile/LogLevel/ConnectTimeout options.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs`:
- Around line 117-122: The function effective_minimum_fault_window_secs
currently applies debug_fault_window_override_secs() even when
run_spec.minimum_fault_window_secs is None; change it so the override only
applies if run_spec.minimum_fault_window_secs.is_some() — i.e., return
Some(override) only when the run already has a minimum window, otherwise return
None. Update effective_minimum_fault_window_secs to check
JepsenRunSpec.minimum_fault_window_secs first, then consult
debug_fault_window_override_secs(); also add a regression test that sets the env
override, constructs a control run with minimum_fault_window_secs = None, and
asserts the effective_minimum_fault_window_secs (and control-run behavior)
remains None/one-shot.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/support.rs`:
- Around line 212-243: run_local_tar_extract currently returns early when
child.stdin.write_all(...) fails, leaving the spawned tar process (child)
unreaped; modify the function so that on any write_all error you still wait
for/reap the child (e.g., call child.wait() or child.wait_with_output()),
capture that child's exit result (and stderr if desired), then return the
original write error (or a combined error message) after reaping; reference
run_local_tar_extract, the child variable, the write_all call, and the
subsequent wait_with_output call to locate where to add the cleanup path.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs`:
- Around line 442-463: The decoder decode_tracker_field is not symmetric with
encode_tracker_field: when it sees a backslash then another char it currently
emits both, so encoded "\\" becomes "\\\\" on decode; update
decode_tracker_field to treat "\\" as a single literal backslash (i.e., match
the next char: if 'n' push '\n', else if '\\' push '\\', else push the next char
directly), and keep the None case pushing a single trailing backslash; add a
small round-trip unit test that asserts encode_tracker_field ->
decode_tracker_field returns the original for strings containing backslashes and
newlines (e.g., "C:\\tmp\\foo" and "line1\nline2") to prevent regressions.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs`:
- Around line 212-220: The summary rendering is using decoded tracker text with
embedded newlines which breaks the watch layout; before styling/truncation
(calls around render_summary_row, watch_style, truncate_for_watch with
WATCH_RULE_WIDTH and fields snapshot.detail, snapshot.last_error and similar
lane-note/recent-event paths), collapse multiline text into a single-line (e.g.,
replace '\n' with ' ' or take the first line) and then pass that flattened
string into truncate_for_watch and watch_style so the renderer stays stable;
apply the same fix to the other occurrences referenced in the comment (summary,
lane-note, recent-event code paths).

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs`:
- Around line 96-119: The loop currently stores Some(Err(...)) in contexts when
try_prepare_kubevirt_watch_lane_context fails, making the failure permanent;
instead, change the logic so that try_prepare_kubevirt_watch_lane_context errors
are not saved as Some(Err) but leave contexts[index] as None (so the next loop
iteration will retry), and only set contexts[index]=Some(Ok(context)) on
success; adjust the match arm that handles Context Err variants (and
KubevirtWatchLaneSnapshot construction) accordingly so transient preparation
errors are treated like runtime collection errors and the watcher will retry
preparing the lane on subsequent iterations (referencing
try_prepare_kubevirt_watch_lane_context, contexts,
collect_kubevirt_watch_lane_snapshot, and KubevirtWatchLaneSnapshot).

In `@docs/jepsen-refactor-plan.md`:
- Around line 26-39: Update the "Target Shape" module map so it matches the
modules implemented in this PR rather than the stale plan: replace references to
artifacts.rs and tests.rs with the actual module names introduced (e.g.,
tracker.rs, support.rs, surface.rs, watch_render.rs, runs.rs and any other new
helpers), list
allocdb-jepsen/{tracker.rs,support.rs,surface.rs,watch_render.rs,runs.rs} with
brief role descriptions consistent with the existing bullet style, and ensure
the prose and sequencing reflect the current implementation and intended
follow-ups.

---

Nitpick comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs`:
- Around line 166-171: Replace eprintln! lifecycle messages in cluster.rs with
structured log macros so they go through the configured logger; for the shown
call that prints backend={}, workspace={}, replica={} (using
layout.backend_name(), layout.workspace_root(), replica_id.get()) switch to
info!(...) for begin/complete events and debug!(...) for detailed plan
summaries, preserving the same message text and fields, and apply the same
change to the other occurrences you flagged (lines around the calls that use
layout.backend_name(), layout.workspace_root(), replica_id.get(), and similar
site-specific variables at the ranges you noted) so all lifecycle output
respects filters/formatting and correlates with existing log::error! usage.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs`:
- Around line 94-200: The PR lacks focused regression/unit tests for the
outcome-mapping helpers that define Jepsen history semantics; add tests that pin
negative/recovery branches for map_reserve_submit_response,
classify_resource_read_outcome, and drain_expiration_until_resource_available:
specifically, write tests asserting the "not primary" Text fallback path,
mismatched resource_id handling in map_reserve_submit_response, explicit
result-code -> JepsenEventOutcome mappings for classify_resource_read_outcome
(including rejection codes), and exhaustion/retry behavior for
drain_expiration_until_resource_available; place tests alongside the module
(covering the functions by name) and include negative-path, recovery, and
invariant assertions so these behaviors cannot regress.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 42a74032-05a4-47ae-ac64-ec9a8086cce3

📥 Commits

Reviewing files that changed from the base of the PR and between b258871 and 8ae9daf.

📒 Files selected for processing (15)
  • crates/allocdb-node/src/bin/allocdb-jepsen.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/args.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/support.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
  • crates/allocdb-node/src/bin/allocdb-qemu-testbed.rs
  • docs/jepsen-refactor-plan.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/allocdb-node/src/bin/allocdb-jepsen/args.rs
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
  • crates/allocdb-node/src/bin/allocdb-qemu-testbed.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/support.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/jepsen-refactor-plan.md
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Keep the system buildable after each chunk whenever feasible. If a larger refactor is required, stage it as a short sequence of intermediate, verifiable steps.

Applied to files:

  • docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
🧬 Code graph analysis (6)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (17)
  • append_text_line (112-126)
  • current_time_millis (128-133)
  • parse_optional_bool (68-75)
  • parse_optional_path (77-83)
  • parse_optional_string (85-91)
  • parse_optional_u64 (46-55)
  • parse_optional_usize (57-66)
  • parse_required_u32 (19-26)
  • parse_required_u64 (28-35)
  • parse_required_u128 (37-44)
  • required_field (9-17)
  • required_field (23-24)
  • required_field (32-33)
  • required_field (41-42)
  • write_text_atomically (93-110)
  • value (50-51)
  • value (61-62)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
  • sanitize_run_id (121-135)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (7)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
  • sanitize_run_id (121-135)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
  • current_time_millis (128-133)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (3)
  • new (29-38)
  • new (165-206)
  • slot (52-54)
crates/allocdb-core/src/state_machine_metrics.rs (1)
  • operation (72-84)
crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (3)
  • backend_name (27-27)
  • backend_name (44-46)
  • backend_name (75-77)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
  • drop (46-65)
  • args (92-104)
crates/allocdb-node/src/jepsen_tests.rs (1)
  • temp_dir (16-22)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (3)
crates/allocdb-node/src/local_cluster.rs (1)
  • decode_control_status_response (782-786)
crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
  • replica_role_label (829-837)
crates/allocdb-node/src/replica.rs (1)
  • status (959-961)
crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (6)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
  • load_kubevirt_layout (140-144)
  • prepare_kubevirt_helper (146-169)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (1)
  • request_remote_control_status (23-44)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (6)
  • maybe_load_run_status_snapshot (385-402)
  • run_events_path (299-301)
  • run_status_path (295-297)
  • snapshot (208-210)
  • new (29-38)
  • new (165-206)
crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (3)
  • parse_watch_event_line (131-144)
  • render_kubevirt_fleet_watch (64-114)
  • render_kubevirt_watch (17-62)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
  • decode_external_api_response (99-110)
  • send_remote_api_request (20-29)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (3)
  • new (27-33)
  • new (55-61)
  • new (109-136)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (3)
crates/allocdb-node/src/api_codec.rs (1)
  • decode_response (165-210)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
  • send_remote_api_request (20-29)
crates/allocdb-node/src/api.rs (1)
  • from_client_request (42-47)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (7)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (5)
  • render_runtime_probe_summary (109-139)
  • runtime_probe_is_active (65-70)
  • runtime_replica_probes_with_live_roles (46-63)
  • summarize_runtime_probes (72-93)
  • probes (110-137)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (3)
  • copy_file_or_remove (291-311)
  • prepare_log_path_for (278-289)
  • run_remote_control_command (191-202)
crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
  • replica_role_label (829-837)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (1)
  • value (275-277)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (2)
  • value (50-51)
  • value (61-62)
crates/allocdb-node/src/engine_observe.rs (1)
  • active_snapshot_lsn (43-45)
crates/allocdb-node/src/replica.rs (2)
  • metadata (954-956)
  • highest_prepared_lsn (998-1002)
🔇 Additional comments (1)
crates/allocdb-node/src/bin/allocdb-qemu-testbed.rs (1)

1026-1043: Nice SSH hardening coverage.

The shared SSH arg builder now fails fast instead of hanging on prompts/timeouts, and the new unit test pins that contract in place.

Also applies to: 1190-1197

@skel84
Copy link
Owner Author

skel84 commented Mar 15, 2026

Addressed the remaining review follow-ups in :\n- fixed KubeVirt helper SSH parity, control-run fault-window override semantics, tar child cleanup, tracker backslash symmetry, watch retry behavior, and single-line watch rendering\n- moved cluster lifecycle output onto structured logs\n- added focused outcome/drain regression tests plus the stale refactor-plan module map update\n\nValidation rerun: preflight: cargo fmt --all --check
preflight: cargo clippy --all-targets --all-features -- -D warnings
preflight: cargo test

running 5 tests
test tests::benchmark_options_reject_degenerate_values ... ok
test tests::benchmark_options_reject_oversized_derived_capacities ... ok
test tests::high_retry_pressure_reports_retry_cache_and_retirement_recovery ... ok
test tests::all_selection_runs_both_scenarios ... ok
test tests::one_resource_many_contenders_reports_expected_contention ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.23s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 77 tests
test command_codec::tests::decoder_rejects_unknown_command_tag ... ok
test command_codec::tests::decoder_rejects_truncated_payload ... ok
test command_codec::tests::client_request_round_trips ... ok
test command_codec::tests::internal_command_round_trips ... ok
test fixed_map::tests::duplicate_key_is_rejected ... ok
test fixed_map::tests::deletion_preserves_probe_chains ... ok
test fixed_map::tests::full_capacity_fails_explicitly ... ok
test fixed_map::tests::insert_get_and_remove_round_trip ... ok
test fixed_map::tests::deletion_preserves_wrapped_probe_chain_past_home_bucket ... ok
test fixed_map::tests::repeated_removals_preserve_lookup_for_operation_like_hashes ... ok
test recovery::issue_30_tests::recover_allocdb_rejects_non_monotonic_lsn ... ok
test retire_queue::tests::queue_round_trips_entries ... ok
test retire_queue::tests::queue_wraps_without_allocation ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_duplicate_operation_ids ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_duplicate_reservation_ids ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_duplicate_resource_ids ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_inconsistent_progress_watermarks ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_missing_active_reservation_reference ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_operation_table_over_capacity ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_reservation_table_over_capacity ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_resource_table_over_capacity ... ok
test snapshot::issue_30_tests::from_snapshot_rejects_terminal_reservation_without_retirement ... ok
test snapshot::tests::from_snapshot_rejects_wheel_size_mismatch ... ok
test snapshot::tests::snapshot_decode_accepts_legacy_v1_layout ... ok
test snapshot::tests::snapshot_decode_rejects_corruption ... ok
test snapshot::tests::snapshot_restores_retired_lookup_watermark ... ok
test snapshot::tests::snapshot_round_trips_allocator_state ... ok
test snapshot::tests::snapshot_round_trips_slot_overflow_operation_result ... ok
test recovery::issue_31_tests::recover_allocdb_rejects_client_slot_overflow_in_replayed_wal ... ok
test recovery::issue_30_tests::recover_allocdb_rejects_rewound_request_slot ... ok
test recovery::tests::recover_allocdb_replays_wal_without_snapshot ... ok
test recovery::tests::recover_allocdb_replays_internal_commands ... ok
test state_machine::issue_31_tests::config_validation_rejects_operation_window_overflow ... ok
test state_machine::issue_31_tests::create_rejects_operation_window_overflow_without_advancing_progress ... ok
test state_machine::issue_31_tests::internal_expire_rejects_history_window_overflow_without_mutating_state ... ok
test state_machine::issue_31_tests::reserve_rejects_large_request_slot_without_mutating_resource ... ok
test state_machine::issue_32_tests::operation_reuse_conflict_survives_probe_chain_retirement_gap ... ok
test state_machine::issue_33_tests::reservation_lookup_conservatively_marks_older_ids_retired ... ok
test state_machine::issue_33_tests::reservation_lookup_keeps_unknown_future_ids_not_found ... ok
test state_machine::issue_33_tests::retired_reservation_lookup_survives_unrelated_later_write ... ok
test state_machine::observe_tests::health_metrics_report_due_expiration_backlog ... ok
test state_machine::observe_tests::logical_slot_lag_saturates_at_zero ... ok
test state_machine::observe_tests::operation_table_utilization_drops_after_retry_window_retirement ... ok
test state_machine::observe_tests::stale_confirm_cannot_confirm_a_newer_reservation ... ok
test state_machine::tests::config_validation_rejects_invalid_history_window ... ok
test state_machine::tests::confirm_requires_matching_holder ... ok
test state_machine::tests::create_resource_is_idempotent_with_same_operation_id ... ok
test state_machine::tests::create_resource_rejects_conflicting_reuse_of_operation_id ... ok
test state_machine::tests::due_reservations_are_bucketed_by_deadline_slot ... ok
test state_machine::tests::expiration_bucket_capacity_fails_fast ... ok
test state_machine::tests::expire_is_noop_after_confirm ... ok
test state_machine::tests::expire_releases_reserved_resource ... ok
test state_machine::tests::operation_dedupe_expires_after_window ... ok
test state_machine::tests::release_returns_resource_to_available_and_retains_history ... ok
test state_machine::tests::reserve_assigns_deterministic_reservation_id_and_deadline ... ok
test state_machine::tests::reserve_rejects_busy_resource ... ok
test state_machine::tests::resource_table_capacity_fails_fast ... ok
test wal::tests::corrupted_checksum_is_rejected ... ok
test wal::tests::frame_round_trips ... ok
test wal::tests::scanner_stops_at_invalid_frame ... ok
test wal::tests::scanner_stops_at_torn_tail ... ok
test wal::tests::truncated_frame_is_rejected ... ok
test snapshot_file::tests::snapshot_file_returns_none_when_missing ... ok
test recovery::tests::recover_allocdb_fails_closed_on_mid_log_corruption ... ok
test recovery::tests::recover_allocdb_truncates_torn_tail ... ok
test recovery::issue_30_tests::recover_allocdb_rejects_semantically_invalid_snapshot ... ok
test wal_file::tests::wal_file_rejects_large_payloads ... ok
test wal_file::tests::wal_file_round_trips_frames ... ok
test recovery::tests::recover_allocdb_marks_empty_snapshot_as_loaded ... ok
test wal_file::tests::wal_file_recovery_reports_torn_tail ... ok
test wal_file::tests::wal_file_truncate_rejects_mid_log_corruption ... ok
test snapshot_file::tests::snapshot_file_rejects_corruption ... ok
test snapshot_file::tests::snapshot_file_round_trips ... ok
test wal_file::tests::wal_file_truncates_to_valid_prefix ... ok
test recovery::tests::recover_allocdb_skips_frames_covered_by_snapshot ... ok
test wal_file::tests::wal_file_replace_with_frames_rewrites_contents ... ok
test snapshot_file::tests::snapshot_file_replaces_existing_contents ... ok

test result: ok. 77 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.12s

running 151 tests
test api::tests::api_codec_rejects_truncated_and_trailing_frames ... ok
test api::tests::request_codec_round_trips_all_variants ... ok
test api::tests::api_bytes_round_trip_metrics_response ... ok
test api::tests::api_submit_maps_invalid_payload_to_definite_failure ... ok
test api::tests::submit_request_round_trips_through_wire_codec ... ok
test api::tests::response_codec_round_trips_all_variants ... ok
test bounded_queue::tests::queue_round_trips_entries ... ok
test bounded_queue::tests::queue_wraps_without_allocation ... ok
test engine::checkpoint_tests::checkpoint_rejects_queued_submissions ... ok
test api::tests::api_submit_maps_crash_injected_commit_to_indefinite_storage_failure_wire_value ... ok
test api::tests::api_reads_reject_when_engine_is_halted ... ok
test api::tests::api_submit_commits_and_exposes_retry_cache ... ok
test api::tests::api_reads_enforce_fence_and_return_views ... ok
test api::tests::api_bytes_recovery_preserves_state_and_retry_cache ... ok
test api::tests::api_tick_expirations_commits_due_internal_expire ... ok
test engine::issue_31_tests::submit_rejects_deadline_overflow_before_wal_append ... ok
test api::tests::api_reservation_reports_retired_history ... ok
test engine::tests::engine_config_rejects_zero_bounds ... ok
test engine::issue_31_tests::tick_expirations_rejects_history_overflow_before_internal_commit ... ok
test engine::tests::conflicting_retry_returns_conflict_without_wal_growth ... ok
test engine::tests::duplicate_retry_returns_cached_result_without_wal_growth ... ok
test engine::tests::enqueue_respects_bounded_queue_capacity ... ok
test engine::tests::encoded_submission_rejects_malformed_payload_before_commit ... ok
test engine::tests::encoded_submission_round_trips_same_as_typed_submit ... ok
test engine::issue_31_tests::recover_rejects_exhausted_next_lsn ... ok
test engine::issue_31_tests::engine_returns_lsn_exhausted_after_last_representable_commit ... ok
test engine::tests::expiration_tick_commits_internal_expire_and_frees_overdue_resource ... ok
test engine::tests::expiration_tick_drains_queued_confirm_before_expiring_same_reservation ... ok
test engine::tests::metrics_include_queue_depth_and_core_health ... ok
test engine::tests::oversized_submission_is_rejected_before_commit ... ok
test engine::tests::halted_engine_rejects_reads_until_recovery ... ok
test engine::tests::expiration_tick_post_append_failure_requires_recovery_for_expired_state ... ok
test engine::tests::expiration_tick_pre_append_failure_leaves_expiration_uncommitted_after_recovery ... ok
test engine::tests::queued_duplicate_retry_reuses_original_submission ... ok
test engine::tests::queued_conflicting_retry_returns_conflict_without_second_entry ... ok
test engine::tests::recover_restores_state_and_retry_cache ... ok
test engine::tests::retry_after_failed_post_append_attempt_returns_cached_result_after_recovery ... ok
test engine::tests::submission_errors_have_explicit_indefinite_category ... ok
test jepsen::tests::analysis_flags_duplicate_committed_execution ... ok
test jepsen::tests::analysis_flags_early_expiration_release ... ok
test jepsen::tests::analysis_flags_stale_successful_read ... ok
test jepsen::tests::analysis_resolves_ambiguous_write_through_retry_cache ... ok
test engine::tests::expiration_tick_preserves_monotonic_slots_after_draining_queued_release ... ok
test jepsen::tests::history_codec_round_trips_none_lsn_and_tick_expired_without_resource_id ... ok
test jepsen::tests::release_gate_plan_matches_documented_matrix ... ok
test engine::tests::retry_after_failed_pre_append_attempt_executes_once_after_recovery ... ok
test engine::tests::strict_read_fence_requires_applied_lsn ... ok
test kubevirt_testbed::tests::layout_rejects_wrong_replica_count ... ok
test kubevirt_testbed::tests::layout_round_trips_through_text ... ok
test kubevirt_testbed::tests::layout_rejects_control_guest_replica_id ... ok
test engine::tests::retry_resolution_is_only_guaranteed_within_the_dedupe_window ... ok
test engine::tests::submit_sequences_wal_and_applies_state ... ok
test kubevirt_testbed::tests::replica_layout_uses_guest_ips_for_all_surfaces ... ok
test local_cluster::tests::control_command_parser_rejects_unknown_requests ... ok
test local_cluster::tests::decode_control_status_response_surfaces_remote_error ... ok
test local_cluster::tests::decode_control_status_response_round_trips_through_text_encoding ... ok
test local_cluster::tests::fault_state_round_trips_through_text_encoding ... ok
test local_cluster::tests::request_control_status_retries_transient_decode_error_then_succeeds ... ok
test local_cluster::tests::layout_round_trips_through_text_encoding ... ok
test local_cluster::tests::request_control_status_returns_non_transient_error_without_retry ... ok
test local_cluster::tests::request_control_status_retries_truncated_field_value_then_succeeds ... ok
test local_cluster::tests::status_response_round_trips_through_text_encoding ... ok
test local_cluster::tests::timeline_round_trips_through_text_encoding ... ok
test kubevirt_testbed::tests::layout_rejects_duplicate_replica_addresses ... ok
test qemu_testbed::tests::qemu_command_contains_expected_networks_and_seed_assets ... ok
test qemu_testbed::tests::control_guest_user_data_contains_control_script_and_private_key ... ok
test engine::tests::recovery_metrics_treat_loaded_empty_snapshot_as_snapshot_startup ... ok
test kubevirt_testbed::tests::load_is_read_only_for_missing_workspace_root ... ok
test qemu_testbed::tests::qemu_testbed_layout_round_trips_through_text_encoding ... ok
test qemu_testbed::tests::replica_guest_user_data_contains_replica_service_and_layout ... ok
test qemu_testbed::tests::replica_layout_uses_management_for_control_and_cluster_for_data ... ok
test kubevirt_testbed::tests::persist_and_load_round_trip ... ok
test engine::checkpoint_tests::recovery_survives_new_snapshot_before_wal_rewrite ... ok
test qemu_testbed::tests::persisted_qemu_layout_round_trips_from_disk ... ok
test jepsen::tests::history_codec_round_trips_and_artifact_bundle_is_written ... ok
test replica::tests::replica_metadata_file_rejects_oversized_sidecar ... ok
test engine::tests::recovery_metrics_report_snapshot_and_wal_replay ... ok
test engine::checkpoint_tests::checkpoint_rewrites_wal_with_one_checkpoint_overlap ... ok
test replica::tests::replica_open_faults_on_corrupt_metadata_bytes ... ok
test replica::tests::replica_metadata_file_rejects_previous_role_encoding_version ... ok
test replica::tests::replica_metadata_file_round_trips ... ok
test replica::tests::replica_open_bootstraps_missing_metadata ... ok
test replica::tests::replica_first_uncommitted_prepared_entry_is_none_when_empty ... ok
test replica::tests::replica_metadata_file_overwrite_replaces_prior_contents ... ok
test replica::tests::replica_open_faults_on_metadata_identity_mismatch ... ok
test replica::tests::replica_open_faults_on_metadata_shard_mismatch ... ok
test replica::tests::replica_open_faults_when_last_normal_view_exceeds_current_view ... ok
test replica::tests::replica_open_faults_when_commit_lsn_is_ahead_of_local_state ... ok
test replica::tests::replica_open_faults_when_snapshot_anchor_is_not_local ... ok
test replica::tests::replica_first_uncommitted_prepared_entry_returns_lowest_lsn ... ok
test replica::tests::replica_open_faults_on_corrupt_prepare_log_bytes ... ok
test replica::tests::replica_open_faults_when_snapshot_exceeds_commit_lsn ... ok
test replica::tests::replica_open_faults_when_snapshot_has_no_commit_lsn ... ok
test replica::tests::replica_first_uncommitted_prepared_entry_is_stable_after_mutation ... ok
test replica::tests::replica_open_faults_when_vote_view_precedes_last_normal_view ... ok
test replica::tests::replica_open_faults_on_empty_prepare_log_file ... ok
test replica::tests::replica_open_faults_when_vote_view_precedes_current_view ... ok
test replica::tests::replica_recover_faults_when_local_apply_lags_commit_lsn ... ok
test replica::tests::replica_recover_honors_persisted_faulted_role ... ok
test replica::tests::replica_recover_faults_when_local_apply_exceeds_commit_lsn ... ok
test replica::tests::replica_prepare_expiration_tick_rejects_pending_prepared_suffix ... ok
test replica::tests::replica_recover_bootstraps_metadata_from_local_durable_state ... ok
test replica::tests::replica_prepare_and_commit_keep_apply_gated_by_commit ... ok
test replica::tests::replica_recover_faults_when_prepare_log_lsn_skips_commit_boundary ... ok
test replica::tests::replica_recover_restores_prepared_suffix_from_prepare_log ... ok
test replicated_simulation::tests::connectivity_matrix_controls_delivery_until_partition_heals ... ok
test replica::tests::replica_vote_persists_view_uncertainty_and_blocks_view_regression ... ok
test replica::tests::replica_commit_expiration_tick_persists_and_recovers_expired_state ... ok
test replica::tests::replica_reconstructs_committed_prefix_and_discards_uncommitted_suffix ... ok
test replicated_simulation::tests::crash_and_restart_keep_replica_workspace_stable ... ok
test replica::tests::replica_prepare_expiration_tick_emits_internal_entries ... ok
test replicated_simulation::tests::faulted_replica_rejoin_is_rejected ... ok
test replicated_simulation::tests::queue_and_deliver_reject_crashed_replicas ... ok
test replicated_simulation::tests::invalid_replica_access_returns_unknown_replica ... ok
test replicated_simulation::tests::higher_view_takeover_does_not_promote_primary_only_prepared_suffix ... ok
test replicated_simulation::tests::minority_partition_keeps_quorum_and_healed_backup_catches_up ... ok
test replicated_simulation::tests::quorum_lost_primary_fails_closed_for_reads_and_writes ... ok
test replicated_simulation::tests::reads_are_served_only_from_the_primary_after_local_commit ... ok
test replicated_simulation::tests::higher_view_takeover_recovers_missing_prepared_suffix_from_another_voter ... ok
test replicated_simulation::tests::quorum_write_publishes_after_majority_append_and_backups_wait_for_commit ... ok
test replicated_simulation::tests::replicated_harness_bootstraps_three_real_replicas_with_independent_workspaces ... ok
test replicated_simulation::tests::restart_rejects_running_replica ... ok
test replicated_simulation::tests::higher_view_takeover_reconstructs_prefix_and_rejects_stale_primary_reads ... ok
test simulation::tests::crash_plan_seed_is_reproducible_and_order_independent ... ok
test simulation::tests::explicit_slot_advancement_controls_lag_and_backlog ... ok
test simulation::tests::harness_storage_fault_append_failure_halts_engine_for_negative_path_tests ... ok
test replicated_simulation::tests::primary_crash_after_reply_preserves_read_and_retry_on_new_primary ... ok
test simulation::tests::schedule_exploration_rejects_duplicate_labels - should panic ... ok
test simulation::tests::schedule_exploration_rejects_empty_candidate_slots - should panic ... ok
test replicated_simulation::tests::primary_crash_before_quorum_append_retries_once_after_failover ... ok
test replicated_simulation::tests::primary_crash_after_quorum_append_retries_from_reconstructed_commit ... ok
test simulation::tests::seeded_client_post_sync_crash_recovers_via_real_engine ... ok
test replicated_simulation::tests::rejoin_rejects_replica_that_knows_a_higher_view_than_the_primary ... ok
test simulation::tests::seeded_checkpoint_crash_after_snapshot_write_is_recoverable ... ok
test simulation::tests::seeded_checkpoint_crash_after_wal_rewrite_is_recoverable ... ok
test simulation::tests::seeded_ready_batch_transcript_is_reproducible ... ok
test simulation::tests::seeded_recovery_replay_crash_can_target_later_replayed_frame ... ok
test replicated_simulation::tests::retry_cache_hits_do_not_overwrite_canonical_published_results ... ok
test simulation::tests::seeded_schedule_explores_ingress_contention_reproducibly ... ok
test simulation::tests::seeded_schedule_explores_retry_timing_reproducibly ... ok
test simulation::tests::simulated_sync_failure_recovers_retryable_write_from_real_wal ... ok
test simulation::tests::simulated_checksum_corruption_fails_closed_during_restart ... ok
test simulation::tests::seeded_schedule_explores_due_expiration_order_reproducibly ... ok
test simulation::tests::simulated_slot_driver_handles_expiration_restart_path ... ok
test simulation::tests::simulated_torn_tail_recovers_from_snapshot_and_retries_once ... ok
test replicated_simulation::tests::replicated_schedule_transcript_is_reproducible ... ok
test simulation::tests::seeded_recovery_boundary_crashes_are_reproducible_and_resumable ... ok
test replicated_simulation::tests::split_cluster_into_non_quorum_minorities_fails_closed_until_heal_and_rejoin ... ok
test replicated_simulation::tests::stale_replica_rejoins_by_snapshot_transfer_when_primary_pruned_older_history ... ok
test replicated_simulation::tests::stale_replica_rejoins_by_suffix_and_drops_uncommitted_suffix ... ok
test simulation::tests::seeded_schedule_preserves_earliest_deadline_priority_under_bounded_ticks ... ok

test result: ok. 151 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.75s

running 48 tests
test runs::tests::fault_window_override_changes_faulted_runs ... ok
test events::tests::response_text_not_primary_is_detected ... ok
test events::tests::outcome_from_submission_failure_maps_rejection_codes ... ok
test cluster::tests::plan_rejoin_uses_source_commit_and_view ... ok
test cluster::tests::plan_failover_prefers_new_primary_when_commits_and_prepares_match_or_lead ... ok
test cluster::tests::plan_failover_prefers_supporting_backup_with_newer_commit ... ok
test cluster::tests::plan_failover_prefers_supporting_backup_with_higher_prepare_on_equal_commit ... ok
test runs::tests::fault_window_override_does_not_change_control_runs ... ok
test tests::classify_resource_read_outcome_distinguishes_available_and_held_states ... ok
test tests::classify_resource_read_outcome_maps_not_primary_text ... ok
test events::tests::map_reserve_submit_response_maps_expected_result_codes ... ok
test tests::classify_resource_read_outcome_rejects_mismatched_resource ... ok
test tests::compact_counter_formats_large_values ... ok
test tests::compact_fault_window_progress_formats_control_and_faulted_runs ... ok
test tests::expired_reservation_maps_to_released_read_state ... ok
test tests::fault_window_completion_distinguishes_control_and_long_fault_runs ... ok
test tests::history_builder_preserves_nonzero_sequence_offsets ... ok
test tests::indefinite_submission_failure_maps_to_ambiguous_outcome ... ok
test tests::live_runtime_replica_matching_ignores_down_and_faulted_replicas ... ok
test tests::parse_args_rejects_missing_required_flags ... ok
test tests::parse_args_returns_help_without_subcommand ... ok
test tests::parse_args_rejects_unknown_subcommand ... ok
test tests::parse_args_rejects_trailing_plan_arguments ... ok
test tests::analyzer_accepts_failover_read_fence_history_once_ambiguity_is_retried ... ok
test tests::parse_watch_event_line_extracts_timestamp_and_detail ... ok
test tests::parse_watch_event_line_unescapes_multiline_detail ... ok
test tests::parse_watch_kubevirt_lane_spec_extracts_name_workspace_and_output_root ... ok
test tests::parse_watch_kubevirt_lane_spec_rejects_blank_fields ... ok
test tests::parse_watch_kubevirt_lane_spec_rejects_missing_fields ... ok
test tests::probe_submit_and_read_validation_cover_pass_and_fail_paths ... ok
test tests::progress_bar_clamps_to_requested_width ... ok
test tests::protocol_probe_rejects_placeholder_responses ... ok
test tests::release_gate_plan_includes_faulted_qemu_runs ... ok
test tests::remote_tcp_probe_command_places_args_before_heredoc ... ok
test tests::remote_tcp_probe_command_preserves_empty_payload_argument ... ok
test tests::render_runtime_probe_summary_marks_live_faulted_and_down_replicas ... ok
test tests::request_namespace_monotonicity_covers_verify_then_execute_ordering ... ok
test tests::resolve_run_spec_and_minimum_fault_window_are_enforced ... ok
test tests::summarize_runtime_probes_counts_only_active_roles ... ok
test tests::tracker_field_round_trips_newlines_and_backslashes ... ok
test tests::run_status_snapshot_round_trips_through_text_codec ... ok
test tests::unique_probe_resource_id_is_monotonic ... ok
test tests::run_status_snapshot_round_trips_multiline_error_and_detail ... ok
test tests::copy_file_or_remove_copies_and_removes_stale_destination ... ok
test tests::temp_staging_dir_uses_unique_paths_for_same_prefix ... ok
test tests::disable_local_tar_copyfile_metadata_sets_expected_env ... ok
test events::tests::drain_expiration_until_resource_available_reports_retry_exhaustion ... ok
test events::tests::drain_expiration_until_resource_available_stops_once_resource_is_available ... ok

test result: ok. 48 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.02s

running 2 tests
test tests::prepare_ack_requires_expected_replica_identity ... ok
test tests::read_stream_bytes_rejects_oversized_payload ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 4 tests
test tests::parse_qemu_image_format_extracts_format_line ... ok
test tests::ssh_args_include_fail_fast_options ... ok
test tests::start_prepared_testbed_rolls_back_readiness_failure_and_allows_retry ... ok
test tests::start_prepared_testbed_rolls_back_spawn_failure_and_allows_retry ... ok

test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 7 tests
test local_cluster_client_transport_commits_reads_and_retries ... ok
test local_cluster_client_transport_retries_same_ambiguous_write ... ok
test local_cluster_fault_harness_crashes_restarts_and_records_isolation ... ok
test local_cluster_runner_starts_stops_and_reuses_stable_layout ... ok
test local_cluster_tick_expirations_replicates_internal_commands ... ok
test local_cluster_submit_malformed_payload_returns_invalid_request_response ... ok
test local_cluster_tick_retry_drains_pending_internal_suffix_after_quorum_recovers ... ok

test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 8.08s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

preflight: scripts/check_repo.sh
repo-check: ok
preflight: ok

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
docs/jepsen-refactor-plan.md (1)

26-43: ⚠️ Potential issue | 🟡 Minor

Sync the module map and slice ownership with the code that landed.

The plan still omits landed modules like cluster.rs, events.rs, kubevirt.rs, nemesis.rs, remote.rs, and scenarios.rs, and Slice 3 still describes RunExecutionContext as tracker work even though it lives in support.rs. As written, this doc is already stale against the implementation in this PR and will point the next refactor step at the wrong seams.

As per coding guidelines, **/*.md: "Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR."

Also applies to: 71-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/jepsen-refactor-plan.md` around lines 26 - 43, The module map and slice
ownership in the "Target Shape" section are stale; update the list to include
the landed modules cluster.rs, events.rs, kubevirt.rs, nemesis.rs, remote.rs,
and scenarios.rs, and correct the slice ownership that currently attributes
RunExecutionContext to tracker work by noting that RunExecutionContext lives in
support.rs (adjust any "Slice 3" wording accordingly); ensure the new module
list and slice descriptions match the actual symbols and file boundaries (e.g.,
allocdb-jepsen/support.rs contains RunExecutionContext) so the doc aligns with
the implementation introduced in this PR.
🧹 Nitpick comments (3)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (1)

294-306: Add a mismatch-id regression test here.

This guard is important, but the test module still doesn't cover the negative path where GetResource(Found(...)) returns a different resource_id. One focused unit test would keep the Jepsen history from silently regressing back to accepting the wrong object.

As per coding guidelines, **/*.rs: "Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs` around lines 294 - 306,
Add a focused unit test for classify_resource_read_outcome that exercises the
negative path where
RemoteApiOutcome::Api(ApiResponse::GetResource(ResourceResponse::Found(...)))
contains a Resource whose resource_id does not match the requested resource_id;
call classify_resource_read_outcome with these mismatched ids and assert it
returns Err (and optionally that the error string mentions both ids or the
mismatch), placing the test in the same test module as the existing tests so
future changes cannot regress acceptance of mismatched resources.
crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (2)

247-255: Avoid full-file materialization on every refresh tick.

Line 247–Line 255 re-read and re-parse the entire events file each loop; with 250ms refresh this can become expensive for long runs. Stream lines and keep only a bounded tail in-memory.

♻️ Proposed refactor
-    let Ok(bytes) = fs::read_to_string(&events_path) else {
+    let Ok(file) = fs::File::open(&events_path) else {
         return Vec::new();
     };
-    let mut events = bytes
-        .lines()
-        .filter_map(|line| parse_watch_event_line(line).ok())
-        .collect::<Vec<_>>();
-    if events.len() > limit {
-        events.drain(..events.len().saturating_sub(limit));
+    let reader = std::io::BufReader::new(file);
+    let mut events = std::collections::VecDeque::with_capacity(limit);
+    for line in std::io::BufRead::lines(reader).map_while(Result::ok) {
+        if let Ok(event) = parse_watch_event_line(&line) {
+            if events.len() == limit {
+                events.pop_front();
+            }
+            events.push_back(event);
+        }
     }
-    events
+    events.into_iter().collect()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs` around lines 247 - 255,
The loop currently re-reads and re-parses the whole events file into a Vec each
tick (using events_path, parse_watch_event_line, and limit) which is expensive;
change it to stream new lines from the file (use a BufReader with seek to the
previous read position or tail via File::open and read_to_end incrementally) and
append parsed events to a bounded in-memory buffer (use a VecDeque or retain
only the last limit items) instead of re-materializing the entire file on every
refresh; ensure you still call parse_watch_event_line per incoming line and drop
older entries so events never grows beyond limit.

251-253: Don’t silently discard malformed event lines.

Line 252 suppresses parse failures with .ok(), which hides format regressions and corrupt data. Emit a warn with path/context when parsing fails.

🔎 Proposed logging improvement
-    let mut events = bytes
-        .lines()
-        .filter_map(|line| parse_watch_event_line(line).ok())
-        .collect::<Vec<_>>();
+    let mut events = bytes
+        .lines()
+        .filter_map(|line| match parse_watch_event_line(line) {
+            Ok(event) => Some(event),
+            Err(error) => {
+                log::warn!(
+                    "failed to parse watch event line from {}: {}",
+                    events_path.display(),
+                    error
+                );
+                None
+            }
+        })
+        .collect::<Vec<_>>();
As per coding guidelines, `**/*.rs`: Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs` around lines 251 - 253,
The code currently swallows parse errors by using .filter_map(|line|
parse_watch_event_line(line).ok()) which hides malformed event lines; replace
this with handling the Result from parse_watch_event_line so that Err cases emit
a structured warn (including the offending line and any surrounding
context/path) and only Ok values are collected into the Vec; specifically update
the iterator chain that calls parse_watch_event_line to log with warn!(…) on Err
while collecting Ok(event) into the final collection (keep
parse_watch_event_line and the existing collect::<Vec<_>>() usage).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs`:
- Around line 179-195: The stdin write path for the spawned kubectl child in the
block (inside apply_kubevirt_helper_pod / the kubectl Command that spawns
`child`) currently returns immediately on write_all() error and leaks the child;
change the error path to first capture the child, attempt to wait_with_output()
to reap the process (using the same wait-and-collect pattern as elsewhere), then
propagate an error that includes the child's stderr/exit info; do the same fix
for the other identical block around lines 338-359 so every write_all() failure
always calls child.wait_with_output() before returning.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs`:
- Around line 218-223: The run exits early when
enforce_minimum_fault_window(...) fails but skips persisting final metadata;
move or add a call to tracker.complete(&history_path, &bundle_dir,
&logs_archive, &report) so the history_file, artifact bundle, logs_archive,
blocker count and release-gate status are persisted before returning, then call
tracker.fail(RunTrackerPhase::Completed, &error) and return Err(error) as
currently done; specifically update the block that calls
enforce_minimum_fault_window, started_at.elapsed(), tracker.fail and
tracker.complete so tracker.complete runs prior to tracker.fail when
enforce_minimum_fault_window returns Err.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs`:
- Around line 534-543: In render_recent_events(), flatten the multiline detail
(flatten_watch_text(&event.detail)) and truncate it to a fixed max width (e.g.
define MAX_WATCH_DETAIL_LEN) before passing to style_event_detail so extremely
long "error:" paths don't wrap the layout; implement truncation that appends an
ellipsis when truncated and use the truncated string in the println call (keep
references to event_icon, pad_watch_cell, watch_style and style_event_detail
unchanged).

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs`:
- Around line 44-83: Add unit/integration tests around watch_kubevirt and the
helper functions to cover one-shot vs follow control flow, event truncation,
retry/error handling, and error surfaces: write tests that call watch_kubevirt
(or extract its loop body into a testable helper) to assert that when follow is
false it returns after one iteration and when follow is true it continues until
maybe_load_run_status_snapshot returns a snapshot whose RunTrackerState !=
RunTrackerState::Running; add tests for load_recent_run_events to verify it
enforces the N-event limit and degrades on missing or malformed event files; add
tests for prepare_kubevirt_helper / lane initialization to simulate transient
failures and verify retry behavior and per-lane error handling; and add tests
for collect_replica_watch_snapshots and any replica-metric fetchers to simulate
missing metrics, unexpected API responses, and I/O/parsing errors and assert
graceful degradation (no panics, appropriate error propagation or partial
results). Use mocks or temp files to simulate file-not-found and parse failures
and ensure refresh_millis / sleep behavior is controlled (inject a short/mocked
delay) so tests run fast.

---

Duplicate comments:
In `@docs/jepsen-refactor-plan.md`:
- Around line 26-43: The module map and slice ownership in the "Target Shape"
section are stale; update the list to include the landed modules cluster.rs,
events.rs, kubevirt.rs, nemesis.rs, remote.rs, and scenarios.rs, and correct the
slice ownership that currently attributes RunExecutionContext to tracker work by
noting that RunExecutionContext lives in support.rs (adjust any "Slice 3"
wording accordingly); ensure the new module list and slice descriptions match
the actual symbols and file boundaries (e.g., allocdb-jepsen/support.rs contains
RunExecutionContext) so the doc aligns with the implementation introduced in
this PR.

---

Nitpick comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs`:
- Around line 294-306: Add a focused unit test for
classify_resource_read_outcome that exercises the negative path where
RemoteApiOutcome::Api(ApiResponse::GetResource(ResourceResponse::Found(...)))
contains a Resource whose resource_id does not match the requested resource_id;
call classify_resource_read_outcome with these mismatched ids and assert it
returns Err (and optionally that the error string mentions both ids or the
mismatch), placing the test in the same test module as the existing tests so
future changes cannot regress acceptance of mismatched resources.

In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs`:
- Around line 247-255: The loop currently re-reads and re-parses the whole
events file into a Vec each tick (using events_path, parse_watch_event_line, and
limit) which is expensive; change it to stream new lines from the file (use a
BufReader with seek to the previous read position or tail via File::open and
read_to_end incrementally) and append parsed events to a bounded in-memory
buffer (use a VecDeque or retain only the last limit items) instead of
re-materializing the entire file on every refresh; ensure you still call
parse_watch_event_line per incoming line and drop older entries so events never
grows beyond limit.
- Around line 251-253: The code currently swallows parse errors by using
.filter_map(|line| parse_watch_event_line(line).ok()) which hides malformed
event lines; replace this with handling the Result from parse_watch_event_line
so that Err cases emit a structured warn (including the offending line and any
surrounding context/path) and only Ok values are collected into the Vec;
specifically update the iterator chain that calls parse_watch_event_line to log
with warn!(…) on Err while collecting Ok(event) into the final collection (keep
parse_watch_event_line and the existing collect::<Vec<_>>() usage).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 30f478ce-a3f8-44b5-980a-52ddaebbc7ff

📥 Commits

Reviewing files that changed from the base of the PR and between 8ae9daf and 8f95c51.

📒 Files selected for processing (10)
  • crates/allocdb-node/src/bin/allocdb-jepsen.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/support.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
  • docs/jepsen-refactor-plan.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/jepsen-refactor-plan.md
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/support.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/events.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs
  • crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
🧠 Learnings (4)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.md : Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Applied to files:

  • docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to docs/status.md : Keep [`docs/status.md`](./docs/status.md) current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.

Applied to files:

  • docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Keep the system buildable after each chunk whenever feasible. If a larger refactor is required, stage it as a short sequence of intermediate, verifiable steps.

Applied to files:

  • docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
🧬 Code graph analysis (4)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (5)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
  • sanitize_run_id (121-135)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
  • current_time_millis (128-133)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
  • new (29-38)
  • new (165-206)
crates/allocdb-node/src/replica.rs (2)
  • paths (974-976)
  • status (959-961)
crates/allocdb-node/src/jepsen_tests.rs (1)
  • temp_dir (16-22)
crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (5)
crates/allocdb-node/src/jepsen.rs (5)
  • analyze_history (744-753)
  • create_artifact_bundle (784-833)
  • load_history (686-690)
  • persist_history (698-703)
  • render_analysis_report (756-775)
crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs (1)
  • execute_control_run (19-41)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (1)
  • ensure_runtime_cluster_ready (54-104)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (3)
  • workspace_root (703-705)
  • new (654-695)
  • backend_name (699-701)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
  • new (29-38)
  • new (165-206)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
crates/allocdb-node/src/kubevirt_testbed.rs (2)
  • kubevirt_testbed_layout_path (162-164)
  • map (376-390)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
  • build_remote_tcp_probe_command (40-49)
  • encode_hex (112-119)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (3)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (6)
  • live_runtime_replica_matching (95-107)
  • render_runtime_probe_summary (109-139)
  • runtime_probe_is_active (65-70)
  • runtime_replica_probes_with_live_roles (46-63)
  • summarize_runtime_probes (72-93)
  • probes (110-137)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (5)
  • copy_file_or_remove (309-329)
  • prepare_log_path_for (296-307)
  • new (27-33)
  • new (55-61)
  • new (109-136)
crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
  • replica_role_label (842-850)
🔇 Additional comments (4)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (1)

442-446: Add a round-trip regression test for tracker-field escaping.

decode_tracker_field() now has custom handling for \\ and \n, but this file still has no test pinning the encode_tracker_field()/decode_tracker_field() symmetry. A case like C:\tmp\foo plus a multiline error would catch the exact regression fixed here.

As per coding guidelines, **/*.rs: "Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage."

crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (1)

376-385: The control-run regression test still doesn't exercise the env override path.

fault_window_override_does_not_change_control_runs() never sets FAULT_WINDOW_OVERRIDE_ENV, so it only checks the default case. The override bug can come back without this test ever failing.

As per coding guidelines, **/*.rs: "Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage."

crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (2)

59-73: One-shot watch exit is now correct.

Line 71–Line 73 cleanly enforce single-iteration behavior when follow is false, so the one-shot path no longer hangs.


100-127: Lane preparation error handling is materially improved.

Line 103 preserves the real preparation error for rendering, and keeping contexts[index] as None on failure allows retry on the next loop.

@skel84
Copy link
Owner Author

skel84 commented Mar 15, 2026

Applied the remaining review-driven fixes:

  • reap spawned kubectl children on stdin-write failures
  • persist completion metadata before minimum-window failure reporting
  • add watcher regression tests and stabilize single-watch event rendering
  • add module-local tracker/event regression coverage
  • refresh the refactor plan module map

Validation rerun: ./scripts/preflight.sh

@skel84 skel84 merged commit 3eb9299 into main Mar 15, 2026
2 checks passed
@skel84 skel84 deleted the issue-70-jepsen-refactor branch March 15, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant