refactor(jepsen): split CLI into focused modules#72
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (2)**/*.md📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.rs📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (4)📚 Learning: 2026-03-12T15:18:53.086ZApplied to files:
📚 Learning: 2026-03-12T15:18:53.086ZApplied to files:
📚 Learning: 2026-03-12T15:18:53.086ZApplied to files:
📚 Learning: 2026-03-12T15:18:53.086ZApplied to files:
🧬 Code graph analysis (4)crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (9)
crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (6)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (4)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
🔇 Additional comments (7)
Summary by CodeRabbit
WalkthroughRefactors the monolithic allocdb-jepsen binary into many focused modules (args, cluster, common, events, kubevirt, nemesis, remote, runs, runtime, scenarios, support, surface, tracker, watch, watch_render), adding CLI parsing, test orchestration, QEMU/KubeVirt integrations, a terminal watch UI, run tracking, and supporting utilities; documentation updated with a Jepsen refactor plan. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (7)
docs/jepsen-refactor-plan.md (1)
20-24: Optional: Consider varying sentence structure in Non-Goals.Three consecutive bullets start with "No" (lines 20-22), which creates slight repetition. This is a minor stylistic point and doesn't affect clarity.
✍️ Optional rewrite for variety
## Non-Goals -- No new Jepsen workloads or backend features. -- No Hetzner backend work in this task. -- No analyzer/history model redesign in [`jepsen.rs`](../crates/allocdb-node/src/jepsen.rs) +- New Jepsen workloads or backend features are out of scope. +- Hetzner backend work is deferred. +- Analyzer/history model redesign in [`jepsen.rs`](../crates/allocdb-node/src/jepsen.rs) unless a minimal extraction requires a purely mechanical move. -- No dedicated `allocdb-validation` crate yet. +- Dedicated `allocdb-validation` crate is deferred to a later decision point.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/jepsen-refactor-plan.md` around lines 20 - 24, Edit the "Non-Goals" bullet list to reduce repetition by rephrasing at least two of the three consecutive bullets that start with "No" (for example, change one to "Does not include ..." and another to "This task will not cover ...") while keeping the same meanings; specifically update the bullets referencing "Jepsen workloads or backend features", "Hetzner backend work", and the mention of "`jepsen.rs`" so that one or two use alternative sentence starts for variety but retain the original constraints and links.crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (1)
138-144: Kubevirt helper guard is held but named_helper.The
_helpervariable holds aKubevirtHelperGuardthat likely performs cleanup on drop. The underscore prefix conventionally suggests an unused variable, but here it's essential to keep the guard alive. Consider renaming to_kubevirt_helperor adding a comment to clarify intent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs` around lines 138 - 144, The local binding _helper in verify_kubevirt_surface holds a KubevirtHelperGuard returned by prepare_kubevirt_helper and must be kept alive for its Drop cleanup, but the underscore prefix falsely signals it's unused; rename _helper to a clear binding like kubevirt_helper (or _kubevirt_helper) and/or add an inline comment above the binding noting "keep guard alive for cleanup on drop" so the intent is explicit while still preventing unused-variable warnings.crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (1)
46-60: Side effect:runtime_replica_probesmutates cloned replica configs.The function clones each replica and updates
replica.rolebased on remote status. This is a useful enrichment but the function name doesn't hint at mutation. Consider documenting this behavior or renaming toruntime_replica_probes_with_updated_roles.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs` around lines 46 - 60, The function runtime_replica_probes currently clones each replica and mutates the clone's role based on request_remote_control_status before returning RuntimeReplicaProbe, which is a surprising side-effect; either rename the function to runtime_replica_probes_with_updated_roles or document the mutation in the function's docstring and public API, and ensure callers expecting unchanged replica data are updated to use the new name or behavior; locate runtime_replica_probes, the mapping that sets replica.role from status (via request_remote_control_status) and change the identifier and/or add a comment explaining the enrichment so the mutation is explicit to users.crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (1)
77-80: Tracker errors silently ignored inHistoryBuilder::push.If
set_history_eventsorappend_eventfails, the error is discarded withlet _ =. This could hide I/O failures during history recording. Consider logging atwarnlevel when these operations fail.🔧 Proposed improvement
if let Some(tracker) = &self.tracker { - let _ = tracker.set_history_events(self.next_sequence); - let _ = tracker.append_event(&format!("history sequence={}", self.next_sequence)); + if let Err(e) = tracker.set_history_events(self.next_sequence) { + log::warn!("failed to update history events count: {e}"); + } + if let Err(e) = tracker.append_event(&format!("history sequence={}", self.next_sequence)) { + log::warn!("failed to append history event: {e}"); + } }As per coding guidelines: "Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:
warnfor degraded but expected conditions."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/support.rs` around lines 77 - 80, The calls to tracker.set_history_events and tracker.append_event in HistoryBuilder::push currently swallow errors with `let _ =`, hiding failures; change these to check the Result and log a warning on Err using the project's logging facility (e.g., log::warn! or tracing::warn!), referencing the same context (include the sequence number and the operation name). Specifically, replace the `let _ = tracker.set_history_events(self.next_sequence)` and `let _ = tracker.append_event(...)` with error-aware handling that logs a warn message containing the error and the history sequence (and continue operation if desired).crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (1)
273-276: Event detail field is not escaped.Similar to
last_error, thedetailstring is written directly without escaping. A detail containing=or newlines would make the log harder to parse reliably.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs` around lines 273 - 276, The append_event function writes detail directly into the log causing parsing issues for values containing '=' or newlines; update append_event (and the formatted line creation using events_path and append_text_line) to escape or serialize the detail string (e.g., replace newlines, escape '=' and backslashes, or use a quoted/percent-encoded representation) before embedding it in format!("time_millis={} detail={detail}\n", ...), mirroring how last_error is handled so the output remains unambiguous and parseable.crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (1)
429-434: JepsenExpiredReservation constructed inline multiple times.The same
JepsenExpiredReservationis constructed identically in multiple places within the same function. Consider extracting to a local variable for clarity.Also applies to: 702-707
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs` around lines 429 - 434, The code repeatedly constructs the same JepsenExpiredReservation inline; extract a single local variable (e.g., let expired_reservation = JepsenExpiredReservation { resource_id, holder_id: 606, reservation_id: reserve_commit.reservation_id.get(), released_lsn: None };) near the first use and reuse that variable for subsequent places (including the other occurrence around lines 702-707) instead of reconstructing it to improve clarity and avoid duplication; ensure you reference the same reservation_id expression (reserve_commit.reservation_id.get()) and keep released_lsn as None.crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (1)
327-329: The hardcoded/var/lib/allocdb-qemupath is intentional and used for both QEMU and KubeVirt backends, but the "qemu" naming is misleading.The
fetch_external_logs_archivefunction is generic acrossExternalTestbedimplementations and is called by both backends. The directory path is consistent with theallocdb-qemu-controlscript, which also names itself with "qemu" despite being deployed to both QEMU VMs and KubeVirt Pods. The directory is created on-demand by the control script'scollect-logsfunction viamkdir -p.Consider renaming the script and base directory to use a backend-agnostic name (e.g.,
allocdb-control,/var/lib/allocdb-control) to clarify that both backends share the same design, or add a comment documenting why "qemu" is retained as the naming convention across both backends.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs` around lines 327 - 329, The hardcoded path assigned to remote_dir inside fetch_external_logs_archive uses "/var/lib/allocdb-qemu" which is misleading for a backend-agnostic routine; either rename the directory and related script references to a neutral name (e.g., "/var/lib/allocdb-control" and "allocdb-control") across the codebase and deployment scripts, or add a clarifying comment where remote_dir is built (and in the allocdb-qemu-control script) explaining that the "qemu" name is historical and intentionally shared by both QEMU and KubeVirt backends; update any references to allocdb-qemu-control/remote_dir in deploy/collect-logs logic to match the chosen approach so names remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/args.rs`:
- Around line 76-79: The match arm for "plan" currently returns
ParsedCommand::Plan without validating any trailing arguments; update the
parsing logic so that when subcommand == "plan" you verify there are no extra
CLI arguments (e.g., check the remaining args iterator or slice used by the
parser) and return an Err (parse error) if any unknown/trailing flags are
present instead of accepting them silently; modify the "plan" arm that now
returns ParsedCommand::Plan to perform that check before returning to ensure
typos like `allocdb-jepsen plan --workspace foo` fail fast.
- Around line 71-115: Add comprehensive negative-path unit tests for the CLI by
exercising parse_args and parse_watch_kubevirt_lane_spec: assert that parse_args
returns Err for unknown subcommands (e.g., "foo"), returns Help when no
subcommand is provided, and that commands requiring flags (e.g., Analyze,
VerifyQemuSurface/VerifyKubevirtSurface) return Err when their required flags
are missing or malformed; separately add tests for
parse_watch_kubevirt_lane_spec to assert Err on empty parts, too few
comma-separated fields, or whitespace-only fields and Ok on well-formed
"<name,workspace,output-root>" inputs. Use the public function names parse_args
and parse_watch_kubevirt_lane_spec to locate code and cover all reported
negative cases and regression scenarios.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs`:
- Around line 401-539: Add comprehensive tests covering perform_failover and
perform_rejoin to pin down source selection, target_commit_lsn derivation,
staged rewrites, and role/view transitions: create unit/integration tests that
use StagedReplicaWorkspace and staged_replica_summary to craft scenarios (higher
LSN on supporting vs new_primary, equal commit LSN but higher apply LSN, missing
LSNs) and assert the chosen source, computed target_commit_lsn, and new_view
returned by perform_failover; assert that rewrite_replica_from_source produced
expected staged summaries and that after import/restart
wait_for_runtime_replica_role observes ReplicaRole::Primary/Backup as expected;
also add negative-path/recovery tests (e.g., missing source, conflicting
summaries, failed import/restart) to ensure errors are propagated and invariants
hold for perform_rejoin (check it uses source_summary fields correctly and sets
target view/commit LSN), placing tests alongside other allocdb-node tests for
*.rs per guidelines.
- Around line 40-42: The loop currently swallows all errors from heal_replica
(and similarly in the later remediation calls around lines 56–69); update the
calls to handle and log failures instead of ignoring them: call
heal_replica(layout, replica.replica_id), check its Result, and on Err emit an
error-level log via the crate logger that includes the replica.replica_id, the
layout id/context, and the error details (use debug/span context if available);
do the same for the other remediation functions referenced in that region so
failures are visible rather than producing only a later “did not converge”
timeout.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs`:
- Around line 299-308: In classify_resource_read_outcome, ensure the
ResourceResponse::Found actually matches the requested resource_id: inside the
RemoteApiOutcome::Api(ApiResponse::GetResource(ResourceResponse::Found(resource)))
arm compare resource.id (or the actual identifier field on resource) with the
function parameter resource_id and if they differ return an error (e.g., Err
with a descriptive message) instead of treating it as a successful read; only
construct ResourceReadObservation::Available or ::Held when the ids match.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs`:
- Around line 403-445: prepare_kubevirt_helper_stage_dir and
chmod_kubevirt_helper_ssh_key interpolate user-controlled helper_stage_dir into
sh -lc strings unquoted, allowing shell injection; update the format!() calls in
those functions to wrap the stage_dir in single quotes (and escape any embedded
single quotes in the path) so the shell treats it as a single literal argument
(e.g., change the command templates used by prepare_kubevirt_helper_stage_dir
and chmod_kubevirt_helper_ssh_key to quote {stage_dir} and
{stage_dir}/id_ed25519 respectively); copy_kubevirt_helper_ssh_key needs no
change because it passes the path as a kubectl arg.
- Around line 149-157: The phase-detection currently treats any non-zero kubectl
get pod exit as Ok(None) which hides real kubectl failures; update
kubevirt_helper_phase to return Err for non-zero kubectl exit codes (propagating
the underlying error) instead of Ok(None), and ensure the caller (the prepare
function that calls kubevirt_helper_phase) propagates that error rather than
treating None as "pod missing"; keep the existing logic that when phase is
Some("Running") it skips recreate, and continue using
delete_kubevirt_helper_pod, apply_kubevirt_helper_pod, and
kubevirt_helper_should_delete_on_drop unchanged — only change
kubevirt_helper_phase's error behavior and the caller to propagate errors.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs`:
- Around line 137-178: The qemu_ssh_args used by run_qemu_remote_host_command is
missing SSH options that prevent hangs/prompts and noisy logs; update the
qemu_ssh_args function to include the same fail-fast flags used in
kubevirt_ssh_args (add "ConnectTimeout=5", "BatchMode=yes", and
"LogLevel=ERROR") so SSH will time out, avoid interactive prompts, and suppress
verbose output; also add these options to the ssh_args definition in
allocdb-qemu-testbed.rs for consistency.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs`:
- Around line 214-219: The code currently calls tracker.complete(...) and then,
when report.release_gate_passed() is false, calls
tracker.fail(RunTrackerPhase::Completed, ...) which can overwrite the state;
update the logic in the block handling the end-of-run so that you call either
tracker.complete(...) (which should set success/failure based on
report.release_gate_passed()) or tracker.fail(...) but not both—specifically
remove the redundant tracker.fail(RunTrackerPhase::Completed, "Jepsen release
gate is blocked") when tracker.complete(...) has already been invoked, or
alternatively adjust tracker.complete to return/propagate the error so you only
call fail when complete was not called.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs`:
- Around line 360-366: The current serialization of snapshot.last_error inside
the format! call emits raw multiline text which breaks line-based parsing in
decode_run_status_snapshot; update the serialization in the format! that writes
last_error to either escape newlines (e.g., replace '\n' with '\\n') or encode
the field as base64, and update decode_run_status_snapshot to perform the
corresponding unescape/base64-decode so parsing remains line-oriented and
robust; make sure to modify the references to snapshot.last_error (the format!
writer) and the decode_run_status_snapshot reader together so they remain
symmetric.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs`:
- Around line 23-27: The render currently unconditionally emits full-screen ANSI
sequences (see the print!("\x1B[2J\x1B[H") and related println! calls
surrounding watch_color_enabled, watch_spinner_frame, and watch_pulse_frame),
which pollutes redirected output; fix by detecting whether stdout is a TTY
(e.g., using atty::is(atty::Stream::Stdout) or an equivalent helper) and only
emit the clear/reset escape sequences and any interactive-only output when that
check passes, applying the same guard to the other renderer block referenced
(lines 64-76).
In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs`:
- Around line 94-116: The code currently discards the real error by calling
try_prepare_kubevirt_watch_lane_context(spec).ok(), causing all failures to be
reported as "workspace is not ready yet"; instead capture and preserve the
original error and propagate it into the KubevirtWatchLaneSnapshot.lane_error.
Change the storage in contexts to keep the Result (or alongside the prepared
context keep an Option<String> for the error), replace .ok() with matching the
Err(e) to save e.to_string(), and in the branch where contexts[index] is None
return a KubevirtWatchLaneSnapshot whose lane_error is the preserved error
string (fallback to the existing workspace-not-ready message only if no specific
error was captured). Ensure you reference
try_prepare_kubevirt_watch_lane_context, contexts (per-index storage),
KubevirtWatchLaneSnapshot, and kubevirt_testbed_layout_path/spec.workspace_root
when making these changes.
- Around line 59-79: The loop in watch.rs never exits for one-shot because it
only breaks when a non-running snapshot appears; change the control flow so that
when follow == false you perform a single iteration and return immediately after
calling render_kubevirt_watch instead of sleeping/waiting for a state change.
Locate the loop that calls maybe_load_run_status_snapshot,
collect_replica_watch_snapshots, load_recent_run_events, and
render_kubevirt_watch and adjust it to either (a) wrap the current loop body in
an if follow { loop { ... } } else { ...single iteration... } or (b)
unconditionally render once and then if follow is true enter the repeating loop;
ensure no sleep occurs for the one-shot path.
---
Nitpick comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs`:
- Around line 429-434: The code repeatedly constructs the same
JepsenExpiredReservation inline; extract a single local variable (e.g., let
expired_reservation = JepsenExpiredReservation { resource_id, holder_id: 606,
reservation_id: reserve_commit.reservation_id.get(), released_lsn: None };) near
the first use and reuse that variable for subsequent places (including the other
occurrence around lines 702-707) instead of reconstructing it to improve clarity
and avoid duplication; ensure you reference the same reservation_id expression
(reserve_commit.reservation_id.get()) and keep released_lsn as None.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs`:
- Around line 327-329: The hardcoded path assigned to remote_dir inside
fetch_external_logs_archive uses "/var/lib/allocdb-qemu" which is misleading for
a backend-agnostic routine; either rename the directory and related script
references to a neutral name (e.g., "/var/lib/allocdb-control" and
"allocdb-control") across the codebase and deployment scripts, or add a
clarifying comment where remote_dir is built (and in the allocdb-qemu-control
script) explaining that the "qemu" name is historical and intentionally shared
by both QEMU and KubeVirt backends; update any references to
allocdb-qemu-control/remote_dir in deploy/collect-logs logic to match the chosen
approach so names remain consistent.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs`:
- Around line 46-60: The function runtime_replica_probes currently clones each
replica and mutates the clone's role based on request_remote_control_status
before returning RuntimeReplicaProbe, which is a surprising side-effect; either
rename the function to runtime_replica_probes_with_updated_roles or document the
mutation in the function's docstring and public API, and ensure callers
expecting unchanged replica data are updated to use the new name or behavior;
locate runtime_replica_probes, the mapping that sets replica.role from status
(via request_remote_control_status) and change the identifier and/or add a
comment explaining the enrichment so the mutation is explicit to users.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/support.rs`:
- Around line 77-80: The calls to tracker.set_history_events and
tracker.append_event in HistoryBuilder::push currently swallow errors with `let
_ =`, hiding failures; change these to check the Result and log a warning on Err
using the project's logging facility (e.g., log::warn! or tracing::warn!),
referencing the same context (include the sequence number and the operation
name). Specifically, replace the `let _ =
tracker.set_history_events(self.next_sequence)` and `let _ =
tracker.append_event(...)` with error-aware handling that logs a warn message
containing the error and the history sequence (and continue operation if
desired).
In `@crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs`:
- Around line 138-144: The local binding _helper in verify_kubevirt_surface
holds a KubevirtHelperGuard returned by prepare_kubevirt_helper and must be kept
alive for its Drop cleanup, but the underscore prefix falsely signals it's
unused; rename _helper to a clear binding like kubevirt_helper (or
_kubevirt_helper) and/or add an inline comment above the binding noting "keep
guard alive for cleanup on drop" so the intent is explicit while still
preventing unused-variable warnings.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs`:
- Around line 273-276: The append_event function writes detail directly into the
log causing parsing issues for values containing '=' or newlines; update
append_event (and the formatted line creation using events_path and
append_text_line) to escape or serialize the detail string (e.g., replace
newlines, escape '=' and backslashes, or use a quoted/percent-encoded
representation) before embedding it in format!("time_millis={}
detail={detail}\n", ...), mirroring how last_error is handled so the output
remains unambiguous and parseable.
In `@docs/jepsen-refactor-plan.md`:
- Around line 20-24: Edit the "Non-Goals" bullet list to reduce repetition by
rephrasing at least two of the three consecutive bullets that start with "No"
(for example, change one to "Does not include ..." and another to "This task
will not cover ...") while keeping the same meanings; specifically update the
bullets referencing "Jepsen workloads or backend features", "Hetzner backend
work", and the mention of "`jepsen.rs`" so that one or two use alternative
sentence starts for variety but retain the original constraints and links.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 127c87e6-72f8-4229-89e2-5a8b551f190d
📒 Files selected for processing (19)
crates/allocdb-node/src/bin/allocdb-jepsen.rscrates/allocdb-node/src/bin/allocdb-jepsen/args.rscrates/allocdb-node/src/bin/allocdb-jepsen/cluster.rscrates/allocdb-node/src/bin/allocdb-jepsen/common.rscrates/allocdb-node/src/bin/allocdb-jepsen/events.rscrates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rscrates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rscrates/allocdb-node/src/bin/allocdb-jepsen/remote.rscrates/allocdb-node/src/bin/allocdb-jepsen/runs.rscrates/allocdb-node/src/bin/allocdb-jepsen/runtime.rscrates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rscrates/allocdb-node/src/bin/allocdb-jepsen/support.rscrates/allocdb-node/src/bin/allocdb-jepsen/surface.rscrates/allocdb-node/src/bin/allocdb-jepsen/tracker.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rsdocs/README.mddocs/jepsen-refactor-plan.mddocs/status.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Files:
docs/status.mddocs/jepsen-refactor-plan.mddocs/README.md
docs/status.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep
docs/status.mdcurrent as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Files:
docs/status.md
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rscrates/allocdb-node/src/bin/allocdb-jepsen/tracker.rscrates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rscrates/allocdb-node/src/bin/allocdb-jepsen/args.rscrates/allocdb-node/src/bin/allocdb-jepsen/surface.rscrates/allocdb-node/src/bin/allocdb-jepsen/common.rscrates/allocdb-node/src/bin/allocdb-jepsen/runs.rscrates/allocdb-node/src/bin/allocdb-jepsen/runtime.rscrates/allocdb-node/src/bin/allocdb-jepsen/remote.rscrates/allocdb-node/src/bin/allocdb-jepsen/cluster.rscrates/allocdb-node/src/bin/allocdb-jepsen/support.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rscrates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rscrates/allocdb-node/src/bin/allocdb-jepsen/events.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch.rs
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to docs/status.md : Keep [`docs/status.md`](./docs/status.md) current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Applied to files:
docs/status.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Keep the system buildable after each chunk whenever feasible. If a larger refactor is required, stage it as a short sequence of intermediate, verifiable steps.
Applied to files:
docs/jepsen-refactor-plan.md
🧬 Code graph analysis (7)
crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (4)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (6)
first_backup_replica(110-118)heal_replica(188-204)isolate_replica(170-186)maybe_crash_replica(131-150)perform_failover(401-491)perform_rejoin(493-540)crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (2)
backup_process_name(381-383)primary_process_name(377-379)crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
unique_probe_resource_id(135-139)crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (3)
new(27-33)new(55-61)new(95-122)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
as_str(77-83)as_str(107-117)
crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (8)
crates/allocdb-node/src/jepsen.rs (6)
analyze_history(744-753)create_artifact_bundle(784-833)load_history(686-690)persist_history(698-703)release_gate_plan(109-154)render_analysis_report(756-775)crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
load_kubevirt_layout(140-144)prepare_kubevirt_helper(146-169)crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs (2)
execute_control_run(19-41)execute_crash_restart_run(43-86)crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (1)
execute_mixed_failover_run(64-102)crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
load_qemu_layout(14-18)sanitize_run_id(121-135)crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (1)
ensure_runtime_cluster_ready(39-73)crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (7)
workspace_root(28-28)workspace_root(48-50)workspace_root(79-81)verify_external_surface(146-211)backend_name(27-27)backend_name(44-46)backend_name(75-77)crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
new(29-38)new(165-206)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (3)
crates/allocdb-node/src/local_cluster.rs (1)
decode_control_status_response(782-786)crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
replica_role_label(817-825)crates/allocdb-node/src/replica.rs (1)
status(959-961)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (3)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (5)
render_runtime_probe_summary(106-136)runtime_probe_is_active(62-67)runtime_replica_probes(46-60)summarize_runtime_probes(69-90)probes(107-134)crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (7)
copy_file_or_remove(277-297)prepare_log_path_for(264-275)run_remote_control_command(177-188)new(27-33)new(55-61)new(95-122)from_export(124-136)crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
replica_role_label(817-825)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (2)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
sanitize_run_id(121-135)crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
current_time_millis(128-133)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
crates/allocdb-node/src/kubevirt_testbed.rs (1)
kubevirt_testbed_layout_path(162-164)crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
build_remote_tcp_probe_command(40-49)encode_hex(112-119)
🪛 LanguageTool
docs/jepsen-refactor-plan.md
[style] ~22-~22: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...No Hetzner backend work in this task. - No analyzer/history model redesign in [`je...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🔇 Additional comments (24)
docs/status.md (1)
218-220: LGTM! Status narrative updated to reflect the refactor plan.The updated text correctly captures the immediate next step (issue
#70Jepsen refactor) and properly defers the larger architectural decision (dedicated validation crate vs. Hetzner follow-on) until after the maintainability improvement is complete. This aligns well with the newjepsen-refactor-plan.mddocument.docs/jepsen-refactor-plan.md (1)
1-121: LGTM! Well-structured refactor plan.The document clearly defines goals, constraints, and a phased execution plan. The emphasis on keeping the repo buildable after each slice (line 15) and including validation steps for each slice aligns well with the learning about incremental, verifiable steps. The decision point at the end (lines 116-121) appropriately defers the crate extraction choice until after the internal seams are proven.
docs/README.md (1)
21-21: LGTM! Documentation index updated.The new Jepsen Refactor Plan entry is properly added to the Engineering Docs section. The placement between Fault Model and Operator Runbook is reasonable, though you could also consider positioning it adjacent to "KubeVirt Jepsen Report" (line 23) since both documents relate to Jepsen tooling.
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (3)
69-81: Sequence starts at 1, not 0 after firstpush.In
HistoryBuilder::push(referenced from support.rs),next_sequenceis incremented before assigning to the event. If initialized withnext_sequence = 0, the first event getssequence = 1. This appears intentional but verify it matches the expected Jepsen history indexing convention.
28-38: LGTM: Unique ID generation is well-designed.The combination of milliseconds timestamp with an atomic nonce provides good uniqueness guarantees. The use of
saturating_mulandsaturating_addprevents overflow panics.
164-206: LGTM: RunTracker initialization is robust.Proper directory creation, atomic file writes, and initial status snapshot emission. The error messages are descriptive with path context.
crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (2)
146-210: LGTM: Comprehensive surface verification logic.The verification flow is well-structured:
- Ensures cluster readiness
- Probes each replica's metrics and protocol endpoints
- Validates write/read roundtrip on primary
Error messages include backend name and replica ID for debugging.
26-41: LGTM: ExternalTestbed trait design.Clean abstraction over backend differences. The trait provides the minimal surface needed for Jepsen operations while keeping backend-specific logic encapsulated.
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (3)
153-157: LGTM: Safe Drop implementation for staged workspace.Silently ignoring
remove_dir_allfailure is acceptable here since cleanup failures during drop shouldn't panic. The temp directory will be cleaned up by OS eventually.
159-175: LGTM: Robust temp directory naming.Combines prefix, PID, atomic sequence counter, and milliseconds to ensure uniqueness across concurrent processes and within the same process.
277-297: LGTM:copy_file_or_removehandles NotFound gracefully.When source doesn't exist, the destination is cleaned up if it exists. This prevents stale files from persisting when the source is absent.
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (3)
93-110: LGTM: Atomic file write implementation.Correctly implements write-to-temp, sync, rename pattern for crash-safe file updates. Also handles cleanup on rename failure.
135-139: Verify:unique_probe_resource_idbit layout.The ID is
(millis << 32) | nonce. With millis being a u128 timestamp (~44 bits needed for current epoch), the left-shift by 32 is safe. The nonce occupies the lower 32 bits. This provides good uniqueness but verify callers don't assume any particular bit layout.
46-75: LGTM: Optional parsing helpers are consistent.All optional parsers handle the "none" sentinel uniformly and provide clear error messages including the invalid value.
crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs (3)
19-41: LGTM: Clean dispatch pattern for control runs.Properly resolves primary/backup replicas and dispatches to workload-specific handlers. Error propagation is consistent.
43-86: LGTM: Crash-restart scenarios properly orchestrate failover.The pattern of identifying primary, failover target, and supporting backup before dispatching to handlers is consistent and correct.
248-251: Consistent error handling for expected committed reserves.All workloads that require a committed reserve use
ok_or_elsewith descriptive error messages indicating which scenario expected the commit. This is good defensive programming.Also applies to: 331-333, 533-535, 599-601
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (2)
106-136: LGTM: Compact probe summary rendering.The rendering handles all three states (active, faulted, down) with appropriate truncation of long error messages to the first line. Good for operational visibility.
69-90: LGTM: Topology summarization is correct.Only counts active replicas and correctly distinguishes primaries from backups. Non-active roles are ignored as expected.
crates/allocdb-node/src/bin/allocdb-jepsen/nemesis.rs (2)
74-76: LGTM: Clear error for unsupported combination.Returns an explicit error when
ReservationContentionis used with mixed failover runs, which is documented as undefined for this workload.
104-180: LGTM: Partition-heal reservation contention scenario.Well-structured sequence: create resource → isolate backups → reserve → heal → crash primary → failover → retry reserve → second reserve → rejoin. The orchestration correctly exercises the partition-heal nemesis.
crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (3)
133-220: LGTM: Comprehensive run orchestration flow.The
run_externalfunction properly sequences all phases: surface verification → execution → persistence → analysis → archiving → completion. Error handling at each phase updates the tracker appropriately.
222-277: LGTM: Fault window iteration loop is well-designed.The loop correctly:
- Ensures cluster readiness before each nemesis iteration
- Creates fresh request namespaces per iteration
- Validates non-empty history output
- Tracks sequence numbers across iterations
- Respects fault window duration constraints
253-258: Good defensive check for empty iteration history.Explicitly failing when an iteration produces no events prevents silent test failures and aids debugging.
Apply the substantive CodeRabbit feedback for the refactor PR: tighten CLI validation, improve tracker/watch encoding and one-shot behavior, harden KubeVirt helper handling, add failover plan coverage, and make QEMU SSH helpers fail fast. Validation: ./scripts/preflight.sh Refs #70
|
Addressed the substantive CodeRabbit follow-ups: tightened CLI negative-path handling/tests, hardened tracker/watch encoding and one-shot behavior, improved KubeVirt helper error handling and quoting, added failover-plan coverage, and added fail-fast SSH options for QEMU helpers. Reran: ./scripts/preflight.sh |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (1)
166-171: Route lifecycle events throughloginstead ofeprintln!.These state-transition messages bypass the configured logger, so they will not honor filters/formatting and are harder to correlate with the
log::error!output already used earlier in this file.begin/completeevents look likeinfo!; the detailed plan summaries look likedebug!.As per coding guidelines,
**/*.rs: "Add extensive logging where it materially improves debuggability or operational clarity." and "Logging must be structured and purposeful."Also applies to: 187-192, 205-210, 223-228, 353-366, 374-384, 478-553, 563-603
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs` around lines 166 - 171, Replace eprintln! lifecycle messages in cluster.rs with structured log macros so they go through the configured logger; for the shown call that prints backend={}, workspace={}, replica={} (using layout.backend_name(), layout.workspace_root(), replica_id.get()) switch to info!(...) for begin/complete events and debug!(...) for detailed plan summaries, preserving the same message text and fields, and apply the same change to the other occurrences you flagged (lines around the calls that use layout.backend_name(), layout.workspace_root(), replica_id.get(), and similar site-specific variables at the ranges you noted) so all lifecycle output respects filters/formatting and correlates with existing log::error! usage.crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (1)
94-200: Add focused regression tests for the outcome-mapping helpers.
map_reserve_submit_response,classify_resource_read_outcome, anddrain_expiration_until_resource_availablenow define how Jepsen history is recorded. Please pin the negative/recovery branches here—textnot primaryfallback, mismatchedresource_id, result-code mapping, and drain retry exhaustion—with unit tests so these semantics do not drift silently.As per coding guidelines,
**/*.rs: "Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage."Also applies to: 294-338, 404-464, 482-631
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs` around lines 94 - 200, The PR lacks focused regression/unit tests for the outcome-mapping helpers that define Jepsen history semantics; add tests that pin negative/recovery branches for map_reserve_submit_response, classify_resource_read_outcome, and drain_expiration_until_resource_available: specifically, write tests asserting the "not primary" Text fallback path, mismatched resource_id handling in map_reserve_submit_response, explicit result-code -> JepsenEventOutcome mappings for classify_resource_read_outcome (including rejection codes), and exhaustion/retry behavior for drain_expiration_until_resource_available; place tests alongside the module (covering the functions by name) and include negative-path, recovery, and invariant assertions so these behaviors cannot regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs`:
- Around line 494-520: kubevirt_helper_ssh_args builds SSH arguments for the
helper pod but is missing the non-interactive flag BatchMode=yes; update
kubevirt_helper_ssh_args to include the "-o" "BatchMode=yes" option (same
style/location as other "-o" options) so SSH won't prompt for
passwords/passphrases and will behave like qemu_ssh_args when stdin is piped;
modify the args vector returned by kubevirt_helper_ssh_args to insert the
BatchMode=yes pair alongside the existing
StrictHostKeyChecking/UserKnownHostsFile/LogLevel/ConnectTimeout options.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs`:
- Around line 117-122: The function effective_minimum_fault_window_secs
currently applies debug_fault_window_override_secs() even when
run_spec.minimum_fault_window_secs is None; change it so the override only
applies if run_spec.minimum_fault_window_secs.is_some() — i.e., return
Some(override) only when the run already has a minimum window, otherwise return
None. Update effective_minimum_fault_window_secs to check
JepsenRunSpec.minimum_fault_window_secs first, then consult
debug_fault_window_override_secs(); also add a regression test that sets the env
override, constructs a control run with minimum_fault_window_secs = None, and
asserts the effective_minimum_fault_window_secs (and control-run behavior)
remains None/one-shot.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/support.rs`:
- Around line 212-243: run_local_tar_extract currently returns early when
child.stdin.write_all(...) fails, leaving the spawned tar process (child)
unreaped; modify the function so that on any write_all error you still wait
for/reap the child (e.g., call child.wait() or child.wait_with_output()),
capture that child's exit result (and stderr if desired), then return the
original write error (or a combined error message) after reaping; reference
run_local_tar_extract, the child variable, the write_all call, and the
subsequent wait_with_output call to locate where to add the cleanup path.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs`:
- Around line 442-463: The decoder decode_tracker_field is not symmetric with
encode_tracker_field: when it sees a backslash then another char it currently
emits both, so encoded "\\" becomes "\\\\" on decode; update
decode_tracker_field to treat "\\" as a single literal backslash (i.e., match
the next char: if 'n' push '\n', else if '\\' push '\\', else push the next char
directly), and keep the None case pushing a single trailing backslash; add a
small round-trip unit test that asserts encode_tracker_field ->
decode_tracker_field returns the original for strings containing backslashes and
newlines (e.g., "C:\\tmp\\foo" and "line1\nline2") to prevent regressions.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs`:
- Around line 212-220: The summary rendering is using decoded tracker text with
embedded newlines which breaks the watch layout; before styling/truncation
(calls around render_summary_row, watch_style, truncate_for_watch with
WATCH_RULE_WIDTH and fields snapshot.detail, snapshot.last_error and similar
lane-note/recent-event paths), collapse multiline text into a single-line (e.g.,
replace '\n' with ' ' or take the first line) and then pass that flattened
string into truncate_for_watch and watch_style so the renderer stays stable;
apply the same fix to the other occurrences referenced in the comment (summary,
lane-note, recent-event code paths).
In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs`:
- Around line 96-119: The loop currently stores Some(Err(...)) in contexts when
try_prepare_kubevirt_watch_lane_context fails, making the failure permanent;
instead, change the logic so that try_prepare_kubevirt_watch_lane_context errors
are not saved as Some(Err) but leave contexts[index] as None (so the next loop
iteration will retry), and only set contexts[index]=Some(Ok(context)) on
success; adjust the match arm that handles Context Err variants (and
KubevirtWatchLaneSnapshot construction) accordingly so transient preparation
errors are treated like runtime collection errors and the watcher will retry
preparing the lane on subsequent iterations (referencing
try_prepare_kubevirt_watch_lane_context, contexts,
collect_kubevirt_watch_lane_snapshot, and KubevirtWatchLaneSnapshot).
In `@docs/jepsen-refactor-plan.md`:
- Around line 26-39: Update the "Target Shape" module map so it matches the
modules implemented in this PR rather than the stale plan: replace references to
artifacts.rs and tests.rs with the actual module names introduced (e.g.,
tracker.rs, support.rs, surface.rs, watch_render.rs, runs.rs and any other new
helpers), list
allocdb-jepsen/{tracker.rs,support.rs,surface.rs,watch_render.rs,runs.rs} with
brief role descriptions consistent with the existing bullet style, and ensure
the prose and sequencing reflect the current implementation and intended
follow-ups.
---
Nitpick comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs`:
- Around line 166-171: Replace eprintln! lifecycle messages in cluster.rs with
structured log macros so they go through the configured logger; for the shown
call that prints backend={}, workspace={}, replica={} (using
layout.backend_name(), layout.workspace_root(), replica_id.get()) switch to
info!(...) for begin/complete events and debug!(...) for detailed plan
summaries, preserving the same message text and fields, and apply the same
change to the other occurrences you flagged (lines around the calls that use
layout.backend_name(), layout.workspace_root(), replica_id.get(), and similar
site-specific variables at the ranges you noted) so all lifecycle output
respects filters/formatting and correlates with existing log::error! usage.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs`:
- Around line 94-200: The PR lacks focused regression/unit tests for the
outcome-mapping helpers that define Jepsen history semantics; add tests that pin
negative/recovery branches for map_reserve_submit_response,
classify_resource_read_outcome, and drain_expiration_until_resource_available:
specifically, write tests asserting the "not primary" Text fallback path,
mismatched resource_id handling in map_reserve_submit_response, explicit
result-code -> JepsenEventOutcome mappings for classify_resource_read_outcome
(including rejection codes), and exhaustion/retry behavior for
drain_expiration_until_resource_available; place tests alongside the module
(covering the functions by name) and include negative-path, recovery, and
invariant assertions so these behaviors cannot regress.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 42a74032-05a4-47ae-ac64-ec9a8086cce3
📒 Files selected for processing (15)
crates/allocdb-node/src/bin/allocdb-jepsen.rscrates/allocdb-node/src/bin/allocdb-jepsen/args.rscrates/allocdb-node/src/bin/allocdb-jepsen/cluster.rscrates/allocdb-node/src/bin/allocdb-jepsen/events.rscrates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rscrates/allocdb-node/src/bin/allocdb-jepsen/remote.rscrates/allocdb-node/src/bin/allocdb-jepsen/runs.rscrates/allocdb-node/src/bin/allocdb-jepsen/runtime.rscrates/allocdb-node/src/bin/allocdb-jepsen/support.rscrates/allocdb-node/src/bin/allocdb-jepsen/surface.rscrates/allocdb-node/src/bin/allocdb-jepsen/tracker.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rscrates/allocdb-node/src/bin/allocdb-qemu-testbed.rsdocs/jepsen-refactor-plan.md
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/allocdb-node/src/bin/allocdb-jepsen/args.rs
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rscrates/allocdb-node/src/bin/allocdb-qemu-testbed.rscrates/allocdb-node/src/bin/allocdb-jepsen/support.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rscrates/allocdb-node/src/bin/allocdb-jepsen/runtime.rscrates/allocdb-node/src/bin/allocdb-jepsen/remote.rscrates/allocdb-node/src/bin/allocdb-jepsen/surface.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch.rscrates/allocdb-node/src/bin/allocdb-jepsen/runs.rscrates/allocdb-node/src/bin/allocdb-jepsen/events.rscrates/allocdb-node/src/bin/allocdb-jepsen/cluster.rscrates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Files:
docs/jepsen-refactor-plan.md
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Keep the system buildable after each chunk whenever feasible. If a larger refactor is required, stage it as a short sequence of intermediate, verifiable steps.
Applied to files:
docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Applied to files:
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
🧬 Code graph analysis (6)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (17)
append_text_line(112-126)current_time_millis(128-133)parse_optional_bool(68-75)parse_optional_path(77-83)parse_optional_string(85-91)parse_optional_u64(46-55)parse_optional_usize(57-66)parse_required_u32(19-26)parse_required_u64(28-35)parse_required_u128(37-44)required_field(9-17)required_field(23-24)required_field(32-33)required_field(41-42)write_text_atomically(93-110)value(50-51)value(61-62)crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
sanitize_run_id(121-135)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (7)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
sanitize_run_id(121-135)crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
current_time_millis(128-133)crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (3)
new(29-38)new(165-206)slot(52-54)crates/allocdb-core/src/state_machine_metrics.rs (1)
operation(72-84)crates/allocdb-node/src/bin/allocdb-jepsen/surface.rs (3)
backend_name(27-27)backend_name(44-46)backend_name(75-77)crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
drop(46-65)args(92-104)crates/allocdb-node/src/jepsen_tests.rs (1)
temp_dir(16-22)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (3)
crates/allocdb-node/src/local_cluster.rs (1)
decode_control_status_response(782-786)crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
replica_role_label(829-837)crates/allocdb-node/src/replica.rs (1)
status(959-961)
crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (6)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
load_kubevirt_layout(140-144)prepare_kubevirt_helper(146-169)crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (1)
request_remote_control_status(23-44)crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (6)
maybe_load_run_status_snapshot(385-402)run_events_path(299-301)run_status_path(295-297)snapshot(208-210)new(29-38)new(165-206)crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (3)
parse_watch_event_line(131-144)render_kubevirt_fleet_watch(64-114)render_kubevirt_watch(17-62)crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
decode_external_api_response(99-110)send_remote_api_request(20-29)crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (3)
new(27-33)new(55-61)new(109-136)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (3)
crates/allocdb-node/src/api_codec.rs (1)
decode_response(165-210)crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
send_remote_api_request(20-29)crates/allocdb-node/src/api.rs (1)
from_client_request(42-47)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (7)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (5)
render_runtime_probe_summary(109-139)runtime_probe_is_active(65-70)runtime_replica_probes_with_live_roles(46-63)summarize_runtime_probes(72-93)probes(110-137)crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (3)
copy_file_or_remove(291-311)prepare_log_path_for(278-289)run_remote_control_command(191-202)crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
replica_role_label(829-837)crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (1)
value(275-277)crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (2)
value(50-51)value(61-62)crates/allocdb-node/src/engine_observe.rs (1)
active_snapshot_lsn(43-45)crates/allocdb-node/src/replica.rs (2)
metadata(954-956)highest_prepared_lsn(998-1002)
🔇 Additional comments (1)
crates/allocdb-node/src/bin/allocdb-qemu-testbed.rs (1)
1026-1043: Nice SSH hardening coverage.The shared SSH arg builder now fails fast instead of hanging on prompts/timeouts, and the new unit test pins that contract in place.
Also applies to: 1190-1197
|
Addressed the remaining review follow-ups in :\n- fixed KubeVirt helper SSH parity, control-run fault-window override semantics, tar child cleanup, tracker backslash symmetry, watch retry behavior, and single-line watch rendering\n- moved cluster lifecycle output onto structured logs\n- added focused outcome/drain regression tests plus the stale refactor-plan module map update\n\nValidation rerun: preflight: cargo fmt --all --check running 5 tests test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.23s running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s running 77 tests test result: ok. 77 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.12s running 151 tests test result: ok. 151 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.75s running 48 tests test result: ok. 48 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.02s running 2 tests test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s running 4 tests test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s running 7 tests test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 8.08s running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s preflight: scripts/check_repo.sh |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
docs/jepsen-refactor-plan.md (1)
26-43:⚠️ Potential issue | 🟡 MinorSync the module map and slice ownership with the code that landed.
The plan still omits landed modules like
cluster.rs,events.rs,kubevirt.rs,nemesis.rs,remote.rs, andscenarios.rs, and Slice 3 still describesRunExecutionContextas tracker work even though it lives insupport.rs. As written, this doc is already stale against the implementation in this PR and will point the next refactor step at the wrong seams.As per coding guidelines,
**/*.md: "Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR."Also applies to: 71-77
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/jepsen-refactor-plan.md` around lines 26 - 43, The module map and slice ownership in the "Target Shape" section are stale; update the list to include the landed modules cluster.rs, events.rs, kubevirt.rs, nemesis.rs, remote.rs, and scenarios.rs, and correct the slice ownership that currently attributes RunExecutionContext to tracker work by noting that RunExecutionContext lives in support.rs (adjust any "Slice 3" wording accordingly); ensure the new module list and slice descriptions match the actual symbols and file boundaries (e.g., allocdb-jepsen/support.rs contains RunExecutionContext) so the doc aligns with the implementation introduced in this PR.
🧹 Nitpick comments (3)
crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (1)
294-306: Add a mismatch-id regression test here.This guard is important, but the test module still doesn't cover the negative path where
GetResource(Found(...))returns a differentresource_id. One focused unit test would keep the Jepsen history from silently regressing back to accepting the wrong object.As per coding guidelines,
**/*.rs: "Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs` around lines 294 - 306, Add a focused unit test for classify_resource_read_outcome that exercises the negative path where RemoteApiOutcome::Api(ApiResponse::GetResource(ResourceResponse::Found(...))) contains a Resource whose resource_id does not match the requested resource_id; call classify_resource_read_outcome with these mismatched ids and assert it returns Err (and optionally that the error string mentions both ids or the mismatch), placing the test in the same test module as the existing tests so future changes cannot regress acceptance of mismatched resources.crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (2)
247-255: Avoid full-file materialization on every refresh tick.Line 247–Line 255 re-read and re-parse the entire events file each loop; with 250ms refresh this can become expensive for long runs. Stream lines and keep only a bounded tail in-memory.
♻️ Proposed refactor
- let Ok(bytes) = fs::read_to_string(&events_path) else { + let Ok(file) = fs::File::open(&events_path) else { return Vec::new(); }; - let mut events = bytes - .lines() - .filter_map(|line| parse_watch_event_line(line).ok()) - .collect::<Vec<_>>(); - if events.len() > limit { - events.drain(..events.len().saturating_sub(limit)); + let reader = std::io::BufReader::new(file); + let mut events = std::collections::VecDeque::with_capacity(limit); + for line in std::io::BufRead::lines(reader).map_while(Result::ok) { + if let Ok(event) = parse_watch_event_line(&line) { + if events.len() == limit { + events.pop_front(); + } + events.push_back(event); + } } - events + events.into_iter().collect()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs` around lines 247 - 255, The loop currently re-reads and re-parses the whole events file into a Vec each tick (using events_path, parse_watch_event_line, and limit) which is expensive; change it to stream new lines from the file (use a BufReader with seek to the previous read position or tail via File::open and read_to_end incrementally) and append parsed events to a bounded in-memory buffer (use a VecDeque or retain only the last limit items) instead of re-materializing the entire file on every refresh; ensure you still call parse_watch_event_line per incoming line and drop older entries so events never grows beyond limit.
251-253: Don’t silently discard malformed event lines.Line 252 suppresses parse failures with
.ok(), which hides format regressions and corrupt data. Emit awarnwith path/context when parsing fails.As per coding guidelines, `**/*.rs`: Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.🔎 Proposed logging improvement
- let mut events = bytes - .lines() - .filter_map(|line| parse_watch_event_line(line).ok()) - .collect::<Vec<_>>(); + let mut events = bytes + .lines() + .filter_map(|line| match parse_watch_event_line(line) { + Ok(event) => Some(event), + Err(error) => { + log::warn!( + "failed to parse watch event line from {}: {}", + events_path.display(), + error + ); + None + } + }) + .collect::<Vec<_>>();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs` around lines 251 - 253, The code currently swallows parse errors by using .filter_map(|line| parse_watch_event_line(line).ok()) which hides malformed event lines; replace this with handling the Result from parse_watch_event_line so that Err cases emit a structured warn (including the offending line and any surrounding context/path) and only Ok values are collected into the Vec; specifically update the iterator chain that calls parse_watch_event_line to log with warn!(…) on Err while collecting Ok(event) into the final collection (keep parse_watch_event_line and the existing collect::<Vec<_>>() usage).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs`:
- Around line 179-195: The stdin write path for the spawned kubectl child in the
block (inside apply_kubevirt_helper_pod / the kubectl Command that spawns
`child`) currently returns immediately on write_all() error and leaks the child;
change the error path to first capture the child, attempt to wait_with_output()
to reap the process (using the same wait-and-collect pattern as elsewhere), then
propagate an error that includes the child's stderr/exit info; do the same fix
for the other identical block around lines 338-359 so every write_all() failure
always calls child.wait_with_output() before returning.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs`:
- Around line 218-223: The run exits early when
enforce_minimum_fault_window(...) fails but skips persisting final metadata;
move or add a call to tracker.complete(&history_path, &bundle_dir,
&logs_archive, &report) so the history_file, artifact bundle, logs_archive,
blocker count and release-gate status are persisted before returning, then call
tracker.fail(RunTrackerPhase::Completed, &error) and return Err(error) as
currently done; specifically update the block that calls
enforce_minimum_fault_window, started_at.elapsed(), tracker.fail and
tracker.complete so tracker.complete runs prior to tracker.fail when
enforce_minimum_fault_window returns Err.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs`:
- Around line 534-543: In render_recent_events(), flatten the multiline detail
(flatten_watch_text(&event.detail)) and truncate it to a fixed max width (e.g.
define MAX_WATCH_DETAIL_LEN) before passing to style_event_detail so extremely
long "error:" paths don't wrap the layout; implement truncation that appends an
ellipsis when truncated and use the truncated string in the println call (keep
references to event_icon, pad_watch_cell, watch_style and style_event_detail
unchanged).
In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs`:
- Around line 44-83: Add unit/integration tests around watch_kubevirt and the
helper functions to cover one-shot vs follow control flow, event truncation,
retry/error handling, and error surfaces: write tests that call watch_kubevirt
(or extract its loop body into a testable helper) to assert that when follow is
false it returns after one iteration and when follow is true it continues until
maybe_load_run_status_snapshot returns a snapshot whose RunTrackerState !=
RunTrackerState::Running; add tests for load_recent_run_events to verify it
enforces the N-event limit and degrades on missing or malformed event files; add
tests for prepare_kubevirt_helper / lane initialization to simulate transient
failures and verify retry behavior and per-lane error handling; and add tests
for collect_replica_watch_snapshots and any replica-metric fetchers to simulate
missing metrics, unexpected API responses, and I/O/parsing errors and assert
graceful degradation (no panics, appropriate error propagation or partial
results). Use mocks or temp files to simulate file-not-found and parse failures
and ensure refresh_millis / sleep behavior is controlled (inject a short/mocked
delay) so tests run fast.
---
Duplicate comments:
In `@docs/jepsen-refactor-plan.md`:
- Around line 26-43: The module map and slice ownership in the "Target Shape"
section are stale; update the list to include the landed modules cluster.rs,
events.rs, kubevirt.rs, nemesis.rs, remote.rs, and scenarios.rs, and correct the
slice ownership that currently attributes RunExecutionContext to tracker work by
noting that RunExecutionContext lives in support.rs (adjust any "Slice 3"
wording accordingly); ensure the new module list and slice descriptions match
the actual symbols and file boundaries (e.g., allocdb-jepsen/support.rs contains
RunExecutionContext) so the doc aligns with the implementation introduced in
this PR.
---
Nitpick comments:
In `@crates/allocdb-node/src/bin/allocdb-jepsen/events.rs`:
- Around line 294-306: Add a focused unit test for
classify_resource_read_outcome that exercises the negative path where
RemoteApiOutcome::Api(ApiResponse::GetResource(ResourceResponse::Found(...)))
contains a Resource whose resource_id does not match the requested resource_id;
call classify_resource_read_outcome with these mismatched ids and assert it
returns Err (and optionally that the error string mentions both ids or the
mismatch), placing the test in the same test module as the existing tests so
future changes cannot regress acceptance of mismatched resources.
In `@crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs`:
- Around line 247-255: The loop currently re-reads and re-parses the whole
events file into a Vec each tick (using events_path, parse_watch_event_line, and
limit) which is expensive; change it to stream new lines from the file (use a
BufReader with seek to the previous read position or tail via File::open and
read_to_end incrementally) and append parsed events to a bounded in-memory
buffer (use a VecDeque or retain only the last limit items) instead of
re-materializing the entire file on every refresh; ensure you still call
parse_watch_event_line per incoming line and drop older entries so events never
grows beyond limit.
- Around line 251-253: The code currently swallows parse errors by using
.filter_map(|line| parse_watch_event_line(line).ok()) which hides malformed
event lines; replace this with handling the Result from parse_watch_event_line
so that Err cases emit a structured warn (including the offending line and any
surrounding context/path) and only Ok values are collected into the Vec;
specifically update the iterator chain that calls parse_watch_event_line to log
with warn!(…) on Err while collecting Ok(event) into the final collection (keep
parse_watch_event_line and the existing collect::<Vec<_>>() usage).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 30f478ce-a3f8-44b5-980a-52ddaebbc7ff
📒 Files selected for processing (10)
crates/allocdb-node/src/bin/allocdb-jepsen.rscrates/allocdb-node/src/bin/allocdb-jepsen/cluster.rscrates/allocdb-node/src/bin/allocdb-jepsen/events.rscrates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rscrates/allocdb-node/src/bin/allocdb-jepsen/runs.rscrates/allocdb-node/src/bin/allocdb-jepsen/support.rscrates/allocdb-node/src/bin/allocdb-jepsen/tracker.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rsdocs/jepsen-refactor-plan.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Files:
docs/jepsen-refactor-plan.md
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rscrates/allocdb-node/src/bin/allocdb-jepsen/support.rscrates/allocdb-node/src/bin/allocdb-jepsen/watch.rscrates/allocdb-node/src/bin/allocdb-jepsen/runs.rscrates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rscrates/allocdb-node/src/bin/allocdb-jepsen/events.rscrates/allocdb-node/src/bin/allocdb-jepsen/tracker.rscrates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
🧠 Learnings (4)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.md : Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Applied to files:
docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to docs/status.md : Keep [`docs/status.md`](./docs/status.md) current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Applied to files:
docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Keep the system buildable after each chunk whenever feasible. If a larger refactor is required, stage it as a short sequence of intermediate, verifiable steps.
Applied to files:
docs/jepsen-refactor-plan.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Applied to files:
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs
🧬 Code graph analysis (4)
crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (5)
crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (1)
sanitize_run_id(121-135)crates/allocdb-node/src/bin/allocdb-jepsen/common.rs (1)
current_time_millis(128-133)crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
new(29-38)new(165-206)crates/allocdb-node/src/replica.rs (2)
paths(974-976)status(959-961)crates/allocdb-node/src/jepsen_tests.rs (1)
temp_dir(16-22)
crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (5)
crates/allocdb-node/src/jepsen.rs (5)
analyze_history(744-753)create_artifact_bundle(784-833)load_history(686-690)persist_history(698-703)render_analysis_report(756-775)crates/allocdb-node/src/bin/allocdb-jepsen/scenarios.rs (1)
execute_control_run(19-41)crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (1)
ensure_runtime_cluster_ready(54-104)crates/allocdb-node/src/bin/allocdb-jepsen/events.rs (3)
workspace_root(703-705)new(654-695)backend_name(699-701)crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (2)
new(29-38)new(165-206)
crates/allocdb-node/src/bin/allocdb-jepsen/kubevirt.rs (2)
crates/allocdb-node/src/kubevirt_testbed.rs (2)
kubevirt_testbed_layout_path(162-164)map(376-390)crates/allocdb-node/src/bin/allocdb-jepsen/remote.rs (2)
build_remote_tcp_probe_command(40-49)encode_hex(112-119)
crates/allocdb-node/src/bin/allocdb-jepsen/cluster.rs (3)
crates/allocdb-node/src/bin/allocdb-jepsen/runtime.rs (6)
live_runtime_replica_matching(95-107)render_runtime_probe_summary(109-139)runtime_probe_is_active(65-70)runtime_replica_probes_with_live_roles(46-63)summarize_runtime_probes(72-93)probes(110-137)crates/allocdb-node/src/bin/allocdb-jepsen/support.rs (5)
copy_file_or_remove(309-329)prepare_log_path_for(296-307)new(27-33)new(55-61)new(109-136)crates/allocdb-node/src/bin/allocdb-jepsen/watch_render.rs (1)
replica_role_label(842-850)
🔇 Additional comments (4)
crates/allocdb-node/src/bin/allocdb-jepsen/tracker.rs (1)
442-446: Add a round-trip regression test for tracker-field escaping.
decode_tracker_field()now has custom handling for\\and\n, but this file still has no test pinning theencode_tracker_field()/decode_tracker_field()symmetry. A case likeC:\tmp\fooplus a multiline error would catch the exact regression fixed here.As per coding guidelines,
**/*.rs: "Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage."crates/allocdb-node/src/bin/allocdb-jepsen/runs.rs (1)
376-385: The control-run regression test still doesn't exercise the env override path.
fault_window_override_does_not_change_control_runs()never setsFAULT_WINDOW_OVERRIDE_ENV, so it only checks the default case. The override bug can come back without this test ever failing.As per coding guidelines,
**/*.rs: "Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage."crates/allocdb-node/src/bin/allocdb-jepsen/watch.rs (2)
59-73: One-shot watch exit is now correct.Line 71–Line 73 cleanly enforce single-iteration behavior when
followis false, so the one-shot path no longer hangs.
100-127: Lane preparation error handling is materially improved.Line 103 preserves the real preparation error for rendering, and keeping
contexts[index]asNoneon failure allows retry on the next loop.
|
Applied the remaining review-driven fixes:
Validation rerun: |
Summary
allocdb-jepsenbinary into focused modules for args, watch/rendering, remote backends, run orchestration, cluster helpers, and scenario support#70Validation
./scripts/preflight.shALLOCDB_JEPSEN_FAULT_WINDOW_SECS_OVERRIDE=180for faulted runs