feat(simulation): inject seeded crash points#42
Conversation
Add deterministic one-shot crash plans across engine submit, checkpoint, and recovery boundaries, wire them through the promoted simulation harness, and cover restart behavior with seeded regression tests. Refs #17
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
WalkthroughThis PR introduces a deterministic crash injection framework across the allocdb system. It adds recovery boundary observation points, defines crash plan infrastructure, integrates crash checks throughout the engine and checkpoint lifecycle, and extends error handling to propagate crash injection events. The simulation harness gains support for planning crashes at recovery and engine boundaries. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Engine as Engine<br/>(recover_with_crash_plan)
participant Core as Recovery Core<br/>(recover_allocdb_impl)
participant Observer as Crash Observer<br/>(callback)
User->>Engine: recover_with_crash_plan(crash_plan)
Engine->>Engine: Setup observer closure
Engine->>Core: recover_allocdb_with_observer(observer_fn)
Core->>Core: Load snapshot
Core->>Observer: RecoveryBoundary::AfterSnapshotLoad
Observer->>Observer: Check if crash matches
alt Crash Injected
Observer-->>Core: Err(CrashInjected)
Core-->>Engine: RecoveryObserverError::Observer(CrashInjected)
else No Crash
Observer-->>Core: Ok(())
Core->>Core: Truncate WAL
Core->>Observer: RecoveryBoundary::AfterWalTruncate
Observer-->>Core: Ok(()) or Err(CrashInjected)
end
alt Continue Replay
Core->>Core: Replay frame (LSN, record_type)
Core->>Observer: RecoveryBoundary::AfterReplayFrame{lsn, record_type}
Observer->>Observer: Check if crash matches
Observer-->>Core: Ok(()) or Err(CrashInjected)
Core-->>Core: Next frame...
end
alt Final Result
Core-->>Engine: RecoveryResult or RecoveryObserverError
Engine-->>User: Ok(Engine) or Err(RecoverEngineError::CrashInjected)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
# Conflicts: # docs/status.md
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/allocdb-node/src/engine_checkpoint.rs (1)
51-54:⚠️ Potential issue | 🟡 MinorDocument the new
CrashInjectedfailure mode.
checkpoint()can now returnCheckpointError::CrashInjected(_), but the# Errorssection still only describes halt, queue, WAL, and snapshot failures.Suggested doc fix
- /// Returns [`CheckpointError`] if the engine is halted, queued writes still exist, the WAL is - /// not clean, or snapshot/WAL persistence fails. + /// Returns [`CheckpointError`] if the engine is halted, queued writes still exist, the WAL is + /// not clean, snapshot/WAL persistence fails, or a seeded crash plan aborts the checkpoint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/engine_checkpoint.rs` around lines 51 - 54, Update the function-level docs for checkpoint() to mention the new CrashInjected failure mode: explain that checkpoint() may return CheckpointError::CrashInjected(reason) when an injected crash or simulated failure is detected, and include that it carries context about the injection; keep the existing descriptions for halted engine, queued writes, dirty WAL, and persistence failures (snapshot/WAL) intact and add CrashInjected alongside them in the # Errors section referencing CheckpointError::CrashInjected to make the possible return variants comprehensive.
🧹 Nitpick comments (4)
crates/allocdb-node/src/simulation_tests.rs (1)
387-457: The other recovery boundaries still need direct restart tests.This only drives
CrashPoint::RecoveryAfterReplayFrame, but Lines 25-29 incrates/allocdb-core/src/recovery.rsalso addAfterSnapshotLoadandAfterWalTruncate. Those abort points happen before any frame replay, so they can regress independently of the replay-frame case.As per coding guidelines,
**/*.rs: Write extensive tests for every meaningful behavior change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/simulation_tests.rs` around lines 387 - 457, The test only verifies CrashPoint::RecoveryAfterReplayFrame; add explicit subtests that drive and assert behavior for each recovery boundary (CrashPoint::RecoveryAfterSnapshotLoad and CrashPoint::RecoveryAfterWalTruncate) so they are exercised independently. For each point in the recovery_points array (use seed_for_point(point, &recovery_points)), create two SimulationHarness instances (SimulationHarness::new), perform the same submit/checkpoint/advance sequence, call arm_next_recovery_crash and assert the planned point equals the current point, call restart to capture the injected crash (unwrap_err and match RecoverEngineError::CrashInjected), then call restart again to get the recovered harness and assert expected recovery fields (startup_kind, loaded_snapshot_lsn, replayed_wal_frame_count/last_lsn) and resource presence via first.engine().db().resource(ResourceId(...)). Repeat this logic for CrashPoint::RecoveryAfterSnapshotLoad and CrashPoint::RecoveryAfterWalTruncate so each abort boundary is covered.crates/allocdb-node/src/api.rs (1)
157-159: Add an API regression for the newCrashInjectedmapping.This changes the wire contract:
SubmissionError::CrashInjected(_)now surfaces asStorageFailure. Please pin both that code and the preservedIndefinitecategory so client retry behavior cannot drift later.As per coding guidelines,
**/*.rs: Write extensive tests for every meaningful behavior change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/api.rs` around lines 157 - 159, SubmissionError::CrashInjected(_) was recently remapped to SubmissionFailureCode::StorageFailure which changes the public wire contract; update the mapping to explicitly pin both the CrashInjected -> StorageFailure mapping and the preserved Indefinite category and add tests that assert the serialized wire values cannot change. Concretely: in the match that maps SubmissionError to SubmissionFailureCode (referencing SubmissionError::CrashInjected and SubmissionFailureCode::StorageFailure and the Indefinite variant), add explicit match arms and a comment documenting the intended wire value, then add unit/integration tests that serialize the resulting SubmissionFailureCode (and the Indefinite variant) to the wire format and assert the exact expected bytes/values so future edits will fail if the mapping drifts.crates/allocdb-node/src/engine.rs (2)
370-373: Log recovery-side crash injections too.Runtime injection goes through
maybe_inject_crash()and emits a warning, but this branch returnsCrashInjected(plan)silently.recover_allocdb_with_observer()explicitly skips loggingRecoveryObserverError::Observer(_), so the seed/point context disappears right where restart triage needs it.Proposed patch
- .map_err(|error| match error { - RecoveryObserverError::Recovery(error) => RecoverEngineError::Recovery(error), - RecoveryObserverError::Observer(plan) => RecoverEngineError::CrashInjected(plan), - })?; + .map_err(|error| match error { + RecoveryObserverError::Recovery(error) => RecoverEngineError::Recovery(error), + RecoveryObserverError::Observer(plan) => { + warn!( + "halting engine recovery on injected crash: seed={} point={:?}", + plan.seed, plan.point + ); + RecoverEngineError::CrashInjected(plan) + } + })?;As per coding guidelines, "Add extensive logging where it materially improves debuggability or operational clarity" and "Logging must be structured and purposeful."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/engine.rs` around lines 370 - 373, The mapping from RecoveryObserverError::Observer to RecoverEngineError::CrashInjected currently drops contextual info; before returning that error in the closure (the map_err branch that matches RecoveryObserverError::Observer(plan)), call the same logging used by maybe_inject_crash() (e.g., emit a warning with the seed/point/plan context) so the crash-injection context is preserved for restart triage; update the logic around recover_allocdb_with_observer()/maybe_inject_crash() to log a structured warning including the plan and any seed/point identifiers immediately before constructing RecoverEngineError::CrashInjected.
211-224: Pin seed selection to a stable sort key, not derived enum order.
from_seed()canonicalizes withsort_unstable()onCrashPoint's derivedOrd, so reordering or inserting enum variants later silently changes the same seed → point mapping for the same enabled set. If that mapping is part of the reproducibility contract, prefer explicit stable ids or preserve a caller-owned canonical order.Also applies to: 260-262
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/engine.rs` around lines 211 - 224, The current from_seed() uses sort_unstable() on CrashPoint's derived Ord, which makes the seed→point mapping unstable when enum variants are reordered; add a stable sort key and use it instead of derived Ord—e.g. add a CrashPoint::stable_id() returning a fixed integer for each variant and change all sort_unstable()/sorting usages in from_seed (and the other occurrences around lines 260-262) to sort_by_key(|p| p.stable_id()), ensuring the seed selection is pinned to those stable ids (or alternatively accept a caller-provided canonical ordering) so future enum reorders/inserts won't change mappings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/allocdb-node/src/engine.rs`:
- Around line 221-223: The observer for RecoveryBoundary::AfterReplayFrame is
collapsing all per-frame events into a single CrashPoint
(RecoveryAfterReplayFrame) and consuming the one-shot plan at the first match,
so only the first replayed frame is reachable; modify the plan and observer to
carry a replay ordinal/LSN (e.g., add an ordinal field to the CrashPoint variant
RecoveryAfterReplayFrame or include an lsn/seq in the one-shot plan) and have
the observer compare the current frame's ordinal/LSN before consuming the plan;
update the matching logic in the observer code paths referenced by
RecoveryAfterSnapshotLoad, RecoveryAfterWalTruncate, and
RecoveryAfterReplayFrame so each AfterReplayFrame event checks the ordinal/LSN
and only consumes the plan when it matches the intended frame (or derive the
ordinal deterministically from the seed inside the observer if you prefer not to
modify the plan structure).
In `@crates/allocdb-node/src/simulation_tests.rs`:
- Around line 343-385: The test
seeded_checkpoint_crash_after_snapshot_write_is_recoverable currently only
exercises CrashPoint::CheckpointAfterSnapshotWrite; add a sibling test (or
extend this one) to cover CrashPoint::CheckpointAfterWalRewrite by using
seed_for_point with CrashPoint::CheckpointAfterWalRewrite (and the same
checkpoint_points array), arm the crash via harness.arm_next_engine_crash,
assert the checkpoint unwrap_err matches the planned crash, then restart and
assert the recovered.recovery fields reflect the WAL-rewrite path (e.g.,
appropriate RecoveryStartupKind and
loaded_snapshot_lsn/replayed_wal_frame_count) and that resources ResourceId(11)
and ResourceId(12) exist; reuse the same helpers (SimulationHarness::new,
harness.advance_to, harness.submit, harness.checkpoint, harness.restart) and
mirror the assertions structure from the existing
seeded_checkpoint_crash_after_snapshot_write_is_recoverable test.
---
Outside diff comments:
In `@crates/allocdb-node/src/engine_checkpoint.rs`:
- Around line 51-54: Update the function-level docs for checkpoint() to mention
the new CrashInjected failure mode: explain that checkpoint() may return
CheckpointError::CrashInjected(reason) when an injected crash or simulated
failure is detected, and include that it carries context about the injection;
keep the existing descriptions for halted engine, queued writes, dirty WAL, and
persistence failures (snapshot/WAL) intact and add CrashInjected alongside them
in the # Errors section referencing CheckpointError::CrashInjected to make the
possible return variants comprehensive.
---
Nitpick comments:
In `@crates/allocdb-node/src/api.rs`:
- Around line 157-159: SubmissionError::CrashInjected(_) was recently remapped
to SubmissionFailureCode::StorageFailure which changes the public wire contract;
update the mapping to explicitly pin both the CrashInjected -> StorageFailure
mapping and the preserved Indefinite category and add tests that assert the
serialized wire values cannot change. Concretely: in the match that maps
SubmissionError to SubmissionFailureCode (referencing
SubmissionError::CrashInjected and SubmissionFailureCode::StorageFailure and the
Indefinite variant), add explicit match arms and a comment documenting the
intended wire value, then add unit/integration tests that serialize the
resulting SubmissionFailureCode (and the Indefinite variant) to the wire format
and assert the exact expected bytes/values so future edits will fail if the
mapping drifts.
In `@crates/allocdb-node/src/engine.rs`:
- Around line 370-373: The mapping from RecoveryObserverError::Observer to
RecoverEngineError::CrashInjected currently drops contextual info; before
returning that error in the closure (the map_err branch that matches
RecoveryObserverError::Observer(plan)), call the same logging used by
maybe_inject_crash() (e.g., emit a warning with the seed/point/plan context) so
the crash-injection context is preserved for restart triage; update the logic
around recover_allocdb_with_observer()/maybe_inject_crash() to log a structured
warning including the plan and any seed/point identifiers immediately before
constructing RecoverEngineError::CrashInjected.
- Around line 211-224: The current from_seed() uses sort_unstable() on
CrashPoint's derived Ord, which makes the seed→point mapping unstable when enum
variants are reordered; add a stable sort key and use it instead of derived
Ord—e.g. add a CrashPoint::stable_id() returning a fixed integer for each
variant and change all sort_unstable()/sorting usages in from_seed (and the
other occurrences around lines 260-262) to sort_by_key(|p| p.stable_id()),
ensuring the seed selection is pinned to those stable ids (or alternatively
accept a caller-provided canonical ordering) so future enum reorders/inserts
won't change mappings.
In `@crates/allocdb-node/src/simulation_tests.rs`:
- Around line 387-457: The test only verifies
CrashPoint::RecoveryAfterReplayFrame; add explicit subtests that drive and
assert behavior for each recovery boundary
(CrashPoint::RecoveryAfterSnapshotLoad and CrashPoint::RecoveryAfterWalTruncate)
so they are exercised independently. For each point in the recovery_points array
(use seed_for_point(point, &recovery_points)), create two SimulationHarness
instances (SimulationHarness::new), perform the same submit/checkpoint/advance
sequence, call arm_next_recovery_crash and assert the planned point equals the
current point, call restart to capture the injected crash (unwrap_err and match
RecoverEngineError::CrashInjected), then call restart again to get the recovered
harness and assert expected recovery fields (startup_kind, loaded_snapshot_lsn,
replayed_wal_frame_count/last_lsn) and resource presence via
first.engine().db().resource(ResourceId(...)). Repeat this logic for
CrashPoint::RecoveryAfterSnapshotLoad and CrashPoint::RecoveryAfterWalTruncate
so each abort boundary is covered.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 89cc4bb8-36b5-4e7c-8584-824bc37980b1
📒 Files selected for processing (9)
crates/allocdb-core/src/recovery.rscrates/allocdb-node/src/api.rscrates/allocdb-node/src/engine.rscrates/allocdb-node/src/engine_checkpoint.rscrates/allocdb-node/src/engine_tests.rscrates/allocdb-node/src/simulation.rscrates/allocdb-node/src/simulation_tests.rsdocs/status.mddocs/testing.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/engine_checkpoint.rscrates/allocdb-node/src/api.rscrates/allocdb-node/src/simulation.rscrates/allocdb-node/src/simulation_tests.rscrates/allocdb-node/src/engine_tests.rscrates/allocdb-core/src/recovery.rscrates/allocdb-node/src/engine.rs
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Files:
docs/status.mddocs/testing.md
docs/status.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep
docs/status.mdcurrent as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Files:
docs/status.md
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Treat CodeRabbit as part of the required review path when it is enabled on the repository. Wait for its status to complete before merge. If it completes without a visible review comment or review thread, request visible output with `coderabbitai summary`. Address every substantive CodeRabbit comment explicitly before merge by either applying the change or documenting why it is not being applied. Apply correctness, safety, recovery, test, and docs-alignment feedback by default; document why you reject suggestions that would weaken determinism, boundedness, or trusted-core discipline.
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Applied to files:
docs/status.mdcrates/allocdb-node/src/simulation_tests.rscrates/allocdb-node/src/engine_tests.rs
🧬 Code graph analysis (4)
crates/allocdb-node/src/simulation.rs (1)
crates/allocdb-node/src/engine.rs (3)
recover_with_crash_plan(348-387)from_seed(254-269)is_recovery_boundary(228-235)
crates/allocdb-node/src/engine_tests.rs (1)
crates/allocdb-node/src/engine.rs (1)
category(148-157)
crates/allocdb-core/src/recovery.rs (3)
crates/allocdb-core/src/snapshot_file.rs (3)
from(19-21)from(25-27)context(141-146)crates/allocdb-core/src/wal_file.rs (2)
from(34-36)frame(229-236)crates/allocdb-core/src/wal.rs (1)
frame(230-237)
crates/allocdb-node/src/engine.rs (2)
crates/allocdb-core/src/recovery.rs (1)
recover_allocdb_with_observer(162-190)crates/allocdb-core/src/wal_file.rs (1)
open(45-54)
🔇 Additional comments (8)
docs/status.md (1)
28-30: Status snapshot stays aligned with the crash-plan work.The milestone shift, validation commands, and recovery-coverage summary all line up with the code in this PR.
Also applies to: 86-97, 103-106
crates/allocdb-core/src/recovery.rs (1)
143-153: Good layering for recovery observers.
recover_allocdb()stays as the no-observer surface, whilerecover_allocdb_with_observer()plugs the boundary callback straight into the real recovery loop.Also applies to: 162-190, 192-255
crates/allocdb-node/src/engine_checkpoint.rs (1)
93-110: Crash injection is placed at the two checkpoint durability boundaries.The post-snapshot-write and post-WAL-rewrite hooks line up with the two distinct on-disk states that recovery has to tolerate.
docs/testing.md (1)
100-107: Testing docs capture the promoted crash-plan model clearly.The added seed semantics, named boundaries, and restart-path evidence match the harness and regression tests in this PR.
Also applies to: 114-121
crates/allocdb-node/src/engine_tests.rs (1)
874-881: Good regression forCrashInjectedcategorization.This pins the new indefinite classification at the engine boundary, which is the contract callers need for retry-after-restart handling.
crates/allocdb-node/src/simulation_tests.rs (1)
271-341: Nice coverage for deterministic selection and post-sync submit recovery.These tests pin both the seed/order invariants and the persisted-but-unapplied restart path through the real engine.
crates/allocdb-node/src/simulation.rs (1)
96-117: Recovery crash plans remain one-shot and use the real restart path.Stashing the plan in
pending_recovery_crashand consuming it fromrestart()keeps the injected failure deterministic without adding a simulator-only recovery branch.Also applies to: 175-215
crates/allocdb-node/src/engine.rs (1)
645-701: Nice one-shot placement around the ambiguous commit boundaries.The before-append / after-sync / after-apply hooks cover the three materially different recovery states, and centralizing the consume-and-halt logic in
maybe_inject_crash()keeps each seeded failure one-shot.Also applies to: 767-821, 951-965
Stabilize seeded crash-point selection, target replay-frame recovery crashes deterministically across later replayed frames, add recovery/API regression coverage, and tighten checkpoint docs/logging. Refs #17
Summary
Validation
Closes #17