Skip to content

feat(simulation): inject seeded crash points#42

Merged
skel84 merged 3 commits intomainfrom
issue-17-crash-points
Mar 12, 2026
Merged

feat(simulation): inject seeded crash points#42
skel84 merged 3 commits intomainfrom
issue-17-crash-points

Conversation

@skel84
Copy link
Owner

@skel84 skel84 commented Mar 12, 2026

Summary

  • add seeded one-shot crash plans across engine submit, checkpoint, and recovery boundaries
  • expose crash-plan control through the promoted simulation harness and add restart-path regression coverage
  • update testing/status docs to record M4-T02 completion and move focus to M4-T03

Validation

  • cargo test -p allocdb-node simulation -- --nocapture
  • cargo test -p allocdb-node engine -- --nocapture
  • cargo test -p allocdb-core recovery -- --nocapture
  • ./scripts/preflight.sh

Closes #17

Add deterministic one-shot crash plans across engine submit, checkpoint, and recovery boundaries, wire them through the promoted simulation harness, and cover restart behavior with seeded regression tests.

Refs #17
@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

Warning

Rate limit exceeded

@skel84 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 53 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2ec610a2-966c-4062-b860-f7c6d5c98974

📥 Commits

Reviewing files that changed from the base of the PR and between 7a068fa and 44fe047.

📒 Files selected for processing (7)
  • crates/allocdb-core/src/recovery.rs
  • crates/allocdb-node/src/api.rs
  • crates/allocdb-node/src/api_tests.rs
  • crates/allocdb-node/src/engine.rs
  • crates/allocdb-node/src/engine_checkpoint.rs
  • crates/allocdb-node/src/simulation_tests.rs
  • docs/status.md

Walkthrough

This PR introduces a deterministic crash injection framework across the allocdb system. It adds recovery boundary observation points, defines crash plan infrastructure, integrates crash checks throughout the engine and checkpoint lifecycle, and extends error handling to propagate crash injection events. The simulation harness gains support for planning crashes at recovery and engine boundaries.

Changes

Cohort / File(s) Summary
Recovery Observer Infrastructure
crates/allocdb-core/src/recovery.rs
Introduces RecoveryBoundary enum (AfterSnapshotLoad, AfterWalTruncate, AfterReplayFrame) and RecoveryObserverError for wrapping observer or recovery errors. Adds recover_allocdb_with_observer<E, F>() public API accepting a user-supplied boundary observer callback, delegates from recover_allocdb, and internally implements observer invocations at key recovery checkpoints with LSN and record_type metadata.
Error Mapping & Classification
crates/allocdb-node/src/api.rs
Expands SubmissionFailure::from_submission_error match arm to classify SubmissionError::CrashInjected(_) as StorageFailure (alongside WalFile errors).
Crash Plan Infrastructure & Engine Integration
crates/allocdb-node/src/engine.rs
Introduces public CrashPoint enum (11 variants spanning client, internal, checkpoint, and recovery phases) and public CrashPlan struct. Adds CrashInjected variants to SubmissionError and RecoverEngineError. Wires recover_with_crash_plan() to use recovery observer callbacks, maps RecoveryObserverError to CrashInjected, and injects crash checks at 6 engine submission/apply points via maybe_inject_crash(). Adds armed_crash field to SingleNodeEngine and arm_next_crash() test helper.
Checkpoint Crash Hooks
crates/allocdb-node/src/engine_checkpoint.rs
Adds CrashInjected variant to CheckpointError and injects crash checks after snapshot write (CheckpointAfterSnapshotWrite) and WAL rewrite (CheckpointAfterWalRewrite) events.
Test Infrastructure & Assertions
crates/allocdb-node/src/engine_tests.rs, crates/allocdb-node/src/simulation_tests.rs
Extends imports and adds test assertion verifying CrashInjected submission errors categorize as Indefinite. Introduces 4 new deterministic crash/recovery tests (reproducibility, seeded engine crash recovery, checkpoint snapshot-write crash recovery, replay-interrupted recovery) with seed_for_point helper and coverage for WalOnly and SnapshotAndWal recovery kinds.
Simulation Harness
crates/allocdb-node/src/simulation.rs
Adds pending_recovery_crash field to SimulationHarness and modifies restart() to call recover_with_crash_plan(). Introduces arm_next_engine_crash() and arm_next_recovery_crash() test APIs that construct CrashPlans from seeds and enabled points, with recovery variant asserting boundary constraints.
Documentation
docs/status.md, docs/testing.md
Updates validation/recovery description to highlight seeded crash-point injection across submit, checkpoint, and recovery boundaries. Documents deterministic simulation crash planning, restart coverage scenarios (post-sync replay, snapshot-written before WAL rewrite, replay-interrupted recovery), and new test commands. Clarifies crash-plan reproducibility and boundary filtering behavior.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Engine as Engine<br/>(recover_with_crash_plan)
    participant Core as Recovery Core<br/>(recover_allocdb_impl)
    participant Observer as Crash Observer<br/>(callback)

    User->>Engine: recover_with_crash_plan(crash_plan)
    Engine->>Engine: Setup observer closure
    Engine->>Core: recover_allocdb_with_observer(observer_fn)
    
    Core->>Core: Load snapshot
    Core->>Observer: RecoveryBoundary::AfterSnapshotLoad
    Observer->>Observer: Check if crash matches
    
    alt Crash Injected
        Observer-->>Core: Err(CrashInjected)
        Core-->>Engine: RecoveryObserverError::Observer(CrashInjected)
    else No Crash
        Observer-->>Core: Ok(())
        Core->>Core: Truncate WAL
        Core->>Observer: RecoveryBoundary::AfterWalTruncate
        Observer-->>Core: Ok(()) or Err(CrashInjected)
    end
    
    alt Continue Replay
        Core->>Core: Replay frame (LSN, record_type)
        Core->>Observer: RecoveryBoundary::AfterReplayFrame{lsn, record_type}
        Observer->>Observer: Check if crash matches
        Observer-->>Core: Ok(()) or Err(CrashInjected)
        Core-->>Core: Next frame...
    end
    
    alt Final Result
        Core-->>Engine: RecoveryResult or RecoveryObserverError
        Engine-->>User: Ok(Engine) or Err(RecoverEngineError::CrashInjected)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • PR #38: The simulation harness changes in this PR (pending_recovery_crash field, arm_next_recovery_crash method) build directly on the simulation infrastructure introduced there, extending it with recovery-specific crash planning capability.
  • PR #39: Both PRs modify crates/allocdb-core/src/recovery.rs recovery internals; this PR adds RecoveryBoundary observer callbacks and RecoveryObserverError wrapper on top of the refactored recovery flow introduced there.
  • PR #12: The RecoveryBoundary::AfterReplayFrame observer point directly relates to recovery-level replay frame metadata tracking changes, both affecting the same recovery replay logic and checkpoint state.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main objective: introducing seeded crash point injection into the simulation framework.
Description check ✅ Passed The description covers the summary, linked issue reference, validation steps, and addresses documentation updates as required by the template.
Linked Issues check ✅ Passed The PR successfully implements all coding requirements from issue #17: deterministic crash points across submit/checkpoint/recovery boundaries, reproducibility via seeding, and recovery path validation through the real engine.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #17 objectives. The PR introduces crash injection infrastructure, integrates observers into recovery, expands simulation harness capabilities, and updates documentation—all within scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch issue-17-crash-points
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@skel84
Copy link
Owner Author

skel84 commented Mar 12, 2026

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/allocdb-node/src/engine_checkpoint.rs (1)

51-54: ⚠️ Potential issue | 🟡 Minor

Document the new CrashInjected failure mode.

checkpoint() can now return CheckpointError::CrashInjected(_), but the # Errors section still only describes halt, queue, WAL, and snapshot failures.

Suggested doc fix
-    /// Returns [`CheckpointError`] if the engine is halted, queued writes still exist, the WAL is
-    /// not clean, or snapshot/WAL persistence fails.
+    /// Returns [`CheckpointError`] if the engine is halted, queued writes still exist, the WAL is
+    /// not clean, snapshot/WAL persistence fails, or a seeded crash plan aborts the checkpoint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/engine_checkpoint.rs` around lines 51 - 54, Update
the function-level docs for checkpoint() to mention the new CrashInjected
failure mode: explain that checkpoint() may return
CheckpointError::CrashInjected(reason) when an injected crash or simulated
failure is detected, and include that it carries context about the injection;
keep the existing descriptions for halted engine, queued writes, dirty WAL, and
persistence failures (snapshot/WAL) intact and add CrashInjected alongside them
in the # Errors section referencing CheckpointError::CrashInjected to make the
possible return variants comprehensive.
🧹 Nitpick comments (4)
crates/allocdb-node/src/simulation_tests.rs (1)

387-457: The other recovery boundaries still need direct restart tests.

This only drives CrashPoint::RecoveryAfterReplayFrame, but Lines 25-29 in crates/allocdb-core/src/recovery.rs also add AfterSnapshotLoad and AfterWalTruncate. Those abort points happen before any frame replay, so they can regress independently of the replay-frame case.

As per coding guidelines, **/*.rs: Write extensive tests for every meaningful behavior change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/simulation_tests.rs` around lines 387 - 457, The test
only verifies CrashPoint::RecoveryAfterReplayFrame; add explicit subtests that
drive and assert behavior for each recovery boundary
(CrashPoint::RecoveryAfterSnapshotLoad and CrashPoint::RecoveryAfterWalTruncate)
so they are exercised independently. For each point in the recovery_points array
(use seed_for_point(point, &recovery_points)), create two SimulationHarness
instances (SimulationHarness::new), perform the same submit/checkpoint/advance
sequence, call arm_next_recovery_crash and assert the planned point equals the
current point, call restart to capture the injected crash (unwrap_err and match
RecoverEngineError::CrashInjected), then call restart again to get the recovered
harness and assert expected recovery fields (startup_kind, loaded_snapshot_lsn,
replayed_wal_frame_count/last_lsn) and resource presence via
first.engine().db().resource(ResourceId(...)). Repeat this logic for
CrashPoint::RecoveryAfterSnapshotLoad and CrashPoint::RecoveryAfterWalTruncate
so each abort boundary is covered.
crates/allocdb-node/src/api.rs (1)

157-159: Add an API regression for the new CrashInjected mapping.

This changes the wire contract: SubmissionError::CrashInjected(_) now surfaces as StorageFailure. Please pin both that code and the preserved Indefinite category so client retry behavior cannot drift later.

As per coding guidelines, **/*.rs: Write extensive tests for every meaningful behavior change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/api.rs` around lines 157 - 159,
SubmissionError::CrashInjected(_) was recently remapped to
SubmissionFailureCode::StorageFailure which changes the public wire contract;
update the mapping to explicitly pin both the CrashInjected -> StorageFailure
mapping and the preserved Indefinite category and add tests that assert the
serialized wire values cannot change. Concretely: in the match that maps
SubmissionError to SubmissionFailureCode (referencing
SubmissionError::CrashInjected and SubmissionFailureCode::StorageFailure and the
Indefinite variant), add explicit match arms and a comment documenting the
intended wire value, then add unit/integration tests that serialize the
resulting SubmissionFailureCode (and the Indefinite variant) to the wire format
and assert the exact expected bytes/values so future edits will fail if the
mapping drifts.
crates/allocdb-node/src/engine.rs (2)

370-373: Log recovery-side crash injections too.

Runtime injection goes through maybe_inject_crash() and emits a warning, but this branch returns CrashInjected(plan) silently. recover_allocdb_with_observer() explicitly skips logging RecoveryObserverError::Observer(_), so the seed/point context disappears right where restart triage needs it.

Proposed patch
-        .map_err(|error| match error {
-            RecoveryObserverError::Recovery(error) => RecoverEngineError::Recovery(error),
-            RecoveryObserverError::Observer(plan) => RecoverEngineError::CrashInjected(plan),
-        })?;
+        .map_err(|error| match error {
+            RecoveryObserverError::Recovery(error) => RecoverEngineError::Recovery(error),
+            RecoveryObserverError::Observer(plan) => {
+                warn!(
+                    "halting engine recovery on injected crash: seed={} point={:?}",
+                    plan.seed, plan.point
+                );
+                RecoverEngineError::CrashInjected(plan)
+            }
+        })?;

As per coding guidelines, "Add extensive logging where it materially improves debuggability or operational clarity" and "Logging must be structured and purposeful."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/engine.rs` around lines 370 - 373, The mapping from
RecoveryObserverError::Observer to RecoverEngineError::CrashInjected currently
drops contextual info; before returning that error in the closure (the map_err
branch that matches RecoveryObserverError::Observer(plan)), call the same
logging used by maybe_inject_crash() (e.g., emit a warning with the
seed/point/plan context) so the crash-injection context is preserved for restart
triage; update the logic around
recover_allocdb_with_observer()/maybe_inject_crash() to log a structured warning
including the plan and any seed/point identifiers immediately before
constructing RecoverEngineError::CrashInjected.

211-224: Pin seed selection to a stable sort key, not derived enum order.

from_seed() canonicalizes with sort_unstable() on CrashPoint's derived Ord, so reordering or inserting enum variants later silently changes the same seed → point mapping for the same enabled set. If that mapping is part of the reproducibility contract, prefer explicit stable ids or preserve a caller-owned canonical order.

Also applies to: 260-262

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/engine.rs` around lines 211 - 224, The current
from_seed() uses sort_unstable() on CrashPoint's derived Ord, which makes the
seed→point mapping unstable when enum variants are reordered; add a stable sort
key and use it instead of derived Ord—e.g. add a CrashPoint::stable_id()
returning a fixed integer for each variant and change all
sort_unstable()/sorting usages in from_seed (and the other occurrences around
lines 260-262) to sort_by_key(|p| p.stable_id()), ensuring the seed selection is
pinned to those stable ids (or alternatively accept a caller-provided canonical
ordering) so future enum reorders/inserts won't change mappings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/allocdb-node/src/engine.rs`:
- Around line 221-223: The observer for RecoveryBoundary::AfterReplayFrame is
collapsing all per-frame events into a single CrashPoint
(RecoveryAfterReplayFrame) and consuming the one-shot plan at the first match,
so only the first replayed frame is reachable; modify the plan and observer to
carry a replay ordinal/LSN (e.g., add an ordinal field to the CrashPoint variant
RecoveryAfterReplayFrame or include an lsn/seq in the one-shot plan) and have
the observer compare the current frame's ordinal/LSN before consuming the plan;
update the matching logic in the observer code paths referenced by
RecoveryAfterSnapshotLoad, RecoveryAfterWalTruncate, and
RecoveryAfterReplayFrame so each AfterReplayFrame event checks the ordinal/LSN
and only consumes the plan when it matches the intended frame (or derive the
ordinal deterministically from the seed inside the observer if you prefer not to
modify the plan structure).

In `@crates/allocdb-node/src/simulation_tests.rs`:
- Around line 343-385: The test
seeded_checkpoint_crash_after_snapshot_write_is_recoverable currently only
exercises CrashPoint::CheckpointAfterSnapshotWrite; add a sibling test (or
extend this one) to cover CrashPoint::CheckpointAfterWalRewrite by using
seed_for_point with CrashPoint::CheckpointAfterWalRewrite (and the same
checkpoint_points array), arm the crash via harness.arm_next_engine_crash,
assert the checkpoint unwrap_err matches the planned crash, then restart and
assert the recovered.recovery fields reflect the WAL-rewrite path (e.g.,
appropriate RecoveryStartupKind and
loaded_snapshot_lsn/replayed_wal_frame_count) and that resources ResourceId(11)
and ResourceId(12) exist; reuse the same helpers (SimulationHarness::new,
harness.advance_to, harness.submit, harness.checkpoint, harness.restart) and
mirror the assertions structure from the existing
seeded_checkpoint_crash_after_snapshot_write_is_recoverable test.

---

Outside diff comments:
In `@crates/allocdb-node/src/engine_checkpoint.rs`:
- Around line 51-54: Update the function-level docs for checkpoint() to mention
the new CrashInjected failure mode: explain that checkpoint() may return
CheckpointError::CrashInjected(reason) when an injected crash or simulated
failure is detected, and include that it carries context about the injection;
keep the existing descriptions for halted engine, queued writes, dirty WAL, and
persistence failures (snapshot/WAL) intact and add CrashInjected alongside them
in the # Errors section referencing CheckpointError::CrashInjected to make the
possible return variants comprehensive.

---

Nitpick comments:
In `@crates/allocdb-node/src/api.rs`:
- Around line 157-159: SubmissionError::CrashInjected(_) was recently remapped
to SubmissionFailureCode::StorageFailure which changes the public wire contract;
update the mapping to explicitly pin both the CrashInjected -> StorageFailure
mapping and the preserved Indefinite category and add tests that assert the
serialized wire values cannot change. Concretely: in the match that maps
SubmissionError to SubmissionFailureCode (referencing
SubmissionError::CrashInjected and SubmissionFailureCode::StorageFailure and the
Indefinite variant), add explicit match arms and a comment documenting the
intended wire value, then add unit/integration tests that serialize the
resulting SubmissionFailureCode (and the Indefinite variant) to the wire format
and assert the exact expected bytes/values so future edits will fail if the
mapping drifts.

In `@crates/allocdb-node/src/engine.rs`:
- Around line 370-373: The mapping from RecoveryObserverError::Observer to
RecoverEngineError::CrashInjected currently drops contextual info; before
returning that error in the closure (the map_err branch that matches
RecoveryObserverError::Observer(plan)), call the same logging used by
maybe_inject_crash() (e.g., emit a warning with the seed/point/plan context) so
the crash-injection context is preserved for restart triage; update the logic
around recover_allocdb_with_observer()/maybe_inject_crash() to log a structured
warning including the plan and any seed/point identifiers immediately before
constructing RecoverEngineError::CrashInjected.
- Around line 211-224: The current from_seed() uses sort_unstable() on
CrashPoint's derived Ord, which makes the seed→point mapping unstable when enum
variants are reordered; add a stable sort key and use it instead of derived
Ord—e.g. add a CrashPoint::stable_id() returning a fixed integer for each
variant and change all sort_unstable()/sorting usages in from_seed (and the
other occurrences around lines 260-262) to sort_by_key(|p| p.stable_id()),
ensuring the seed selection is pinned to those stable ids (or alternatively
accept a caller-provided canonical ordering) so future enum reorders/inserts
won't change mappings.

In `@crates/allocdb-node/src/simulation_tests.rs`:
- Around line 387-457: The test only verifies
CrashPoint::RecoveryAfterReplayFrame; add explicit subtests that drive and
assert behavior for each recovery boundary
(CrashPoint::RecoveryAfterSnapshotLoad and CrashPoint::RecoveryAfterWalTruncate)
so they are exercised independently. For each point in the recovery_points array
(use seed_for_point(point, &recovery_points)), create two SimulationHarness
instances (SimulationHarness::new), perform the same submit/checkpoint/advance
sequence, call arm_next_recovery_crash and assert the planned point equals the
current point, call restart to capture the injected crash (unwrap_err and match
RecoverEngineError::CrashInjected), then call restart again to get the recovered
harness and assert expected recovery fields (startup_kind, loaded_snapshot_lsn,
replayed_wal_frame_count/last_lsn) and resource presence via
first.engine().db().resource(ResourceId(...)). Repeat this logic for
CrashPoint::RecoveryAfterSnapshotLoad and CrashPoint::RecoveryAfterWalTruncate
so each abort boundary is covered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 89cc4bb8-36b5-4e7c-8584-824bc37980b1

📥 Commits

Reviewing files that changed from the base of the PR and between a33b738 and 7a068fa.

📒 Files selected for processing (9)
  • crates/allocdb-core/src/recovery.rs
  • crates/allocdb-node/src/api.rs
  • crates/allocdb-node/src/engine.rs
  • crates/allocdb-node/src/engine_checkpoint.rs
  • crates/allocdb-node/src/engine_tests.rs
  • crates/allocdb-node/src/simulation.rs
  • crates/allocdb-node/src/simulation_tests.rs
  • docs/status.md
  • docs/testing.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/engine_checkpoint.rs
  • crates/allocdb-node/src/api.rs
  • crates/allocdb-node/src/simulation.rs
  • crates/allocdb-node/src/simulation_tests.rs
  • crates/allocdb-node/src/engine_tests.rs
  • crates/allocdb-core/src/recovery.rs
  • crates/allocdb-node/src/engine.rs
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/status.md
  • docs/testing.md
docs/status.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep docs/status.md current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.

Files:

  • docs/status.md
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Treat CodeRabbit as part of the required review path when it is enabled on the repository. Wait for its status to complete before merge. If it completes without a visible review comment or review thread, request visible output with `coderabbitai summary`. Address every substantive CodeRabbit comment explicitly before merge by either applying the change or documenting why it is not being applied. Apply correctness, safety, recovery, test, and docs-alignment feedback by default; document why you reject suggestions that would weaken determinism, boundedness, or trusted-core discipline.
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • docs/status.md
  • crates/allocdb-node/src/simulation_tests.rs
  • crates/allocdb-node/src/engine_tests.rs
🧬 Code graph analysis (4)
crates/allocdb-node/src/simulation.rs (1)
crates/allocdb-node/src/engine.rs (3)
  • recover_with_crash_plan (348-387)
  • from_seed (254-269)
  • is_recovery_boundary (228-235)
crates/allocdb-node/src/engine_tests.rs (1)
crates/allocdb-node/src/engine.rs (1)
  • category (148-157)
crates/allocdb-core/src/recovery.rs (3)
crates/allocdb-core/src/snapshot_file.rs (3)
  • from (19-21)
  • from (25-27)
  • context (141-146)
crates/allocdb-core/src/wal_file.rs (2)
  • from (34-36)
  • frame (229-236)
crates/allocdb-core/src/wal.rs (1)
  • frame (230-237)
crates/allocdb-node/src/engine.rs (2)
crates/allocdb-core/src/recovery.rs (1)
  • recover_allocdb_with_observer (162-190)
crates/allocdb-core/src/wal_file.rs (1)
  • open (45-54)
🔇 Additional comments (8)
docs/status.md (1)

28-30: Status snapshot stays aligned with the crash-plan work.

The milestone shift, validation commands, and recovery-coverage summary all line up with the code in this PR.

Also applies to: 86-97, 103-106

crates/allocdb-core/src/recovery.rs (1)

143-153: Good layering for recovery observers.

recover_allocdb() stays as the no-observer surface, while recover_allocdb_with_observer() plugs the boundary callback straight into the real recovery loop.

Also applies to: 162-190, 192-255

crates/allocdb-node/src/engine_checkpoint.rs (1)

93-110: Crash injection is placed at the two checkpoint durability boundaries.

The post-snapshot-write and post-WAL-rewrite hooks line up with the two distinct on-disk states that recovery has to tolerate.

docs/testing.md (1)

100-107: Testing docs capture the promoted crash-plan model clearly.

The added seed semantics, named boundaries, and restart-path evidence match the harness and regression tests in this PR.

Also applies to: 114-121

crates/allocdb-node/src/engine_tests.rs (1)

874-881: Good regression for CrashInjected categorization.

This pins the new indefinite classification at the engine boundary, which is the contract callers need for retry-after-restart handling.

crates/allocdb-node/src/simulation_tests.rs (1)

271-341: Nice coverage for deterministic selection and post-sync submit recovery.

These tests pin both the seed/order invariants and the persisted-but-unapplied restart path through the real engine.

crates/allocdb-node/src/simulation.rs (1)

96-117: Recovery crash plans remain one-shot and use the real restart path.

Stashing the plan in pending_recovery_crash and consuming it from restart() keeps the injected failure deterministic without adding a simulator-only recovery branch.

Also applies to: 175-215

crates/allocdb-node/src/engine.rs (1)

645-701: Nice one-shot placement around the ambiguous commit boundaries.

The before-append / after-sync / after-apply hooks cover the three materially different recovery states, and centralizing the consume-and-halt logic in maybe_inject_crash() keeps each seeded failure one-shot.

Also applies to: 767-821, 951-965

Stabilize seeded crash-point selection, target replay-frame recovery crashes deterministically across later replayed frames, add recovery/API regression coverage, and tighten checkpoint docs/logging.

Refs #17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

M4-T02 Inject crash points

1 participant