Skip to content

fix(sync): apply deltas before WAL commit so rollback undo works#985

Merged
scarmuega merged 1 commit intomainfrom
fix/wal-apply-order-rollback-undo
Apr 29, 2026
Merged

fix(sync): apply deltas before WAL commit so rollback undo works#985
scarmuega merged 1 commit intomainfrom
fix/wal-apply-order-rollback-undo

Conversation

@scarmuega
Copy link
Copy Markdown
Member

@scarmuega scarmuega commented Apr 29, 2026

Summary

The roll work-unit lifecycle wrote the WAL before apply_entities ran, so every on-disk delta carried prev_*: None. When core/sync.rs::rollback later deserialized those rows and called delta.undo(), the first non-trivial delta (typically ControlledAmountInc) panicked with:

panicked at crates/cardano/src/model/accounts.rs:329:47: apply captured stake

This affected anyone who bootstrapped from a snapshot, started syncing, and then hit any peer-driven rollback past blocks dolos had already roll-forwarded since startup.

The ordering itself is pre-existing (set when work units were formalized in #842), but the panic was latent until #971 replaced no-op undo stubs with real expect("apply captured ...") calls. The proptest harness in #971 only round-trips deltas in memory, so it never exercised the WAL serde path.

What changed

  • RollWorkUnit lifecycle reshuffle (crates/cardano/src/roll/work_unit.rs): load_entities moved into load, apply_entities moved into compute. commit_wal now serializes deltas that already carry prev_* undo state. commit_state becomes thin (just persist).
  • Invariant guard (crates/cardano/src/roll/batch.rs): new applied: bool flag on WorkBatch, set by apply_entities, asserted by commit_wal via debug_assert!. A future re-ordering trips locally rather than in production rollback.
  • Reverse undo order (crates/core/src/sync.rs): the inner loop in rollback iterated deltas forward when undoing. Multiple deltas keyed to the same entity must be reversed last-first to walk back through the apply chain correctly. Was masked while prev_* was None; load-bearing once it's populated.
  • New proptest helper (crates/cardano/src/model/testing.rs): assert_delta_serde_roundtrip serializes the post-apply delta with bincode (the WAL's encoding), deserializes, and asserts undo restores the original entity.
  • Serde-roundtrip proptests for ControlledAmountInc, ControlledAmountDec, StakeRegistration, StakeDelegation, StakeDeregistration, VoteDelegation, WithdrawalInc.
  • Integration test (tests/bootstrap.rs::test_rollback_after_full_sync_lifecycle): feeds blocks through the full sync lifecycle and rolls back. On the un-fixed code it dies with the exact panic from the bug report; with the fix it passes cleanly.

Verification

  • The integration test reproduces the user's exact panic when run against the un-fixed lifecycle (crates/cardano/src/model/accounts.rs:329:47: apply captured stake).
  • With the fix, full workspace test suite passes — 0 failures across all crates plus integration tests, including 117 cardano lib tests.

Out of scope (flagged for follow-up)

  • Boundary work units (Ewrap, Estart, Rupd) don't write deltas to the WAL today (they inherit the default no-op commit_wal), so rollbacks across an epoch boundary still don't undo the boundary deltas. Separate gap.
  • State catch-up from WAL after a crash between commit_wal and commit_state — same recovery window as before this fix; current recovery relies on the peer re-sending the block.

Test plan

  • cargo test --test bootstrap — passes (3/3, including the new regression)
  • cargo test -p dolos-cardano --lib — passes (117/117, including the new serde-roundtrip proptests)
  • cargo test --workspace — passes, 0 failures
  • Confirm the new integration test fails with the exact production panic when applied against an un-fixed RollWorkUnit
  • Smoke-test on a live snapshot bootstrap + sync (recommended before merge — the unit/integration coverage validates the lifecycle but exercising a real chainsync rollback against a peer is the final gate)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved rollback consistency by ensuring deltas are undone in the correct order during state recovery.
  • Tests

    • Added comprehensive serialization roundtrip validation tests.
    • Added integration test for rollback recovery after full sync lifecycle.

The roll work-unit lifecycle wrote the WAL before `apply_entities` ran, so
each on-disk delta carried `prev_*: None`. When `core/sync.rs::rollback`
later deserialized those rows and called `delta.undo()`, the first
non-trivial delta (typically `ControlledAmountInc`) panicked with
`panicked at crates/cardano/src/model/accounts.rs:329:47: apply captured stake`.
Pre-existing — the ordering was set when work units were formalized
(#842) but the panic was latent until #971 replaced no-op `undo`
stubs with real `expect("apply captured ...")` calls.

Reshuffles `RollWorkUnit` so `load_entities` runs in `load`,
`apply_entities` runs in `compute`, and `commit_wal` then serializes
deltas that already carry their `prev_*` undo state. `commit_state`
becomes a thin "persist what's already in memory" pass.

Adds an `applied: bool` invariant flag on `WorkBatch` with a
`debug_assert!` in `commit_wal` so a future re-ordering trips
locally instead of in production rollback.

Also fixes a related correctness issue exposed once `prev_*` actually
becomes load-bearing: `core/sync.rs::rollback` iterated deltas in
forward order when undoing. Multiple deltas keyed to the same entity
have to be reversed last-first to walk back through the apply chain
correctly. Inner loop is now `.iter_mut().rev()`.

New regression coverage:
- `assert_delta_serde_roundtrip` proptest helper that serializes the
  post-apply delta with bincode (the WAL's encoding), deserializes,
  and asserts `undo` restores the original entity. The existing
  `assert_delta_roundtrip` only round-trips in memory, so it never
  exercised the wire format.
- Serde-roundtrip proptests for `ControlledAmountInc`,
  `ControlledAmountDec`, `StakeRegistration`, `StakeDelegation`,
  `StakeDeregistration`, `VoteDelegation`, `WithdrawalInc`.
- Integration test `test_rollback_after_full_sync_lifecycle` that
  feeds blocks through the full sync lifecycle and rolls back. On
  the un-fixed code it dies with the exact panic from the bug
  report; with the fix it passes cleanly.

Out of scope (flagged for follow-up):
- Boundary work units (Ewrap, Estart, Rupd) don't write deltas to
  the WAL, so rollbacks across an epoch boundary still don't undo
  the boundary deltas.
- State catch-up from WAL after a crash between commit_wal and
  commit_state — same recovery window as before this fix; relies
  on the peer re-sending the block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The pull request reorganizes the roll pipeline to apply entity deltas earlier during computation rather than at commit time, ensuring undo (prev_*) state is captured before WAL serialization. Comprehensive serde roundtrip tests validate delta undo behavior through bincode serialization, and rollback undo logic now traverses deltas in reverse order for correctness. An integration test verifies full rollback lifecycle.

Changes

Cohort / File(s) Summary
Dependencies
crates/cardano/Cargo.toml
Added bincode as workspace-managed development dependency for serde roundtrip testing.
Test Infrastructure & Helpers
crates/cardano/src/model/testing.rs, crates/cardano/src/model/accounts.rs, tests/bootstrap.rs
Introduced assert_delta_serde_roundtrip helper for validating WAL-style serialization/deserialization of deltas; added seven property tests in accounts module exercising undo after bincode roundtrips; added integration test for full rollback lifecycle after block sync.
Roll Pipeline Reorganization
crates/cardano/src/roll/work_unit.rs, crates/cardano/src/roll/batch.rs
Shifted entity loading to load() phase and delta application to compute() phase; commit_wal() no longer performs slot sorting or entity application; added applied flag tracking in WorkBatch with assertion ensuring deltas are applied before WAL commit; comments updated to reflect control flow.
Rollback Undo Ordering
crates/core/src/sync.rs
Modified rollback to traverse log.delta in reverse order when undoing, ensuring last-applied-first-undone semantics; added inline comments describing ordering requirement tied to each delta's pre-apply captured state.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 The deltas now apply before they dance with WAL,
In reverse we undo them, heeding logic's call,
Bincode tests the round-trip through serialize's hall,
Prev-state captured early, undo restores all!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main fix: reordering the WAL commit lifecycle so deltas carry undo state, enabling correct rollback behavior.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/wal-apply-order-rollback-undo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@scarmuega scarmuega merged commit 5ee5215 into main Apr 29, 2026
11 of 12 checks passed
@scarmuega scarmuega deleted the fix/wal-apply-order-rollback-undo branch April 29, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant