fix(sync): apply deltas before WAL commit so rollback undo works#985
fix(sync): apply deltas before WAL commit so rollback undo works#985
Conversation
The roll work-unit lifecycle wrote the WAL before `apply_entities` ran, so each on-disk delta carried `prev_*: None`. When `core/sync.rs::rollback` later deserialized those rows and called `delta.undo()`, the first non-trivial delta (typically `ControlledAmountInc`) panicked with `panicked at crates/cardano/src/model/accounts.rs:329:47: apply captured stake`. Pre-existing — the ordering was set when work units were formalized (#842) but the panic was latent until #971 replaced no-op `undo` stubs with real `expect("apply captured ...")` calls. Reshuffles `RollWorkUnit` so `load_entities` runs in `load`, `apply_entities` runs in `compute`, and `commit_wal` then serializes deltas that already carry their `prev_*` undo state. `commit_state` becomes a thin "persist what's already in memory" pass. Adds an `applied: bool` invariant flag on `WorkBatch` with a `debug_assert!` in `commit_wal` so a future re-ordering trips locally instead of in production rollback. Also fixes a related correctness issue exposed once `prev_*` actually becomes load-bearing: `core/sync.rs::rollback` iterated deltas in forward order when undoing. Multiple deltas keyed to the same entity have to be reversed last-first to walk back through the apply chain correctly. Inner loop is now `.iter_mut().rev()`. New regression coverage: - `assert_delta_serde_roundtrip` proptest helper that serializes the post-apply delta with bincode (the WAL's encoding), deserializes, and asserts `undo` restores the original entity. The existing `assert_delta_roundtrip` only round-trips in memory, so it never exercised the wire format. - Serde-roundtrip proptests for `ControlledAmountInc`, `ControlledAmountDec`, `StakeRegistration`, `StakeDelegation`, `StakeDeregistration`, `VoteDelegation`, `WithdrawalInc`. - Integration test `test_rollback_after_full_sync_lifecycle` that feeds blocks through the full sync lifecycle and rolls back. On the un-fixed code it dies with the exact panic from the bug report; with the fix it passes cleanly. Out of scope (flagged for follow-up): - Boundary work units (Ewrap, Estart, Rupd) don't write deltas to the WAL, so rollbacks across an epoch boundary still don't undo the boundary deltas. - State catch-up from WAL after a crash between commit_wal and commit_state — same recovery window as before this fix; relies on the peer re-sending the block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThe pull request reorganizes the roll pipeline to apply entity deltas earlier during computation rather than at commit time, ensuring undo ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
Summary
The roll work-unit lifecycle wrote the WAL before
apply_entitiesran, so every on-disk delta carriedprev_*: None. Whencore/sync.rs::rollbacklater deserialized those rows and calleddelta.undo(), the first non-trivial delta (typicallyControlledAmountInc) panicked with:This affected anyone who bootstrapped from a snapshot, started syncing, and then hit any peer-driven rollback past blocks dolos had already roll-forwarded since startup.
The ordering itself is pre-existing (set when work units were formalized in #842), but the panic was latent until #971 replaced no-op
undostubs with realexpect("apply captured ...")calls. The proptest harness in #971 only round-trips deltas in memory, so it never exercised the WAL serde path.What changed
RollWorkUnitlifecycle reshuffle (crates/cardano/src/roll/work_unit.rs):load_entitiesmoved intoload,apply_entitiesmoved intocompute.commit_walnow serializes deltas that already carryprev_*undo state.commit_statebecomes thin (just persist).crates/cardano/src/roll/batch.rs): newapplied: boolflag onWorkBatch, set byapply_entities, asserted bycommit_walviadebug_assert!. A future re-ordering trips locally rather than in production rollback.crates/core/src/sync.rs): the inner loop inrollbackiterated deltas forward when undoing. Multiple deltas keyed to the same entity must be reversed last-first to walk back through the apply chain correctly. Was masked whileprev_*wasNone; load-bearing once it's populated.crates/cardano/src/model/testing.rs):assert_delta_serde_roundtripserializes the post-apply delta with bincode (the WAL's encoding), deserializes, and assertsundorestores the original entity.ControlledAmountInc,ControlledAmountDec,StakeRegistration,StakeDelegation,StakeDeregistration,VoteDelegation,WithdrawalInc.tests/bootstrap.rs::test_rollback_after_full_sync_lifecycle): feeds blocks through the full sync lifecycle and rolls back. On the un-fixed code it dies with the exact panic from the bug report; with the fix it passes cleanly.Verification
crates/cardano/src/model/accounts.rs:329:47: apply captured stake).Out of scope (flagged for follow-up)
Ewrap,Estart,Rupd) don't write deltas to the WAL today (they inherit the default no-opcommit_wal), so rollbacks across an epoch boundary still don't undo the boundary deltas. Separate gap.commit_walandcommit_state— same recovery window as before this fix; current recovery relies on the peer re-sending the block.Test plan
cargo test --test bootstrap— passes (3/3, including the new regression)cargo test -p dolos-cardano --lib— passes (117/117, including the new serde-roundtrip proptests)cargo test --workspace— passes, 0 failuresRollWorkUnit🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests