feat(wal): version on-disk schema and force-reset on incompatibility#984
feat(wal): version on-disk schema and force-reset on incompatibility#984
Conversation
Adds a WAL_METADATA table to the redb WAL DB tracking a schema version, checked at open(). Older or missing versions trigger a wipe-and-stamp; newer versions return WalError::IncompatibleVersion to prevent an old binary from destroying newer data. After a wipe, check_wal_in_sync_with_state auto-reseeds the WAL from the state cursor (when fully defined) instead of erroring, eliminating the manual `dolos doctor reset-wal` step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis pull request adds WAL schema version tracking with on-disk metadata and enhances bootstrap state-WAL synchronization. It introduces a new Changes
Sequence Diagram(s)sequenceDiagram
participant Init as Store<br/>Initialization
participant Meta as wal_metadata<br/>Table
participant Check as Compatibility<br/>Check
participant WAL as wal<br/>Table
participant Action as Resolution<br/>Action
Init->>Meta: Read version
Meta-->>Check: Return stored version
Check->>Check: Compare to CURRENT_WAL_VERSION
alt Versions Match
Check->>Action: No-op (log debug)
Action-->>Init: Continue
else Found > Expected
Check->>Action: Return IncompatibleVersion error
Action-->>Init: Fail with diagnostic
else Found < Expected (WAL empty)
Check->>Meta: Stamp new version
Meta-->>Action: Version updated
Action-->>Init: Continue (log info)
else Found < Expected (WAL non-empty)
Check->>WAL: Wipe wal table
WAL-->>Meta: Confirm
Meta->>Meta: Stamp new version
Meta-->>Action: Version updated
Action-->>Init: Continue (log warn)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/bin/dolos/common.rs (1)
65-73: Clarify that version numbers refer to WAL schema, not software releases.The help message uses
v{found}andv{expected}which could mislead users into thinking these are software versions (like v1.0.3). These are actually WAL schema versions (integers like 1, 2). Consider rephrasing to avoid confusion:✏️ Suggested clarification
Error::WalError(WalError::IncompatibleVersion { found, expected }) => miette::miette!( help = format!( - "WAL was created by a newer dolos version (v{found}) than this binary supports (v{expected}); upgrade dolos or run `dolos bootstrap --force` to wipe storage and re-bootstrap", + "WAL schema version {found} is newer than this binary supports (schema version {expected}); upgrade dolos or run `dolos bootstrap --force` to wipe storage and re-bootstrap", ), - "incompatible WAL version: found v{found}, expected v{expected}", + "incompatible WAL schema version: found {found}, expected {expected}", ),crates/redb3/src/wal/mod.rs (1)
777-784:reset_to()is not atomic — document or fix.This method performs two separate transactions (
remove_entriesthenappend_entries). A crash or error between them leaves the WAL empty. Consider combining into a single transaction for consistency withwipe_and_stamp().♻️ Proposed atomic implementation
fn reset_to(&self, point: &ChainPoint) -> Result<(), RedbWalError> { - self.remove_entries(&ChainPoint::Origin)?; - - let entry = (point.clone(), LogValue::origin()); - self.append_entries(&[entry])?; - + let mut wx = self.db.begin_write()?; + wx.set_quick_repair(true); + { + let after = DbChainPoint::from(ChainPoint::Origin); + let mut table = wx.open_table(WAL)?; + // Remove all entries after Origin (i.e., everything) + let bounds = after.clone()..; + let range = table.extract_from_if(bounds, |current, _| current > after)?; + for entry in range { + let _ = entry?; + } + // Insert the new origin entry + let key = DbChainPoint::from(point.clone()); + let value = to_raw_log_value(&LogValue::<T>::origin()); + table.insert(key, value)?; + } + wx.commit()?; Ok(()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/redb3/src/wal/mod.rs` around lines 777 - 784, reset_to currently performs two separate operations (remove_entries then append_entries) so a crash between them can leave the WAL empty; update reset_to to perform both actions in a single atomic transaction similar to wipe_and_stamp. Concretely, either reuse the internal transaction logic from wipe_and_stamp (or extract its transaction block into a shared helper) and perform remove_entries(&ChainPoint::Origin) and the append of (point.clone(), LogValue::origin()) inside the same transaction, or add a new transactional method that runs both operations together and call that from reset_to to ensure atomicity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/core/src/bootstrap.rs`:
- Around line 79-86: The call to domain.wal().reset_to(state) can leave the WAL
empty if reset_to() executes remove_entries() and append_entries() in separate
transactions and the second transaction fails; either make reset_to() atomic
(combine remove_entries + append_entries into a single write transaction inside
the WAL implementation) or, if atomicity is not feasible, add a clear comment
here in bootstrap.rs alongside the warn!/reset_to call explaining that
reset_to() performs two separate transactions, may transiently leave the WAL
empty on partial failure, and that the bootstrap path will retry on next startup
(reference reset_to, remove_entries, append_entries, and the match arm handling
None/Some(state)).
---
Nitpick comments:
In `@crates/redb3/src/wal/mod.rs`:
- Around line 777-784: reset_to currently performs two separate operations
(remove_entries then append_entries) so a crash between them can leave the WAL
empty; update reset_to to perform both actions in a single atomic transaction
similar to wipe_and_stamp. Concretely, either reuse the internal transaction
logic from wipe_and_stamp (or extract its transaction block into a shared
helper) and perform remove_entries(&ChainPoint::Origin) and the append of
(point.clone(), LogValue::origin()) inside the same transaction, or add a new
transactional method that runs both operations together and call that from
reset_to to ensure atomicity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0dc211d2-5519-4553-b6e1-e11d33515df7
📒 Files selected for processing (4)
crates/core/src/bootstrap.rscrates/core/src/lib.rscrates/redb3/src/wal/mod.rssrc/bin/dolos/common.rs
| (None, Some(ref state)) if state.is_fully_defined() => { | ||
| // WAL is empty but state has a fully-defined cursor — likely a | ||
| // post-bootstrap or post-version-wipe situation. Auto-reseed the | ||
| // WAL from the state cursor so the node can resume syncing | ||
| // without manual intervention. | ||
| warn!(%state, "WAL is empty but state cursor is fully defined; reseeding WAL from state"); | ||
| domain.wal().reset_to(state)?; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify reset_to implementation atomicity in both backends
echo "=== redb3 reset_to implementation ==="
rg -nA 10 'fn reset_to.*point.*ChainPoint' crates/redb3/src/wal/
echo ""
echo "=== fjall reset_to implementation (if exists) ==="
rg -nA 10 'fn reset_to.*point.*ChainPoint' crates/fjall/Repository: txpipe/dolos
Length of output: 1691
🏁 Script executed:
# Find WAL trait definition and understand transaction semantics
echo "=== WalStore trait definition ==="
rg -nB 2 -A 15 'trait.*Wal.*Store' crates/
echo ""
echo "=== fjall WAL implementation search ==="
fd -t f '*.rs' crates/fjall | xargs rg -l 'reset_to\|WalStore' | head -5
echo ""
echo "=== Check remove_entries and append_entries transaction semantics ==="
rg -nB 2 -A 5 'fn remove_entries|fn append_entries' crates/redb3/src/wal/mod.rs | head -40Repository: txpipe/dolos
Length of output: 2690
🏁 Script executed:
# Search fjall directory structure
echo "=== fjall directory contents ==="
fd -t f '\.rs$' crates/fjall/src/ | head -20
echo ""
echo "=== Search for fjall WalStore implementation ==="
rg -n 'impl.*WalStore' crates/fjall/
echo ""
echo "=== Search for reset_to in fjall ==="
rg -nA 8 'fn reset_to' crates/fjall/src/
echo ""
echo "=== Check if fjall backend exists and has wal module ==="
find crates/fjall -type f -name '*.rs' | xargs grep -l 'reset_to' 2>/dev/null || echo "No reset_to found in fjall"Repository: txpipe/dolos
Length of output: 584
🏁 Script executed:
# Check available Domain implementations and their backends
echo "=== Domain trait implementations ==="
rg -n 'impl.*Domain' crates/core/src/ | head -10
echo ""
echo "=== Bootstrap context - check what domain type is used ==="
rg -nB 5 'domain\.wal\(\)\.reset_to' crates/core/src/bootstrap.rs
echo ""
echo "=== Check if fjall implements WalStore ==="
rg -rn 'impl.*WalStore.*fjall\|struct.*Fjall.*Wal' crates/
echo ""
echo "=== List all backend crates ==="
ls -la crates/ | grep -E '^d' | awk '{print $NF}'Repository: txpipe/dolos
Length of output: 1444
🏁 Script executed:
# Check if fjall is expected to implement WalStore at all
echo "=== Search all crates for WalStore implementations ==="
rg -r 'impl.*WalStore' crates/
echo ""
echo "=== Check fjall lib.rs for what it exports ==="
head -50 crates/fjall/src/lib.rs
echo ""
echo "=== Check cargo features and optional dependencies ==="
rg -A 20 '\[features\]' crates/fjall/Cargo.toml crates/redb3/Cargo.toml 2>/dev/null | head -50Repository: txpipe/dolos
Length of output: 11494
🏁 Script executed:
# Check Domain trait's wal() method signature and WalStore usage
echo "=== Domain trait definition for wal() ==="
rg -nB 2 -A 2 'fn wal\(' crates/core/src/lib.rs
echo ""
echo "=== Check if there are multiple WAL implementations or only redb3 ==="
rg -r 'impl.*WalStore.*for' crates/
echo ""
echo "=== Verify remove_entries and append_entries use separate transactions ==="
rg -nB 1 -A 8 'fn remove_entries\|fn append_entries' crates/redb3/src/wal/mod.rs | grep -A 8 'fn remove_entries'Repository: txpipe/dolos
Length of output: 10670
🏁 Script executed:
# Get clean search for WalStore implementations
echo "=== All WalStore trait implementations ==="
rg -n '^impl.*WalStore' crates/ --type rust
echo ""
echo "=== Recheck reset_to implementation in full ==="
sed -n '777,784p' crates/redb3/src/wal/mod.rs
echo ""
echo "=== Check if this pattern (two transactions) is intentional/documented ==="
rg -B 5 'fn reset_to' crates/redb3/src/wal/mod.rs | head -20Repository: txpipe/dolos
Length of output: 1155
🏁 Script executed:
# Understand transaction semantics - are they truly independent or is there recovery?
echo "=== Check remove_entries implementation ==="
sed -n '749,775p' crates/redb3/src/wal/mod.rs
echo ""
echo "=== Check append_entries implementation ==="
sed -n '730,748p' crates/redb3/src/wal/mod.rs
echo ""
echo "=== Look for error handling patterns in WAL module ==="
rg -B 2 -A 2 'begin_write' crates/redb3/src/wal/mod.rs | head -30Repository: txpipe/dolos
Length of output: 2445
🏁 Script executed:
# Check if there's any idempotency or recovery documentation
echo "=== Check comments around reset_to ==="
sed -n '770,790p' crates/redb3/src/wal/mod.rs
echo ""
echo "=== Check if redb has transaction rollback/recovery guarantees ==="
rg -B 3 'set_quick_repair' crates/redb3/src/wal/mod.rs | head -20
echo ""
echo "=== Verify the scenario - can reset_to fail between operations? ==="
rg -n 'append_entries.*Result' crates/redb3/src/wal/mod.rsRepository: txpipe/dolos
Length of output: 1640
reset_to() performs two separate transactions; partial failure leaves WAL empty.
reset_to() calls remove_entries() then append_entries() as separate write transactions. If remove_entries() succeeds but append_entries() fails (e.g., disk full), the WAL is left empty while the state cursor persists. This is recoverable on the next startup (the same code path retries), but the non-atomic pattern should be documented or refactored for clarity:
- Make
reset_to()atomic within a single transaction, or - Add a comment in bootstrap.rs explaining the retry-on-failure semantics
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/core/src/bootstrap.rs` around lines 79 - 86, The call to
domain.wal().reset_to(state) can leave the WAL empty if reset_to() executes
remove_entries() and append_entries() in separate transactions and the second
transaction fails; either make reset_to() atomic (combine remove_entries +
append_entries into a single write transaction inside the WAL implementation)
or, if atomicity is not feasible, add a clear comment here in bootstrap.rs
alongside the warn!/reset_to call explaining that reset_to() performs two
separate transactions, may transiently leave the WAL empty on partial failure,
and that the bootstrap path will retry on next startup (reference reset_to,
remove_entries, append_entries, and the match arm handling None/Some(state)).
Summary
wal_metadataredb table tracking a schema version (CURRENT_WAL_VERSION = 1), checked onRedbWalStore::open/memory(). Older or missing versions trigger a wipe-and-stamp in a single atomic transaction; a newer on-disk version returnsWalError::IncompatibleVersionso an old binary can't destroy newer data.check_wal_in_sync_with_statewhen the WAL is empty and the state cursor is fully defined, replacing the manualdolos doctor reset-walstep. Partial-cursor cases still error.IncompatibleVersionfromsetup_domainwith a help message pointing atdolos bootstrap --forcefor accidental downgrades.Motivation
Backward compatibility of the WAL with v1.0.3 broke unexpectedly. Without a version marker, mismatched on-disk data can cause bincode panics or silent corruption. This change makes incompatibility detectable and recoverable: nodes upgrading silently wipe and reseed, nodes downgrading refuse to start.
Test plan
cargo build --workspacecargo test -p dolos-redb3(5 new tests for fresh init, legacy upgrade, downgrade refusal, idempotent reopen)cargo test --workspace --libcargo test --test bootstrap(existing integration tests pass)CURRENT_WAL_VERSIONlocally, confirmIncompatibleVersionerror surfaces insetup_domain🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes