Skip to content

feat(wal): version on-disk schema and force-reset on incompatibility#984

Merged
scarmuega merged 1 commit intomainfrom
feat/wal-versioning
Apr 29, 2026
Merged

feat(wal): version on-disk schema and force-reset on incompatibility#984
scarmuega merged 1 commit intomainfrom
feat/wal-versioning

Conversation

@scarmuega
Copy link
Copy Markdown
Member

@scarmuega scarmuega commented Apr 29, 2026

Summary

  • Adds a wal_metadata redb table tracking a schema version (CURRENT_WAL_VERSION = 1), checked on RedbWalStore::open / memory(). Older or missing versions trigger a wipe-and-stamp in a single atomic transaction; a newer on-disk version returns WalError::IncompatibleVersion so an old binary can't destroy newer data.
  • Auto-reseeds the WAL in check_wal_in_sync_with_state when the WAL is empty and the state cursor is fully defined, replacing the manual dolos doctor reset-wal step. Partial-cursor cases still error.
  • Surfaces IncompatibleVersion from setup_domain with a help message pointing at dolos bootstrap --force for accidental downgrades.

Motivation

Backward compatibility of the WAL with v1.0.3 broke unexpectedly. Without a version marker, mismatched on-disk data can cause bincode panics or silent corruption. This change makes incompatibility detectable and recoverable: nodes upgrading silently wipe and reseed, nodes downgrading refuse to start.

Test plan

  • cargo build --workspace
  • cargo test -p dolos-redb3 (5 new tests for fresh init, legacy upgrade, downgrade refusal, idempotent reopen)
  • cargo test --workspace --lib
  • cargo test --test bootstrap (existing integration tests pass)
  • Manual: build new binary, point at an existing on-disk v1.0.3 instance, observe wipe + reseed warn logs, verify sync resumes
  • Manual: simulate downgrade by bumping CURRENT_WAL_VERSION locally, confirm IncompatibleVersion error surfaces in setup_domain

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • WAL schema version tracking prevents incompatible database versions from corrupting data by validating compatibility at startup.
  • Bug Fixes

    • Enhanced WAL state consistency checks with more informative error messages. Users encountering version incompatibilities now receive explicit recovery instructions including upgrade or force bootstrap options.

Adds a WAL_METADATA table to the redb WAL DB tracking a schema version,
checked at open(). Older or missing versions trigger a wipe-and-stamp;
newer versions return WalError::IncompatibleVersion to prevent an old
binary from destroying newer data. After a wipe, check_wal_in_sync_with_state
auto-reseeds the WAL from the state cursor (when fully defined) instead of
erroring, eliminating the manual `dolos doctor reset-wal` step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

This pull request adds WAL schema version tracking with on-disk metadata and enhances bootstrap state-WAL synchronization. It introduces a new WalError::IncompatibleVersion variant, implements version compatibility checking during store initialization, updates the bootstrap integrity check to distinguish fully-defined vs partially-defined state cursors, and provides user-facing diagnostics for version mismatches in the CLI.

Changes

Cohort / File(s) Summary
Core Error Handling
crates/core/src/lib.rs
Added WalError::IncompatibleVersion variant carrying both found and expected version values for format compatibility errors.
Bootstrap State-WAL Synchronization
crates/core/src/bootstrap.rs
Enhanced integrity check to distinguish fully-defined vs partially-defined state cursors; now reseeds/resets WAL for fully-defined cursors instead of treating as inconsistent state.
WAL Version Tracking
crates/redb3/src/wal/mod.rs
Introduced wal_metadata table for on-disk version tracking; added ensure_version() compatibility check during store initialization with branching logic for stamp, wipe, or error outcomes; includes helper functions and unit tests for version handling.
CLI Error Diagnostics
src/bin/dolos/common.rs
Added targeted error handling in setup_domain to intercept incompatible WAL version errors and provide formatted miette diagnostic with upgrade/recovery guidance.

Sequence Diagram(s)

sequenceDiagram
    participant Init as Store<br/>Initialization
    participant Meta as wal_metadata<br/>Table
    participant Check as Compatibility<br/>Check
    participant WAL as wal<br/>Table
    participant Action as Resolution<br/>Action

    Init->>Meta: Read version
    Meta-->>Check: Return stored version
    Check->>Check: Compare to CURRENT_WAL_VERSION
    
    alt Versions Match
        Check->>Action: No-op (log debug)
        Action-->>Init: Continue
    else Found > Expected
        Check->>Action: Return IncompatibleVersion error
        Action-->>Init: Fail with diagnostic
    else Found < Expected (WAL empty)
        Check->>Meta: Stamp new version
        Meta-->>Action: Version updated
        Action-->>Init: Continue (log info)
    else Found < Expected (WAL non-empty)
        Check->>WAL: Wipe wal table
        WAL-->>Meta: Confirm
        Meta->>Meta: Stamp new version
        Meta-->>Action: Version updated
        Action-->>Init: Continue (log warn)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • fix: avoid slot-only chain intersection #948: Directly related—both PRs modify the WAL-vs-state synchronization logic in bootstrap, specifically the distinction between fully-defined and partially-defined state cursors and WAL reseeding behavior.
  • fix(cli): reset wal after any kind of bootstrap #950: Directly related—both PRs modify bootstrap WAL reseeding logic and touch the same core bootstrap code paths for state cursor handling.
  • chore: unify both redb crates #798: Related through WAL module scope—the earlier PR moved/renamed WAL functionality into dolos\_redb3, while this PR implements WAL on-disk schema versioning inside the redb3 WAL module.

Poem

🐰 Hoppy hop! The WAL tracks well,
Version stamps in metadata dwell,
When states and cursors both align,
The rabbit reseeds with glee so fine—
Consistency blooms, hooray! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly and accurately reflects the main changes: adding versioning to on-disk WAL schema and implementing force-reset behavior on version incompatibility.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/wal-versioning

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/bin/dolos/common.rs (1)

65-73: Clarify that version numbers refer to WAL schema, not software releases.

The help message uses v{found} and v{expected} which could mislead users into thinking these are software versions (like v1.0.3). These are actually WAL schema versions (integers like 1, 2). Consider rephrasing to avoid confusion:

✏️ Suggested clarification
         Error::WalError(WalError::IncompatibleVersion { found, expected }) => miette::miette!(
             help = format!(
-                "WAL was created by a newer dolos version (v{found}) than this binary supports (v{expected}); upgrade dolos or run `dolos bootstrap --force` to wipe storage and re-bootstrap",
+                "WAL schema version {found} is newer than this binary supports (schema version {expected}); upgrade dolos or run `dolos bootstrap --force` to wipe storage and re-bootstrap",
             ),
-            "incompatible WAL version: found v{found}, expected v{expected}",
+            "incompatible WAL schema version: found {found}, expected {expected}",
         ),
crates/redb3/src/wal/mod.rs (1)

777-784: reset_to() is not atomic — document or fix.

This method performs two separate transactions (remove_entries then append_entries). A crash or error between them leaves the WAL empty. Consider combining into a single transaction for consistency with wipe_and_stamp().

♻️ Proposed atomic implementation
     fn reset_to(&self, point: &ChainPoint) -> Result<(), RedbWalError> {
-        self.remove_entries(&ChainPoint::Origin)?;
-
-        let entry = (point.clone(), LogValue::origin());
-        self.append_entries(&[entry])?;
-
+        let mut wx = self.db.begin_write()?;
+        wx.set_quick_repair(true);
+        {
+            let after = DbChainPoint::from(ChainPoint::Origin);
+            let mut table = wx.open_table(WAL)?;
+            // Remove all entries after Origin (i.e., everything)
+            let bounds = after.clone()..;
+            let range = table.extract_from_if(bounds, |current, _| current > after)?;
+            for entry in range {
+                let _ = entry?;
+            }
+            // Insert the new origin entry
+            let key = DbChainPoint::from(point.clone());
+            let value = to_raw_log_value(&LogValue::<T>::origin());
+            table.insert(key, value)?;
+        }
+        wx.commit()?;
         Ok(())
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/redb3/src/wal/mod.rs` around lines 777 - 784, reset_to currently
performs two separate operations (remove_entries then append_entries) so a crash
between them can leave the WAL empty; update reset_to to perform both actions in
a single atomic transaction similar to wipe_and_stamp. Concretely, either reuse
the internal transaction logic from wipe_and_stamp (or extract its transaction
block into a shared helper) and perform remove_entries(&ChainPoint::Origin) and
the append of (point.clone(), LogValue::origin()) inside the same transaction,
or add a new transactional method that runs both operations together and call
that from reset_to to ensure atomicity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/core/src/bootstrap.rs`:
- Around line 79-86: The call to domain.wal().reset_to(state) can leave the WAL
empty if reset_to() executes remove_entries() and append_entries() in separate
transactions and the second transaction fails; either make reset_to() atomic
(combine remove_entries + append_entries into a single write transaction inside
the WAL implementation) or, if atomicity is not feasible, add a clear comment
here in bootstrap.rs alongside the warn!/reset_to call explaining that
reset_to() performs two separate transactions, may transiently leave the WAL
empty on partial failure, and that the bootstrap path will retry on next startup
(reference reset_to, remove_entries, append_entries, and the match arm handling
None/Some(state)).

---

Nitpick comments:
In `@crates/redb3/src/wal/mod.rs`:
- Around line 777-784: reset_to currently performs two separate operations
(remove_entries then append_entries) so a crash between them can leave the WAL
empty; update reset_to to perform both actions in a single atomic transaction
similar to wipe_and_stamp. Concretely, either reuse the internal transaction
logic from wipe_and_stamp (or extract its transaction block into a shared
helper) and perform remove_entries(&ChainPoint::Origin) and the append of
(point.clone(), LogValue::origin()) inside the same transaction, or add a new
transactional method that runs both operations together and call that from
reset_to to ensure atomicity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0dc211d2-5519-4553-b6e1-e11d33515df7

📥 Commits

Reviewing files that changed from the base of the PR and between 86cc287 and 30ac13b.

📒 Files selected for processing (4)
  • crates/core/src/bootstrap.rs
  • crates/core/src/lib.rs
  • crates/redb3/src/wal/mod.rs
  • src/bin/dolos/common.rs

Comment on lines +79 to +86
(None, Some(ref state)) if state.is_fully_defined() => {
// WAL is empty but state has a fully-defined cursor — likely a
// post-bootstrap or post-version-wipe situation. Auto-reseed the
// WAL from the state cursor so the node can resume syncing
// without manual intervention.
warn!(%state, "WAL is empty but state cursor is fully defined; reseeding WAL from state");
domain.wal().reset_to(state)?;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify reset_to implementation atomicity in both backends
echo "=== redb3 reset_to implementation ==="
rg -nA 10 'fn reset_to.*point.*ChainPoint' crates/redb3/src/wal/

echo ""
echo "=== fjall reset_to implementation (if exists) ==="
rg -nA 10 'fn reset_to.*point.*ChainPoint' crates/fjall/

Repository: txpipe/dolos

Length of output: 1691


🏁 Script executed:

# Find WAL trait definition and understand transaction semantics
echo "=== WalStore trait definition ==="
rg -nB 2 -A 15 'trait.*Wal.*Store' crates/

echo ""
echo "=== fjall WAL implementation search ==="
fd -t f '*.rs' crates/fjall | xargs rg -l 'reset_to\|WalStore' | head -5

echo ""
echo "=== Check remove_entries and append_entries transaction semantics ==="
rg -nB 2 -A 5 'fn remove_entries|fn append_entries' crates/redb3/src/wal/mod.rs | head -40

Repository: txpipe/dolos

Length of output: 2690


🏁 Script executed:

# Search fjall directory structure
echo "=== fjall directory contents ==="
fd -t f '\.rs$' crates/fjall/src/ | head -20

echo ""
echo "=== Search for fjall WalStore implementation ==="
rg -n 'impl.*WalStore' crates/fjall/

echo ""
echo "=== Search for reset_to in fjall ==="
rg -nA 8 'fn reset_to' crates/fjall/src/

echo ""
echo "=== Check if fjall backend exists and has wal module ==="
find crates/fjall -type f -name '*.rs' | xargs grep -l 'reset_to' 2>/dev/null || echo "No reset_to found in fjall"

Repository: txpipe/dolos

Length of output: 584


🏁 Script executed:

# Check available Domain implementations and their backends
echo "=== Domain trait implementations ==="
rg -n 'impl.*Domain' crates/core/src/ | head -10

echo ""
echo "=== Bootstrap context - check what domain type is used ==="
rg -nB 5 'domain\.wal\(\)\.reset_to' crates/core/src/bootstrap.rs

echo ""
echo "=== Check if fjall implements WalStore ==="
rg -rn 'impl.*WalStore.*fjall\|struct.*Fjall.*Wal' crates/

echo ""
echo "=== List all backend crates ==="
ls -la crates/ | grep -E '^d' | awk '{print $NF}'

Repository: txpipe/dolos

Length of output: 1444


🏁 Script executed:

# Check if fjall is expected to implement WalStore at all
echo "=== Search all crates for WalStore implementations ==="
rg -r 'impl.*WalStore' crates/

echo ""
echo "=== Check fjall lib.rs for what it exports ==="
head -50 crates/fjall/src/lib.rs

echo ""
echo "=== Check cargo features and optional dependencies ==="
rg -A 20 '\[features\]' crates/fjall/Cargo.toml crates/redb3/Cargo.toml 2>/dev/null | head -50

Repository: txpipe/dolos

Length of output: 11494


🏁 Script executed:

# Check Domain trait's wal() method signature and WalStore usage
echo "=== Domain trait definition for wal() ==="
rg -nB 2 -A 2 'fn wal\(' crates/core/src/lib.rs

echo ""
echo "=== Check if there are multiple WAL implementations or only redb3 ==="
rg -r 'impl.*WalStore.*for' crates/

echo ""
echo "=== Verify remove_entries and append_entries use separate transactions ==="
rg -nB 1 -A 8 'fn remove_entries\|fn append_entries' crates/redb3/src/wal/mod.rs | grep -A 8 'fn remove_entries'

Repository: txpipe/dolos

Length of output: 10670


🏁 Script executed:

# Get clean search for WalStore implementations
echo "=== All WalStore trait implementations ==="
rg -n '^impl.*WalStore' crates/ --type rust

echo ""
echo "=== Recheck reset_to implementation in full ==="
sed -n '777,784p' crates/redb3/src/wal/mod.rs

echo ""
echo "=== Check if this pattern (two transactions) is intentional/documented ==="
rg -B 5 'fn reset_to' crates/redb3/src/wal/mod.rs | head -20

Repository: txpipe/dolos

Length of output: 1155


🏁 Script executed:

# Understand transaction semantics - are they truly independent or is there recovery?
echo "=== Check remove_entries implementation ==="
sed -n '749,775p' crates/redb3/src/wal/mod.rs

echo ""
echo "=== Check append_entries implementation ==="
sed -n '730,748p' crates/redb3/src/wal/mod.rs

echo ""
echo "=== Look for error handling patterns in WAL module ==="
rg -B 2 -A 2 'begin_write' crates/redb3/src/wal/mod.rs | head -30

Repository: txpipe/dolos

Length of output: 2445


🏁 Script executed:

# Check if there's any idempotency or recovery documentation
echo "=== Check comments around reset_to ==="
sed -n '770,790p' crates/redb3/src/wal/mod.rs

echo ""
echo "=== Check if redb has transaction rollback/recovery guarantees ==="
rg -B 3 'set_quick_repair' crates/redb3/src/wal/mod.rs | head -20

echo ""
echo "=== Verify the scenario - can reset_to fail between operations? ==="
rg -n 'append_entries.*Result' crates/redb3/src/wal/mod.rs

Repository: txpipe/dolos

Length of output: 1640


reset_to() performs two separate transactions; partial failure leaves WAL empty.

reset_to() calls remove_entries() then append_entries() as separate write transactions. If remove_entries() succeeds but append_entries() fails (e.g., disk full), the WAL is left empty while the state cursor persists. This is recoverable on the next startup (the same code path retries), but the non-atomic pattern should be documented or refactored for clarity:

  1. Make reset_to() atomic within a single transaction, or
  2. Add a comment in bootstrap.rs explaining the retry-on-failure semantics
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/core/src/bootstrap.rs` around lines 79 - 86, The call to
domain.wal().reset_to(state) can leave the WAL empty if reset_to() executes
remove_entries() and append_entries() in separate transactions and the second
transaction fails; either make reset_to() atomic (combine remove_entries +
append_entries into a single write transaction inside the WAL implementation)
or, if atomicity is not feasible, add a clear comment here in bootstrap.rs
alongside the warn!/reset_to call explaining that reset_to() performs two
separate transactions, may transiently leave the WAL empty on partial failure,
and that the bootstrap path will retry on next startup (reference reset_to,
remove_entries, append_entries, and the match arm handling None/Some(state)).

@scarmuega scarmuega merged commit 7548171 into main Apr 29, 2026
12 checks passed
@scarmuega scarmuega deleted the feat/wal-versioning branch April 29, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant