Skip to content

fix(core): B3 reconcile skips missing trie nodes instead of crashing …#696

Merged
github-actions[bot] merged 1 commit into
mainfrom
fix/b3-reconcile-fail-soft-on-missing-trie
May 21, 2026
Merged

fix(core): B3 reconcile skips missing trie nodes instead of crashing …#696
github-actions[bot] merged 1 commit into
mainfrom
fix/b3-reconcile-fail-soft-on-missing-trie

Conversation

@satyakwok
Copy link
Copy Markdown
Collaborator

@satyakwok satyakwok commented May 20, 2026

…boot

The trie-reconcile pass at boot used to propagate any trie.get error all the way out of load_blockchain, which turned a single dangling trie node into an unbootable validator — even when the other 99.99% of the trie was healthy and the chain was actively producing blocks before the restart.

Hit live on testnet 2026-05-20 during the v2.2.12 RPC swap. One address (0x044daee895f2bb2ffecd4dd2e771f5ce405ab0ea at h=5003961) had a dangling reference to node 314e57bd...; the validator had been up for 5 hours serving that exact state without complaint, but the next boot crashed on phase-1 trie scan. All four testnet validators were stuck because they share the same trie shape.

The reconcile already has well-defined behavior for "no trie leaf for this address" — phase-2 keeps the blob view as canonical. Treat a failed trie.get the same way: log the gap, count it, surface a summary warning if any were skipped, then continue. Touched accounts will rewrite the trie node on next apply, so the gap closes on its own without operator intervention.

This is strictly a fail-soft on a previously fatal path. Healthy chains never hit it; the only behavioral change is that broken state loads instead of refusing to load.

Summary

Risk tier

Check ONE:

  • 🟢 Low — docs, tools/, tests/, CI configs, dependency patch bumps in dev-only crates, comments
  • 🟡 Medium — non-consensus production code (RPC handlers, network plumbing, observability, ops scripts)
  • 🟠 High — consensus-critical crates (sentrix-core, sentrix-trie, sentrix-staking, sentrix-bft), block_executor, apply_block_*, state_root path
  • 🔴 Critical — Voyager activation, fork-height changes, hard-fork rollouts, anything that flips env vars on mainnet

Required by tier

🟢 Low — minimum bar

  • CI green (tests + clippy + audit + gitleaks)

🟡 Medium — adds

  • New public function or behavioural change has at least one corresponding #[test] in same PR
  • Brief description of how this was tested (manual run, integration test, etc.)

🟠 High — adds

  • Regression test that fails on main and passes with this change — paste test name in PR body
  • Designed against documented invariant (link the audit/runbook/design doc)
  • Fresh-brain review by someone other than the author (per the consensus-change review checklist)
  • Single conceptual unit per PR (no bundling — bundling consensus changes burned us on v2.1.12 → 2026-04-25 livelock)

🔴 Critical — adds

  • Testnet rehearsal completed with success criteria + log evidence linked here
  • Bake window observed: minimum 2h on testnet at the same configuration before mainnet
  • Coordinated rollback plan documented in PR body — exact commands operator runs if it fails
  • Operator sign-off at activation moment (not just PR approval — separate moment for the actual flip)

Test plan

Rollback plan

Related

Summary by CodeRabbit

  • Bug Fixes
    • Startup is more resilient: individual trie/account lookup failures no longer abort initialization. The system logs warnings for each skipped address and leaves cached account values unchanged until they can be updated.
    • Emits an aggregate warning after scanning when any addresses were skipped due to missing or corrupted trie nodes.

Review Change Stack

@github-actions github-actions Bot enabled auto-merge (squash) May 20, 2026 04:56
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

This PR modifies the reconcile_accounts_from_trie function in crates/sentrix-core/src/storage.rs to handle trie lookup failures resiliently. Instead of aborting blockchain load when a trie lookup fails, the code now logs a warning per failing address, tracks the count of missing trie nodes, and continues iteration by inserting None for that account so phase 2 reconciliation preserves the blob's cached values until the account is next accessed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning Required sections are incomplete. Summary section is blank, Test plan section has only a dash with no actual test details, and Rollback plan (recommended for High-tier PRs) is missing. Fill in the Summary section with 1-3 sentences, provide specific test plan details (especially the regression test name that fails on main), and add a rollback plan for the High-tier consensus change.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: reconcile now skips missing trie nodes instead of crashing boot. It accurately reflects the primary fix.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/b3-reconcile-fail-soft-on-missing-trie

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/sentrix-core/src/storage.rs (1)

367-369: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update docstring to reflect fail-soft behavior.

Lines 367-369 still describe the old hard-fail semantics ("Trie-lookup errors (Err(_)) propagate: a corrupted trie is a hard-fail, not silent fallback"), but the implementation now catches errors and continues. This contradiction could mislead future maintainers.

📝 Proposed docstring update
-/// Trie-lookup errors (`Err(_)`) propagate: a corrupted trie is a
-/// hard-fail, not silent fallback.
+/// Trie-lookup errors (`Err(_)`) are logged and skipped: a missing or
+/// corrupted trie node for a single address no longer aborts boot.
+/// The address is treated as having no trie leaf (blob value preserved),
+/// and the next block touching that account will rewrite the trie entry.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sentrix-core/src/storage.rs` around lines 367 - 369, Replace the
outdated sentence "Trie-lookup errors (`Err(_)`) propagate: a corrupted trie is
a hard-fail, not silent fallback." in the docstring with a short note that the
implementation now treats trie-lookup errors as fail-soft: errors are caught and
the code continues using fallback behavior (i.e., corrupted trie entries are
skipped/ignored rather than propagating). Update the wording in
crates/sentrix-core/src/storage.rs so the docstring for the trie lookup behavior
reflects the current catch-and-continue semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/sentrix-core/src/storage.rs`:
- Around line 367-369: Replace the outdated sentence "Trie-lookup errors
(`Err(_)`) propagate: a corrupted trie is a hard-fail, not silent fallback." in
the docstring with a short note that the implementation now treats trie-lookup
errors as fail-soft: errors are caught and the code continues using fallback
behavior (i.e., corrupted trie entries are skipped/ignored rather than
propagating). Update the wording in crates/sentrix-core/src/storage.rs so the
docstring for the trie lookup behavior reflects the current catch-and-continue
semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ea6d6dc0-7e42-4c63-bb5e-f5c22711c346

📥 Commits

Reviewing files that changed from the base of the PR and between 3debf85 and a517677.

📒 Files selected for processing (1)
  • crates/sentrix-core/src/storage.rs

…boot

The trie-reconcile pass at boot used to propagate any `trie.get` error
all the way out of `load_blockchain`, which turned a single dangling
trie node into an unbootable validator — even when the other 99.99% of
the trie was healthy and the chain was actively producing blocks before
the restart.

Hit live on testnet 2026-05-20 during a binary swap. One address
(0x044daee895f2bb2ffecd4dd2e771f5ce405ab0ea at h=5003961) had a
dangling reference to node 314e57bd...; the validator had been up for
5 hours serving that exact state without complaint, but the next boot
crashed on phase-1 trie scan. All four testnet validators were stuck
because they shared the same trie shape.

The reconcile already has well-defined behavior for "no trie leaf for
this address" — phase-2 keeps the blob view as canonical. Treat a
failed `trie.get` the same way: log the gap, count it, surface a
summary warning if any were skipped, then continue. Touched accounts
will rewrite the trie node on next apply, so the gap closes on its
own without operator intervention.

This is strictly a fail-soft on a previously fatal path. Healthy
chains never hit it; the only behavioral change is that broken state
loads instead of refusing to load. Also updates the
`reconcile_accounts_from_trie` docstring so it matches the new
catch-and-continue semantics (was still claiming hard-fail).
@satyakwok satyakwok force-pushed the fix/b3-reconcile-fail-soft-on-missing-trie branch from a517677 to a74d53f Compare May 20, 2026 23:55
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sentrix-core/src/storage.rs`:
- Around line 425-449: The current catch-all for trie.get(&key) treats all
errors as recoverable gaps; change the Err branch to match on the SentrixError
enum: if Err(SentrixError::Internal(msg)) and msg.contains("missing node") then
log the missing-node warning, increment trie_gaps and continue (as you do now),
but if Err(SentrixError::StorageError(_)) or
Err(SentrixError::SerializationError(_)) then propagate the error (return Err(e)
or use ?), since those indicate broken storage/serialization; keep the same
handling for successful Ok(leaf) and subsequent account_value_decode usage and
push into trie_values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 57aeb45f-d992-4541-8699-d8a225d3bd29

📥 Commits

Reviewing files that changed from the base of the PR and between a517677 and a74d53f.

📒 Files selected for processing (1)
  • crates/sentrix-core/src/storage.rs

Comment on lines +425 to +449
let mut trie_gaps: usize = 0;
for addr in &addrs {
let key = address_to_key(addr);
let leaf = trie.get(&key).map_err(|e| {
SentrixError::Internal(format!(
"B3 reconcile: trie lookup for {addr} failed at h={}: {e}",
bc.chain.last().map(|b| b.index).unwrap_or(0)
))
})?;
let leaf = match trie.get(&key) {
Ok(leaf) => leaf,
Err(e) => {
tracing::warn!(
"B3 reconcile: trie lookup for {addr} failed at h={}: {e} — \
skipping reconcile for this address (will rewrite on next touch)",
bc.chain.last().map(|b| b.index).unwrap_or(0)
);
trie_gaps += 1;
None
}
};
let decoded = leaf.and_then(|bytes| account_value_decode(&bytes));
trie_values.push((addr.clone(), decoded));
}
if trie_gaps > 0 {
tracing::warn!(
"B3 reconcile: skipped {trie_gaps}/{} addresses due to missing trie nodes \
at h={}; chain will continue producing — touched accounts repair themselves",
addrs.len(),
bc.chain.last().map(|b| b.index).unwrap_or(0)
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

TRIE_FILES=$(fd -i 'trie|error' crates --extension rs)

echo "=== trie.get definitions ==="
rg -n -C4 '\bfn\s+get\s*\(' $TRIE_FILES

echo
echo "=== trie-related error enums / aliases / variants ==="
rg -n -C4 '\b(enum|type)\s+\w*Error\b|Missing|Dangling|Corrupt|Decode|Storage|Io' $TRIE_FILES

echo
echo "=== existing trie.get call sites ==="
rg -n -C4 'trie\.get\s*\(' crates/sentrix-core/src/storage.rs $TRIE_FILES

Repository: sentrix-labs/sentrix

Length of output: 93


🏁 Script executed:

# Check repository structure and find trie-related files
find . -type f -name "*.rs" | grep -i trie | head -20

Repository: sentrix-labs/sentrix

Length of output: 454


🏁 Script executed:

# Also check Cargo.toml files to understand dependencies
find . -name "Cargo.toml" | xargs grep -l "trie\|sentrix" | head -10

Repository: sentrix-labs/sentrix

Length of output: 374


🏁 Script executed:

# Check the storage.rs file context around the error handling
sed -n '420,455p' crates/sentrix-core/src/storage.rs

Repository: sentrix-labs/sentrix

Length of output: 1744


🏁 Script executed:

# Look for trie-related modules in the codebase
rg "mod trie|use.*trie" crates --type rs | head -20

Repository: sentrix-labs/sentrix

Length of output: 91


🏁 Script executed:

# Correct the previous command
rg "mod trie|use.*trie" crates/sentrix-core/src/storage.rs -A2 -B2

Repository: sentrix-labs/sentrix

Length of output: 280


🏁 Script executed:

# Check the trie module's error type definition
cat crates/sentrix-trie/src/lib.rs

Repository: sentrix-labs/sentrix

Length of output: 777


🏁 Script executed:

# Look for get function and error types in tree.rs
rg -n "fn get|enum.*Error|type.*Error" crates/sentrix-trie/src/tree.rs -A3

Repository: sentrix-labs/sentrix

Length of output: 228


🏁 Script executed:

# Check tree.rs for the get implementation
head -200 crates/sentrix-trie/src/tree.rs | tail -150

Repository: sentrix-labs/sentrix

Length of output: 7105


🏁 Script executed:

# Get the full get() implementation
sed -n '200,280p' crates/sentrix-trie/src/tree.rs

Repository: sentrix-labs/sentrix

Length of output: 3312


🏁 Script executed:

# Find SentrixResult and SentrixError definitions
rg "type SentrixResult|enum SentrixError" crates/sentrix-trie/src -A20

Repository: sentrix-labs/sentrix

Length of output: 46


🏁 Script executed:

# Look for where SentrixError is defined in the entire repo
rg "enum SentrixError" crates -A30

Repository: sentrix-labs/sentrix

Length of output: 2299


🏁 Script executed:

# Check sentrix-primitives or common error handling
fd -i error crates -name "*.rs" -type f | head -10

Repository: sentrix-labs/sentrix

Length of output: 295


🏁 Script executed:

# Get the full SentrixError enum
cat crates/sentrix-primitives/src/error.rs | head -100

Repository: sentrix-labs/sentrix

Length of output: 2063


🏁 Script executed:

# Also check if there are more variants
rg "SentrixError::" crates/sentrix-primitives/src/error.rs

Repository: sentrix-labs/sentrix

Length of output: 170


🏁 Script executed:

# Search for all uses of SentrixError in the trie module to understand all variants
rg "SentrixError::" crates/sentrix-trie/src -B1 -A1

Repository: sentrix-labs/sentrix

Length of output: 12687


🏁 Script executed:

# Verify what cache.get_node returns and how it handles errors
sed -n '1,150p' crates/sentrix-trie/src/cache.rs | grep -A30 "fn get_node"

Repository: sentrix-labs/sentrix

Length of output: 1434


🏁 Script executed:

# Check if there are patterns elsewhere in storage.rs for selective error handling
rg "match.*Err|SentrixError::" crates/sentrix-core/src/storage.rs -B2 -A3 | head -80

Repository: sentrix-labs/sentrix

Length of output: 3173


🏁 Script executed:

# Check if there's a specific error message pattern for "missing node" vs other Internal errors
rg "trie: missing node" crates/sentrix-trie/src -B2 -A2

Repository: sentrix-labs/sentrix

Length of output: 2842


🏁 Script executed:

# Look for any pattern matching on error messages elsewhere in the codebase
rg "SentrixError::Internal.*missing|if.*contains.*missing" crates/sentrix-core/src -B2 -A2

Repository: sentrix-labs/sentrix

Length of output: 46


🏁 Script executed:

# Check the full context of the reconcile function to understand the intent
sed -n '360,380p' crates/sentrix-core/src/storage.rs

Repository: sentrix-labs/sentrix

Length of output: 1169


🏁 Script executed:

# Check if cache.get_node can return StorageError by looking at the full chain
sed -n '1,50p' crates/sentrix-trie/src/cache.rs

Repository: sentrix-labs/sentrix

Length of output: 2244


🏁 Script executed:

# Verify load_node implementation which is called by cache.get_node
rg "fn load_node" crates/sentrix-trie/src/storage.rs -A20

Repository: sentrix-labs/sentrix

Length of output: 895


🏁 Script executed:

# Check what other errors could be returned from trie operations
rg "map_err.*SentrixError" crates/sentrix-trie/src/storage.rs | head -20

Repository: sentrix-labs/sentrix

Length of output: 1510


Distinguish missing-node errors from storage/serialization failures in error handling.

Lines 428-439 catch all trie.get() errors identically. However, trie.get() can fail with three distinct error types: SentrixError::Internal("trie: missing node ...") (node not found), SentrixError::StorageError(...) (I/O failure), and SentrixError::SerializationError(...) (deserialization failure). The current code treats them all as recoverable gaps, but storage and serialization errors indicate a fundamentally broken trie and should not be suppressed.

Match specifically on SentrixError::Internal with a message pattern check for "missing node", and let StorageError and SerializationError propagate. This preserves the fail-soft behavior for dangling trie nodes while preventing boot with a corrupted or unreadable trie.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sentrix-core/src/storage.rs` around lines 425 - 449, The current
catch-all for trie.get(&key) treats all errors as recoverable gaps; change the
Err branch to match on the SentrixError enum: if
Err(SentrixError::Internal(msg)) and msg.contains("missing node") then log the
missing-node warning, increment trie_gaps and continue (as you do now), but if
Err(SentrixError::StorageError(_)) or Err(SentrixError::SerializationError(_))
then propagate the error (return Err(e) or use ?), since those indicate broken
storage/serialization; keep the same handling for successful Ok(leaf) and
subsequent account_value_decode usage and push into trie_values.

@github-actions github-actions Bot merged commit a06fa94 into main May 21, 2026
14 checks passed
satyakwok added a commit that referenced this pull request May 21, 2026
2.2.14 went live on mainnet 2026-05-21 17:02 UTC carrying PRs
#692/#696/#702/#703/#704. Document the five fixes + flag that
the eth_call + logsBloom fixes on this branch are scheduled
for 2.2.15.
github-actions Bot pushed a commit that referenced this pull request May 21, 2026
* fix(rpc): eth_getTransactionCount + eth_call + logsBloom length

Three discrete chain-RPC fixes bundled because they share one cause
(strict-but-non-spec behavior breaking off-the-shelf EVM agents). All
discovered together standing up the Hyperlane relayer for the Base
Sepolia ↔ Sentrix Testnet bridge (audit fix H-4).

1) eth_getTransactionCount accepts any block tag.
   Relayer + ethers + viem all pin nonce queries to a recent past
   block. A stale nonce is self-correcting (chain rejects wrong-nonce
   tx, caller retries) so this method serves current nonce regardless
   of block tag.

2) eth_call accepts any block tag.
   Same pattern: Hyperlane queries Mailbox.delivered(msgId) and
   recipientIsm(addr) at past blocks. Returns current-state. The strict
   gate stays on eth_getBalance / eth_getCode / eth_getStorageAt where
   wrong = wrong protocol decision.

3) Empty logsBloom is now actually 256 bytes (Ethereum spec).
   The EMPTY_LOGS_BLOOM const was 304 bytes (608 hex chars) because of
   an off-by-one in the original hand-typed string. The doc comment
   said 256 but the literal was 304. ethers' fee-oracle middleware
   strict-parses Block.logsBloom — Hyperlane gas estimation
   SerdeJson'd on this with "invalid length 608, expected 256 bytes"
   before any process() tx could be submitted.

Bumps workspace 2.2.13 -> 2.2.14.

* docs(changelog): add 2.2.14 entry — five RPC compat fixes

2.2.14 went live on mainnet 2026-05-21 17:02 UTC carrying PRs
#692/#696/#702/#703/#704. Document the five fixes + flag that
the eth_call + logsBloom fixes on this branch are scheduled
for 2.2.15.
github-actions Bot pushed a commit that referenced this pull request May 22, 2026
…bytes (#707)

* fix(rpc): eth_getTransactionCount + eth_call + logsBloom length

Three discrete chain-RPC fixes bundled because they share one cause
(strict-but-non-spec behavior breaking off-the-shelf EVM agents). All
discovered together standing up the Hyperlane relayer for the Base
Sepolia ↔ Sentrix Testnet bridge (audit fix H-4).

1) eth_getTransactionCount accepts any block tag.
   Relayer + ethers + viem all pin nonce queries to a recent past
   block. A stale nonce is self-correcting (chain rejects wrong-nonce
   tx, caller retries) so this method serves current nonce regardless
   of block tag.

2) eth_call accepts any block tag.
   Same pattern: Hyperlane queries Mailbox.delivered(msgId) and
   recipientIsm(addr) at past blocks. Returns current-state. The strict
   gate stays on eth_getBalance / eth_getCode / eth_getStorageAt where
   wrong = wrong protocol decision.

3) Empty logsBloom is now actually 256 bytes (Ethereum spec).
   The EMPTY_LOGS_BLOOM const was 304 bytes (608 hex chars) because of
   an off-by-one in the original hand-typed string. The doc comment
   said 256 but the literal was 304. ethers' fee-oracle middleware
   strict-parses Block.logsBloom — Hyperlane gas estimation
   SerdeJson'd on this with "invalid length 608, expected 256 bytes"
   before any process() tx could be submitted.

Bumps workspace 2.2.13 -> 2.2.14.

* docs(changelog): add 2.2.14 entry — five RPC compat fixes

2.2.14 went live on mainnet 2026-05-21 17:02 UTC carrying PRs
#692/#696/#702/#703/#704. Document the five fixes + flag that
the eth_call + logsBloom fixes on this branch are scheduled
for 2.2.15.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant