Skip to content

fix: avoid excessive mem increase during boundary#905

Merged
scarmuega merged 7 commits intomainfrom
dolos-boundary-mem
Feb 17, 2026
Merged

fix: avoid excessive mem increase during boundary#905
scarmuega merged 7 commits intomainfrom
dolos-boundary-mem

Conversation

@scarmuega
Copy link
Member

@scarmuega scarmuega commented Feb 17, 2026

Summary by CodeRabbit

  • New Features

    • Lazy streaming entity iteration to drastically lower memory usage during scans.
    • Deterministic work-buffer/state machine for improved block-boundary handling and work emission.
  • Refactor

    • Simplified block-processing pipeline and unified work unit handling; delta computation relocated for clearer flow.
    • Minor API adjustments to support config/cache-aware work units.
  • Tests

    • Added memory-focused tests validating lazy iteration across backends.
  • Chores

    • Added development test tooling for memory instrumentation.

@coderabbitai
Copy link

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

Introduces a WorkBuffer state machine for deterministic block/boundary handling, refactors RollWorkUnit construction to take CardanoConfig and Cache and moves delta computation into load, converts fjall EntityIterator to lazy streaming, and adds memory-usage tests plus dev-dependencies for stats_alloc and dolos-testing.

Changes

Cohort / File(s) Summary
Work Buffer & Internal Work
crates/cardano/src/work.rs
Adds WorkBuffer and InternalWorkUnit enums plus full buffering API and extensive tests for boundary detection, cursor advancement, and restart semantics.
Roll work construction & loading
crates/cardano/src/lib.rs, crates/cardano/src/roll/work_unit.rs, crates/cardano/src/roll/batch.rs
Replaced previous WorkBuffer/InternalWorkUnit usage in lib; RollWorkUnit::new now requires config: CardanoConfig and cache: Cache; delta computation moved into load; WorkBatch::decode_utxos no longer needs CardanoLogic and decodes outputs directly.
Eras type change
crates/cardano/src/eras.rs
ChainSummary now derives Clone.
Fjall entity iteration
crates/fjall/src/state/entities.rs
Replaced eager Vec-based EntityIterator with a streaming fjall::Iter-backed iterator and added EmptyEntityValueIterator, changing memory profile from O(N) to O(1).
Tests & dev-deps
tests/memory.rs, Cargo.toml, crates/cardano/Cargo.toml
Adds tests/memory.rs memory-usage tests (stats_alloc instrumented), adds stats_alloc and dolos-testing as dev-dependencies, and new [[test]] entry for memory test.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant WB as WorkBuffer
    participant Caller as CardanoLogic/Coordinator
    participant RWU as RollWorkUnit
    participant Delta as roll::compute_delta

    Client->>WB: receive_block(block, eras, stability_window)
    activate WB
    WB->>WB: detect boundaries / extend batch
    WB-->>Client: updated buffer
    deactivate WB

    Caller->>WB: pop_work(stop_epoch)
    activate WB
    alt work available
        WB-->>Caller: Some(InternalWorkUnit::Blocks or boundary)
    else
        WB-->>Caller: None
    end
    deactivate WB

    Caller->>RWU: RollWorkUnit::new(batch, genesis, live, config, cache)
    activate RWU
    RWU-->>Caller: constructed RWU (holds config & cache)
    deactivate RWU

    Caller->>RWU: load(domain)
    activate RWU
    RWU->>RWU: decode UTxOs from batch
    RWU->>Delta: compute_delta::<D>(..., config, cache)
    activate Delta
    Delta-->>RWU: deltas/result
    deactivate Delta
    RWU-->>Caller: ready work unit
    deactivate RWU
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I hop through buffers, boundaries, and beams,
I stitch config and cache into rolling dreams.
Lazy streams nibble memory light and small,
Tests count the carrots — fifty thousand and all! 🥕✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix: avoid excessive mem increase during boundary' is somewhat vague. While it references memory concerns and boundaries (relevant to the cardano multi-era block processing changes), it doesn't clearly convey the primary changes: implementing lazy streaming iteration and refactoring the work unit state machine. The term 'boundary' is specific but lacks context for someone unfamiliar with the codebase. Consider a more descriptive title such as 'fix: implement lazy entity iteration and refactor work buffer state machine' or 'fix: reduce memory usage through lazy streaming and work unit refactoring' to better reflect the main technical changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dolos-boundary-mem

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/cardano/src/lib.rs (1)

8-12: ⚠️ Potential issue | 🟡 Minor

Fix unused import flagged by CI: Block as _.

The pipeline reports clippy: unused import 'Block' at line 9. Remove the unused trait import.

🔧 Proposed fix
 use dolos_core::{
-    config::CardanoConfig, Block as _, BlockSlot, ChainError, ChainPoint, Domain, DomainError,
+    config::CardanoConfig, BlockSlot, ChainError, ChainPoint, Domain, DomainError,
     EntityKey, EraCbor, Genesis, MempoolAwareUtxoStore, MempoolTx, MempoolUpdate, RawBlock,
     StateStore, TipEvent, WorkUnit,
 };

As per coding guidelines: "Run cargo clippy --workspace --all-targets --all-features and resolve all clippy warnings before committing changes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 8 - 12, Remove the unused trait
import `Block as _` from the use list in lib.rs to satisfy Clippy; edit the
import statement that currently brings in `Block as _` (alongside symbols like
CardanoConfig, BlockSlot, ChainError, etc.) and delete the `Block as _` entry so
only actually used symbols remain imported.
🧹 Nitpick comments (6)
tests/memory.rs (1)

36-55: Subtle test ordering: the memory assertion only covers iterator creation, not consumption.

The Region tracks allocations between lines 36 and 42, which only covers the iter_entities() call — not the subsequent iter.count() on line 54 that actually drives the iterator. This correctly validates that the iterator is lazily initialized (the core fix), but it does not assert that streaming through all 50k entities stays within bounds.

If you also want to assert bounded memory during full iteration, move reg.change() after iter.count():

♻️ Proposed: assert memory across full iteration
     let reg = Region::new(&GLOBAL);
 
     let iter = store
         .iter_entities(NS, EntityKey::full_range())
         .expect("iter_entities failed");
 
-    let stats = reg.change();
-    let heap_delta = stats.bytes_allocated as usize;
-
-    let threshold = 10 * 1024 * 1024; // 10 MB
-    assert!(
-        heap_delta < threshold,
-        "iter_entities should allocate O(1) memory (lazy). \
-         Allocated {} bytes but threshold is {} bytes.",
-        heap_delta,
-        threshold,
-    );
-
     let count = iter.count();
     assert_eq!(count, ENTITY_COUNT as usize, "iterator should yield all entities");
+
+    let stats = reg.change();
+    let heap_delta = stats.bytes_allocated as usize;
+
+    let threshold = 10 * 1024 * 1024; // 10 MB
+    assert!(
+        heap_delta < threshold,
+        "iter_entities should allocate O(1) memory (lazy). \
+         Allocated {} bytes but threshold is {} bytes.",
+        heap_delta,
+        threshold,
+    );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/memory.rs` around lines 36 - 55, The memory assertion currently
measures allocations only during iterator creation (Region::new / reg.change
around iter_entities), not during consumption; to ensure bounded memory for full
streaming, move the reg.change()/let stats.../let heap_delta... block to after
iter.count() (or add a second reg.change() and assertion after iter.count()) so
that the heap_delta reflects allocations during iter.count() (the Entity
iterator produced by store.iter_entities) and assert it against threshold.
crates/fjall/src/state/entities.rs (1)

63-113: Good refactor to lazy streaming — consider logging malformed keys.

Both the fjall and redb3 backends now implement lazy streaming iteration: fjall wraps fjall::Iter and redb3 wraps redb::Range, each decoding items one at a time in next(). This maintains consistency between backends and keeps memory usage O(1) as verified by the memory tests.

The guard.into_inner() API usage is correct for fjall 3.0.0.

One concern: silently skipping malformed keys (lines 102–107) could mask data corruption. A tracing::warn! on the skip path would aid diagnosis without affecting performance in the normal case.

🔧 Suggested: log malformed keys
                 if key_bytes.len() >= PREFIXED_KEY_SIZE {
                     let entity_key = decode_entity_key(&key_bytes);
                     let entity_value = value_bytes.to_vec();
                     return Some(Ok((entity_key, entity_value)));
                 }
-                // Skip malformed keys (too short), continue to next
+                // Skip malformed keys (too short), continue to next
+                tracing::warn!(key_len = key_bytes.len(), "skipping malformed entity key (too short)");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/fjall/src/state/entities.rs` around lines 63 - 113, In the Iterator
impl for EntityIterator (fn next()), add a tracing::warn! when you hit the
malformed-key branch (where key_bytes.len() < PREFIXED_KEY_SIZE) so skipped keys
are visible; include the offending key length and a short hex or base64 snippet
of key_bytes to aid debugging, e.g. use hex::encode(&key_bytes) or
base64::engine::general_purpose::STANDARD.encode(&key_bytes) in the message, and
leave the loop behavior unchanged (continue to next item) so normal streaming
semantics in EntityIterator::next() are preserved.
crates/cardano/src/work.rs (4)

253-295: Significant duplication between feed_and_drain and feed_drain_with_cursors.

Both helpers share identical drain/feed loop structures. Consider refactoring into a single core function that always tracks cursors, with feed_and_drain as a thin wrapper that discards them.

Also applies to: 302-349

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 253 - 295, Extract the common
feed/drain loop into a single core function (e.g., feed_drain_core) that accepts
and returns cursor/state info along with the Vec<WorkTag>, reusing the same
logic currently in feed_and_drain and feed_drain_with_cursors (including the
pop_work/receive_block interactions and the InternalWorkUnit::ForcedStop
handling using tag_from_internal). Then implement feed_drain_with_cursors to
call feed_drain_core and return both tags and cursors, and make feed_and_drain a
thin wrapper that calls feed_drain_core and drops/ignores cursors, preserving
existing signatures/behavior for block creation (make_block),
eras/stability_window use, and stop_epoch handling.

189-191: ForcedStop is an absorbing state that emits infinitely — caller must break the loop.

pop_work on ForcedStop returns (Some(ForcedStop), ForcedStop) indefinitely. This is handled correctly in the test helpers (line 285-288), but any future caller of pop_work in a loop must explicitly check for ForcedStop to avoid an infinite loop.

A brief doc comment on pop_work noting this contract would be helpful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 189 - 191, Add a doc comment to the
pop_work function documenting that WorkBuffer::ForcedStop (mapped to
InternalWorkUnit::ForcedStop) is an absorbing state and pop_work will repeatedly
return (Some(InternalWorkUnit::ForcedStop), WorkBuffer::ForcedStop) until the
caller breaks out; instruct callers to check for InternalWorkUnit::ForcedStop /
WorkBuffer::ForcedStop and explicitly stop their loops to avoid infinite
emission (refer to the existing test helper behavior around ForcedStop for
example).

67-83: Empty arm in extend_batch is dead code via receive_block.

receive_block always intercepts Empty at line 119 and routes to on_genesis_boundary, so extend_batch is never called in the Empty state through the public API. Consider removing it or adding a comment explaining if it's intentionally kept for future use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 67 - 83, The match arm handling
WorkBuffer::Empty inside extend_batch is dead because receive_block already
routes Empty to on_genesis_boundary; remove the WorkBuffer::Empty arm from
extend_batch (leaving Restart and OpenBatch handling and the unreachable!() for
other cases) or, if you want to keep it for future use, replace the arm with a
clear comment above extend_batch referencing receive_block and
on_genesis_boundary to explain why Empty is present; update any affected pattern
coverage and add a short comment in extend_batch mentioning the decision.

13-34: Consider deriving Debug on InternalWorkUnit and WorkBuffer.

Neither enum derives Debug, which makes it harder to diagnose state machine issues at runtime (e.g., in logs or panic messages). The unreachable!() calls throughout would benefit from a debug representation of the current state.

Proposed change
+#[derive(Debug)]
 pub(crate) enum InternalWorkUnit {
     Genesis,
     Blocks(WorkBatch),
     EWrap(BlockSlot),
     EStart(BlockSlot),
     Rupd(BlockSlot),
     ForcedStop,
 }

+#[derive(Debug)]
 pub(crate) enum WorkBuffer {

This assumes WorkBatch and OwnedMultiEraBlock implement Debug. If not, a manual Debug impl that prints the variant name and key metadata (e.g., slot numbers) would still be valuable.

#!/bin/bash
# Check if WorkBatch and OwnedMultiEraBlock implement Debug
echo "=== WorkBatch ==="
rg -n "struct WorkBatch" --type rust -A 3
echo "=== OwnedMultiEraBlock ==="
rg -n "struct OwnedMultiEraBlock" --type rust -A 3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 13 - 34, The enums InternalWorkUnit
and WorkBuffer currently lack Debug implementations which makes logging and
panic traces unhelpful; add derives for Debug (e.g., #[derive(Debug, Clone,
PartialEq, Eq)] as appropriate) to both InternalWorkUnit and WorkBuffer,
ensuring WorkBatch and OwnedMultiEraBlock also implement Debug (or provide a
manual Debug impl that prints variant names and key metadata like slot numbers)
so the compiler can derive Debug successfully; update the enum declarations
(InternalWorkUnit, WorkBuffer) to include the derive attributes and, if
necessary, implement Debug for WorkBatch/OwnedMultiEraBlock or write custom
Debug impls that surface useful state for logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/cardano/src/work.rs`:
- Line 53: The match arm for WorkBuffer::ForcedStop uses unreachable!() which
can panic for crate-internal callers of the pub method last_point_seen; change
this to a safe, non-panicking behavior: either return the last known point
(e.g., clone and return the stored tip) or make last_point_seen fallible (return
Option<Point> or Result<Point, WorkError>) and propagate that through callers
such as receive_block (which already uses can_receive_block) so you remove
unreachable!(); update the WorkBuffer::ForcedStop arm to return the chosen
value/type and adjust all call sites of last_point_seen (or its signature)
accordingly.
- Around line 634-653: The replay filter currently excludes the boundary slot
because replay_slots is built with .filter(|&s| s > cursor.slot()), but for a
ChainPoint::Slot cursor (established by update_cursor at EStart) semantics are
inclusive; change the filter to .filter(|&s| s >= cursor.slot()) when
constructing replay_slots (located adjacent to estart_idx, full[idx], and
WorkBuffer::Restart usage) and keep using feed_and_drain to produce
restart_tags; additionally tighten the assertion to ensure the boundary block
110 is actually the start of a Blocks unit by asserting restart_tags contains a
WorkTag::Blocks { first, .. } with first == 110.

---

Outside diff comments:
In `@crates/cardano/src/lib.rs`:
- Around line 8-12: Remove the unused trait import `Block as _` from the use
list in lib.rs to satisfy Clippy; edit the import statement that currently
brings in `Block as _` (alongside symbols like CardanoConfig, BlockSlot,
ChainError, etc.) and delete the `Block as _` entry so only actually used
symbols remain imported.

---

Nitpick comments:
In `@crates/cardano/src/work.rs`:
- Around line 253-295: Extract the common feed/drain loop into a single core
function (e.g., feed_drain_core) that accepts and returns cursor/state info
along with the Vec<WorkTag>, reusing the same logic currently in feed_and_drain
and feed_drain_with_cursors (including the pop_work/receive_block interactions
and the InternalWorkUnit::ForcedStop handling using tag_from_internal). Then
implement feed_drain_with_cursors to call feed_drain_core and return both tags
and cursors, and make feed_and_drain a thin wrapper that calls feed_drain_core
and drops/ignores cursors, preserving existing signatures/behavior for block
creation (make_block), eras/stability_window use, and stop_epoch handling.
- Around line 189-191: Add a doc comment to the pop_work function documenting
that WorkBuffer::ForcedStop (mapped to InternalWorkUnit::ForcedStop) is an
absorbing state and pop_work will repeatedly return
(Some(InternalWorkUnit::ForcedStop), WorkBuffer::ForcedStop) until the caller
breaks out; instruct callers to check for InternalWorkUnit::ForcedStop /
WorkBuffer::ForcedStop and explicitly stop their loops to avoid infinite
emission (refer to the existing test helper behavior around ForcedStop for
example).
- Around line 67-83: The match arm handling WorkBuffer::Empty inside
extend_batch is dead because receive_block already routes Empty to
on_genesis_boundary; remove the WorkBuffer::Empty arm from extend_batch (leaving
Restart and OpenBatch handling and the unreachable!() for other cases) or, if
you want to keep it for future use, replace the arm with a clear comment above
extend_batch referencing receive_block and on_genesis_boundary to explain why
Empty is present; update any affected pattern coverage and add a short comment
in extend_batch mentioning the decision.
- Around line 13-34: The enums InternalWorkUnit and WorkBuffer currently lack
Debug implementations which makes logging and panic traces unhelpful; add
derives for Debug (e.g., #[derive(Debug, Clone, PartialEq, Eq)] as appropriate)
to both InternalWorkUnit and WorkBuffer, ensuring WorkBatch and
OwnedMultiEraBlock also implement Debug (or provide a manual Debug impl that
prints variant names and key metadata like slot numbers) so the compiler can
derive Debug successfully; update the enum declarations (InternalWorkUnit,
WorkBuffer) to include the derive attributes and, if necessary, implement Debug
for WorkBatch/OwnedMultiEraBlock or write custom Debug impls that surface useful
state for logging.

In `@crates/fjall/src/state/entities.rs`:
- Around line 63-113: In the Iterator impl for EntityIterator (fn next()), add a
tracing::warn! when you hit the malformed-key branch (where key_bytes.len() <
PREFIXED_KEY_SIZE) so skipped keys are visible; include the offending key length
and a short hex or base64 snippet of key_bytes to aid debugging, e.g. use
hex::encode(&key_bytes) or
base64::engine::general_purpose::STANDARD.encode(&key_bytes) in the message, and
leave the loop behavior unchanged (continue to next item) so normal streaming
semantics in EntityIterator::next() are preserved.

In `@tests/memory.rs`:
- Around line 36-55: The memory assertion currently measures allocations only
during iterator creation (Region::new / reg.change around iter_entities), not
during consumption; to ensure bounded memory for full streaming, move the
reg.change()/let stats.../let heap_delta... block to after iter.count() (or add
a second reg.change() and assertion after iter.count()) so that the heap_delta
reflects allocations during iter.count() (the Entity iterator produced by
store.iter_entities) and assert it against threshold.

WorkBuffer::EwrapBoundary(block, _) => block.point(),
WorkBuffer::EstartBoundary(block, _) => block.point(),
WorkBuffer::PreForcedStop(block) => block.point(),
WorkBuffer::ForcedStop => unreachable!(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

unreachable!() on ForcedStop is a latent panic for crate-internal callers.

last_point_seen is a pub method. While receive_block guards against this via the can_receive_block assert, any other crate-internal caller hitting ForcedStop gets an unhelpful unreachable panic. Consider returning the last known point or making this method fallible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` at line 53, The match arm for
WorkBuffer::ForcedStop uses unreachable!() which can panic for crate-internal
callers of the pub method last_point_seen; change this to a safe, non-panicking
behavior: either return the last known point (e.g., clone and return the stored
tip) or make last_point_seen fallible (return Option<Point> or Result<Point,
WorkError>) and propagate that through callers such as receive_block (which
already uses can_receive_block) so you remove unreachable!(); update the
WorkBuffer::ForcedStop arm to return the chosen value/type and adjust all call
sites of last_point_seen (or its signature) accordingly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/cardano/src/lib.rs (1)

277-283: ⚠️ Potential issue | 🟠 Major

Silently swallowing refresh_cache errors may cause stale era data for subsequent work units.

If refresh_cache fails (e.g., state store I/O error), the error is logged but needs_cache_refresh is still set to false, so the stale cache will be used for all subsequent Blocks / boundary detection without any retry. This could cause incorrect epoch boundary detection or wrong stability window.

Consider propagating the error (changing pop_work to return Result<Option<…>, …>) or at least keeping needs_cache_refresh = true so the next call retries:

Proposed minimal fix (keep flag set on failure)
         if self.needs_cache_refresh {
             if let Err(e) = self.refresh_cache::<D>(domain.state()) {
                 tracing::error!(error = %e, "failed to refresh cache after era-modifying work unit");
-            }
-            self.needs_cache_refresh = false;
+            } else {
+                self.needs_cache_refresh = false;
+            }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 277 - 283, The current code clears
self.needs_cache_refresh even when self.refresh_cache::<D>(domain.state())
returns Err, which swallows errors and prevents retries; update the logic in
lib.rs so that self.needs_cache_refresh is only set to false on successful
refresh (i.e., move the assignment into the Ok branch) or, if you prefer a
stronger fix, propagate the error by changing pop_work to return
Result<Option<...>, E> and bubble up refresh_cache errors from
refresh_cache::<D>(domain.state()) instead of just logging them; refer to the
refresh_cache function and the needs_cache_refresh flag and adjust pop_work's
return type/signature if choosing propagation.
🧹 Nitpick comments (4)
crates/cardano/src/work.rs (3)

143-193: Consider exhaustive matching in pop_work instead of _ => unreachable!().

Lines 144–146 handle Empty and Restart with an early return, then line 191 uses _ => unreachable!() as a catch-all. If a new WorkBuffer variant is added later, this silently panics at runtime instead of producing a compile-time error.

Suggested: replace early return + catch-all with exhaustive arms
     pub fn pop_work(self, stop_epoch: Option<Epoch>) -> (Option<InternalWorkUnit>, Self) {
-        if matches!(self, WorkBuffer::Restart(..)) || matches!(self, WorkBuffer::Empty) {
-            return (None, self);
-        }
-
         match self {
+            WorkBuffer::Empty => (None, self),
+            WorkBuffer::Restart(..) => (None, self),
             WorkBuffer::Genesis(block) => (
                 // ... existing arms unchanged ...
             ),
             // ... other arms ...
-            _ => unreachable!(),
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 143 - 193, The current pop_work
implementation returns early for WorkBuffer::Empty and WorkBuffer::Restart then
uses a catch-all `_ => unreachable!()`, which hides missing-variant bugs; change
pop_work to match exhaustively on WorkBuffer by adding explicit arms for Empty
and Restart (returning (None, self)) alongside the existing arms (Genesis,
OpenBatch, PreRupdBoundary, RupdBoundary, PreEwrapBoundary, EwrapBoundary,
EstartBoundary, PreForcedStop, ForcedStop) and remove the `_ => unreachable!()`
arm so the compiler will force you to handle any new WorkBuffer variants
(preserve behavior with stop_epoch handling in the EstartBoundary arm).

108-141: receive_block only detects the first boundary between two consecutive blocks.

If a single block jump spans multiple epoch boundaries (e.g., prev_slot=50 → next_slot=350 crossing boundaries at 100, 200, 300), only the first epoch boundary is detected. Later boundaries would go unnoticed since the intermediate block (at the detected boundary) transitions to EwrapBoundary, and the subsequent pop_workEStartOpenBatch path uses the same block without re-checking for additional boundaries.

In practice this likely requires extremely sparse chains. If this is a known constraint, consider adding a brief comment documenting the assumption that blocks arrive with at most one boundary gap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 108 - 141, receive_block currently
only detects the first epoch/rollback boundary between prev_slot and next_slot;
if a block jump spans multiple boundaries you must either document the
single-boundary assumption or handle multiple boundaries. Fix by changing
receive_block to loop: compute boundary = pallas_extras::epoch_boundary(eras,
prev_slot, next_slot) (and rupd via rupd_boundary) repeatedly, advancing
prev_slot to the detected boundary slot and invoking the appropriate handler
(on_ewrap_boundary/on_rupd_boundary/on_genesis_boundary) for each detected
boundary until none remain, then call extend_batch; reference functions/ids:
receive_block, last_point_seen, slot, pallas_extras::epoch_boundary,
pallas_extras::rupd_boundary, on_ewrap_boundary, on_rupd_boundary,
on_genesis_boundary, extend_batch.

252-295: feed_and_drain helper: good but inner loop may spin on stuck states.

In the inner loop (lines 264–275), if can_receive_block() returns false and pop_work returns None, the loop breaks — correct. But if pop_work keeps returning Some(...) without ever transitioning to a receivable state (a hypothetical bug in the state machine), this loop would spin indefinitely.

Since this is test-only code, it's low risk. Just noting for awareness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 252 - 295, The inner loop in
feed_and_drain can spin indefinitely if buf.can_receive_block() stays false
while buf.pop_work(...) keeps returning Some repeatedly; add a simple safety
guard: introduce a max iteration counter (e.g. const MAX_DRAINS: usize =
10_000), increment it each time you pop_work inside feed_and_drain, and if it
exceeds the limit break the loop (or push a ForcedStop work tag via
tag_from_internal(InternalWorkUnit::ForcedStop) before breaking) to avoid an
infinite spin. Update the loop that calls buf.pop_work(stop_epoch) and
references buf.can_receive_block(), ensuring the counter resets each outer-slot
iteration and that you still collect tags via tag_from_internal for drained
units.
crates/cardano/src/lib.rs (1)

302-309: Consider wrapping config and cache in Arc to reduce cloning overhead per work unit.

Every Blocks work unit clones both CardanoConfig and Cache (which includes ChainSummary). While Cardano's era list is bounded (~14 entries), cloning still represents unnecessary allocations on each batch. Since both are shared references and lifetimes permit, wrapping them in Arc would make clones cheap reference-count bumps, aligning with the PR's goal of reducing memory pressure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 302 - 309, Change CardanoConfig and
Cache to be stored and passed as Arc to avoid expensive deep clones: update the
types of self.config and self.cache to Arc<CardanoConfig> and Arc<Cache> (and
any inner ChainSummary references), adjust constructors and fields that build
RollWorkUnit so roll::RollWorkUnit::new accepts Arc<CardanoConfig> and
Arc<Cache> (or clones the Arcs), and replace .clone() calls when creating
CardanoWorkUnit::Roll in InternalWorkUnit::Blocks with cheap Arc clone (or
remove explicit clone if moving the Arc). Ensure all usages that previously
expected owned CardanoConfig/Cache are updated to dereference or clone the Arc
as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/cardano/src/lib.rs`:
- Around line 277-283: The current code clears self.needs_cache_refresh even
when self.refresh_cache::<D>(domain.state()) returns Err, which swallows errors
and prevents retries; update the logic in lib.rs so that
self.needs_cache_refresh is only set to false on successful refresh (i.e., move
the assignment into the Ok branch) or, if you prefer a stronger fix, propagate
the error by changing pop_work to return Result<Option<...>, E> and bubble up
refresh_cache errors from refresh_cache::<D>(domain.state()) instead of just
logging them; refer to the refresh_cache function and the needs_cache_refresh
flag and adjust pop_work's return type/signature if choosing propagation.

---

Duplicate comments:
In `@crates/cardano/src/work.rs`:
- Around line 41-55: The method last_point_seen currently panics on
WorkBuffer::ForcedStop via unreachable!(); change it to be fallible by returning
Option<ChainPoint> (i.e., pub fn last_point_seen(&self) -> Option<ChainPoint>)
and return Some(...) for all existing variants and None for ForcedStop, then
update callers to handle the Option (or propagate it) instead of relying on
can_receive_block; alternatively, if you prefer keeping the non-fallible API,
add a field to the ForcedStop variant to store the last known ChainPoint and
return that in last_point_seen (replace unreachable!() with that stored point) —
choose one approach and apply consistently across uses of last_point_seen and
WorkBuffer::ForcedStop.

---

Nitpick comments:
In `@crates/cardano/src/lib.rs`:
- Around line 302-309: Change CardanoConfig and Cache to be stored and passed as
Arc to avoid expensive deep clones: update the types of self.config and
self.cache to Arc<CardanoConfig> and Arc<Cache> (and any inner ChainSummary
references), adjust constructors and fields that build RollWorkUnit so
roll::RollWorkUnit::new accepts Arc<CardanoConfig> and Arc<Cache> (or clones the
Arcs), and replace .clone() calls when creating CardanoWorkUnit::Roll in
InternalWorkUnit::Blocks with cheap Arc clone (or remove explicit clone if
moving the Arc). Ensure all usages that previously expected owned
CardanoConfig/Cache are updated to dereference or clone the Arc as needed.

In `@crates/cardano/src/work.rs`:
- Around line 143-193: The current pop_work implementation returns early for
WorkBuffer::Empty and WorkBuffer::Restart then uses a catch-all `_ =>
unreachable!()`, which hides missing-variant bugs; change pop_work to match
exhaustively on WorkBuffer by adding explicit arms for Empty and Restart
(returning (None, self)) alongside the existing arms (Genesis, OpenBatch,
PreRupdBoundary, RupdBoundary, PreEwrapBoundary, EwrapBoundary, EstartBoundary,
PreForcedStop, ForcedStop) and remove the `_ => unreachable!()` arm so the
compiler will force you to handle any new WorkBuffer variants (preserve behavior
with stop_epoch handling in the EstartBoundary arm).
- Around line 108-141: receive_block currently only detects the first
epoch/rollback boundary between prev_slot and next_slot; if a block jump spans
multiple boundaries you must either document the single-boundary assumption or
handle multiple boundaries. Fix by changing receive_block to loop: compute
boundary = pallas_extras::epoch_boundary(eras, prev_slot, next_slot) (and rupd
via rupd_boundary) repeatedly, advancing prev_slot to the detected boundary slot
and invoking the appropriate handler
(on_ewrap_boundary/on_rupd_boundary/on_genesis_boundary) for each detected
boundary until none remain, then call extend_batch; reference functions/ids:
receive_block, last_point_seen, slot, pallas_extras::epoch_boundary,
pallas_extras::rupd_boundary, on_ewrap_boundary, on_rupd_boundary,
on_genesis_boundary, extend_batch.
- Around line 252-295: The inner loop in feed_and_drain can spin indefinitely if
buf.can_receive_block() stays false while buf.pop_work(...) keeps returning Some
repeatedly; add a simple safety guard: introduce a max iteration counter (e.g.
const MAX_DRAINS: usize = 10_000), increment it each time you pop_work inside
feed_and_drain, and if it exceeds the limit break the loop (or push a ForcedStop
work tag via tag_from_internal(InternalWorkUnit::ForcedStop) before breaking) to
avoid an infinite spin. Update the loop that calls buf.pop_work(stop_epoch) and
references buf.can_receive_block(), ensuring the counter resets each outer-slot
iteration and that you still collect tags via tag_from_internal for drained
units.

@scarmuega scarmuega merged commit ec2eb41 into main Feb 17, 2026
12 checks passed
@scarmuega scarmuega deleted the dolos-boundary-mem branch February 17, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant