Skip to content

M7-T01 Add replicated node state and durable protocol metadata#60

Merged
skel84 merged 3 commits intomainfrom
issue-49-replicated-node-metadata
Mar 13, 2026
Merged

M7-T01 Add replicated node state and durable protocol metadata#60
skel84 merged 3 commits intomainfrom
issue-49-replicated-node-metadata

Conversation

@skel84
Copy link
Owner

@skel84 skel84 commented Mar 13, 2026

Summary

  • add a real allocdb-node::replica wrapper with durable replica metadata, explicit faulted startup state, and bootstrap paths for fresh-open and recover flows
  • validate persisted replica identity, view/vote ordering, commit-versus-snapshot consistency, and local applied state before a replica can become active
  • document the concrete M7-T01 metadata contract and update the status snapshot to reflect replicated implementation work

Linked Issue

Changes

  • add ReplicaNode, ReplicaMetadata, ReplicaMetadataFile, and startup fault-reason types in crates/allocdb-node
  • add regression tests for metadata round-trip, fresh bootstrap, recover bootstrap, corrupt metadata, identity mismatch, commit mismatch, and persisted faulted role
  • expose SingleNodeEngine::active_snapshot_lsn() for wrapper validation and export the new replica surface from the crate root
  • update docs/replication.md and docs/status.md for the persisted metadata file and faulted-startup behavior

Validation

  • scripts/preflight.sh
  • narrower targeted commands added when they materially strengthen confidence
  • cargo test -p allocdb-node
  • cargo clippy -p allocdb-node --all-targets --all-features -- -D warnings

Docs

  • docs updated if behavior, invariants, failure modes, or operator semantics changed

CodeRabbit Triage

  • CodeRabbit status completed
  • requested @coderabbitai summary if no visible review comment or thread appeared
  • applied relevant correctness, safety, recovery, testing, and docs suggestions
  • documented any intentionally rejected suggestions

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Summary by CodeRabbit

  • New Features

    • Added a replica node with durable metadata, bootstrap, validation, recovery, identity and durable-vote handling.
    • Exposed an accessor to read the active snapshot LSN.
  • Tests

    • Added comprehensive tests for replica metadata, bootstrap, recovery, fault scenarios, and non-Unix durability behavior.
  • Documentation

    • Expanded replication docs and project status to reflect M7 replicated implementation progress.

Walkthrough

Adds a new public replica module implementing durable replica metadata, validation, fault handling, and a ReplicaNode wrapper with open/recover flows; exposes SingleNodeEngine::active_snapshot_lsn; adds tests and docs describing persisted metadata and startup validation. (34 words)

Changes

Cohort / File(s) Summary
Engine Observability & Re-exports
crates/allocdb-node/src/engine_observe.rs, crates/allocdb-node/src/lib.rs
Adds pub fn active_snapshot_lsn(&self) -> Option<Lsn> to SingleNodeEngine and re-exports the new replica module from the crate root.
Replica Implementation
crates/allocdb-node/src/replica.rs
New comprehensive replica module: identity/role types, durable ReplicaMetadata, encoding/decoding, ReplicaMetadataFile I/O, bootstrap/validate logic, ReplicaNode open/recover lifecycle, status/error enums, and platform parent-dir sync helpers.
Tests
crates/allocdb-node/src/replica_tests.rs, crates/allocdb-node/src/non_unix_tests.rs
Adds extensive tests for metadata round-trip, bootstrap, recovery, multiple faulting scenarios, persisted-fault handling, and a non-Unix durability behavior test.
Docs & Status
docs/replication.md, docs/status.md
Documents persisted replica metadata fields, startup validation rules and failure modes; updates project status and milestone progress for replicated-node work.

Sequence Diagrams

sequenceDiagram
    participant User as User
    participant ReplicaNode as ReplicaNode
    participant ReplicaMetadataFile as ReplicaMetadataFile
    participant SingleNodeEngine as SingleNodeEngine
    participant Filesystem as Filesystem

    User->>ReplicaNode: open(core_config, engine_config, identity, paths)
    ReplicaNode->>ReplicaMetadataFile: load_metadata()
    ReplicaMetadataFile->>Filesystem: read metadata file
    alt metadata missing
        ReplicaNode->>SingleNodeEngine: open(engine_config)
        ReplicaNode->>ReplicaMetadataFile: write_metadata(metadata.bootstrap(...))
        ReplicaMetadataFile->>Filesystem: write temp file
        ReplicaMetadataFile->>Filesystem: fsync temp file
        ReplicaMetadataFile->>Filesystem: rename temp -> metadata
        ReplicaMetadataFile->>Filesystem: sync parent dir
    else metadata exists
        ReplicaNode->>ReplicaNode: validate(metadata vs engine state)
        alt validation success
            ReplicaNode->>SingleNodeEngine: open(engine_config)
        else validation failure
            ReplicaNode-->>User: Err(Faulted)
        end
    end
    ReplicaNode-->>User: Result<ReplicaNode>
Loading
sequenceDiagram
    participant User as User
    participant ReplicaNode as ReplicaNode
    participant ReplicaMetadataFile as ReplicaMetadataFile
    participant SingleNodeEngine as SingleNodeEngine
    participant Filesystem as Filesystem

    User->>ReplicaNode: recover(core_config, engine_config, identity, paths)
    ReplicaNode->>ReplicaMetadataFile: load_metadata()
    ReplicaMetadataFile->>Filesystem: read metadata file
    alt metadata indicates fault
        ReplicaNode-->>User: Err(Faulted)
    else metadata present
        ReplicaNode->>SingleNodeEngine: recover(engine_config)
        ReplicaNode->>ReplicaNode: validate(recovered engine state)
        alt validation success
            ReplicaNode->>ReplicaMetadataFile: persist_metadata()
        else validation failure
            ReplicaNode-->>User: Err(Faulted)
        end
    end
    ReplicaNode-->>User: Result<ReplicaNode>
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'M7-T01 Add replicated node state and durable protocol metadata' directly and comprehensively summarizes the main changes: addition of replica wrapper, durable metadata, and related types.
Description check ✅ Passed The pull request description covers all required template sections with concrete details: Summary clearly states objectives, Linked Issue (#49) is provided, Changes lists specific additions, Validation includes both required and extra test commands, and Docs confirms updates.
Linked Issues check ✅ Passed The pull request fulfills all acceptance criteria from #49: implements durable replica metadata (ReplicaMetadata, ReplicaMetadataFile), defines startup validation path with faulted-state entry via ReplicaFault types, preserves allocator apply path, provides comprehensive tests, and updates documentation.
Out of Scope Changes check ✅ Passed All code changes are within scope of #49: replica wrapper implementation (replica.rs +859 lines), tests (replica_tests.rs +666, non_unix_tests.rs +44), new public accessor (engine_observe.rs +5), exports (lib.rs +7), and documentation updates (replication.md +11, status.md +14/-4).
Docstring Coverage ✅ Passed Docstring coverage is 97.44% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch issue-49-replicated-node-metadata
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@skel84
Copy link
Owner Author

skel84 commented Mar 13, 2026

@coderabbitai summary

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

✅ Actions performed

Summary regeneration triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/allocdb-node/src/replica_tests.rs`:
- Around line 105-325: The test suite is missing coverage for several failure
branches in ReplicaMetadata::validate(); add focused unit/integration tests that
assert ReplicaNode::open or ReplicaNode::recover faults for each invariant:
shard mismatch (identity.shard_id mismatch vs identity()),
snapshot-without-commit (active_snapshot_lsn set but commit_lsn None),
snapshot-beyond-commit (active_snapshot_lsn > commit_lsn), last_normal_view
ordering (last_normal_view > current_view or inconsistent), durable_vote
ordering (durable_vote.view < current_view or inconsistent voted_for),
active-snapshot mismatch (active_snapshot_lsn not present in engine snapshot
state), and metadata rewrite/overwrite behavior (writing new metadata that
violates invariants or concurrent overwrite). Use
ReplicaMetadataFile::new(...).write_metadata(...) to seed on-disk metadata, then
call ReplicaNode::open or ::recover and assert ReplicaNodeStatus::Faulted with
the specific ReplicaStartupValidationError variants produced by
ReplicaMetadata::validate() to cover each branch.

In `@crates/allocdb-node/src/replica.rs`:
- Around line 551-553: The method engine_mut() exposes mutable access to
SingleNodeEngine and must not be public because callers can advance allocator
state without updating ReplicaMetadata; make engine_mut non-public (remove the
pub visibility) or restrict it to crate/internal use (e.g., change to pub(crate)
or rename to engine_mut_internal) and keep direct mutable access hidden until
you implement coordinated mutation APIs that update ReplicaMetadata
(commit_lsn/active_snapshot_lsn) alongside engine changes; update any call sites
or tests that relied on the public API to use the new internal name or the
future coordinated methods.
- Around line 358-379: The current write_metadata() (and its
temp_path()/sync_parent_dir() usage) advertises Unix-level durable replace but
on non-Unix builds the directory-sync branch is a no-op and std::fs::rename does
not provide the same replace-safety; to fix this either implement a
platform-specific safe-replace for non-Unix targets or make the stronger
guarantee explicit by gating the implementation to Unix-only: add #[cfg(unix)]
to the write_metadata()/temp_path()/sync_parent_dir() implementation and add a
#[cfg(not(unix))] compile_error or documented stub that clearly states non-Unix
targets are unsupported for durable replace, referencing the write_metadata,
temp_path, and sync_parent_dir symbols so reviewers can locate the
platform-specific change.
- Around line 346-355: The load_metadata function currently uses
file.read_to_end which can OOM on a huge/corrupt sidecar; instead bound the read
before decoding by checking the file size and/or using
File::take(MAX_METADATA_BYTES) (or reading into a fixed-size buffer) and
erroring if the sidecar exceeds a small MAX_METADATA_BYTES constant; update the
logic around File::open + read_to_end/bytes to use file.metadata().len() (or
file.take(...).read_to_end) and return an appropriate ReplicaMetadataFileError
when the file is too large, then pass the safely-sized bytes into
decode_replica_metadata to decode the ReplicaMetadata.

In `@docs/replication.md`:
- Around line 152-155: Update the docs text to state that startup faults not
only on decode/metadata inconsistency but also on unreadable metadata due to I/O
errors (e.g., permission denied) as implemented in
crates/allocdb-node/src/replica.rs; mention that the replica startup validation
in the replica initialization path (see replica.rs startup/validation logic that
treats metadata read errors other than NotFound as faulted) will leave the
replica in a `faulted` state until repaired, and ensure the paragraph explicitly
lists "unreadable or permission-denied sidecars" alongside decode failures and
persisted `faulted` roles.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a8b9cb2-64e8-4434-9105-b16dfc6acfe9

📥 Commits

Reviewing files that changed from the base of the PR and between 5bb40d7 and 427ab04.

📒 Files selected for processing (6)
  • crates/allocdb-node/src/engine_observe.rs
  • crates/allocdb-node/src/lib.rs
  • crates/allocdb-node/src/replica.rs
  • crates/allocdb-node/src/replica_tests.rs
  • docs/replication.md
  • docs/status.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/replication.md
  • docs/status.md
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/lib.rs
  • crates/allocdb-node/src/engine_observe.rs
  • crates/allocdb-node/src/replica_tests.rs
  • crates/allocdb-node/src/replica.rs
docs/status.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep docs/status.md current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.

Files:

  • docs/status.md
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to docs/status.md : Keep [`docs/status.md`](./docs/status.md) current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.

Applied to files:

  • docs/status.md
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/replica_tests.rs
🧬 Code graph analysis (1)
crates/allocdb-node/src/replica_tests.rs (3)
crates/allocdb-node/src/replica.rs (11)
  • path (337-339)
  • engine (547-549)
  • metadata_path (557-559)
  • paths (562-564)
  • new (183-193)
  • new (330-334)
  • new (758-760)
  • open (412-461)
  • status (542-544)
  • metadata (537-539)
  • bootstrap (67-81)
crates/allocdb-node/src/engine_observe.rs (1)
  • active_snapshot_lsn (43-45)
crates/allocdb-core/src/state_machine_metrics.rs (1)
  • last_applied_lsn (21-23)

@skel84
Copy link
Owner Author

skel84 commented Mar 13, 2026

Fixes Applied Successfully

Applied the substantive CodeRabbit fixes in commit 018edad.

Files modified:

  • crates/allocdb-node/src/replica.rs
  • crates/allocdb-node/src/replica_tests.rs
  • docs/replication.md

Applied

  • bounded replica metadata reads and added explicit oversized-sidecar rejection
  • removed the public mutable-engine escape hatch from ReplicaNode
  • made the durable replace contract explicit for non-Unix targets by returning Unsupported
  • expanded validation coverage for shard mismatch, snapshot/commit invariants, vote/view ordering, active-snapshot mismatch, and overwrite behavior
  • documented unreadable or permission-denied metadata sidecars as faulting startup

Not applied

  • the suggested "inconsistent voted_for" test was not added because the current DurableVote model has no separate membership or candidate-consistency invariant on voted_for; the new tests cover the actual validation branches implemented today

Validation rerun:

  • cargo test -p allocdb-node
  • cargo clippy -p allocdb-node --all-targets --all-features -- -D warnings
  • ./scripts/preflight.sh

The latest autofix changes are on the issue-49-replicated-node-metadata branch.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/allocdb-node/src/replica_tests.rs`:
- Around line 128-605: The tests call ReplicaMetadataFile::write_metadata (which
returns an Unsupported error on non-Unix) and unwrap its result, causing panics
off Unix; fix by gating the whole test module to Unix with #[cfg(unix)] (and
#[cfg(test)] as appropriate) or, if you need cross-platform tests, update each
test that calls write_metadata / ReplicaMetadataFile to handle the Unsupported
error path (assert that write_metadata returns
Err(ReplicaMetadataFileError::Unsupported) on non-unix) and skip the unix-only
assertions when not on unix; locate symbols ReplicaMetadataFile::write_metadata,
replica_* tests (e.g., replica_open_bootstraps_missing_metadata,
replica_metadata_file_round_trips) and apply the cfg or conditional assertions
accordingly.

In `@crates/allocdb-node/src/replica.rs`:
- Around line 154-169: The current validation only errors when
local_last_applied_lsn > commit_lsn, which allows a stale local state when
commit_lsn is ahead; change the check to require the local applied LSN to
exactly match the committed LSN: compare self.commit_lsn (commit_lsn) to
local_last_applied_lsn (last_applied_lsn) and if they differ return a validation
error (create or reuse an appropriate ReplicaStartupValidationError variant,
e.g. CommitLsnMismatch or extend AppliedLsnExceedsCommitLsn semantics) that
includes both self.commit_lsn and local_last_applied_lsn so startup is rejected
when commit_lsn != last_applied_lsn, handling Option cases (None vs Some)
consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c0218093-9dcb-4d43-8fd9-47bee161e9ed

📥 Commits

Reviewing files that changed from the base of the PR and between 427ab04 and 018edad.

📒 Files selected for processing (3)
  • crates/allocdb-node/src/replica.rs
  • crates/allocdb-node/src/replica_tests.rs
  • docs/replication.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/replica_tests.rs
  • crates/allocdb-node/src/replica.rs
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/replication.md
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Treat CodeRabbit as part of the required review path when it is enabled on the repository. Wait for its status to complete before merge. If it completes without a visible review comment or review thread, request visible output with `coderabbitai summary`. Address every substantive CodeRabbit comment explicitly before merge by either applying the change or documenting why it is not being applied. Apply correctness, safety, recovery, test, and docs-alignment feedback by default; document why you reject suggestions that would weaken determinism, boundedness, or trusted-core discipline.
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/replica_tests.rs
🧬 Code graph analysis (2)
crates/allocdb-node/src/replica_tests.rs (3)
crates/allocdb-node/src/replica.rs (13)
  • path (340-342)
  • engine (568-570)
  • metadata_path (573-575)
  • paths (578-580)
  • status (563-565)
  • new (184-194)
  • new (333-337)
  • new (781-783)
  • open (433-482)
  • metadata (558-560)
  • bootstrap (68-82)
  • load_metadata (350-376)
  • recover (493-555)
crates/allocdb-node/src/engine_observe.rs (1)
  • active_snapshot_lsn (43-45)
crates/allocdb-core/src/state_machine_metrics.rs (1)
  • last_applied_lsn (21-23)
crates/allocdb-node/src/replica.rs (3)
crates/allocdb-node/src/replica_tests.rs (4)
  • identity (58-63)
  • metadata_path (25-27)
  • snapshot_path (29-31)
  • wal_path (33-35)
crates/allocdb-node/src/engine_observe.rs (1)
  • active_snapshot_lsn (43-45)
crates/allocdb-core/src/state_machine_metrics.rs (1)
  • last_applied_lsn (21-23)
🔇 Additional comments (1)
docs/replication.md (1)

140-155: Good docs/code alignment on startup faults.

This now mirrors the sidecar fields and fail-closed startup cases implemented in crates/allocdb-node/src/replica.rs, including unreadable sidecars and persisted faulted state.

@skel84
Copy link
Owner Author

skel84 commented Mar 13, 2026

Fixes Applied Successfully

Applied the remaining substantive CodeRabbit fixes in commit c775800.

Files modified:

  • crates/allocdb-node/src/replica.rs
  • crates/allocdb-node/src/replica_tests.rs
  • crates/allocdb-node/src/non_unix_tests.rs
  • docs/replication.md

Applied

  • reject replicas whose local applied state is behind the persisted commit_lsn, while preserving the more specific snapshot-anchor fault when both invariants fail
  • add startup regression coverage for commit_lsn ahead of both fresh-open and recovered local state
  • gate the existing replica metadata test suite to Unix, and add an explicit non-Unix assertion that durable metadata writes return Unsupported
  • document that startup requires the local applied point and snapshot anchor to match the persisted replica metadata exactly

Validation rerun:

  • cargo test -p allocdb-node
  • cargo clippy -p allocdb-node --all-targets --all-features -- -D warnings
  • ./scripts/preflight.sh

The latest autofix changes are on the issue-49-replicated-node-metadata branch.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/allocdb-node/src/non_unix_tests.rs (1)

42-43: Minor: Temp file cleanup path construction may be incorrect.

The with_extension("replica.tmp") call on line 43 replaces the entire extension, producing a path like foo.replica.tmp. However, the temp_path method in ReplicaMetadataFile (lines 426-434 in replica.rs) formats the extension as {ext}.tmp, which for a .replica extension would produce .replica.tmp. In this case the behavior happens to match, but the cleanup logic relies on implementation details.

Since write actually fails before creating the temp file on non-Unix (the error occurs in replace_file_durably), the temp file may not exist anyway—but the parent directory and partial temp file from File::create at line 417 could exist.

💡 Suggestion: Consider using the same temp_path logic
-    let _ = fs::remove_file(file.path().with_extension("replica.tmp"));
+    // Temp file uses the pattern: {original}.tmp
+    let mut temp_path = path.clone();
+    temp_path.set_extension("replica.tmp");
+    let _ = fs::remove_file(&temp_path);

Or expose a method to get the temp path for testing purposes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/non_unix_tests.rs` around lines 42 - 43, The cleanup
currently uses file.path().with_extension("replica.tmp") which relies on
implementation detail; instead construct the temp path using the same logic as
ReplicaMetadataFile::temp_path (i.e., take the original file path, read its
extension and set the extension to "{ext}.tmp") or call/expose
ReplicaMetadataFile::temp_path from the test, then remove that exact temp path
(and ensure the parent directory is removed if needed) so the test mirrors the
real temp-file naming used by replace_file_durably/File::create.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/allocdb-node/src/non_unix_tests.rs`:
- Around line 42-43: The cleanup currently uses
file.path().with_extension("replica.tmp") which relies on implementation detail;
instead construct the temp path using the same logic as
ReplicaMetadataFile::temp_path (i.e., take the original file path, read its
extension and set the extension to "{ext}.tmp") or call/expose
ReplicaMetadataFile::temp_path from the test, then remove that exact temp path
(and ensure the parent directory is removed if needed) so the test mirrors the
real temp-file naming used by replace_file_durably/File::create.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ac36fa1b-5120-4af5-bf2d-ae155fdfc56a

📥 Commits

Reviewing files that changed from the base of the PR and between 018edad and c775800.

📒 Files selected for processing (4)
  • crates/allocdb-node/src/non_unix_tests.rs
  • crates/allocdb-node/src/replica.rs
  • crates/allocdb-node/src/replica_tests.rs
  • docs/replication.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/non_unix_tests.rs
  • crates/allocdb-node/src/replica_tests.rs
  • crates/allocdb-node/src/replica.rs
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/replication.md
🧠 Learnings (3)
📓 Common learnings
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Treat CodeRabbit as part of the required review path when it is enabled on the repository. Wait for its status to complete before merge. If it completes without a visible review comment or review thread, request visible output with `coderabbitai summary`. Address every substantive CodeRabbit comment explicitly before merge by either applying the change or documenting why it is not being applied. Apply correctness, safety, recovery, test, and docs-alignment feedback by default; document why you reject suggestions that would weaken determinism, boundedness, or trusted-core discipline.
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.md : Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Applied to files:

  • docs/replication.md
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/replica_tests.rs
🧬 Code graph analysis (1)
crates/allocdb-node/src/replica_tests.rs (1)
crates/allocdb-node/src/replica.rs (13)
  • path (363-365)
  • engine (591-593)
  • metadata_path (596-598)
  • paths (601-603)
  • status (586-588)
  • new (203-213)
  • new (356-360)
  • new (804-806)
  • open (456-505)
  • metadata (581-583)
  • bootstrap (72-86)
  • load_metadata (373-399)
  • recover (516-578)
🔇 Additional comments (23)
docs/replication.md (1)

140-156: LGTM!

The M7-T01 documentation accurately describes the persisted metadata fields and startup validation behavior. The fault conditions (decode failure, metadata inconsistency, unreadable/permission-denied sidecars, applied state mismatch, persisted faulted role) correctly reflect the implementation in replica.rs. The documentation is well-aligned with the code.

crates/allocdb-node/src/replica.rs (11)

1-22: LGTM!

Imports are minimal and appropriate. The metadata size bound (256 bytes) is reasonable for the fixed-size format, and the magic/version scheme supports forward compatibility. Test module gating (#[cfg(all(test, unix))] and #[cfg(all(test, not(unix)))]) correctly separates platform-specific test behavior.


23-68: LGTM!

Type definitions are clean and well-structured. Derives are appropriate (Clone, Copy, Debug, Eq, PartialEq), and the public fields on ReplicaMetadata allow straightforward construction and inspection. The types align with the documented persisted fields in docs/replication.md.


70-192: LGTM!

The bootstrap constructor correctly initializes a replica in Recovering state with view 0 and no vote. The validate method is comprehensive—it checks identity match, snapshot-versus-commit consistency, view/vote ordering invariants, active snapshot anchor match, and the full spectrum of applied-LSN-versus-commit-LSN states (none/none, behind, equal, ahead). The match arms at lines 165-188 correctly handle all combinations.


194-347: LGTM!

Error types are well-designed with appropriate granularity. ReplicaMetadataLoadError wisely stores std::io::ErrorKind rather than std::io::Error to support Copy. The From implementations enable ergonomic error propagation. Validation error variants capture all relevant context fields for debugging.


349-435: LGTM!

The bounded read in load_metadata correctly guards against oversized files with both the pre-read metadata length check and post-read length verification. The write_metadata flow (create parent → temp write → sync_data → durable rename) follows best practices for atomic durable writes. The temp-path naming scheme is deterministic, which is acceptable since a ReplicaMetadataFile is typically owned by a single ReplicaNode instance.


437-578: LGTM!

The ReplicaNode::open and ReplicaNode::recover constructors correctly implement the documented startup paths. The three-way handling via LoadedMetadata (Missing → bootstrap, Faulted → skip engine, Ready → validate) is clean. Logging levels are appropriate: info for successful bootstrap, warn for pre-engine faults, error for post-validation faults. Both paths correctly persist bootstrapped metadata before returning.


580-643: LGTM!

Accessors are appropriately read-only. The removal of engine_mut (per past review feedback) prevents callers from advancing allocator state without updating replica metadata. The persist_metadata method enables explicit metadata sync for future protocol-level operations.


645-710: LGTM!

The load_or_fault_existing_metadata helper correctly handles all load outcomes: missing file, persisted faulted role, successful load, and load errors. On load failure, it appropriately constructs synthetic faulted metadata using the expected identity (not trusting corrupted file contents). The error logging includes sufficient context for debugging.


712-796: LGTM!

The encoding functions are straightforward and consistent. All integers use little-endian encoding. Optional fields use a tag byte (0 for None, 1 for Some) followed by the payload. The maximum encoded size (~74 bytes) is well under the 256-byte bound.


727-859: LGTM!

The MetadataCursor abstraction provides clean bounds-checked reading. Decoding validates magic, version, and option tags, and the finish() call ensures no trailing bytes. Error variants (BufferTooShort, InvalidMagic, UnsupportedVersion, InvalidRole, InvalidOptionTag, InvalidLayout) cover all malformed input cases.


861-882: LGTM!

The Unix durability path correctly performs rename followed by parent directory sync_all to ensure the rename is durable. The non-Unix stub returns Unsupported as documented. The sync_parent_dir gracefully handles edge cases where parent() returns None.

crates/allocdb-node/src/non_unix_tests.rs (1)

19-41: LGTM!

The test correctly verifies that write_metadata returns ReplicaMetadataFileError::Io with ErrorKind::Unsupported on non-Unix platforms, matching the documented behavior and the replace_file_durably implementation.

crates/allocdb-node/src/replica_tests.rs (10)

1-127: LGTM!

Test helpers are well-designed and reduce boilerplate. The test_path function ensures unique paths via nanosecond timestamps. Config values are minimal but sufficient for the test scenarios. The assert_faulted helper cleanly verifies both status and engine absence.


128-151: LGTM!

The round-trip test exercises all metadata fields including optional ones with Some values, verifying the complete encode/decode path.


153-178: LGTM!

Tests the missing-metadata bootstrap path: fresh engine, bootstrapped metadata (view 0, Recovering role, no LSNs), and metadata persistence. Verification that the metadata file was created confirms durability.


180-204: LGTM!

Tests recover-path bootstrap: prepares local state via engine + checkpoint, then verifies that ReplicaNode::recover bootstraps metadata from the recovered engine state (commit_lsn=2, active_snapshot_lsn=2).


206-298: LGTM!

Fault tests cover corrupt metadata (decode failure), replica ID mismatch, and shard ID mismatch. Each correctly constructs the fault condition and verifies the expected ReplicaFaultReason variant.


300-352: LGTM!

Tests cover the snapshot-versus-commit invariants: snapshot without commit and snapshot exceeding commit. Both correctly trigger the expected validation errors.


354-447: LGTM!

Tests verify view ordering invariants: last_normal_view ≤ current_view, vote.view ≥ current_view, and vote.view ≥ last_normal_view. Each constructs the violation and asserts the correct error variant.


449-475: LGTM!

Tests the active snapshot anchor validation: metadata claims a snapshot at LSN 3, but the fresh engine has no snapshot. The test correctly expects ActiveSnapshotMismatch.


477-580: LGTM!

Tests comprehensively cover the applied-versus-commit LSN validation: exceeds (lines 478-519), behind on fresh open (lines 521-546), and behind on recover (lines 548-580). Both directions and both open/recover paths are exercised.


582-666: LGTM!

Tests cover: oversized sidecar rejection (verifies the 256-byte bound), overwrite semantics (ensures writes replace rather than append), and persisted faulted role (verifies that a Faulted role in metadata triggers PersistedFaultState without further validation). Good negative-path coverage.

@skel84 skel84 merged commit 722d50c into main Mar 13, 2026
2 checks passed
@skel84 skel84 deleted the issue-49-replicated-node-metadata branch March 13, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

M7-T01 Add replicated node state and durable protocol metadata

1 participant