test: promote replicated simulation scenarios#65
Conversation
Add deterministic partition and primary-crash regression coverage to the real three-replica harness, plus the minimal retry-aware harness support needed to replay ambiguous client outcomes after failover. Closes #54
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (3)**/*.rs📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.md📄 CodeRabbit inference engine (AGENTS.md)
Files:
docs/status.md📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (2)📚 Learning: 2026-03-12T15:18:53.086ZApplied to files:
📚 Learning: 2026-03-12T15:18:53.086ZApplied to files:
🧬 Code graph analysis (2)crates/allocdb-node/src/replicated_simulation_tests.rs (4)
crates/allocdb-node/src/replicated_simulation.rs (2)
🔇 Additional comments (9)
Summary by CodeRabbit
WalkthroughAdds a retry-aware client submit flow to the replicated simulation, exposes Changes
Sequence DiagramsequenceDiagram
participant Client
participant Simulation as Replicated Simulation
participant RetryCache as Retry Lookup
participant Engine as Primary Engine State
Client->>Simulation: client_submit_or_retry(primary, slot, payload)
activate Simulation
Simulation->>RetryCache: lookup_retry_result(primary, slot, payload)
activate RetryCache
RetryCache->>Engine: fetch active primary engine state
Engine-->>RetryCache: engine state / prepared & published records
alt Retry Found (Fingerprint match -> published)
RetryCache-->>Simulation: SubmissionResult (published)
Simulation-->>Client: Published(SubmissionResult)
else No usable retry result
RetryCache-->>Simulation: None
deactivate RetryCache
Simulation->>Simulation: client_submit() -> produce Prepared entry
Simulation-->>Client: Prepared(ReplicaPreparedEntry)
end
deactivate Simulation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/allocdb-node/src/replicated_simulation_tests.rs`:
- Around line 1029-1045: The test currently observes the outcome via
published_result(entry.lsn) which is set by handle_prepare_ack() as soon as
quorum append completes, so it does not model a distinct “reply not yet
published” boundary; either change the test to explicitly model pre-reply by
removing/avoiding the published_result check and instead assert quorum-append
via a lower-level indicator (e.g., check a quorum-appended helper or persisted
log state) before crashing, or explicitly publish the reply before the crash if
you intend the “reply exposed” scenario (call the harness method that publishes
replies, e.g., harness.publish_reply(entry.lsn) or equivalent), and update the
test name/docs to match the chosen behavior; locate changes around
client_submit, deliver_protocol_message, published_result(entry.lsn), and
handle_prepare_ack() to implement this fix.
In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 1325-1329: The patch missed a regression test where a newly
elected primary must fetch a prepared suffix from another voter: add a test in
replicated_simulation.rs that simulates a 3-replica view where only replica 3
holds the prepared entry and replica 2 wins the new view, forcing the path that
calls highest_prepared_lsn() and then find_prepared_entry() to reconstruct the
missing entries; exercise the failover/recovery flow, assert that replica 2
successfully copies the prepared suffix (verify commit LSNs and that the
prepared entry appears on replica 2), and ensure the test covers the
negative/recovery case so this regression cannot recur.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8665918b-a8fd-42d1-80f6-de232bca5797
📒 Files selected for processing (5)
crates/allocdb-node/src/replica.rscrates/allocdb-node/src/replicated_simulation.rscrates/allocdb-node/src/replicated_simulation_tests.rsdocs/status.mddocs/testing.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/replica.rscrates/allocdb-node/src/replicated_simulation_tests.rscrates/allocdb-node/src/replicated_simulation.rs
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Files:
docs/status.mddocs/testing.md
docs/status.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep
docs/status.mdcurrent as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Files:
docs/status.md
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Applied to files:
crates/allocdb-node/src/replicated_simulation_tests.rs
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to docs/status.md : Keep [`docs/status.md`](./docs/status.md) current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Applied to files:
docs/status.md
🧬 Code graph analysis (2)
crates/allocdb-node/src/replicated_simulation_tests.rs (1)
crates/allocdb-node/src/replicated_simulation.rs (2)
replica(535-540)configured_primary(578-580)
crates/allocdb-node/src/replicated_simulation.rs (2)
crates/allocdb-node/src/replica.rs (5)
engine(719-721)new(210-220)new(452-456)new(539-543)new(1376-1378)crates/allocdb-node/src/simulation.rs (4)
engine(341-343)new(86-88)new(114-119)new(166-185)
|
Addressed CodeRabbit feedback in commit d8348da. Changes:
Validation rerun:
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
docs/testing.md (1)
261-263:⚠️ Potential issue | 🟡 MinorDon't describe an unmodeled pre-reply crash window.
The harness still publishes the client outcome as soon as quorum append completes in
handle_prepare_ack(), so this scenario only models interrupted commit propagation. Please remove the “reply delivery” wording here to keep the docs aligned with the executable behavior.As per coding guidelines,
**/*.md: Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/testing.md` around lines 261 - 263, Update the test description to remove the unmodeled "reply delivery" wording so it matches the harness behavior: change the second bullet to state that the primary is crashed after quorum append (and after reply) modeling only interrupted commit propagation, and explicitly reference the existing behavior in handle_prepare_ack() which publishes client outcome as soon as quorum append completes; remove any claim that reply delivery can be interrupted because that pre-reply crash window is not modeled.
🧹 Nitpick comments (1)
crates/allocdb-node/src/replicated_simulation.rs (1)
1206-1243: Log retry-cache hits and misses in this lookup path.This is the new ambiguity-resolution surface, but hit/miss/conflict outcomes are silent today. A structured
debug!here withprimary,operation_id,request_slot,applied_lsn, andresult_codewould make failover regressions much easier to triage.As per coding guidelines,
**/*.rs: Add extensive logging where it materially improves debuggability or operational clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/replicated_simulation.rs` around lines 1206 - 1243, The lookup_retry_result path lacks observability for retry-cache behavior; add a structured debug log in lookup_retry_result (after decoding request and after retrieving the record) that logs primary, request.operation_id, request_slot, and whether the record was found (hit/miss); when found include applied_lsn, record.result_code and whether the outcome was a conflict (compare record.command_fingerprint vs request.command.fingerprint()), and indicate from_retry_cache=true in the log so operators can trace retry-cache hits/misses and conflict outcomes for SubmissionResult.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 1323-1327: The code is incorrectly using
node.highest_prepared_lsn() from every voter when computing target_commit_lsn,
which allows a prepare that exists only on the currently-configured primary to
be promoted into committed history; change the logic that sets target_commit_lsn
(the two max(...) lines) to ignore the still-configured primary (i.e., skip the
node whose metadata().is_configured_primary or matching view/primary identifier)
when calling highest_prepared_lsn(), and ensure
find_prepared_entry()/complete_view_change() logic only considers prepared LSNs
from non-primary voters; add a regression test that simulates a "primary-only
prepared entry" (force complete_view_change() while primary is still in config)
and asserts that such an LSN is not copied/committed on the new primary.
---
Duplicate comments:
In `@docs/testing.md`:
- Around line 261-263: Update the test description to remove the unmodeled
"reply delivery" wording so it matches the harness behavior: change the second
bullet to state that the primary is crashed after quorum append (and after
reply) modeling only interrupted commit propagation, and explicitly reference
the existing behavior in handle_prepare_ack() which publishes client outcome as
soon as quorum append completes; remove any claim that reply delivery can be
interrupted because that pre-reply crash window is not modeled.
---
Nitpick comments:
In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 1206-1243: The lookup_retry_result path lacks observability for
retry-cache behavior; add a structured debug log in lookup_retry_result (after
decoding request and after retrieving the record) that logs primary,
request.operation_id, request_slot, and whether the record was found (hit/miss);
when found include applied_lsn, record.result_code and whether the outcome was a
conflict (compare record.command_fingerprint vs request.command.fingerprint()),
and indicate from_retry_cache=true in the log so operators can trace retry-cache
hits/misses and conflict outcomes for SubmissionResult.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c6085148-432e-4d28-aace-28b83c28e380
📒 Files selected for processing (4)
crates/allocdb-node/src/replicated_simulation.rscrates/allocdb-node/src/replicated_simulation_tests.rsdocs/status.mddocs/testing.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/replicated_simulation_tests.rscrates/allocdb-node/src/replicated_simulation.rs
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Files:
docs/status.mddocs/testing.md
docs/status.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep
docs/status.mdcurrent as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Files:
docs/status.md
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Applied to files:
crates/allocdb-node/src/replicated_simulation_tests.rsdocs/status.md
📚 Learning: 2026-03-12T15:18:53.086Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.086Z
Learning: Applies to docs/status.md : Keep [`docs/status.md`](./docs/status.md) current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Applied to files:
docs/status.md
🧬 Code graph analysis (1)
crates/allocdb-node/src/replicated_simulation.rs (2)
crates/allocdb-node/src/simulation.rs (1)
engine(341-343)crates/allocdb-node/src/replica.rs (1)
engine(719-721)
…icated-tests # Conflicts: # docs/status.md
|
Addressed the latest CodeRabbit finding in Changes:
Reran:
|
Summary
Validation
Closes #54