feat: route topology handoffs via Nexus IPC (#165)#255
Merged
windoliver merged 7 commits intomainfrom Apr 15, 2026
Merged
Conversation
…o Grove (#165) Unify the EventBus and Nexus IPC delivery paths so topology-driven handoffs flow through Nexus IPC with traceable delivery state. Architecture: - EventBus.publish() is now async, returning PublishResult with IPC message ID (NexusEventBus uses NexusIpcClient; LocalEventBus returns {ok: true} synchronously) - TopologyRouter.route() is async, returns RouteResult[] with per-target message IDs, sends all targets in parallel via Promise.all - NexusIpcClient extracts the shared POST /api/v2/ipc/send logic from NexusEventBus and NexusWsBridge (DRY) Handoff state machine: - Add processed and dead_lettered to HandoffStatus enum - Add canTransition(from, to) state machine with exhaustive tests - Add ipcMessageId field to Handoff interface - Add markProcessed(), markDeadLettered(), setIpcMessageId() to stores - Happy path: pending_pickup → delivered → processed → replied - Failure: pending_pickup → dead_lettered, delivered → expired Storage: - NexusHandoffStore gains in-memory cache with 5s TTL, invalidated on writes and available for SSE event invalidation - Rename casUpdate → readModifyWrite (honest naming — CAS is broken) - IPC message IDs linked back to handoffs after routing (best-effort) MCP tools: - grove_list_handoffs description updated for IPC state awareness - Status enum extended with processed and dead_lettered - New grove_list_dead_letters tool for DLQ visibility Cleanup: - Replace 6 inline appendFileSync debug blocks in NexusWsBridge with shared debugLog() from tui/debug-log.ts Tests: 166 new/updated tests across 9 test files, all passing.
…g, agent ack (#165) Wire the remaining functional gaps in the IPC handoff lifecycle: SSE → handoff status updates (Gap 1): - NexusWsBridge.handleEvent now calls handoffStore.markDelivered() when a message_delivered SSE event arrives, matching by ipcMessageId - Bridge accepts handoffStore and ipcClient via options Dead-lettering on IPC failure (Gap 2): - contribute.ts routing block checks RouteResult.ok — when false, marks the handoff as dead_lettered with stderr warning - RouteResult now includes ok + error fields from PublishResult Agent acknowledgment (Gap 3): - New grove_ack_handoff MCP tool transitions delivered → processed - Uses canTransition() to enforce state machine rules - Returns previous and new status for visibility NexusWsBridge uses NexusIpcClient (Gap 4): - send() delegates to ipcClient when injected, falls back to inline fetch for backward compat Integration test (Gap 5): - 10 tests covering: contribute → handoff → IPC → status updates, dead-letter path, state machine transitions, multi-target parallel routing, LocalEventBus vs NexusEventBus behavior
When the Nexus IPC endpoint (/api/v2/ipc/send) returns 404 or is unreachable, that's an infrastructure issue (endpoint doesn't exist on this Nexus version), not a delivery rejection. Handoffs should stay in their current status and fall back to the session orchestrator's polling path — not be dead-lettered. Changes: - IpcSendResult gains infrastructureError flag, set on 404/405/502/503 and connection errors - NexusIpcClient caches endpoint unavailability after first failure to avoid repeated failed fetches on every contribution - contribute.ts routing block skips dead-letter when infrastructureError is true — only dead-letters on actual delivery rejections - PublishResult and RouteResult propagate the flag through the chain - NexusWsBridge accepts handoffStore + ipcClient for SSE delivery tracking and DRY send path - 3 new integration tests: infra error skip, delivery rejection, endpoint caching This fixes the false dead_lettered status seen in TUI e2e when Nexus VFS is available but the IPC endpoint is not.
…t writes, SSE race, silent errors
Round 1 fixes from Codex adversarial review:
1. Transient IPC outage no longer permanently cached (HIGH)
- Only 404/405 permanently disable endpoint (doesn't exist)
- 502/503/network errors use 30s backoff, then retry
- Prevents IPC blackhole after brief Nexus restart
2. Concurrent handoff status writes validated (HIGH)
- NexusHandoffStore.transitionHandoff() checks canTransition()
inside readModifyWrite, rejects stale transitions
- Prevents concurrent SSE delivery + ack from clobbering state
3. SSE delivery race with ipcMessageId write (HIGH)
- updateHandoffDeliveryStatus falls back to matching by
(toRole, pending/delivered, most recent) when ipcMessageId
hasn't been written yet by the fire-and-forget routing block
4. Silent IPC bookkeeping errors now logged (MEDIUM)
- contribute.ts catch block logs handoff ID, target role, and
error via console.warn instead of silently discarding
…SE matching, expiry scope
1. All non-2xx IPC responses are infrastructureError (HIGH)
- Only explicit delivery rejections (future: 2xx with reject body) dead-letter
- 429/500/401/403 are retryable, not permanent failures
2. No-op transitions skip write entirely (HIGH)
- transitionHandoff reads, validates canTransition, writes only on valid change
- Prevents stale snapshot from clobbering concurrent updates
3. SSE fallback constrained by sender + unlinked filter (HIGH)
- Matches by fromRole===sender and !ipcMessageId to avoid cross-matching
- Prevents wrong handoff correlation under concurrent delivery
4. Transient backoff cleared on success (MEDIUM)
- A successful send proves endpoint is healthy, clears the 30s backoff
- Prevents mixed-outcome batches from causing process-wide outage
5. expireStale covers delivered + processed states (MEDIUM)
- All three store implementations updated: pending_pickup, delivered,
processed are now expirable per the state machine
- Conformance test updated to match
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
NexusIpcClientshared abstraction eliminates duplicate IPC send pathsgrove_list_dead_letters,grove_ack_handoffWhat changed
Core refactor:
EventBus.publish()→ async, returnsPublishResultwith IPC message IDTopologyRouter.route()→ async, returnsRouteResult[], parallel sends viaPromise.allNexusEventBusdelegates toNexusIpcClient;LocalEventBusreturns{ok: true}NexusWsBridge.send()usesNexusIpcClientwhen injected (DRY)Handoff state machine:
HandoffStatus: addedprocessed,dead_letteredcanTransition(from, to)enforces valid state transitionsipcMessageIdfield onHandofffor IPC traceabilitymarkProcessed(),markDeadLettered(),setIpcMessageId()on all 3 store implsIPC lifecycle wiring:
message_delivered→handoffStore.markDelivered()in NexusWsBridgemarkDeadLettered()in contribute.ts routinggrove_ack_handofftool for agentdelivered → processedacknowledgmentCleanup:
appendFileSyncdebug blocks withdebugLog()casUpdate→readModifyWrite(honest naming — CAS is broken)Test plan
HandoffStoreconformance suite runs against InMemory + Nexus implementationscanTransition()tests for all 6×6 transitionstmux capture-pane: full coder-reviewer loop, both handoffs show📬 delivered, zerodead_letteredCloses #165