feat: thread primitive — ThreadInfo, ThreadRegistry, structural inversion#32
feat: thread primitive — ThreadInfo, ThreadRegistry, structural inversion#32dcetlin wants to merge 4 commits into
Conversation
…sion Threads become durable entities that persist across session deaths. ThreadInfo carries: topic, anchorState, respawnCount, sessionHistory, currentSessionId, threadUrl. Persisted to threads.json. ThreadRegistry handles boot (load/migrate/reconcile), detachedThreads() query for crash recovery, and lifecycle co-updates. Structural inversion: router resolves threads via threadRegistry first (fallback to legacy registry.getByThread for backward compat). All lifecycle callsites co-update ThreadInfo: - doSpawnSession: creates/updates ThreadInfo, sets anchor live/zombie - killSession: calls detachSession, sets anchor killed - bridge-server death detection: detachSession + anchor crashed - anchor-state.ts: co-updates ThreadInfo on every state change Command files updated: resume/respawn/recover use threadRegistry as primary lookup, spawn checks threadRegistry for dead-session-in-thread, health shows thread counts, forks/handoff use threadRegistry for topic. Backward compatible: registry.threadToSession and resolveThreadSession still work. ThreadMember tracking unchanged. Sam's review/build/state- machine untouched. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hey Sam — wanted to frame why I think this is worth landing now rather than deferring. The way I see hydra's internals, there are three distinct concerns:
Right now, #2 is scattered across SessionInfo fields that get pruned when sessions die. That's what caused the three bugs we hit in one session — orphaned sessions with no ThreadInfo, recovery manifests that couldn't find crashed threads, and anchor state that diverged from session state. This PR makes threads a first-class entity. The key design choices:
The PR is +387/-29, and I've been running this on |
…x cleanup
1. Add 'resurrect' to ThreadSessionEntry.originType and SessionInfo.originType
type unions — doSpawnSession sets originType='resurrect' for existing-thread
spawns, which would violate the type under strict checking.
2. Consolidate double-persist in killSession: detachSession now accepts
{ skipPersist } option. killSession uses it to batch the thread detach +
anchorState='killed' update into a single threadRegistry.persist() call.
Eliminates the crash window where thread state could be inconsistent.
3. Add stale tmux cleanup in recover's threadRegistry path — the legacy
recovery path killed lingering tmux sessions before spawning, but the
threadRegistry path skipped this step.
- Remove dead second routing check (only reachable via threadReply) - Guard threadReply thread reuse against active session threads - Gate bare kill command by msg.isThread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Decouple thread descriptors from session routing by extracting `currentSessionId` from ThreadInfo into a standalone binding map on ThreadRegistry. Before: ThreadInfo owned `currentSessionId`, routing reached through ThreadInfo to find sessions, and SessionRegistry maintained a parallel `threadToSession` Map. Two sources of truth for one relationship. After: ThreadInfo is a pure descriptor (topic, anchor, history). ThreadRegistry owns a `bindings` Map — the sole source of truth for which session is live on which thread. All routing resolves bindings, never ThreadInfo. Binding lifecycle: spawn → bind, kill → unbind. This eliminates the structural root cause of the threadReply routing bug class: a threadReply thread has no binding, so routing never finds a session. No guard needed — the bug disappears structurally. Changes: - ThreadInfo: remove `currentSessionId` field - ThreadRegistry: add `_bindings` Map with bind/unbind/getBoundSession/isBound - ThreadRegistry: persist bindings to bindings.json, migrate old format on load - SessionRegistry: remove `threadToSession`, `getByThread`, `setThread`, `deleteThread` - resolveThreadSession: queries bindings instead of ThreadInfo - resolveSessionForThread: queries bindings instead of ThreadInfo - session-lifecycle: spawn uses bind(), kill uses unbind() - All command handlers: use getBoundSession() instead of legacy lookups - Tests updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Decouple thread descriptors from session routing by extracting `currentSessionId` from ThreadInfo into a standalone binding map on ThreadRegistry. Routing resolves bindings, never ThreadInfo. See PR #32 commit for full rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sf8193
left a comment
There was a problem hiding this comment.
Thanks for the thorough work here — the thread-primary concept is solid. A few questions on specifics before we merge.
| _threadRegistry = mod.threadRegistry | ||
| } | ||
| return _threadRegistry! | ||
| } |
There was a problem hiding this comment.
Could we avoid the lazy dynamic import here? It adds a circular dep workaround that's easy to break silently. Would it work to have session-lifecycle.ts handle the ThreadRegistry co-update instead (it already imports both), or pass threadRegistry as a parameter to setAnchorState?
| thread.anchorState = state | ||
| tr.persist() | ||
| } | ||
| } catch {} |
There was a problem hiding this comment.
Should we surface failures here instead of bare catch {}? If this silently fails, anchor emoji and threads.json diverge without any signal. Even a process.stderr.write would help.
| } | ||
| threadRegistry.persist() | ||
| } | ||
|
|
There was a problem hiding this comment.
This is a broader question — detachSession (line 44) already calls threadRegistry.unbind(), then killSession calls it again here. The double-unbind is harmless but it makes ownership unclear: who's responsible for unbinding?
More generally, does the complexity of spreading ThreadRegistry co-updates across detachSession, killSession, setAnchorState, and bridge-server death detection pay for itself vs. having a single ThreadRegistry.onSessionDeath(sessionId) method that handles all the state transitions in one place?
| } | ||
| } | ||
| if (!skipPersist) threadRegistry.persist() | ||
| } |
There was a problem hiding this comment.
Can detachSession be called twice for the same session (e.g. bridge-server death detection + killSession race)? If so, endedAt gets overwritten silently — does it make sense to guard with if (!histEntry.endedAt)?
| existing.totalMessages += (session.messageCount ?? 0) | ||
| if (session.lastActive > existing.lastActive) { | ||
| existing.lastActive = session.lastActive | ||
| } |
There was a problem hiding this comment.
Does it make sense to keep ~60 lines of migration code that only runs once (first boot after merge) and then sits as dead code? Could we handle this as a one-time migration script instead, or at least add a log + deletion plan?
| if (session.isJoinMember) continue | ||
| if (this.threads.has(session.threadId)) continue | ||
| this.threads.set(session.threadId, { | ||
| threadId: session.threadId, |
There was a problem hiding this comment.
The reconcile loop creates ThreadInfo for every live session not in threads.json. But if threads.json was deleted or corrupted, this recreates entries for sessions that may be zombies (tmux alive but bridge dead). Should we also check bridge health here, or is that overkill for a recovery path?
| if (msg.hasExistingThread && msg.existingThreadId) { | ||
| const existingIsSession = msg.hasExistingThread && msg.existingThreadId | ||
| && threadRegistry.isBound(msg.existingThreadId) | ||
| if (msg.hasExistingThread && msg.existingThreadId && !existingIsSession) { |
There was a problem hiding this comment.
Can we confirm this behavior change is intentional? Previously, if a message had an existing thread, it would reuse it. Now if that thread is bound to a session, we skip it and create a new thread instead. Is there a case where we reach this point with a session-bound existingThreadId that wasn't caught by the routing guard above?
| // --------------------------------------------------------------------------- | ||
|
|
||
| /** Resolve a thread/channel ID to a session ID via the binding map */ | ||
| function resolveSessionForThread(channelId: string, existingThreadId?: string): string | undefined { |
There was a problem hiding this comment.
Two questions on routing:
-
The old code had a second fallback block (lines 446-455 on main) that handled
threadReply-created threads wherechat_idwas rewritten butmsg.isThreadis false. This PR removes it. DoesresolveSessionForThreadin themsg.isThreadblock above cover that case?threadReplymessages aren'tmsg.isThread— so this might silently break threadReply channel routing. -
Does it make sense to also check
resolveThreadSessionFromMsg(the newer method usingeffectiveThreadIdfor Slack compat)? This PR'sresolveSessionForThreaduseschannelId/existingThreadIdwhich is the older pattern — wanted to confirm Slack DM routing still works correctly.
|
|
||
| void setAnchorState(threadId!, respawnCount > 0 ? 'zombie' : 'live', respawnCount).catch(() => {}) | ||
|
|
||
| return { name: tmuxName, sessionId, threadId: threadId!, url } |
There was a problem hiding this comment.
Should we batch the threadRegistry.persist() calls? A session crash can trigger: death detection → detachSession → persist → setAnchorState → persist → killSession → detachSession → persist. That's 3+ writes of two files each in quick succession. Not a correctness issue, but worth noting.
|
Closing this in favor of a lighter approach. Your review surfaced the core tension: the thread-primary structural inversion adds real complexity (lazy imports, double-unbind ownership, migration code, persist batching) to make ThreadInfo load-bearing for routing. The concrete value — persistent metadata across session deaths (respawnCount, sessionHistory, anchor state, topic) — doesn't require that routing depend on ThreadInfo. The replacement PR will introduce ThreadInfo as metadata-only: it observes and records, but routing never consults it. This gives us the durable thread metadata that recovery features need (emoji state machine, respawn counts, session history) without the structural inversion or the routing complexity you flagged. On your specific comments:
|
|
Superseded — replacing with metadata-only ThreadInfo approach. |
- Refuse `hydra up` when platform is already running (prevents gotcha #32 duplicate-main ping-pong). Points to restart/down. - Extract stop-byte.sh for clean byte teardown — eliminates fragile shell function coupling and command injection vector in lifecycleDown. - Replace execSync('sleep 0.5') with Bun.sleep(500). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-flight in `up` checks for orphaned claude processes (not just tmux sessions) to prevent gotcha #32 ping-pong - Validate byte script exists before starting daemon — prevents unknown platforms from leaving a half-started daemon - Replace VALID_PLATFORMS enum with filesystem-based validation: check for start-{platform}-byte.sh, discord falls back to start-byte-v2.sh (legacy name) - `down` removes daemon.sock + daemon.pid so discoverSockets() doesn't show phantom daemons Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add daemon+byte lifecycle commands to the hydra CLI. Platform is always required (no default). - `hydra up <platform>` — validate byte script exists, check for running tmux sessions and orphaned claude processes (prevents gotcha #32 ping-pong), start daemon, wait for socket, start byte - `hydra down <platform>` — stop byte via stop-byte.sh (orphan cleanup), stop daemon, remove stale socket + PID file - `hydra restart <platform>` — restart daemon only (picks up code changes) No hardcoded platform enum — uses filesystem-based validation (does start-{platform}-byte.sh exist?). New platforms work with zero CLI changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why
Sessions die routinely. When they do, their thread context scatters — topic, anchor state, session history all live on SessionInfo, which gets pruned. Recovery means reconstructing thread context from fragments across sessions.json and threadToSession.
This PR introduces threads as a durable entity. A thread knows its own state independently of any session. Sessions attach and detach. "Dead session" becomes "thread between sessions."
Architecture: Three concerns, three owners
The thread-primary architecture separates three concerns that were previously entangled on SessionInfo:
Map<threadId, sessionId>persisted tobindings.json. Spawn creates a binding; kill removes it.Routing resolves bindings, not ThreadInfo. A thread with no binding has no session — no guard logic needed, no ambiguity possible. This eliminates the structural root cause of the threadReply routing bug class (where auto-created threads could accidentally match session-spawned threads).
What
B2: ThreadInfo + ThreadRegistry (sessions.ts)
ThreadInfo: threadId, anchorState, topic, respawnCount, sessionHistory, threadUrlThreadRegistry: load/migrate/reconcile on boot, detachedThreads() for crash recovery, persist to threads.jsonB4: Binding extraction (sessions.ts, all consumers)
ThreadRegistry._bindings:Map<threadId, sessionId>— the routing lookupbind(),unbind(),getBoundSession(),isBound()— the binding APIbindings.json, migrated from old-formatcurrentSessionIdon loadSessionRegistry.threadToSessionremoved (was a parallel source of truth)resolveSessionForThread()andresolveThreadSession()query bindings onlydetachSession() (session-lifecycle.ts)
Lifecycle co-updates (session-lifecycle.ts)
doSpawnSession: creates/updates ThreadInfo, binds session, sets anchor live/zombiekillSession: calls detachSession, unbinds, sets anchor killedConsumer updates (6 files)
resolveSessionForThread()queriesgetBoundSession()isBound()instead ofcurrentSessionIdgetBoundSession()getBoundSession()getBoundSession()getBoundSession()for thread ownershipKey design decisions
currentSessionIdon ThreadInfo — the binding map IS the relationship. ThreadInfo is purely descriptive.currentSessionIdis read, extracted into bindings, and the field is stripped on load.resolveSessionForThreadstays in router.ts — it's a convenience wrapper overgetBoundSession, not a data-model concern.Test plan
bun build daemon.ts+bun build bridge.tscompile clean🤖 Generated with Claude Code