feat: thread primitive — ThreadInfo, ThreadRegistry, structural inversion by dcetlin · Pull Request #32 · sf8193/hydra

dcetlin · 2026-06-23T19:32:20Z

Why

Sessions die routinely. When they do, their thread context scatters — topic, anchor state, session history all live on SessionInfo, which gets pruned. Recovery means reconstructing thread context from fragments across sessions.json and threadToSession.

This PR introduces threads as a durable entity. A thread knows its own state independently of any session. Sessions attach and detach. "Dead session" becomes "thread between sessions."

Architecture: Three concerns, three owners

The thread-primary architecture separates three concerns that were previously entangled on SessionInfo:

Thread descriptors (ThreadInfo) — durable metadata: topic, anchor state, session history, respawn count. Survives session death. Never consulted for routing.
Bindings (ThreadRegistry._bindings) — the sole source of truth for "which session is live on which thread." A Map<threadId, sessionId> persisted to bindings.json. Spawn creates a binding; kill removes it.
Sessions (SessionInfo) — compute process metadata: tmux name, listening state, capabilities. Ephemeral — pruned when the process dies.

Routing resolves bindings, not ThreadInfo. A thread with no binding has no session — no guard logic needed, no ambiguity possible. This eliminates the structural root cause of the threadReply routing bug class (where auto-created threads could accidentally match session-spawned threads).

What

B2: ThreadInfo + ThreadRegistry (sessions.ts)

ThreadInfo: threadId, anchorState, topic, respawnCount, sessionHistory, threadUrl
ThreadRegistry: load/migrate/reconcile on boot, detachedThreads() for crash recovery, persist to threads.json
Orphan reconciliation: sessions spawned between persist and crash get ThreadInfo entries on boot

B4: Binding extraction (sessions.ts, all consumers)

ThreadRegistry._bindings: Map<threadId, sessionId> — the routing lookup
bind(), unbind(), getBoundSession(), isBound() — the binding API
Persisted to bindings.json, migrated from old-format currentSessionId on load
SessionRegistry.threadToSession removed (was a parallel source of truth)
resolveSessionForThread() and resolveThreadSession() query bindings only

detachSession() (session-lifecycle.ts)

Clean session death: unbinds thread, timestamps history entry

Lifecycle co-updates (session-lifecycle.ts)

doSpawnSession: creates/updates ThreadInfo, binds session, sets anchor live/zombie
killSession: calls detachSession, unbinds, sets anchor killed

Consumer updates (6 files)

Router: resolveSessionForThread() queries getBoundSession()
Router: threadReply guard uses isBound() instead of currentSessionId
bridge-dispatch: kill-by-thread uses getBoundSession()
commands/thread.ts: resume/respawn use getBoundSession()
commands/build.ts, review.ts: use getBoundSession()
commands/global.ts: spawn checks getBoundSession() for thread ownership

Key design decisions

No currentSessionId on ThreadInfo — the binding map IS the relationship. ThreadInfo is purely descriptive.
Separate persistence — bindings.json exists alongside threads.json. Clean separation, independent lifecycle.
Backward compatible migration — old-format threads.json with currentSessionId is read, extracted into bindings, and the field is stripped on load.
resolveSessionForThread stays in router.ts — it's a convenience wrapper over getBoundSession, not a data-model concern.

Test plan

bun build daemon.ts + bun build bridge.ts compile clean
Daemon boots, threads.json + bindings.json created on first run
Spawn → ThreadInfo created, binding created, anchor 🚀
Kill → binding removed, ThreadInfo detached, anchor ☠️
Session crash → death detection fires, binding removed, anchor 💥
resume → session reattaches, new binding to existing ThreadInfo
respawn → new session, new binding, ThreadInfo updated
threadReply message in channel with session threads → routes to main, not session (binding-based guard)
/review, /build, fork, handoff all resolve via getBoundSession()

🤖 Generated with Claude Code

…sion Threads become durable entities that persist across session deaths. ThreadInfo carries: topic, anchorState, respawnCount, sessionHistory, currentSessionId, threadUrl. Persisted to threads.json. ThreadRegistry handles boot (load/migrate/reconcile), detachedThreads() query for crash recovery, and lifecycle co-updates. Structural inversion: router resolves threads via threadRegistry first (fallback to legacy registry.getByThread for backward compat). All lifecycle callsites co-update ThreadInfo: - doSpawnSession: creates/updates ThreadInfo, sets anchor live/zombie - killSession: calls detachSession, sets anchor killed - bridge-server death detection: detachSession + anchor crashed - anchor-state.ts: co-updates ThreadInfo on every state change Command files updated: resume/respawn/recover use threadRegistry as primary lookup, spawn checks threadRegistry for dead-session-in-thread, health shows thread counts, forks/handoff use threadRegistry for topic. Backward compatible: registry.threadToSession and resolveThreadSession still work. ThreadMember tracking unchanged. Sam's review/build/state- machine untouched. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dcetlin · 2026-06-24T20:45:48Z

Hey Sam — wanted to frame why I think this is worth landing now rather than deferring.

The way I see hydra's internals, there are three distinct concerns:

Features define behavior — review, build, design each own their state machines, participant tracking, and turn management. They don't need to know how sessions are found or recovered.
Threads carry continuity — topic, anchor state, session history, respawn count. This state outlives any individual session. When a session dies, the thread remembers what was happening and who was there.
Sessions provide compute — context window, tool access, active reasoning. Ephemeral by nature. They attach to threads, do work, and eventually detach (cleanly or by crashing).

Right now, #2 is scattered across SessionInfo fields that get pruned when sessions die. That's what caused the three bugs we hit in one session — orphaned sessions with no ThreadInfo, recovery manifests that couldn't find crashed threads, and anchor state that diverged from session state.

This PR makes threads a first-class entity. The key design choices:

ThreadInfo owns the durable state, SessionInfo stays ephemeral
detachSession() is the clean death path — nulls currentSessionId, timestamps history
Backward compatible — resolveThreadSession() checks threadRegistry first, falls back to legacy. Your review/build/state-machine/tests are completely untouched.
No feature changes — this is structural only. Same behavior, but the state lives where it belongs.

The PR is +387/-29, and I've been running this on live for a week with zero issues. Happy to walk through any part of it.

…x cleanup 1. Add 'resurrect' to ThreadSessionEntry.originType and SessionInfo.originType type unions — doSpawnSession sets originType='resurrect' for existing-thread spawns, which would violate the type under strict checking. 2. Consolidate double-persist in killSession: detachSession now accepts { skipPersist } option. killSession uses it to batch the thread detach + anchorState='killed' update into a single threadRegistry.persist() call. Eliminates the crash window where thread state could be inconsistent. 3. Add stale tmux cleanup in recover's threadRegistry path — the legacy recovery path killed lingering tmux sessions before spawning, but the threadRegistry path skipped this step.

- Remove dead second routing check (only reachable via threadReply) - Guard threadReply thread reuse against active session threads - Gate bare kill command by msg.isThread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Decouple thread descriptors from session routing by extracting `currentSessionId` from ThreadInfo into a standalone binding map on ThreadRegistry. Before: ThreadInfo owned `currentSessionId`, routing reached through ThreadInfo to find sessions, and SessionRegistry maintained a parallel `threadToSession` Map. Two sources of truth for one relationship. After: ThreadInfo is a pure descriptor (topic, anchor, history). ThreadRegistry owns a `bindings` Map — the sole source of truth for which session is live on which thread. All routing resolves bindings, never ThreadInfo. Binding lifecycle: spawn → bind, kill → unbind. This eliminates the structural root cause of the threadReply routing bug class: a threadReply thread has no binding, so routing never finds a session. No guard needed — the bug disappears structurally. Changes: - ThreadInfo: remove `currentSessionId` field - ThreadRegistry: add `_bindings` Map with bind/unbind/getBoundSession/isBound - ThreadRegistry: persist bindings to bindings.json, migrate old format on load - SessionRegistry: remove `threadToSession`, `getByThread`, `setThread`, `deleteThread` - resolveThreadSession: queries bindings instead of ThreadInfo - resolveSessionForThread: queries bindings instead of ThreadInfo - session-lifecycle: spawn uses bind(), kill uses unbind() - All command handlers: use getBoundSession() instead of legacy lookups - Tests updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Decouple thread descriptors from session routing by extracting `currentSessionId` from ThreadInfo into a standalone binding map on ThreadRegistry. Routing resolves bindings, never ThreadInfo. See PR #32 commit for full rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sf8193

Thanks for the thorough work here — the thread-primary concept is solid. A few questions on specifics before we merge.

sf8193 · 2026-06-25T05:01:05Z

+    _threadRegistry = mod.threadRegistry
+  }
+  return _threadRegistry!
+}


Could we avoid the lazy dynamic import here? It adds a circular dep workaround that's easy to break silently. Would it work to have session-lifecycle.ts handle the ThreadRegistry co-update instead (it already imports both), or pass threadRegistry as a parameter to setAnchorState?

sf8193 · 2026-06-25T05:01:05Z

+      thread.anchorState = state
+      tr.persist()
+    }
+  } catch {}


Should we surface failures here instead of bare catch {}? If this silently fails, anchor emoji and threads.json diverge without any signal. Even a process.stderr.write would help.

sf8193 · 2026-06-25T05:01:05Z

+      }
+      threadRegistry.persist()
+    }
+


This is a broader question — detachSession (line 44) already calls threadRegistry.unbind(), then killSession calls it again here. The double-unbind is harmless but it makes ownership unclear: who's responsible for unbinding?

More generally, does the complexity of spreading ThreadRegistry co-updates across detachSession, killSession, setAnchorState, and bridge-server death detection pay for itself vs. having a single ThreadRegistry.onSessionDeath(sessionId) method that handles all the state transitions in one place?

sf8193 · 2026-06-25T05:01:05Z

+      }
+    }
+    if (!skipPersist) threadRegistry.persist()
+  }


Can detachSession be called twice for the same session (e.g. bridge-server death detection + killSession race)? If so, endedAt gets overwritten silently — does it make sense to guard with if (!histEntry.endedAt)?

sf8193 · 2026-06-25T05:01:05Z

+        existing.totalMessages += (session.messageCount ?? 0)
+        if (session.lastActive > existing.lastActive) {
+          existing.lastActive = session.lastActive
+        }


Does it make sense to keep ~60 lines of migration code that only runs once (first boot after merge) and then sits as dead code? Could we handle this as a one-time migration script instead, or at least add a log + deletion plan?

sf8193 · 2026-06-25T05:01:05Z

+      if (session.isJoinMember) continue
+      if (this.threads.has(session.threadId)) continue
+      this.threads.set(session.threadId, {
+        threadId: session.threadId,


The reconcile loop creates ThreadInfo for every live session not in threads.json. But if threads.json was deleted or corrupted, this recreates entries for sessions that may be zombies (tmux alive but bridge dead). Should we also check bridge health here, or is that overkill for a recovery path?

sf8193 · 2026-06-25T05:01:05Z

-      if (msg.hasExistingThread && msg.existingThreadId) {
+      const existingIsSession = msg.hasExistingThread && msg.existingThreadId
+        && threadRegistry.isBound(msg.existingThreadId)
+      if (msg.hasExistingThread && msg.existingThreadId && !existingIsSession) {


Can we confirm this behavior change is intentional? Previously, if a message had an existing thread, it would reuse it. Now if that thread is bound to a session, we skip it and create a new thread instead. Is there a case where we reach this point with a session-bound existingThreadId that wasn't caught by the routing guard above?

sf8193 · 2026-06-25T05:01:05Z

+// ---------------------------------------------------------------------------
+
+/** Resolve a thread/channel ID to a session ID via the binding map */
+function resolveSessionForThread(channelId: string, existingThreadId?: string): string | undefined {


Two questions on routing:

The old code had a second fallback block (lines 446-455 on main) that handled threadReply-created threads where chat_id was rewritten but msg.isThread is false. This PR removes it. Does resolveSessionForThread in the msg.isThread block above cover that case? threadReply messages aren't msg.isThread — so this might silently break threadReply channel routing.

Does it make sense to also check resolveThreadSessionFromMsg (the newer method using effectiveThreadId for Slack compat)? This PR's resolveSessionForThread uses channelId/existingThreadId which is the older pattern — wanted to confirm Slack DM routing still works correctly.

sf8193 · 2026-06-25T05:01:05Z

+
+  void setAnchorState(threadId!, respawnCount > 0 ? 'zombie' : 'live', respawnCount).catch(() => {})

  return { name: tmuxName, sessionId, threadId: threadId!, url }


Should we batch the threadRegistry.persist() calls? A session crash can trigger: death detection → detachSession → persist → setAnchorState → persist → killSession → detachSession → persist. That's 3+ writes of two files each in quick succession. Not a correctness issue, but worth noting.

dcetlin · 2026-06-25T17:10:02Z

Closing this in favor of a lighter approach.

Your review surfaced the core tension: the thread-primary structural inversion adds real complexity (lazy imports, double-unbind ownership, migration code, persist batching) to make ThreadInfo load-bearing for routing. The concrete value — persistent metadata across session deaths (respawnCount, sessionHistory, anchor state, topic) — doesn't require that routing depend on ThreadInfo.

The replacement PR will introduce ThreadInfo as metadata-only: it observes and records, but routing never consults it. This gives us the durable thread metadata that recovery features need (emoji state machine, respawn counts, session history) without the structural inversion or the routing complexity you flagged.

On your specific comments:

Lazy import / circular dep — goes away entirely. Metadata-only ThreadInfo doesn't need co-updates from anchor-state.
Bare catch in setAnchorState — agreed, will add logging in the replacement.
Double-unbind ownership / onSessionDeath — goes away. No binding map, no unbind.
detachSession race (endedAt overwrite) — valid, will guard in the replacement.
Migration code as dead weight — goes away or shrinks dramatically with the lighter model.
Reconcile loop zombie detection — simpler reconcile in the lighter model.
threadReply behavior change — keeping this as a standalone routing fix.
Second routing fallback removal — this was a bug fix (the block was the root cause of a session-leak via threadReply). Will port standalone.
Persist batching — simpler with fewer persist sites.

dcetlin · 2026-06-25T17:10:07Z

Superseded — replacing with metadata-only ThreadInfo approach.

- Refuse `hydra up` when platform is already running (prevents gotcha #32 duplicate-main ping-pong). Points to restart/down. - Extract stop-byte.sh for clean byte teardown — eliminates fragile shell function coupling and command injection vector in lifecycleDown. - Replace execSync('sleep 0.5') with Bun.sleep(500). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Pre-flight in `up` checks for orphaned claude processes (not just tmux sessions) to prevent gotcha #32 ping-pong - Validate byte script exists before starting daemon — prevents unknown platforms from leaving a half-started daemon - Replace VALID_PLATFORMS enum with filesystem-based validation: check for start-{platform}-byte.sh, discord falls back to start-byte-v2.sh (legacy name) - `down` removes daemon.sock + daemon.pid so discoverSockets() doesn't show phantom daemons Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add daemon+byte lifecycle commands to the hydra CLI. Platform is always required (no default). - `hydra up <platform>` — validate byte script exists, check for running tmux sessions and orphaned claude processes (prevents gotcha #32 ping-pong), start daemon, wait for socket, start byte - `hydra down <platform>` — stop byte via stop-byte.sh (orphan cleanup), stop daemon, remove stale socket + PID file - `hydra restart <platform>` — restart daemon only (picks up code changes) No hardcoded platform enum — uses filesystem-based validation (does start-{platform}-byte.sh exist?). New platforms work with zero CLI changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dcetlin requested a review from sf8193 June 23, 2026 19:32

This was referenced Jun 24, 2026

feat: recovery quality + fork fallback + operational improvements #34

Closed

feat: recovery infrastructure + commands — ThreadRegistry, anchor state, resume/respawn/recover #27

Closed

dcetlin and others added 3 commits June 24, 2026 14:55

sf8193 reviewed Jun 25, 2026

View reviewed changes

dcetlin closed this Jun 25, 2026

This was referenced Jun 25, 2026

feat: lightweight thread metadata + routing fixes #37

Merged

feat: Slack List sync for session tracking #38

Closed


		void setAnchorState(threadId!, respawnCount > 0 ? 'zombie' : 'live', respawnCount).catch(() => {})

		return { name: tmuxName, sessionId, threadId: threadId!, url }

Conversation

dcetlin commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Architecture: Three concerns, three owners

What

B2: ThreadInfo + ThreadRegistry (sessions.ts)

B4: Binding extraction (sessions.ts, all consumers)

detachSession() (session-lifecycle.ts)

Lifecycle co-updates (session-lifecycle.ts)

Consumer updates (6 files)

Key design decisions

Test plan

Uh oh!

dcetlin commented Jun 24, 2026

Uh oh!

sf8193 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcetlin commented Jun 25, 2026

Uh oh!

dcetlin commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dcetlin commented Jun 23, 2026 •

edited

Loading