Skip to content

feat: thread primitive — ThreadInfo, ThreadRegistry, structural inversion#32

Closed
dcetlin wants to merge 4 commits into
mainfrom
thread-primary-v2
Closed

feat: thread primitive — ThreadInfo, ThreadRegistry, structural inversion#32
dcetlin wants to merge 4 commits into
mainfrom
thread-primary-v2

Conversation

@dcetlin

@dcetlin dcetlin commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Why

Sessions die routinely. When they do, their thread context scatters — topic, anchor state, session history all live on SessionInfo, which gets pruned. Recovery means reconstructing thread context from fragments across sessions.json and threadToSession.

This PR introduces threads as a durable entity. A thread knows its own state independently of any session. Sessions attach and detach. "Dead session" becomes "thread between sessions."

Architecture: Three concerns, three owners

The thread-primary architecture separates three concerns that were previously entangled on SessionInfo:

  1. Thread descriptors (ThreadInfo) — durable metadata: topic, anchor state, session history, respawn count. Survives session death. Never consulted for routing.
  2. Bindings (ThreadRegistry._bindings) — the sole source of truth for "which session is live on which thread." A Map<threadId, sessionId> persisted to bindings.json. Spawn creates a binding; kill removes it.
  3. Sessions (SessionInfo) — compute process metadata: tmux name, listening state, capabilities. Ephemeral — pruned when the process dies.

Routing resolves bindings, not ThreadInfo. A thread with no binding has no session — no guard logic needed, no ambiguity possible. This eliminates the structural root cause of the threadReply routing bug class (where auto-created threads could accidentally match session-spawned threads).

What

B2: ThreadInfo + ThreadRegistry (sessions.ts)

  • ThreadInfo: threadId, anchorState, topic, respawnCount, sessionHistory, threadUrl
  • ThreadRegistry: load/migrate/reconcile on boot, detachedThreads() for crash recovery, persist to threads.json
  • Orphan reconciliation: sessions spawned between persist and crash get ThreadInfo entries on boot

B4: Binding extraction (sessions.ts, all consumers)

  • ThreadRegistry._bindings: Map<threadId, sessionId> — the routing lookup
  • bind(), unbind(), getBoundSession(), isBound() — the binding API
  • Persisted to bindings.json, migrated from old-format currentSessionId on load
  • SessionRegistry.threadToSession removed (was a parallel source of truth)
  • resolveSessionForThread() and resolveThreadSession() query bindings only

detachSession() (session-lifecycle.ts)

  • Clean session death: unbinds thread, timestamps history entry

Lifecycle co-updates (session-lifecycle.ts)

  • doSpawnSession: creates/updates ThreadInfo, binds session, sets anchor live/zombie
  • killSession: calls detachSession, unbinds, sets anchor killed

Consumer updates (6 files)

  • Router: resolveSessionForThread() queries getBoundSession()
  • Router: threadReply guard uses isBound() instead of currentSessionId
  • bridge-dispatch: kill-by-thread uses getBoundSession()
  • commands/thread.ts: resume/respawn use getBoundSession()
  • commands/build.ts, review.ts: use getBoundSession()
  • commands/global.ts: spawn checks getBoundSession() for thread ownership

Key design decisions

  • No currentSessionId on ThreadInfo — the binding map IS the relationship. ThreadInfo is purely descriptive.
  • Separate persistence — bindings.json exists alongside threads.json. Clean separation, independent lifecycle.
  • Backward compatible migration — old-format threads.json with currentSessionId is read, extracted into bindings, and the field is stripped on load.
  • resolveSessionForThread stays in router.ts — it's a convenience wrapper over getBoundSession, not a data-model concern.

Test plan

  • bun build daemon.ts + bun build bridge.ts compile clean
  • Daemon boots, threads.json + bindings.json created on first run
  • Spawn → ThreadInfo created, binding created, anchor 🚀
  • Kill → binding removed, ThreadInfo detached, anchor ☠️
  • Session crash → death detection fires, binding removed, anchor 💥
  • resume → session reattaches, new binding to existing ThreadInfo
  • respawn → new session, new binding, ThreadInfo updated
  • threadReply message in channel with session threads → routes to main, not session (binding-based guard)
  • /review, /build, fork, handoff all resolve via getBoundSession()

🤖 Generated with Claude Code

…sion

Threads become durable entities that persist across session deaths.

ThreadInfo carries: topic, anchorState, respawnCount, sessionHistory,
currentSessionId, threadUrl. Persisted to threads.json. ThreadRegistry
handles boot (load/migrate/reconcile), detachedThreads() query for
crash recovery, and lifecycle co-updates.

Structural inversion: router resolves threads via threadRegistry first
(fallback to legacy registry.getByThread for backward compat). All
lifecycle callsites co-update ThreadInfo:
- doSpawnSession: creates/updates ThreadInfo, sets anchor live/zombie
- killSession: calls detachSession, sets anchor killed
- bridge-server death detection: detachSession + anchor crashed
- anchor-state.ts: co-updates ThreadInfo on every state change

Command files updated: resume/respawn/recover use threadRegistry as
primary lookup, spawn checks threadRegistry for dead-session-in-thread,
health shows thread counts, forks/handoff use threadRegistry for topic.

Backward compatible: registry.threadToSession and resolveThreadSession
still work. ThreadMember tracking unchanged. Sam's review/build/state-
machine untouched.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dcetlin

dcetlin commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

Hey Sam — wanted to frame why I think this is worth landing now rather than deferring.

The way I see hydra's internals, there are three distinct concerns:

  1. Features define behavior — review, build, design each own their state machines, participant tracking, and turn management. They don't need to know how sessions are found or recovered.

  2. Threads carry continuity — topic, anchor state, session history, respawn count. This state outlives any individual session. When a session dies, the thread remembers what was happening and who was there.

  3. Sessions provide compute — context window, tool access, active reasoning. Ephemeral by nature. They attach to threads, do work, and eventually detach (cleanly or by crashing).

Right now, #2 is scattered across SessionInfo fields that get pruned when sessions die. That's what caused the three bugs we hit in one session — orphaned sessions with no ThreadInfo, recovery manifests that couldn't find crashed threads, and anchor state that diverged from session state.

This PR makes threads a first-class entity. The key design choices:

  • ThreadInfo owns the durable state, SessionInfo stays ephemeral
  • detachSession() is the clean death path — nulls currentSessionId, timestamps history
  • Backward compatibleresolveThreadSession() checks threadRegistry first, falls back to legacy. Your review/build/state-machine/tests are completely untouched.
  • No feature changes — this is structural only. Same behavior, but the state lives where it belongs.

The PR is +387/-29, and I've been running this on live for a week with zero issues. Happy to walk through any part of it.

dcetlin and others added 3 commits June 24, 2026 14:55
…x cleanup

1. Add 'resurrect' to ThreadSessionEntry.originType and SessionInfo.originType
   type unions — doSpawnSession sets originType='resurrect' for existing-thread
   spawns, which would violate the type under strict checking.

2. Consolidate double-persist in killSession: detachSession now accepts
   { skipPersist } option. killSession uses it to batch the thread detach +
   anchorState='killed' update into a single threadRegistry.persist() call.
   Eliminates the crash window where thread state could be inconsistent.

3. Add stale tmux cleanup in recover's threadRegistry path — the legacy
   recovery path killed lingering tmux sessions before spawning, but the
   threadRegistry path skipped this step.
- Remove dead second routing check (only reachable via threadReply)
- Guard threadReply thread reuse against active session threads
- Gate bare kill command by msg.isThread

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Decouple thread descriptors from session routing by extracting
`currentSessionId` from ThreadInfo into a standalone binding map
on ThreadRegistry.

Before: ThreadInfo owned `currentSessionId`, routing reached through
ThreadInfo to find sessions, and SessionRegistry maintained a parallel
`threadToSession` Map. Two sources of truth for one relationship.

After: ThreadInfo is a pure descriptor (topic, anchor, history).
ThreadRegistry owns a `bindings` Map — the sole source of truth for
which session is live on which thread. All routing resolves bindings,
never ThreadInfo. Binding lifecycle: spawn → bind, kill → unbind.

This eliminates the structural root cause of the threadReply routing
bug class: a threadReply thread has no binding, so routing never finds
a session. No guard needed — the bug disappears structurally.

Changes:
- ThreadInfo: remove `currentSessionId` field
- ThreadRegistry: add `_bindings` Map with bind/unbind/getBoundSession/isBound
- ThreadRegistry: persist bindings to bindings.json, migrate old format on load
- SessionRegistry: remove `threadToSession`, `getByThread`, `setThread`, `deleteThread`
- resolveThreadSession: queries bindings instead of ThreadInfo
- resolveSessionForThread: queries bindings instead of ThreadInfo
- session-lifecycle: spawn uses bind(), kill uses unbind()
- All command handlers: use getBoundSession() instead of legacy lookups
- Tests updated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dcetlin added a commit that referenced this pull request Jun 25, 2026
Decouple thread descriptors from session routing by extracting
`currentSessionId` from ThreadInfo into a standalone binding map
on ThreadRegistry. Routing resolves bindings, never ThreadInfo.

See PR #32 commit for full rationale.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@sf8193 sf8193 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough work here — the thread-primary concept is solid. A few questions on specifics before we merge.

Comment thread daemon/anchor-state.ts
_threadRegistry = mod.threadRegistry
}
return _threadRegistry!
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid the lazy dynamic import here? It adds a circular dep workaround that's easy to break silently. Would it work to have session-lifecycle.ts handle the ThreadRegistry co-update instead (it already imports both), or pass threadRegistry as a parameter to setAnchorState?

Comment thread daemon/anchor-state.ts
thread.anchorState = state
tr.persist()
}
} catch {}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we surface failures here instead of bare catch {}? If this silently fails, anchor emoji and threads.json diverge without any signal. Even a process.stderr.write would help.

}
threadRegistry.persist()
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a broader question — detachSession (line 44) already calls threadRegistry.unbind(), then killSession calls it again here. The double-unbind is harmless but it makes ownership unclear: who's responsible for unbinding?

More generally, does the complexity of spreading ThreadRegistry co-updates across detachSession, killSession, setAnchorState, and bridge-server death detection pay for itself vs. having a single ThreadRegistry.onSessionDeath(sessionId) method that handles all the state transitions in one place?

}
}
if (!skipPersist) threadRegistry.persist()
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can detachSession be called twice for the same session (e.g. bridge-server death detection + killSession race)? If so, endedAt gets overwritten silently — does it make sense to guard with if (!histEntry.endedAt)?

Comment thread daemon/sessions.ts
existing.totalMessages += (session.messageCount ?? 0)
if (session.lastActive > existing.lastActive) {
existing.lastActive = session.lastActive
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to keep ~60 lines of migration code that only runs once (first boot after merge) and then sits as dead code? Could we handle this as a one-time migration script instead, or at least add a log + deletion plan?

Comment thread daemon/sessions.ts
if (session.isJoinMember) continue
if (this.threads.has(session.threadId)) continue
this.threads.set(session.threadId, {
threadId: session.threadId,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconcile loop creates ThreadInfo for every live session not in threads.json. But if threads.json was deleted or corrupted, this recreates entries for sessions that may be zombies (tmux alive but bridge dead). Should we also check bridge health here, or is that overkill for a recovery path?

Comment thread daemon/router.ts
if (msg.hasExistingThread && msg.existingThreadId) {
const existingIsSession = msg.hasExistingThread && msg.existingThreadId
&& threadRegistry.isBound(msg.existingThreadId)
if (msg.hasExistingThread && msg.existingThreadId && !existingIsSession) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we confirm this behavior change is intentional? Previously, if a message had an existing thread, it would reuse it. Now if that thread is bound to a session, we skip it and create a new thread instead. Is there a case where we reach this point with a session-bound existingThreadId that wasn't caught by the routing guard above?

Comment thread daemon/router.ts
// ---------------------------------------------------------------------------

/** Resolve a thread/channel ID to a session ID via the binding map */
function resolveSessionForThread(channelId: string, existingThreadId?: string): string | undefined {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions on routing:

  1. The old code had a second fallback block (lines 446-455 on main) that handled threadReply-created threads where chat_id was rewritten but msg.isThread is false. This PR removes it. Does resolveSessionForThread in the msg.isThread block above cover that case? threadReply messages aren't msg.isThread — so this might silently break threadReply channel routing.

  2. Does it make sense to also check resolveThreadSessionFromMsg (the newer method using effectiveThreadId for Slack compat)? This PR's resolveSessionForThread uses channelId/existingThreadId which is the older pattern — wanted to confirm Slack DM routing still works correctly.


void setAnchorState(threadId!, respawnCount > 0 ? 'zombie' : 'live', respawnCount).catch(() => {})

return { name: tmuxName, sessionId, threadId: threadId!, url }

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we batch the threadRegistry.persist() calls? A session crash can trigger: death detection → detachSession → persist → setAnchorState → persist → killSession → detachSession → persist. That's 3+ writes of two files each in quick succession. Not a correctness issue, but worth noting.

@dcetlin

dcetlin commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

Closing this in favor of a lighter approach.

Your review surfaced the core tension: the thread-primary structural inversion adds real complexity (lazy imports, double-unbind ownership, migration code, persist batching) to make ThreadInfo load-bearing for routing. The concrete value — persistent metadata across session deaths (respawnCount, sessionHistory, anchor state, topic) — doesn't require that routing depend on ThreadInfo.

The replacement PR will introduce ThreadInfo as metadata-only: it observes and records, but routing never consults it. This gives us the durable thread metadata that recovery features need (emoji state machine, respawn counts, session history) without the structural inversion or the routing complexity you flagged.

On your specific comments:

  • Lazy import / circular dep — goes away entirely. Metadata-only ThreadInfo doesn't need co-updates from anchor-state.
  • Bare catch in setAnchorState — agreed, will add logging in the replacement.
  • Double-unbind ownership / onSessionDeath — goes away. No binding map, no unbind.
  • detachSession race (endedAt overwrite) — valid, will guard in the replacement.
  • Migration code as dead weight — goes away or shrinks dramatically with the lighter model.
  • Reconcile loop zombie detection — simpler reconcile in the lighter model.
  • threadReply behavior change — keeping this as a standalone routing fix.
  • Second routing fallback removal — this was a bug fix (the block was the root cause of a session-leak via threadReply). Will port standalone.
  • Persist batching — simpler with fewer persist sites.

@dcetlin

dcetlin commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded — replacing with metadata-only ThreadInfo approach.

@dcetlin dcetlin closed this Jun 25, 2026
dcetlin added a commit that referenced this pull request Jul 1, 2026
- Refuse `hydra up` when platform is already running (prevents
  gotcha #32 duplicate-main ping-pong). Points to restart/down.
- Extract stop-byte.sh for clean byte teardown — eliminates fragile
  shell function coupling and command injection vector in lifecycleDown.
- Replace execSync('sleep 0.5') with Bun.sleep(500).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dcetlin added a commit that referenced this pull request Jul 1, 2026
- Pre-flight in `up` checks for orphaned claude processes (not just
  tmux sessions) to prevent gotcha #32 ping-pong
- Validate byte script exists before starting daemon — prevents
  unknown platforms from leaving a half-started daemon
- Replace VALID_PLATFORMS enum with filesystem-based validation:
  check for start-{platform}-byte.sh, discord falls back to
  start-byte-v2.sh (legacy name)
- `down` removes daemon.sock + daemon.pid so discoverSockets()
  doesn't show phantom daemons

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dcetlin added a commit that referenced this pull request Jul 1, 2026
Add daemon+byte lifecycle commands to the hydra CLI. Platform is
always required (no default).

- `hydra up <platform>` — validate byte script exists, check for
  running tmux sessions and orphaned claude processes (prevents
  gotcha #32 ping-pong), start daemon, wait for socket, start byte
- `hydra down <platform>` — stop byte via stop-byte.sh (orphan
  cleanup), stop daemon, remove stale socket + PID file
- `hydra restart <platform>` — restart daemon only (picks up code
  changes)

No hardcoded platform enum — uses filesystem-based validation
(does start-{platform}-byte.sh exist?). New platforms work with
zero CLI changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants