Summary
When a Cursor session ends (any reason — archiveReason set to "User terminated", "Session crashed", "Local launch failed: …", etc.), the hub's sessions.metadata row is rewritten by the runner's archive transition. If the CLI's local cached metadata is stale or null at that moment, the rewrite ships a sparse blob and cursorSessionId (plus other flavor resume tokens — cursorSessionProtocol, codexSessionId, claudeSessionId, geminiSessionId, opencodeSessionId, kimiSessionId) gets cleared from the row. The transcript and on-disk Cursor chat data remain intact, but the hub no longer has a resume token to hand a re-spawned cursor-agent, so subsequent POST /api/sessions/:id/resume calls fail with resume_unavailable (or worse, ACP session/load rejects the legacy UUID and the launcher exits).
This is the systemic cause of the "session crashed → cannot resume even though chat is on disk" failure mode. Recovery today requires direct sqlite surgery on sessions.metadata to restore cursorSessionId/cursorSessionProtocol before resume succeeds.
Repro
- Start any flavored session (e.g.
hapi cursor) — wait for cursor-agent to report its session UUID via onSessionFoundWithProtocol(...) so metadata.cursorSessionId is populated. Confirm with sqlite3 ~/.hapi/hapi.db "SELECT json_extract(metadata,'$.cursorSessionId') FROM sessions WHERE id='…'".
- End the session via any path that reaches
runnerLifecycle.archiveAndClose (terminate, crash, local-launch failure, handoff). The archive write spreads the CLI's locally-cached Metadata and overrides lifecycleState/archivedBy/archiveReason.
- Re-query
sessions.metadata. In a non-trivial fraction of cases on long-running installs, cursorSessionId is gone.
The "non-trivial fraction" is real: a DB audit on one operator's machine (~99 inactive sessions) found multiple lifecycleState=archived archiveReason='Session crashed' rows whose cursorSessionId was missing despite the on-disk Cursor chat store still being present and the transcript being intact.
Root cause
Two layers conspire:
1. CLI archive write replaces the entire metadata blob
cli/src/agent/runnerLifecycle.ts archiveAndClose:
options.session.updateMetadata((currentMetadata) => ({
...currentMetadata,
lifecycleState: 'archived',
lifecycleStateSince: Date.now(),
archivedBy: 'cli',
archiveReason
}))
The spread intends to preserve prior fields, but currentMetadata is the CLI's locally-cached snapshot (ApiSessionClient.metadata). That cache can be:
null when MetadataSchema.safeParse(raw.metadata) rejected the row at session bootstrap (cli/src/api/api.ts getOrCreateSession / getSession — lines 71-75 and 121-125 — silently null out the metadata on parse failure). In that case current = this.metadata ?? ({} as Metadata) in cli/src/api/apiSession.ts updateMetadata (lines 611-645) yields {}, and the archive payload becomes {lifecycleState:'archived', archivedBy:'cli', archiveReason} only.
- stale if a hub-side
update-session event was missed (or arrived at a lower metadataVersion) and the local cache never picked up a cursorSessionId write committed by another path.
2. Hub update-metadata is an unconditional REPLACE
hub/src/store/sessions.ts updateSessionMetadata (called from hub/src/socket/handlers/cli/sessionHandlers.ts handleUpdateMetadata) hands the incoming blob to updateVersionedField which writes metadata = @field_value verbatim. There is no carve-out for protocol resume tokens. Whatever the CLI sends fully replaces the prior row.
So a sparse archive payload from the CLI lands in the DB as-is, and the resume token vanishes.
Fix proposal
The principled fix is at the hub layer because it's the single chokepoint for every metadata write (CLI, web, future surfaces) and protects against any caller that drops a resume token, intentionally or otherwise.
In hub/src/store/sessions.ts, change updateSessionMetadata to read the prior metadata, then carry forward a small allowlist of resume-token fields when the incoming write lacks them:
const PROTOCOL_RESUME_FIELDS = [
'claudeSessionId',
'codexSessionId',
'geminiSessionId',
'opencodeSessionId',
'cursorSessionId',
'cursorSessionProtocol',
'kimiSessionId'
] as const
function preserveProtocolResumeFields(prior: unknown, next: unknown): unknown {
if (!isPlainObject(prior) || !isPlainObject(next)) return next
const merged: Record<string, unknown> = { ...(next as Record<string, unknown>) }
for (const field of PROTOCOL_RESUME_FIELDS) {
if (merged[field] === undefined && (prior as Record<string, unknown>)[field] !== undefined) {
merged[field] = (prior as Record<string, unknown>)[field]
}
}
return merged
}
Wrap the SELECT-then-UPDATE in a db.transaction so the merge is atomic with the version check.
Properties
- A caller that intentionally changes a flavor session id (e.g. a fresh bootstrap that just discovered a new id) still wins —
merged[field] = next[field] if next has it.
- A caller that omits the field (the bug today) gets the prior value preserved.
- Other metadata fields are untouched: lifecycleState/archiveReason/etc. still replace as today.
- The version check in
updateVersionedField still runs, so concurrent writers still see version-mismatch.
- The list mirrors
pickExistingSessionMetadata in cli/src/agent/sessionFactory.ts (which already documents the same set as "native resume metadata").
Why merge-not-replace at the hub vs. carry-forward at the CLI
CLI-side carry-forward (e.g. fetching from hub before each archive write) is racy and adds a network round-trip in the cleanup path. The hub already has the prior row in hand for the version check; preserving the seven resume-token fields there is one extra read in a transaction and protects every write surface.
Routing-default note
Preserving just cursorSessionId is sufficient for legacy routing because cli/src/cursor/utils/cursorProtocol.ts isLegacyCursorSession() defaults to legacy when cursorSessionProtocol is unset and cursorSessionId is truthy. So this fix unblocks resume even for sessions that lost both the id and the protocol marker before this patch — preserving the id alone is enough.
Tests
Regression coverage at hub/src/store/sessions.test.ts:
- Cursor session archive preserves
cursorSessionId (write archive payload without cursorSessionId, assert prior value carried through)
- Cursor session archive preserves
cursorSessionProtocol (same shape)
- Codex session archive preserves
codexSessionId (proves the fix is generic)
- Generic across all seven fields (claude/codex/gemini/opencode/cursor/kimi) — table-driven test
- Explicit-overwrite path: when next blob includes a different
cursorSessionId, next wins (no false preservation)
- Explicit-clear path: this is intentionally NOT supported — clearing a resume token requires a separate dedicated mutation if a use case ever shows up; today no caller needs it
- Version mismatch behavior unchanged
- Crash-path: simulate
archiveReason='Session crashed' payload — id preserved end-to-end
- Round-trip: archive → fetch → resume payload still has the id
A patch is on the way (fork PR coming).
Summary
When a Cursor session ends (any reason —
archiveReasonset to"User terminated","Session crashed","Local launch failed: …", etc.), the hub'ssessions.metadatarow is rewritten by the runner's archive transition. If the CLI's local cachedmetadatais stale ornullat that moment, the rewrite ships a sparse blob andcursorSessionId(plus other flavor resume tokens —cursorSessionProtocol,codexSessionId,claudeSessionId,geminiSessionId,opencodeSessionId,kimiSessionId) gets cleared from the row. The transcript and on-disk Cursor chat data remain intact, but the hub no longer has a resume token to hand a re-spawnedcursor-agent, so subsequentPOST /api/sessions/:id/resumecalls fail withresume_unavailable(or worse, ACPsession/loadrejects the legacy UUID and the launcher exits).This is the systemic cause of the "session crashed → cannot resume even though chat is on disk" failure mode. Recovery today requires direct sqlite surgery on
sessions.metadatato restorecursorSessionId/cursorSessionProtocolbefore resume succeeds.Repro
hapi cursor) — wait forcursor-agentto report its session UUID viaonSessionFoundWithProtocol(...)sometadata.cursorSessionIdis populated. Confirm withsqlite3 ~/.hapi/hapi.db "SELECT json_extract(metadata,'$.cursorSessionId') FROM sessions WHERE id='…'".runnerLifecycle.archiveAndClose(terminate, crash, local-launch failure, handoff). The archive write spreads the CLI's locally-cachedMetadataand overrideslifecycleState/archivedBy/archiveReason.sessions.metadata. In a non-trivial fraction of cases on long-running installs,cursorSessionIdis gone.The "non-trivial fraction" is real: a DB audit on one operator's machine (~99 inactive sessions) found multiple
lifecycleState=archived archiveReason='Session crashed'rows whosecursorSessionIdwas missing despite the on-disk Cursor chat store still being present and the transcript being intact.Root cause
Two layers conspire:
1. CLI archive write replaces the entire metadata blob
cli/src/agent/runnerLifecycle.tsarchiveAndClose:The spread intends to preserve prior fields, but
currentMetadatais the CLI's locally-cached snapshot (ApiSessionClient.metadata). That cache can be:nullwhenMetadataSchema.safeParse(raw.metadata)rejected the row at session bootstrap (cli/src/api/api.tsgetOrCreateSession/getSession— lines 71-75 and 121-125 — silently null out the metadata on parse failure). In that casecurrent = this.metadata ?? ({} as Metadata)incli/src/api/apiSession.tsupdateMetadata(lines 611-645) yields{}, and the archive payload becomes{lifecycleState:'archived', archivedBy:'cli', archiveReason}only.update-sessionevent was missed (or arrived at a lowermetadataVersion) and the local cache never picked up acursorSessionIdwrite committed by another path.2. Hub
update-metadatais an unconditional REPLACEhub/src/store/sessions.tsupdateSessionMetadata(called fromhub/src/socket/handlers/cli/sessionHandlers.tshandleUpdateMetadata) hands the incoming blob toupdateVersionedFieldwhich writesmetadata = @field_valueverbatim. There is no carve-out for protocol resume tokens. Whatever the CLI sends fully replaces the prior row.So a sparse archive payload from the CLI lands in the DB as-is, and the resume token vanishes.
Fix proposal
The principled fix is at the hub layer because it's the single chokepoint for every metadata write (CLI, web, future surfaces) and protects against any caller that drops a resume token, intentionally or otherwise.
In
hub/src/store/sessions.ts, changeupdateSessionMetadatato read the prior metadata, then carry forward a small allowlist of resume-token fields when the incoming write lacks them:Wrap the SELECT-then-UPDATE in a
db.transactionso the merge is atomic with the version check.Properties
merged[field] = next[field]if next has it.updateVersionedFieldstill runs, so concurrent writers still seeversion-mismatch.pickExistingSessionMetadataincli/src/agent/sessionFactory.ts(which already documents the same set as "native resume metadata").Why merge-not-replace at the hub vs. carry-forward at the CLI
CLI-side carry-forward (e.g. fetching from hub before each archive write) is racy and adds a network round-trip in the cleanup path. The hub already has the prior row in hand for the version check; preserving the seven resume-token fields there is one extra read in a transaction and protects every write surface.
Routing-default note
Preserving just
cursorSessionIdis sufficient for legacy routing becausecli/src/cursor/utils/cursorProtocol.tsisLegacyCursorSession()defaults to legacy whencursorSessionProtocolis unset andcursorSessionIdis truthy. So this fix unblocks resume even for sessions that lost both the id and the protocol marker before this patch — preserving the id alone is enough.Tests
Regression coverage at
hub/src/store/sessions.test.ts:cursorSessionId(write archive payload withoutcursorSessionId, assert prior value carried through)cursorSessionProtocol(same shape)codexSessionId(proves the fix is generic)cursorSessionId, next wins (no false preservation)archiveReason='Session crashed'payload — id preserved end-to-endA patch is on the way (fork PR coming).