Skip to content

fix(hub): preserve session metadata across archive transitions#825

Merged
tiann merged 4 commits into
tiann:mainfrom
heavygee:fix/preserve-cursor-session-id-on-crash
Jun 8, 2026
Merged

fix(hub): preserve session metadata across archive transitions#825
tiann merged 4 commits into
tiann:mainfrom
heavygee:fix/preserve-cursor-session-id-on-crash

Conversation

@heavygee
Copy link
Copy Markdown
Collaborator

@heavygee heavygee commented Jun 6, 2026

Summary

Hub-side preservation of session-routing and flavor resume fields across update-metadata writes, so the CLI's archive transition can no longer silently clear cursorSessionId (or any other flavor's resume id, the routing fields flavor/machineId, or the schema-required path/host) from a session row.

Closes #820

Problem

cli/src/agent/runnerLifecycle.ts archiveAndClose does:

options.session.updateMetadata((currentMetadata) => ({
    ...currentMetadata,
    lifecycleState: 'archived',
    lifecycleStateSince: Date.now(),
    archivedBy: 'cli',
    archiveReason
}))

The spread is supposed to preserve prior fields, but currentMetadata is the local cache in cli/src/api/apiSession.ts (current = this.metadata ?? ({} as Metadata)). That cache is null when MetadataSchema.safeParse rejected the row at session bootstrap (cli/src/api/api.ts silently nulls out metadata on parse failure), and can also be stale relative to the hub.

hub/src/store/sessions.ts updateSessionMetadata calls updateVersionedField, which does an unconditional metadata = @field_value replacement. So a sparse archive payload from the CLI becomes the new row contents and the resume identifiers, routing keys, and parse-required fields are all gone.

DB audits on long-running installs showed lifecycleState=archived / archiveReason='Session crashed' rows missing cursorSessionId despite the on-disk Cursor chat store still being present. This is the systemic cause of "session crashed → cannot resume even though chat data is on disk".

Approach

Single-chokepoint fix at the hub: the only place every metadata write goes through is updateSessionMetadata. Wrap the existing updateVersionedField call in a db.transaction that first reads the prior row's metadata, then carries fields forward when the incoming payload omits them.

Three tiers, each addressing a distinct downstream failure mode:

const PARSE_IDENTITY_FIELDS = ['path', 'host'] as const
const ROUTING_FIELDS        = ['flavor', 'machineId'] as const
const SIMPLE_RESUME_TOKENS  = [
    'claudeSessionId',
    'codexSessionId',
    'geminiSessionId',
    'opencodeSessionId',
    'cursorSessionId',
    'kimiSessionId'
] as const
  • PARSE_IDENTITY_FIELDS — required by MetadataSchema in shared/src/schemas.ts. If they're missing, hub session cache and CLI getSession reject the row with safeParse → null and resume can't even find the session, regardless of whether the resume token survived.
  • ROUTING_FIELDSflavor is what hub/src/web/routes/sessions.ts (multiple sites) and sync/syncEngine.ts use to pick which flavor session-id field to read; with flavor missing, the ?? 'claude' fallback misroutes a Cursor/Codex/Gemini session as Claude and the preserved token is ignored. machineId is the filter the CLI's resumable listing uses to scope rows to the current host.
  • SIMPLE_RESUME_TOKENS — flavor-specific resume identifiers with write-once-keep semantics. Mirror of pickExistingSessionMetadata in cli/src/agent/sessionFactory.ts (which already preserves the same set on bootstrap).

Plus pair-aware handling for cursorSessionId / cursorSessionProtocol: protocol is tied to a specific chat id, so when next explicitly sets a new cursorSessionId the prior cursorSessionProtocol is dropped (callers can include the new protocol with the new id); when next is silent on both, the prior protocol carries with the prior id.

The update-metadata socket handler now broadcasts the persisted (merged) value rather than the pre-merge input, so peer CLIs in the same session room don't overwrite their local cache with a tokenless blob.

Properties:

  • Caller that explicitly sets any of these fields wins — next[field] defined always wins.
  • Caller that omits a field (the bug) gets the prior value preserved.
  • Other metadata fields (lifecycleState, archiveReason, name, summary, tools, slashCommands, …) replace as today.
  • Version check in updateVersionedField is unchanged; concurrent writers still see version-mismatch.
  • Cost: one extra SELECT metadata per metadata write inside a SQLite transaction.

Why hub vs CLI: CLI-side carry-forward (e.g. fetching from hub before each write) is racy and adds a network round-trip in the cleanup path. The hub already reads the prior row for the version check, so the merge there is a strictly smaller surface and protects every write surface (CLI, web, future).

Routing default: preserving cursorSessionId is sufficient for legacy resume because cli/src/cursor/utils/cursorProtocol.ts isLegacyCursorSession() defaults to legacy when cursorSessionProtocol is unset and cursorSessionId is truthy. So this fix unblocks resume even for sessions that lost both the id and the protocol marker before the patch landed.

Testing

hub/src/store/sessions.test.ts — 25 tests, all passing:

  • Cursor archive preserves cursorSessionId (crash-archive)
  • Cursor archive preserves cursorSessionProtocol
  • Codex archive preserves codexSessionId (proves generic across flavors)
  • Table-driven coverage of all six flavor resume tokens
  • path / host preserved on sparse archive (parse identity)
  • flavor / machineId preserved on sparse archive (routing)
  • No-prior-value cases for each tier (doesn't invent fields)
  • Explicit-override cases for each tier (next wins)
  • cursorSessionProtocol dropped when next sets a new cursorSessionId
  • cursorSessionProtocol preserved when next is silent on both id and protocol
  • Explicit cursorSessionProtocol honored when id is unchanged
  • Empty payload (stale-cache failure mode): id preserved
  • Multi-write: id survives consecutive metadata replacements
  • Version mismatch behavior unchanged
  • Missing session row returns error
  • Round-trip: archive payload → read-back is sufficient for legacy routing
  • result.value returned by updateSessionMetadata is the merged value (not the pre-merge input)

hub/src/socket/handlers/cli/sessionHandlers.test.ts — new test asserts the update-metadata ack and the room-event broadcast both carry the merged value, not the pre-merge payload.

Full hub suite: 297 passed, 0 failed. bun typecheck clean across cli/web/hub.

Risks

  • Wire-level semantics of update-metadata change for the preserved fields: a write that wants to clear one of them by omission can no longer do so. No code path in this repo does that today — these fields are all write-once-keep semantics. If a future caller needs an explicit clear, it should be a dedicated mutation.
  • One extra SELECT per metadata write inside a transaction. Negligible for the sessions table size; sessions are not write-heavy.

heavygee and others added 3 commits June 6, 2026 19:16
…sitions

When a session ends (terminate, crash, local-launch failure, handoff),
the runner's archive write replaces sessions.metadata wholesale. If the
CLI's locally cached Metadata is null (e.g. Zod parse failed at bootstrap
and api.ts nulled it out) or stale, the spread in archiveAndClose ships
a sparse blob and the resume token (cursorSessionId, codexSessionId,
claudeSessionId, etc.) gets cleared from the row even though the
on-disk chat data is still intact.

Fix at the hub layer because update-metadata is the single chokepoint
for every metadata write surface (CLI, web, future): in the store-level
updateSessionMetadata, read the prior row's metadata inside a
transaction and carry forward a small allowlist of flavor resume tokens
when the incoming write omits them. Explicit overwrites still win.

The allowlist mirrors pickExistingSessionMetadata in sessionFactory.ts
which already preserves the same fields on bootstrap.

Closes tiann#820

Co-authored-by: Cursor <cursoragent@cursor.com>
Three bot findings on the initial patch:

1. (P1) Sparse archive payloads still resulted in metadata blobs that
   failed MetadataSchema parse downstream — required `path`/`host` were
   not in the carry-forward set, so even though the resume token
   survived, hub session cache and CLI getSession nulled-out the row
   and resume_unavailable came back. Add PARSE_IDENTITY_FIELDS = `path`,
   `host` to the carry-forward.

2. (P2) Preserving `cursorSessionProtocol` whenever it was omitted
   carried a stale protocol over to a freshly written `cursorSessionId`,
   misrouting a future remote resume. Pair-aware logic: drop the prior
   protocol when next sets a new id; preserve the protocol only when
   next is silent on both id and protocol.

3. (P2) The successful update-metadata broadcast emitted the pre-merge
   payload to other CLIs in the session room, so even though the DB row
   was preserved, peer caches diverged. Switch the broadcast value to
   `result.value` (the persisted merged value) so live caches stay in
   sync with the truth.

Refactor preserveProtocolResumeFields into mergeSessionMetadata with
two tiers (PARSE_IDENTITY_FIELDS + SIMPLE_RESUME_TOKENS) plus the
cursor pair handler. 6 new tests cover the regressions; existing 16
still pass plus 1 new socket-level test for the broadcast.

Co-authored-by: Cursor <cursoragent@cursor.com>
Bot P2 on the prior fix: PARSE_IDENTITY_FIELDS (path, host) made the
blob parseable and SIMPLE_RESUME_TOKENS preserved the chat-id, but
flavor and machineId were still being dropped by sparse archive
payloads. Consequences:

- flavor: hub/src/web/routes/sessions.ts and sync/syncEngine.ts read
  `metadata?.flavor ?? 'claude'` to pick which session id field to
  resume. With flavor missing, a Cursor/Codex/Gemini session was
  routed as Claude and the preserved cursorSessionId was ignored.

- machineId: telegram/bot.ts and the CLI's resumable listing read
  `metadata?.machineId` to scope sessions to the current host. With
  machineId missing, the row dropped out of the resume picker.

Add a third carry-forward tier ROUTING_FIELDS = [flavor, machineId]
between PARSE_IDENTITY_FIELDS and SIMPLE_RESUME_TOKENS in
mergeSessionMetadata. 3 new tests cover preservation, no-invention,
and explicit override.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Preserved resume tokens can no longer be cleared — carryForwardIfMissing restores prior token fields whenever the next metadata blob omits them, but an existing Codex path intentionally clears codexSessionId by deleting it before updateMetadata (cli/src/codex/session.ts:104-107). After this change the hub merges the old codexSessionId back into the row and returns it in the ack, so resetCodexThread() cannot actually reset the Codex thread and later resume/routing can keep using the stale thread id. Evidence: hub/src/store/sessions.ts:68.
    Suggested fix:
    const EXPLICIT_CLEAR_FIELDS = ['codexSessionId'] as const
    
    function shouldCarryForward(field: string, next: Record<string, unknown>): boolean {
        return next[field] === undefined && !EXPLICIT_CLEAR_FIELDS.some((clearField) => clearField === field)
    }
    Or add a dedicated clear sentinel/API and update resetCodexThread() to use it, with a regression test proving codexSessionId is removed.

Questions

  • None.

Summary

  • Review mode: initial
  • One Major issue found: the new preservation rule blocks an existing metadata clear path. Residual risk: I did not run the full Bun suite in this review pass.

Testing

  • Not run (automation)

HAPI Bot

Comment thread hub/src/store/sessions.ts
Upstream cold-review (Major): the carry-forward semantics introduced
in the prior commits ("omit field → preserve from prior") collide
with cli/src/codex/session.ts resetCodexThread(), which intentionally
clears codexSessionId by deleting it from the metadata blob before
calling updateMetadata. With omit-as-preserve, the cleared id was
restored from the prior row and /clear on a Codex session no longer
dropped the persisted thread.

Add an explicit-clear sentinel: when next sets a carry-forward field
to `null`, the merge drops the key entirely from the persisted blob
(key removed; not stored as null since MetadataSchema fields are
`string().optional()`). `undefined` (key missing from next) keeps its
"carry forward" meaning. The two semantics now compose cleanly:

  - next.field = "x"   → next wins (caller sets a new value)
  - next.field = null  → drop the field (caller intentionally clears)
  - next omits field   → carry forward prior (caller didn't touch it)

Update resetCodexThread() to send `codexSessionId: null` so the
reset actually drops the persisted thread under the new merge.

4 new hub tests cover: explicit clear of a single token, clear-one-
preserve-others independence, no-op clear on a never-set field, and
the success-ack value reflects the cleared blob. cli/src/codex tests
(224/224) and hub suite (301/301) green; bun typecheck clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • None.

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • No issues found in the latest diff. The prior Major about intentional metadata clears appears addressed by the Codex null sentinel and hub-side removal behavior. Residual risk: static review only; full suite results were not independently run in this review pass.

Testing

  • Not run (automation)

HAPI Bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cursor sessionId can be cleared from session metadata on crash-archive, blocking future resume even when chat data is on disk

2 participants