Skip to content

Cursor sessionId can be cleared from session metadata on crash-archive, blocking future resume even when chat data is on disk #820

@heavygee

Description

@heavygee

Summary

When a Cursor session ends (any reason — archiveReason set to "User terminated", "Session crashed", "Local launch failed: …", etc.), the hub's sessions.metadata row is rewritten by the runner's archive transition. If the CLI's local cached metadata is stale or null at that moment, the rewrite ships a sparse blob and cursorSessionId (plus other flavor resume tokens — cursorSessionProtocol, codexSessionId, claudeSessionId, geminiSessionId, opencodeSessionId, kimiSessionId) gets cleared from the row. The transcript and on-disk Cursor chat data remain intact, but the hub no longer has a resume token to hand a re-spawned cursor-agent, so subsequent POST /api/sessions/:id/resume calls fail with resume_unavailable (or worse, ACP session/load rejects the legacy UUID and the launcher exits).

This is the systemic cause of the "session crashed → cannot resume even though chat is on disk" failure mode. Recovery today requires direct sqlite surgery on sessions.metadata to restore cursorSessionId/cursorSessionProtocol before resume succeeds.

Repro

  1. Start any flavored session (e.g. hapi cursor) — wait for cursor-agent to report its session UUID via onSessionFoundWithProtocol(...) so metadata.cursorSessionId is populated. Confirm with sqlite3 ~/.hapi/hapi.db "SELECT json_extract(metadata,'$.cursorSessionId') FROM sessions WHERE id='…'".
  2. End the session via any path that reaches runnerLifecycle.archiveAndClose (terminate, crash, local-launch failure, handoff). The archive write spreads the CLI's locally-cached Metadata and overrides lifecycleState/archivedBy/archiveReason.
  3. Re-query sessions.metadata. In a non-trivial fraction of cases on long-running installs, cursorSessionId is gone.

The "non-trivial fraction" is real: a DB audit on one operator's machine (~99 inactive sessions) found multiple lifecycleState=archived archiveReason='Session crashed' rows whose cursorSessionId was missing despite the on-disk Cursor chat store still being present and the transcript being intact.

Root cause

Two layers conspire:

1. CLI archive write replaces the entire metadata blob

cli/src/agent/runnerLifecycle.ts archiveAndClose:

options.session.updateMetadata((currentMetadata) => ({
    ...currentMetadata,
    lifecycleState: 'archived',
    lifecycleStateSince: Date.now(),
    archivedBy: 'cli',
    archiveReason
}))

The spread intends to preserve prior fields, but currentMetadata is the CLI's locally-cached snapshot (ApiSessionClient.metadata). That cache can be:

  • null when MetadataSchema.safeParse(raw.metadata) rejected the row at session bootstrap (cli/src/api/api.ts getOrCreateSession / getSession — lines 71-75 and 121-125 — silently null out the metadata on parse failure). In that case current = this.metadata ?? ({} as Metadata) in cli/src/api/apiSession.ts updateMetadata (lines 611-645) yields {}, and the archive payload becomes {lifecycleState:'archived', archivedBy:'cli', archiveReason} only.
  • stale if a hub-side update-session event was missed (or arrived at a lower metadataVersion) and the local cache never picked up a cursorSessionId write committed by another path.

2. Hub update-metadata is an unconditional REPLACE

hub/src/store/sessions.ts updateSessionMetadata (called from hub/src/socket/handlers/cli/sessionHandlers.ts handleUpdateMetadata) hands the incoming blob to updateVersionedField which writes metadata = @field_value verbatim. There is no carve-out for protocol resume tokens. Whatever the CLI sends fully replaces the prior row.

So a sparse archive payload from the CLI lands in the DB as-is, and the resume token vanishes.

Fix proposal

The principled fix is at the hub layer because it's the single chokepoint for every metadata write (CLI, web, future surfaces) and protects against any caller that drops a resume token, intentionally or otherwise.

In hub/src/store/sessions.ts, change updateSessionMetadata to read the prior metadata, then carry forward a small allowlist of resume-token fields when the incoming write lacks them:

const PROTOCOL_RESUME_FIELDS = [
    'claudeSessionId',
    'codexSessionId',
    'geminiSessionId',
    'opencodeSessionId',
    'cursorSessionId',
    'cursorSessionProtocol',
    'kimiSessionId'
] as const

function preserveProtocolResumeFields(prior: unknown, next: unknown): unknown {
    if (!isPlainObject(prior) || !isPlainObject(next)) return next
    const merged: Record<string, unknown> = { ...(next as Record<string, unknown>) }
    for (const field of PROTOCOL_RESUME_FIELDS) {
        if (merged[field] === undefined && (prior as Record<string, unknown>)[field] !== undefined) {
            merged[field] = (prior as Record<string, unknown>)[field]
        }
    }
    return merged
}

Wrap the SELECT-then-UPDATE in a db.transaction so the merge is atomic with the version check.

Properties

  • A caller that intentionally changes a flavor session id (e.g. a fresh bootstrap that just discovered a new id) still wins — merged[field] = next[field] if next has it.
  • A caller that omits the field (the bug today) gets the prior value preserved.
  • Other metadata fields are untouched: lifecycleState/archiveReason/etc. still replace as today.
  • The version check in updateVersionedField still runs, so concurrent writers still see version-mismatch.
  • The list mirrors pickExistingSessionMetadata in cli/src/agent/sessionFactory.ts (which already documents the same set as "native resume metadata").

Why merge-not-replace at the hub vs. carry-forward at the CLI

CLI-side carry-forward (e.g. fetching from hub before each archive write) is racy and adds a network round-trip in the cleanup path. The hub already has the prior row in hand for the version check; preserving the seven resume-token fields there is one extra read in a transaction and protects every write surface.

Routing-default note

Preserving just cursorSessionId is sufficient for legacy routing because cli/src/cursor/utils/cursorProtocol.ts isLegacyCursorSession() defaults to legacy when cursorSessionProtocol is unset and cursorSessionId is truthy. So this fix unblocks resume even for sessions that lost both the id and the protocol marker before this patch — preserving the id alone is enough.

Tests

Regression coverage at hub/src/store/sessions.test.ts:

  • Cursor session archive preserves cursorSessionId (write archive payload without cursorSessionId, assert prior value carried through)
  • Cursor session archive preserves cursorSessionProtocol (same shape)
  • Codex session archive preserves codexSessionId (proves the fix is generic)
  • Generic across all seven fields (claude/codex/gemini/opencode/cursor/kimi) — table-driven test
  • Explicit-overwrite path: when next blob includes a different cursorSessionId, next wins (no false preservation)
  • Explicit-clear path: this is intentionally NOT supported — clearing a resume token requires a separate dedicated mutation if a use case ever shows up; today no caller needs it
  • Version mismatch behavior unchanged
  • Crash-path: simulate archiveReason='Session crashed' payload — id preserved end-to-end
  • Round-trip: archive → fetch → resume payload still has the id

A patch is on the way (fork PR coming).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions