Cursor sessionId can be cleared from session metadata on crash-archive, blocking future resume even when chat data is on disk

## Summary

When a Cursor session ends (any reason — `archiveReason` set to `"User terminated"`, `"Session crashed"`, `"Local launch failed: …"`, etc.), the hub's `sessions.metadata` row is rewritten by the runner's archive transition. If the CLI's local cached `metadata` is stale or `null` at that moment, the rewrite ships a sparse blob and `cursorSessionId` (plus other flavor resume tokens — `cursorSessionProtocol`, `codexSessionId`, `claudeSessionId`, `geminiSessionId`, `opencodeSessionId`, `kimiSessionId`) gets cleared from the row. The transcript and on-disk Cursor chat data remain intact, but the hub no longer has a resume token to hand a re-spawned `cursor-agent`, so subsequent `POST /api/sessions/:id/resume` calls fail with `resume_unavailable` (or worse, ACP `session/load` rejects the legacy UUID and the launcher exits).

This is the systemic cause of the "session crashed → cannot resume even though chat is on disk" failure mode. Recovery today requires direct sqlite surgery on `sessions.metadata` to restore `cursorSessionId`/`cursorSessionProtocol` before resume succeeds.

## Repro

1. Start any flavored session (e.g. `hapi cursor`) — wait for `cursor-agent` to report its session UUID via `onSessionFoundWithProtocol(...)` so `metadata.cursorSessionId` is populated. Confirm with `sqlite3 ~/.hapi/hapi.db "SELECT json_extract(metadata,'$.cursorSessionId') FROM sessions WHERE id='…'"`.
2. End the session via any path that reaches `runnerLifecycle.archiveAndClose` (terminate, crash, local-launch failure, handoff). The archive write spreads the CLI's locally-cached `Metadata` and overrides `lifecycleState`/`archivedBy`/`archiveReason`.
3. Re-query `sessions.metadata`. In a non-trivial fraction of cases on long-running installs, `cursorSessionId` is gone.

The "non-trivial fraction" is real: a DB audit on one operator's machine (~99 inactive sessions) found multiple `lifecycleState=archived archiveReason='Session crashed'` rows whose `cursorSessionId` was missing despite the on-disk Cursor chat store still being present and the transcript being intact.

## Root cause

Two layers conspire:

### 1. CLI archive write replaces the entire metadata blob

`cli/src/agent/runnerLifecycle.ts` `archiveAndClose`:

```ts
options.session.updateMetadata((currentMetadata) => ({
    ...currentMetadata,
    lifecycleState: 'archived',
    lifecycleStateSince: Date.now(),
    archivedBy: 'cli',
    archiveReason
}))
```

The spread *intends* to preserve prior fields, but `currentMetadata` is the CLI's locally-cached snapshot (`ApiSessionClient.metadata`). That cache can be:

- **`null`** when `MetadataSchema.safeParse(raw.metadata)` rejected the row at session bootstrap (`cli/src/api/api.ts` `getOrCreateSession` / `getSession` — lines 71-75 and 121-125 — silently null out the metadata on parse failure). In that case `current = this.metadata ?? ({} as Metadata)` in `cli/src/api/apiSession.ts` `updateMetadata` (lines 611-645) yields `{}`, and the archive payload becomes `{lifecycleState:'archived', archivedBy:'cli', archiveReason}` only.
- **stale** if a hub-side `update-session` event was missed (or arrived at a lower `metadataVersion`) and the local cache never picked up a `cursorSessionId` write committed by another path.

### 2. Hub `update-metadata` is an unconditional REPLACE

`hub/src/store/sessions.ts` `updateSessionMetadata` (called from `hub/src/socket/handlers/cli/sessionHandlers.ts` `handleUpdateMetadata`) hands the incoming blob to `updateVersionedField` which writes `metadata = @field_value` verbatim. There is no carve-out for protocol resume tokens. Whatever the CLI sends fully replaces the prior row.

So a sparse archive payload from the CLI lands in the DB as-is, and the resume token vanishes.

## Fix proposal

The principled fix is **at the hub layer** because it's the single chokepoint for every metadata write (CLI, web, future surfaces) and protects against any caller that drops a resume token, intentionally or otherwise.

In `hub/src/store/sessions.ts`, change `updateSessionMetadata` to read the prior metadata, then carry forward a small allowlist of resume-token fields when the incoming write lacks them:

```ts
const PROTOCOL_RESUME_FIELDS = [
    'claudeSessionId',
    'codexSessionId',
    'geminiSessionId',
    'opencodeSessionId',
    'cursorSessionId',
    'cursorSessionProtocol',
    'kimiSessionId'
] as const

function preserveProtocolResumeFields(prior: unknown, next: unknown): unknown {
    if (!isPlainObject(prior) || !isPlainObject(next)) return next
    const merged: Record<string, unknown> = { ...(next as Record<string, unknown>) }
    for (const field of PROTOCOL_RESUME_FIELDS) {
        if (merged[field] === undefined && (prior as Record<string, unknown>)[field] !== undefined) {
            merged[field] = (prior as Record<string, unknown>)[field]
        }
    }
    return merged
}
```

Wrap the SELECT-then-UPDATE in a `db.transaction` so the merge is atomic with the version check.

### Properties

- A caller that intentionally **changes** a flavor session id (e.g. a fresh bootstrap that just discovered a new id) still wins — `merged[field] = next[field]` if next has it.
- A caller that **omits** the field (the bug today) gets the prior value preserved.
- Other metadata fields are untouched: lifecycleState/archiveReason/etc. still replace as today.
- The version check in `updateVersionedField` still runs, so concurrent writers still see `version-mismatch`.
- The list mirrors `pickExistingSessionMetadata` in `cli/src/agent/sessionFactory.ts` (which already documents the same set as "native resume metadata").

### Why merge-not-replace at the hub vs. carry-forward at the CLI

CLI-side carry-forward (e.g. fetching from hub before each archive write) is racy and adds a network round-trip in the cleanup path. The hub already has the prior row in hand for the version check; preserving the seven resume-token fields there is one extra read in a transaction and protects every write surface.

### Routing-default note

Preserving just `cursorSessionId` is sufficient for legacy routing because `cli/src/cursor/utils/cursorProtocol.ts` `isLegacyCursorSession()` defaults to legacy when `cursorSessionProtocol` is unset and `cursorSessionId` is truthy. So this fix unblocks resume even for sessions that lost both the id and the protocol marker before this patch — preserving the id alone is enough.

## Tests

Regression coverage at `hub/src/store/sessions.test.ts`:

- Cursor session archive preserves `cursorSessionId` (write archive payload without `cursorSessionId`, assert prior value carried through)
- Cursor session archive preserves `cursorSessionProtocol` (same shape)
- Codex session archive preserves `codexSessionId` (proves the fix is generic)
- Generic across all seven fields (claude/codex/gemini/opencode/cursor/kimi) — table-driven test
- Explicit-overwrite path: when next blob includes a different `cursorSessionId`, next wins (no false preservation)
- Explicit-clear path: this is intentionally NOT supported — clearing a resume token requires a separate dedicated mutation if a use case ever shows up; today no caller needs it
- Version mismatch behavior unchanged
- Crash-path: simulate `archiveReason='Session crashed'` payload — id preserved end-to-end
- Round-trip: archive → fetch → resume payload still has the id

A patch is on the way (fork PR coming).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cursor sessionId can be cleared from session metadata on crash-archive, blocking future resume even when chat data is on disk #820

Summary

Repro

Root cause

1. CLI archive write replaces the entire metadata blob

2. Hub `update-metadata` is an unconditional REPLACE

Fix proposal

Properties

Why merge-not-replace at the hub vs. carry-forward at the CLI

Routing-default note

Tests

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Cursor sessionId can be cleared from session metadata on crash-archive, blocking future resume even when chat data is on disk #820

Description

Summary

Repro

Root cause

1. CLI archive write replaces the entire metadata blob

2. Hub update-metadata is an unconditional REPLACE

Fix proposal

Properties

Why merge-not-replace at the hub vs. carry-forward at the CLI

Routing-default note

Tests

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Hub `update-metadata` is an unconditional REPLACE