fix(hub): preserve session metadata across archive transitions#825
Merged
tiann merged 4 commits intoJun 8, 2026
Merged
Conversation
…sitions When a session ends (terminate, crash, local-launch failure, handoff), the runner's archive write replaces sessions.metadata wholesale. If the CLI's locally cached Metadata is null (e.g. Zod parse failed at bootstrap and api.ts nulled it out) or stale, the spread in archiveAndClose ships a sparse blob and the resume token (cursorSessionId, codexSessionId, claudeSessionId, etc.) gets cleared from the row even though the on-disk chat data is still intact. Fix at the hub layer because update-metadata is the single chokepoint for every metadata write surface (CLI, web, future): in the store-level updateSessionMetadata, read the prior row's metadata inside a transaction and carry forward a small allowlist of flavor resume tokens when the incoming write omits them. Explicit overwrites still win. The allowlist mirrors pickExistingSessionMetadata in sessionFactory.ts which already preserves the same fields on bootstrap. Closes tiann#820 Co-authored-by: Cursor <cursoragent@cursor.com>
Three bot findings on the initial patch: 1. (P1) Sparse archive payloads still resulted in metadata blobs that failed MetadataSchema parse downstream — required `path`/`host` were not in the carry-forward set, so even though the resume token survived, hub session cache and CLI getSession nulled-out the row and resume_unavailable came back. Add PARSE_IDENTITY_FIELDS = `path`, `host` to the carry-forward. 2. (P2) Preserving `cursorSessionProtocol` whenever it was omitted carried a stale protocol over to a freshly written `cursorSessionId`, misrouting a future remote resume. Pair-aware logic: drop the prior protocol when next sets a new id; preserve the protocol only when next is silent on both id and protocol. 3. (P2) The successful update-metadata broadcast emitted the pre-merge payload to other CLIs in the session room, so even though the DB row was preserved, peer caches diverged. Switch the broadcast value to `result.value` (the persisted merged value) so live caches stay in sync with the truth. Refactor preserveProtocolResumeFields into mergeSessionMetadata with two tiers (PARSE_IDENTITY_FIELDS + SIMPLE_RESUME_TOKENS) plus the cursor pair handler. 6 new tests cover the regressions; existing 16 still pass plus 1 new socket-level test for the broadcast. Co-authored-by: Cursor <cursoragent@cursor.com>
Bot P2 on the prior fix: PARSE_IDENTITY_FIELDS (path, host) made the blob parseable and SIMPLE_RESUME_TOKENS preserved the chat-id, but flavor and machineId were still being dropped by sparse archive payloads. Consequences: - flavor: hub/src/web/routes/sessions.ts and sync/syncEngine.ts read `metadata?.flavor ?? 'claude'` to pick which session id field to resume. With flavor missing, a Cursor/Codex/Gemini session was routed as Claude and the preserved cursorSessionId was ignored. - machineId: telegram/bot.ts and the CLI's resumable listing read `metadata?.machineId` to scope sessions to the current host. With machineId missing, the row dropped out of the resume picker. Add a third carry-forward tier ROUTING_FIELDS = [flavor, machineId] between PARSE_IDENTITY_FIELDS and SIMPLE_RESUME_TOKENS in mergeSessionMetadata. 3 new tests cover preservation, no-invention, and explicit override. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Findings
- [Major] Preserved resume tokens can no longer be cleared —
carryForwardIfMissingrestores prior token fields whenever the next metadata blob omits them, but an existing Codex path intentionally clearscodexSessionIdby deleting it beforeupdateMetadata(cli/src/codex/session.ts:104-107). After this change the hub merges the oldcodexSessionIdback into the row and returns it in the ack, soresetCodexThread()cannot actually reset the Codex thread and later resume/routing can keep using the stale thread id. Evidence:hub/src/store/sessions.ts:68.
Suggested fix:Or add a dedicated clear sentinel/API and updateconst EXPLICIT_CLEAR_FIELDS = ['codexSessionId'] as const function shouldCarryForward(field: string, next: Record<string, unknown>): boolean { return next[field] === undefined && !EXPLICIT_CLEAR_FIELDS.some((clearField) => clearField === field) }
resetCodexThread()to use it, with a regression test provingcodexSessionIdis removed.
Questions
- None.
Summary
- Review mode: initial
- One Major issue found: the new preservation rule blocks an existing metadata clear path. Residual risk: I did not run the full Bun suite in this review pass.
Testing
- Not run (automation)
HAPI Bot
Upstream cold-review (Major): the carry-forward semantics introduced
in the prior commits ("omit field → preserve from prior") collide
with cli/src/codex/session.ts resetCodexThread(), which intentionally
clears codexSessionId by deleting it from the metadata blob before
calling updateMetadata. With omit-as-preserve, the cleared id was
restored from the prior row and /clear on a Codex session no longer
dropped the persisted thread.
Add an explicit-clear sentinel: when next sets a carry-forward field
to `null`, the merge drops the key entirely from the persisted blob
(key removed; not stored as null since MetadataSchema fields are
`string().optional()`). `undefined` (key missing from next) keeps its
"carry forward" meaning. The two semantics now compose cleanly:
- next.field = "x" → next wins (caller sets a new value)
- next.field = null → drop the field (caller intentionally clears)
- next omits field → carry forward prior (caller didn't touch it)
Update resetCodexThread() to send `codexSessionId: null` so the
reset actually drops the persisted thread under the new merge.
4 new hub tests cover: explicit clear of a single token, clear-one-
preserve-others independence, no-op clear on a never-set field, and
the success-ack value reflects the cleared blob. cli/src/codex tests
(224/224) and hub suite (301/301) green; bun typecheck clean.
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Findings
- None.
Questions
- None.
Summary
- Review mode: follow-up after new commits
- No issues found in the latest diff. The prior Major about intentional metadata clears appears addressed by the Codex null sentinel and hub-side removal behavior. Residual risk: static review only; full suite results were not independently run in this review pass.
Testing
- Not run (automation)
HAPI Bot
This was referenced Jun 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hub-side preservation of session-routing and flavor resume fields across
update-metadatawrites, so the CLI's archive transition can no longer silently clearcursorSessionId(or any other flavor's resume id, the routing fieldsflavor/machineId, or the schema-requiredpath/host) from a session row.Closes #820
Problem
cli/src/agent/runnerLifecycle.tsarchiveAndClosedoes:The spread is supposed to preserve prior fields, but
currentMetadatais the local cache incli/src/api/apiSession.ts(current = this.metadata ?? ({} as Metadata)). That cache isnullwhenMetadataSchema.safeParserejected the row at session bootstrap (cli/src/api/api.tssilently nulls out metadata on parse failure), and can also be stale relative to the hub.hub/src/store/sessions.tsupdateSessionMetadatacallsupdateVersionedField, which does an unconditionalmetadata = @field_valuereplacement. So a sparse archive payload from the CLI becomes the new row contents and the resume identifiers, routing keys, and parse-required fields are all gone.DB audits on long-running installs showed
lifecycleState=archived/archiveReason='Session crashed'rows missingcursorSessionIddespite the on-disk Cursor chat store still being present. This is the systemic cause of "session crashed → cannot resume even though chat data is on disk".Approach
Single-chokepoint fix at the hub: the only place every metadata write goes through is
updateSessionMetadata. Wrap the existingupdateVersionedFieldcall in adb.transactionthat first reads the prior row's metadata, then carries fields forward when the incoming payload omits them.Three tiers, each addressing a distinct downstream failure mode:
PARSE_IDENTITY_FIELDS— required byMetadataSchemainshared/src/schemas.ts. If they're missing, hub session cache and CLIgetSessionreject the row withsafeParse → nulland resume can't even find the session, regardless of whether the resume token survived.ROUTING_FIELDS—flavoris whathub/src/web/routes/sessions.ts(multiple sites) andsync/syncEngine.tsuse to pick which flavor session-id field to read; withflavormissing, the?? 'claude'fallback misroutes a Cursor/Codex/Gemini session as Claude and the preserved token is ignored.machineIdis the filter the CLI's resumable listing uses to scope rows to the current host.SIMPLE_RESUME_TOKENS— flavor-specific resume identifiers with write-once-keep semantics. Mirror ofpickExistingSessionMetadataincli/src/agent/sessionFactory.ts(which already preserves the same set on bootstrap).Plus pair-aware handling for
cursorSessionId/cursorSessionProtocol: protocol is tied to a specific chat id, so whennextexplicitly sets a newcursorSessionIdthe priorcursorSessionProtocolis dropped (callers can include the new protocol with the new id); whennextis silent on both, the prior protocol carries with the prior id.The
update-metadatasocket handler now broadcasts the persisted (merged) value rather than the pre-merge input, so peer CLIs in the same session room don't overwrite their local cache with a tokenless blob.Properties:
next[field]defined always wins.updateVersionedFieldis unchanged; concurrent writers still seeversion-mismatch.SELECT metadataper metadata write inside a SQLite transaction.Why hub vs CLI: CLI-side carry-forward (e.g. fetching from hub before each write) is racy and adds a network round-trip in the cleanup path. The hub already reads the prior row for the version check, so the merge there is a strictly smaller surface and protects every write surface (CLI, web, future).
Routing default: preserving
cursorSessionIdis sufficient for legacy resume becausecli/src/cursor/utils/cursorProtocol.tsisLegacyCursorSession()defaults to legacy whencursorSessionProtocolis unset andcursorSessionIdis truthy. So this fix unblocks resume even for sessions that lost both the id and the protocol marker before the patch landed.Testing
hub/src/store/sessions.test.ts— 25 tests, all passing:cursorSessionId(crash-archive)cursorSessionProtocolcodexSessionId(proves generic across flavors)path/hostpreserved on sparse archive (parse identity)flavor/machineIdpreserved on sparse archive (routing)cursorSessionProtocoldropped when next sets a newcursorSessionIdcursorSessionProtocolpreserved when next is silent on both id and protocolcursorSessionProtocolhonored when id is unchangederrorresult.valuereturned byupdateSessionMetadatais the merged value (not the pre-merge input)hub/src/socket/handlers/cli/sessionHandlers.test.ts— new test asserts theupdate-metadataack and the room-event broadcast both carry the merged value, not the pre-merge payload.Full hub suite: 297 passed, 0 failed.
bun typecheckclean across cli/web/hub.Risks
update-metadatachange for the preserved fields: a write that wants to clear one of them by omission can no longer do so. No code path in this repo does that today — these fields are all write-once-keep semantics. If a future caller needs an explicit clear, it should be a dedicated mutation.