feat(agents): durable tool execution + transactional resume [#2835] by viniciusdacal · Pull Request #2841 · vertz-dev/vertz

viniciusdacal · 2026-04-19T16:24:10Z

Summary

Closes #2835. Ships durable tool execution for @vertz/agents — at-most-once handler execution across crashes, with automatic resume when run() is called with store + sessionId against a durable store. Production triagebot path is now unblocked (#2834 Anthropic adapter + this feature).

Public API Changes

Added (patch bump):

AgentStore.appendMessagesAtomic(sessionId, messages, session) — the durability primitive. Implemented on memory (throw), SQLite (db.transaction), and D1 (db.batch).
MemoryStoreNotDurableError — thrown at run() entry when memoryStore() is paired with sessionId. Catches chat-only agents that would silently lose data.
ToolDurabilityError — surfaced as a tool_result when an orphaned non-safeToRetry call is detected on resume. Exported from the main barrel so callers can pattern-match.
tool({ safeToRetry: true, ... }) — optional per-tool flag. When true, the framework may re-invoke the handler on resume. Default assumes side effects.
@vertz/agents/testing subpath — crashAfterToolResults(store, N) harness for writing resume tests.

Removed (pre-v1, no shim):

AgentLoopConfig.checkpointInterval — was a notification callback, not durability.
ReactLoopOptions.onCheckpoint — same.

Design

Three agent sign-offs (DX, Product, Technical) + human sign-off on the Rev 2 design doc before implementation. See:

Design: plans/agents-durable-resume.md
Phases: plans/agents-durable-resume/phase-01..05.md
Retrospective: plans/post-implementation-reviews/agents-durable-resume.md

Phases summary

Phase 1 — appendMessagesAtomic interface + 3 impls, MemoryStoreNotDurableError at run() entry, @vertz/agents/testing subpath with crash harness, E2E test as TDD RED + perf gate (~4ms for 10-step in-memory SQLite loop, well under 200ms budget).
Phase 2 — ReAct loop writes per-step atomically via a persistStep callback. End-of-run flush kept only for the non-durable path. checkpointInterval deleted across the package.
Phase 3 — Resume detection: orphaned assistant-with-toolCalls + missing tool_result → synthetic ToolDurabilityError tool_result. E2E flips GREEN. MVP complete.
Phase 4 — tool({ safeToRetry }) opt-in. Resume re-invokes handlers declared safe; falls back to the error for the rest.
Phase 5 — durable-resume guide in packages/mint-docs/, changeset, retrospective with CF DO manual-verification checklist.

E2E acceptance test status

Both scenarios green at packages/agents/src/__tests__/durable-resume.test.ts:

Side-effecting tool + crash → handler runs once, ToolDurabilityError in resumed history.
safeToRetry: true tool + crash → handler re-invokes on resume, real result persisted.

Plus memory-store guard fires at run() entry (before any LLM call) for both tool-calling and chat-only agents.

Test plan

vtz test in packages/agents — 243 pass, 13 skipped (unrelated to this PR).
vtz run typecheck in packages/agents — clean.
Perf gate (durable-resume.perf.test.ts) — ~4ms, < 200ms budget.
Rebased onto latest main (resolved conflict with feat(agents): export d1Store from @vertz/agents public API #2838 index.ts).
CF DO manual verification against triagebot staging — post-merge, per retro. Framework tests are merge-gating; DO verification is release-gating.

Breaking changes

Pre-v1, so technically none external. For existing callers:

checkpointInterval / onCheckpoint removed. Any caller using them must migrate — resume is the replacement if that's what they wanted.
memoryStore() + sessionId now throws at entry. Callers wanting in-process session continuity should use sqliteStore({ path: ':memory:' }); those wanting stateless should drop sessionId. All affected tests in this repo were migrated in this PR.

🤖 Generated with Claude Code

] Rev 2 of the design for @vertz/agents durable tool execution and transactional resume. Approved via three agent reviews (DX / Product / Technical) + human sign-off on 2026-04-19. Implementation broken into five phase files; Phases 1-3 are the shippable MVP that closes the P0 correctness hole for side-effecting tool calls (no double-fire on resume after a crash between write phases). Key design decisions locked: - Activation is implicit: store + sessionId + non-memory store → durable. No flag. - Tool opt-in named `safeToRetry` (not `idempotent`) to avoid the Stripe-idempotency-key semantic collision; default is side-effecting. - Durability primitive is a single `AgentStore.appendMessagesAtomic()` method; two atomic writes per step (pre-dispatch / post-dispatch). No `toolCallStatus` field — orphan sentinel is message history alone. - Memory store under durable execution throws `MemoryStoreNotDurableError` at run() entry, not lazily. - Deletes `checkpointInterval` + `onCheckpoint` pre-v1, no shim. Related: #2834 (Anthropic adapter — merged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e guard [#2835] Phase 1 Task 1 of the durable resume feature. Adds the durability primitive to the AgentStore contract so Phase 2 can rely on per-step atomic writes, and introduces MemoryStoreNotDurableError plus an isMemoryStore() brand so run() can fail fast when memoryStore + sessionId are combined. - Extend AgentStore with appendMessagesAtomic(sessionId, messages, session). Implementations must run as one driver-level transaction over already-resolved data (no awaits between statements). - memoryStore().appendMessagesAtomic() always throws MemoryStoreNotDurableError — memory is in-process, cannot provide the guarantee. - sqlite-store + d1-store get stubbed throws; real implementations land in Phase 1 Tasks 2 & 3. The two no-throw-plain-error lint warnings are transient and disappear in the follow-up commits. - Export MemoryStoreNotDurableError from @vertz/agents public barrel. - Export MEMORY_STORE_KIND + isMemoryStore() as module-level helpers for run.ts to consume in Phase 1 Task 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 Task 2. Real implementation of the durability primitive for the SQLite store. Session upsert + all message inserts run inside a single db.transaction(() => { ... }) callable, over already-resolved data — no await inside the transaction (the @vertz/sqlite driver is sync). If any statement throws, the whole transaction rolls back, so readers never see partial state. Covered by three tests: - happy path: session + messages visible after one call - rollback: a circular-reference toolCalls payload fails JSON.stringify mid-batch; no messages land and session.updatedAt is unchanged - monotonic seq: successive calls continue the sequence numbering Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 Task 3. Real implementation of the durability primitive for the Cloudflare D1 store. A single db.batch([...]) wraps the session upsert + every message INSERT in one implicit transaction. Each INSERT derives its seq from a subquery (COALESCE(MAX(seq), 0) + 1) so no pre-batch SELECT is required — D1 statements in the same batch see each other's writes, which gives the INSERTs monotonically increasing seq values. Because D1 batch() is documented as implicitly transactional (https://developers.cloudflare.com/d1/worker-api/prepared-statements/#batch-statements), the whole batch commits atomically or rolls back on any statement failure. Covered by four tests: - happy path: session + messages visible after one call - monotonic seq across two successive atomic appends - rollback: a failing batch() rejects, no partial state visible - toolCall metadata (toolCallId, toolName, toolCalls) round-trips Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 Task 4. Fail fast at run() entry — before any LLM call or store access — whenever a sessionId is paired with memoryStore(). The memory store cannot provide the durable per-step writes that resume requires, so silently allowing it would lose data on restart (especially for chat-only agents that never call a tool and never exercise appendMessagesAtomic otherwise). - run() at the top of the hasStore branch checks isMemoryStore(store) when sessionId is present and throws MemoryStoreNotDurableError synchronously. - Added three tests covering: tool-calling agent throws before LLM; chat- only agent throws (no silent loss); memoryStore without sessionId still works normally. - Migrated 12 existing run.test.ts uses of memoryStore() + sessionId to sqliteStore({ path: ':memory:' }). Equivalent in-process behavior, transactional by construction. - Same migration in create-agent-runner.test.ts (3 uses). - types.test-d.ts left untouched — its memoryStore() call doesn't pass a sessionId and is a pure type check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#2835] Phase 1 Task 5. Adds the test-only subpath export '@vertz/agents/testing' and its first helper, crashAfterToolResults(store, failOnCallNumber = 2), used by durable-resume.test.ts to simulate a crash between the pre- dispatch write and the post-dispatch write. The helper wraps any AgentStore, counts appendMessagesAtomic calls, and throws a sentinel Error on the Nth call; all other methods pass through unchanged. - src/testing/crash-harness.ts: the factory. - src/testing/index.ts: barrel for the subpath. - src/testing/crash-harness.test.ts: 4 tests covering pass-through behavior, Nth-call fail, and delegation of non-atomic methods. - package.json exports: ./testing entry with dist/testing/index.{js,d.ts}. - build.config.ts: add src/testing/index.ts as an entry so dts runs. Verified: `dist/testing/index.js` and `dist/testing/index.d.ts` produced by the build. Lint clean (the sentinel plain-Error throw has an inline disable comment with rationale). 227/227 agent tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@ts-expect-error

Phase 1 Task 6. Lands the feature-level E2E test that drives Phases 2 and 3 to GREEN, plus a perf regression gate for the durable-resume write pattern. __tests__/durable-resume.test.ts — the MVP contract. Three cases: 1. "Does not re-invoke handler" (THE key assertion). Scripted LLM asks for postSlack; crash harness throws on the 2nd appendMessagesAtomic call (simulating a crash AFTER the handler dispatched). Resume with the same sessionId must NOT re-invoke the handler; the stored message history must include a ToolDurabilityError tool_result. --- Currently RED: Phase 1 doesn't call appendMessagesAtomic in the loop so the crash never fires. Phase 2 wires the atomic writes (still RED because no resume logic). Phase 3 surfaces the error → GREEN. The feature branch carries the intermediate RED commits by design; the final PR to main is green. .claude/rules/tdd.md forbids .skip, so the test stays un-skipped. 2. "memoryStore + sessionId throws at entry" — GREEN already (Task 4 landed that path). Doubles as an integration assertion that the guard fires before any LLM call. 3. Type-level @ts-expect-error on `safeToRetry: true` — GREEN until Phase 4 adds the field. If Phase 4's type wiring is missed, this directive fires "unused" and alerts. __tests__/durable-resume.perf.test.ts — gates a 10-step scripted loop on sqliteStore(:memory:) under 200ms. Current measurement: ~4ms. If Phase 2's per-step atomic writes regress this badly, CI alarms. Scope boundary: the current run() requires the session row to exist before run() with a fixed sessionId — the test pre-seeds it via saveSession(). This matches the DO walkthrough pattern and is not a framework change in scope for Phase 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 of durable tool execution. Wires appendMessagesAtomic into the loop so tool-calling steps commit durably in two atomic writes per step (assistant-with-toolCalls, then tool_results), and deletes the obsolete checkpointInterval + onCheckpoint pair. reactLoop: - Remove checkpointInterval + onCheckpoint from ReactLoopOptions (pre-v1, no shim). - Add persistStep callback: fires with phase='assistant-with-tool-calls' after the assistant message is pushed, then with phase='tool-results' after all result messages for the step are pushed. If persistStep rejects, the error propagates out. run.ts: - Detect durable mode: store + sessionId + !isMemoryStore. - When durable, provide persistStep that calls appendMessagesAtomic with a fresh session snapshot each write. The new user message is bundled into the FIRST persistStep call (so a crash before any step leaves no partial state — nothing persists until work begins). - Skip the end-of-run saveSession + appendMessages pair when durable; flush any trailing messages (e.g. text-only final assistant) via one last atomic call. - Non-durable path (stateless or no sessionId) unchanged. Config: - Delete checkpointInterval from AgentLoopConfig + agent() defaults + all tests that referenced the obsolete callback. The types.test-d.ts regression guard now asserts that checkpointInterval is rejected. E2E test update: - durable-resume.test.ts is still RED on the single "ToolDurabilityError surfaces in history" assertion; Phase 3 adds the resume logic that writes the synthetic error tool_result. Everything else in the file passes — handler runs once pre-crash, crash trips correctly on write #2, second run() completes cleanly (scripted LLM says "Done" rather than re-requesting the tool). 232 pass, 1 fail (the planned Phase-3-target assertion), 0 lint/type regressions. The 8 pre-existing no-throw-plain-error warnings in react-loop.ts / react-loop.test.ts / agent.ts are unchanged by this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3 of durable tool execution — the MVP completes here. When run() resumes a session whose last assistant message has unmatched tool_call ids (the crash signature for a side-effecting tool that ran but whose tool_result was lost before write #2 committed), the framework now: 1. Constructs a ToolDurabilityError per missing tool_call. 2. Serializes each as a `tool` role message (same JSON shape the loop already uses for handler errors, plus a `kind: 'tool-durability-error'` discriminator so callers + LLMs can pattern-match). 3. Commits the synthetic tool_results atomically via appendMessagesAtomic before the loop's first LLM call. 4. Extends previousMessages with those synthetic rows so the LLM sees the error in-band and decides recovery (check external state, ask user, abort — its call). Also adds `findOrphanAssistantWithToolCalls()` — a pure message-history scan, no new schema column required. The sentinel is "assistant with toolCalls + no matching tool_result" per the design's crash taxonomy. Export: - ToolDurabilityError class from @vertz/agents barrel (so callers inspecting resumed history can pattern-match). Tests: - errors.test.ts: class shape + serialized encoding. - durable-resume.test.ts: the main E2E is now fully GREEN. Handler runs exactly once across crash + resume; ToolDurabilityError surfaces in history with the correct toolName/toolCallId. Total: 236/236 tests pass. Typecheck clean. Lint clean on Phase 3 additions. Phases 1–3 = MVP. Phase 4 adds the safeToRetry opt-in; Phase 5 ships docs + changeset + PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 4 of durable tool execution. Pure-read tools or handlers that are safe to run twice can now declare `safeToRetry: true` to opt out of the conservative ToolDurabilityError path. On resume with a missing tool_result, the framework re-invokes those handlers and persists the real result instead of asking the LLM to decide recovery. Types: - ToolConfig.safeToRetry?: boolean — public, documented with explicit callout that this is about resume replay, NOT HTTP retry. - ToolDefinition.safeToRetry?: boolean — forwarded by tool(). run.ts (resume dispatch): - Orphan handling moves from the session-load branch to after ctx/agents/resolvedTools are built, so executeToolCall can run with a real ToolContext. - Per missing tool_call: if resolvedTools[name]?.safeToRetry, call executeToolCall — which handles input validation + output validation + handler errors identically to the loop. Otherwise surface the ToolDurabilityError as before. - Re-invocations + durability-error messages persist atomically in a single appendMessagesAtomic call (batch per resume, not per tool). react-loop.ts: - Export executeToolCall + ToolCallResult so run.ts can reuse the same code path (tool-not-found / no-handler / input/output validation / handler errors all encoded the same way). Tests: - tool.test.ts: safeToRetry forwards correctly (true → true, omitted → undefined). - durable-resume.test.ts: new E2E scenario — a safeToRetry tool crashes mid-step, handler re-invokes on resume, real result lands, NO ToolDurabilityError appears in history. Handler count goes 1 → 2 — exactly the point. - Updated the type-level test: safeToRetry: true compiles (negative was removed), safeToRetry: 'yes' is rejected. 240/240 tests pass. 2 pre-existing no-throw-plain-error warnings unchanged by this commit. Phases 1–4 deliver the full feature. Phase 5 wraps docs + changeset + retrospective + PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 5 — the release wrap for durable tool execution. - packages/mint-docs/guides/agents/durable-resume.mdx: a full user-facing guide covering activation (store + sessionId on durable store), the safeToRetry flag with an explicit "this is NOT network retry" callout, the crash-window taxonomy table in user terms, cost guidance (~2 writes/step on D1), and the ToolDurabilityError inspection pattern for resumed history. - docs.json nav updated to include the new page. - .changeset/agents-durable-resume.md: patch bump with the full public-surface diff (new + removed). - plans/post-implementation-reviews/agents-durable-resume.md: retro covering what shipped, what worked, what didn't, manual-verification checklist for the CF DO staging run, and open follow-ups. The manual CF DO verification is release-gating, not merge-gating — the framework tests all pass; the retro captures the checklist to run post-merge against triagebot staging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure oxfmt pass on the new durable-resume guide to unblock CI. No content change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…drift) These two files are identical to origin/main but fail `oxfmt --check`, blocking CI on this PR. Pure whitespace/wrap normalization — no content change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

viniciusdacal and others added 12 commits April 19, 2026 13:52

style(mint-docs): oxfmt durable-resume guide [#2835]

3902d93

Pure oxfmt pass on the new durable-resume guide to unblock CI. No content change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

viniciusdacal force-pushed the feat/agents-durable-resume branch from e883e53 to 3902d93 Compare April 19, 2026 16:53

viniciusdacal merged commit 091282b into main Apr 19, 2026
6 checks passed

github-actions Bot mentioned this pull request Apr 19, 2026

chore: version packages #2854

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): durable tool execution + transactional resume [#2835]#2841

feat(agents): durable tool execution + transactional resume [#2835]#2841
viniciusdacal merged 13 commits intomainfrom
feat/agents-durable-resume

viniciusdacal commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viniciusdacal commented Apr 19, 2026

Summary

Public API Changes

Design

Phases summary

E2E acceptance test status

Test plan

Breaking changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant