Skip to content

feat(agents): durable tool execution + transactional resume [#2835]#2841

Merged
viniciusdacal merged 13 commits intomainfrom
feat/agents-durable-resume
Apr 19, 2026
Merged

feat(agents): durable tool execution + transactional resume [#2835]#2841
viniciusdacal merged 13 commits intomainfrom
feat/agents-durable-resume

Conversation

@viniciusdacal
Copy link
Copy Markdown
Contributor

Summary

Closes #2835. Ships durable tool execution for @vertz/agents — at-most-once handler execution across crashes, with automatic resume when run() is called with store + sessionId against a durable store. Production triagebot path is now unblocked (#2834 Anthropic adapter + this feature).

Public API Changes

Added (patch bump):

  • AgentStore.appendMessagesAtomic(sessionId, messages, session) — the durability primitive. Implemented on memory (throw), SQLite (db.transaction), and D1 (db.batch).
  • MemoryStoreNotDurableError — thrown at run() entry when memoryStore() is paired with sessionId. Catches chat-only agents that would silently lose data.
  • ToolDurabilityError — surfaced as a tool_result when an orphaned non-safeToRetry call is detected on resume. Exported from the main barrel so callers can pattern-match.
  • tool({ safeToRetry: true, ... }) — optional per-tool flag. When true, the framework may re-invoke the handler on resume. Default assumes side effects.
  • @vertz/agents/testing subpath — crashAfterToolResults(store, N) harness for writing resume tests.

Removed (pre-v1, no shim):

  • AgentLoopConfig.checkpointInterval — was a notification callback, not durability.
  • ReactLoopOptions.onCheckpoint — same.

Design

Three agent sign-offs (DX, Product, Technical) + human sign-off on the Rev 2 design doc before implementation. See:

Phases summary

  • Phase 1appendMessagesAtomic interface + 3 impls, MemoryStoreNotDurableError at run() entry, @vertz/agents/testing subpath with crash harness, E2E test as TDD RED + perf gate (~4ms for 10-step in-memory SQLite loop, well under 200ms budget).
  • Phase 2 — ReAct loop writes per-step atomically via a persistStep callback. End-of-run flush kept only for the non-durable path. checkpointInterval deleted across the package.
  • Phase 3 — Resume detection: orphaned assistant-with-toolCalls + missing tool_result → synthetic ToolDurabilityError tool_result. E2E flips GREEN. MVP complete.
  • Phase 4tool({ safeToRetry }) opt-in. Resume re-invokes handlers declared safe; falls back to the error for the rest.
  • Phase 5durable-resume guide in packages/mint-docs/, changeset, retrospective with CF DO manual-verification checklist.

E2E acceptance test status

Both scenarios green at packages/agents/src/__tests__/durable-resume.test.ts:

  1. Side-effecting tool + crash → handler runs once, ToolDurabilityError in resumed history.
  2. safeToRetry: true tool + crash → handler re-invokes on resume, real result persisted.

Plus memory-store guard fires at run() entry (before any LLM call) for both tool-calling and chat-only agents.

Test plan

  • vtz test in packages/agents — 243 pass, 13 skipped (unrelated to this PR).
  • vtz run typecheck in packages/agents — clean.
  • Perf gate (durable-resume.perf.test.ts) — ~4ms, < 200ms budget.
  • Rebased onto latest main (resolved conflict with feat(agents): export d1Store from @vertz/agents public API #2838 index.ts).
  • CF DO manual verification against triagebot staging — post-merge, per retro. Framework tests are merge-gating; DO verification is release-gating.

Breaking changes

Pre-v1, so technically none external. For existing callers:

  • checkpointInterval / onCheckpoint removed. Any caller using them must migrate — resume is the replacement if that's what they wanted.
  • memoryStore() + sessionId now throws at entry. Callers wanting in-process session continuity should use sqliteStore({ path: ':memory:' }); those wanting stateless should drop sessionId. All affected tests in this repo were migrated in this PR.

🤖 Generated with Claude Code

viniciusdacal and others added 12 commits April 19, 2026 13:52
]

Rev 2 of the design for @vertz/agents durable tool execution and
transactional resume. Approved via three agent reviews (DX / Product /
Technical) + human sign-off on 2026-04-19. Implementation broken into
five phase files; Phases 1-3 are the shippable MVP that closes the P0
correctness hole for side-effecting tool calls (no double-fire on
resume after a crash between write phases).

Key design decisions locked:
- Activation is implicit: store + sessionId + non-memory store → durable.
  No flag.
- Tool opt-in named `safeToRetry` (not `idempotent`) to avoid the
  Stripe-idempotency-key semantic collision; default is side-effecting.
- Durability primitive is a single `AgentStore.appendMessagesAtomic()`
  method; two atomic writes per step (pre-dispatch / post-dispatch).
  No `toolCallStatus` field — orphan sentinel is message history alone.
- Memory store under durable execution throws `MemoryStoreNotDurableError`
  at run() entry, not lazily.
- Deletes `checkpointInterval` + `onCheckpoint` pre-v1, no shim.

Related: #2834 (Anthropic adapter — merged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e guard [#2835]

Phase 1 Task 1 of the durable resume feature. Adds the durability primitive
to the AgentStore contract so Phase 2 can rely on per-step atomic writes,
and introduces MemoryStoreNotDurableError plus an isMemoryStore() brand
so run() can fail fast when memoryStore + sessionId are combined.

- Extend AgentStore with appendMessagesAtomic(sessionId, messages, session).
  Implementations must run as one driver-level transaction over
  already-resolved data (no awaits between statements).
- memoryStore().appendMessagesAtomic() always throws
  MemoryStoreNotDurableError — memory is in-process, cannot provide the
  guarantee.
- sqlite-store + d1-store get stubbed throws; real implementations land in
  Phase 1 Tasks 2 & 3. The two no-throw-plain-error lint warnings are
  transient and disappear in the follow-up commits.
- Export MemoryStoreNotDurableError from @vertz/agents public barrel.
- Export MEMORY_STORE_KIND + isMemoryStore() as module-level helpers for
  run.ts to consume in Phase 1 Task 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 2. Real implementation of the durability primitive for the
SQLite store. Session upsert + all message inserts run inside a single
db.transaction(() => { ... }) callable, over already-resolved data — no
await inside the transaction (the @vertz/sqlite driver is sync). If any
statement throws, the whole transaction rolls back, so readers never see
partial state.

Covered by three tests:
- happy path: session + messages visible after one call
- rollback: a circular-reference toolCalls payload fails JSON.stringify
  mid-batch; no messages land and session.updatedAt is unchanged
- monotonic seq: successive calls continue the sequence numbering

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 3. Real implementation of the durability primitive for the
Cloudflare D1 store. A single db.batch([...]) wraps the session upsert +
every message INSERT in one implicit transaction. Each INSERT derives its
seq from a subquery (COALESCE(MAX(seq), 0) + 1) so no pre-batch SELECT is
required — D1 statements in the same batch see each other's writes, which
gives the INSERTs monotonically increasing seq values.

Because D1 batch() is documented as implicitly transactional
(https://developers.cloudflare.com/d1/worker-api/prepared-statements/#batch-statements),
the whole batch commits atomically or rolls back on any statement failure.

Covered by four tests:
- happy path: session + messages visible after one call
- monotonic seq across two successive atomic appends
- rollback: a failing batch() rejects, no partial state visible
- toolCall metadata (toolCallId, toolName, toolCalls) round-trips

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 4. Fail fast at run() entry — before any LLM call or store
access — whenever a sessionId is paired with memoryStore(). The memory
store cannot provide the durable per-step writes that resume requires,
so silently allowing it would lose data on restart (especially for
chat-only agents that never call a tool and never exercise
appendMessagesAtomic otherwise).

- run() at the top of the hasStore branch checks isMemoryStore(store)
  when sessionId is present and throws MemoryStoreNotDurableError
  synchronously.
- Added three tests covering: tool-calling agent throws before LLM; chat-
  only agent throws (no silent loss); memoryStore without sessionId still
  works normally.
- Migrated 12 existing run.test.ts uses of memoryStore() + sessionId to
  sqliteStore({ path: ':memory:' }). Equivalent in-process behavior,
  transactional by construction.
- Same migration in create-agent-runner.test.ts (3 uses).
- types.test-d.ts left untouched — its memoryStore() call doesn't pass a
  sessionId and is a pure type check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2835]

Phase 1 Task 5. Adds the test-only subpath export '@vertz/agents/testing'
and its first helper, crashAfterToolResults(store, failOnCallNumber = 2),
used by durable-resume.test.ts to simulate a crash between the pre-
dispatch write and the post-dispatch write. The helper wraps any
AgentStore, counts appendMessagesAtomic calls, and throws a sentinel
Error on the Nth call; all other methods pass through unchanged.

- src/testing/crash-harness.ts: the factory.
- src/testing/index.ts: barrel for the subpath.
- src/testing/crash-harness.test.ts: 4 tests covering pass-through
  behavior, Nth-call fail, and delegation of non-atomic methods.
- package.json exports: ./testing entry with dist/testing/index.{js,d.ts}.
- build.config.ts: add src/testing/index.ts as an entry so dts runs.

Verified: `dist/testing/index.js` and `dist/testing/index.d.ts` produced
by the build. Lint clean (the sentinel plain-Error throw has an inline
disable comment with rationale). 227/227 agent tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 6. Lands the feature-level E2E test that drives Phases 2
and 3 to GREEN, plus a perf regression gate for the durable-resume
write pattern.

__tests__/durable-resume.test.ts — the MVP contract. Three cases:

1. "Does not re-invoke handler" (THE key assertion). Scripted LLM asks
   for postSlack; crash harness throws on the 2nd appendMessagesAtomic
   call (simulating a crash AFTER the handler dispatched). Resume with
   the same sessionId must NOT re-invoke the handler; the stored
   message history must include a ToolDurabilityError tool_result.
   --- Currently RED: Phase 1 doesn't call appendMessagesAtomic in the
   loop so the crash never fires. Phase 2 wires the atomic writes
   (still RED because no resume logic). Phase 3 surfaces the error →
   GREEN. The feature branch carries the intermediate RED commits by
   design; the final PR to main is green. .claude/rules/tdd.md forbids
   .skip, so the test stays un-skipped.

2. "memoryStore + sessionId throws at entry" — GREEN already (Task 4
   landed that path). Doubles as an integration assertion that the
   guard fires before any LLM call.

3. Type-level @ts-expect-error on `safeToRetry: true` — GREEN until
   Phase 4 adds the field. If Phase 4's type wiring is missed, this
   directive fires "unused" and alerts.

__tests__/durable-resume.perf.test.ts — gates a 10-step scripted loop
on sqliteStore(:memory:) under 200ms. Current measurement: ~4ms. If
Phase 2's per-step atomic writes regress this badly, CI alarms.

Scope boundary: the current run() requires the session row to exist
before run() with a fixed sessionId — the test pre-seeds it via
saveSession(). This matches the DO walkthrough pattern and is not a
framework change in scope for Phase 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of durable tool execution. Wires appendMessagesAtomic into the
loop so tool-calling steps commit durably in two atomic writes per step
(assistant-with-toolCalls, then tool_results), and deletes the obsolete
checkpointInterval + onCheckpoint pair.

reactLoop:
- Remove checkpointInterval + onCheckpoint from ReactLoopOptions (pre-v1,
  no shim).
- Add persistStep callback: fires with phase='assistant-with-tool-calls'
  after the assistant message is pushed, then with phase='tool-results'
  after all result messages for the step are pushed. If persistStep
  rejects, the error propagates out.

run.ts:
- Detect durable mode: store + sessionId + !isMemoryStore.
- When durable, provide persistStep that calls appendMessagesAtomic
  with a fresh session snapshot each write. The new user message is
  bundled into the FIRST persistStep call (so a crash before any step
  leaves no partial state — nothing persists until work begins).
- Skip the end-of-run saveSession + appendMessages pair when durable;
  flush any trailing messages (e.g. text-only final assistant) via one
  last atomic call.
- Non-durable path (stateless or no sessionId) unchanged.

Config:
- Delete checkpointInterval from AgentLoopConfig + agent() defaults +
  all tests that referenced the obsolete callback. The types.test-d.ts
  regression guard now asserts that checkpointInterval is rejected.

E2E test update:
- durable-resume.test.ts is still RED on the single "ToolDurabilityError
  surfaces in history" assertion; Phase 3 adds the resume logic that
  writes the synthetic error tool_result. Everything else in the file
  passes — handler runs once pre-crash, crash trips correctly on write
  #2, second run() completes cleanly (scripted LLM says "Done" rather
  than re-requesting the tool).

232 pass, 1 fail (the planned Phase-3-target assertion), 0 lint/type
regressions. The 8 pre-existing no-throw-plain-error warnings in
react-loop.ts / react-loop.test.ts / agent.ts are unchanged by this
commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of durable tool execution — the MVP completes here. When run()
resumes a session whose last assistant message has unmatched tool_call
ids (the crash signature for a side-effecting tool that ran but whose
tool_result was lost before write #2 committed), the framework now:

1. Constructs a ToolDurabilityError per missing tool_call.
2. Serializes each as a `tool` role message (same JSON shape the loop
   already uses for handler errors, plus a `kind: 'tool-durability-error'`
   discriminator so callers + LLMs can pattern-match).
3. Commits the synthetic tool_results atomically via
   appendMessagesAtomic before the loop's first LLM call.
4. Extends previousMessages with those synthetic rows so the LLM sees
   the error in-band and decides recovery (check external state, ask
   user, abort — its call).

Also adds `findOrphanAssistantWithToolCalls()` — a pure message-history
scan, no new schema column required. The sentinel is "assistant with
toolCalls + no matching tool_result" per the design's crash taxonomy.

Export:
- ToolDurabilityError class from @vertz/agents barrel (so callers
  inspecting resumed history can pattern-match).

Tests:
- errors.test.ts: class shape + serialized encoding.
- durable-resume.test.ts: the main E2E is now fully GREEN. Handler runs
  exactly once across crash + resume; ToolDurabilityError surfaces in
  history with the correct toolName/toolCallId.

Total: 236/236 tests pass. Typecheck clean. Lint clean on Phase 3
additions.

Phases 1–3 = MVP. Phase 4 adds the safeToRetry opt-in; Phase 5 ships
docs + changeset + PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of durable tool execution. Pure-read tools or handlers that
are safe to run twice can now declare `safeToRetry: true` to opt out of
the conservative ToolDurabilityError path. On resume with a missing
tool_result, the framework re-invokes those handlers and persists the
real result instead of asking the LLM to decide recovery.

Types:
- ToolConfig.safeToRetry?: boolean — public, documented with explicit
  callout that this is about resume replay, NOT HTTP retry.
- ToolDefinition.safeToRetry?: boolean — forwarded by tool().

run.ts (resume dispatch):
- Orphan handling moves from the session-load branch to after
  ctx/agents/resolvedTools are built, so executeToolCall can run with
  a real ToolContext.
- Per missing tool_call: if resolvedTools[name]?.safeToRetry, call
  executeToolCall — which handles input validation + output validation +
  handler errors identically to the loop. Otherwise surface the
  ToolDurabilityError as before.
- Re-invocations + durability-error messages persist atomically in a
  single appendMessagesAtomic call (batch per resume, not per tool).

react-loop.ts:
- Export executeToolCall + ToolCallResult so run.ts can reuse the same
  code path (tool-not-found / no-handler / input/output validation /
  handler errors all encoded the same way).

Tests:
- tool.test.ts: safeToRetry forwards correctly (true → true, omitted →
  undefined).
- durable-resume.test.ts: new E2E scenario — a safeToRetry tool crashes
  mid-step, handler re-invokes on resume, real result lands, NO
  ToolDurabilityError appears in history. Handler count goes 1 → 2 —
  exactly the point.
- Updated the type-level test: safeToRetry: true compiles (negative
  was removed), safeToRetry: 'yes' is rejected.

240/240 tests pass. 2 pre-existing no-throw-plain-error warnings
unchanged by this commit.

Phases 1–4 deliver the full feature. Phase 5 wraps docs + changeset +
retrospective + PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 — the release wrap for durable tool execution.

- packages/mint-docs/guides/agents/durable-resume.mdx: a full
  user-facing guide covering activation (store + sessionId on durable
  store), the safeToRetry flag with an explicit "this is NOT network
  retry" callout, the crash-window taxonomy table in user terms, cost
  guidance (~2 writes/step on D1), and the ToolDurabilityError
  inspection pattern for resumed history.
- docs.json nav updated to include the new page.
- .changeset/agents-durable-resume.md: patch bump with the full
  public-surface diff (new + removed).
- plans/post-implementation-reviews/agents-durable-resume.md: retro
  covering what shipped, what worked, what didn't, manual-verification
  checklist for the CF DO staging run, and open follow-ups.

The manual CF DO verification is release-gating, not merge-gating —
the framework tests all pass; the retro captures the checklist to run
post-merge against triagebot staging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure oxfmt pass on the new durable-resume guide to unblock CI. No
content change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@viniciusdacal viniciusdacal force-pushed the feat/agents-durable-resume branch from e883e53 to 3902d93 Compare April 19, 2026 16:53
…drift)

These two files are identical to origin/main but fail `oxfmt --check`,
blocking CI on this PR. Pure whitespace/wrap normalization — no
content change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@viniciusdacal viniciusdacal merged commit 091282b into main Apr 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(agents): idempotent tool execution and transactional tool_result persistence for durable runs

1 participant