Skip to content

feat(checkpoint): wire checkpointing into agent event loop#2190

Draft
JackYPCOnline wants to merge 2 commits intostrands-agents:mainfrom
JackYPCOnline:checkpoint_1
Draft

feat(checkpoint): wire checkpointing into agent event loop#2190
JackYPCOnline wants to merge 2 commits intostrands-agents:mainfrom
JackYPCOnline:checkpoint_1

Conversation

@JackYPCOnline
Copy link
Copy Markdown
Contributor

Description

wires the Checkpoint data model (landed in #2181) into the agent runtime so an opt-in checkpointing=True agent pauses at ReAct cycle boundaries and resumes cleanly from persisted checkpoints — including across fresh process boundaries, which is what makes durability providers like Temporal, Dapr, and AWS Step Functions usable with Strands.

The design mirrors the existing interrupt pattern by construction — stop_reason="checkpoint", checkpointResume content block for resume, snapshot-based state transfer. Users who know interrupts know this.

User-facing API (zero breaking changes — opt-in only):

agent = Agent(tools=[...], checkpointing=True)
result = await agent.invoke_async("do the thing")

while result.stop_reason == "checkpoint":
    # persist anywhere: Temporal Event History, DB, file, etc.
    save_somewhere(result.checkpoint.to_dict())

    # resume in a fresh process / Agent instance
    result = await fresh_agent.invoke_async(
        [{"checkpointResume": {"checkpoint": result.checkpoint.to_dict()}}]
    )

print(result.message)  # stop_reason == "end_turn"

What changed:

  • Agent.__init__ — new checkpointing: bool = False parameter and two internal fields (_checkpointing, _checkpoint_resume_context). Default False: zero behavioral change for existing callers.
  • Agent._convert_prompt_to_messages — detects checkpointResume content blocks, validates shape (mirrors _InterruptState.resume() conventions: TypeError for shape, KeyError for lookup, ValueError for misconfig), loads the snapshot, and stashes the resume context.
  • event_loop_cycle — one priming block (reads + one-shot clears resume context) plus two checkpoint emission points (after_model and after_tools), all gated behind agent._checkpointing.
  • AgentResult — new checkpoint: Checkpoint | None = None field; to_dict / from_dict round-trip it.
  • EventLoopStopEvent — extended constructor with checkpoint kwarg; the 7-tuple matches AgentResult field order for positional unpacking.
  • strands.experimental.checkpoint — new CheckpointResumeContent / CheckpointResumeDict TypedDicts (parallel to InterruptResponseContent in types/interrupt.py).
  • checkpoint.py docstring — adds "Interaction with interrupts" section explaining precedence (interrupts win over checkpoints when both would fire in the same cycle, by design).

State-machine verification. Four scenarios traced against the code and covered by tests:

  1. Fresh call, checkpointing=False → identical to pre-change.
  2. Fresh call, checkpointing=True, tool_use → after_model checkpoint at cycle_index=0.
  3. Resume from after_model → snapshot restored, model call skipped (assistant tool_use is already last message), tools run, after_tools checkpoint at cycle_index=0.
  4. Resume from after_tools at cycle_index=N → primes invocation_state["_checkpoint_cycle_index"]=N+1, model runs, next after_model checkpoint carries cycle_index=N+1.

Durability proof — the killer test. test_crash_after_tools_does_not_rerun_completed_tools: three tools with independent call counters, agent runs through after_tools, the Agent instance is discarded entirely (del), a fresh Agent resumes from the persisted checkpoint, and the post-crash model returns end_turn. Assertion: each tool's counter is exactly 1. Completed work survives worker loss.

V0 known limitations (documented in checkpoint.py module docstring, not blockers):

  • Metrics reset on each resume call — the orchestrator is responsible for aggregating metrics across a durable run.
  • OpenAIResponsesModel(stateful=True) not supported — _model_state is not in take_snapshot(preset="session"). Follow-up issue to extend the snapshot preset.
  • AgentResult.message at after_tools is the assistant message that requested the tools (tool results are inside checkpoint.snapshot).
  • BeforeInvocationEvent / AfterInvocationEvent fire on every resume call (same as interrupts — hooks counting invocations see each resume as a separate invocation).
  • Per-tool granularity within a cycle requires a custom ToolExecutor (e.g. a future TemporalToolExecutor). The SDK checkpoint operates at cycle boundaries.
  • Streaming callbacks do not re-emit on replay.

Related Issues

Documentation PR

Type of Change

New feature

Testing

Verified the changes do not break functionality or introduce warnings in consuming repositories.

  • I ran hatch run prepare

Evidence from fresh runs:

  • hatch test — 2641 passed, 0 failed (up from 2620 baseline: 21 new tests).
  • hatch run hatch-static-analysis:lint-checkruff check and mypy both clean (145 source files).
  • hatch run hatch-static-analysis:format-check — 366/366 files formatted.

New tests added:

  • Unit — CheckpointResume types: tests/strands/experimental/checkpoint/test_types.py (2 tests, shape of the new TypedDicts).
  • Unit — AgentResult.checkpoint: tests/strands/agent/test_agent_result.py (6 new tests folded in — field default, accepts checkpoint, to_dict includes/omits checkpoint, from_dict round-trip, missing-checkpoint resilience).
  • Unit — EventLoopStopEvent checkpoint kwarg: tests/strands/types/test__events.py (2 new tests folded in — tuple length, default None).
  • Unit — Agent.__init__ flag: tests/strands/agent/test_agent.py (2 new tests).
  • Unit — _convert_prompt_to_messages validation: tests/strands/agent/test_agent.py (5 new tests — checkpointing=False error, mixed content, multiple blocks, missing key, schema mismatch).
  • Cycle-level — event loop emission: tests/strands/event_loop/test_event_loop.py (3 new tests — after_model emission, after_tools emission, cycle-index continuity across resume).
  • Integration — end-to-end durability: tests/strands/experimental/checkpoint/test_checkpoint.py (2 new tests folded in — round-trip across three cycles through fresh Agent instances, and the killer crash-after-tools test).

The 7-tuple shape change to EventLoopStopEvent required updating 10 pre-existing test-side tuple unpackers and two assertions in test__events.py. Done mechanically (add one slot each); all pre-existing tests still pass.

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly (user-guide page is a follow-up PR in agent-docs; module-level docstring in checkpoint.py updated with V0 limitations and interrupt interaction)
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed (reference Temporal / Dapr / Step Functions examples are the next milestone — M1/M2/M3 in the durable-execution tracking plan)
  • My changes generate no new warnings
  • Any dependent changes have been merged and published (Part A — feat: introduce checkpoint in experimental #2181 — is merged on main; this PR builds on it)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@JackYPCOnline JackYPCOnline marked this pull request as draft April 22, 2026 20:39
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 98.41270% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/strands/event_loop/event_loop.py 96.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/strands/types/_events.py
Comment thread src/strands/event_loop/event_loop.py Outdated
Comment thread src/strands/agent/agent.py Outdated

# Resume detection — must run before existing shape handling so checkpointResume
# blocks aren't misinterpreted as content blocks. Mirrors _InterruptState.resume()
# conventions (TypeError for shape, KeyError for lookup, ValueError for misconfig;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The comment says this mirrors _InterruptState.resume() conventions: "TypeError for shape, KeyError for lookup, ValueError for misconfig". However, Checkpoint.from_dict() on line 1047 raises CheckpointException (not ValueError) when the schema version doesn't match. This means a corrupted or version-mismatched checkpoint will surface as a CheckpointException, breaking the stated convention.

Suggestion: Either catch CheckpointException here and re-raise as ValueError to maintain the documented convention, or update the comment to reflect the actual exception types (including CheckpointException).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK convention is standalone per-domain exceptions (SessionException, SnapshotException, and now CheckpointException). Catching ValueError should not accidentally catch checkpoint-specific failures — that would hide schema-version problems inside generic error handlers. Callers who need to distinguish "checkpoint restore failed" from "bad input" should catch CheckpointException explicitly; this is the pattern the SDK already uses.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — the per-domain exception convention is clear. The updated comment in the code now documents CheckpointException explicitly, which addresses the concern. 👍

tool_executor: ToolExecutor | None = None,
retry_strategy: ModelRetryStrategy | _DefaultRetryStrategySentinel | None = _DEFAULT_RETRY_STRATEGY,
concurrent_invocation_mode: ConcurrentInvocationMode = ConcurrentInvocationMode.THROW,
checkpointing: bool = False,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This introduces a new public API surface on Agent.__init__ (checkpointing parameter), a new AgentResult field, new StopReason value, and new content block types. Per the API Bar Raising process, this falls under "moderate changes" (adding a new class/abstraction customers use to achieve new behavior) and should have the needs-api-review label.

Suggestion: Add the needs-api-review label to this PR and ensure an API reviewer evaluates the public API design before merge. Key API decisions to review:

  • Is checkpointing: bool the right granularity, or should it be an enum/config object for future extensibility (e.g., checkpoint only at after_tools, custom positions)?
  • Is checkpointResume as a content block the right abstraction for resume, vs. a dedicated method like agent.resume_from_checkpoint(checkpoint)?
  • Should Checkpoint be a frozen dataclass to prevent accidental mutation?

Comment thread src/strands/agent/agent.py Outdated
# conventions (TypeError for shape, KeyError for lookup, ValueError for misconfig;
# error messages use the SDK's key=<value> | message format).
if isinstance(prompt, list) and prompt:
has_checkpoint_resume = any(isinstance(c, dict) and "checkpointResume" in c for c in prompt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The checkpointResume content block approach requires users to construct a raw dict/TypedDict to resume: [{"checkpointResume": {"checkpoint": checkpoint.to_dict()}}]. This is error-prone and not as discoverable as a method call. The validation code (lines 1026-1045) exists primarily because the dict-based API has many ways to be misformed.

Per the SDK tenet "The obvious path is the happy path" — consider whether a dedicated Agent.resume_from_checkpoint(checkpoint) method would better guide users toward correct usage while keeping the content block as a low-level API for advanced use cases. This would align with the "Provide Both Low-Level and High-Level APIs" decision record.

if resume_pos == "after_model":
pass # Just resumed here — skip re-checkpoint, proceed to tools
else:
cycle_index = invocation_state.get("_checkpoint_cycle_index", 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: When checkpointing=True and the model returns end_turn (no tool_use), the agent completes normally but _checkpoint_cycle_index in invocation_state defaults to 0 throughout. There's no after_model checkpoint emitted for end_turn — is this intentional? If the model returns end_turn on the very first call, the user gets no checkpoint at all. This seems correct for durability semantics (nothing to resume from), but it's worth documenting explicitly that checkpoints are only emitted when there's a tool_use cycle.

Also: the _checkpoint_cycle_index key is initialized lazily via invocation_state.get("_checkpoint_cycle_index", 0). If this key is not consumed on an end_turn path, it's harmless but could be confusing during debugging.

version = data.get("schema_version", "")
if version != CHECKPOINT_SCHEMA_VERSION:
raise ValueError(
raise CheckpointException(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Checkpoint.from_dict was changed from raising ValueError to CheckpointException, but the docstring for _convert_prompt_to_messages in agent.py still documents the convention as "ValueError for misconfig". Additionally, users who were catching ValueError (based on the Part A docs) would now miss this exception.

Suggestion: Since this is still experimental, the change is reasonable — but ensure the exception hierarchy is intentional. Consider whether CheckpointException should be a subclass of ValueError to maintain backward compatibility with code that catches ValueError.

Comment thread tests/strands/experimental/checkpoint/test_checkpoint.py Outdated
raise KeyError("checkpoint | missing required key in checkpointResume block")

checkpoint = Checkpoint.from_dict(resume_block["checkpoint"])
self.load_snapshot(Snapshot.from_dict(checkpoint.snapshot))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The Snapshot.from_dict(checkpoint.snapshot) call on line 1048 could fail if the snapshot data is corrupted or from an incompatible snapshot schema version. This error would bubble up as whatever Snapshot.from_dict raises, which may not have a clear error message about the checkpoint resume context.

Suggestion: Consider wrapping line 1047-1048 in a try/except that provides a clear error message, e.g., "Failed to restore agent state from checkpoint snapshot: {original_error}".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SnapshotException already carries a clear, structured message ("Unsupported snapshot schema version: '{v}'. Current version: {c}"). Adding a wrapper layer at the checkpoint level introduces indirection without adding information. If a user hits SnapshotException during checkpoint resume, the message + the call stack are enough to diagnose.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair — SnapshotException is already descriptive and the stack trace provides the checkpoint resume context. No wrapping needed.

@github-actions
Copy link
Copy Markdown

Assessment: Comment

This is a well-structured PR that wires checkpoint functionality into the agent loop with a clean opt-in design. The state machine is carefully reasoned and the integration tests (especially the crash-after-tools test) are compelling. Two themes warrant attention before merge:

Review Themes
  • API Review Required: This introduces meaningful new public API surface (Agent parameter, AgentResult field, new StopReason, content block types). Per the API Bar Raising process, it needs a needs-api-review label and reviewer sign-off. Key design questions: is checkpointing: bool the right level of configurability, and should there be a high-level resume_from_checkpoint() method alongside the content-block primitive?

  • Error Contract Consistency: The resume validation comments claim to mirror _InterruptState.resume() conventions (TypeError/KeyError/ValueError), but Checkpoint.from_dict now raises CheckpointException. The exception hierarchy should be consistent and documented.

  • Coupling Pattern: The event loop directly accesses private agent attributes (_checkpointing, _checkpoint_resume_context). This mirrors the existing interrupt pattern but extends the coupling surface. Consider exposing checkpoint config as an explicit parameter or read-only property.

  • Test Coverage Gap: Missing a test for the checkpointing=True + end_turn (no tool use) path, and the Codecov report shows 1 partial line in event_loop.py.

The feature design, state-machine logic, and durability proof are solid. The integration tests are particularly well-designed.

# Checkpoint after model call, before tool execution.
# One-shot pop: safe because after_model always returns before reaching
# after_tools, so the stashed position is only consumed once.
if agent._checkpointing:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: When checkpointing=True and the cancel_signal is set, the after_model checkpoint emission on line 230 runs before the cancel check inside _handle_tool_execution (line 554). This means cancellation during a checkpointing agent's tool-use cycle would produce stop_reason="checkpoint" instead of stop_reason="cancelled". On resume, the cancel signal wouldn't be set, so the agent would proceed normally.

This is probably fine for durable-execution semantics (the orchestrator decides whether to re-cancel), but it's an unspecified interaction that could surprise users who expect cancel() to always produce stop_reason="cancelled".

Suggestion: Document the cancel-vs-checkpoint precedence (checkpoint wins, cancel is ignored) either in the module docstring or add a test that specifies the expected behavior.

@github-actions
Copy link
Copy Markdown

Assessment: Comment

Good progress since the last round — the frozen=True dataclass, _build_checkpoint_stop_event extraction, and updated error convention documentation address several prior concerns. A few new items surfaced:

New Review Items
  • Docstring accuracy: event_loop_cycle Yields docstring still documents a 4-element tuple but the actual event is now 7 elements. The cancel() docstring uses "checkpoint" in a way that now conflicts with the durable-execution Checkpoint concept introduced here.
  • AGENTS.md update: The directory structure section needs to be updated to include experimental/checkpoint/ per the repo's own guidelines.
  • Cancel + checkpoint interaction: When both checkpointing=True and cancel_signal are set, checkpoint emission takes precedence over cancel. This is probably correct but should be documented or tested.

The prior-round items around API review (needs-api-review label, high-level resume method question) remain open for maintainer decision. The core state-machine logic and test coverage are solid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant