feat(checkpoint): wire checkpointing into agent event loop#2190
feat(checkpoint): wire checkpointing into agent event loop#2190JackYPCOnline wants to merge 2 commits intostrands-agents:mainfrom
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
|
||
| # Resume detection — must run before existing shape handling so checkpointResume | ||
| # blocks aren't misinterpreted as content blocks. Mirrors _InterruptState.resume() | ||
| # conventions (TypeError for shape, KeyError for lookup, ValueError for misconfig; |
There was a problem hiding this comment.
Issue: The comment says this mirrors _InterruptState.resume() conventions: "TypeError for shape, KeyError for lookup, ValueError for misconfig". However, Checkpoint.from_dict() on line 1047 raises CheckpointException (not ValueError) when the schema version doesn't match. This means a corrupted or version-mismatched checkpoint will surface as a CheckpointException, breaking the stated convention.
Suggestion: Either catch CheckpointException here and re-raise as ValueError to maintain the documented convention, or update the comment to reflect the actual exception types (including CheckpointException).
There was a problem hiding this comment.
The SDK convention is standalone per-domain exceptions (SessionException, SnapshotException, and now CheckpointException). Catching ValueError should not accidentally catch checkpoint-specific failures — that would hide schema-version problems inside generic error handlers. Callers who need to distinguish "checkpoint restore failed" from "bad input" should catch CheckpointException explicitly; this is the pattern the SDK already uses.
There was a problem hiding this comment.
Makes sense — the per-domain exception convention is clear. The updated comment in the code now documents CheckpointException explicitly, which addresses the concern. 👍
| tool_executor: ToolExecutor | None = None, | ||
| retry_strategy: ModelRetryStrategy | _DefaultRetryStrategySentinel | None = _DEFAULT_RETRY_STRATEGY, | ||
| concurrent_invocation_mode: ConcurrentInvocationMode = ConcurrentInvocationMode.THROW, | ||
| checkpointing: bool = False, |
There was a problem hiding this comment.
Issue: This introduces a new public API surface on Agent.__init__ (checkpointing parameter), a new AgentResult field, new StopReason value, and new content block types. Per the API Bar Raising process, this falls under "moderate changes" (adding a new class/abstraction customers use to achieve new behavior) and should have the needs-api-review label.
Suggestion: Add the needs-api-review label to this PR and ensure an API reviewer evaluates the public API design before merge. Key API decisions to review:
- Is
checkpointing: boolthe right granularity, or should it be an enum/config object for future extensibility (e.g., checkpoint only atafter_tools, custom positions)? - Is
checkpointResumeas a content block the right abstraction for resume, vs. a dedicated method likeagent.resume_from_checkpoint(checkpoint)? - Should
Checkpointbe a frozen dataclass to prevent accidental mutation?
| # conventions (TypeError for shape, KeyError for lookup, ValueError for misconfig; | ||
| # error messages use the SDK's key=<value> | message format). | ||
| if isinstance(prompt, list) and prompt: | ||
| has_checkpoint_resume = any(isinstance(c, dict) and "checkpointResume" in c for c in prompt) |
There was a problem hiding this comment.
Issue: The checkpointResume content block approach requires users to construct a raw dict/TypedDict to resume: [{"checkpointResume": {"checkpoint": checkpoint.to_dict()}}]. This is error-prone and not as discoverable as a method call. The validation code (lines 1026-1045) exists primarily because the dict-based API has many ways to be misformed.
Per the SDK tenet "The obvious path is the happy path" — consider whether a dedicated Agent.resume_from_checkpoint(checkpoint) method would better guide users toward correct usage while keeping the content block as a low-level API for advanced use cases. This would align with the "Provide Both Low-Level and High-Level APIs" decision record.
| if resume_pos == "after_model": | ||
| pass # Just resumed here — skip re-checkpoint, proceed to tools | ||
| else: | ||
| cycle_index = invocation_state.get("_checkpoint_cycle_index", 0) |
There was a problem hiding this comment.
Issue: When checkpointing=True and the model returns end_turn (no tool_use), the agent completes normally but _checkpoint_cycle_index in invocation_state defaults to 0 throughout. There's no after_model checkpoint emitted for end_turn — is this intentional? If the model returns end_turn on the very first call, the user gets no checkpoint at all. This seems correct for durability semantics (nothing to resume from), but it's worth documenting explicitly that checkpoints are only emitted when there's a tool_use cycle.
Also: the _checkpoint_cycle_index key is initialized lazily via invocation_state.get("_checkpoint_cycle_index", 0). If this key is not consumed on an end_turn path, it's harmless but could be confusing during debugging.
| version = data.get("schema_version", "") | ||
| if version != CHECKPOINT_SCHEMA_VERSION: | ||
| raise ValueError( | ||
| raise CheckpointException( |
There was a problem hiding this comment.
Issue: Checkpoint.from_dict was changed from raising ValueError to CheckpointException, but the docstring for _convert_prompt_to_messages in agent.py still documents the convention as "ValueError for misconfig". Additionally, users who were catching ValueError (based on the Part A docs) would now miss this exception.
Suggestion: Since this is still experimental, the change is reasonable — but ensure the exception hierarchy is intentional. Consider whether CheckpointException should be a subclass of ValueError to maintain backward compatibility with code that catches ValueError.
| raise KeyError("checkpoint | missing required key in checkpointResume block") | ||
|
|
||
| checkpoint = Checkpoint.from_dict(resume_block["checkpoint"]) | ||
| self.load_snapshot(Snapshot.from_dict(checkpoint.snapshot)) |
There was a problem hiding this comment.
Issue: The Snapshot.from_dict(checkpoint.snapshot) call on line 1048 could fail if the snapshot data is corrupted or from an incompatible snapshot schema version. This error would bubble up as whatever Snapshot.from_dict raises, which may not have a clear error message about the checkpoint resume context.
Suggestion: Consider wrapping line 1047-1048 in a try/except that provides a clear error message, e.g., "Failed to restore agent state from checkpoint snapshot: {original_error}".
There was a problem hiding this comment.
SnapshotException already carries a clear, structured message ("Unsupported snapshot schema version: '{v}'. Current version: {c}"). Adding a wrapper layer at the checkpoint level introduces indirection without adding information. If a user hits SnapshotException during checkpoint resume, the message + the call stack are enough to diagnose.
There was a problem hiding this comment.
Fair — SnapshotException is already descriptive and the stack trace provides the checkpoint resume context. No wrapping needed.
|
Assessment: Comment This is a well-structured PR that wires checkpoint functionality into the agent loop with a clean opt-in design. The state machine is carefully reasoned and the integration tests (especially the crash-after-tools test) are compelling. Two themes warrant attention before merge: Review Themes
The feature design, state-machine logic, and durability proof are solid. The integration tests are particularly well-designed. |
| # Checkpoint after model call, before tool execution. | ||
| # One-shot pop: safe because after_model always returns before reaching | ||
| # after_tools, so the stashed position is only consumed once. | ||
| if agent._checkpointing: |
There was a problem hiding this comment.
Issue: When checkpointing=True and the cancel_signal is set, the after_model checkpoint emission on line 230 runs before the cancel check inside _handle_tool_execution (line 554). This means cancellation during a checkpointing agent's tool-use cycle would produce stop_reason="checkpoint" instead of stop_reason="cancelled". On resume, the cancel signal wouldn't be set, so the agent would proceed normally.
This is probably fine for durable-execution semantics (the orchestrator decides whether to re-cancel), but it's an unspecified interaction that could surprise users who expect cancel() to always produce stop_reason="cancelled".
Suggestion: Document the cancel-vs-checkpoint precedence (checkpoint wins, cancel is ignored) either in the module docstring or add a test that specifies the expected behavior.
|
Assessment: Comment Good progress since the last round — the New Review Items
The prior-round items around API review ( |
Description
wires the
Checkpointdata model (landed in #2181) into the agent runtime so an opt-incheckpointing=Trueagent pauses at ReAct cycle boundaries and resumes cleanly from persisted checkpoints — including across fresh process boundaries, which is what makes durability providers like Temporal, Dapr, and AWS Step Functions usable with Strands.The design mirrors the existing interrupt pattern by construction —
stop_reason="checkpoint",checkpointResumecontent block for resume, snapshot-based state transfer. Users who know interrupts know this.User-facing API (zero breaking changes — opt-in only):
What changed:
Agent.__init__— newcheckpointing: bool = Falseparameter and two internal fields (_checkpointing,_checkpoint_resume_context). Default False: zero behavioral change for existing callers.Agent._convert_prompt_to_messages— detectscheckpointResumecontent blocks, validates shape (mirrors_InterruptState.resume()conventions:TypeErrorfor shape,KeyErrorfor lookup,ValueErrorfor misconfig), loads the snapshot, and stashes the resume context.event_loop_cycle— one priming block (reads + one-shot clears resume context) plus two checkpoint emission points (after_modelandafter_tools), all gated behindagent._checkpointing.AgentResult— newcheckpoint: Checkpoint | None = Nonefield;to_dict/from_dictround-trip it.EventLoopStopEvent— extended constructor withcheckpointkwarg; the 7-tuple matchesAgentResultfield order for positional unpacking.strands.experimental.checkpoint— newCheckpointResumeContent/CheckpointResumeDictTypedDicts (parallel toInterruptResponseContentintypes/interrupt.py).checkpoint.pydocstring — adds "Interaction with interrupts" section explaining precedence (interrupts win over checkpoints when both would fire in the same cycle, by design).State-machine verification. Four scenarios traced against the code and covered by tests:
checkpointing=False→ identical to pre-change.checkpointing=True, tool_use →after_modelcheckpoint atcycle_index=0.after_model→ snapshot restored, model call skipped (assistant tool_use is already last message), tools run,after_toolscheckpoint atcycle_index=0.after_toolsatcycle_index=N→ primesinvocation_state["_checkpoint_cycle_index"]=N+1, model runs, nextafter_modelcheckpoint carriescycle_index=N+1.Durability proof — the killer test.
test_crash_after_tools_does_not_rerun_completed_tools: three tools with independent call counters, agent runs throughafter_tools, the Agent instance is discarded entirely (del), a fresh Agent resumes from the persisted checkpoint, and the post-crash model returnsend_turn. Assertion: each tool's counter is exactly 1. Completed work survives worker loss.V0 known limitations (documented in
checkpoint.pymodule docstring, not blockers):OpenAIResponsesModel(stateful=True)not supported —_model_stateis not intake_snapshot(preset="session"). Follow-up issue to extend the snapshot preset.AgentResult.messageatafter_toolsis the assistant message that requested the tools (tool results are insidecheckpoint.snapshot).BeforeInvocationEvent/AfterInvocationEventfire on every resume call (same as interrupts — hooks counting invocations see each resume as a separate invocation).ToolExecutor(e.g. a futureTemporalToolExecutor). The SDK checkpoint operates at cycle boundaries.Related Issues
Documentation PR
Type of Change
New feature
Testing
Verified the changes do not break functionality or introduce warnings in consuming repositories.
hatch run prepareEvidence from fresh runs:
hatch test— 2641 passed, 0 failed (up from 2620 baseline: 21 new tests).hatch run hatch-static-analysis:lint-check—ruff checkandmypyboth clean (145 source files).hatch run hatch-static-analysis:format-check— 366/366 files formatted.New tests added:
CheckpointResumetypes:tests/strands/experimental/checkpoint/test_types.py(2 tests, shape of the new TypedDicts).AgentResult.checkpoint:tests/strands/agent/test_agent_result.py(6 new tests folded in — field default, accepts checkpoint,to_dictincludes/omits checkpoint,from_dictround-trip, missing-checkpoint resilience).EventLoopStopEventcheckpoint kwarg:tests/strands/types/test__events.py(2 new tests folded in — tuple length, default None).Agent.__init__flag:tests/strands/agent/test_agent.py(2 new tests)._convert_prompt_to_messagesvalidation:tests/strands/agent/test_agent.py(5 new tests —checkpointing=Falseerror, mixed content, multiple blocks, missing key, schema mismatch).tests/strands/event_loop/test_event_loop.py(3 new tests —after_modelemission,after_toolsemission, cycle-index continuity across resume).tests/strands/experimental/checkpoint/test_checkpoint.py(2 new tests folded in — round-trip across three cycles through fresh Agent instances, and the killer crash-after-tools test).The 7-tuple shape change to
EventLoopStopEventrequired updating 10 pre-existing test-side tuple unpackers and two assertions intest__events.py. Done mechanically (add one slot each); all pre-existing tests still pass.Checklist
agent-docs; module-level docstring incheckpoint.pyupdated with V0 limitations and interrupt interaction)main; this PR builds on it)By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.