From b24cd634c5deaca77a709e85cb3741b30dcd3a64 Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Tue, 3 Mar 2026 12:33:19 -0500
Subject: [PATCH 01/11]  feat: durable integration doc

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 team/DURABILITY_PROVIDER_INTEGRRATION.md

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
new file mode 100644
index 000000000..e69de29bb

From 890185830dc727469f92d572c67b417c5e7d5281 Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Tue, 3 Mar 2026 12:51:51 -0500
Subject: [PATCH 02/11] feat: add missing doc

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 301 +++++++++++++++++++++++
 1 file changed, 301 insertions(+)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index e69de29bb..bc540a167 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -0,0 +1,301 @@
+# Durable Execution Provider Integration
+
+**Status**: Proposed  
+**Date**: 2026-03-01  
+**Issue**: https://github.com/strands-agents/sdk-python/issues/1369  
+**Target Release**: 2.0
+
+---
+
+## Context
+
+Durable agents are designed for production workloads that need resilience across failures, long waits, and restarts. Unlike ephemeral agents that lose all progress when a process dies, durable agents persist their state and resume from where they left off.
+
+The Strands SDK currently runs the agent reasoning loop entirely in-process: a `while` loop inside `invoke_async` that alternates between calling the LLM and executing tools. All state lives in memory for the duration of a single call.
+
+```
+Today: Single invoke_async call
+
+  User Prompt
+      │
+      ▼
+┌─────────────────────────────────────┐
+│         In-Process Loop             │
+│                                     │
+│  ┌──────┐   ┌──────┐   ┌──────┐     │
+│  │ LLM  │──▶│ Tool │──▶│ LLM  │──▶  │  AgentResult
+│  └──────┘   └──────┘   └──────┘     │
+│                                     │
+│  State: agent.messages (in memory)  │
+└─────────────────────────────────────┘
+         Process dies here?
+         → Everything lost
+```
+
+This doc covers two providers: [Temporal](https://temporal.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
+
+---
+
+## 1. How We Can Integrate Them into Our SDK Today
+
+### Temporal: Agent as Activity
+
+Wrap `agent("prompt")` as a Temporal Activity. Temporal retries the Activity on failure. The `session_manager` restores conversation history across retries.
+
+```python
+from temporalio import activity
+from strands import Agent
+from strands.session import S3SessionManager
+
+
+@activity.defn
+async def run_agent_activity(session_id: str, prompt: str) -> str:
+    agent = Agent(
+        model="us.anthropic.claude-sonnet-4-5",
+        tools=[migrate_db, send_email],
+        session_manager=S3SessionManager(bucket="state", session_id=session_id),
+    )
+    result = agent(prompt)
+    return str(result.message)
+```
+
+The Temporal workflow handles this activity.
+
+```python
+from temporalio import workflow
+
+
+@workflow.defn
+class AgentWorkflow:
+    @workflow.run
+    async def run(self, session_id: str, prompt: str) -> str:
+        return await workflow.execute_activity(
+            run_agent_activity,
+            args=[session_id, prompt],
+            schedule_to_close_timeout=timedelta(minutes=10),
+            retry_policy=RetryPolicy(maximum_attempts=3),
+        )
+```
+
+Temporal retries the activity if the worker crashes. `S3SessionManager` restores conversation history on each retry.
+
+The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries.
+
+```
+  Activity: run_agent_activity   <-- one black box
+  [LLM]─[Tool 1]─[Tool 2]─[CRASH] <-- retry from here leads to re-execution from beginning.
+```
+
+---
+
+### AWS Lambda Durable: Agent as Durable Step
+
+AWS Lambda Durable Functions (launched December 2025) supports wrapping any callable as a `@durable_step`. You can wrap `agent("prompt")` as a single step today with no SDK changes required.
+
+```python
+from aws_durable_execution_sdk_python import durable_execution, durable_step, DurableContext
+from strands import Agent
+from strands.session import S3SessionManager
+
+
+@durable_step
+def run_strands_agent(ctx, prompt: str, session_id: str) -> str:
+    agent = Agent(
+        model="us.anthropic.claude-sonnet-4-5",
+        tools=[migrate_db, send_email],
+        session_manager=S3SessionManager(bucket="state", session_id=session_id),
+    )
+    return str(agent(prompt).message)
+
+
+@durable_execution
+def handler(event: dict, context: DurableContext) -> dict:
+    result = context.step(
+        run_strands_agent, event["prompt"], event["session_id"]
+    )
+    return {"result": result}
+```
+
+Lambda's 15-minute ceiling no longer applies to the overall execution. Each step can run for up to 15 minutes and the `context.step()` call itself can wait indefinitely. If Lambda crashes after a step completes, AWS replays the handler and injects the cached result so the step does not re-execute.
+
+The fundamental problem is still the same as the Temporal pattern. The entire agent loop is treated as one step. Crash recovery only happens at step boundaries, so if Lambda dies mid-loop, the entire step retries and already-completed tool calls end up re-executing.
+
+```
+Lambda Durable sees:
+
+  context.step(run_strands_agent)   <-- one checkpoint
+  [LLM]─[Tool 1]─[Tool 2]─[CRASH]  <-- Step retries → Tool 1 and Tool 2 re-execute
+```
+
+#### Integrating with an Existing Lambda Layer
+
+Lambda Durable can only be enabled on new functions. The migration path is to add the SDK to the existing Layer build and deploy a new function that references that Layer.
+
+Existing functions are untouched. The new durable function shares the same Layer. The only additions needed are `aws_durable_execution_sdk_python` in `requirements.txt` and the `@durable_execution` / `@durable_step` decorators in the new handler.
+
+---
+
+### Pattern Comparison
+
+| | Temporal | AWS Lambda Durable |
+|---|---|---|
+| No SDK changes needed | ✅ | ✅ |
+| Survives Lambda 15-min ceiling | ✅ (via Activity timeout) | ✅ (native) |
+| Crash recovery at invocation level | ✅ | ✅ |
+| Mid-loop crash recovery | ❌ | ❌ |
+| Non-idempotent tools safe | ❌ | ❌ |
+| Per-LLM / per-tool checkpointing | ❌ | ❌ |
+| Step-level visibility | ❌ | ❌ |
+
+Both current patterns solve the execution ceiling problem. Neither solves mid-loop durability.
+
+---
+
+## 2. Gap If We Want Native Integration
+
+The key takeaway is both providers need the handler to own the loop. They cannot checkpoint what they cannot see. If the agent loop is hidden inside our SDK, these platforms have no way to hook into individual steps.
+
+**Temporal needs:** Each LLM call and each tool call as a separate Activity. Temporal records each Activity result permanently. On crash, completed Activities are replayed from history and the loop resumes at the step that was interrupted.
+
+**Lambda Durable needs:** Each LLM call and each tool call wrapped in its own `context.step()`. AWS records each step result. On Lambda interruption, completed steps are skipped and the loop resumes at the step that was interrupted.
+
+Both require the handler to control the loop:
+
+```
+What both need:                       What the current SDK gives:
+
+  while True:                           context.step(
+    llm  = checkpoint(call_llm)             agent("prompt")   ← one black box
+    tool = checkpoint(call_tool)        )
+    if done: break
+```
+
+The current SDK owns the loop inside `invoke_async`. There is no mechanism for external code to inject a checkpoint between iterations. The event loop cannot call back into the durable platform between steps.
+
+This is the single root cause of all current integration limitations, and it surfaces as three gaps:
+
+**Gap 1. Event loop does not use `invoke_callbacks_async` at step fire points**
+
+`HookRegistry` already ships a fully working `invoke_callbacks_async()` method. The gap is that `event_loop.py` does not yet call it at `AfterToolCallEvent` and `AfterModelCallEvent`. Those two call sites still use the sync `invoke_callbacks()`, which raises a `RuntimeError` at runtime if an async callback is registered. Until those call sites are updated, there is no way to write an async checkpoint hook that actually fires mid-loop.
+
+**Gap 2. No serializable agent configuration**
+
+When a Temporal worker or a Lambda Durable handler replays the loop on a new process, it needs to reconstruct the agent from scratch each iteration. The current stateful agent accumulates `self.messages` and cannot be reconstructed from configuration alone. This will be addressed by the stateless agent proposal (@Patrick).
+
+**Gap 3. No `DurableBackend` abstraction on `Agent`**
+
+There is no way to tell `Agent` to dispatch its loop to an external platform. A `durable_backend` parameter and an async dispatch method are needed to make the integration work.
+
+---
+
+## 3. Proposed Solution
+
+The solution opens the event loop so that durable platforms can observe and checkpoint each step, without changing the existing `agent("prompt")` call signature.
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────────┐
+│                    Agent (2.0)                      │
+│                                                     │
+│  agent("prompt")         → AgentResult  (unchanged) │
+│  agent.dispatch_async()  → ExecutionHandle  (new)   │
+│                                                     │
+│  durable_backend: DurableBackend | None             │
+└─────────────────┬───────────────────────────────────┘
+                  │
+       ┌──────────┴───────────┐
+       │                      │
+  backend = None         backend set
+       │                      │
+  in-process loop        dispatch to platform
+  (unchanged)            → block → AgentResult
+                         or
+                         → ExecutionHandle (non-blocking)
+                              │
+              ┌───────────────┴──────────────┐
+              │                              │
+        TemporalBackend            LambdaDurableBackend
+```
+
+### Core SDK Changes
+
+**1. Wire `invoke_callbacks_async` into the event loop at step fire points.**
+
+`HookRegistry.invoke_callbacks_async()` already exists and already handles both sync and async callbacks correctly. The only change needed is in `event_loop.py`: replace the two `invoke_callbacks()` call sites at `AfterToolCallEvent` and `AfterModelCallEvent` with `await invoke_callbacks_async()`. This is a one-line change per call site and has no impact on existing sync-only hooks.
+
+```python
+# strands/hooks/registry.py
+class HookRegistry:
+    async def invoke_callbacks_async(self, event: HookEvent) -> None:
+        for callback in self._callbacks.get(type(event), []):
+            await callback(event)
+```
+
+**2. `AgentSpec`:** A frozen, JSON-safe dataclass (`model_id`, `tool_names`, `tool_schemas`, `system_prompt`, etc.) built via `Agent._build_spec()` at dispatch time. This is the only thing that crosses the process boundary to a remote worker or Lambda handler, never a live `Agent` object.
+
+```python
+@dataclass(frozen=True)
+class AgentSpec:
+    model_id: str
+    system_prompt: str
+    tool_names: list[str]
+    tool_schemas: dict[str, dict]
+    session_id: str
+```
+
+**3. `DurableBackend` + `ExecutionHandle`:** Two ABCs in a new `strands.agent.backends` module. `DurableBackend.dispatch(spec, prompt)` returns an `ExecutionHandle`. The actual implementations live in `strands-temporal` and `strands-aws` as separate packages, so the core SDK has no runtime dependency on either.
+
+```python
+# strands/agent/backends.py
+class ExecutionHandle(ABC):
+    async def result(self) -> AgentResult: ...
+
+class DurableBackend(ABC):
+    async def dispatch(self, spec: AgentSpec, prompt: str) -> ExecutionHandle: ...
+```
+
+**4. `durable_backend` on `Agent`:** A single optional constructor parameter (default `None`). When set, `invoke_async` delegates to the backend and still returns `AgentResult` as before. A new `dispatch_async()` method returns an `ExecutionHandle` for callers that want non-blocking control.
+
+```python
+agent = Agent(tools=[...], durable_backend=LambdaDurableBackend())
+
+# Blocking — same call signature as today
+result = await agent.invoke_async("prompt")
+
+# Non-blocking — get a handle and await later
+handle = await agent.dispatch_async("prompt")
+result = await handle.result()
+```
+
+
+Both `strands-temporal` and `strands-aws` share the same goal: checkpoint after every LLM call and every tool call so a crash at any point resumes from the last completed step rather than from the beginning. The mechanism differs per platform (Temporal uses `@activity.defn`, Lambda Durable uses `context.step()`) but the contract is identical. Each agent loop iteration is two checkpointed units: one for the LLM call, one for the tool call. On replay, completed units are skipped and the loop continues from where it stopped.
+
+### Proposal
+
+| Gap | Fix | Who |
+|---|---|---|
+| Event loop calls sync `invoke_callbacks` at step fire points | Replace two call sites in `event_loop.py` with `await invoke_callbacks_async()` | Core SDK |
+| No serializable config | `AgentSpec` frozen dataclass | Core SDK |
+| No extension point on `Agent` | `durable_backend` param + `dispatch_async()` | Core SDK |
+| No remote tool resolution | `ToolRegistry.resolve(name)` + `all_registered()` | Core SDK |
+| No Temporal worker | `StrandsWorkflow`, `create_strands_worker()` | `strands-temporal` package |
+| No Lambda Durable handler | `@durable_execution` handler + `@durable_step` wrappers | `strands-aws` package |
+
+
+---
+
+## Action Items
+
+1. Aws Durable Lambda as entry point
+2. Update event_loop.py to call invoke_callbacks_async()
+3. Add AgentSpec frozen dataclass ( Proposed by @Patrick)
+4. Add durable_backend param and dispatch_async() to Agent
+5. Implement provider packages
+
+
+## Willingness to Implement
+
+Yes. The core SDK gaps are the prerequisite and are owned by the core team. The `strands-temporal` and `strands-aws` provider packages are parallel efforts, each depending on the stateless agent proposal landing first.
+
+---
\ No newline at end of file

From 6d1083f89b3eb2e158c71cab658fd9890e9c5225 Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Tue, 3 Mar 2026 12:53:22 -0500
Subject: [PATCH 03/11]  fix: update willingness

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index bc540a167..b94b7e100 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -296,6 +296,6 @@ Both `strands-temporal` and `strands-aws` share the same goal: checkpoint after
 
 ## Willingness to Implement
 
-Yes. The core SDK gaps are the prerequisite and are owned by the core team. The `strands-temporal` and `strands-aws` provider packages are parallel efforts, each depending on the stateless agent proposal landing first.
+TBD. Maybe start AWS durable first. 
 
 ---
\ No newline at end of file

From 0b766b1993b870804ec7f2dfff1ac0e50e7601ec Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackyuan0427@gmail.com>
Date: Wed, 4 Mar 2026 00:16:49 -0500
Subject: [PATCH 04/11] fix: update docs based on new finding

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 138 +++++++++++++++++++----
 1 file changed, 119 insertions(+), 19 deletions(-)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index b94b7e100..74270d5db 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -7,7 +7,7 @@
 
 ---
 
-## Context
+## 1. Context
 
 Durable agents are designed for production workloads that need resilience across failures, long waits, and restarts. Unlike ephemeral agents that lose all progress when a process dies, durable agents persist their state and resume from where they left off.
 
@@ -32,11 +32,111 @@ Today: Single invoke_async call
          → Everything lost
 ```
 
-This doc covers two providers: [Temporal](https://temporal.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
+This doc covers two providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
 
----
+### 1. How Durable Providers Orchestrate AI Agents
+
+Before diving into integration, Let's go through the shared architecture these providers use and how existing agent frameworks build on them.
+
+
+Temporal, Dapr, and Lambda Durable all share the same recovery mechanism: record the result of each completed unit, skip it on replay, resume from where execution stopped. Then the granularity is the key.
+Each provider names the pieces differently but the model is the same:
+
+| Concept | Temporal | Dapr | Lambda Durable |
+|---|---|---|---|
+| Orchestrator | Workflow | Workflow | `@durable_execution` handler |
+| Checkpointable unit | Activity | Activity | `context.step()` |
+| Replay mechanism | Event History | State Store | Cached step results |
+| Human-in-the-loop | Signal | Signal / PubSub event | (not yet supported) |
+
+Here is what a 3-step agent loop looks like when the provider can see each step vs. when it cannot:
+
+```
+
+Provider sees each step (native):       Provider sees one black box (wrapper):
+
+  ┌─ LLM call ──── ✅ saved ─┐            ┌─ agent("prompt") ── ? ─┐
+  │                          │            │                        │
+  ├─ Tool 1  ──── ✅ saved  ─┤            │  LLM call              │
+  │                          │            │  Tool 1                │
+  ├─ Tool 2  ──── 💥 CRASH   │            │  Tool 2   💥 CRASH     │
+  │                          │            │                        │
+  │  Resume here ─▶ Tool 2   │            │  Resume ─▶ LLM call    │
+  │                          │           │  (start over)          │
+  ├─ LLM call ──── ✅ saved ─┤            └────────────────────────┘
+  │                          │
+  └─ Done                     
+  
+```
+The left side is what Temporal AI Agent and Dapr's Durable Agent achieve. The right side is what happens if we wrap the Strands agent loop as a single unit.
+
+### 2. How Those Providers Build Their First-class AI agent
+
+
+**Temporal AI Agent** [temporal-community/temporal-ai-agent](https://github.com/temporal-community/temporal-ai-agent) puts the agent loop *inside the Workflow*. Each LLM call and each tool call is dispatched as a separate Activity. The Workflow is deterministic — it just decides "call LLM next" or "call tool next." The Activities do the actual I/O. On crash, Temporal replays completed Activities from event history and the loop resumes mid-conversation.
+
+See example code, more to read (docs)[https://docs.temporal.io/ai-cookbook/agentic-loop-tool-call-claude-python]
+
+```python
+# Temporal: orchestrator owns the loop, each step is an Activity
+@workflow.defn
+class AgentWorkflow:
+    @workflow.run
+    async def run(self, prompt: str):
+        messages = [{"role": "user", "content": prompt}]
+        while True:
+            # Each LLM call = checkpointed Activity
+            response = await workflow.execute_activity(
+                call_llm, args=[messages], ...
+            )
+            if not response.tool_calls:
+                return response.text
+
+            # Each tool call = checkpointed Activity (uses dynamic activities)
+            for tool_call in response.tool_calls:
+                result = await workflow.execute_activity(
+                    tool_call.name, args=[tool_call.input], ...
+                )
+                messages.append(tool_result(result))
+```
+**Dapr Agents** They offer both general agent and `DurableAgent` is workflow-backed, LLM call and tool execution uses a durable activity automatically, and the `AgentRunner` handles the workflow lifecycle.
 
-## 1. How We Can Integrate Them into Our SDK Today
+```python
+from dapr_agents.workflow.runners import AgentRunner
+
+async def main():
+    travel_planner = DurableAgent(
+        name="TravelBuddy",
+        role="Travel Planner",
+        goal="Help users find flights and remember preferences",
+        instructions=["Help users find flights and remember preferences"],
+        tools=[search_flights],
+        memory = AgentMemoryConfig(
+            store=ConversationDaprStateMemory(
+                store_name="conversationstore",
+                session_id="travel-session",
+            )
+        )
+    )
+
+    runner = AgentRunner()
+
+    try:
+        itinerary = await runner.run(
+            travel_planner,
+            payload={"task": "Plan a 3-day trip to Paris"},
+        )
+        print(itinerary)
+    finally:
+        runner.shutdown(travel_planner)
+
+```
+
+Highlight the `memory` here, Dapr externalizes conversation state to a pluggable state store keyed by session_id. This is functionally equivalent to our SessionManager, after we introduce checkpoint-based snapshot, we will close this gap. 
+
+We will discuss AWS Durable in next section since they all face the same granularity issue.
+
+## 2. How We Can Integrate Them into Our SDK Today: Wrapper Pattern
 
 ### Temporal: Agent as Activity
 
@@ -77,18 +177,18 @@ class AgentWorkflow:
         )
 ```
 
-Temporal retries the activity if the worker crashes. `S3SessionManager` restores conversation history on each retry.
-
-The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries.
+Temporal retries the activity if the worker crashes. `S3SessionManager` restores conversation history on each retry. The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries.
 
 ```
   Activity: run_agent_activity   <-- one black box
   [LLM]─[Tool 1]─[Tool 2]─[CRASH] <-- retry from here leads to re-execution from beginning.
 ```
 
+Let us skip Dapr general agent for now since it has the same pattern.
+
 ---
 
-### AWS Lambda Durable: Agent as Durable Step
+### AWS Lambda Durable: Agent as Durable Step [Diagram and docs](https://docs.aws.amazon.com/lambda/latest/dg/durable-functions.html)
 
 AWS Lambda Durable Functions (launched December 2025) supports wrapping any callable as a `@durable_step`. You can wrap `agent("prompt")` as a single step today with no SDK changes required.
 
@@ -127,9 +227,9 @@ Lambda Durable sees:
   [LLM]─[Tool 1]─[Tool 2]─[CRASH]  <-- Step retries → Tool 1 and Tool 2 re-execute
 ```
 
-#### Integrating with an Existing Lambda Layer
+### Integrating with an Existing Lambda Layer
 
-Lambda Durable can only be enabled on new functions. The migration path is to add the SDK to the existing Layer build and deploy a new function that references that Layer.
+[Lambda Durable](https://github.com/aws/aws-durable-execution-sdk-python) can only be enabled on new functions. The migration path is to add the SDK to the existing Layer build and deploy a new function that references that Layer.
 
 Existing functions are untouched. The new durable function shares the same Layer. The only additions needed are `aws_durable_execution_sdk_python` in `requirements.txt` and the `@durable_execution` / `@durable_step` decorators in the new handler.
 
@@ -137,15 +237,15 @@ Existing functions are untouched. The new durable function shares the same Layer
 
 ### Pattern Comparison
 
-| | Temporal | AWS Lambda Durable |
-|---|---|---|
-| No SDK changes needed | ✅ | ✅ |
-| Survives Lambda 15-min ceiling | ✅ (via Activity timeout) | ✅ (native) |
-| Crash recovery at invocation level | ✅ | ✅ |
-| Mid-loop crash recovery | ❌ | ❌ |
-| Non-idempotent tools safe | ❌ | ❌ |
-| Per-LLM / per-tool checkpointing | ❌ | ❌ |
-| Step-level visibility | ❌ | ❌ |
+| | Temporal | Dapr | AWS Lambda Durable |
+|---|---|---|---|
+| No SDK changes needed | ✅ | ✅ | ✅ |
+| Survives Lambda 15-min ceiling | ✅ (via Activity timeout) | ✅ (via Activity timeout) | ✅ (native) |
+| Crash recovery at invocation level | ✅ | ✅ | ✅ |
+| Mid-loop crash recovery | ❌ | ❌ | ❌ |
+| Non-idempotent tools safe | ❌ | ❌ | ❌ |
+| Per-LLM / per-tool checkpointing | ❌ | ❌ | ❌ |
+| Step-level visibility | ❌ | ❌ | ❌ |
 
 Both current patterns solve the execution ceiling problem. Neither solves mid-loop durability.
 

From 040899b330046df25eabc9c9186031f028e93881 Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Wed, 4 Mar 2026 13:12:50 -0500
Subject: [PATCH 05/11]  2nd updates to include more patterns

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 173 ++++++++++++++++++++---
 1 file changed, 150 insertions(+), 23 deletions(-)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index 74270d5db..534a14772 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -190,7 +190,7 @@ Let us skip Dapr general agent for now since it has the same pattern.
 
 ### AWS Lambda Durable: Agent as Durable Step [Diagram and docs](https://docs.aws.amazon.com/lambda/latest/dg/durable-functions.html)
 
-AWS Lambda Durable Functions (launched December 2025) supports wrapping any callable as a `@durable_step`. You can wrap `agent("prompt")` as a single step today with no SDK changes required.
+AWS Lambda Durable Function supports wrapping any callable as a `@durable_step`. We can wrap `agent("prompt")` as a single step today with no SDK changes required.
 
 ```python
 from aws_durable_execution_sdk_python import durable_execution, durable_step, DurableContext
@@ -227,9 +227,8 @@ Lambda Durable sees:
   [LLM]─[Tool 1]─[Tool 2]─[CRASH]  <-- Step retries → Tool 1 and Tool 2 re-execute
 ```
 
-### Integrating with an Existing Lambda Layer
 
-[Lambda Durable](https://github.com/aws/aws-durable-execution-sdk-python) can only be enabled on new functions. The migration path is to add the SDK to the existing Layer build and deploy a new function that references that Layer.
+[Lambda Durable](https://github.com/aws/aws-durable-execution-sdk-python) is sync only [issue](https://github.com/aws/aws-durable-execution-sdk-python/issues/316), and it can only be enabled on new functions. The migration path is to add the SDK to the existing Layer build and deploy a new function that references that Layer.
 
 Existing functions are untouched. The new durable function shares the same Layer. The only additions needed are `aws_durable_execution_sdk_python` in `requirements.txt` and the `@durable_execution` / `@durable_step` decorators in the new handler.
 
@@ -247,19 +246,13 @@ Existing functions are untouched. The new durable function shares the same Layer
 | Per-LLM / per-tool checkpointing | ❌ | ❌ | ❌ |
 | Step-level visibility | ❌ | ❌ | ❌ |
 
-Both current patterns solve the execution ceiling problem. Neither solves mid-loop durability.
+None solve mid-loop durability.
 
 ---
 
-## 2. Gap If We Want Native Integration
+## 2. Gaps If We Want Native Integration
 
-The key takeaway is both providers need the handler to own the loop. They cannot checkpoint what they cannot see. If the agent loop is hidden inside our SDK, these platforms have no way to hook into individual steps.
-
-**Temporal needs:** Each LLM call and each tool call as a separate Activity. Temporal records each Activity result permanently. On crash, completed Activities are replayed from history and the loop resumes at the step that was interrupted.
-
-**Lambda Durable needs:** Each LLM call and each tool call wrapped in its own `context.step()`. AWS records each step result. On Lambda interruption, completed steps are skipped and the loop resumes at the step that was interrupted.
-
-Both require the handler to control the loop:
+The provider owns the loop and checkpoints each LLM call and tool call individually. This is how Temporal AI Agent and Dapr DurableAgent work natively. It's what we need to build into Strands. The core requirement is the same across all three providers: the handler must control the loop so it can insert a checkpoint between every step.
 
 ```
 What both need:                       What the current SDK gives:
@@ -272,7 +265,7 @@ What both need:                       What the current SDK gives:
 
 The current SDK owns the loop inside `invoke_async`. There is no mechanism for external code to inject a checkpoint between iterations. The event loop cannot call back into the durable platform between steps.
 
-This is the single root cause of all current integration limitations, and it surfaces as three gaps:
+This is the single root cause of all current integration limitations, and it surfaces as **three gaps**:
 
 **Gap 1. Event loop does not use `invoke_callbacks_async` at step fire points**
 
@@ -292,17 +285,150 @@ There is no way to tell `Agent` to dispatch its loop to an external platform. A
 
 The solution opens the event loop so that durable platforms can observe and checkpoint each step, without changing the existing `agent("prompt")` call signature.
 
+### Why `dispatch_async`?
+agent("prompt") blocks the caller until the full agent loop finishes. That works fine in-process. But when the loop runs on a remote platform (Temporal worker, Dapr sidecar, Lambda), the execution could take longer than expectation.
+
+```python
+# ── With durable_backend: still blocking (backward compatible) ───
+agent = Agent(tools=[...], durable_backend=TemporalBackend(...))
+result = agent("prompt")  # dispatches to Temporal, blocks until done
+
+# ── With dispatch_async: non-blocking ────────────────────────────
+handle = await agent.dispatch_async("prompt")
+
+# handle is an ExecutionHandle — a reference to a running execution
+print(handle.execution_id)   # "temporal-wf-abc123" or "lambda-exec-xyz"
+
+# Do other work...
+
+# When we need the result:
+result = await handle.result()   # blocks until execution finishes
+           
+
+```
+
+### Why Telemetry pattern doesn't work
+
+```
+OTel (works from anywhere):
+
+  ┌─ Your Python process ──────────────────────────────┐
+  │                                                     │
+  │  with trace_api.use_span(span):     ← inline wrap  │
+  │      result = call_llm(...)                         │
+  │                                                     │
+  │  Span data → Exporter → Collector (Jaeger, etc.)   │
+  │             (just data, fire-and-forget)             │
+  └─────────────────────────────────────────────────────┘
+
+  OTel only OBSERVES. It ships telemetry data.
+  If the process crashes, the spans are lost too.
+  That's fine — tracing is best-effort.
+
+
+Temporal (must run inside Workflow sandbox):
+
+  ┌─ Temporal Worker process ───────────────────────────┐
+  │                                                      │
+  │  ┌─ Workflow (deterministic sandbox) ──────────┐     │
+  │  │                                              │     │
+  │  │  result = await workflow.execute_activity(   │     │
+  │  │      call_llm, ...                           │     │
+  │  │  )               ↑                           │     │
+  │  │                  │                           │     │
+  │  └──────────────────┼───────────────────────────┘     │
+  │                     │                                  │
+  │   Temporal Server ◄─┘  records Activity result         │
+  │   (gRPC service)       in Event History                │
+  │                        before returning to Workflow     │
+  └────────────────────────────────────────────────────────┘
+
+  Temporal CONTROLS EXECUTION. It persists + replays.
+  If the process crashes, the server still has the history.
+  On restart, it replays completed Activities from history.
+```
+
+The key difference: OTel's exporter is fire-and-forget (just data going out), so it works from any Python process. Temporal's workflow.execute_activity() is a two-way contract,the Temporal Server must acknowledge the Activity result before the Workflow proceeds.
+
+What it looks like ?
+
+```
+
+  ┌─ Temporal Worker ──────────────────────────────────────────┐
+  │                                                             │
+  │  ┌─ Workflow sandbox ────────────────────────────────┐      │
+  │  │                                                    │      │
+  │  │  Strands event_loop.py (unchanged):               │      │
+  │  │                                                    │      │
+  │  │    while True:                                     │      │
+  │  │      ┌─────────────────────────────────────────┐   │      │
+  │  │      │ hook: on_before_model_call              │   │      │
+  │  │      │   → workflow.execute_activity(call_llm) │ ──┼──►Temporal Server
+  │  │      │   ← result (checkpointed ✅)            │   │   (persists)
+  │  │      └─────────────────────────────────────────┘   │      │
+  │  │                                                    │      │
+  │  │      ┌─────────────────────────────────────────┐   │      │
+  │  │      │ hook: on_before_tool_call               │   │      │
+  │  │      │   → workflow.execute_activity(call_tool) │ ──┼──►Temporal Server
+  │  │      │   ← result (checkpointed ✅)            │   │   (persists)
+  │  │      └─────────────────────────────────────────┘   │      │
+  │  │                                                    │      │
+  │  │      if done: break                                │      │
+  │  │                                                    │      │
+  │  └────────────────────────────────────────────────────┘      │
+  │                                                             │
+  │  On crash: Temporal replays Workflow, hooks fire again,      │
+  │  but execute_activity returns cached results for completed   │
+  │  Activities → loop resumes from where it stopped.            │
+  └─────────────────────────────────────────────────────────────┘
+  
+```
+
+what about Lambda ?
+
+Lambda Durable's Python SDK is sync-only — context.step() is a blocking call, no async def, no await. So we can't call context.step() from inside an async Strands hook. Instead, the Lambda handler replaces the Strands event loop entirely and runs its own sync loop with context.step() calls:
+
+What we need is :
+```python
+@durable_execution
+def handler(event: dict, context: DurableContext):
+    """Lambda Durable owns the loop. Each LLM/tool call is a step."""
+    spec = AgentSpec.from_dict(event["spec"])
+    prompt = event["prompt"]
+    messages = [{"role": "user", "content": prompt}]
+
+    while True:
+        # ✅ Checkpoint: LLM call is a durable step
+        response = context.step(
+            lambda _: call_llm(spec.model_id, messages, spec.tool_schemas),
+            name=f"llm-call-{len(messages)}",
+        )
+
+        if not response["tool_calls"]:
+            return response["text"]
+
+        # ✅ Checkpoint: each tool call is a durable step
+        for tool_call in response["tool_calls"]:
+            tool_result = context.step(
+                lambda _, tc=tool_call: call_tool(tc["name"], tc["input"]),
+                name=f"tool-{tc['name']}-{len(messages)}",
+            )
+            messages.append({"role": "tool", "content": tool_result})
+```
+
+If Lambda Durable adds async def handler / await context.step() support in the future, it could switch to the hook approach too. Until then, loop replacement is the correct pattern
+
 ### Architecture
 
 ```
 ┌─────────────────────────────────────────────────────┐
-│                    Agent (2.0)                      │
-│                                                     │
-│  agent("prompt")         → AgentResult  (unchanged) │
-│  agent.dispatch_async()  → ExecutionHandle  (new)   │
-│                                                     │
-│  durable_backend: DurableBackend | None             │
-└─────────────────┬───────────────────────────────────┘
+│                    Agent (2.0)                       │
+│                                                      │
+│  agent("prompt")         → AgentResult  (unchanged)  │
+│  agent.dispatch_async()  → ExecutionHandle  (new)    │
+│                                                      │
+│  durable_backend: DurableBackend | None              │
+└─────────────────┬────────────────────────────────────┘
                   │
        ┌──────────┴───────────┐
        │                      │
@@ -313,9 +439,10 @@ The solution opens the event loop so that durable platforms can observe and chec
                          or
                          → ExecutionHandle (non-blocking)
                               │
-              ┌───────────────┴──────────────┐
-              │                              │
-        TemporalBackend            LambdaDurableBackend
+              ┌───────────────┼──────────────┐
+              │               │              │
+        TemporalBackend  DaprBackend  LambdaDurableBackend
+        (hook-based)     (hook-based) (handler-owns-loop)
 ```
 
 ### Core SDK Changes

From 29200ed4676035ac7ddac6ca1039fd60d97b26d3 Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Thu, 5 Mar 2026 17:48:36 -0500
Subject: [PATCH 06/11] feat: update doc after POC

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 494 ++++++++++++-----------
 1 file changed, 257 insertions(+), 237 deletions(-)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index 534a14772..98728d8c2 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -1,9 +1,9 @@
 # Durable Execution Provider Integration
 
 **Status**: Proposed  
-**Date**: 2026-03-01  
+**Date**: 2026-03-04  
 **Issue**: https://github.com/strands-agents/sdk-python/issues/1369  
-**Target Release**: 2.0
+**Target Release**: TBD
 
 ---
 
@@ -32,14 +32,18 @@ Today: Single invoke_async call
          → Everything lost
 ```
 
-This doc covers two providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
+This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
 
-### 1. How Durable Providers Orchestrate AI Agents
+This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
 
-Before diving into integration, Let's go through the shared architecture these providers use and how existing agent frameworks build on them.
+---
+
+## 1. How Durable Providers Orchestrate AI Agents
+
+Before diving into integration, let's go through the shared architecture these providers use and how existing agent frameworks build on them.
 
+Temporal, Dapr, and Lambda Durable all share the same recovery mechanism: record the result of each completed unit, skip it on replay, resume from where execution stopped. The **granularity** of those units is the key.
 
-Temporal, Dapr, and Lambda Durable all share the same recovery mechanism: record the result of each completed unit, skip it on replay, resume from where execution stopped. Then the granularity is the key.
 Each provider names the pieces differently but the model is the same:
 
 | Concept | Temporal | Dapr | Lambda Durable |
@@ -52,30 +56,28 @@ Each provider names the pieces differently but the model is the same:
 Here is what a 3-step agent loop looks like when the provider can see each step vs. when it cannot:
 
 ```
-
 Provider sees each step (native):       Provider sees one black box (wrapper):
 
   ┌─ LLM call ──── ✅ saved ─┐            ┌─ agent("prompt") ── ? ─┐
-  │                          │            │                        │
+  │                           │            │                        │
   ├─ Tool 1  ──── ✅ saved  ─┤            │  LLM call              │
-  │                          │            │  Tool 1                │
+  │                           │            │  Tool 1                │
   ├─ Tool 2  ──── 💥 CRASH   │            │  Tool 2   💥 CRASH     │
-  │                          │            │                        │
+  │                           │            │                        │
   │  Resume here ─▶ Tool 2   │            │  Resume ─▶ LLM call    │
-  │                          │           │  (start over)          │
+  │                           │            │  (start over)          │
   ├─ LLM call ──── ✅ saved ─┤            └────────────────────────┘
-  │                          │
-  └─ Done                     
-  
+  │                           │
+  └─ Done
 ```
-The left side is what Temporal AI Agent and Dapr's Durable Agent achieve. The right side is what happens if we wrap the Strands agent loop as a single unit.
 
-### 2. How Those Providers Build Their First-class AI agent
+The left side is what Temporal AI Agent and Dapr's Durable Agent achieve. The right side is what happens if we wrap the Strands agent loop as a single unit.
 
+### How Those Providers Build Their First-class AI Agent
 
-**Temporal AI Agent** [temporal-community/temporal-ai-agent](https://github.com/temporal-community/temporal-ai-agent) puts the agent loop *inside the Workflow*. Each LLM call and each tool call is dispatched as a separate Activity. The Workflow is deterministic — it just decides "call LLM next" or "call tool next." The Activities do the actual I/O. On crash, Temporal replays completed Activities from event history and the loop resumes mid-conversation.
+**Temporal AI Agent** ([temporal-community/temporal-ai-agent](https://github.com/temporal-community/temporal-ai-agent)) puts the agent loop *inside the Workflow*. Each LLM call and each tool call is dispatched as a separate Activity. The Workflow is deterministic — it just decides "call LLM next" or "call tool next." The Activities do the actual I/O. On crash, Temporal replays completed Activities from event history and the loop resumes mid-conversation.
 
-See example code, more to read (docs)[https://docs.temporal.io/ai-cookbook/agentic-loop-tool-call-claude-python]
+See example code, more to read at [Temporal AI Cookbook](https://docs.temporal.io/ai-cookbook/agentic-loop-tool-call-claude-python).
 
 ```python
 # Temporal: orchestrator owns the loop, each step is an Activity
@@ -99,7 +101,8 @@ class AgentWorkflow:
                 )
                 messages.append(tool_result(result))
 ```
-**Dapr Agents** They offer both general agent and `DurableAgent` is workflow-backed, LLM call and tool execution uses a durable activity automatically, and the `AgentRunner` handles the workflow lifecycle.
+
+**Dapr Agents** offers both a general agent and `DurableAgent`. `DurableAgent` is workflow-backed, where LLM calls and tool execution use durable activities automatically. The `AgentRunner` handles the workflow lifecycle.
 
 ```python
 from dapr_agents.workflow.runners import AgentRunner
@@ -129,14 +132,15 @@ async def main():
         print(itinerary)
     finally:
         runner.shutdown(travel_planner)
-
 ```
 
-Highlight the `memory` here, Dapr externalizes conversation state to a pluggable state store keyed by session_id. This is functionally equivalent to our SessionManager, after we introduce checkpoint-based snapshot, we will close this gap. 
+Highlight the `memory` here. Dapr externalizes conversation state to a pluggable state store keyed by session_id. This is functionally equivalent to our `SessionManager`. After we introduce checkpoint-based snapshot, we will close this gap.
+
+We will discuss AWS Durable in the next section since they all face the same granularity issue.
 
-We will discuss AWS Durable in next section since they all face the same granularity issue.
+---
 
-## 2. How We Can Integrate Them into Our SDK Today: Wrapper Pattern
+## 2. Level 1: Wrap Whole Agent Invoke (Works Today, No SDK Changes)
 
 ### Temporal: Agent as Activity
 
@@ -186,9 +190,9 @@ Temporal retries the activity if the worker crashes. `S3SessionManager` restores
 
 Let us skip Dapr general agent for now since it has the same pattern.
 
----
+### AWS Lambda Durable: Agent as Durable Step
 
-### AWS Lambda Durable: Agent as Durable Step [Diagram and docs](https://docs.aws.amazon.com/lambda/latest/dg/durable-functions.html)
+[Diagram and docs](https://docs.aws.amazon.com/lambda/latest/dg/durable-functions.html)
 
 AWS Lambda Durable Function supports wrapping any callable as a `@durable_step`. We can wrap `agent("prompt")` as a single step today with no SDK changes required.
 
@@ -227,14 +231,7 @@ Lambda Durable sees:
   [LLM]─[Tool 1]─[Tool 2]─[CRASH]  <-- Step retries → Tool 1 and Tool 2 re-execute
 ```
 
-
-[Lambda Durable](https://github.com/aws/aws-durable-execution-sdk-python) is sync only [issue](https://github.com/aws/aws-durable-execution-sdk-python/issues/316), and it can only be enabled on new functions. The migration path is to add the SDK to the existing Layer build and deploy a new function that references that Layer.
-
-Existing functions are untouched. The new durable function shares the same Layer. The only additions needed are `aws_durable_execution_sdk_python` in `requirements.txt` and the `@durable_execution` / `@durable_step` decorators in the new handler.
-
----
-
-### Pattern Comparison
+### Level 1 Summary
 
 | | Temporal | Dapr | AWS Lambda Durable |
 |---|---|---|---|
@@ -250,279 +247,302 @@ None solve mid-loop durability.
 
 ---
 
-## 2. Gaps If We Want Native Integration
+## 3. Level 2: Native Integration (Requires SDK Changes)
 
-The provider owns the loop and checkpoints each LLM call and tool call individually. This is how Temporal AI Agent and Dapr DurableAgent work natively. It's what we need to build into Strands. The core requirement is the same across all three providers: the handler must control the loop so it can insert a checkpoint between every step.
+Instead of handing our loop to the durable provider, we keep `agent.invoke()` and wrap the individual I/O calls (LLM, tools) with the platform's checkpoint primitive. The user sets up the durable infrastructure (Temporal Worker, Lambda Durable function, etc.) and passes context into our SDK. Our SDK wraps each call so the provider can checkpoint it.
 
 ```
-What both need:                       What the current SDK gives:
+Level 1 (wrapper):                     Level 2 (native):
 
-  while True:                           context.step(
-    llm  = checkpoint(call_llm)             agent("prompt")   ← one black box
-    tool = checkpoint(call_tool)        )
-    if done: break
+  checkpoint(                             while True:
+      agent("prompt")  ← one box             wrappedCallModel()   ← checkpointed
+  )                                          wrappedCallTools()   ← checkpointed
+                                             if done: break
+
+  Our loop is inside their checkpoint.    Our loop stays ours.
+  They can't see individual steps.        They checkpoint each step.
 ```
 
-The current SDK owns the loop inside `invoke_async`. There is no mechanism for external code to inject a checkpoint between iterations. The event loop cannot call back into the durable platform between steps.
+### Current Gaps
+
+Two things are needed in our SDK to make this work:
 
-This is the single root cause of all current integration limitations, and it surfaces as **three gaps**:
+**Gap 1. Event loop async hooks** (easy fix)
 
-**Gap 1. Event loop does not use `invoke_callbacks_async` at step fire points**
+`HookRegistry` already ships a fully working `invoke_callbacks_async()` method. The event loop just needs two call sites updated from `invoke_callbacks()` to `await invoke_callbacks_async()`. One-line change per site, no impact on existing sync hooks. After this fix, async checkpoint hooks fire mid-loop, which is what the durability wrappers need.
 
-`HookRegistry` already ships a fully working `invoke_callbacks_async()` method. The gap is that `event_loop.py` does not yet call it at `AfterToolCallEvent` and `AfterModelCallEvent`. Those two call sites still use the sync `invoke_callbacks()`, which raises a `RuntimeError` at runtime if an async callback is registered. Until those call sites are updated, there is no way to write an async checkpoint hook that actually fires mid-loop.
+**Gap 2. No `Durability` abstraction on `Agent`**
 
-**Gap 2. No serializable agent configuration**
+There is no way to tell `Agent` to wrap its I/O calls with a durable provider's checkpoint primitive. A `durability` parameter that wraps `callModel` and `callTools` is needed.
 
-When a Temporal worker or a Lambda Durable handler replays the loop on a new process, it needs to reconstruct the agent from scratch each iteration. The current stateful agent accumulates `self.messages` and cannot be reconstructed from configuration alone. This will be addressed by the stateless agent proposal (@Patrick).
+Hooks alone cannot fill this gap. `BeforeModelCallEvent` and `AfterModelCallEvent` are notification-only — the only writable field on `AfterModelCallEvent` is `retry`. There is no way to inject a cached result or skip the actual model call from a hook. The event loop calls `stream_messages` unconditionally after `BeforeModelCallEvent` fires regardless of what any hook does. Compare this to `AfterToolCallEvent`, which has a writable `result` field — tools could theoretically be intercepted via hooks today, but model calls cannot.
 
-**Gap 3. No `DurableBackend` abstraction on `Agent`**
+This means the `Durability` abstraction must be wired directly into the event loop's call sites for `stream_messages` and tool execution, not layered on top via hooks.
 
-There is no way to tell `Agent` to dispatch its loop to an external platform. A `durable_backend` parameter and an async dispatch method are needed to make the integration work.
+Once both gaps are closed, the proposed solution below becomes possible.
 
 ---
 
-## 3. Proposed Solution
+## 4. Proposed Solution
 
-The solution opens the event loop so that durable platforms can observe and checkpoint each step, without changing the existing `agent("prompt")` call signature.
+Below is the diagram showing how this will look like in Strands Agent. More to read about how Temporal skips completed activities: [Event History](https://docs.temporal.io/workflow-execution/event).
 
-### Why `dispatch_async`?
-agent("prompt") blocks the caller until the full agent loop finishes. That works fine in-process. But when the loop runs on a remote platform (Temporal worker, Dapr sidecar, Lambda), the execution could take longer than expectation.
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  Temporal Worker (your process)                                             │
+│                                                                             │
+│  ┌─ Workflow (deterministic sandbox) ──────────────────────────────────┐    │
+│  │                                                                      │   │
+│  │  Strands agent loop (via Durability wiring):                        │   │
+│  │                                                                      │   │
+│  │    while True:                                                       │   │
+│  │                                                                      │   │
+│  │      ① durability.wrap_model_call (event loop call site)            │   │
+│  │        └─ workflow.execute_activity(call_llm, args=[xxx])  ────┼───┼──► Temporal Server
+│  │                                                                      │   │    ┌──────────────┐
+│  │        ◄─ LLM response  (replayed from event history on crash ✅)  ─┼───┼──◄ │ Event History│
+│  │          Note: hooks cannot do this — BeforeModelCallEvent has no   │   │    │              │
+│  │          result injection; stream_messages fires unconditionally.    │   │    │ [call_llm]✅ │
+│  │                                                                      │   │    │              │
+│  │      ② durability.wrap_tool_call (event loop call site)             │   │    │ [call_tool]✅│
+│  │        └─ workflow.execute_activity(call_tool, args=[tool_input]) ──┼───┼──► │ [call_llm] ✅│
+│  │                                                                      │   │    │ ...          │
+│  │        ◄─ tool result   (replayed from event history on crash ✅)  ─┼───┼──◄ └──────────────┘
+│  │                                                                      │   │
+│  │      if stop_reason == "end_turn": break                             │   │
+│  │                                                                      │   │
+│  └──────────────────────────────────────────────────────────────────────┘   │
+│                                                                             │
+│  On crash → Worker restarts → Temporal replays Workflow from event history  │
+│             Completed Activities return cached results, skipping re-execution│
+│             Loop resumes exactly where it crashed                           │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+### How Users Will Use This
+
+The user owns the Durable Workflow (Temporal or Dapr Workflow, or Lambda Durable handler). They instantiate our Agent inside it, passing a `Durability` object that knows how to wrap functions for that platform. Our event loop runs as normal, but `callModel` and `callTools` are wrapped versions in runtime that checkpoint through the platform.
+
+Let's imagine how a user uses this pattern:
 
 ```python
-# ── With durable_backend: still blocking (backward compatible) ───
-agent = Agent(tools=[...], durable_backend=TemporalBackend(...))
-result = agent("prompt")  # dispatches to Temporal, blocks until done
+# ─── User's code (they own the Temporal setup) ──────────────────
 
-# ── With dispatch_async: non-blocking ────────────────────────────
-handle = await agent.dispatch_async("prompt")
+from temporalio import workflow
+from temporalio.client import Client
+from strands import Agent
+from strands_temporal import TemporalDurability
 
-# handle is an ExecutionHandle — a reference to a running execution
-print(handle.execution_id)   # "temporal-wf-abc123" or "lambda-exec-xyz"
 
-# Do other work...
+async def main():
+    client = await Client.connect("localhost:7233")
+    await client.execute_workflow(
+        MyAgentWorkflow.run,
+        args=["Plan a trip to Paris"],
+        id="trip-planner",
+        task_queue="strands-agents",
+    )
+
 
-# When we need the result:
-result = await handle.result()   # blocks until execution finishes
-           
+@workflow.defn
+class MyAgentWorkflow:
+    """User defines this. They control the Workflow shape."""
 
+    @workflow.run
+    async def run(self, prompt: str) -> str:
+        # User creates Agent with our durability wrapper.
+        # TemporalDurability receives the workflow context so it can
+        # call workflow.execute_activity() to wrap I/O calls.
+        agent = Agent(
+            tools=[search_flights, book_hotel],
+            durability=TemporalDurability(model_id="us.anthropic.claude-sonnet-4-5"),  # ← wraps callModel/callTools
+        )
+        result = await agent.invoke_async(prompt)
+        return str(result.message)
 ```
 
-### Why Telemetry pattern doesn't work
+### The Code We Will Own
 
-```
-OTel (works from anywhere):
-
-  ┌─ Your Python process ──────────────────────────────┐
-  │                                                     │
-  │  with trace_api.use_span(span):     ← inline wrap  │
-  │      result = call_llm(...)                         │
-  │                                                     │
-  │  Span data → Exporter → Collector (Jaeger, etc.)   │
-  │             (just data, fire-and-forget)             │
-  └─────────────────────────────────────────────────────┘
-
-  OTel only OBSERVES. It ships telemetry data.
-  If the process crashes, the spans are lost too.
-  That's fine — tracing is best-effort.
-
-
-Temporal (must run inside Workflow sandbox):
-
-  ┌─ Temporal Worker process ───────────────────────────┐
-  │                                                      │
-  │  ┌─ Workflow (deterministic sandbox) ──────────┐     │
-  │  │                                              │     │
-  │  │  result = await workflow.execute_activity(   │     │
-  │  │      call_llm, ...                           │     │
-  │  │  )               ↑                           │     │
-  │  │                  │                           │     │
-  │  └──────────────────┼───────────────────────────┘     │
-  │                     │                                  │
-  │   Temporal Server ◄─┘  records Activity result         │
-  │   (gRPC service)       in Event History                │
-  │                        before returning to Workflow     │
-  └────────────────────────────────────────────────────────┘
-
-  Temporal CONTROLS EXECUTION. It persists + replays.
-  If the process crashes, the server still has the history.
-  On restart, it replays completed Activities from history.
-```
+```python
+# ─── strands-temporal package ────────────────────────────────────
 
-The key difference: OTel's exporter is fire-and-forget (just data going out), so it works from any Python process. Temporal's workflow.execute_activity() is a two-way contract,the Temporal Server must acknowledge the Activity result before the Workflow proceeds.
+from temporalio import workflow, activity
+from strands.agent.durability import Durability
 
-What it looks like ?
 
-```
+class TemporalDurability(Durability):
+    """Wraps I/O calls as Temporal Activities."""
 
-  ┌─ Temporal Worker ──────────────────────────────────────────┐
-  │                                                             │
-  │  ┌─ Workflow sandbox ────────────────────────────────┐      │
-  │  │                                                    │      │
-  │  │  Strands event_loop.py (unchanged):               │      │
-  │  │                                                    │      │
-  │  │    while True:                                     │      │
-  │  │      ┌─────────────────────────────────────────┐   │      │
-  │  │      │ hook: on_before_model_call              │   │      │
-  │  │      │   → workflow.execute_activity(call_llm) │ ──┼──►Temporal Server
-  │  │      │   ← result (checkpointed ✅)            │   │   (persists)
-  │  │      └─────────────────────────────────────────┘   │      │
-  │  │                                                    │      │
-  │  │      ┌─────────────────────────────────────────┐   │      │
-  │  │      │ hook: on_before_tool_call               │   │      │
-  │  │      │   → workflow.execute_activity(call_tool) │ ──┼──►Temporal Server
-  │  │      │   ← result (checkpointed ✅)            │   │   (persists)
-  │  │      └─────────────────────────────────────────┘   │      │
-  │  │                                                    │      │
-  │  │      if done: break                                │      │
-  │  │                                                    │      │
-  │  └────────────────────────────────────────────────────┘      │
-  │                                                             │
-  │  On crash: Temporal replays Workflow, hooks fire again,      │
-  │  but execute_activity returns cached results for completed   │
-  │  Activities → loop resumes from where it stopped.            │
-  └─────────────────────────────────────────────────────────────┘
-  
-```
+    def __init__(self, model_id: str):
+        self.model_id = model_id
 
-what about Lambda ?
+    def wrap_model_call(self, call_model_fn):
+        """Wrap the raw model call so it runs as a Temporal Activity."""
+        # Activities must be pre-registered on the worker at startup.
+        # wrap_model_call returns a dispatcher that calls the registered activity by reference.
+        model_id = self.model_id
+        async def wrapped(messages, system_prompt, tool_specs):
+            result = await workflow.execute_activity(
+                call_model_activity,  # pre-registered on worker
+                args=[model_id, messages, system_prompt, tool_specs],
+                start_to_close_timeout=timedelta(minutes=5),
+            )
+            return result["stop_reason"], result["message"]
+        return wrapped
+
+    def wrap_tool_call(self, call_tool_fn):
+        """Wrap the raw tool call so it runs as a Temporal Activity."""
+        async def wrapped(tool_name, tool_input, tool_use_id):
+            return await workflow.execute_activity(
+                call_tool_activity,  # pre-registered on worker
+                args=[tool_name, tool_input, tool_use_id],
+                start_to_close_timeout=timedelta(minutes=10),
+            )
+        return wrapped
+```
 
-Lambda Durable's Python SDK is sync-only — context.step() is a blocking call, no async def, no await. So we can't call context.step() from inside an async Strands hook. Instead, the Lambda handler replaces the Strands event loop entirely and runs its own sync loop with context.step() calls:
+### The Durability Class (Core SDK)
 
-What we need is :
 ```python
-@durable_execution
-def handler(event: dict, context: DurableContext):
-    """Lambda Durable owns the loop. Each LLM/tool call is a step."""
-    spec = AgentSpec.from_dict(event["spec"])
-    prompt = event["prompt"]
-    messages = [{"role": "user", "content": prompt}]
-
-    while True:
-        # ✅ Checkpoint: LLM call is a durable step
-        response = context.step(
-            lambda _: call_llm(spec.model_id, messages, spec.tool_schemas),
-            name=f"llm-call-{len(messages)}",
-        )
+# ─── strands/agent/durability.py ─────────────────────────────────
 
-        if not response["tool_calls"]:
-            return response["text"]
+class Durability:
+    """Base class. Each provider implements wrap_model_call / wrap_tool_call."""
 
-        # ✅ Checkpoint: each tool call is a durable step
-        for tool_call in response["tool_calls"]:
-            tool_result = context.step(
-                lambda _, tc=tool_call: call_tool(tc["name"], tc["input"]),
-                name=f"tool-{tc['name']}-{len(messages)}",
-            )
-            messages.append({"role": "tool", "content": tool_result})
+    def wrap_model_call(self, call_model_fn):
+        """Override to wrap the model call with checkpointing."""
+        return call_model_fn  # default: no wrapping
+
+    def wrap_tool_call(self, call_tool_fn):
+        """Override to wrap the tool call with checkpointing."""
+        return call_tool_fn  # default: no wrapping
 ```
 
-If Lambda Durable adds async def handler / await context.step() support in the future, it could switch to the hook approach too. Until then, loop replacement is the correct pattern
+### How Agent Uses It
 
-### Architecture
+```python
+# ─── Inside Agent / event_loop.py (simplified) ──────────────────
 
-```
-┌─────────────────────────────────────────────────────┐
-│                    Agent (2.0)                       │
-│                                                      │
-│  agent("prompt")         → AgentResult  (unchanged)  │
-│  agent.dispatch_async()  → ExecutionHandle  (new)    │
-│                                                      │
-│  durable_backend: DurableBackend | None              │
-└─────────────────┬────────────────────────────────────┘
-                  │
-       ┌──────────┴───────────┐
-       │                      │
-  backend = None         backend set
-       │                      │
-  in-process loop        dispatch to platform
-  (unchanged)            → block → AgentResult
-                         or
-                         → ExecutionHandle (non-blocking)
-                              │
-              ┌───────────────┼──────────────┐
-              │               │              │
-        TemporalBackend  DaprBackend  LambdaDurableBackend
-        (hook-based)     (hook-based) (handler-owns-loop)
-```
+class Agent:
+    def __init__(self, tools, durability=None, ...):
+        self.durability = durability or Durability()  # no-op default
+        ...
 
-### Core SDK Changes
+    async def invoke_async(self, prompt: str) -> AgentResult:
+        # Same loop as today. The only difference:
+        # callModel and callTools are wrapped if durability is set.
 
-**1. Wire `invoke_callbacks_async` into the event loop at step fire points.**
+        wrapped_call_model = self.durability.wrap_model_call(self._call_model)
+        wrapped_call_tools = self.durability.wrap_tool_call(self._call_tools)
 
-`HookRegistry.invoke_callbacks_async()` already exists and already handles both sync and async callbacks correctly. The only change needed is in `event_loop.py`: replace the two `invoke_callbacks()` call sites at `AfterToolCallEvent` and `AfterModelCallEvent` with `await invoke_callbacks_async()`. This is a one-line change per call site and has no impact on existing sync-only hooks.
+        messages = [{"role": "user", "content": prompt}]
 
-```python
-# strands/hooks/registry.py
-class HookRegistry:
-    async def invoke_callbacks_async(self, event: HookEvent) -> None:
-        for callback in self._callbacks.get(type(event), []):
-            await callback(event)
-```
+        while True:
+            response = await wrapped_call_model(messages, self.system_prompt, self.tool_specs)
 
-**2. `AgentSpec`:** A frozen, JSON-safe dataclass (`model_id`, `tool_names`, `tool_schemas`, `system_prompt`, etc.) built via `Agent._build_spec()` at dispatch time. This is the only thing that crosses the process boundary to a remote worker or Lambda handler, never a live `Agent` object.
+            if not response.tool_calls:
+                return AgentResult(message=response.text)
 
-```python
-@dataclass(frozen=True)
-class AgentSpec:
-    model_id: str
-    system_prompt: str
-    tool_names: list[str]
-    tool_schemas: dict[str, dict]
-    session_id: str
+            for tool_call in response.tool_calls:
+                tool_result = await wrapped_call_tools(tool_call.name, tool_call.input)
+                messages.append({"role": "tool", "content": tool_result})
 ```
 
-**3. `DurableBackend` + `ExecutionHandle`:** Two ABCs in a new `strands.agent.backends` module. `DurableBackend.dispatch(spec, prompt)` returns an `ExecutionHandle`. The actual implementations live in `strands-temporal` and `strands-aws` as separate packages, so the core SDK has no runtime dependency on either.
+### What Happens on Crash and Replay
 
-```python
-# strands/agent/backends.py
-class ExecutionHandle(ABC):
-    async def result(self) -> AgentResult: ...
-
-class DurableBackend(ABC):
-    async def dispatch(self, spec: AgentSpec, prompt: str) -> ExecutionHandle: ...
 ```
+Step 1: Workflow starts, loop begins
+Step 2: wrapped_call_model() → Temporal records ActivityTaskCompleted ✅
+Step 3: wrapped_call_tools("search_flights") → Temporal records ActivityTaskCompleted ✅
+Step 4: wrapped_call_model() → 💥 Worker crashes mid-Activity
+
+─── Worker restarts, Temporal replays the Workflow ───
+
+Step 1: Workflow starts, loop begins (Workflow code runs from top)
+Step 2: wrapped_call_model() → Temporal sees ActivityTaskCompleted in history
+                              → returns cached result instantly, NO re-execution
+Step 3: wrapped_call_tools() → Temporal sees ActivityTaskCompleted in history
+                              → returns cached result instantly, NO re-execution
+Step 4: wrapped_call_model() → No history for this → executes the Activity for real
+Step 5: continues from here...
+```
+
+The same applies to Lambda Durable's `context.step()`. Completed steps return their cached results on replay. But Lambda Durable does not support `async` today ([tracking issue](https://github.com/aws/aws-durable-execution-sdk-python/issues/316)).
+
+So there will be two options:
 
-**4. `durable_backend` on `Agent`:** A single optional constructor parameter (default `None`). When set, `invoke_async` delegates to the backend and still returns `AgentResult` as before. A new `dispatch_async()` method returns an `ExecutionHandle` for callers that want non-blocking control.
+**Option A: Sync wrapper (works today)**
 
 ```python
-agent = Agent(tools=[...], durable_backend=LambdaDurableBackend())
+class LambdaDurability(Durability):
+    def __init__(self, context: DurableContext):
+        self.context = context
+
+    def wrap_model_call(self, call_model_fn):
+        def wrapped(messages, system_prompt, tool_specs):
+            return self.context.step(
+                lambda _: call_model_fn(messages, system_prompt, tool_specs),
+                name=f"call-model-{len(messages)}",
+            )
+        return wrapped
 
-# Blocking — same call signature as today
-result = await agent.invoke_async("prompt")
+    def wrap_tool_call(self, call_tool_fn):
+        def wrapped(tool_name, tool_input, tool_use_id):
+            return self.context.step(
+                lambda _: call_tool_fn(tool_name, tool_input, tool_use_id),
+                name=f"call-tool-{tool_name}",
+            )
+        return wrapped
+```
 
-# Non-blocking — get a handle and await later
-handle = await agent.dispatch_async("prompt")
-result = await handle.result()
+```python
+# User's Lambda handler
+@durable_execution
+def handler(event: dict, context: DurableContext):
+    agent = Agent(
+        tools=[search_flights, book_hotel],
+        durability=LambdaDurability(context),
+    )
+    result = agent(event["prompt"])  # sync call
+    return {"result": str(result.message)}
 ```
 
+**Option B: Wait for async Lambda Durable support**
 
-Both `strands-temporal` and `strands-aws` share the same goal: checkpoint after every LLM call and every tool call so a crash at any point resumes from the last completed step rather than from the beginning. The mechanism differs per platform (Temporal uses `@activity.defn`, Lambda Durable uses `context.step()`) but the contract is identical. Each agent loop iteration is two checkpointed units: one for the LLM call, one for the tool call. On replay, completed units are skipped and the loop continues from where it stopped.
+If Lambda Durable adds `async def` handler / `await context.step()` in the future, the integration becomes identical to the Temporal pattern.
 
-### Proposal
+Lambda Durable can only be enabled on new functions. We cannot add durable configuration to an existing function after creation. The migration path is to deploy a new durable-enabled function alongside existing ones. The new function can share the same Lambda Layer; only `aws-durable-execution-sdk-python` and the decorators are added.
 
-| Gap | Fix | Who |
-|---|---|---|
-| Event loop calls sync `invoke_callbacks` at step fire points | Replace two call sites in `event_loop.py` with `await invoke_callbacks_async()` | Core SDK |
-| No serializable config | `AgentSpec` frozen dataclass | Core SDK |
-| No extension point on `Agent` | `durable_backend` param + `dispatch_async()` | Core SDK |
-| No remote tool resolution | `ToolRegistry.resolve(name)` + `all_registered()` | Core SDK |
-| No Temporal worker | `StrandsWorkflow`, `create_strands_worker()` | `strands-temporal` package |
-| No Lambda Durable handler | `@durable_execution` handler + `@durable_step` wrappers | `strands-aws` package |
+### What Each Side Owns
 
+```
+┌─────────────────────────────────┐  ┌──────────────────────────────────┐
+│        User owns                │  │       We own (Strands SDK)       │
+│                                 │  │                                  │
+│  • Temporal/Dapr/Lambda setup   │  │  • Agent class                   │
+│  • Worker / Workflow definition │  │  • Event loop (invoke_async)     │
+│  • Infrastructure (containers,  │  │  • Durability ABC                │
+│    task queues, state stores)   │  │  • Model + tool call wrapping    │
+│  • Passing durability into      │  │  • Provider packages:            │
+│    Agent constructor            │  │    strands-temporal              │
+│                                 │  │    strands-dapr                  │
+│                                 │  │    strands-aws                   │
+└─────────────────────────────────┘  └──────────────────────────────────┘
+```
 
 ---
 
 ## Action Items
 
-1. Aws Durable Lambda as entry point
-2. Update event_loop.py to call invoke_callbacks_async()
-3. Add AgentSpec frozen dataclass ( Proposed by @Patrick)
-4. Add durable_backend param and dispatch_async() to Agent
-5. Implement provider packages
-
+1. Fix async hooks in `event_loop.py` (easy, two one-line changes)
+2. Add `Durability` base class in `strands/agent/durability.py`
+3. Add `durability` param to `Agent`, apply wrapping in `invoke_async`
+4. Implement `strands-aws` package with `LambdaDurability` (start here)
+5. Implement `strands-temporal` package with `TemporalDurability`
+6. Implement `strands-dapr` package with `DaprDurability`
 
 ## Willingness to Implement
 
-TBD. Maybe start AWS durable first. 
+TBD. Start with `strands-aws` (Lambda Durable) since the sync wrapper pattern is simplest to validate.
 
----
\ No newline at end of file
+---

From a6b7b250bb7b591339078f69463aa53eaa436b11 Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Thu, 5 Mar 2026 19:25:58 -0500
Subject: [PATCH 07/11] Call out limitation in this doc

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 67 ++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index 98728d8c2..1b511b6ff 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -530,8 +530,75 @@ Lambda Durable can only be enabled on new functions. We cannot add durable confi
 └─────────────────────────────────┘  └──────────────────────────────────┘
 ```
 
+## **Known Gaps & Open Questions**
+
+**1. Model instantiation per activity call**
+
+In the PoC, `call_model_activity` constructs a new `BedrockModel(model_id=model_id)` on every invocation. This is intentional — model objects hold boto3 clients and cannot be serialized across the activity boundary.
+
+The cleaner design your senior proposed is a `TemporalModelProvider` that the user subclasses. The provider knows how to reconstruct the model from serializable config inside the activity, and also knows when it is running inside a workflow (dispatch to activity) vs. outside (call model directly):
+
+```python
+class TemporalModelProvider:
+    def stream_data(self, ...):
+        if in_workflow_process:
+            start_activity(...)  # dispatch to Temporal activity
+        else:
+            model = self.create_model(...)
+            model.stream(...)
+
+    def create_model(self, params):
+        ...  # subclass implements this
+
+
+class MyTemporalModelProvider(TemporalModelProvider):
+    temperature: float = 0.7
+
+    def create_model(self, serialized):
+        from strands.models.bedrock import BedrockModel
+        return BedrockModel(temperature=serialized.temperature)
+
+# This DevX is subject to change
+agent = Agent(
+    model=MyTemporalModelProvider(),
+    durability=TemporalDurability(),
+)
+```
+
+This keeps model config serializable (plain fields on the provider subclass) and lets users bring any model provider, not just Bedrock. The exact interface for `TemporalModelProvider` is a design decision we need to finalize.
+
+**2. `AgentState` is not deterministic across replay**
+
+Tools can write to `agent.state` during execution. On Temporal replay, tool calls return cached results (the tool function does not re-execute), but `agent.state` mutations inside the tool *do* re-execute because they happen in the workflow, not the activity. If a tool's state update depends on something non-deterministic (timestamp, random value, external read), the state after replay may differ from the original run.
+
+This is a correctness risk. Three options are on the table: document it as a constraint (tools that write to `agent.state` must be deterministic), checkpoint `agent.state` as part of the activity result and restore it on replay, or treat `agent.state` as out-of-scope for durable execution in v1. Needs a decision before we ship.
+
+**3. `Human in the loop` is different**
+Strands has its own Interrupt / InterruptException mechanism for pausing the agent and waiting for human input. In Temporal, the correct pattern for human-in-the-loop is a Signal.A user who wants human-in-the-loop with Temporal durability can't use Strands' Interrupt, they need to use Temporal Signals directly.
+
+**4. Streaming callbacks are meaningless during replay**
+During Temporal replay, the model activity returns a cached result instantly, no stream, no tokens. Any UI or logging built on streaming callbacks will see nothing on replay. This isn't a correctness issue but it's a confusing UX gap worth calling out.
+
+**5. MCP Limitation** 
+Temporal agent has a MCP example, after took a closer look, I found that MCP works, but Strands' MCPClient integration pattern doesn't map directly. Strands' MCPClient is designed to be constructed once and passed as a tool to Agent. In a durable context, the MCP connection must live inside the activity worker and be managed there (not constructed in the workflow and passed in).The user can't just do Agent(tools=[my_mcp_client], durability=...) and have it work, because MCPClient holds a live background thread that can't cross the activity boundary.
+
+**6. Effort estimate and comparison with Lambda Durable**
+
+Lambda Durable (Level 1, no SDK changes) can be validated in ~1 week: wrap `agent(prompt)` as a `@durable_step`, confirm crash recovery at the invocation level. The limitation is mid-loop granularity, but it's a real working integration.
+
+Level 2 (native Temporal/Dapr) is significantly more work:
+
+| | Lambda Durable Level 1 | Temporal Level 2                                     |
+|---|---|------------------------------------------------------|
+| SDK changes needed | None | ~4 (hooks, Durability class, Agent wiring, event loop) |
+| New packages | None | `strands-temporal`, `strands-aws`, `strands-dapr`    |
+| Mid-loop crash recovery | ❌ | ✅                                                    |
+| Estimated effort | ~1 week | ~4 weeks                                             |
+
+
 ---
 
+
 ## Action Items
 
 1. Fix async hooks in `event_loop.py` (easy, two one-line changes)

From 3d5879b6204497d7a3f9ebd8ea11540827bcac38 Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Thu, 5 Mar 2026 19:29:26 -0500
Subject: [PATCH 08/11] fix: fix wording

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index 1b511b6ff..f50379f72 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -273,11 +273,11 @@ Two things are needed in our SDK to make this work:
 
 **Gap 2. No `Durability` abstraction on `Agent`**
 
-There is no way to tell `Agent` to wrap its I/O calls with a durable provider's checkpoint primitive. A `durability` parameter that wraps `callModel` and `callTools` is needed.
+`Agent` has no way to wrap its I/O calls with a durable provider's checkpoint primitive today. We need a `durability` parameter that intercepts `callModel` and `callTools` before they execute.
 
-Hooks alone cannot fill this gap. `BeforeModelCallEvent` and `AfterModelCallEvent` are notification-only — the only writable field on `AfterModelCallEvent` is `retry`. There is no way to inject a cached result or skip the actual model call from a hook. The event loop calls `stream_messages` unconditionally after `BeforeModelCallEvent` fires regardless of what any hook does. Compare this to `AfterToolCallEvent`, which has a writable `result` field — tools could theoretically be intercepted via hooks today, but model calls cannot.
+Hooks won't work here. `BeforeModelCallEvent` and `AfterModelCallEvent` are notification-only — the only writable field on `AfterModelCallEvent` is `retry`. The event loop calls `stream_messages` unconditionally after `BeforeModelCallEvent` fires, so there's no way to inject a cached result or skip the actual model call from a hook. `AfterToolCallEvent` does have a writable `result` field, so tools could theoretically be intercepted today, but model calls cannot.
 
-This means the `Durability` abstraction must be wired directly into the event loop's call sites for `stream_messages` and tool execution, not layered on top via hooks.
+The `Durability` abstraction needs to be wired directly into the event loop's call sites for `stream_messages` and tool execution. Hooks can't get there.
 
 Once both gaps are closed, the proposed solution below becomes possible.
 

From fa236307b59fe4fa7a03142764fc5c5ee3fae0ba Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Thu, 5 Mar 2026 19:55:48 -0500
Subject: [PATCH 09/11] fix: update the doc

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 28 +++++++++++++++---------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index f50379f72..c64a6051b 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -38,6 +38,8 @@ This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https:
 
 ---
 
+**Note** All APIs and sudo code in this doc I purposed are for demo purposes, they are subjected to change.
+
 ## 1. How Durable Providers Orchestrate AI Agents
 
 Before diving into integration, let's go through the shared architecture these providers use and how existing agent frameworks build on them.
@@ -350,12 +352,10 @@ class MyAgentWorkflow:
 
     @workflow.run
     async def run(self, prompt: str) -> str:
-        # User creates Agent with our durability wrapper.
-        # TemporalDurability receives the workflow context so it can
-        # call workflow.execute_activity() to wrap I/O calls.
         agent = Agent(
+            model=MyTemporalModelProvider(),  # ← serializable config, reconstructed inside activity
             tools=[search_flights, book_hotel],
-            durability=TemporalDurability(model_id="us.anthropic.claude-sonnet-4-5"),  # ← wraps callModel/callTools
+            durability=TemporalDurability(),
         )
         result = await agent.invoke_async(prompt)
         return str(result.message)
@@ -401,7 +401,7 @@ class TemporalDurability(Durability):
         return wrapped
 ```
 
-### The Durability Class (Core SDK)
+### The Durability Base Class (Core SDK)
 
 ```python
 # ─── strands/agent/durability.py ─────────────────────────────────
@@ -444,7 +444,7 @@ class Agent:
                 return AgentResult(message=response.text)
 
             for tool_call in response.tool_calls:
-                tool_result = await wrapped_call_tools(tool_call.name, tool_call.input)
+                tool_result = await wrapped_call_tools(tool_call.name, tool_call.input, tool_call.toolUseId)
                 messages.append({"role": "tool", "content": tool_result})
 ```
 
@@ -601,15 +601,23 @@ Level 2 (native Temporal/Dapr) is significantly more work:
 
 ## Action Items
 
-1. Fix async hooks in `event_loop.py` (easy, two one-line changes)
+1. Fix async hooks in `event_loop.py` (two one-line changes)
+
+After this, two open questions must be resolved before implementation starts:
+
+- Finalize `TemporalModelProvider` interface (Gap 1)
+- Decide `AgentState` replay strategy (Gap 2)
+
+Once resolved:
+
 2. Add `Durability` base class in `strands/agent/durability.py`
 3. Add `durability` param to `Agent`, apply wrapping in `invoke_async`
-4. Implement `strands-aws` package with `LambdaDurability` (start here)
-5. Implement `strands-temporal` package with `TemporalDurability`
+4. Implement `strands-temporal` package with `TemporalDurability` and `TemporalModelProvider`
+5. Implement `strands-aws` package with `LambdaDurability`
 6. Implement `strands-dapr` package with `DaprDurability`
 
 ## Willingness to Implement
 
-TBD. Start with `strands-aws` (Lambda Durable) since the sync wrapper pattern is simplest to validate.
+TBD. 
 
 ---

From 7537dd9cf92b8cb90092ae09584fb6bb745d27d8 Mon Sep 17 00:00:00 2001
From: Jack Yuan <jackypc@amazon.com>
Date: Thu, 5 Mar 2026 20:01:14 -0500
Subject: [PATCH 10/11] update: add POC code

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index c64a6051b..f0e36c5b0 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -600,6 +600,7 @@ Level 2 (native Temporal/Dapr) is significantly more work:
 
 
 ## Action Items
+POC code: https://github.com/JackYPCOnline/sdk-python/tree/POC/real_sdk_durable
 
 1. Fix async hooks in `event_loop.py` (two one-line changes)
 
@@ -620,4 +621,4 @@ Once resolved:
 
 TBD. 
 
----
+---
\ No newline at end of file

From 1592a7196c568651a817d376215e08d0966ec76d Mon Sep 17 00:00:00 2001
From: Jack Yuan <94985218+JackYPCOnline@users.noreply.github.com>
Date: Fri, 6 Mar 2026 13:25:49 -0500
Subject: [PATCH 11/11] Update DURABILITY_PROVIDER_INTEGRRATION.md

---
 team/DURABILITY_PROVIDER_INTEGRRATION.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/team/DURABILITY_PROVIDER_INTEGRRATION.md b/team/DURABILITY_PROVIDER_INTEGRRATION.md
index f0e36c5b0..98d059b9a 100644
--- a/team/DURABILITY_PROVIDER_INTEGRRATION.md
+++ b/team/DURABILITY_PROVIDER_INTEGRRATION.md
@@ -34,8 +34,6 @@ Today: Single invoke_async call
 
 This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
 
-This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
-
 ---
 
 **Note** All APIs and sudo code in this doc I purposed are for demo purposes, they are subjected to change.
@@ -536,7 +534,7 @@ Lambda Durable can only be enabled on new functions. We cannot add durable confi
 
 In the PoC, `call_model_activity` constructs a new `BedrockModel(model_id=model_id)` on every invocation. This is intentional — model objects hold boto3 clients and cannot be serialized across the activity boundary.
 
-The cleaner design your senior proposed is a `TemporalModelProvider` that the user subclasses. The provider knows how to reconstruct the model from serializable config inside the activity, and also knows when it is running inside a workflow (dispatch to activity) vs. outside (call model directly):
+`TemporalModelProvider` that the user subclasses. The provider knows how to reconstruct the model from serializable config inside the activity, and also knows when it is running inside a workflow (dispatch to activity) vs. outside (call model directly):
 
 ```python
 class TemporalModelProvider:
@@ -621,4 +619,4 @@ Once resolved:
 
 TBD. 
 
----
\ No newline at end of file
+---