feat: durable providers integration proposal by JackYPCOnline · Pull Request #584 · strands-agents/docs

JackYPCOnline · 2026-03-03T17:33:48Z

Description

Adds a new internal team doc covering the design and integration plan for durable execution support in the Strands SDK.

The doc covers:

Why the current in-process agent loop breaks under production failure scenarios (mid-loop crash, Lambda 15-min ceiling)
How teams can integrate today using Temporal and AWS Lambda Durable as a single coarse-grained step, including a working code example and the known limitations of each approach
The three gaps in the current SDK that block native per-step checkpointing
The proposed 2.0 solution: async hooks, AgentSpec, DurableBackend/ExecutionHandle ABCs, and a durable_backend parameter on Agent, with API sketches for each

Related Issues

Type of Change

New content
Content update/revision
Structure/organization improvement
Typo/formatting fix
Bug fix
Other (please describe):

Checklist

I have read the CONTRIBUTING document
My changes follow the project's documentation style
I have tested the documentation locally using mkdocs serve
Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

pgrayy · 2026-03-06T18:20:43Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+
+This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
+
+This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).


Nit: Duplicate line.

pgrayy · 2026-03-06T18:23:47Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+
+### How Those Providers Build Their First-class AI Agent
+
+**Temporal AI Agent** ([temporal-community/temporal-ai-agent](https://github.com/temporal-community/temporal-ai-agent)) puts the agent loop *inside the Workflow*. Each LLM call and each tool call is dispatched as a separate Activity. The Workflow is deterministic — it just decides "call LLM next" or "call tool next." The Activities do the actual I/O. On crash, Temporal replays completed Activities from event history and the loop resumes mid-conversation.


I personally like this because it forces us into thinking of the agent loop more as a state machine which it is.

And this stuff comes up a lot. That is why there are dedicated services like step functions and airflow. The one downside is these orchestrators can be heavy. Step functions is a whole dedicated service. Airflow can run in a single process but it is still not meant to be something that you import into an application and call like Strands.

With all that said, curious how process/memory heavy temporal is.

Good call out .That's something I didn't check, but also kinda curious too. It is more like if we want to offer this option to customers instead of performance I guess.

Yep understandable.

see related comment https://github.com/strands-agents/docs/pull/584/changes#r2897275989

zastrowm · 2026-03-06T18:25:56Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+            return result["stop_reason"], result["message"]
+        return wrapped
+
+    def wrap_tool_call(self, call_tool_fn):


This means the tool definition and tool spec is accessible to both the workflow and the activity; e.g. they need the same set of tools

Workflow is just the code entry point

Unshure · 2026-03-06T18:25:58Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+
+**Gap 2. No `Durability` abstraction on `Agent`**
+
+`Agent` has no way to wrap its I/O calls with a durable provider's checkpoint primitive today. We need a `durability` parameter that intercepts `callModel` and `callTools` before they execute.


Can we just wrap the model provider as a Durable model provider?

from strands import Agent from strands.models import BedrockModel, DurableWrapper model = DurableWrapper(BedrockModel()) agent = Agent(model=model)

Yess, this is exactly the same idea @zastrowm has

Worth noting that we still have the gap of needing to wrap/abstract tools in this case

(although.... maybe something new that let's model providers handle the tool calling?)

ryanycoleman · 2026-03-06T18:26:06Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+| Orchestrator | Workflow | Workflow | `@durable_execution` handler |
+| Checkpointable unit | Activity | Activity | `context.step()` |
+| Replay mechanism | Event History | State Store | Cached step results |
+| Human-in-the-loop | Signal | Signal / PubSub event | (not yet supported) |


Surprising LDF does not support HIL. Is that coming?

Actually the LDF has human-in-loop in different ways https://github.com/aws-samples/serverless-patterns/tree/main/lambda-durable-human-approval-sam

Yeah, was going to say something similar - this talks about a "pause"/waitForCallback which might be equivalent

https://dev.to/dobeerman/pause-your-lambda-building-a-slack-approval-workflow-with-aws-durable-functions-17jo

ryanycoleman · 2026-03-06T18:28:24Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+
+---
+
+## 2. Level 1: Wrap Whole Agent Invoke (Works Today, No SDK Changes)


IOW the user of the 3P framework use that SDK (e.g. Temporal) as designed?

yes, must use their SDK to ochestrate

afarntrog · 2026-03-06T18:28:46Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+        )
+```
+
+Temporal retries the activity if the worker crashes. `S3SessionManager` restores conversation history on each retry. The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries.


The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries.

Being that the entire agent retries, why would a customer want to use this as opposed to just using the regular Strands session manager?

Yes, they can acheive this using strands graph directly, but it's more like what level will you handle retry & persistence

zastrowm · 2026-03-06T18:28:59Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+# User's Lambda handler
+@durable_execution
+def handler(event: dict, context: DurableContext):
+    agent = Agent(


I think this might benefit for an Agent subclass here. Given how invasive it is, I wonder if just need to design the Agent a little more abstractly for durability when it's so invasive.

That might allow us to streamline some of the api, but then transitioning from normal Agent -> specific providers is not as slick

dbschmigelski · 2026-03-06T18:29:03Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+
+    @workflow.run
+    async def run(self, prompt: str) -> str:
+        agent = Agent(


I admit that I may not be fully internalizing the flow in this document.

But, code smells usually indicate a problem with the underlying design. Here, what smells is that to use durability I need to use both TemporalDurability and a MyTemporalModelProvider.

What happens if I use MyTemporalModelProvider without TemporalDurability? Or what happens if I use TemporalDurability with a non temporal model provider?

Yea, I agree with this. Ive seen a few similar comments too.

Calling back to @pgrayy design to split the agent runtime and agent state, if we turned the agent loop into a proper state machine, we could basically turn each state execution into a "durable" step to make replay a lot easier. That way side-effects are not re-computed, and we can more easily restart.

In temporal, you need to instancitate a model object in Activities, where our model now is not json serializeable, if user want to pass configuration to models, they have to define their TemporalmodelProviders

ryanycoleman · 2026-03-06T18:29:45Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+    return str(agent(prompt).message)
+
+
+@durable_execution


Where this pattern gets tricky for me is when you have to layer on another SDK like AgentCore SDK, or whatever the users chosen runtime environment provides. The developer now has to navigate several layers of wrapping to achieve a working agent.

This is durability at agent level, treating agent/multiagent execution as a atomic operation. Yes, I agree, this is meaningless.

My point is that they'll have to think about the AgentCore @entrypoint decorator /and/ how each durability solution works. I think for users of these frameworks that are highly motivated, that's fine.

I'm not sure the DX is good enough to attract someone who wants durability guarantees but hasn't gone deep in the SDKs.

ryanycoleman · 2026-03-06T18:34:38Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+During Temporal replay, the model activity returns a cached result instantly, no stream, no tokens. Any UI or logging built on streaming callbacks will see nothing on replay. This isn't a correctness issue but it's a confusing UX gap worth calling out.
+
+**5. MCP Limitation** 
+Temporal agent has a MCP example, after took a closer look, I found that MCP works, but Strands' MCPClient integration pattern doesn't map directly. Strands' MCPClient is designed to be constructed once and passed as a tool to Agent. In a durable context, the MCP connection must live inside the activity worker and be managed there (not constructed in the workflow and passed in).The user can't just do Agent(tools=[my_mcp_client], durability=...) and have it work, because MCPClient holds a live background thread that can't cross the activity boundary.


This reads like a big limitation. What can we do about it?

Also is this specific to Temporal or also DAPR & LDF?

ryanycoleman · 2026-03-06T18:35:36Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+---
+
+
+## Action Items


I'd encourage validation of your prototype in EKS and AgentCore runtime scenarios as well, in case there are any complications in the implementation or developer experience.

pgrayy · 2026-03-06T18:35:45Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+        messages = [{"role": "user", "content": prompt}]
+
+        while True:
+            response = await wrapped_call_model(messages, self.system_prompt, self.tool_specs)


What happens if the whole process shuts down let's say during a model call? Will temporal allow us to pick up from here without the user having to prompt the agent again?

pgrayy · 2026-03-06T18:36:50Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+Step 3: wrapped_call_tools("search_flights") → Temporal records ActivityTaskCompleted ✅
+Step 4: wrapped_call_model() → 💥 Worker crashes mid-Activity
+
+─── Worker restarts, Temporal replays the Workflow ───


It has to replay the whole workflow? Or can it pick up from where it left off?

it actually rerun the workflow, then return the cached value directly from each successful activities.

dbschmigelski · 2026-03-06T18:40:36Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+```python
+# ─── strands/agent/durability.py ─────────────────────────────────
+
+class Durability:


I think at a high level, and this is perfectly understandable, this feels like we are solving the problem that is immediately in front of us instead of taking step back to look at the bigger picture.

For me, a pre requisite for durability is a state machine implementation of the loop. I believe if we first solve that durability becomes a much smaller problem.

For example, if we have a state machine each step in that machine can be tied to the corresponding Hook. The hook can then represent the step that is about to be taken. Meaning suppose the BeforeModelCallHook looked like the following now

@dataclass class BeforeModelCallEvent(HookEvent): invocation_state: dict[str, Any] = field(default_factory=dict) step: Step

if we allowed for mutation of the step then I think there is a world where durability is just a plugin that does the following

before_model_call_event.step = TemporalDurabiliy.wrap(step)

The overall gut feeling I have here, is this feature doesn't need to be quite as deep in the sdk as is proposed.

plus one per state machine

lizradway · 2026-03-06T18:41:13Z

team/DURABILITY_PROVIDER_INTEGRRATION.md

+| | Lambda Durable Level 1 | Temporal Level 2                                     |
+|---|---|------------------------------------------------------|
+| SDK changes needed | None | ~4 (hooks, Durability class, Agent wiring, event loop) |
+| New packages | None | `strands-temporal`, `strands-aws`, `strands-dapr`    |


Do we see strands-aws as extending to hold other aws intergrations? If not, I think we should use more specific naming.

feat: durable integration doc

b24cd63

JackYPCOnline requested a deployment to manual-approval March 3, 2026 17:34 — with GitHub Actions Waiting

JackYPCOnline marked this pull request as draft March 3, 2026 17:34

feat: add missing doc

8901858

JackYPCOnline marked this pull request as ready for review March 3, 2026 17:52

JackYPCOnline requested a deployment to manual-approval March 3, 2026 17:52 — with GitHub Actions Waiting

fix: update willingness

6d1083f

JackYPCOnline requested a deployment to manual-approval March 3, 2026 17:53 — with GitHub Actions Waiting

fix: update docs based on new finding

0b766b1

JackYPCOnline requested a deployment to manual-approval March 4, 2026 05:17 — with GitHub Actions Waiting

2nd updates to include more patterns

040899b

JackYPCOnline requested a deployment to manual-approval March 4, 2026 18:13 — with GitHub Actions Waiting

feat: update doc after POC

29200ed

JackYPCOnline requested a deployment to manual-approval March 5, 2026 23:04 — with GitHub Actions Waiting

JackYPCOnline requested a deployment to manual-approval March 5, 2026 23:05 — with GitHub Actions Waiting

Call out limitation in this doc

a6b7b25

JackYPCOnline requested a deployment to manual-approval March 6, 2026 00:26 — with GitHub Actions Waiting

fix: fix wording

3d5879b

JackYPCOnline requested a deployment to manual-approval March 6, 2026 00:29 — with GitHub Actions Waiting

JackYPCOnline requested a deployment to manual-approval March 6, 2026 00:30 — with GitHub Actions Waiting

fix: update the doc

fa23630

JackYPCOnline requested a deployment to manual-approval March 6, 2026 00:56 — with GitHub Actions Waiting

update: add POC code

7537dd9

JackYPCOnline requested a deployment to manual-approval March 6, 2026 01:01 — with GitHub Actions Waiting

pgrayy reviewed Mar 6, 2026

View reviewed changes

Update DURABILITY_PROVIDER_INTEGRRATION.md

1592a71

zastrowm reviewed Mar 6, 2026

View reviewed changes

Unshure reviewed Mar 6, 2026

View reviewed changes

ryanycoleman reviewed Mar 6, 2026

View reviewed changes

afarntrog reviewed Mar 6, 2026

View reviewed changes

zastrowm reviewed Mar 6, 2026

View reviewed changes

dbschmigelski reviewed Mar 6, 2026

View reviewed changes

ryanycoleman reviewed Mar 6, 2026

View reviewed changes

pgrayy reviewed Mar 6, 2026

View reviewed changes

dbschmigelski reviewed Mar 6, 2026

View reviewed changes

lizradway reviewed Mar 6, 2026

View reviewed changes

JackYPCOnline requested a deployment to manual-approval March 6, 2026 18:44 — with GitHub Actions Waiting


		This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).

		This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).


		### How Those Providers Build Their First-class AI Agent

		Temporal AI Agent ([temporal-community/temporal-ai-agent](https://github.com/temporal-community/temporal-ai-agent)) puts the agent loop inside the Workflow. Each LLM call and each tool call is dispatched as a separate Activity. The Workflow is deterministic — it just decides "call LLM next" or "call tool next." The Activities do the actual I/O. On crash, Temporal replays completed Activities from event history and the loop resumes mid-conversation.


		Gap 2. No `Durability` abstraction on `Agent`

		`Agent` has no way to wrap its I/O calls with a durable provider's checkpoint primitive today. We need a `durability` parameter that intercepts `callModel` and `callTools` before they execute.


		---

		## 2. Level 1: Wrap Whole Agent Invoke (Works Today, No SDK Changes)

Conversation

JackYPCOnline commented Mar 3, 2026

Description

Related Issues

Type of Change

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgrayy Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgrayy Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

pgrayy Mar 6, 2026 •

edited

Loading

pgrayy Mar 6, 2026 •

edited

Loading