Skip to content

feat: durable providers integration proposal#584

Open
JackYPCOnline wants to merge 11 commits intostrands-agents:mainfrom
JackYPCOnline:durable
Open

feat: durable providers integration proposal#584
JackYPCOnline wants to merge 11 commits intostrands-agents:mainfrom
JackYPCOnline:durable

Conversation

@JackYPCOnline
Copy link
Contributor

Description

Adds a new internal team doc covering the design and integration plan for durable execution support in the Strands SDK.

The doc covers:

  • Why the current in-process agent loop breaks under production failure scenarios (mid-loop crash, Lambda 15-min ceiling)

  • How teams can integrate today using Temporal and AWS Lambda Durable as a single coarse-grained step, including a working code example and the known limitations of each approach

  • The three gaps in the current SDK that block native per-step checkpointing

  • The proposed 2.0 solution: async hooks, AgentSpec, DurableBackend/ExecutionHandle ABCs, and a durable_backend parameter on Agent, with API sketches for each

Related Issues

Type of Change

  • New content
  • Content update/revision
  • Structure/organization improvement
  • Typo/formatting fix
  • Bug fix
  • Other (please describe):

Checklist

  • I have read the CONTRIBUTING document
  • My changes follow the project's documentation style
  • I have tested the documentation locally using mkdocs serve
  • Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@JackYPCOnline JackYPCOnline marked this pull request as draft March 3, 2026 17:34
@JackYPCOnline JackYPCOnline marked this pull request as ready for review March 3, 2026 17:52

This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).

This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Duplicate line.


### How Those Providers Build Their First-class AI Agent

**Temporal AI Agent** ([temporal-community/temporal-ai-agent](https://github.com/temporal-community/temporal-ai-agent)) puts the agent loop *inside the Workflow*. Each LLM call and each tool call is dispatched as a separate Activity. The Workflow is deterministic — it just decides "call LLM next" or "call tool next." The Activities do the actual I/O. On crash, Temporal replays completed Activities from event history and the loop resumes mid-conversation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like this because it forces us into thinking of the agent loop more as a state machine which it is.

Copy link
Member

@pgrayy pgrayy Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this stuff comes up a lot. That is why there are dedicated services like step functions and airflow. The one downside is these orchestrators can be heavy. Step functions is a whole dedicated service. Airflow can run in a single process but it is still not meant to be something that you import into an application and call like Strands.

With all that said, curious how process/memory heavy temporal is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out .That's something I didn't check, but also kinda curious too. It is more like if we want to offer this option to customers instead of performance I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep understandable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return result["stop_reason"], result["message"]
return wrapped

def wrap_tool_call(self, call_tool_fn):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the tool definition and tool spec is accessible to both the workflow and the activity; e.g. they need the same set of tools

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow is just the code entry point


**Gap 2. No `Durability` abstraction on `Agent`**

`Agent` has no way to wrap its I/O calls with a durable provider's checkpoint primitive today. We need a `durability` parameter that intercepts `callModel` and `callTools` before they execute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just wrap the model provider as a Durable model provider?

from strands import Agent
from strands.models import BedrockModel, DurableWrapper

model = DurableWrapper(BedrockModel())
agent = Agent(model=model)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yess, this is exactly the same idea @zastrowm has

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that we still have the gap of needing to wrap/abstract tools in this case

(although.... maybe something new that let's model providers handle the tool calling?)

| Orchestrator | Workflow | Workflow | `@durable_execution` handler |
| Checkpointable unit | Activity | Activity | `context.step()` |
| Replay mechanism | Event History | State Store | Cached step results |
| Human-in-the-loop | Signal | Signal / PubSub event | (not yet supported) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprising LDF does not support HIL. Is that coming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was going to say something similar - this talks about a "pause"/waitForCallback which might be equivalent

https://dev.to/dobeerman/pause-your-lambda-building-a-slack-approval-workflow-with-aws-durable-functions-17jo


---

## 2. Level 1: Wrap Whole Agent Invoke (Works Today, No SDK Changes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW the user of the 3P framework use that SDK (e.g. Temporal) as designed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, must use their SDK to ochestrate

)
```

Temporal retries the activity if the worker crashes. `S3SessionManager` restores conversation history on each retry. The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries.

Being that the entire agent retries, why would a customer want to use this as opposed to just using the regular Strands session manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can acheive this using strands graph directly, but it's more like what level will you handle retry & persistence

# User's Lambda handler
@durable_execution
def handler(event: dict, context: DurableContext):
agent = Agent(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might benefit for an Agent subclass here. Given how invasive it is, I wonder if just need to design the Agent a little more abstractly for durability when it's so invasive.

That might allow us to streamline some of the api, but then transitioning from normal Agent -> specific providers is not as slick


@workflow.run
async def run(self, prompt: str) -> str:
agent = Agent(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that I may not be fully internalizing the flow in this document.

But, code smells usually indicate a problem with the underlying design. Here, what smells is that to use durability I need to use both TemporalDurability and a MyTemporalModelProvider.

What happens if I use MyTemporalModelProvider without TemporalDurability? Or what happens if I use TemporalDurability with a non temporal model provider?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I agree with this. Ive seen a few similar comments too.

Calling back to @pgrayy design to split the agent runtime and agent state, if we turned the agent loop into a proper state machine, we could basically turn each state execution into a "durable" step to make replay a lot easier. That way side-effects are not re-computed, and we can more easily restart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In temporal, you need to instancitate a model object in Activities, where our model now is not json serializeable, if user want to pass configuration to models, they have to define their TemporalmodelProviders

return str(agent(prompt).message)


@durable_execution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this pattern gets tricky for me is when you have to layer on another SDK like AgentCore SDK, or whatever the users chosen runtime environment provides. The developer now has to navigate several layers of wrapping to achieve a working agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is durability at agent level, treating agent/multiagent execution as a atomic operation. Yes, I agree, this is meaningless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that they'll have to think about the AgentCore @entrypoint decorator /and/ how each durability solution works. I think for users of these frameworks that are highly motivated, that's fine.

I'm not sure the DX is good enough to attract someone who wants durability guarantees but hasn't gone deep in the SDKs.

During Temporal replay, the model activity returns a cached result instantly, no stream, no tokens. Any UI or logging built on streaming callbacks will see nothing on replay. This isn't a correctness issue but it's a confusing UX gap worth calling out.

**5. MCP Limitation**
Temporal agent has a MCP example, after took a closer look, I found that MCP works, but Strands' MCPClient integration pattern doesn't map directly. Strands' MCPClient is designed to be constructed once and passed as a tool to Agent. In a durable context, the MCP connection must live inside the activity worker and be managed there (not constructed in the workflow and passed in).The user can't just do Agent(tools=[my_mcp_client], durability=...) and have it work, because MCPClient holds a live background thread that can't cross the activity boundary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads like a big limitation. What can we do about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is this specific to Temporal or also DAPR & LDF?

---


## Action Items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd encourage validation of your prototype in EKS and AgentCore runtime scenarios as well, in case there are any complications in the implementation or developer experience.

messages = [{"role": "user", "content": prompt}]

while True:
response = await wrapped_call_model(messages, self.system_prompt, self.tool_specs)
Copy link
Member

@pgrayy pgrayy Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the whole process shuts down let's say during a model call? Will temporal allow us to pick up from here without the user having to prompt the agent again?

Step 3: wrapped_call_tools("search_flights") → Temporal records ActivityTaskCompleted ✅
Step 4: wrapped_call_model() → 💥 Worker crashes mid-Activity

─── Worker restarts, Temporal replays the Workflow ───
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to replay the whole workflow? Or can it pick up from where it left off?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it actually rerun the workflow, then return the cached value directly from each successful activities.

```python
# ─── strands/agent/durability.py ─────────────────────────────────

class Durability:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at a high level, and this is perfectly understandable, this feels like we are solving the problem that is immediately in front of us instead of taking step back to look at the bigger picture.

For me, a pre requisite for durability is a state machine implementation of the loop. I believe if we first solve that durability becomes a much smaller problem.

For example, if we have a state machine each step in that machine can be tied to the corresponding Hook. The hook can then represent the step that is about to be taken. Meaning suppose the BeforeModelCallHook looked like the following now

@dataclass
class BeforeModelCallEvent(HookEvent):
    invocation_state: dict[str, Any] = field(default_factory=dict)
    step: Step

if we allowed for mutation of the step then I think there is a world where durability is just a plugin that does the following

before_model_call_event.step = TemporalDurabiliy.wrap(step)

The overall gut feeling I have here, is this feature doesn't need to be quite as deep in the sdk as is proposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus one per state machine

| | Lambda Durable Level 1 | Temporal Level 2 |
|---|---|------------------------------------------------------|
| SDK changes needed | None | ~4 (hooks, Durability class, Agent wiring, event loop) |
| New packages | None | `strands-temporal`, `strands-aws`, `strands-dapr` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we see strands-aws as extending to hold other aws intergrations? If not, I think we should use more specific naming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants