feat: durable providers integration proposal#584
feat: durable providers integration proposal#584JackYPCOnline wants to merge 11 commits intostrands-agents:mainfrom
Conversation
|
|
||
| This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html). | ||
|
|
||
| This doc covers three providers: [Temporal](https://temporal.io/), [Dapr](https://dapr.io/) and [AWS Lambda Durable Execution](https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html). |
|
|
||
| ### How Those Providers Build Their First-class AI Agent | ||
|
|
||
| **Temporal AI Agent** ([temporal-community/temporal-ai-agent](https://github.com/temporal-community/temporal-ai-agent)) puts the agent loop *inside the Workflow*. Each LLM call and each tool call is dispatched as a separate Activity. The Workflow is deterministic — it just decides "call LLM next" or "call tool next." The Activities do the actual I/O. On crash, Temporal replays completed Activities from event history and the loop resumes mid-conversation. |
There was a problem hiding this comment.
I personally like this because it forces us into thinking of the agent loop more as a state machine which it is.
There was a problem hiding this comment.
And this stuff comes up a lot. That is why there are dedicated services like step functions and airflow. The one downside is these orchestrators can be heavy. Step functions is a whole dedicated service. Airflow can run in a single process but it is still not meant to be something that you import into an application and call like Strands.
With all that said, curious how process/memory heavy temporal is.
There was a problem hiding this comment.
Good call out .That's something I didn't check, but also kinda curious too. It is more like if we want to offer this option to customers instead of performance I guess.
There was a problem hiding this comment.
see related comment https://github.com/strands-agents/docs/pull/584/changes#r2897275989
| return result["stop_reason"], result["message"] | ||
| return wrapped | ||
|
|
||
| def wrap_tool_call(self, call_tool_fn): |
There was a problem hiding this comment.
This means the tool definition and tool spec is accessible to both the workflow and the activity; e.g. they need the same set of tools
There was a problem hiding this comment.
Workflow is just the code entry point
|
|
||
| **Gap 2. No `Durability` abstraction on `Agent`** | ||
|
|
||
| `Agent` has no way to wrap its I/O calls with a durable provider's checkpoint primitive today. We need a `durability` parameter that intercepts `callModel` and `callTools` before they execute. |
There was a problem hiding this comment.
Can we just wrap the model provider as a Durable model provider?
from strands import Agent
from strands.models import BedrockModel, DurableWrapper
model = DurableWrapper(BedrockModel())
agent = Agent(model=model)
There was a problem hiding this comment.
Yess, this is exactly the same idea @zastrowm has
There was a problem hiding this comment.
Worth noting that we still have the gap of needing to wrap/abstract tools in this case
(although.... maybe something new that let's model providers handle the tool calling?)
| | Orchestrator | Workflow | Workflow | `@durable_execution` handler | | ||
| | Checkpointable unit | Activity | Activity | `context.step()` | | ||
| | Replay mechanism | Event History | State Store | Cached step results | | ||
| | Human-in-the-loop | Signal | Signal / PubSub event | (not yet supported) | |
There was a problem hiding this comment.
Surprising LDF does not support HIL. Is that coming?
There was a problem hiding this comment.
Actually the LDF has human-in-loop in different ways https://github.com/aws-samples/serverless-patterns/tree/main/lambda-durable-human-approval-sam
There was a problem hiding this comment.
Yeah, was going to say something similar - this talks about a "pause"/waitForCallback which might be equivalent
|
|
||
| --- | ||
|
|
||
| ## 2. Level 1: Wrap Whole Agent Invoke (Works Today, No SDK Changes) |
There was a problem hiding this comment.
IOW the user of the 3P framework use that SDK (e.g. Temporal) as designed?
There was a problem hiding this comment.
yes, must use their SDK to ochestrate
| ) | ||
| ``` | ||
|
|
||
| Temporal retries the activity if the worker crashes. `S3SessionManager` restores conversation history on each retry. The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries. |
There was a problem hiding this comment.
The entire agent loop is one atomic Activity. If the process crashes after tool call 2 but before tool call 3, the whole activity retries.
Being that the entire agent retries, why would a customer want to use this as opposed to just using the regular Strands session manager?
There was a problem hiding this comment.
Yes, they can acheive this using strands graph directly, but it's more like what level will you handle retry & persistence
| # User's Lambda handler | ||
| @durable_execution | ||
| def handler(event: dict, context: DurableContext): | ||
| agent = Agent( |
There was a problem hiding this comment.
I think this might benefit for an Agent subclass here. Given how invasive it is, I wonder if just need to design the Agent a little more abstractly for durability when it's so invasive.
That might allow us to streamline some of the api, but then transitioning from normal Agent -> specific providers is not as slick
|
|
||
| @workflow.run | ||
| async def run(self, prompt: str) -> str: | ||
| agent = Agent( |
There was a problem hiding this comment.
I admit that I may not be fully internalizing the flow in this document.
But, code smells usually indicate a problem with the underlying design. Here, what smells is that to use durability I need to use both TemporalDurability and a MyTemporalModelProvider.
What happens if I use MyTemporalModelProvider without TemporalDurability? Or what happens if I use TemporalDurability with a non temporal model provider?
There was a problem hiding this comment.
Yea, I agree with this. Ive seen a few similar comments too.
Calling back to @pgrayy design to split the agent runtime and agent state, if we turned the agent loop into a proper state machine, we could basically turn each state execution into a "durable" step to make replay a lot easier. That way side-effects are not re-computed, and we can more easily restart.
There was a problem hiding this comment.
In temporal, you need to instancitate a model object in Activities, where our model now is not json serializeable, if user want to pass configuration to models, they have to define their TemporalmodelProviders
| return str(agent(prompt).message) | ||
|
|
||
|
|
||
| @durable_execution |
There was a problem hiding this comment.
Where this pattern gets tricky for me is when you have to layer on another SDK like AgentCore SDK, or whatever the users chosen runtime environment provides. The developer now has to navigate several layers of wrapping to achieve a working agent.
There was a problem hiding this comment.
This is durability at agent level, treating agent/multiagent execution as a atomic operation. Yes, I agree, this is meaningless.
There was a problem hiding this comment.
My point is that they'll have to think about the AgentCore @entrypoint decorator /and/ how each durability solution works. I think for users of these frameworks that are highly motivated, that's fine.
I'm not sure the DX is good enough to attract someone who wants durability guarantees but hasn't gone deep in the SDKs.
| During Temporal replay, the model activity returns a cached result instantly, no stream, no tokens. Any UI or logging built on streaming callbacks will see nothing on replay. This isn't a correctness issue but it's a confusing UX gap worth calling out. | ||
|
|
||
| **5. MCP Limitation** | ||
| Temporal agent has a MCP example, after took a closer look, I found that MCP works, but Strands' MCPClient integration pattern doesn't map directly. Strands' MCPClient is designed to be constructed once and passed as a tool to Agent. In a durable context, the MCP connection must live inside the activity worker and be managed there (not constructed in the workflow and passed in).The user can't just do Agent(tools=[my_mcp_client], durability=...) and have it work, because MCPClient holds a live background thread that can't cross the activity boundary. |
There was a problem hiding this comment.
This reads like a big limitation. What can we do about it?
There was a problem hiding this comment.
Also is this specific to Temporal or also DAPR & LDF?
| --- | ||
|
|
||
|
|
||
| ## Action Items |
There was a problem hiding this comment.
I'd encourage validation of your prototype in EKS and AgentCore runtime scenarios as well, in case there are any complications in the implementation or developer experience.
| messages = [{"role": "user", "content": prompt}] | ||
|
|
||
| while True: | ||
| response = await wrapped_call_model(messages, self.system_prompt, self.tool_specs) |
There was a problem hiding this comment.
What happens if the whole process shuts down let's say during a model call? Will temporal allow us to pick up from here without the user having to prompt the agent again?
| Step 3: wrapped_call_tools("search_flights") → Temporal records ActivityTaskCompleted ✅ | ||
| Step 4: wrapped_call_model() → 💥 Worker crashes mid-Activity | ||
|
|
||
| ─── Worker restarts, Temporal replays the Workflow ─── |
There was a problem hiding this comment.
It has to replay the whole workflow? Or can it pick up from where it left off?
There was a problem hiding this comment.
it actually rerun the workflow, then return the cached value directly from each successful activities.
| ```python | ||
| # ─── strands/agent/durability.py ───────────────────────────────── | ||
|
|
||
| class Durability: |
There was a problem hiding this comment.
I think at a high level, and this is perfectly understandable, this feels like we are solving the problem that is immediately in front of us instead of taking step back to look at the bigger picture.
For me, a pre requisite for durability is a state machine implementation of the loop. I believe if we first solve that durability becomes a much smaller problem.
For example, if we have a state machine each step in that machine can be tied to the corresponding Hook. The hook can then represent the step that is about to be taken. Meaning suppose the BeforeModelCallHook looked like the following now
@dataclass
class BeforeModelCallEvent(HookEvent):
invocation_state: dict[str, Any] = field(default_factory=dict)
step: Step
if we allowed for mutation of the step then I think there is a world where durability is just a plugin that does the following
before_model_call_event.step = TemporalDurabiliy.wrap(step)
The overall gut feeling I have here, is this feature doesn't need to be quite as deep in the sdk as is proposed.
| | | Lambda Durable Level 1 | Temporal Level 2 | | ||
| |---|---|------------------------------------------------------| | ||
| | SDK changes needed | None | ~4 (hooks, Durability class, Agent wiring, event loop) | | ||
| | New packages | None | `strands-temporal`, `strands-aws`, `strands-dapr` | |
There was a problem hiding this comment.
Do we see strands-aws as extending to hold other aws intergrations? If not, I think we should use more specific naming.
Description
Adds a new internal team doc covering the design and integration plan for durable execution support in the Strands SDK.
The doc covers:
Why the current in-process agent loop breaks under production failure scenarios (mid-loop crash, Lambda 15-min ceiling)
How teams can integrate today using Temporal and AWS Lambda Durable as a single coarse-grained step, including a working code example and the known limitations of each approach
The three gaps in the current SDK that block native per-step checkpointing
The proposed 2.0 solution: async hooks, AgentSpec, DurableBackend/ExecutionHandle ABCs, and a durable_backend parameter on Agent, with API sketches for each
Related Issues
Type of Change
Checklist
mkdocs serveBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.