temporal-spring-ai: add side-effect replay tests for chat, activity tools, and @SideEffectTool#2856
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ools, and @SideEffectTool Three new tests under src/test/.../replay/: - ChatModelSideEffectTest: register a ChatModel with an AtomicInteger counter. Run a workflow that makes one chat call, assert counter=1. Replay the captured history, assert counter still 1 — the activity result comes from history, not from re-invoking the ChatModel. - ActivityToolSideEffectTest: activity-backed @tool whose impl increments a counter. ToolCallingStubChatModel asks for the tool on the first call and returns final text on the second. Same assertion shape: counter=1 after run, counter=1 after replay. - SideEffectToolReplayTest: @SideEffectTool body increments a counter via a file-scope static. Workflow drives a tool call through ToolCallingStubChatModel. The assertion proves that Workflow.sideEffect's marker is what's consulted on replay rather than re-invoking the @tool method. MCP is intentionally omitted — spring-ai-mcp is compileOnly and adding it just for one test isn't worth the dep weight. MCP tool calls go through the same Temporal activity machinery as ChatModel, which ChatModelSideEffectTest already covers. I verified the SideEffectToolReplayTest catches a real regression by temporarily dropping the Workflow.sideEffect wrap in SideEffectToolCallback; the test correctly failed with `expected: <1> but was: <2>`. Restored before this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Planning scratchpad — not part of the shipped artifact. Removed before merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| assertEquals( | ||
| 1, | ||
| model.callCount.get(), | ||
| "ChatModel must not be re-invoked during replay — activity results come from history"); |
There was a problem hiding this comment.
That sounds like you're testing Temporal itself rather than the plugin's code.
I understand that's not really your intent, but the line is thin, and future readers of this code (both humans and agents) might easily misunderstand this.
There was a problem hiding this comment.
Yeah, agreed. Assertion comments have been re-worked to make it more clear.
| ChatWorkflow.class, WorkflowOptions.newBuilder().setTaskQueue(TASK_QUEUE).build()); | ||
| assertEquals("pong", workflow.chat("ping")); | ||
| assertEquals( | ||
| 1, model.callCount.get(), "ChatModel should be called once during the initial run"); |
There was a problem hiding this comment.
The test methodology isn't exact, which may result in flakes. It is actually possible that the activity function might end up being executed more than once, due to retry, and so the counter might be higher than one here.
A scheduled activity with maxAttempts > 1 will run exactly once to completion as observed by the Temporal server, but the activity itself has no way to know whether the server will have effectively observed the task completion.
Instead of this atomic counter, you can use a concurrency-safe Set that records the activity ID of all activity tasks that were executed. The size of that set is what you are looking for.
Alternatively, you could use a very different test approach: look at the workflow history and filter on ActivityTaskScheduled events. The expectation is that there should be one ActivityTaskScheduled event for this specific activity type.
There was a problem hiding this comment.
Thats a very good point. I'll fix this, but I'll also work on updating the docs, since they are really vague about the right way to test this:
Check for duplicate side effects or other types of failures.
https://docs.temporal.io/develop/plugins-guide#side-effects-tests
There was a problem hiding this comment.
Ok, ChatModelSideEffectTest and ActivityToolSideEffectTest have been changed to inspecting workflow history.
SideEffectToolReplayTest has been left as-is, I think its fine for Workflow.sideEffect.
|
This PR definitely needs more work on the tests. Will re-work it some. |
Configure TestWorkflowEnvironment with WorkflowCacheSize(0) so the worker replays from history on every workflow task instead of resuming from in-memory cached state. That is the regime in which side-effect safety actually has to hold: a missing Workflow.sideEffect wrap or an un-guarded in-workflow mutation would run on each replay, which cached resumes mask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer pointed out that "ChatModel must not be re-invoked during replay" reads as testing Temporal's activity replay guarantee rather than the plugin's behavior. Reword class-level javadoc and assertion messages so the property under test is explicit: the plugin places ChatModel calls, activity-stub tool calls, and @SideEffectTool bodies behind the correct Temporal boundary. Temporal's replay/memoization semantics are assumed correct. The counter-stays-at-1 observation is the signal we use to detect a plugin regression, not the thing being tested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unters
Reviewer noted that counting invocations of the backing impl can
conflate two signals: the plugin inlining a call (what we want to
catch) vs. Temporal re-delivering an activity task to the worker
(which can legitimately happen with maxAttempts > 1, producing >1
impl executions for a single scheduled activity).
Switch ChatModelSideEffectTest and ActivityToolSideEffectTest to
scan workflow history for ActivityTaskScheduled events of the
expected activity type. That count is invariant under activity-task
redelivery and speaks directly to the plugin property under test
("the plugin scheduled the call as an activity"). Drops the impl
counters and the WorkflowReplayer pass — the former is no longer
needed, the latter is redundant with WorkflowDeterminismTest.
SideEffectToolReplayTest intentionally keeps its counter: the
Workflow.sideEffect path runs in workflow code under deterministic
replay semantics, not as an activity, so the retry-induced flake
mode doesn't apply.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@mjameswh PR has been improved, and I also opened a PR on docs to explain more about how to write these types of tests: temporalio/documentation#4481 |
What was changed
Three new tests under
src/test/.../replay/:ChatModelSideEffectTest— register aChatModelwith anAtomicIntegercounter. Run a workflow that makes one chat call, assert counter=1. Replay the captured history, assert counter still 1 — the activity result comes from history, not from re-invoking theChatModel.ActivityToolSideEffectTest— activity-backed@Toolwhose impl increments a counter. A tool-calling stubChatModelasks for the tool on the first call and returns final text on the second. Same shape: counter=1 after run, counter=1 after replay.SideEffectToolReplayTest—@SideEffectToolbody increments a file-scope counter. The workflow drives one tool call. Assertion proves thatWorkflow.sideEffect's marker is consulted on replay rather than re-invoking the@Toolmethod.MCP is intentionally omitted for this PR.
spring-ai-mcpiscompileOnly, and adding it just for one replay test isn't worth the dep weight — MCP tool calls go through the same Temporal activity machinery asChatModel, whichChatModelSideEffectTestalready exercises. The plan leaves this as an acceptable substitution.Why?
The existing
WorkflowDeterminismTestcatches history-vs-command mismatches but does not prove the plugin isn't re-invoking user side effects on replay. Temporal's AI partner review standards require side-effect safety tests, and a regression — someone adding an in-workflow cache or dropping theWorkflow.sideEffectwrap — would be easy to slip in without these.I verified the
SideEffectToolReplayTestactually catches a real regression: temporarily dropping theWorkflow.sideEffect(...)wrap inSideEffectToolCallback.call(...)producesexpected: <1> but was: <2>exactly as intended. Restored before this commit.