Conversation
…flow execution failures When an activity or child workflow result fails to deserialize (e.g. broken UnmarshalJSON), the workflow now blocks with repeated WorkflowTaskFailed events instead of terminating with WorkflowExecutionFailed. This matches the behavior of non-determinism errors: the workflow stays open until a fixed worker is deployed. The mechanism: - Activity.Run() and Workflow.RunChild() detect decode errors (errors that are not temporal.ActivityError, ChildWorkflowExecutionError, or CanceledError) and wrap them in DecodeError. - The WithImplementation wrapper checks the returned error for DecodeError and panics outside the user's workflow function scope, causing the SDK to fail the workflow task rather than the workflow execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sean-george-anchor
left a comment
There was a problem hiding this comment.
Hey, I understand the desire to reduce the blast radius when workflows can't deserialize activity outputs.
That said, I don't think this is the right fix, and I worry anything in this direction is going to be brittle.
-
The SDK doesn't give us a way to identify decode errors.
decodeArg()returns plain unmarshaling errors, no typed wrapper, no exported concept. The exclusion check works for v1.25.1 (I verified every error path ininternal_event_handlers.go), but it's not a public contract. If a future SDK version introduces a new unwrapped error type, we'd silently misclassify it and panic on errors that should fail the workflow. -
Silent retries trade one problem for another. Failed workflows are loud. They surface in error tracking, monitoring, and alerts. Silently retrying workflows are invisible unless you're specifically watching task failure counts. Changing the failure mode here could actually make incidents like this harder to detect, not easier.
-
The
Execute()path isn't covered, and closing that gap adds a lot of machinery (future wrappers, context propagation,ChildWorkflowFuturewrapping) that needs to stay in sync as the API grows.
I think the real value is in prevention and detection: smoke tests before promotion, replay test enforcement, workflow failure alerting, deployment gating on error rates. Those keep the serialization mismatch from reaching production in the first place, which is more valuable than changing what happens after it does. Changing tempts' error semantics under the hood makes the library harder to reason about for anyone who knows Temporal's model, and any implementation here is going to be coupled to SDK internals in ways that are hard to maintain long-term.
Happy to discuss.
Replace elimination-based heuristics (isActivityDecodeError, isChildWorkflowDecodeError) with a single isDecodeError that checks errors.Is(err, converter.ErrUnableToDecode). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sentinel survives through the error chain via %w, so the intermediate DecodeError wrapper and the detection/wrapping in Activity.Run and RunChild are unnecessary. panicOnDecodeError now checks for converter.ErrUnableToDecode directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sean-george-anchor
left a comment
There was a problem hiding this comment.
Nice, I like this approach, it's much simpler!
When an activity or child workflow result fails to deserialize (e.g. broken
UnmarshalJSON), the workflow now blocks with repeatedWorkflowTaskFailedevents instead of terminating withWorkflowExecutionFailed. This matches the behavior of non-determinism errors: the workflow stays open until a fixed worker is deployed.The mechanism:
Activity.Run()andWorkflow.RunChild()detect decode errors (errors that are nottemporal.ActivityError,ChildWorkflowExecutionError, orCanceledError) and wrap them inDecodeError.One thing missed is if the tempts user creates a future and directly calls
Get()on it. Intercepting that requires writing a future wrapper, so I left that out for now.