Skip to content

Treat payload decode errors as workflow task failures instead of work flow execution failures#8

Merged
vikstrous merged 8 commits intomasterfrom
decode-error-blocks-workflow
Feb 20, 2026
Merged

Treat payload decode errors as workflow task failures instead of work flow execution failures#8
vikstrous merged 8 commits intomasterfrom
decode-error-blocks-workflow

Conversation

@vikstrous
Copy link
Owner

@vikstrous vikstrous commented Feb 19, 2026

When an activity or child workflow result fails to deserialize (e.g. broken UnmarshalJSON), the workflow now blocks with repeated WorkflowTaskFailed events instead of terminating with WorkflowExecutionFailed. This matches the behavior of non-determinism errors: the workflow stays open until a fixed worker is deployed.

The mechanism:

  • Activity.Run() and Workflow.RunChild() detect decode errors (errors that are not temporal.ActivityError, ChildWorkflowExecutionError, or CanceledError) and wrap them in DecodeError.
  • The WithImplementation wrapper checks the returned error for DecodeError and panics outside the user's workflow function scope, causing the SDK to fail the workflow task rather than the workflow execution.

One thing missed is if the tempts user creates a future and directly calls Get() on it. Intercepting that requires writing a future wrapper, so I left that out for now.

…flow execution failures

When an activity or child workflow result fails to deserialize (e.g. broken
UnmarshalJSON), the workflow now blocks with repeated WorkflowTaskFailed
events instead of terminating with WorkflowExecutionFailed. This matches
the behavior of non-determinism errors: the workflow stays open until a
fixed worker is deployed.

The mechanism:
- Activity.Run() and Workflow.RunChild() detect decode errors (errors that
  are not temporal.ActivityError, ChildWorkflowExecutionError, or
  CanceledError) and wrap them in DecodeError.
- The WithImplementation wrapper checks the returned error for DecodeError
  and panics outside the user's workflow function scope, causing the SDK
  to fail the workflow task rather than the workflow execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vikstrous vikstrous marked this pull request as ready for review February 19, 2026 20:44
Copy link

@sean-george-anchor sean-george-anchor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I understand the desire to reduce the blast radius when workflows can't deserialize activity outputs.

That said, I don't think this is the right fix, and I worry anything in this direction is going to be brittle.

  • The SDK doesn't give us a way to identify decode errors. decodeArg() returns plain unmarshaling errors, no typed wrapper, no exported concept. The exclusion check works for v1.25.1 (I verified every error path in internal_event_handlers.go), but it's not a public contract. If a future SDK version introduces a new unwrapped error type, we'd silently misclassify it and panic on errors that should fail the workflow.

  • Silent retries trade one problem for another. Failed workflows are loud. They surface in error tracking, monitoring, and alerts. Silently retrying workflows are invisible unless you're specifically watching task failure counts. Changing the failure mode here could actually make incidents like this harder to detect, not easier.

  • The Execute() path isn't covered, and closing that gap adds a lot of machinery (future wrappers, context propagation, ChildWorkflowFuture wrapping) that needs to stay in sync as the API grows.

I think the real value is in prevention and detection: smoke tests before promotion, replay test enforcement, workflow failure alerting, deployment gating on error rates. Those keep the serialization mismatch from reaching production in the first place, which is more valuable than changing what happens after it does. Changing tempts' error semantics under the hood makes the library harder to reason about for anyone who knows Temporal's model, and any implementation here is going to be coupled to SDK internals in ways that are hard to maintain long-term.

Happy to discuss.

vikstrous and others added 5 commits February 20, 2026 15:44
Replace elimination-based heuristics (isActivityDecodeError,
isChildWorkflowDecodeError) with a single isDecodeError that checks
errors.Is(err, converter.ErrUnableToDecode).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sentinel survives through the error chain via %w, so the
intermediate DecodeError wrapper and the detection/wrapping in
Activity.Run and RunChild are unnecessary. panicOnDecodeError
now checks for converter.ErrUnableToDecode directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@sean-george-anchor sean-george-anchor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like this approach, it's much simpler!

@vikstrous vikstrous merged commit 7d1e6b7 into master Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants