Standalone Activity: preserve server-generated request IDs across restarts#9724
Merged
dandavison merged 4 commits intomainfrom Apr 2, 2026
Merged
Standalone Activity: preserve server-generated request IDs across restarts#9724dandavison merged 4 commits intomainfrom
dandavison merged 4 commits intomainfrom
Conversation
f39f920 to
38bd190
Compare
stephanos
reviewed
Mar 27, 2026
|
|
||
| if req.GetRequestId() == "" { | ||
| // Since this mutates the request, we clone it first so that any retries use the original request. | ||
| req = common.CloneProto(req) |
Contributor
There was a problem hiding this comment.
fwiw, in my PR now I've moved this into the validation and normalization methods like this:
if req.GetRequestId() == "" {
req.RequestId = uuid.NewString()
} else if len(req.GetRequestId()) > config.MaxIDLengthLimit() {
return serviceerror.NewInvalidArgumentf("request_id exceeds length limit. Length=%d Limit=%d",
len(req.GetRequestId()), config.MaxIDLengthLimit())
}
Contributor
Author
There was a problem hiding this comment.
I've opened a pure refactoring PR to make SAA more consistent with SNO #9795
bergundy
approved these changes
Mar 27, 2026
Member
bergundy
left a comment
There was a problem hiding this comment.
Just FTR, the duplicates can easily happen if a client retries and doesn't specify a request ID. This makes it a bit worse.
dd58d7d to
7d4a409
Compare
013ac41 to
6b6085c
Compare
Contributor
Author
|
@stephanos I've added a non-mergeable manual repro test to commit history and to the PR description (which will become the commit message). |
Calls validateAndPopulateStartRequest twice with the same request pointer (simulating RetryableInterceptor retry behavior) and asserts both clones carry the same server-generated RequestId. Fails without the fix in the preceding commit: each clone got a fresh UUID because the RequestId was generated after the clone.
Use a package-level atomic to fail StartActivityExecution once after the activity is created at history, triggering the RetryableInterceptor. Without the fix, the retry generates a new request ID and gets ActivityExecutionAlreadyStarted. With the fix, the retry reuses the same request ID and the dedup succeeds.
Move RequestId generation before the proto clone so that the
RetryableInterceptor, which retries with the same request pointer,
carries the same server-generated RequestId on every attempt. Previously
the RequestId was generated on the clone, so each retry got a fresh UUID
that could not match the stored RequestId, breaking dedup and potentially
causing a double-create if the activity completed between attempts.
Fix Terminate/CancelActivity request-ID dedup across server-side retries
Same issue as StartActivityExecution: the RequestId was generated on a
clone, so RetryableInterceptor retries got a fresh UUID each time. For
Terminate, this caused a spurious FailedPrecondition error on retry
("already terminated with request ID ..."). For RequestCancel, same
pattern ("cancellation already requested with request ID ...").
Fix: set RequestId on the original request before any clone, matching
the OSS pattern used by workflow handlers.
These handlers had no other mutations requiring a clone, so the clone is
simply removed.
b80584f to
1446e11
Compare
…de retries" This reverts commit 885673a.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What changed?
When generating a request ID server-side, set it on the request struct before any cloning so that the mutation is re-used by all retries.
Why?
Without this, there is a bug, although I have not attempted to repro it:
requestIDrequestIDis generated and set on a cloned copyRetryableInterceptornot to receive the Ack (it sees a context expiry)How did you test it?
commit fa2476c0b235a665f8846e91c0052825317be4d7 Author: Dan Davison <dan.davison@temporal.io> Date: 2 days ago Add functional test for request-ID stability across server-side retries Use a package-level atomic to fail StartActivityExecution once after the activity is created at history, triggering the RetryableInterceptor. Without the fix, the retry generates a new request ID and gets ActivityExecutionAlreadyStarted. With the fix, the retry reuses the same request ID and the dedup succeeds. diff --git a/chasm/lib/activity/frontend.go b/chasm/lib/activity/frontend.go index c013b17c2..2a7b96bad 100644 --- a/chasm/lib/activity/frontend.go +++ b/chasm/lib/activity/frontend.go @@ -2,6 +2,7 @@ import ( "context" + "sync/atomic" "github.com/google/uuid" apiactivitypb "go.temporal.io/api/activity/v1" //nolint:importas @@ -35,6 +36,10 @@ type FrontendHandler interface { var ErrStandaloneActivityDisabled = serviceerror.NewUnimplemented("Standalone activity is disabled") +// TestStartFailOnce, when set to true, causes the next StartActivityExecution to return Unavailable +// after the activity is created. It fires once (CAS to false). +var TestStartFailOnce atomic.Bool + type frontendHandler struct { FrontendHandler client activitypb.ActivityServiceClient @@ -100,6 +105,13 @@ func (h *frontendHandler) StartActivityExecution(ctx context.Context, req *workf NamespaceId: namespaceID.String(), FrontendRequest: modifiedReq, }) + if err != nil { + return nil, err + } + + if TestStartFailOnce.CompareAndSwap(true, false) { + return nil, serviceerror.NewUnavailable("test: injected failure after successful creation") + } return resp.GetFrontendResponse(), err } diff --git a/tests/standalone_activity_test.go b/tests/standalone_activity_test.go index 6afc7b606..d5ded28c1 100644 --- a/tests/standalone_activity_test.go +++ b/tests/standalone_activity_test.go @@ -270,6 +270,36 @@ func (s *standaloneActivityTestSuite) TestIDConflictPolicy() { }) } +func (s *standaloneActivityTestSuite) TestServerGeneratedRequestIDStableAcrossRetries() { + t := s.T() + ctx, cancel := context.WithTimeout(t.Context(), 10*time.Second) + defer cancel() + + activityID := testcore.RandomizeStr(t.Name()) + taskQueue := testcore.RandomizeStr(t.Name()) + + // Make the handler fail once with a retryable error so the RetryableInterceptor retries. + activity.TestStartFailOnce.Store(true) + + resp, err := s.FrontendClient().StartActivityExecution(ctx, &workflowservice.StartActivityExecutionRequest{ + Namespace: s.Namespace().String(), + ActivityId: activityID, + ActivityType: s.tv.ActivityType(), + Identity: s.tv.WorkerIdentity(), + Input: defaultInput, + TaskQueue: &taskqueuepb.TaskQueue{ + Name: taskQueue, + }, + StartToCloseTimeout: durationpb.New(defaultStartToCloseTimeout), + // No RequestId — server generates one. + }) + // With the fix, the retry uses the same request ID, so history recognizes it as a dedup + // and succeeds (with Started=false). Without the fix, the retry generates a new request ID + // and gets ActivityExecutionAlreadyStarted. + require.NoError(t, err) + require.NotNil(t, resp) +} + func (s *standaloneActivityTestSuite) TestPollActivityTaskQueue() { t := s.T() ctx, cancel := context.WithTimeout(t.Context(), 10*time.Second)Potential risks
Could introduce incorrectness into Standalone Activity
Note
Medium Risk
Changes request ID generation semantics for standalone activity Start/Cancel/Terminate paths to improve deduplication across retries; risk is moderate because it touches request mutation behavior that affects idempotency and retry interactions.
Overview
Ensures standalone activity requests reuse a single
RequestIdacross frontend retries by generating the server-side ID before cloning/mutating the request (so subsequent retry attempts see the same ID).Removes the prior pre-mutation cloning for
TerminateActivityExecutionandRequestCancelActivityExecutionrequest-ID population, and adds a unit test (frontend_test.go) assertingStartActivityExecutionkeeps a stableRequestIdacross multiplevalidateAndPopulateStartRequestcalls.Written by Cursor Bugbot for commit 885f60d. This will update automatically on new commits. Configure here.