Standalone Activity: preserve server-generated request IDs across restarts by dandavison · Pull Request #9724 · temporalio/temporal

dandavison · 2026-03-27T16:44:30Z

What changed?

When generating a request ID server-side, set it on the request struct before any cloning so that the mutation is re-used by all retries.

Why?

Without this, there is a bug, although I have not attempted to repro it:

Request arrives at Frontend without requestID
Inside retry interceptor loop, requestID is generated and set on a cloned copy
Request proceeds to history, starts the execution, but then some networking condition in the cell causes RetryableInterceptor not to receive the Ack (it sees a context expiry)
Frontend retries, generating a new request ID. But meanwhile the activity has completed. This would be rare, but technically possible.
The default reuse policy permits a second execution to be started. This would be a bug: the second start should have been prevented by the request ID. If the user's activity lacks idempotency protection it will lead to incorrectness in the user's systems.

How did you test it?

built
added weak new unit test(s) for the Start case.
manually tested:

commit fa2476c0b235a665f8846e91c0052825317be4d7
Author: Dan Davison <dan.davison@temporal.io>
Date:   2 days ago

    Add functional test for request-ID stability across server-side retries
    
    Use a package-level atomic to fail StartActivityExecution once after the
    activity is created at history, triggering the RetryableInterceptor.
    Without the fix, the retry generates a new request ID and gets
    ActivityExecutionAlreadyStarted. With the fix, the retry reuses the
    same request ID and the dedup succeeds.

diff --git a/chasm/lib/activity/frontend.go b/chasm/lib/activity/frontend.go
index c013b17c2..2a7b96bad 100644
--- a/chasm/lib/activity/frontend.go
+++ b/chasm/lib/activity/frontend.go
@@ -2,6 +2,7 @@
 
 import (
 	"context"
+	"sync/atomic"
 
 	"github.com/google/uuid"
 	apiactivitypb "go.temporal.io/api/activity/v1" //nolint:importas
@@ -35,6 +36,10 @@ type FrontendHandler interface {
 
 var ErrStandaloneActivityDisabled = serviceerror.NewUnimplemented("Standalone activity is disabled")
 
+// TestStartFailOnce, when set to true, causes the next StartActivityExecution to return Unavailable
+// after the activity is created. It fires once (CAS to false).
+var TestStartFailOnce atomic.Bool
+
 type frontendHandler struct {
 	FrontendHandler
 	client            activitypb.ActivityServiceClient
@@ -100,6 +105,13 @@ func (h *frontendHandler) StartActivityExecution(ctx context.Context, req *workf
 		NamespaceId:     namespaceID.String(),
 		FrontendRequest: modifiedReq,
 	})
+	if err != nil {
+		return nil, err
+	}
+
+	if TestStartFailOnce.CompareAndSwap(true, false) {
+		return nil, serviceerror.NewUnavailable("test: injected failure after successful creation")
+	}
 
 	return resp.GetFrontendResponse(), err
 }
diff --git a/tests/standalone_activity_test.go b/tests/standalone_activity_test.go
index 6afc7b606..d5ded28c1 100644
--- a/tests/standalone_activity_test.go
+++ b/tests/standalone_activity_test.go
@@ -270,6 +270,36 @@ func (s *standaloneActivityTestSuite) TestIDConflictPolicy() {
 	})
 }
 
+func (s *standaloneActivityTestSuite) TestServerGeneratedRequestIDStableAcrossRetries() {
+	t := s.T()
+	ctx, cancel := context.WithTimeout(t.Context(), 10*time.Second)
+	defer cancel()
+
+	activityID := testcore.RandomizeStr(t.Name())
+	taskQueue := testcore.RandomizeStr(t.Name())
+
+	// Make the handler fail once with a retryable error so the RetryableInterceptor retries.
+	activity.TestStartFailOnce.Store(true)
+
+	resp, err := s.FrontendClient().StartActivityExecution(ctx, &workflowservice.StartActivityExecutionRequest{
+		Namespace:    s.Namespace().String(),
+		ActivityId:   activityID,
+		ActivityType: s.tv.ActivityType(),
+		Identity:     s.tv.WorkerIdentity(),
+		Input:        defaultInput,
+		TaskQueue: &taskqueuepb.TaskQueue{
+			Name: taskQueue,
+		},
+		StartToCloseTimeout: durationpb.New(defaultStartToCloseTimeout),
+		// No RequestId — server generates one.
+	})
+	// With the fix, the retry uses the same request ID, so history recognizes it as a dedup
+	// and succeeds (with Started=false). Without the fix, the retry generates a new request ID
+	// and gets ActivityExecutionAlreadyStarted.
+	require.NoError(t, err)
+	require.NotNil(t, resp)
+}
+
 func (s *standaloneActivityTestSuite) TestPollActivityTaskQueue() {
 	t := s.T()
 	ctx, cancel := context.WithTimeout(t.Context(), 10*time.Second)

Potential risks

Could introduce incorrectness into Standalone Activity

Note

Medium Risk
Changes request ID generation semantics for standalone activity Start/Cancel/Terminate paths to improve deduplication across retries; risk is moderate because it touches request mutation behavior that affects idempotency and retry interactions.

Overview
Ensures standalone activity requests reuse a single RequestId across frontend retries by generating the server-side ID before cloning/mutating the request (so subsequent retry attempts see the same ID).

Removes the prior pre-mutation cloning for TerminateActivityExecution and RequestCancelActivityExecution request-ID population, and adds a unit test (frontend_test.go) asserting StartActivityExecution keeps a stable RequestId across multiple validateAndPopulateStartRequest calls.

^{Written by Cursor Bugbot for commit 885f60d. This will update automatically on new commits. Configure here.}

stephanos · 2026-03-27T17:48:48Z

chasm/lib/activity/frontend.go


 	if req.GetRequestId() == "" {
-		// Since this mutates the request, we clone it first so that any retries use the original request.
-		req = common.CloneProto(req)


fwiw, in my PR now I've moved this into the validation and normalization methods like this:

if req.GetRequestId() == "" { req.RequestId = uuid.NewString() } else if len(req.GetRequestId()) > config.MaxIDLengthLimit() { return serviceerror.NewInvalidArgumentf("request_id exceeds length limit. Length=%d Limit=%d", len(req.GetRequestId()), config.MaxIDLengthLimit()) }

I've opened a pure refactoring PR to make SAA more consistent with SNO #9795

bergundy

Just FTR, the duplicates can easily happen if a client retries and doesn't specify a request ID. This makes it a bit worse.

dandavison · 2026-04-02T21:18:35Z

@stephanos I've added a non-mergeable manual repro test to commit history and to the PR description (which will become the commit message).

Calls validateAndPopulateStartRequest twice with the same request pointer (simulating RetryableInterceptor retry behavior) and asserts both clones carry the same server-generated RequestId. Fails without the fix in the preceding commit: each clone got a fresh UUID because the RequestId was generated after the clone.

Use a package-level atomic to fail StartActivityExecution once after the activity is created at history, triggering the RetryableInterceptor. Without the fix, the retry generates a new request ID and gets ActivityExecutionAlreadyStarted. With the fix, the retry reuses the same request ID and the dedup succeeds.

Move RequestId generation before the proto clone so that the RetryableInterceptor, which retries with the same request pointer, carries the same server-generated RequestId on every attempt. Previously the RequestId was generated on the clone, so each retry got a fresh UUID that could not match the stored RequestId, breaking dedup and potentially causing a double-create if the activity completed between attempts. Fix Terminate/CancelActivity request-ID dedup across server-side retries Same issue as StartActivityExecution: the RequestId was generated on a clone, so RetryableInterceptor retries got a fresh UUID each time. For Terminate, this caused a spurious FailedPrecondition error on retry ("already terminated with request ID ..."). For RequestCancel, same pattern ("cancellation already requested with request ID ..."). Fix: set RequestId on the original request before any clone, matching the OSS pattern used by workflow handlers. These handlers had no other mutations requiring a clone, so the clone is simply removed.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

chasm/lib/activity/frontend.go

…de retries" This reverts commit 885673a.

dandavison requested a review from a team as a code owner March 27, 2026 16:44

dandavison requested review from fretz12 and stephanos March 27, 2026 16:50

dandavison force-pushed the saa-retain-request-ids branch from f39f920 to 38bd190 Compare March 27, 2026 16:55

stephanos reviewed Mar 27, 2026

View reviewed changes

bergundy approved these changes Mar 27, 2026

View reviewed changes

dandavison force-pushed the saa-retain-request-ids branch 3 times, most recently from dd58d7d to 7d4a409 Compare April 2, 2026 21:13

dandavison requested review from a team as code owners April 2, 2026 21:13

dandavison force-pushed the saa-retain-request-ids branch 2 times, most recently from 013ac41 to 6b6085c Compare April 2, 2026 21:15

dandavison added 3 commits April 2, 2026 17:22

dandavison force-pushed the saa-retain-request-ids branch 2 times, most recently from b80584f to 1446e11 Compare April 2, 2026 21:25

cursor bot reviewed Apr 2, 2026

View reviewed changes

chasm/lib/activity/frontend.go Outdated Show resolved Hide resolved

Revert "Add functional test for request-ID stability across server-si…

885f60d

…de retries" This reverts commit 885673a.

dandavison merged commit f8bbf7c into main Apr 2, 2026
48 checks passed

dandavison deleted the saa-retain-request-ids branch April 2, 2026 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone Activity: preserve server-generated request IDs across restarts#9724

Standalone Activity: preserve server-generated request IDs across restarts#9724
dandavison merged 4 commits intomainfrom
saa-retain-request-ids

dandavison commented Mar 27, 2026 •

edited by cursor bot

Loading

Uh oh!

stephanos Mar 27, 2026

Uh oh!

dandavison Apr 2, 2026

Uh oh!

bergundy left a comment

Uh oh!

dandavison commented Apr 2, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dandavison commented Mar 27, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

stephanos Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

dandavison Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

bergundy left a comment

Choose a reason for hiding this comment

Uh oh!

dandavison commented Apr 2, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dandavison commented Mar 27, 2026 •

edited by cursor bot

Loading