Skip to content

Standalone Activity: preserve server-generated request IDs across restarts#9724

Merged
dandavison merged 4 commits intomainfrom
saa-retain-request-ids
Apr 2, 2026
Merged

Standalone Activity: preserve server-generated request IDs across restarts#9724
dandavison merged 4 commits intomainfrom
saa-retain-request-ids

Conversation

@dandavison
Copy link
Copy Markdown
Contributor

@dandavison dandavison commented Mar 27, 2026

What changed?

When generating a request ID server-side, set it on the request struct before any cloning so that the mutation is re-used by all retries.

Why?

Without this, there is a bug, although I have not attempted to repro it:

  1. Request arrives at Frontend without requestID
  2. Inside retry interceptor loop, requestID is generated and set on a cloned copy
  3. Request proceeds to history, starts the execution, but then some networking condition in the cell causes RetryableInterceptor not to receive the Ack (it sees a context expiry)
  4. Frontend retries, generating a new request ID. But meanwhile the activity has completed. This would be rare, but technically possible.
  5. The default reuse policy permits a second execution to be started. This would be a bug: the second start should have been prevented by the request ID. If the user's activity lacks idempotency protection it will lead to incorrectness in the user's systems.

How did you test it?

  • built
  • added weak new unit test(s) for the Start case.
  • manually tested:
commit fa2476c0b235a665f8846e91c0052825317be4d7
Author: Dan Davison <dan.davison@temporal.io>
Date:   2 days ago

    Add functional test for request-ID stability across server-side retries
    
    Use a package-level atomic to fail StartActivityExecution once after the
    activity is created at history, triggering the RetryableInterceptor.
    Without the fix, the retry generates a new request ID and gets
    ActivityExecutionAlreadyStarted. With the fix, the retry reuses the
    same request ID and the dedup succeeds.

diff --git a/chasm/lib/activity/frontend.go b/chasm/lib/activity/frontend.go
index c013b17c2..2a7b96bad 100644
--- a/chasm/lib/activity/frontend.go
+++ b/chasm/lib/activity/frontend.go
@@ -2,6 +2,7 @@
 
 import (
 	"context"
+	"sync/atomic"
 
 	"github.com/google/uuid"
 	apiactivitypb "go.temporal.io/api/activity/v1" //nolint:importas
@@ -35,6 +36,10 @@ type FrontendHandler interface {
 
 var ErrStandaloneActivityDisabled = serviceerror.NewUnimplemented("Standalone activity is disabled")
 
+// TestStartFailOnce, when set to true, causes the next StartActivityExecution to return Unavailable
+// after the activity is created. It fires once (CAS to false).
+var TestStartFailOnce atomic.Bool
+
 type frontendHandler struct {
 	FrontendHandler
 	client            activitypb.ActivityServiceClient
@@ -100,6 +105,13 @@ func (h *frontendHandler) StartActivityExecution(ctx context.Context, req *workf
 		NamespaceId:     namespaceID.String(),
 		FrontendRequest: modifiedReq,
 	})
+	if err != nil {
+		return nil, err
+	}
+
+	if TestStartFailOnce.CompareAndSwap(true, false) {
+		return nil, serviceerror.NewUnavailable("test: injected failure after successful creation")
+	}
 
 	return resp.GetFrontendResponse(), err
 }
diff --git a/tests/standalone_activity_test.go b/tests/standalone_activity_test.go
index 6afc7b606..d5ded28c1 100644
--- a/tests/standalone_activity_test.go
+++ b/tests/standalone_activity_test.go
@@ -270,6 +270,36 @@ func (s *standaloneActivityTestSuite) TestIDConflictPolicy() {
 	})
 }
 
+func (s *standaloneActivityTestSuite) TestServerGeneratedRequestIDStableAcrossRetries() {
+	t := s.T()
+	ctx, cancel := context.WithTimeout(t.Context(), 10*time.Second)
+	defer cancel()
+
+	activityID := testcore.RandomizeStr(t.Name())
+	taskQueue := testcore.RandomizeStr(t.Name())
+
+	// Make the handler fail once with a retryable error so the RetryableInterceptor retries.
+	activity.TestStartFailOnce.Store(true)
+
+	resp, err := s.FrontendClient().StartActivityExecution(ctx, &workflowservice.StartActivityExecutionRequest{
+		Namespace:    s.Namespace().String(),
+		ActivityId:   activityID,
+		ActivityType: s.tv.ActivityType(),
+		Identity:     s.tv.WorkerIdentity(),
+		Input:        defaultInput,
+		TaskQueue: &taskqueuepb.TaskQueue{
+			Name: taskQueue,
+		},
+		StartToCloseTimeout: durationpb.New(defaultStartToCloseTimeout),
+		// No RequestId — server generates one.
+	})
+	// With the fix, the retry uses the same request ID, so history recognizes it as a dedup
+	// and succeeds (with Started=false). Without the fix, the retry generates a new request ID
+	// and gets ActivityExecutionAlreadyStarted.
+	require.NoError(t, err)
+	require.NotNil(t, resp)
+}
+
 func (s *standaloneActivityTestSuite) TestPollActivityTaskQueue() {
 	t := s.T()
 	ctx, cancel := context.WithTimeout(t.Context(), 10*time.Second)

Potential risks

Could introduce incorrectness into Standalone Activity


Note

Medium Risk
Changes request ID generation semantics for standalone activity Start/Cancel/Terminate paths to improve deduplication across retries; risk is moderate because it touches request mutation behavior that affects idempotency and retry interactions.

Overview
Ensures standalone activity requests reuse a single RequestId across frontend retries by generating the server-side ID before cloning/mutating the request (so subsequent retry attempts see the same ID).

Removes the prior pre-mutation cloning for TerminateActivityExecution and RequestCancelActivityExecution request-ID population, and adds a unit test (frontend_test.go) asserting StartActivityExecution keeps a stable RequestId across multiple validateAndPopulateStartRequest calls.

Written by Cursor Bugbot for commit 885f60d. This will update automatically on new commits. Configure here.

@dandavison dandavison requested a review from a team as a code owner March 27, 2026 16:44
@dandavison dandavison requested review from fretz12 and stephanos March 27, 2026 16:50
@dandavison dandavison force-pushed the saa-retain-request-ids branch from f39f920 to 38bd190 Compare March 27, 2026 16:55

if req.GetRequestId() == "" {
// Since this mutates the request, we clone it first so that any retries use the original request.
req = common.CloneProto(req)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, in my PR now I've moved this into the validation and normalization methods like this:

    if req.GetRequestId() == "" {
		req.RequestId = uuid.NewString()
	} else if len(req.GetRequestId()) > config.MaxIDLengthLimit() {
		return serviceerror.NewInvalidArgumentf("request_id exceeds length limit. Length=%d Limit=%d",
			len(req.GetRequestId()), config.MaxIDLengthLimit())
	}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a pure refactoring PR to make SAA more consistent with SNO #9795

Copy link
Copy Markdown
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FTR, the duplicates can easily happen if a client retries and doesn't specify a request ID. This makes it a bit worse.

@dandavison dandavison force-pushed the saa-retain-request-ids branch 3 times, most recently from dd58d7d to 7d4a409 Compare April 2, 2026 21:13
@dandavison dandavison requested review from a team as code owners April 2, 2026 21:13
@dandavison dandavison force-pushed the saa-retain-request-ids branch 2 times, most recently from 013ac41 to 6b6085c Compare April 2, 2026 21:15
@dandavison
Copy link
Copy Markdown
Contributor Author

@stephanos I've added a non-mergeable manual repro test to commit history and to the PR description (which will become the commit message).

Calls validateAndPopulateStartRequest twice with the same request
pointer (simulating RetryableInterceptor retry behavior) and asserts
both clones carry the same server-generated RequestId.

Fails without the fix in the preceding commit: each clone got a fresh
UUID because the RequestId was generated after the clone.
Use a package-level atomic to fail StartActivityExecution once after the
activity is created at history, triggering the RetryableInterceptor.
Without the fix, the retry generates a new request ID and gets
ActivityExecutionAlreadyStarted. With the fix, the retry reuses the
same request ID and the dedup succeeds.
Move RequestId generation before the proto clone so that the
RetryableInterceptor, which retries with the same request pointer,
carries the same server-generated RequestId on every attempt. Previously
the RequestId was generated on the clone, so each retry got a fresh UUID
that could not match the stored RequestId, breaking dedup and potentially
causing a double-create if the activity completed between attempts.

Fix Terminate/CancelActivity request-ID dedup across server-side retries

Same issue as StartActivityExecution: the RequestId was generated on a
clone, so RetryableInterceptor retries got a fresh UUID each time. For
Terminate, this caused a spurious FailedPrecondition error on retry
("already terminated with request ID ..."). For RequestCancel, same
pattern ("cancellation already requested with request ID ...").

Fix: set RequestId on the original request before any clone, matching
the OSS pattern used by workflow handlers.

These handlers had no other mutations requiring a clone, so the clone is
simply removed.
@dandavison dandavison force-pushed the saa-retain-request-ids branch 2 times, most recently from b80584f to 1446e11 Compare April 2, 2026 21:25
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

@dandavison dandavison merged commit f8bbf7c into main Apr 2, 2026
48 checks passed
@dandavison dandavison deleted the saa-retain-request-ids branch April 2, 2026 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants