Fix data race on nexusEndpointsOwnershipLostCh by paulnpdev · Pull Request #9602 · temporalio/temporal

paulnpdev · 2026-03-20T17:30:39Z

What changed?

Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers).

Why?

Data race in integration test

How did you test it?

Potential risks (from @paulnpdev's original commit)

Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is
Claude's analysis:

Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership):

Loads the current channel (line 2769)
Closes it
Stores a new channel (line 2770)
Reader A (checkNexusEndpointsOwnership, any gRPC goroutine):

Loads the channel atomically (line 2751)
Returns it to the caller
Reader B (ListNexusEndpoints, uses the returned channel):

Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703)
Selects on ownershipLostCh in long-poll loop (line 2735)
Correctness evaluation:

The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics:

If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost.
If a reader loads the new channel (after store): it will block until the next ownership change — also correct.
If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost.
One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races.

Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it).

bergundy · 2026-03-23T22:44:33Z

I'm not quite following why the fix is necessary. Isn't the fact that there's a single goroutine that manipulates this value, and does so in a specific order enough to guarantee correctness and eliminate the concern of a data race? What prompted you to make this change in the first place?

dnr · 2026-03-24T23:56:33Z

In general, if one goroutine is modifying a value, and another one is reading it, you need some synchronization to ensure that the write is visible to the reader(s), otherwise they may read an arbitrarily stale value, due to cpu caches and other hardware stuff.

The pattern this was copied from, the userdata changed channel, uses a mutex since the broadcast channel is also synchronized with the data itself. It seems weird to have a "something changed" broadcast channel that's not synchronized with the data it's talking about, and if that applies here then then this should use a mutex too. If it doesn't apply for some reason, then this could use a mutex or atomic. If it uses an atomic, it should Swap instead of Load/Store.

paulnpdev · 2026-03-25T21:24:49Z

I'm not quite following why the fix is necessary. Isn't the fact that there's a single goroutine that manipulates this value, and does so in a specific order enough to guarantee correctness and eliminate the concern of a data race? What prompted you to make this change in the first place?

motivation: tests failing (repetitively) in CI with data race errors :(

bergundy · 2026-03-27T00:06:30Z

In general, if one goroutine is modifying a value, and another one is reading it, you need some synchronization to ensure that the write is visible to the reader(s), otherwise they may read an arbitrarily stale value, due to cpu caches and other hardware stuff.

The pattern this was copied from, the userdata changed channel, uses a mutex since the broadcast channel is also synchronized with the data itself. It seems weird to have a "something changed" broadcast channel that's not synchronized with the data it's talking about, and if that applies here then then this should use a mutex too. If it doesn't apply for some reason, then this could use a mutex or atomic. If it uses an atomic, it should Swap instead of Load/Store.

I chatted with Claude on this a bit because this didn't seem like a real issue to me, it's okay if a reader gets the old value while a swap is happening because that channel will immediately be closed and the reader would resubscribe.
I'm going to fix the CI for this PR and use atomic swaps just to please the Go race detector.

Here's what Claude has to say:

A chan struct{} is a single pointer-sized value. On all Go target architectures, pointer-sized reads and writes are hardware-atomic — you'll never see a torn/partial pointer. And
Go 1.19's memory model explicitly guarantees single-word races don't produce "out of thin air" values: the reader will see some value that was actually written.

So in practice, the reader in checkNexusEndpointsOwnership will always see either the old channel or the new channel, never garbage. And both outcomes are semantically correct:

Old channel (already closed or about to be closed) → signals ownership lost ✓

New channel (not yet closed) → blocks until next ownership change ✓

The honest answer is: on current hardware and current Go compilers, this race is almost certainly benign. There is no realistic scenario where this produces incorrect behavior for
this specific pattern.

The reasons to fix it are:

Go's contract says don't. The spec defines unsynchronized concurrent access as a data race, full stop. The compiler is allowed to assume it doesn't happen and optimize
accordingly (e.g., caching the pointer in a register, reordering across the read). No current Go compiler does this aggressively, but relying on that is relying on implementation
details.

Race detector hygiene. If you tolerate "benign" races, the race detector becomes useless — you can't distinguish them from real bugs. The CI is failing, and there's no way to
annotate "this one's fine."

Maintenance hazard. The "it's benign" reasoning is fragile. It depends on the channel being pointer-sized, the read being a single load, and the close-then-replace semantics
being idempotent-safe. A future change could break any of these assumptions without anyone re-verifying the reasoning.

So: the synchronization is needed primarily for reason #2 (CI is broken) and secondarily as good hygiene (#1, #3), not because there's a concrete correctness bug on any real
platform today.

Use atomic.Value with Swap to fix the data race between the watchMembership goroutine (writer) and gRPC goroutines (readers).

dnr · 2026-03-27T19:25:11Z

I don't think claude's analysis is quite right here: without synchronization, there's nothing that guarantees that a reader will ever read the new value, even after an arbitrary amount of time. Yes, it'll read either the new or old value, but if it reads the old value, sees it's closed, and then tries to read it again, it can still get the old value.

This is probably unlikely or maybe impossible on amd64 and arm64 (I forgot the details) but in theory it is, and that's why the go memory model is written that way, so we should follow that.

bergundy · 2026-03-27T21:15:07Z

Yeah, I highly doubt that this problem is ever going to manifest. Also consider that when ownership changes, requests should eventually go to a new host (they may come back to the original host). I'm ready to merge this change because it obviously doesn't hurt to have this fix in.

paulnpdev · 2026-03-28T18:01:31Z

Yeah, I highly doubt that this problem is ever going to manifest. Also consider that when ownership changes, requests should eventually go to a new host (they may come back to the original host). I'm ready to merge this change because it obviously doesn't hurt to have this fix in.

if by "manifest" you include "cause CDS integration tests to fail" then it absolutely manifests (as race detection failure). That's what brought me here in the first place. If you only include "manifest in meaningful way in production", I'm inclined to agree :)

@paulnpdev

## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>

@paulnpdev

## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>

@paulnpdev

## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>

@paulnpdev

## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>

paulnpdev requested review from a team as code owners March 20, 2026 17:30

bergundy self-requested a review March 23, 2026 17:40

Fix data race on nexusEndpointsOwnershipLostCh

094fb50

Use atomic.Value with Swap to fix the data race between the watchMembership goroutine (writer) and gRPC goroutines (readers).

bergundy force-pushed the paul/fix-nexus-endpoints-race branch from 964c518 to 094fb50 Compare March 27, 2026 00:22

bergundy requested a review from a team as a code owner March 27, 2026 00:22

bergundy changed the title ~~fix data race on unsynchronized channel var~~ Fix data race on nexusEndpointsOwnershipLostCh Mar 27, 2026

paulnpdev requested a review from KeithB-Temporal March 27, 2026 18:24

Fix formatting

a656267

bergundy force-pushed the paul/fix-nexus-endpoints-race branch from 9d822a0 to a656267 Compare March 27, 2026 21:40

dnr approved these changes Mar 27, 2026

View reviewed changes

bergundy merged commit 0b5543c into main Mar 27, 2026
75 of 79 checks passed

bergundy deleted the paul/fix-nexus-endpoints-race branch March 27, 2026 22:21

ShahabT mentioned this pull request Apr 2, 2026

Serverless Feature Integration #9779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data race on nexusEndpointsOwnershipLostCh #9602

Fix data race on nexusEndpointsOwnershipLostCh #9602
bergundy merged 2 commits intomainfrom
paul/fix-nexus-endpoints-race

paulnpdev commented Mar 20, 2026 •

edited by bergundy

Loading

Uh oh!

bergundy commented Mar 23, 2026

Uh oh!

dnr commented Mar 24, 2026

Uh oh!

paulnpdev commented Mar 25, 2026

Uh oh!

bergundy commented Mar 27, 2026

Uh oh!

dnr commented Mar 27, 2026

Uh oh!

bergundy commented Mar 27, 2026

Uh oh!

Uh oh!

paulnpdev commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

paulnpdev commented Mar 20, 2026 • edited by bergundy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks (from @paulnpdev's original commit)

Uh oh!

bergundy commented Mar 23, 2026

Uh oh!

dnr commented Mar 24, 2026

Uh oh!

paulnpdev commented Mar 25, 2026

Uh oh!

bergundy commented Mar 27, 2026

Uh oh!

dnr commented Mar 27, 2026

Uh oh!

bergundy commented Mar 27, 2026

Uh oh!

Uh oh!

paulnpdev commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

paulnpdev commented Mar 20, 2026 •

edited by bergundy

Loading