Skip to content

Fix data race on nexusEndpointsOwnershipLostCh #9602

Merged
bergundy merged 2 commits intomainfrom
paul/fix-nexus-endpoints-race
Mar 27, 2026
Merged

Fix data race on nexusEndpointsOwnershipLostCh #9602
bergundy merged 2 commits intomainfrom
paul/fix-nexus-endpoints-race

Conversation

@paulnpdev
Copy link
Copy Markdown
Member

@paulnpdev paulnpdev commented Mar 20, 2026

What changed?

Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers).

Why?

Data race in integration test

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks (from @paulnpdev's original commit)

Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is
Claude's analysis:

Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership):

Loads the current channel (line 2769)
Closes it
Stores a new channel (line 2770)
Reader A (checkNexusEndpointsOwnership, any gRPC goroutine):

Loads the channel atomically (line 2751)
Returns it to the caller
Reader B (ListNexusEndpoints, uses the returned channel):

Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703)
Selects on ownershipLostCh in long-poll loop (line 2735)
Correctness evaluation:

The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics:

If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost.
If a reader loads the new channel (after store): it will block until the next ownership change — also correct.
If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost.
One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races.

Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it).

@paulnpdev paulnpdev requested review from a team as code owners March 20, 2026 17:30
@bergundy bergundy self-requested a review March 23, 2026 17:40
@bergundy
Copy link
Copy Markdown
Member

I'm not quite following why the fix is necessary. Isn't the fact that there's a single goroutine that manipulates this value, and does so in a specific order enough to guarantee correctness and eliminate the concern of a data race? What prompted you to make this change in the first place?

@dnr
Copy link
Copy Markdown
Contributor

dnr commented Mar 24, 2026

In general, if one goroutine is modifying a value, and another one is reading it, you need some synchronization to ensure that the write is visible to the reader(s), otherwise they may read an arbitrarily stale value, due to cpu caches and other hardware stuff.

The pattern this was copied from, the userdata changed channel, uses a mutex since the broadcast channel is also synchronized with the data itself. It seems weird to have a "something changed" broadcast channel that's not synchronized with the data it's talking about, and if that applies here then then this should use a mutex too. If it doesn't apply for some reason, then this could use a mutex or atomic. If it uses an atomic, it should Swap instead of Load/Store.

@paulnpdev
Copy link
Copy Markdown
Member Author

I'm not quite following why the fix is necessary. Isn't the fact that there's a single goroutine that manipulates this value, and does so in a specific order enough to guarantee correctness and eliminate the concern of a data race? What prompted you to make this change in the first place?

motivation: tests failing (repetitively) in CI with data race errors :(

@bergundy
Copy link
Copy Markdown
Member

In general, if one goroutine is modifying a value, and another one is reading it, you need some synchronization to ensure that the write is visible to the reader(s), otherwise they may read an arbitrarily stale value, due to cpu caches and other hardware stuff.

The pattern this was copied from, the userdata changed channel, uses a mutex since the broadcast channel is also synchronized with the data itself. It seems weird to have a "something changed" broadcast channel that's not synchronized with the data it's talking about, and if that applies here then then this should use a mutex too. If it doesn't apply for some reason, then this could use a mutex or atomic. If it uses an atomic, it should Swap instead of Load/Store.

I chatted with Claude on this a bit because this didn't seem like a real issue to me, it's okay if a reader gets the old value while a swap is happening because that channel will immediately be closed and the reader would resubscribe.
I'm going to fix the CI for this PR and use atomic swaps just to please the Go race detector.

Here's what Claude has to say:

A chan struct{} is a single pointer-sized value. On all Go target architectures, pointer-sized reads and writes are hardware-atomic — you'll never see a torn/partial pointer. And
Go 1.19's memory model explicitly guarantees single-word races don't produce "out of thin air" values: the reader will see some value that was actually written.

So in practice, the reader in checkNexusEndpointsOwnership will always see either the old channel or the new channel, never garbage. And both outcomes are semantically correct:

  • Old channel (already closed or about to be closed) → signals ownership lost ✓
  • New channel (not yet closed) → blocks until next ownership change ✓

The honest answer is: on current hardware and current Go compilers, this race is almost certainly benign. There is no realistic scenario where this produces incorrect behavior for
this specific pattern.

The reasons to fix it are:

  1. Go's contract says don't. The spec defines unsynchronized concurrent access as a data race, full stop. The compiler is allowed to assume it doesn't happen and optimize
    accordingly (e.g., caching the pointer in a register, reordering across the read). No current Go compiler does this aggressively, but relying on that is relying on implementation
    details.
  2. Race detector hygiene. If you tolerate "benign" races, the race detector becomes useless — you can't distinguish them from real bugs. The CI is failing, and there's no way to
    annotate "this one's fine."
  3. Maintenance hazard. The "it's benign" reasoning is fragile. It depends on the channel being pointer-sized, the read being a single load, and the close-then-replace semantics
    being idempotent-safe. A future change could break any of these assumptions without anyone re-verifying the reasoning.

So: the synchronization is needed primarily for reason #2 (CI is broken) and secondarily as good hygiene (#1, #3), not because there's a concrete correctness bug on any real
platform today.

Use atomic.Value with Swap to fix the data race between the
watchMembership goroutine (writer) and gRPC goroutines (readers).
@bergundy bergundy force-pushed the paul/fix-nexus-endpoints-race branch from 964c518 to 094fb50 Compare March 27, 2026 00:22
@bergundy bergundy requested a review from a team as a code owner March 27, 2026 00:22
@bergundy bergundy changed the title fix data race on unsynchronized channel var Fix data race on nexusEndpointsOwnershipLostCh Mar 27, 2026
@dnr
Copy link
Copy Markdown
Contributor

dnr commented Mar 27, 2026

I don't think claude's analysis is quite right here: without synchronization, there's nothing that guarantees that a reader will ever read the new value, even after an arbitrary amount of time. Yes, it'll read either the new or old value, but if it reads the old value, sees it's closed, and then tries to read it again, it can still get the old value.

This is probably unlikely or maybe impossible on amd64 and arm64 (I forgot the details) but in theory it is, and that's why the go memory model is written that way, so we should follow that.

@bergundy
Copy link
Copy Markdown
Member

Yeah, I highly doubt that this problem is ever going to manifest. Also consider that when ownership changes, requests should eventually go to a new host (they may come back to the original host). I'm ready to merge this change because it obviously doesn't hurt to have this fix in.

@bergundy bergundy force-pushed the paul/fix-nexus-endpoints-race branch from 9d822a0 to a656267 Compare March 27, 2026 21:40
@bergundy bergundy merged commit 0b5543c into main Mar 27, 2026
75 of 79 checks passed
@bergundy bergundy deleted the paul/fix-nexus-endpoints-race branch March 27, 2026 22:21
@paulnpdev
Copy link
Copy Markdown
Member Author

Yeah, I highly doubt that this problem is ever going to manifest. Also consider that when ownership changes, requests should eventually go to a new host (they may come back to the original host). I'm ready to merge this change because it obviously doesn't hurt to have this fix in.

if by "manifest" you include "cause CDS integration tests to fail" then it absolutely manifests (as race detection failure). That's what brought me here in the first place. If you only include "manifest in meaningful way in production", I'm inclined to agree :)

chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
## What changed?

Use atomic.Value with Swap to please the race detector for the race
between the watchMembership goroutine (writer) and gRPC goroutines
(readers).

## Why?
Data race in integration test

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks (from @paulnpdev's original commit)

Hard to evaluate whether the code utilizing this field correctly handles
the real-world race involving mutation of this field in real time,
especially as I am not very familiar with this code. Counting on my
reviewers to pay close attention here. Here is
Claude's analysis:

Writer (notifyNexusEndpointsOwnershipChange, single goroutine via
watchMembership):

Loads the current channel (line 2769)
Closes it
Stores a new channel (line 2770)
Reader A (checkNexusEndpointsOwnership, any gRPC goroutine):

Loads the channel atomically (line 2751)
Returns it to the caller
Reader B (ListNexusEndpoints, uses the returned channel):

Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703)
Selects on ownershipLostCh in long-poll loop (line 2735)
Correctness evaluation:

The critical sequence in the writer (lines 2769-2770) is a
load-close-store — two separate atomic operations, not one. There's a
window between close and store where a concurrent reader could load the
already-closed channel. This is actually fine for the intended
semantics:

If a reader loads the old channel (before or during close): it will see
the close signal and correctly report ownership lost.
If a reader loads the new channel (after store): it will block until the
next ownership change — also correct.
If a reader loads the old channel between close and store: it gets a
closed channel, which immediately unblocks the select — correctly
signaling ownership lost.
One real concern: line 2769 does a second Load() to get the channel to
close. If two membership changes fired in rapid succession (the channel
is buffered with size 1), could notifyNexusEndpointsOwnershipChange race
with itself? No — the comment on line 2761 states this method is only
called from the single watchMembership goroutine (the for range loop on
line 359). So the writer is single-threaded, and the load-close-store
sequence is safe from self-races.

Verdict: The atomic fix is correct. The close-then-store non-atomicity
is benign because any interleaving produces the correct outcome (the
closed channel signals ownership loss regardless of when a reader
observes it).

---------

Co-authored-by: Roey Berman <roey.berman@gmail.com>
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
## What changed?

Use atomic.Value with Swap to please the race detector for the race
between the watchMembership goroutine (writer) and gRPC goroutines
(readers).

## Why?
Data race in integration test

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks (from @paulnpdev's original commit)

Hard to evaluate whether the code utilizing this field correctly handles
the real-world race involving mutation of this field in real time,
especially as I am not very familiar with this code. Counting on my
reviewers to pay close attention here. Here is
Claude's analysis:

Writer (notifyNexusEndpointsOwnershipChange, single goroutine via
watchMembership):

Loads the current channel (line 2769)
Closes it
Stores a new channel (line 2770)
Reader A (checkNexusEndpointsOwnership, any gRPC goroutine):

Loads the channel atomically (line 2751)
Returns it to the caller
Reader B (ListNexusEndpoints, uses the returned channel):

Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703)
Selects on ownershipLostCh in long-poll loop (line 2735)
Correctness evaluation:

The critical sequence in the writer (lines 2769-2770) is a
load-close-store — two separate atomic operations, not one. There's a
window between close and store where a concurrent reader could load the
already-closed channel. This is actually fine for the intended
semantics:

If a reader loads the old channel (before or during close): it will see
the close signal and correctly report ownership lost.
If a reader loads the new channel (after store): it will block until the
next ownership change — also correct.
If a reader loads the old channel between close and store: it gets a
closed channel, which immediately unblocks the select — correctly
signaling ownership lost.
One real concern: line 2769 does a second Load() to get the channel to
close. If two membership changes fired in rapid succession (the channel
is buffered with size 1), could notifyNexusEndpointsOwnershipChange race
with itself? No — the comment on line 2761 states this method is only
called from the single watchMembership goroutine (the for range loop on
line 359). So the writer is single-threaded, and the load-close-store
sequence is safe from self-races.

Verdict: The atomic fix is correct. The close-then-store non-atomicity
is benign because any interleaving produces the correct outcome (the
closed channel signals ownership loss regardless of when a reader
observes it).

---------

Co-authored-by: Roey Berman <roey.berman@gmail.com>
chaptersix pushed a commit that referenced this pull request Apr 2, 2026
## What changed?

Use atomic.Value with Swap to please the race detector for the race
between the watchMembership goroutine (writer) and gRPC goroutines
(readers).

## Why?
Data race in integration test

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks (from @paulnpdev's original commit)

Hard to evaluate whether the code utilizing this field correctly handles
the real-world race involving mutation of this field in real time,
especially as I am not very familiar with this code. Counting on my
reviewers to pay close attention here. Here is
Claude's analysis:

Writer (notifyNexusEndpointsOwnershipChange, single goroutine via
watchMembership):

Loads the current channel (line 2769)
Closes it
Stores a new channel (line 2770)
Reader A (checkNexusEndpointsOwnership, any gRPC goroutine):

Loads the channel atomically (line 2751)
Returns it to the caller
Reader B (ListNexusEndpoints, uses the returned channel):

Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703)
Selects on ownershipLostCh in long-poll loop (line 2735)
Correctness evaluation:

The critical sequence in the writer (lines 2769-2770) is a
load-close-store — two separate atomic operations, not one. There's a
window between close and store where a concurrent reader could load the
already-closed channel. This is actually fine for the intended
semantics:

If a reader loads the old channel (before or during close): it will see
the close signal and correctly report ownership lost.
If a reader loads the new channel (after store): it will block until the
next ownership change — also correct.
If a reader loads the old channel between close and store: it gets a
closed channel, which immediately unblocks the select — correctly
signaling ownership lost.
One real concern: line 2769 does a second Load() to get the channel to
close. If two membership changes fired in rapid succession (the channel
is buffered with size 1), could notifyNexusEndpointsOwnershipChange race
with itself? No — the comment on line 2761 states this method is only
called from the single watchMembership goroutine (the for range loop on
line 359). So the writer is single-threaded, and the load-close-store
sequence is safe from self-races.

Verdict: The atomic fix is correct. The close-then-store non-atomicity
is benign because any interleaving produces the correct outcome (the
closed channel signals ownership loss regardless of when a reader
observes it).

---------

Co-authored-by: Roey Berman <roey.berman@gmail.com>
chaptersix pushed a commit to chaptersix/temporal that referenced this pull request Apr 2, 2026
## What changed?

Use atomic.Value with Swap to please the race detector for the race
between the watchMembership goroutine (writer) and gRPC goroutines
(readers).

## Why?
Data race in integration test

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks (from @paulnpdev's original commit)

Hard to evaluate whether the code utilizing this field correctly handles
the real-world race involving mutation of this field in real time,
especially as I am not very familiar with this code. Counting on my
reviewers to pay close attention here. Here is
Claude's analysis:

Writer (notifyNexusEndpointsOwnershipChange, single goroutine via
watchMembership):

Loads the current channel (line 2769)
Closes it
Stores a new channel (line 2770)
Reader A (checkNexusEndpointsOwnership, any gRPC goroutine):

Loads the channel atomically (line 2751)
Returns it to the caller
Reader B (ListNexusEndpoints, uses the returned channel):

Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703)
Selects on ownershipLostCh in long-poll loop (line 2735)
Correctness evaluation:

The critical sequence in the writer (lines 2769-2770) is a
load-close-store — two separate atomic operations, not one. There's a
window between close and store where a concurrent reader could load the
already-closed channel. This is actually fine for the intended
semantics:

If a reader loads the old channel (before or during close): it will see
the close signal and correctly report ownership lost.
If a reader loads the new channel (after store): it will block until the
next ownership change — also correct.
If a reader loads the old channel between close and store: it gets a
closed channel, which immediately unblocks the select — correctly
signaling ownership lost.
One real concern: line 2769 does a second Load() to get the channel to
close. If two membership changes fired in rapid succession (the channel
is buffered with size 1), could notifyNexusEndpointsOwnershipChange race
with itself? No — the comment on line 2761 states this method is only
called from the single watchMembership goroutine (the for range loop on
line 359). So the writer is single-threaded, and the load-close-store
sequence is safe from self-races.

Verdict: The atomic fix is correct. The close-then-store non-atomicity
is benign because any interleaving produces the correct outcome (the
closed channel signals ownership loss regardless of when a reader
observes it).

---------

Co-authored-by: Roey Berman <roey.berman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants