Fix data race on nexusEndpointsOwnershipLostCh #9602
Conversation
|
I'm not quite following why the fix is necessary. Isn't the fact that there's a single goroutine that manipulates this value, and does so in a specific order enough to guarantee correctness and eliminate the concern of a data race? What prompted you to make this change in the first place? |
|
In general, if one goroutine is modifying a value, and another one is reading it, you need some synchronization to ensure that the write is visible to the reader(s), otherwise they may read an arbitrarily stale value, due to cpu caches and other hardware stuff. The pattern this was copied from, the userdata changed channel, uses a mutex since the broadcast channel is also synchronized with the data itself. It seems weird to have a "something changed" broadcast channel that's not synchronized with the data it's talking about, and if that applies here then then this should use a mutex too. If it doesn't apply for some reason, then this could use a mutex or atomic. If it uses an atomic, it should Swap instead of Load/Store. |
motivation: tests failing (repetitively) in CI with data race errors :( |
I chatted with Claude on this a bit because this didn't seem like a real issue to me, it's okay if a reader gets the old value while a swap is happening because that channel will immediately be closed and the reader would resubscribe. Here's what Claude has to say:
|
Use atomic.Value with Swap to fix the data race between the watchMembership goroutine (writer) and gRPC goroutines (readers).
964c518 to
094fb50
Compare
|
I don't think claude's analysis is quite right here: without synchronization, there's nothing that guarantees that a reader will ever read the new value, even after an arbitrary amount of time. Yes, it'll read either the new or old value, but if it reads the old value, sees it's closed, and then tries to read it again, it can still get the old value. This is probably unlikely or maybe impossible on amd64 and arm64 (I forgot the details) but in theory it is, and that's why the go memory model is written that way, so we should follow that. |
|
Yeah, I highly doubt that this problem is ever going to manifest. Also consider that when ownership changes, requests should eventually go to a new host (they may come back to the original host). I'm ready to merge this change because it obviously doesn't hurt to have this fix in. |
9d822a0 to
a656267
Compare
if by "manifest" you include "cause CDS integration tests to fail" then it absolutely manifests (as race detection failure). That's what brought me here in the first place. If you only include "manifest in meaningful way in production", I'm inclined to agree :) |
## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
What changed?
Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers).
Why?
Data race in integration test
How did you test it?
Potential risks (from @paulnpdev's original commit)
Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is
Claude's analysis:
Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership):
Loads the current channel (line 2769)
Closes it
Stores a new channel (line 2770)
Reader A (checkNexusEndpointsOwnership, any gRPC goroutine):
Loads the channel atomically (line 2751)
Returns it to the caller
Reader B (ListNexusEndpoints, uses the returned channel):
Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703)
Selects on ownershipLostCh in long-poll loop (line 2735)
Correctness evaluation:
The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics:
If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost.
If a reader loads the new channel (after store): it will block until the next ownership change — also correct.
If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost.
One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races.
Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it).