-
Notifications
You must be signed in to change notification settings - Fork 51
Add fail safe in pcstore.WatchAndSync to protect missing pod clusters. #755
Add fail safe in pcstore.WatchAndSync to protect missing pod clusters. #755
Conversation
To protect against data loss or misfiring consul queries, this commit adds a failsafe to the pcstore.WatchAndSync function in which it will not call "SyncCluster" or "DeleteCluster" when the pod cluster watch returns no data. This is because a syncer may wish to perform destructive action at the time of a pod cluster deletion, and if the entire set goes away it may be disastrous.
}() | ||
|
||
select { | ||
case <-time.After(1 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writing a test that must take a second to pass makes me sad but short of passing an error on a channel within the failsafe if
to the syncing routine i'm not sure how to avoid it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I also decided I could do no better in https://github.com/square/p2/pull/750/files#diff-465f690c6c15563e83acbe3a7e9a4e2cR986 or https://github.com/square/p2/pull/751/files#diff-465f690c6c15563e83acbe3a7e9a4e2cR990
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting the contents of the for { select {
in pcstore/consul_store.go
into a helper function would allow for unit testing its behaviour. The method looks a bit like channel soup to me, so idk if it's worth the trade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do better, yeah. The behavior I don't want to happen (DeleteCluster()
called) is in a different function though that is communicated with by a channel. I think it'd be fairly invasive to reorganize. Something we should look at in a later PR i think
go func() { | ||
err := store.WatchAndSync(syncer, quit) | ||
if err != nil { | ||
t.Fatalf("Couldn't start WatchAndSync(): %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
funny story - Fatalf
calls FailNow
, and https://golang.org/pkg/testing/#B.FailNow "FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test."
Now, I'm not sure what happens if some other goroutine calls it. Perhaps we could try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, i've observed that behavior in the past but never seen an explanation. Yeah i think it won't work in its current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it happens that err
can't be nil because only GetInitialClusters()
returning an error can cause it, and our mock implementation unconditionally returns a nil
error
To protect against data loss or misfiring consul queries, this commit
adds a failsafe to the pcstore.WatchAndSync function in which it will
not call "SyncCluster" or "DeleteCluster" when the pod cluster watch
returns no data.
This is because a syncer may wish to perform destructive action at the
time of a pod cluster deletion, and if the entire set goes away it may
be disastrous.