Skip to content
This repository has been archived by the owner on Apr 29, 2020. It is now read-only.

Add fail safe in pcstore.WatchAndSync to protect missing pod clusters. #755

Conversation

mpuncel
Copy link
Collaborator

@mpuncel mpuncel commented Feb 9, 2017

To protect against data loss or misfiring consul queries, this commit
adds a failsafe to the pcstore.WatchAndSync function in which it will
not call "SyncCluster" or "DeleteCluster" when the pod cluster watch
returns no data.

This is because a syncer may wish to perform destructive action at the
time of a pod cluster deletion, and if the entire set goes away it may
be disastrous.

To protect against data loss or misfiring consul queries, this commit
adds a failsafe to the pcstore.WatchAndSync function in which it will
not call "SyncCluster" or "DeleteCluster" when the pod cluster watch
returns no data.

This is because a syncer may wish to perform destructive action at the
time of a pod cluster deletion, and if the entire set goes away it may
be disastrous.
}()

select {
case <-time.After(1 * time.Second):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writing a test that must take a second to pass makes me sad but short of passing an error on a channel within the failsafe if to the syncing routine i'm not sure how to avoid it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting the contents of the for { select { in pcstore/consul_store.go into a helper function would allow for unit testing its behaviour. The method looks a bit like channel soup to me, so idk if it's worth the trade.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do better, yeah. The behavior I don't want to happen (DeleteCluster() called) is in a different function though that is communicated with by a channel. I think it'd be fairly invasive to reorganize. Something we should look at in a later PR i think

go func() {
err := store.WatchAndSync(syncer, quit)
if err != nil {
t.Fatalf("Couldn't start WatchAndSync(): %s", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funny story - Fatalf calls FailNow, and https://golang.org/pkg/testing/#B.FailNow "FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test."

Now, I'm not sure what happens if some other goroutine calls it. Perhaps we could try?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, i've observed that behavior in the past but never seen an explanation. Yeah i think it won't work in its current state.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it happens that err can't be nil because only GetInitialClusters() returning an error can cause it, and our mock implementation unconditionally returns a nil error

@mpuncel mpuncel merged commit c4d7769 into square:master Feb 9, 2017
@mpuncel mpuncel deleted the mpuncel/dont-sync-pod-clusters-if-no-pod-clusters branch March 8, 2017 13:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants