Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlord/standby: fix a race between standby goroutine and stop #6042

Merged
merged 5 commits into from Oct 26, 2018

Conversation

bboozzoo
Copy link
Collaborator

When standby.Stop() is called before the goroutine enters the select {} block,
m.stopCh will be nil. Due to short circuit evaluation in select, the stopCh case
will never be evaluated and we will always wait for the timer to expire instead
of exiting right away.

We should probably use tomb and make Stop() wait for the goroutine to exit.

When standby.Stop() is called before the goroutine enters the select {} block,
m.stopCh will be nil. Due to short circuit evaluation in select, the stopCh case
will never be evaluated and we will always wait for the timer to expire instead
of exiting right away.

Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Copy link
Collaborator

@zyga zyga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch and for the explanation. Tricky stuff

@chipaca
Copy link
Contributor

chipaca commented Oct 25, 2018

I think this is the wrong approach. As it's late, I'll work on this in a while and if i get anywhere I'll push what I think might be the right approach, and if I'm wrong we can revert in the morn.

This should do the trick. But I might've gone a little overboard.

Without this change, StandbyOpinions had a bunch of attributes that
were changed for synchronization with no locking. I think that's
problematic, so with this change it's a lot more like a tomb.

I also made it so Stop would wait for the goroutine to exit, as
@bboozzoo suggested. Tomblike, also, that.

I *also* also reuse the timer instead of creating a new one every
iteration. As this loop runs forever, this might make the garbage
collector less busy.

I refrained from changing the protocol so CanStandby returns
yes/no/never, with yes/no being the current True/False, and never
stopping the whole thing. Maybe next time.
m.Start()

// wait enough so the actual Opinionator is called
time.Sleep(10 * time.Millisecond)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should sync with the opinionator being called and make sure that Stop actually completes. Otherwise this feels like the issue I tried to fix in #6041. I tried the diff below locally, what do you think?

diff --git a/overlord/standby/standby_test.go b/overlord/standby/standby_test.go
index 5427814cf..a669a9dbd 100644
--- a/overlord/standby/standby_test.go
+++ b/overlord/standby/standby_test.go
@@ -153,22 +153,44 @@ func (s *standbySuite) TestStartChecks(c *C) {
 
 func (s *standbySuite) TestStopWaits(c *C) {
        defer standby.MockStandbyWait(time.Millisecond)()
-       defer standby.MockStateRequestRestart(func(*state.State, state.RestartType) {})()
+       defer standby.MockStateRequestRestart(func(*state.State, state.RestartType) {
+               c.Fatal("request restart should have not been called")
+       })()
 
        ch := make(chan struct{})
+       opineReady := make(chan struct{})
+       opineGo := make(chan struct{})
        m := standby.New(s.state)
+       synced := false
        m.AddOpinion(opine(func() bool {
+               if !synced {
+                       // synchronize with the main goroutine only at the
+                       // beginning
+                       opineReady <- struct{}{}
+                       <-opineGo
+                       synced = true
+               }
                time.Sleep(200 * time.Millisecond)
                return false
        }))
 
        m.Start()
 
-       // wait enough so the actual Opinionator is called
-       time.Sleep(10 * time.Millisecond)
+       <-opineReady
 
        go func() {
-               m.Stop()
+               stopDone := make(chan struct{})
+
+               // let the opinionator start its delay
+               opineGo <- struct{}{}
+               go func() {
+                       // this will block until standby stops
+                       m.Stop()
+                       close(stopDone)
+               }()
+
+               <-stopDone
+
                close(ch)
        }()
 
@@ -178,4 +200,12 @@ func (s *standbySuite) TestStopWaits(c *C) {
        case <-ch:
                c.Fatal("stop should have blocked and didn't")
        }
+
+       // wait for Stop to complete now
+       select {
+       case <-ch:
+               // nothing to do here
+       case <-time.After(10 * time.Second):
+               c.Fatal("stop did not complete")
+       }
 }

@bboozzoo
Copy link
Collaborator Author

@chipaca thank you for the changes! Looks much better than what I proposed.

bboozzoo and others added 3 commits October 26, 2018 09:03
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
chipaca
chipaca previously approved these changes Oct 26, 2018
Copy link
Contributor

@chipaca chipaca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quickly +1'ing it before we do finer finer finer tuing

@chipaca chipaca dismissed their stale review October 26, 2018 09:28

i kinda had my feet in this too much to +1 for real

Copy link
Contributor

@stolowski stolowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return
}
timer.Reset(wait)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above logic looks good.

}
<-m.stoppedCh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks fine.

@chipaca chipaca merged commit 9584b67 into snapcore:master Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants