New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overlord/standby: fix a race between standby goroutine and stop #6042
Conversation
When standby.Stop() is called before the goroutine enters the select {} block, m.stopCh will be nil. Due to short circuit evaluation in select, the stopCh case will never be evaluated and we will always wait for the timer to expire instead of exiting right away. Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch and for the explanation. Tricky stuff
I think this is the wrong approach. As it's late, I'll work on this in a while and if i get anywhere I'll push what I think might be the right approach, and if I'm wrong we can revert in the morn. |
This should do the trick. But I might've gone a little overboard. Without this change, StandbyOpinions had a bunch of attributes that were changed for synchronization with no locking. I think that's problematic, so with this change it's a lot more like a tomb. I also made it so Stop would wait for the goroutine to exit, as @bboozzoo suggested. Tomblike, also, that. I *also* also reuse the timer instead of creating a new one every iteration. As this loop runs forever, this might make the garbage collector less busy. I refrained from changing the protocol so CanStandby returns yes/no/never, with yes/no being the current True/False, and never stopping the whole thing. Maybe next time.
overlord/standby/standby_test.go
Outdated
m.Start() | ||
|
||
// wait enough so the actual Opinionator is called | ||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should sync with the opinionator being called and make sure that Stop actually completes. Otherwise this feels like the issue I tried to fix in #6041. I tried the diff below locally, what do you think?
diff --git a/overlord/standby/standby_test.go b/overlord/standby/standby_test.go
index 5427814cf..a669a9dbd 100644
--- a/overlord/standby/standby_test.go
+++ b/overlord/standby/standby_test.go
@@ -153,22 +153,44 @@ func (s *standbySuite) TestStartChecks(c *C) {
func (s *standbySuite) TestStopWaits(c *C) {
defer standby.MockStandbyWait(time.Millisecond)()
- defer standby.MockStateRequestRestart(func(*state.State, state.RestartType) {})()
+ defer standby.MockStateRequestRestart(func(*state.State, state.RestartType) {
+ c.Fatal("request restart should have not been called")
+ })()
ch := make(chan struct{})
+ opineReady := make(chan struct{})
+ opineGo := make(chan struct{})
m := standby.New(s.state)
+ synced := false
m.AddOpinion(opine(func() bool {
+ if !synced {
+ // synchronize with the main goroutine only at the
+ // beginning
+ opineReady <- struct{}{}
+ <-opineGo
+ synced = true
+ }
time.Sleep(200 * time.Millisecond)
return false
}))
m.Start()
- // wait enough so the actual Opinionator is called
- time.Sleep(10 * time.Millisecond)
+ <-opineReady
go func() {
- m.Stop()
+ stopDone := make(chan struct{})
+
+ // let the opinionator start its delay
+ opineGo <- struct{}{}
+ go func() {
+ // this will block until standby stops
+ m.Stop()
+ close(stopDone)
+ }()
+
+ <-stopDone
+
close(ch)
}()
@@ -178,4 +200,12 @@ func (s *standbySuite) TestStopWaits(c *C) {
case <-ch:
c.Fatal("stop should have blocked and didn't")
}
+
+ // wait for Stop to complete now
+ select {
+ case <-ch:
+ // nothing to do here
+ case <-time.After(10 * time.Second):
+ c.Fatal("stop did not complete")
+ }
}
@chipaca thank you for the changes! Looks much better than what I proposed. |
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
Signed-off-by: Maciej Borzecki <maciej.zenon.borzecki@canonical.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quickly +1'ing it before we do finer finer finer tuing
i kinda had my feet in this too much to +1 for real
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return | ||
} | ||
timer.Reset(wait) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above logic looks good.
} | ||
<-m.stoppedCh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks fine.
When standby.Stop() is called before the goroutine enters the select {} block,
m.stopCh will be nil. Due to short circuit evaluation in select, the stopCh case
will never be evaluated and we will always wait for the timer to expire instead
of exiting right away.
We should probably use tomb and make Stop() wait for the goroutine to exit.