-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overlord: fix issue with concurrent execution of two snapd processes #11146
Conversation
12624bd
to
893c541
Compare
1abacf1
to
482d916
Compare
Codecov Report
@@ Coverage Diff @@
## master #11146 +/- ##
==========================================
+ Coverage 78.26% 78.32% +0.05%
==========================================
Files 917 920 +3
Lines 104099 104922 +823
==========================================
+ Hits 81475 82176 +701
- Misses 17528 17615 +87
- Partials 5096 5131 +35
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but some small comments. To be clear though, did you run the spread test a bunch of times again and find that you can't reproduce the error anymore with this PR (I assume so, just want to make sure)?
overlord/overlord.go
Outdated
|
||
logger.Noticef("Acquiring lock file") | ||
// This will cause the process to block, if the lock is currently held | ||
if err := o.flock.Lock(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be kind of nice to have something which indicates on the log whether we block here for a significant amount of time, something like
gotLockCh := make(chan struct{})
defer func() {
ticker := time.NewTicker(time.Minute)
defer ticker.Stop()
start := time.Now()
select {
case <- ticker.C:
logger.Noticef("Still waiting for state lock file after %s", time.Since(start))
case <- gotLockCh:
return
}
}
if err := o.flock.Lock(); err != nil {
logger.Noticef("Failed to lock file")
return nil, fmt.Errorf("fatal: could not lock state file: %v", err)
}
gotLockCh <- struct{}
logger.Noticef("Acquired lock file")
I realize the likelihood of deadlocking between two snapd's is probably quite low, but since snap-failure and the ephemeral snapd are known to step on each other's toes from the bug report, if one of them somehow does get stuck holding the lock, we will at least be able to tell from the logs that one of them is stuck with the lock because the other will keep telling us how long it has been waiting for the lock.
I suppose the counter point to this is that we already are logging what's happening inside snapd and we can probably just use journalctl's timestamps to correlate things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, my preference would still be to actually do some polling and have a timeout after which we fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would also let us use sd-notify EXTEND_TIMEOUT_USEC if the systemd timeout is too short
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took Ian's suggestion; I actually tried to add a LockWithTimeout()
method to flock.go
, but I didn't find a way to cancel out lock when the timeout expires. This would leave the background coroutine running, and eventually (maybe) acquiring the lock, after we already reported an error and gave up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for taking my suggestion, though it's unclear to me if @pedronis wanted the inverse of what I proposed, where we actually will fail at some point if we don't get the lock (i.e. implement a proper timeout), as it is my proposal and this implementation will never actually fail, it will just loop indefinitely trying to acquire the lock and complaining about it every minute.
I personally am okay with the current implementation now, just wondering if Samuele is too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agreed elsewhere to switch to a simpler approach using TryLock, which doesn't really need a goroutine or a ticker even
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to a retry mechanism. I'm not really happy with the unit tests I wrote, I would rather have an interface for the FileLock, but if this is the preferred way here, so be it. :-)
overlord/overlord.go
Outdated
if o.flock != nil { | ||
// This will also unlock the file | ||
o.flock.Close() | ||
os.Remove(o.flock.Path()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? this feels somewhat racy where we are releasing the lock, then the other snapd acquires the lock, then this snapd is trying to delete that file?
I don't see us removing the lock in the runinhibit lock code:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think we should recreate it if missing but not remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed, I only though that it was cleaner. I've removed it now.
overlord/snapstate/handlers.go
Outdated
@@ -1630,6 +1630,8 @@ func (m *SnapManager) maybeRestart(t *state.Task, info *snap.Info, rebootRequire | |||
|
|||
typ := info.Type() | |||
|
|||
// TODO: even knowing that "bp" stands for "boot participant" doesn't help | |||
// in making this comment more clear: | |||
// if bp is non-trivial then either we're not on classic, or the snap is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha yeah sorry this comment is quite un-helpful, I think this is what the comment is getting at:
// if the type of the snap requesting this start is non-trivial that either means we
// are on ubuntu core and the type is a base/kernel/gadget which requires a reboot of the
// system, or that the type is snapd in which case we just do a restart of snapd itself.
// In these cases restartReason will be non-empty and thus we will perform a restart.
// If restartReason is empty, then the snap requesting the restart was not a boot participant
// and thus we don't need to do any sort of restarts as a result of updating this snap.
Additionally, the doc-comment for maybeRestart
should indicate that this function is called for all snap refreshes, not just those that may trigger a restart, and if rebootRequired is true then we know a-priori when calling this function that a reboot is needed and thus this function just always does the reboot, otherwise this function will figure out if a reboot is needed depending on the type of the snap that was just updated/linked.
This doesn't need to be updated in this PR to be clear, we can fix this in a different one, your choice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this, bunch of comments/questions
overlord/overlord.go
Outdated
} | ||
o.flock = flock | ||
|
||
logger.Noticef("Acquiring lock file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snap have also locks. We probably want to be explicit "Acquiring state lock file". Same with the rest of the logging
overlord/overlord.go
Outdated
|
||
logger.Noticef("Acquiring lock file") | ||
// This will cause the process to block, if the lock is currently held | ||
if err := o.flock.Lock(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, my preference would still be to actually do some polling and have a timeout after which we fail
@@ -385,6 +409,9 @@ func (o *Overlord) Loop() { | |||
if preseed { | |||
o.runner.OnTaskError(preseedExitWithError) | |||
} | |||
if o.loopTomb == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain the reason for these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this change in a separate commit, with a description which should explain the reason:
Do not initialize the loopTomb member until we are actually asking it to
run some routine. Otherwise the call to loopTomb.Wait() which we do when
stopping the Overlord could block forever.This seems also to be the suggestion given in
go-tomb/tomb#21 to a similar problem.
It's a change I had to do when updating the unit tests, otherwise they were hanging forever.
overlord/overlord.go
Outdated
if o.flock != nil { | ||
// This will also unlock the file | ||
o.flock.Close() | ||
os.Remove(o.flock.Path()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think we should recreate it if missing but not remove it
overlord/overlord.go
Outdated
logger.Noticef("Failed to lock state file") | ||
return nil, fmt.Errorf("fatal: could not lock state file: %v", err) | ||
} | ||
gotLockCh <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to send anything, just close(gotLockCh)
to have the desired effect of waking up the goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still lgtm, one question though and I agree with Maciej's suggestion to close the channel
overlord/overlord.go
Outdated
o.flock = flock | ||
|
||
logger.Noticef("Acquiring state lock file") | ||
if err := lockWithTimeout(o.flock, 20*time.Second); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you could extrac the timeout into a variable
const overlordStateLockTimeout = 20 * time.Second
...
I understand that in the tests, there's always a per test root dir so we will not hit the timeouts unless in a specific test that exercises this code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Yes, I didn't see the unit tests failing on this lock.
|
||
// Wait until the shell command prints "acquired" | ||
buf := make([]byte, 8) | ||
bytesRead, err := io.ReadAtLeast(stdout, buf, len(buf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linter complains about ineffective err, missing c.Assert(err, IsNil)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
cmd := exec.Command("flock", "-w", "2", f.Name(), "-c", "echo acquired && sleep 1") | ||
stdout, err := cmd.StdoutPipe() | ||
c.Assert(err, IsNil) | ||
err = cmd.Start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're missing a cmd.Wait() somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no :-) I don't want to add it, otherwise this test will always take at least a second to complete. I was looking for a way to kill the process instead, but I didn't find it. Ah! I now see that there's a os.Process
member in cmd
. I'll kill it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
this has changed since
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, overall looks good, some detail comments/questions
overlord/overlord.go
Outdated
@@ -76,6 +81,8 @@ var pruneTickerC = func(t *time.Ticker) <-chan time.Time { | |||
// Overlord is the central manager of a snappy system, keeping | |||
// track of all available state managers and related helpers. | |||
type Overlord struct { | |||
flock *osutil.FileLock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be called stateFLock I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
overlord/overlord.go
Outdated
@@ -56,6 +57,10 @@ import ( | |||
"github.com/snapcore/snapd/timings" | |||
) | |||
|
|||
const ( | |||
overlordStateLockTimeout = 20 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is enough on all slow devices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumped it to a minute
overlord/overlord.go
Outdated
if time.Since(startTime) >= timeout { | ||
return errors.New("timeout for state lock file expired") | ||
} | ||
time.Sleep(retryInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't hit this bit in tests, usually what we do to avoid making the tests too slow is add Mock functions to manipulate the relevant time parameters and tweaks them in the test to make them faster, see MockPruneInterval for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
overlord/overlord.go
Outdated
if o.flock != nil { | ||
// This will also unlock the file | ||
o.flock.Close() | ||
logger.Noticef("Released lock file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state lock file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
0c89837
to
ff1caf4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you
Do not initialize the loopTomb member until we are actually asking it to run some routine. Otherwise the call to loopTomb.Wait() which we do when stopping the Overlord could block forever. This seems also to be the suggestion given in go-tomb/tomb#21 to a similar problem.
This will release any contended resources; it's especially important now that we are starting to use a file lock on the snap state.
Now that we are having a file lock on the snapstate when creating an Overlord object, we must be careful not to instantiate more than one overlord in the same process. Slightly refactor the tests to: - ensure that Overlord.Stop() is called when the object is no longer used; - move the creation of the Overlord from the SetUpTest() method to each individual test which needs this object.
Thanks Ian for the reformulation.
There can be situations where there exist two snapd processes in the system: that's the case when one of them is invoked as a subprocess by the snap-failure systemd unit (which runs cmd/snap-failure/cmd_snap.go), which we refer to as the "ephimeral snapd", and the other one is the new, repaired snapd which is started by the ephimeral one. When this happens, we do not want both of these processes to operate on the snapstate at the same time; therefore, create a file-based lock to make the accesses mutually exclusive. A file-based lock will cause the second process to block and wait until the lock is released (which can happen either as a result of an explicit release operation, or at the termination of the process holding it). The process starting snapd therefore needs to invoke the "systemctl start snapd.service" command in non-blocking mode, or it would get itself blocked, too. Fixes: https://bugs.launchpad.net/snapd/+bug/1952404
ff1caf4
to
a956893
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm one small nitpick
There can be situations where there exist two snapd processes in the
system: that's the case when one of them is invoked as a subprocess by
the snap-failure systemd unit (which runs cmd/snap-failure/cmd_snap.go),
which we refer to as the "ephimeral snapd", and the other one is the
new, repaired snapd which is started by the ephimeral one. When this
happens, we do not want both of these processes to operate on the
snapstate at the same time; therefore, create a file-based lock to make
the accesses mutually exclusive.
A file-based lock will cause the second process to block and wait until
the lock is released (which can happen either as a result of an explicit
release operation, or at the termination of the process holding it). The
process starting snapd therefore needs to invoke the "systemctl start
snapd.service" command in non-blocking mode, or it would get itself
blocked, too.
Fixes: https://bugs.launchpad.net/snapd/+bug/1952404
(I first went for an alternative solution involving deferring the snapd start, but I couldn't really follow the code flow. Along the way, I noticed a comment which turned out more confusing than clarifying, so I'm adding a TODO on making it better. But if you already have a suggestion on how to rephrase it, please let me know and I'll update it on the spot)