overlord: fix issue with concurrent execution of two snapd processes #11146

mardy · 2021-12-05T13:08:06Z

There can be situations where there exist two snapd processes in the
system: that's the case when one of them is invoked as a subprocess by
the snap-failure systemd unit (which runs cmd/snap-failure/cmd_snap.go),
which we refer to as the "ephimeral snapd", and the other one is the
new, repaired snapd which is started by the ephimeral one. When this
happens, we do not want both of these processes to operate on the
snapstate at the same time; therefore, create a file-based lock to make
the accesses mutually exclusive.

A file-based lock will cause the second process to block and wait until
the lock is released (which can happen either as a result of an explicit
release operation, or at the termination of the process holding it). The
process starting snapd therefore needs to invoke the "systemctl start
snapd.service" command in non-blocking mode, or it would get itself
blocked, too.

Fixes: https://bugs.launchpad.net/snapd/+bug/1952404

(I first went for an alternative solution involving deferring the snapd start, but I couldn't really follow the code flow. Along the way, I noticed a comment which turned out more confusing than clarifying, so I'm adding a TODO on making it better. But if you already have a suggestion on how to rephrase it, please let me know and I'll update it on the spot)

codecov-commenter · 2021-12-06T08:43:54Z

Codecov Report

Merging #11146 (ff1caf4) into master (46cd020) will increase coverage by 0.05%.
The diff coverage is 83.07%.

@@            Coverage Diff             @@
##           master   #11146      +/-   ##
==========================================
+ Coverage   78.26%   78.32%   +0.05%     
==========================================
  Files         917      920       +3     
  Lines      104099   104922     +823     
==========================================
+ Hits        81475    82176     +701     
- Misses      17528    17615      +87     
- Partials     5096     5131      +35

Flag	Coverage Δ
unittests	`78.32% <83.07%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
wrappers/core18.go	`0.00% <0.00%> (ø)`
overlord/overlord.go	`80.56% <81.13%> (-0.33%)`	⬇️
dirs/dirs.go	`94.77% <100.00%> (+0.08%)`	⬆️
overlord/snapstate/handlers.go	`71.15% <100.00%> (-0.59%)`	⬇️
snap/helpers.go	`4.16% <0.00%> (-95.84%)`	⬇️
interfaces/kmod/backend.go	`70.00% <0.00%> (-9.67%)`	⬇️
overlord/snapstate/backend/copydata.go	`51.82% <0.00%> (-2.03%)`	⬇️
interfaces/kmod/spec.go	`83.78% <0.00%> (-1.64%)`	⬇️
systemd/emulation.go	`40.54% <0.00%> (-0.75%)`	⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46cd020...ff1caf4. Read the comment docs.

anonymouse64

lgtm but some small comments. To be clear though, did you run the spread test a bunch of times again and find that you can't reproduce the error anymore with this PR (I assume so, just want to make sure)?

anonymouse64 · 2021-12-07T18:50:31Z

overlord/overlord.go

+
+	logger.Noticef("Acquiring lock file")
+	// This will cause the process to block, if the lock is currently held
+	if err := o.flock.Lock(); err != nil {


it would be kind of nice to have something which indicates on the log whether we block here for a significant amount of time, something like

gotLockCh := make(chan struct{}) defer func() { ticker := time.NewTicker(time.Minute) defer ticker.Stop() start := time.Now() select { case <- ticker.C: logger.Noticef("Still waiting for state lock file after %s", time.Since(start)) case <- gotLockCh: return } } if err := o.flock.Lock(); err != nil { logger.Noticef("Failed to lock file") return nil, fmt.Errorf("fatal: could not lock state file: %v", err) } gotLockCh <- struct{} logger.Noticef("Acquired lock file")

I realize the likelihood of deadlocking between two snapd's is probably quite low, but since snap-failure and the ephemeral snapd are known to step on each other's toes from the bug report, if one of them somehow does get stuck holding the lock, we will at least be able to tell from the logs that one of them is stuck with the lock because the other will keep telling us how long it has been waiting for the lock.

I suppose the counter point to this is that we already are logging what's happening inside snapd and we can probably just use journalctl's timestamps to correlate things

yes, my preference would still be to actually do some polling and have a timeout after which we fail

that would also let us use sd-notify EXTEND_TIMEOUT_USEC if the systemd timeout is too short

I took Ian's suggestion; I actually tried to add a LockWithTimeout() method to flock.go, but I didn't find a way to cancel out lock when the timeout expires. This would leave the background coroutine running, and eventually (maybe) acquiring the lock, after we already reported an error and gave up.

thanks for taking my suggestion, though it's unclear to me if @pedronis wanted the inverse of what I proposed, where we actually will fail at some point if we don't get the lock (i.e. implement a proper timeout), as it is my proposal and this implementation will never actually fail, it will just loop indefinitely trying to acquire the lock and complaining about it every minute.

I personally am okay with the current implementation now, just wondering if Samuele is too

I think we agreed elsewhere to switch to a simpler approach using TryLock, which doesn't really need a goroutine or a ticker even

I changed it to a retry mechanism. I'm not really happy with the unit tests I wrote, I would rather have an interface for the FileLock, but if this is the preferred way here, so be it. :-)

anonymouse64 · 2021-12-07T18:53:28Z

overlord/overlord.go

+	if o.flock != nil {
+		// This will also unlock the file
+		o.flock.Close()
+		os.Remove(o.flock.Path())


is this necessary? this feels somewhat racy where we are releasing the lock, then the other snapd acquires the lock, then this snapd is trying to delete that file?

I don't see us removing the lock in the runinhibit lock code:

https://github.com/snapcore/snapd/blob/010933af8ea7eeee684eb7e55507515b96c5ca74/cmd/snaplock/runinhibit/inhibit.go#L96-L114

yes, I think we should recreate it if missing but not remove it

It's not needed, I only though that it was cleaner. I've removed it now.

anonymouse64 · 2021-12-07T19:00:42Z

overlord/snapstate/handlers.go

@@ -1630,6 +1630,8 @@ func (m *SnapManager) maybeRestart(t *state.Task, info *snap.Info, rebootRequire

 	typ := info.Type()

+	// TODO: even knowing that "bp" stands for "boot participant" doesn't help
+	// in making this comment more clear:
 	// if bp is non-trivial then either we're not on classic, or the snap is


haha yeah sorry this comment is quite un-helpful, I think this is what the comment is getting at:

// if the type of the snap requesting this start is non-trivial that either means we // are on ubuntu core and the type is a base/kernel/gadget which requires a reboot of the // system, or that the type is snapd in which case we just do a restart of snapd itself. // In these cases restartReason will be non-empty and thus we will perform a restart. // If restartReason is empty, then the snap requesting the restart was not a boot participant // and thus we don't need to do any sort of restarts as a result of updating this snap.

Additionally, the doc-comment for maybeRestart should indicate that this function is called for all snap refreshes, not just those that may trigger a restart, and if rebootRequired is true then we know a-priori when calling this function that a reboot is needed and thus this function just always does the reboot, otherwise this function will figure out if a reboot is needed depending on the type of the snap that was just updated/linked.

This doesn't need to be updated in this PR to be clear, we can fix this in a different one, your choice

Applied, thanks!

wrappers/core18.go

pedronis

thanks for this, bunch of comments/questions

pedronis · 2021-12-08T10:09:57Z

overlord/overlord.go

+	}
+	o.flock = flock
+
+	logger.Noticef("Acquiring lock file")


snap have also locks. We probably want to be explicit "Acquiring state lock file". Same with the rest of the logging

pedronis · 2021-12-08T10:11:16Z

overlord/overlord.go

+
+	logger.Noticef("Acquiring lock file")
+	// This will cause the process to block, if the lock is currently held
+	if err := o.flock.Lock(); err != nil {


yes, my preference would still be to actually do some polling and have a timeout after which we fail

pedronis · 2021-12-08T10:13:21Z

overlord/overlord.go

@@ -385,6 +409,9 @@ func (o *Overlord) Loop() {
 	if preseed {
 		o.runner.OnTaskError(preseedExitWithError)
 	}
+	if o.loopTomb == nil {


can you explain the reason for these changes?

I kept this change in a separate commit, with a description which should explain the reason:

Do not initialize the loopTomb member until we are actually asking it to
run some routine. Otherwise the call to loopTomb.Wait() which we do when
stopping the Overlord could block forever.

This seems also to be the suggestion given in
go-tomb/tomb#21 to a similar problem.

It's a change I had to do when updating the unit tests, otherwise they were hanging forever.

pedronis · 2021-12-08T10:13:46Z

overlord/overlord.go

+	if o.flock != nil {
+		// This will also unlock the file
+		o.flock.Close()
+		os.Remove(o.flock.Path())


yes, I think we should recreate it if missing but not remove it

wrappers/core18.go

bboozzoo · 2021-12-08T13:03:21Z

overlord/overlord.go

+		logger.Noticef("Failed to lock state file")
+		return nil, fmt.Errorf("fatal: could not lock state file: %v", err)
+	}
+	gotLockCh <- struct{}{}


you don't need to send anything, just close(gotLockCh) to have the desired effect of waking up the goroutine.

overlord/overlord.go

MiguelPires

LGTM

anonymouse64

still lgtm, one question though and I agree with Maciej's suggestion to close the channel

overlord/overlord.go

bboozzoo · 2021-12-09T07:33:19Z

overlord/overlord.go

+	o.flock = flock
+
+	logger.Noticef("Acquiring state lock file")
+	if err := lockWithTimeout(o.flock, 20*time.Second); err != nil {


maybe you could extrac the timeout into a variable

const overlordStateLockTimeout = 20 * time.Second ...

I understand that in the tests, there's always a per test root dir so we will not hit the timeouts unless in a specific test that exercises this code path?

Done. Yes, I didn't see the unit tests failing on this lock.

overlord/overlord.go

bboozzoo · 2021-12-09T07:36:30Z

overlord/overlord_test.go

+
+	// Wait until the shell command prints "acquired"
+	buf := make([]byte, 8)
+	bytesRead, err := io.ReadAtLeast(stdout, buf, len(buf))


linter complains about ineffective err, missing c.Assert(err, IsNil)?

bboozzoo · 2021-12-09T07:40:41Z

overlord/overlord_test.go

+	cmd := exec.Command("flock", "-w", "2", f.Name(), "-c", "echo acquired && sleep 1")
+	stdout, err := cmd.StdoutPipe()
+	c.Assert(err, IsNil)
+	err = cmd.Start()


we're missing a cmd.Wait() somewhere

Yes and no :-) I don't want to add it, otherwise this test will always take at least a second to complete. I was looking for a way to kill the process instead, but I didn't find it. Ah! I now see that there's a os.Process member in cmd. I'll kill it :-)

this has changed since

pedronis

thank you, overall looks good, some detail comments/questions

pedronis · 2021-12-09T10:58:23Z

overlord/overlord.go

@@ -76,6 +81,8 @@ var pruneTickerC = func(t *time.Ticker) <-chan time.Time {
 // Overlord is the central manager of a snappy system, keeping
 // track of all available state managers and related helpers.
 type Overlord struct {
+	flock *osutil.FileLock


this should be called stateFLock I think

pedronis · 2021-12-09T11:00:08Z

overlord/overlord.go

@@ -56,6 +57,10 @@ import (
 	"github.com/snapcore/snapd/timings"
 )

+const (
+	overlordStateLockTimeout = 20 * time.Second


I wonder if this is enough on all slow devices

Bumped it to a minute

overlord/overlord.go

pedronis · 2021-12-09T11:09:40Z

overlord/overlord.go

+		if time.Since(startTime) >= timeout {
+			return errors.New("timeout for state lock file expired")
+		}
+		time.Sleep(retryInterval)


we don't hit this bit in tests, usually what we do to avoid making the tests too slow is add Mock functions to manipulate the relevant time parameters and tweaks them in the test to make them faster, see MockPruneInterval for example

pedronis · 2021-12-09T11:11:34Z

overlord/overlord.go

+	if o.flock != nil {
+		// This will also unlock the file
+		o.flock.Close()
+		logger.Noticef("Released lock file")


state lock file

pedronis

thank you

Do not initialize the loopTomb member until we are actually asking it to run some routine. Otherwise the call to loopTomb.Wait() which we do when stopping the Overlord could block forever. This seems also to be the suggestion given in go-tomb/tomb#21 to a similar problem.

This will release any contended resources; it's especially important now that we are starting to use a file lock on the snap state.

Now that we are having a file lock on the snapstate when creating an Overlord object, we must be careful not to instantiate more than one overlord in the same process. Slightly refactor the tests to: - ensure that Overlord.Stop() is called when the object is no longer used; - move the creation of the Overlord from the SetUpTest() method to each individual test which needs this object.

Thanks Ian for the reformulation.

There can be situations where there exist two snapd processes in the system: that's the case when one of them is invoked as a subprocess by the snap-failure systemd unit (which runs cmd/snap-failure/cmd_snap.go), which we refer to as the "ephimeral snapd", and the other one is the new, repaired snapd which is started by the ephimeral one. When this happens, we do not want both of these processes to operate on the snapstate at the same time; therefore, create a file-based lock to make the accesses mutually exclusive. A file-based lock will cause the second process to block and wait until the lock is released (which can happen either as a result of an explicit release operation, or at the termination of the process holding it). The process starting snapd therefore needs to invoke the "systemctl start snapd.service" command in non-blocking mode, or it would get itself blocked, too. Fixes: https://bugs.launchpad.net/snapd/+bug/1952404

anonymouse64

lgtm one small nitpick

overlord/overlord.go

mardy force-pushed the restore-snapd-failover-test branch from 12624bd to 893c541 Compare December 5, 2021 13:26

mardy changed the title ~~Restore snapd failover test~~ overlord: fix issue with concurrent execution of two snapd processes Dec 6, 2021

mardy force-pushed the restore-snapd-failover-test branch 3 times, most recently from 1abacf1 to 482d916 Compare December 6, 2021 08:19

pedronis self-requested a review December 6, 2021 14:13

pedronis added the Needs Samuele review Needs a review from Samuele before it can land label Dec 6, 2021

pedronis requested a review from anonymouse64 December 7, 2021 15:12

anonymouse64 approved these changes Dec 7, 2021

View reviewed changes

pedronis reviewed Dec 8, 2021

View reviewed changes

bboozzoo reviewed Dec 8, 2021

View reviewed changes

MiguelPires previously approved these changes Dec 8, 2021

View reviewed changes

anonymouse64 previously approved these changes Dec 8, 2021

View reviewed changes

overlord/overlord.go Show resolved Hide resolved

pedronis self-requested a review December 8, 2021 20:57

bboozzoo reviewed Dec 9, 2021

View reviewed changes

pedronis reviewed Dec 9, 2021

View reviewed changes

anonymouse64 self-requested a review December 9, 2021 16:22

mardy force-pushed the restore-snapd-failover-test branch from 0c89837 to ff1caf4 Compare December 10, 2021 06:42

mardy mentioned this pull request Dec 13, 2021

tests: disable flacky test snapd-failover #11108

Closed

pedronis self-requested a review December 13, 2021 09:21

pedronis approved these changes Dec 13, 2021

View reviewed changes

mardy added 5 commits December 13, 2021 13:11

tests: have api_base_test stop the Overlord when done

e4469d5

This will release any contended resources; it's especially important now that we are starting to use a file lock on the snap state.

overlord/snapstate: update and clarify obscure comment

375618f

Thanks Ian for the reformulation.

mardy force-pushed the restore-snapd-failover-test branch from ff1caf4 to a956893 Compare December 13, 2021 10:12

anonymouse64 approved these changes Dec 14, 2021

View reviewed changes

overlord/overlord.go Show resolved Hide resolved

overlord: add comment on error handling

f4c39ed

mvo5 added this to the 2.55 milestone Dec 16, 2021

mvo5 merged commit 167d216 into canonical:master Jan 10, 2022

overlord: fix issue with concurrent execution of two snapd processes #11146

overlord: fix issue with concurrent execution of two snapd processes #11146

Conversation

mardy commented Dec 5, 2021 • edited Loading

codecov-commenter commented Dec 6, 2021 • edited Loading

Codecov Report

anonymouse64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedronis left a comment

Choose a reason for hiding this comment

pedronis Dec 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mardy Dec 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MiguelPires left a comment

Choose a reason for hiding this comment

anonymouse64 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedronis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedronis left a comment

Choose a reason for hiding this comment

anonymouse64 left a comment

Choose a reason for hiding this comment

mardy commented Dec 5, 2021 •

edited

Loading

codecov-commenter commented Dec 6, 2021 •

edited

Loading

pedronis Dec 8, 2021 •

edited

Loading

mardy Dec 8, 2021 •

edited

Loading