Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlord: fix issue with concurrent execution of two snapd processes #11146

Merged
merged 6 commits into from Jan 10, 2022

Conversation

mardy
Copy link
Contributor

@mardy mardy commented Dec 5, 2021

There can be situations where there exist two snapd processes in the
system: that's the case when one of them is invoked as a subprocess by
the snap-failure systemd unit (which runs cmd/snap-failure/cmd_snap.go),
which we refer to as the "ephimeral snapd", and the other one is the
new, repaired snapd which is started by the ephimeral one. When this
happens, we do not want both of these processes to operate on the
snapstate at the same time; therefore, create a file-based lock to make
the accesses mutually exclusive.

A file-based lock will cause the second process to block and wait until
the lock is released (which can happen either as a result of an explicit
release operation, or at the termination of the process holding it). The
process starting snapd therefore needs to invoke the "systemctl start
snapd.service" command in non-blocking mode, or it would get itself
blocked, too.

Fixes: https://bugs.launchpad.net/snapd/+bug/1952404

(I first went for an alternative solution involving deferring the snapd start, but I couldn't really follow the code flow. Along the way, I noticed a comment which turned out more confusing than clarifying, so I'm adding a TODO on making it better. But if you already have a suggestion on how to rephrase it, please let me know and I'll update it on the spot)

@mardy mardy force-pushed the restore-snapd-failover-test branch from 12624bd to 893c541 Compare December 5, 2021 13:26
@mardy mardy changed the title Restore snapd failover test overlord: fix issue with concurrent execution of two snapd processes Dec 6, 2021
@mardy mardy force-pushed the restore-snapd-failover-test branch 3 times, most recently from 1abacf1 to 482d916 Compare December 6, 2021 08:19
@codecov-commenter
Copy link

codecov-commenter commented Dec 6, 2021

Codecov Report

Merging #11146 (ff1caf4) into master (46cd020) will increase coverage by 0.05%.
The diff coverage is 83.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11146      +/-   ##
==========================================
+ Coverage   78.26%   78.32%   +0.05%     
==========================================
  Files         917      920       +3     
  Lines      104099   104922     +823     
==========================================
+ Hits        81475    82176     +701     
- Misses      17528    17615      +87     
- Partials     5096     5131      +35     
Flag Coverage Δ
unittests 78.32% <83.07%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
wrappers/core18.go 0.00% <0.00%> (ø)
overlord/overlord.go 80.56% <81.13%> (-0.33%) ⬇️
dirs/dirs.go 94.77% <100.00%> (+0.08%) ⬆️
overlord/snapstate/handlers.go 71.15% <100.00%> (-0.59%) ⬇️
snap/helpers.go 4.16% <0.00%> (-95.84%) ⬇️
interfaces/kmod/backend.go 70.00% <0.00%> (-9.67%) ⬇️
overlord/snapstate/backend/copydata.go 51.82% <0.00%> (-2.03%) ⬇️
interfaces/kmod/spec.go 83.78% <0.00%> (-1.64%) ⬇️
systemd/emulation.go 40.54% <0.00%> (-0.75%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46cd020...ff1caf4. Read the comment docs.

@pedronis pedronis self-requested a review December 6, 2021 14:13
@pedronis pedronis added the Needs Samuele review Needs a review from Samuele before it can land label Dec 6, 2021
Copy link
Member

@anonymouse64 anonymouse64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but some small comments. To be clear though, did you run the spread test a bunch of times again and find that you can't reproduce the error anymore with this PR (I assume so, just want to make sure)?


logger.Noticef("Acquiring lock file")
// This will cause the process to block, if the lock is currently held
if err := o.flock.Lock(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be kind of nice to have something which indicates on the log whether we block here for a significant amount of time, something like

	gotLockCh := make(chan struct{})
	defer func() {
		ticker := time.NewTicker(time.Minute)
		defer ticker.Stop()
		start := time.Now()
		select {
			case <- ticker.C:
				logger.Noticef("Still waiting for state lock file after %s", time.Since(start))
			case <- gotLockCh:
				return
		}
	}
	if err := o.flock.Lock(); err != nil {
		logger.Noticef("Failed to lock file")
		return nil, fmt.Errorf("fatal: could not lock state file: %v", err)
	}
	gotLockCh <- struct{}
	logger.Noticef("Acquired lock file")

I realize the likelihood of deadlocking between two snapd's is probably quite low, but since snap-failure and the ephemeral snapd are known to step on each other's toes from the bug report, if one of them somehow does get stuck holding the lock, we will at least be able to tell from the logs that one of them is stuck with the lock because the other will keep telling us how long it has been waiting for the lock.

I suppose the counter point to this is that we already are logging what's happening inside snapd and we can probably just use journalctl's timestamps to correlate things

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, my preference would still be to actually do some polling and have a timeout after which we fail

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would also let us use sd-notify EXTEND_TIMEOUT_USEC if the systemd timeout is too short

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took Ian's suggestion; I actually tried to add a LockWithTimeout() method to flock.go, but I didn't find a way to cancel out lock when the timeout expires. This would leave the background coroutine running, and eventually (maybe) acquiring the lock, after we already reported an error and gave up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking my suggestion, though it's unclear to me if @pedronis wanted the inverse of what I proposed, where we actually will fail at some point if we don't get the lock (i.e. implement a proper timeout), as it is my proposal and this implementation will never actually fail, it will just loop indefinitely trying to acquire the lock and complaining about it every minute.

I personally am okay with the current implementation now, just wondering if Samuele is too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed elsewhere to switch to a simpler approach using TryLock, which doesn't really need a goroutine or a ticker even

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to a retry mechanism. I'm not really happy with the unit tests I wrote, I would rather have an interface for the FileLock, but if this is the preferred way here, so be it. :-)

if o.flock != nil {
// This will also unlock the file
o.flock.Close()
os.Remove(o.flock.Path())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? this feels somewhat racy where we are releasing the lock, then the other snapd acquires the lock, then this snapd is trying to delete that file?

I don't see us removing the lock in the runinhibit lock code:

// Unlock truncates the run inhibition lock, for the given snap.
//
// An empty inhibition lock means uninhibited "snap run".
func Unlock(snapName string) error {
flock, err := openHintFileLock(snapName)
if os.IsNotExist(err) {
return nil
}
if err != nil {
return err
}
defer flock.Close()
if err := flock.Lock(); err != nil {
return err
}
f := flock.File()
return f.Truncate(0)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we should recreate it if missing but not remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed, I only though that it was cleaner. I've removed it now.

@@ -1630,6 +1630,8 @@ func (m *SnapManager) maybeRestart(t *state.Task, info *snap.Info, rebootRequire

typ := info.Type()

// TODO: even knowing that "bp" stands for "boot participant" doesn't help
// in making this comment more clear:
// if bp is non-trivial then either we're not on classic, or the snap is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha yeah sorry this comment is quite un-helpful, I think this is what the comment is getting at:

// if the type of the snap requesting this start is non-trivial that either means we
// are on ubuntu core and the type is a base/kernel/gadget which requires a reboot of the 
// system, or that the type is snapd in which case we just do a restart of snapd itself.
// In these cases restartReason will be non-empty and thus we will perform a restart.
// If restartReason is empty, then the snap requesting the restart was not a boot participant
// and thus we don't need to do any sort of restarts as a result of updating this snap.

Additionally, the doc-comment for maybeRestart should indicate that this function is called for all snap refreshes, not just those that may trigger a restart, and if rebootRequired is true then we know a-priori when calling this function that a reboot is needed and thus this function just always does the reboot, otherwise this function will figure out if a reboot is needed depending on the type of the snap that was just updated/linked.

This doesn't need to be updated in this PR to be clear, we can fix this in a different one, your choice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied, thanks!

wrappers/core18.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this, bunch of comments/questions

}
o.flock = flock

logger.Noticef("Acquiring lock file")
Copy link
Collaborator

@pedronis pedronis Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snap have also locks. We probably want to be explicit "Acquiring state lock file". Same with the rest of the logging


logger.Noticef("Acquiring lock file")
// This will cause the process to block, if the lock is currently held
if err := o.flock.Lock(); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, my preference would still be to actually do some polling and have a timeout after which we fail

@@ -385,6 +409,9 @@ func (o *Overlord) Loop() {
if preseed {
o.runner.OnTaskError(preseedExitWithError)
}
if o.loopTomb == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain the reason for these changes?

Copy link
Contributor Author

@mardy mardy Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this change in a separate commit, with a description which should explain the reason:

Do not initialize the loopTomb member until we are actually asking it to
run some routine. Otherwise the call to loopTomb.Wait() which we do when
stopping the Overlord could block forever.

This seems also to be the suggestion given in
go-tomb/tomb#21 to a similar problem.

It's a change I had to do when updating the unit tests, otherwise they were hanging forever.

if o.flock != nil {
// This will also unlock the file
o.flock.Close()
os.Remove(o.flock.Path())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we should recreate it if missing but not remove it

wrappers/core18.go Show resolved Hide resolved
logger.Noticef("Failed to lock state file")
return nil, fmt.Errorf("fatal: could not lock state file: %v", err)
}
gotLockCh <- struct{}{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to send anything, just close(gotLockCh) to have the desired effect of waking up the goroutine.

overlord/overlord.go Show resolved Hide resolved
MiguelPires
MiguelPires previously approved these changes Dec 8, 2021
Copy link
Contributor

@MiguelPires MiguelPires left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

anonymouse64
anonymouse64 previously approved these changes Dec 8, 2021
Copy link
Member

@anonymouse64 anonymouse64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still lgtm, one question though and I agree with Maciej's suggestion to close the channel

overlord/overlord.go Show resolved Hide resolved
@pedronis pedronis self-requested a review December 8, 2021 20:57
o.flock = flock

logger.Noticef("Acquiring state lock file")
if err := lockWithTimeout(o.flock, 20*time.Second); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could extrac the timeout into a variable

const overlordStateLockTimeout = 20 * time.Second
...

I understand that in the tests, there's always a per test root dir so we will not hit the timeouts unless in a specific test that exercises this code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Yes, I didn't see the unit tests failing on this lock.

overlord/overlord.go Show resolved Hide resolved

// Wait until the shell command prints "acquired"
buf := make([]byte, 8)
bytesRead, err := io.ReadAtLeast(stdout, buf, len(buf))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linter complains about ineffective err, missing c.Assert(err, IsNil)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

cmd := exec.Command("flock", "-w", "2", f.Name(), "-c", "echo acquired && sleep 1")
stdout, err := cmd.StdoutPipe()
c.Assert(err, IsNil)
err = cmd.Start()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're missing a cmd.Wait() somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no :-) I don't want to add it, otherwise this test will always take at least a second to complete. I was looking for a way to kill the process instead, but I didn't find it. Ah! I now see that there's a os.Process member in cmd. I'll kill it :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@pedronis pedronis dismissed stale reviews from anonymouse64 and MiguelPires December 9, 2021 10:40

this has changed since

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, overall looks good, some detail comments/questions

@@ -76,6 +81,8 @@ var pruneTickerC = func(t *time.Ticker) <-chan time.Time {
// Overlord is the central manager of a snappy system, keeping
// track of all available state managers and related helpers.
type Overlord struct {
flock *osutil.FileLock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be called stateFLock I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -56,6 +57,10 @@ import (
"github.com/snapcore/snapd/timings"
)

const (
overlordStateLockTimeout = 20 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is enough on all slow devices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped it to a minute

overlord/overlord.go Show resolved Hide resolved
if time.Since(startTime) >= timeout {
return errors.New("timeout for state lock file expired")
}
time.Sleep(retryInterval)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't hit this bit in tests, usually what we do to avoid making the tests too slow is add Mock functions to manipulate the relevant time parameters and tweaks them in the test to make them faster, see MockPruneInterval for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if o.flock != nil {
// This will also unlock the file
o.flock.Close()
logger.Noticef("Released lock file")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state lock file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

Do not initialize the loopTomb member until we are actually asking it to
run some routine. Otherwise the call to loopTomb.Wait() which we do when
stopping the Overlord could block forever.

This seems also to be the suggestion given in
go-tomb/tomb#21 to a similar problem.
This will release any contended resources; it's especially important now
that we are starting to use a file lock on the snap state.
Now that we are having a file lock on the snapstate when creating an
Overlord object, we must be careful not to instantiate more than one
overlord in the same process.

Slightly refactor the tests to:
- ensure that Overlord.Stop() is called when the object is no longer
  used;
- move the creation of the Overlord from the SetUpTest() method to each
  individual test which needs this object.
There can be situations where there exist two snapd processes in the
system: that's the case when one of them is invoked as a subprocess by
the snap-failure systemd unit (which runs cmd/snap-failure/cmd_snap.go),
which we refer to as the "ephimeral snapd", and the other one is the
new, repaired snapd which is started by the ephimeral one. When this
happens, we do not want both of these processes to operate on the
snapstate at the same time; therefore, create a file-based lock to make
the accesses mutually exclusive.

A file-based lock will cause the second process to block and wait until
the lock is released (which can happen either as a result of an explicit
release operation, or at the termination of the process holding it). The
process starting snapd therefore needs to invoke the "systemctl start
snapd.service" command in non-blocking mode, or it would get itself
blocked, too.

Fixes: https://bugs.launchpad.net/snapd/+bug/1952404
@mardy mardy force-pushed the restore-snapd-failover-test branch from ff1caf4 to a956893 Compare December 13, 2021 10:12
Copy link
Member

@anonymouse64 anonymouse64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm one small nitpick

overlord/overlord.go Show resolved Hide resolved
@mvo5 mvo5 added this to the 2.55 milestone Dec 16, 2021
@mvo5 mvo5 merged commit 167d216 into snapcore:master Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Samuele review Needs a review from Samuele before it can land
Projects
None yet
7 participants