snapd: initial implementation for systemd software watchdog for snapd #3111

Closed
wants to merge 14 commits into
from

Conversation

Projects
None yet
4 participants
Collaborator

mvo5 commented Mar 29, 2017

The CE team was asking for a software watchdog implementation for snapd in the teamcall we had yesterday. This is a first draft. If it looks reasonable I'm happy to add a proper spread test and more unit testing.

The bug in question: https://bugs.launchpad.net/snapd/+bug/1674506

Some comments. Very happy to see this, essential feature for improving reliability

data/systemd/snapd.service
@@ -6,6 +6,7 @@ Requires=snapd.socket
ExecStart=/usr/lib/snapd/snapd
EnvironmentFile=/etc/environment
Restart=always
+WatchdogSec=5m
@zyga

zyga Mar 29, 2017

Contributor

The watchdog should be poked at twice the frequency specifies here from what I recall. How do we ensure that?

systemd/sdnotify.go
+ if err != nil {
+ return nil, fmt.Errorf("cannot parse WATCHDOG_USEC: %s", err)
+ }
+ dur := time.Duration(usec/2) * time.Microsecond
@zyga

zyga Mar 29, 2017

Contributor

Aha :-)

+ }
+
+ raddr := &net.UnixAddr{
+ Name: e,
@zyga

zyga Mar 29, 2017

Contributor

Perhaps heavyweight but how about a watchdog manager that opens th socket once at startup and handles this all internally?

@mvo5

mvo5 Mar 30, 2017

Collaborator

I was following the code in libsystemd/sd-daemon/sd-daemon.c - it uses this structure, i.e. open socket, send data and cleanup_close the fd.

mvo5 and others added some commits Apr 3, 2017

Contributor

zyga commented Apr 13, 2017

This is failing on [ 1235.846963] audit: type=1326 audit(1492083223.916:961): auid=1000 uid=0 gid=0 ses=3 pid=11968 comm="snap-exec" exe="/usr/lib/snapd/snap-exec" sig=31 arch=c000003e syscall=49 compat=0 ip=0x561dc6ddc1f4 code=0x0. (bind)

mvo5 added some commits Apr 21, 2017

SGTM, with a question about goroutine longevity :-)

systemd/sdnotify.go
+ if wu == "" {
+ return nil, fmt.Errorf("cannot get WATCHDOG_USEC environment")
+ }
+ usec, err := strconv.Atoi(wu)
@chipaca

chipaca Apr 24, 2017

Member

Not sure it's wanted/helpful for this case, but note you have osutil.GetenvInt64.

systemd/sdnotify.go
+ select {
+ case <-wt.C:
+ sdNotify("WATCHDOG=1")
+ }
@chipaca

chipaca Apr 24, 2017

Member

shouldn't RunWatchdog take a tomb, and check it here?
E.g, func RunWatchdog(d *daemon.Daemon) and then case <-d.Dying()?

Contributor

pedronis commented Apr 28, 2017

not a blocker but as I mentioned a slightly more meaningful implementation would involve a new pair of (settable) timer or ticker/callback in the overlord loop

Overlord.SetHeartbeat(interval, cb)

or something like that

then the OverlordLoop would have (sketching):

case <-heartbeatC:
     cb()
     continue ...

because we don't want to run Ensure on each hearbeat tick the Loop would need a bit of reorg though

zyga and others added some commits May 10, 2017

I think there's one thing that's wrong about error handling. Please have a look.

cmd/snapd/main.go
+ if ticker, err := runWatchdog(d); err != nil {
+ logger.Noticef("cannot run software watchdog: %s", err)
+ } else {
+ defer ticker.Stop()
@zyga

zyga May 15, 2017

Contributor

Hmm, am I missing something? If we fail you just log the message and if we don't fail you ... stop the ticker?

@mvo5

mvo5 Jun 6, 2017

Collaborator

Thanks! You are right about the notify, this is silly because if snapd fails to check-in with systemd, it will be killed anyway. The ticker is stopped in a defer (just like the daemon is stopped). But I chagned the code now to make it more obvious.

systemd/sdnotify.go
+// inspired by libsystemd/sd-daemon/sd-daemon.c from the systemd source
+func SdNotify(state string) error {
+ if state == "" {
+ return fmt.Errorf("cannot use empty state")
@zyga

zyga May 15, 2017

Contributor

The term state is misleading as it may imply the snapd state. Not sure how to call this to make it better though

@mvo5

mvo5 Jun 6, 2017

Collaborator

Thanks! I changed this now.

Member

chipaca commented May 23, 2017

what's the state of this PR?

mvo5 added some commits Jun 6, 2017

die if the watchdog can not be run (systemd will kill it anway later …
…because it will fail to check in with systemd)
Collaborator

mvo5 commented Jun 6, 2017

I updated the PR and addressed the review feedback. It does not address the concerns from @pedronis yet, if we want this in this PR I think we should close and reopen only once it is reworked.

Contributor

pedronis commented Jun 6, 2017

@mvo5 since I wrote my comment I noticed that especially because of auto-refresh code we might do in Ensure things that can take a fairly long time relatively speaking, we would need to bound them somehow precisely or move them out again to use the Overlord loop for the watchdog functionaility

Collaborator

mvo5 commented Jun 7, 2017

Given that we actually had no instance of when the system became unresponsive I will close this PR. The current form will not actually help much, all it proves is that a single go-routine in snapd is still alive which is not super helpful.

@mvo5 mvo5 closed this Jun 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment