Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
snapd: initial implementation for systemd software watchdog for snapd #3111
Conversation
zyga
reviewed
Mar 29, 2017
Some comments. Very happy to see this, essential feature for improving reliability
| @@ -6,6 +6,7 @@ Requires=snapd.socket | ||
| ExecStart=/usr/lib/snapd/snapd | ||
| EnvironmentFile=/etc/environment | ||
| Restart=always | ||
| +WatchdogSec=5m |
zyga
Mar 29, 2017
Contributor
The watchdog should be poked at twice the frequency specifies here from what I recall. How do we ensure that?
| + if err != nil { | ||
| + return nil, fmt.Errorf("cannot parse WATCHDOG_USEC: %s", err) | ||
| + } | ||
| + dur := time.Duration(usec/2) * time.Microsecond |
| + } | ||
| + | ||
| + raddr := &net.UnixAddr{ | ||
| + Name: e, |
zyga
Mar 29, 2017
Contributor
Perhaps heavyweight but how about a watchdog manager that opens th socket once at startup and handles this all internally?
mvo5
Mar 30, 2017
Collaborator
I was following the code in libsystemd/sd-daemon/sd-daemon.c - it uses this structure, i.e. open socket, send data and cleanup_close the fd.
mvo5
and others
added some commits
Apr 3, 2017
|
This is failing on |
mvo5
added some commits
Apr 21, 2017
chipaca
approved these changes
Apr 24, 2017
SGTM, with a question about goroutine longevity :-)
| + if wu == "" { | ||
| + return nil, fmt.Errorf("cannot get WATCHDOG_USEC environment") | ||
| + } | ||
| + usec, err := strconv.Atoi(wu) |
chipaca
Apr 24, 2017
Member
Not sure it's wanted/helpful for this case, but note you have osutil.GetenvInt64.
| + select { | ||
| + case <-wt.C: | ||
| + sdNotify("WATCHDOG=1") | ||
| + } |
chipaca
Apr 24, 2017
Member
shouldn't RunWatchdog take a tomb, and check it here?
E.g, func RunWatchdog(d *daemon.Daemon) and then case <-d.Dying()?
|
not a blocker but as I mentioned a slightly more meaningful implementation would involve a new pair of (settable) timer or ticker/callback in the overlord loop Overlord.SetHeartbeat(interval, cb) or something like that then the OverlordLoop would have (sketching):
because we don't want to run Ensure on each hearbeat tick the Loop would need a bit of reorg though |
zyga
and others
added some commits
May 10, 2017
zyga
requested changes
May 15, 2017
I think there's one thing that's wrong about error handling. Please have a look.
| + if ticker, err := runWatchdog(d); err != nil { | ||
| + logger.Noticef("cannot run software watchdog: %s", err) | ||
| + } else { | ||
| + defer ticker.Stop() |
zyga
May 15, 2017
Contributor
Hmm, am I missing something? If we fail you just log the message and if we don't fail you ... stop the ticker?
mvo5
Jun 6, 2017
Collaborator
Thanks! You are right about the notify, this is silly because if snapd fails to check-in with systemd, it will be killed anyway. The ticker is stopped in a defer (just like the daemon is stopped). But I chagned the code now to make it more obvious.
| +// inspired by libsystemd/sd-daemon/sd-daemon.c from the systemd source | ||
| +func SdNotify(state string) error { | ||
| + if state == "" { | ||
| + return fmt.Errorf("cannot use empty state") |
zyga
May 15, 2017
Contributor
The term state is misleading as it may imply the snapd state. Not sure how to call this to make it better though
|
what's the state of this PR? |
mvo5
added some commits
Jun 6, 2017
|
I updated the PR and addressed the review feedback. It does not address the concerns from @pedronis yet, if we want this in this PR I think we should close and reopen only once it is reworked. |
|
@mvo5 since I wrote my comment I noticed that especially because of auto-refresh code we might do in Ensure things that can take a fairly long time relatively speaking, we would need to bound them somehow precisely or move them out again to use the Overlord loop for the watchdog functionaility |
|
Given that we actually had no instance of when the system became unresponsive I will close this PR. The current form will not actually help much, all it proves is that a single go-routine in snapd is still alive which is not super helpful. |
mvo5 commentedMar 29, 2017
The CE team was asking for a software watchdog implementation for snapd in the teamcall we had yesterday. This is a first draft. If it looks reasonable I'm happy to add a proper spread test and more unit testing.
The bug in question: https://bugs.launchpad.net/snapd/+bug/1674506