watchdog: Still fed when rootfs is non-responsive #21083

Malvineous · 2021-10-21T18:16:46Z

The systemd watchdog code does not check the filesystem when it feeds the watchdog. This means that if there is a catastrophic failure that stops the root filesystem from responding, the whole machine becomes unusable, but systemd continues to feed the watchdog so the broken system never recovers. In this situation it is not possible to log in to the machine (locally or remotely) and if there is an active session, no commands can be run as anything that tries to access the root filesystem simply freezes.

In my case I am running some Raspberry Pi devices with their root filesystems on NFS, and when I try to reboot them, the network goes away in the middle of the shutdown process. This then causes the whole device to freeze waiting for the rootfs to return, which never happens. The watchdog timer is active and should take over to reboot the machine as it does in other situations (such as a kernel panic) however in the dead-rootfs scenario systemd continues to feed the watchdog so the broken system never recovers and requires a power cycle to return to service, defeating the purpose of the watchdog.

I believe that systemd should read a non-cached file from the filesystem each time it tries to feed the watchdog, so a frozen rootfs would cause this process to also freeze, thus preventing the watchdog from being fed and causing a reset.

Note that the file being read cannot be cached from a previous read, as any files still cached do not result in a freeze. And it must be on the root filesystem, as accesses to other filesystems like tmpfs still succeed. Ideally reading the systemctl binary or other systemd-related files would be the best choice, provided any disk cache can be discarded before the read operation.

systemd version the issue has been seen with

249.5

Used distribution

Arch Linux ARM

Linux kernel version used (uname -a)

Linux 5.10.74-1-raspberrypi-ARCH #1 SMP Mon Oct 18 17:51:34 UTC 2021 armv7l GNU/Linux

CPU architecture issue was seen on

aarch64

Expected behaviour you didn't see

System should have been rebooted by the watchdog when it stopped responding

Unexpected behaviour you saw

systemd kept feeding the watchdog even though the system was completely broken

Steps to reproduce the problem

Boot a machine with the watchdog timer active
Cause the root filesystem to fail (e.g. root on NFS and unplug network)
Observe the system is unusable (can't log in or run any programs) but despite this it never reboots

The text was updated successfully, but these errors were encountered:

poettering · 2021-10-26T20:45:04Z

It sounds weird to involve the watchdog logic in this. If NFS hangs then NFS should support a policy to detect that and reboot. i.e. it's a software problem, detectable by software, and there's no reason to involve any discrete hardware like a watchdog in that. NFS hangs should be detectable and actionable on systems lacking a hw watchdog too.

Hence, I am not convinced this is a problem we should make ours, or involve the hw watchdog in.

Or in other words: I'd really talk to the NFS people about this, i.e. it sounds way more appropriate asking them for some policy logic that they can execute some operation on hangs. Or maybe at least a way how they can talk to userspace and request some action to be done.

Malvineous · 2021-10-27T04:00:53Z

Thanks for your response!

You make a good point, however as far as I'm aware this is by design from the NFS standpoint. When you mount the NFS filesystem, you have two choices - you can mount it as soft or hard. When mounted hard you get the behaviour described above, in that I/O operations block indefinitely until the NFS server comes back online. Instead you can mount it soft, which will return I/O errors after some number of seconds have passed without a response from the NFS server.

I think the problem is not unique to NFS though. What happens if you boot from a normal storage device and then remove it? (Or the kernel driver crashes, or the device fails, etc.) Once the root filesystem has gone away, you still won't be able to run systemctl, you won't be able to initiate the reboot process, but the watchdog process, already in memory and not touching the root filesystem, will continue to feed the watchdog so the system will never reboot despite being completely unusable. I've had this happen a few times before with a failing disk, and there's nothing you can do in this situation except reach for the reset button, which you want the watchdog to do for you automatically.

I take your point that you don't want to involve the watchdog in too many system components, however I believe the root filesystem is a special case because without it, there's no way to reboot the system. I could load my own program to check the root filesystem is available, but if I detect it has gone away, what can I do? I can't tell systemd to reboot because the reboot process fails when the root filesystem is not present, so the only option is to somehow trigger a hard reset, which the watchdog is designed precisely for.

So to me having the systemd watchdog process also check the root filesystem is important, because it's something that is very difficult if not impossible for any other userspace program to handle.

If you are not keen on adding special code just for the root filesystem, what about adding the ability for the watchdog to run user scripts? If I could write my own shell scripts that check whatever is important to me, and have the systemd watchdog process only feed the watchdog if all the scripts terminate successfully, then not only could I have a script that checks root filesystem availability but I could check anything else critical as well. I think this could prove even more useful for embedded systems where they could just have a bunch of scripts checking critical components and if any of them fail, the watchdog quickly resets the device, skipping a potentially slow shutdown process.

What do you think?

poettering · 2021-10-27T07:27:28Z

I am not disagreeing that the issue is relevant. I just disagree this has to be addressed in the watchdog subsystem of PID 1, or in anyway touch /dev/watchdog0 and similar devices.

It could be a simple service that every now and then checks if disk IO still works, and if not reboots. This can be encapsulated very nicely as an entirely isolated service, no need to involve systemd on that?

Malvineous · 2021-10-27T12:03:53Z

How would such a service reliably trigger a reboot? I am happy to do some experiments if you think it can be done without involving the watchdog. Are you thinking of one of the signals listed in the manpage, e.g. SIGRTMIN+15 which "immediately reboots the machine"?

How would this service handle lockups that happen during the startup and particularly the shutdown phase? At the moment every time I try to reboot cleanly with systemctl reboot it locks up towards the end of the process, as it does something to disable the network before it has finished using the root filesystem. Can I tell systemd not to terminate my process even during a shutdown, so that it remains running right up until the actual hardware reset?

From what I have read, during the shutdown phase PID 1 is replaced with a dedicated shutdown binary. The manpage for it is light on detail - will it also respond to SIGRTMIN+15 or is there another method a running process can tell it to immediately reboot?

poettering · 2021-10-28T08:29:14Z

How would such a service reliably trigger a reboot? I am happy to do some experiments if you think it can be done without involving the watchdog. Are you thinking of one of the signals listed in the manpage, e.g. SIGRTMIN+15 which "immediately reboots the machine"?

Yes.

How would this service handle lockups that happen during the startup and particularly the shutdown phase? At the moment every time I try to reboot cleanly with systemctl reboot it locks up towards the end of the process, as it does something to disable the network before it has finished using the root filesystem. Can I tell systemd not to terminate my process even during a shutdown, so that it remains running right up until the actual hardware reset?

Yes, you can do that. See https://systemd.io/ROOT_STORAGE_DAEMONS

That said, we don't actually need to make use of that: the shutdown logic enables the hw watchdog anyway, to ensure that we'll reboot sooner or later.

From what I have read, during the shutdown phase PID 1 is replaced with a dedicated shutdown binary. The manpage for it is light on detail - will it also respond to SIGRTMIN+15 or is there another method a running process can tell it to immediately reboot?

First of all the daemon could just call the reboot() syscall on its own then, without calling sync() or so. But I think the hw watchdog stuff that is done during the final shutdown logic anyway is a fine safety net already.

Also note that in order to implement properly what you are looking for you probably need a statically linked binary that is locked into memory and not backed by files on disk anymore. i.e. the main daemon binary needs to be moved to a 'ramfs' or so, so that it is never backed by storage, and then executed from there, and then do its thing: once it detects hangs on the file systems it cares about it should first issue SIGRTMIN+15, and then maybe fall back to issuing reboot() itself after a timeout.

yangm97 · 2021-11-09T22:46:25Z

Another scenario where this would be rather useful is when hardware i/o errors lock up the filesystem. I can almost reliably reproduce said "i/o issue" by using a knockoff micro sdcard as the rootfs. Issuing a trim command against one of these is almost certain to kill it in a way that, afaik, only a power cycle will take it out of its misery.

Despite this reproduction being for a very specific corner case, "more legitimate" environments do face similar issues, granted not as often and/or not as easy to reproduce.

Also note that in order to implement properly what you are looking for you probably need a statically linked binary that is locked into memory and not backed by files on disk anymore. i.e. the main daemon binary needs to be moved to a 'ramfs' or so, so that it is never backed by storage, and then executed from there, and then do its thing: once it detects hangs on the file systems it cares about it should first issue SIGRTMIN+15, and then maybe fall back to issuing reboot() itself after a timeout.

Say you have a service with ExecStart=/bin/touch /.foo and FailureAction=reboot-immediate being triggered constantly by a timer. Shouldn't it be able to "fail successfully" and trigger the reboot?
I'm not well versed with the internals of systemd but, if the solution above is theoretically supposed to work, it would seen to me that systemd has all (or most) of the machinery needed to provide this functionality natively.

trentbuck · 2021-12-05T23:57:11Z

I use /dev/watchdog to detect when the rootfs (NFS squashfs + tmpfs cow) goes away or becomes corrupt (e.g. filesystem.squashfs is updated on the NFS server).
When this happens, often /sbin/reboot will not work because either /sbin/reboot isn't already cached in RAM, or (somehow) the system dbus is dead. In such cases, ctrl+alt+del also does not work, because systemd can't find enough of itself to do a clean reboot. In such cases, the system hangs forever - powered on, but unusable and unrebootable.

This is why /dev/watchdog is used, because it works even when the system is too damaged to reboot cleanly.

As a proxy for "is NFS dead?" I ping an httpd on the NFS server.
Very roughly, I do this:

modprobe iTCO_wdt nowayout=1 heartbeat=60 &
modprobe softdog  nowayout=1   timeout=60 &

exec 99>/dev/watchdog
echo pet >&99
while curl -sSfL https://nfs
do   echo pet >&99
     sleep 10
done
# try to do a clean reset
reboot &
# try to do a hard reset
echo o >/proc/sysrq-trigger
# wait and hope the watchdog will save us
sleep infinity

This is sufficient for my needs even when using the in-kernel watchdog "softdog".

Re "It could be a simple service that every now and then checks if disk IO still works, and if not reboots.",
I do not see how this could work, because nfs (-o hard) blocks in D state (i.e. kernel-side).
The process cannot detect react to a dead NFS server because it cannot run until NFS comes back.

trentbuck · 2021-12-06T00:13:35Z

PS (getting off-topic): re "I believe that systemd should read a non-cached file from the filesystem" -- AFAIK the in-kernel linux NFS client caches pretty aggressively and doesn't provide any way to explicitly flush the cache of a single file.

Currently I do "umount -a -t nfs && mount -a -t nfs" to avoid around cache-related database corruption in sqlite journal_mode=WAL. That MIGHT just be close-to-open caching, though -- I have not diagnosed it properly.
You could also do /proc/sys/vm/drop_caches which is system-wide -- worse even than just filesystem-wide.
This random quack result suggests an unnecessary opendir()+closedir() to flush CTO cache for a directory: https://stackoverflow.com/questions/8311710/nfs-cache-cleaning-command

EternityForest · 2024-04-10T06:09:36Z

On the Raspberry Pi it appears that a bad power supply can cause some kind of unrecoverable state of IO errors. There are so many unusual combinations of failures that can happen, I don't think it makes sense to unnecessarily depend on anything but hardware.

Any process you make that's supposed to catch whatever failure could itself fail, while the watchdog stuff could keep running.

I'd like to see some kind of watchdog callback mechanism that lets you stop feeding it if a user-defined script, with a reasonable default, fails.

Malvineous · 2024-04-10T07:23:38Z

I think that's probably the most flexible solution. Your health-check script could then check whatever is important to you (disk, network, external USB device, etc.) and respond OK if all is good and systemd can then feed the watchdog. If your script fails to report in for any reason, then eventually the watchdog will expire. Potentially you could have a number of watchdog.d type scripts, each one of which must report OK (or "not applicable") for the watchdog to be fed.

Only trick then is what to do during the shutdown/restart phase, e.g. if shutdown takes longer than the maximum watchdog expiry as it could on the Pi.

EternityForest · 2024-04-10T08:19:13Z

Yeah, the ability to have multiple scripts seems like a good idea.

I suppose you could have a new unit type for checkers that could specify when it's active, in terms of "Check this script as long as this other unit is intended to be running".

You could then do other nice to have features like allowing N failures before actually taking action.

You could only poll the units that actually had a watchdog flag enabled, so it could also be a general purpose alert system that other services could poll, to take action on less severe alerts, or alerts that haven't been in the failed state long enough to reset.

tomasz-torcz-airspace-intelligence · 2024-04-10T08:33:11Z

No need for new unit type. We already have FailureAction= which can reboot system on unit failure.

EternityForest · 2024-04-10T08:43:58Z

FailureAction would still require that the code to reboot actually be functional, right?

What about a new RequiredByWatchdog for timers? That way the hardware can reset even if things are so corrupt that "sudo reboot now" would just give an IOError.

Malvineous · 2024-04-10T08:51:56Z

Yeah FailureAction= only works if the unit exits and cannot be restarted. If the program freezes (due to a programming logic error, or a blocked syscall) then the process will never exit, and so FailureAction= will never trigger. But the watchdog will keep being fed, and in this scenario, we want the watchdog to stop being fed when the process freezes.

tomasz-torcz-airspace-intelligence · 2024-04-10T08:58:59Z

If the program freezes, it will stop pinging unit watchdog (sd_notify) and systemd will trigger failure action.

EternityForest · 2024-04-10T12:17:07Z

I thought the unit watchdog was separate from the hardware watchdog? In a corrupt state systemd might not be able to trigger anything, the failure actions can involve a lot of complexity that might not be available if the filesystem is gone or large parts of the kernel have been blasted by a cosmic ray bit flip or similar. You can't do a normal failure action in a state where you can't execute any binaries or run any scripts at all. This could all happen while the specific part of systemd that feeds the watchdog keeps on going like everything is fine, because that part of the code might just so happen to not be affected, and it currently doesn't check for such things.

…

On Wed, Apr 10, 2024, 2:59 AM Tomasz Torcz ***@***.***> wrote: If the program freezes, it will stop pinging unit watchdog (sd_notify) and systemd will trigger failure action. — Reply to this email directly, view it on GitHub <#21083 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFZCH2RL5SA7FSAR7IXWSLY4T5OTAVCNFSM5GOYE6V2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGY4TKNJQHEYQ> . You are receiving this because you commented.Message ID: ***@***.***>

poettering added needs-discussion 🤔 pid1 RFE 🎁 Request for Enhancement, i.e. a feature request labels Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watchdog: Still fed when rootfs is non-responsive #21083

watchdog: Still fed when rootfs is non-responsive #21083

Malvineous commented Oct 21, 2021

poettering commented Oct 26, 2021 •

edited

Malvineous commented Oct 27, 2021

poettering commented Oct 27, 2021

Malvineous commented Oct 27, 2021

poettering commented Oct 28, 2021

yangm97 commented Nov 9, 2021

trentbuck commented Dec 5, 2021

trentbuck commented Dec 6, 2021

EternityForest commented Apr 10, 2024

Malvineous commented Apr 10, 2024

EternityForest commented Apr 10, 2024

tomasz-torcz-airspace-intelligence commented Apr 10, 2024

EternityForest commented Apr 10, 2024

Malvineous commented Apr 10, 2024

tomasz-torcz-airspace-intelligence commented Apr 10, 2024

EternityForest commented Apr 10, 2024 via email

watchdog: Still fed when rootfs is non-responsive #21083

watchdog: Still fed when rootfs is non-responsive #21083

Comments

Malvineous commented Oct 21, 2021

poettering commented Oct 26, 2021 • edited

Malvineous commented Oct 27, 2021

poettering commented Oct 27, 2021

Malvineous commented Oct 27, 2021

poettering commented Oct 28, 2021

yangm97 commented Nov 9, 2021

trentbuck commented Dec 5, 2021

trentbuck commented Dec 6, 2021

EternityForest commented Apr 10, 2024

Malvineous commented Apr 10, 2024

EternityForest commented Apr 10, 2024

tomasz-torcz-airspace-intelligence commented Apr 10, 2024

EternityForest commented Apr 10, 2024

Malvineous commented Apr 10, 2024

tomasz-torcz-airspace-intelligence commented Apr 10, 2024

EternityForest commented Apr 10, 2024 via email

poettering commented Oct 26, 2021 •

edited