Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watchdog: Still fed when rootfs is non-responsive #21083

Open
Malvineous opened this issue Oct 21, 2021 · 16 comments
Open

watchdog: Still fed when rootfs is non-responsive #21083

Malvineous opened this issue Oct 21, 2021 · 16 comments
Labels
needs-discussion 🤔 pid1 RFE 🎁 Request for Enhancement, i.e. a feature request

Comments

@Malvineous
Copy link
Contributor

The systemd watchdog code does not check the filesystem when it feeds the watchdog. This means that if there is a catastrophic failure that stops the root filesystem from responding, the whole machine becomes unusable, but systemd continues to feed the watchdog so the broken system never recovers. In this situation it is not possible to log in to the machine (locally or remotely) and if there is an active session, no commands can be run as anything that tries to access the root filesystem simply freezes.

In my case I am running some Raspberry Pi devices with their root filesystems on NFS, and when I try to reboot them, the network goes away in the middle of the shutdown process. This then causes the whole device to freeze waiting for the rootfs to return, which never happens. The watchdog timer is active and should take over to reboot the machine as it does in other situations (such as a kernel panic) however in the dead-rootfs scenario systemd continues to feed the watchdog so the broken system never recovers and requires a power cycle to return to service, defeating the purpose of the watchdog.

I believe that systemd should read a non-cached file from the filesystem each time it tries to feed the watchdog, so a frozen rootfs would cause this process to also freeze, thus preventing the watchdog from being fed and causing a reset.

Note that the file being read cannot be cached from a previous read, as any files still cached do not result in a freeze. And it must be on the root filesystem, as accesses to other filesystems like tmpfs still succeed. Ideally reading the systemctl binary or other systemd-related files would be the best choice, provided any disk cache can be discarded before the read operation.

systemd version the issue has been seen with

249.5

Used distribution

Arch Linux ARM

Linux kernel version used (uname -a)

Linux 5.10.74-1-raspberrypi-ARCH #1 SMP Mon Oct 18 17:51:34 UTC 2021 armv7l GNU/Linux

CPU architecture issue was seen on

aarch64

Expected behaviour you didn't see

System should have been rebooted by the watchdog when it stopped responding

Unexpected behaviour you saw

systemd kept feeding the watchdog even though the system was completely broken

Steps to reproduce the problem

  1. Boot a machine with the watchdog timer active
  2. Cause the root filesystem to fail (e.g. root on NFS and unplug network)
  3. Observe the system is unusable (can't log in or run any programs) but despite this it never reboots
@poettering
Copy link
Member

poettering commented Oct 26, 2021

It sounds weird to involve the watchdog logic in this. If NFS hangs then NFS should support a policy to detect that and reboot. i.e. it's a software problem, detectable by software, and there's no reason to involve any discrete hardware like a watchdog in that. NFS hangs should be detectable and actionable on systems lacking a hw watchdog too.

Hence, I am not convinced this is a problem we should make ours, or involve the hw watchdog in.

Or in other words: I'd really talk to the NFS people about this, i.e. it sounds way more appropriate asking them for some policy logic that they can execute some operation on hangs. Or maybe at least a way how they can talk to userspace and request some action to be done.

@poettering poettering added needs-discussion 🤔 pid1 RFE 🎁 Request for Enhancement, i.e. a feature request labels Oct 26, 2021
@Malvineous
Copy link
Contributor Author

Thanks for your response!

You make a good point, however as far as I'm aware this is by design from the NFS standpoint. When you mount the NFS filesystem, you have two choices - you can mount it as soft or hard. When mounted hard you get the behaviour described above, in that I/O operations block indefinitely until the NFS server comes back online. Instead you can mount it soft, which will return I/O errors after some number of seconds have passed without a response from the NFS server.

I think the problem is not unique to NFS though. What happens if you boot from a normal storage device and then remove it? (Or the kernel driver crashes, or the device fails, etc.) Once the root filesystem has gone away, you still won't be able to run systemctl, you won't be able to initiate the reboot process, but the watchdog process, already in memory and not touching the root filesystem, will continue to feed the watchdog so the system will never reboot despite being completely unusable. I've had this happen a few times before with a failing disk, and there's nothing you can do in this situation except reach for the reset button, which you want the watchdog to do for you automatically.

I take your point that you don't want to involve the watchdog in too many system components, however I believe the root filesystem is a special case because without it, there's no way to reboot the system. I could load my own program to check the root filesystem is available, but if I detect it has gone away, what can I do? I can't tell systemd to reboot because the reboot process fails when the root filesystem is not present, so the only option is to somehow trigger a hard reset, which the watchdog is designed precisely for.

So to me having the systemd watchdog process also check the root filesystem is important, because it's something that is very difficult if not impossible for any other userspace program to handle.

If you are not keen on adding special code just for the root filesystem, what about adding the ability for the watchdog to run user scripts? If I could write my own shell scripts that check whatever is important to me, and have the systemd watchdog process only feed the watchdog if all the scripts terminate successfully, then not only could I have a script that checks root filesystem availability but I could check anything else critical as well. I think this could prove even more useful for embedded systems where they could just have a bunch of scripts checking critical components and if any of them fail, the watchdog quickly resets the device, skipping a potentially slow shutdown process.

What do you think?

@poettering
Copy link
Member

I am not disagreeing that the issue is relevant. I just disagree this has to be addressed in the watchdog subsystem of PID 1, or in anyway touch /dev/watchdog0 and similar devices.

It could be a simple service that every now and then checks if disk IO still works, and if not reboots. This can be encapsulated very nicely as an entirely isolated service, no need to involve systemd on that?

@Malvineous
Copy link
Contributor Author

How would such a service reliably trigger a reboot? I am happy to do some experiments if you think it can be done without involving the watchdog. Are you thinking of one of the signals listed in the manpage, e.g. SIGRTMIN+15 which "immediately reboots the machine"?

How would this service handle lockups that happen during the startup and particularly the shutdown phase? At the moment every time I try to reboot cleanly with systemctl reboot it locks up towards the end of the process, as it does something to disable the network before it has finished using the root filesystem. Can I tell systemd not to terminate my process even during a shutdown, so that it remains running right up until the actual hardware reset?

From what I have read, during the shutdown phase PID 1 is replaced with a dedicated shutdown binary. The manpage for it is light on detail - will it also respond to SIGRTMIN+15 or is there another method a running process can tell it to immediately reboot?

@poettering
Copy link
Member

How would such a service reliably trigger a reboot? I am happy to do some experiments if you think it can be done without involving the watchdog. Are you thinking of one of the signals listed in the manpage, e.g. SIGRTMIN+15 which "immediately reboots the machine"?

Yes.

How would this service handle lockups that happen during the startup and particularly the shutdown phase? At the moment every time I try to reboot cleanly with systemctl reboot it locks up towards the end of the process, as it does something to disable the network before it has finished using the root filesystem. Can I tell systemd not to terminate my process even during a shutdown, so that it remains running right up until the actual hardware reset?

Yes, you can do that. See https://systemd.io/ROOT_STORAGE_DAEMONS

That said, we don't actually need to make use of that: the shutdown logic enables the hw watchdog anyway, to ensure that we'll reboot sooner or later.

From what I have read, during the shutdown phase PID 1 is replaced with a dedicated shutdown binary. The manpage for it is light on detail - will it also respond to SIGRTMIN+15 or is there another method a running process can tell it to immediately reboot?

First of all the daemon could just call the reboot() syscall on its own then, without calling sync() or so. But I think the hw watchdog stuff that is done during the final shutdown logic anyway is a fine safety net already.

Also note that in order to implement properly what you are looking for you probably need a statically linked binary that is locked into memory and not backed by files on disk anymore. i.e. the main daemon binary needs to be moved to a 'ramfs' or so, so that it is never backed by storage, and then executed from there, and then do its thing: once it detects hangs on the file systems it cares about it should first issue SIGRTMIN+15, and then maybe fall back to issuing reboot() itself after a timeout.

@yangm97
Copy link

yangm97 commented Nov 9, 2021

Another scenario where this would be rather useful is when hardware i/o errors lock up the filesystem. I can almost reliably reproduce said "i/o issue" by using a knockoff micro sdcard as the rootfs. Issuing a trim command against one of these is almost certain to kill it in a way that, afaik, only a power cycle will take it out of its misery.

Despite this reproduction being for a very specific corner case, "more legitimate" environments do face similar issues, granted not as often and/or not as easy to reproduce.

Also note that in order to implement properly what you are looking for you probably need a statically linked binary that is locked into memory and not backed by files on disk anymore. i.e. the main daemon binary needs to be moved to a 'ramfs' or so, so that it is never backed by storage, and then executed from there, and then do its thing: once it detects hangs on the file systems it cares about it should first issue SIGRTMIN+15, and then maybe fall back to issuing reboot() itself after a timeout.

Say you have a service with ExecStart=/bin/touch /.foo and FailureAction=reboot-immediate being triggered constantly by a timer. Shouldn't it be able to "fail successfully" and trigger the reboot?
I'm not well versed with the internals of systemd but, if the solution above is theoretically supposed to work, it would seen to me that systemd has all (or most) of the machinery needed to provide this functionality natively.

@trentbuck
Copy link

I use /dev/watchdog to detect when the rootfs (NFS squashfs + tmpfs cow) goes away or becomes corrupt (e.g. filesystem.squashfs is updated on the NFS server).
When this happens, often /sbin/reboot will not work because either /sbin/reboot isn't already cached in RAM, or (somehow) the system dbus is dead. In such cases, ctrl+alt+del also does not work, because systemd can't find enough of itself to do a clean reboot. In such cases, the system hangs forever - powered on, but unusable and unrebootable.

This is why /dev/watchdog is used, because it works even when the system is too damaged to reboot cleanly.

As a proxy for "is NFS dead?" I ping an httpd on the NFS server.
Very roughly, I do this:

modprobe iTCO_wdt nowayout=1 heartbeat=60 &
modprobe softdog  nowayout=1   timeout=60 &

exec 99>/dev/watchdog
echo pet >&99
while curl -sSfL https://nfs
do   echo pet >&99
     sleep 10
done
# try to do a clean reset
reboot &
# try to do a hard reset
echo o >/proc/sysrq-trigger
# wait and hope the watchdog will save us
sleep infinity

This is sufficient for my needs even when using the in-kernel watchdog "softdog".

Re "It could be a simple service that every now and then checks if disk IO still works, and if not reboots.",
I do not see how this could work, because nfs (-o hard) blocks in D state (i.e. kernel-side).
The process cannot detect react to a dead NFS server because it cannot run until NFS comes back.

@trentbuck
Copy link

PS (getting off-topic): re "I believe that systemd should read a non-cached file from the filesystem" -- AFAIK the in-kernel linux NFS client caches pretty aggressively and doesn't provide any way to explicitly flush the cache of a single file.

Currently I do "umount -a -t nfs && mount -a -t nfs" to avoid around cache-related database corruption in sqlite journal_mode=WAL. That MIGHT just be close-to-open caching, though -- I have not diagnosed it properly.
You could also do /proc/sys/vm/drop_caches which is system-wide -- worse even than just filesystem-wide.
This random quack result suggests an unnecessary opendir()+closedir() to flush CTO cache for a directory: https://stackoverflow.com/questions/8311710/nfs-cache-cleaning-command

@EternityForest
Copy link

On the Raspberry Pi it appears that a bad power supply can cause some kind of unrecoverable state of IO errors. There are so many unusual combinations of failures that can happen, I don't think it makes sense to unnecessarily depend on anything but hardware.

Any process you make that's supposed to catch whatever failure could itself fail, while the watchdog stuff could keep running.

I'd like to see some kind of watchdog callback mechanism that lets you stop feeding it if a user-defined script, with a reasonable default, fails.

@Malvineous
Copy link
Contributor Author

I think that's probably the most flexible solution. Your health-check script could then check whatever is important to you (disk, network, external USB device, etc.) and respond OK if all is good and systemd can then feed the watchdog. If your script fails to report in for any reason, then eventually the watchdog will expire. Potentially you could have a number of watchdog.d type scripts, each one of which must report OK (or "not applicable") for the watchdog to be fed.

Only trick then is what to do during the shutdown/restart phase, e.g. if shutdown takes longer than the maximum watchdog expiry as it could on the Pi.

@EternityForest
Copy link

Yeah, the ability to have multiple scripts seems like a good idea.

I suppose you could have a new unit type for checkers that could specify when it's active, in terms of "Check this script as long as this other unit is intended to be running".

You could then do other nice to have features like allowing N failures before actually taking action.

You could only poll the units that actually had a watchdog flag enabled, so it could also be a general purpose alert system that other services could poll, to take action on less severe alerts, or alerts that haven't been in the failed state long enough to reset.

@tomasz-torcz-airspace-intelligence

No need for new unit type. We already have FailureAction= which can reboot system on unit failure.

@EternityForest
Copy link

FailureAction would still require that the code to reboot actually be functional, right?

What about a new RequiredByWatchdog for timers? That way the hardware can reset even if things are so corrupt that "sudo reboot now" would just give an IOError.

@Malvineous
Copy link
Contributor Author

Yeah FailureAction= only works if the unit exits and cannot be restarted. If the program freezes (due to a programming logic error, or a blocked syscall) then the process will never exit, and so FailureAction= will never trigger. But the watchdog will keep being fed, and in this scenario, we want the watchdog to stop being fed when the process freezes.

@tomasz-torcz-airspace-intelligence

If the program freezes, it will stop pinging unit watchdog (sd_notify) and systemd will trigger failure action.

@EternityForest
Copy link

EternityForest commented Apr 10, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion 🤔 pid1 RFE 🎁 Request for Enhancement, i.e. a feature request
Development

No branches or pull requests

6 participants