-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watchdog: Still fed when rootfs is non-responsive #21083
Comments
It sounds weird to involve the watchdog logic in this. If NFS hangs then NFS should support a policy to detect that and reboot. i.e. it's a software problem, detectable by software, and there's no reason to involve any discrete hardware like a watchdog in that. NFS hangs should be detectable and actionable on systems lacking a hw watchdog too. Hence, I am not convinced this is a problem we should make ours, or involve the hw watchdog in. Or in other words: I'd really talk to the NFS people about this, i.e. it sounds way more appropriate asking them for some policy logic that they can execute some operation on hangs. Or maybe at least a way how they can talk to userspace and request some action to be done. |
Thanks for your response! You make a good point, however as far as I'm aware this is by design from the NFS standpoint. When you mount the NFS filesystem, you have two choices - you can mount it as I think the problem is not unique to NFS though. What happens if you boot from a normal storage device and then remove it? (Or the kernel driver crashes, or the device fails, etc.) Once the root filesystem has gone away, you still won't be able to run I take your point that you don't want to involve the watchdog in too many system components, however I believe the root filesystem is a special case because without it, there's no way to reboot the system. I could load my own program to check the root filesystem is available, but if I detect it has gone away, what can I do? I can't tell systemd to reboot because the reboot process fails when the root filesystem is not present, so the only option is to somehow trigger a hard reset, which the watchdog is designed precisely for. So to me having the systemd watchdog process also check the root filesystem is important, because it's something that is very difficult if not impossible for any other userspace program to handle. If you are not keen on adding special code just for the root filesystem, what about adding the ability for the watchdog to run user scripts? If I could write my own shell scripts that check whatever is important to me, and have the systemd watchdog process only feed the watchdog if all the scripts terminate successfully, then not only could I have a script that checks root filesystem availability but I could check anything else critical as well. I think this could prove even more useful for embedded systems where they could just have a bunch of scripts checking critical components and if any of them fail, the watchdog quickly resets the device, skipping a potentially slow shutdown process. What do you think? |
I am not disagreeing that the issue is relevant. I just disagree this has to be addressed in the watchdog subsystem of PID 1, or in anyway touch /dev/watchdog0 and similar devices. It could be a simple service that every now and then checks if disk IO still works, and if not reboots. This can be encapsulated very nicely as an entirely isolated service, no need to involve systemd on that? |
How would such a service reliably trigger a reboot? I am happy to do some experiments if you think it can be done without involving the watchdog. Are you thinking of one of the signals listed in the manpage, e.g. How would this service handle lockups that happen during the startup and particularly the shutdown phase? At the moment every time I try to reboot cleanly with From what I have read, during the shutdown phase PID 1 is replaced with a dedicated shutdown binary. The manpage for it is light on detail - will it also respond to |
Yes.
Yes, you can do that. See https://systemd.io/ROOT_STORAGE_DAEMONS That said, we don't actually need to make use of that: the shutdown logic enables the hw watchdog anyway, to ensure that we'll reboot sooner or later.
First of all the daemon could just call the reboot() syscall on its own then, without calling sync() or so. But I think the hw watchdog stuff that is done during the final shutdown logic anyway is a fine safety net already. Also note that in order to implement properly what you are looking for you probably need a statically linked binary that is locked into memory and not backed by files on disk anymore. i.e. the main daemon binary needs to be moved to a 'ramfs' or so, so that it is never backed by storage, and then executed from there, and then do its thing: once it detects hangs on the file systems it cares about it should first issue SIGRTMIN+15, and then maybe fall back to issuing reboot() itself after a timeout. |
Another scenario where this would be rather useful is when hardware i/o errors lock up the filesystem. I can almost reliably reproduce said "i/o issue" by using a knockoff micro sdcard as the rootfs. Issuing a trim command against one of these is almost certain to kill it in a way that, afaik, only a power cycle will take it out of its misery. Despite this reproduction being for a very specific corner case, "more legitimate" environments do face similar issues, granted not as often and/or not as easy to reproduce.
Say you have a service with |
I use /dev/watchdog to detect when the rootfs (NFS squashfs + tmpfs cow) goes away or becomes corrupt (e.g. filesystem.squashfs is updated on the NFS server). This is why /dev/watchdog is used, because it works even when the system is too damaged to reboot cleanly. As a proxy for "is NFS dead?" I ping an httpd on the NFS server.
This is sufficient for my needs even when using the in-kernel watchdog "softdog". Re "It could be a simple service that every now and then checks if disk IO still works, and if not reboots.", |
PS (getting off-topic): re "I believe that systemd should read a non-cached file from the filesystem" -- AFAIK the in-kernel linux NFS client caches pretty aggressively and doesn't provide any way to explicitly flush the cache of a single file. Currently I do "umount -a -t nfs && mount -a -t nfs" to avoid around cache-related database corruption in sqlite journal_mode=WAL. That MIGHT just be close-to-open caching, though -- I have not diagnosed it properly. |
On the Raspberry Pi it appears that a bad power supply can cause some kind of unrecoverable state of IO errors. There are so many unusual combinations of failures that can happen, I don't think it makes sense to unnecessarily depend on anything but hardware. Any process you make that's supposed to catch whatever failure could itself fail, while the watchdog stuff could keep running. I'd like to see some kind of watchdog callback mechanism that lets you stop feeding it if a user-defined script, with a reasonable default, fails. |
I think that's probably the most flexible solution. Your health-check script could then check whatever is important to you (disk, network, external USB device, etc.) and respond OK if all is good and systemd can then feed the watchdog. If your script fails to report in for any reason, then eventually the watchdog will expire. Potentially you could have a number of Only trick then is what to do during the shutdown/restart phase, e.g. if shutdown takes longer than the maximum watchdog expiry as it could on the Pi. |
Yeah, the ability to have multiple scripts seems like a good idea. I suppose you could have a new unit type for checkers that could specify when it's active, in terms of "Check this script as long as this other unit is intended to be running". You could then do other nice to have features like allowing N failures before actually taking action. You could only poll the units that actually had a watchdog flag enabled, so it could also be a general purpose alert system that other services could poll, to take action on less severe alerts, or alerts that haven't been in the failed state long enough to reset. |
No need for new unit type. We already have FailureAction= which can reboot system on unit failure. |
FailureAction would still require that the code to reboot actually be functional, right? What about a new RequiredByWatchdog for timers? That way the hardware can reset even if things are so corrupt that "sudo reboot now" would just give an IOError. |
Yeah |
If the program freezes, it will stop pinging unit watchdog (sd_notify) and systemd will trigger failure action. |
I thought the unit watchdog was separate from the hardware watchdog?
In a corrupt state systemd might not be able to trigger anything, the
failure actions can involve a lot of complexity that might not be available
if the filesystem is gone or large parts of the kernel have been blasted by
a cosmic ray bit flip or similar.
You can't do a normal failure action in a state where you can't execute any
binaries or run any scripts at all.
This could all happen while the specific part of systemd that feeds the
watchdog keeps on going like everything is fine, because that part of the
code might just so happen to not be affected, and it currently doesn't
check for such things.
…On Wed, Apr 10, 2024, 2:59 AM Tomasz Torcz ***@***.***> wrote:
If the program freezes, it will stop pinging unit watchdog (sd_notify) and
systemd will trigger failure action.
—
Reply to this email directly, view it on GitHub
<#21083 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFZCH2RL5SA7FSAR7IXWSLY4T5OTAVCNFSM5GOYE6V2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGY4TKNJQHEYQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
The systemd watchdog code does not check the filesystem when it feeds the watchdog. This means that if there is a catastrophic failure that stops the root filesystem from responding, the whole machine becomes unusable, but systemd continues to feed the watchdog so the broken system never recovers. In this situation it is not possible to log in to the machine (locally or remotely) and if there is an active session, no commands can be run as anything that tries to access the root filesystem simply freezes.
In my case I am running some Raspberry Pi devices with their root filesystems on NFS, and when I try to reboot them, the network goes away in the middle of the shutdown process. This then causes the whole device to freeze waiting for the rootfs to return, which never happens. The watchdog timer is active and should take over to reboot the machine as it does in other situations (such as a kernel panic) however in the dead-rootfs scenario systemd continues to feed the watchdog so the broken system never recovers and requires a power cycle to return to service, defeating the purpose of the watchdog.
I believe that systemd should read a non-cached file from the filesystem each time it tries to feed the watchdog, so a frozen rootfs would cause this process to also freeze, thus preventing the watchdog from being fed and causing a reset.
Note that the file being read cannot be cached from a previous read, as any files still cached do not result in a freeze. And it must be on the root filesystem, as accesses to other filesystems like tmpfs still succeed. Ideally reading the systemctl binary or other systemd-related files would be the best choice, provided any disk cache can be discarded before the read operation.
systemd version the issue has been seen with
Used distribution
Linux kernel version used (
uname -a
)CPU architecture issue was seen on
Expected behaviour you didn't see
Unexpected behaviour you saw
Steps to reproduce the problem
The text was updated successfully, but these errors were encountered: