Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upCannot shutdown machine if NFS mount is not reachable #6115
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
I think 83897d5 fixes it. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
neilbrown
Jun 13, 2017
Contributor
That patch is only one part of fixing this problem.
You also need util-linux newer than v2.30 (which is the latest release...) particularly commit ce6e269447ce86b0c303ec2febbf4264d318c713
You also need a small change to nfs-utils which hasn't landed upsrteam yet ... I should chase that.
|
That patch is only one part of fixing this problem. |
evverx
referenced this issue
Jun 13, 2017
Closed
Reminder: bump `libmount` dependency to `2.30` #4871
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
poettering
Jun 13, 2017
Member
@neilbrown hmm, are you suggesting that what there's to fix on the systemd side is already fixed in systemd git? If so, we should close this bug, no?
(BTW, I still think we should add a user-space timeout around umount() to deal with such umount() requests that just hang, i.e. the stuff I suggested in #5688 (comment))
|
@neilbrown hmm, are you suggesting that what there's to fix on the systemd side is already fixed in systemd git? If so, we should close this bug, no? (BTW, I still think we should add a user-space timeout around umount() to deal with such umount() requests that just hang, i.e. the stuff I suggested in #5688 (comment)) |
poettering
added
pid1
RFE 🎁
labels
Jun 13, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
neilbrown
Jun 13, 2017
Contributor
You really need pull request #6086 as well, but that only affects automounts.
Yes, there is a fix that is already in systemd (which depends on new util-linux and nfs-utils). This should fix many problems of this kind including the one as described above.
It is possible to synthesize problems that take more work to fix.
If you have an NFS filesystem mounted, and another filesystem mounted on that, and if the NFS filesystem cannot contact its server, then any attempt to unmount the underlying filesystem will block indefinitely. Attempt to unmount the NFS filesystem will fail because it is busy.
The only certain way to unmount a filesystem is to call the sys_umount system call directly passing MNT_DETACH, and aim for the mountpoint closest to the root. That will detach all the filesystems, but won't wait for them to flush out.
I think if you
1/ kill all processes that might be using a filesystem - waiting with timeout for them to disappear
2/ umount(MNT_DETACH) the top level mount point of any tree that you want to discard
3/ call sys_sync() in a child and wait until IO activity subsides (maybe watch changes to Dirty and Writeback in /proc/meminfo?)
then you maximize the chance of cleaning up as much as possible without hanging.
Maybe that is a bit excessive though.
I think that, in general, a timeout on unmount is a good idea to catch possible problems walking down the path to the mountpoint. A JobRunningTimeout would probably be sufficient (as long as it wasn't imposed at mount time as well).
I've been thinking recently that I'd like AT_CACHE_ONLY and MNT_CACHE_ONLY flags which affect the various *at() systemcalls and umount() respectively, and tell path walking to trust the cache, and return an error rather than talk to a filesystem for pathname lookup. That would allow you to unmount anything without risk...
|
You really need pull request #6086 as well, but that only affects automounts. Yes, there is a fix that is already in systemd (which depends on new util-linux and nfs-utils). This should fix many problems of this kind including the one as described above. then you maximize the chance of cleaning up as much as possible without hanging. I think that, in general, a timeout on unmount is a good idea to catch possible problems walking down the path to the mountpoint. A JobRunningTimeout would probably be sufficient (as long as it wasn't imposed at mount time as well). I've been thinking recently that I'd like AT_CACHE_ONLY and MNT_CACHE_ONLY flags which affect the various *at() systemcalls and umount() respectively, and tell path walking to trust the cache, and return an error rather than talk to a filesystem for pathname lookup. That would allow you to unmount anything without risk... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
CRCinAU
Jun 20, 2017
Am I missing the point that systemd actually shuts down the network before the NFS share? While somewhat brain-dead in doing so, is there no way at all to ensure that certain processes complete before the network is yanked out from under them?
EDIT: I should note that I've seen this issue for months - but always just ignored it as 'broken as designed' and unmount my NFS shares manually before a shutdown (if I remember). The important part is that I still see this on Fedora 26 - even when the NFS server is always reachable...
CRCinAU
commented
Jun 20, 2017
•
|
Am I missing the point that systemd actually shuts down the network before the NFS share? While somewhat brain-dead in doing so, is there no way at all to ensure that certain processes complete before the network is yanked out from under them? EDIT: I should note that I've seen this issue for months - but always just ignored it as 'broken as designed' and unmount my NFS shares manually before a shutdown (if I remember). The important part is that I still see this on Fedora 26 - even when the NFS server is always reachable... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zdzichu
Jun 20, 2017
Contributor
For ordering against network you have to use _netdev option in /etc/fstab
|
For ordering against network you have to use |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
CRCinAU
commented
Jun 20, 2017
|
@zdzichu I'll give that a go over the next few weeks and see if it helps... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
arvidjaar
Jun 20, 2017
Contributor
For ordering against network you have to use _netdev option in /etc/fstab
This issue is about manual NFS mounts, how is /etc/fstab relevant?
This issue is about manual NFS mounts, how is /etc/fstab relevant? |
added a commit
to kyle-walker/systemd
that referenced
this issue
Aug 10, 2017
added a commit
to kyle-walker/systemd
that referenced
this issue
Aug 10, 2017
added a commit
to kyle-walker/systemd
that referenced
this issue
Aug 10, 2017
kyle-walker
referenced this issue
Aug 11, 2017
Merged
core: Limit the time and attempts in shutdown remount/umount efforts #6598
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sourcejedi
Sep 6, 2017
Contributor
Am I missing the point that systemd actually shuts down the network before the NFS share? While somewhat brain-dead in doing so, is there no way at all to ensure that certain processes complete before the network is yanked out from under them?
For ordering against network you have to use _netdev option in /etc/fstab
Not according to docs! Which were updated sometime since v233, I think. They say _netdev is only necessary if you have a filesystem type which is not recognized as network. The example given is a mount of an iSCSI network block device.
EDIT: I should note that I've seen this issue for months - but always just ignored it as 'broken as designed' and unmount my NFS shares manually before a shutdown (if I remember). The important part is that I still see this on Fedora 26 - even when the NFS server is always reachable...
So, with the dependencies, the expectation seems to be that NFS filesystems are stopped before the networking unit. But, user processes are not explicitly killed before the network unit! So it looks like they can easily hold the NFS mounts open. I think stopping the mount units simply fails (no timeout), then we rely on systemd-shutdown's kill+unmount loop... by which point we have already stopped networking.service? Though I must be missing something about mount units: I don't know why -.mount doesn't show a visible failure on every shutdown.
User processes which die when the getty or xdm are closed are OK. But I recently noticed gnome sessions surviving while gdm is stopped, so that's a thing that's at least possible. (They were degraded in ways, suggesting this might not be the best idea). And SSH sessions survive the main SSH process, by design. You can also just have random stupid user processes that ignore stuff short of SIGKILL.
In e.g. Debian sysvinit, I think this works differently: there's sendsigs which kills user processes before NFS filesystems are unmounted. (Along with any system daemon that didn't realize it wanted a sendsigs.omit/ drop-in). Considering the issue, this seems like the big hole in the systemd shutdown design.
It seems like this issue would be avoided if we were able to stop all the user scopes for shutdown.target (as well as services like cron, at, and httpd (suexec) that might not be scoping user processes), and manage to order that procedure before the unmounting of filesystems. I dunno how practical it is. Feels like we'd at least need an explicit ordering point (target(s)) to represent stopping user processes.
Not according to docs! Which were updated sometime since v233, I think. They say
So, with the dependencies, the expectation seems to be that NFS filesystems are stopped before the networking unit. But, user processes are not explicitly killed before the network unit! So it looks like they can easily hold the NFS mounts open. I think stopping the mount units simply fails (no timeout), then we rely on User processes which die when the getty or xdm are closed are OK. But I recently noticed gnome sessions surviving while gdm is stopped, so that's a thing that's at least possible. (They were degraded in ways, suggesting this might not be the best idea). And SSH sessions survive the main SSH process, by design. You can also just have random stupid user processes that ignore stuff short of SIGKILL. In e.g. Debian sysvinit, I think this works differently: there's It seems like this issue would be avoided if we were able to stop all the user scopes for shutdown.target (as well as services like cron, at, and httpd (suexec) that might not be scoping user processes), and manage to order that procedure before the unmounting of filesystems. I dunno how practical it is. Feels like we'd at least need an explicit ordering point (target(s)) to represent stopping user processes. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
neilbrown
Sep 7, 2017
Contributor
It seems like this issue would be avoided if we were able to stop all the user scopes for shutdown.target
I think we do. user-1000.slice Conflicts with shutdown.target, so it will be shut down, and user-1000.slice is After systemd-user-sessions.service which is After remote-fs.target. So all processes in user-1000.slice should be killed before any remote filesystems are unmounted, which in turn is before the network is shutdown.
One problem that does remain is that some processes that are accessing an NFS filesystem mounted from an inaccessible server, can not be killed. They get stuck in filemap_fdatawait_range() which waits in a non-killable mode. It is on my to-do list to resolve that, but it isn't straight forward.
I think we do. user-1000.slice Conflicts with shutdown.target, so it will be shut down, and user-1000.slice is After systemd-user-sessions.service which is After remote-fs.target. So all processes in user-1000.slice should be killed before any remote filesystems are unmounted, which in turn is before the network is shutdown. One problem that does remain is that some processes that are accessing an NFS filesystem mounted from an inaccessible server, can not be killed. They get stuck in filemap_fdatawait_range() which waits in a non-killable mode. It is on my to-do list to resolve that, but it isn't straight forward. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
siebenmann
Sep 7, 2017
As a note, user scopes are not the only way that processes can linger after systemd starts trying to unmount NFS filesystems and then shut down networking. Any .service that uses KillMode=process may leave children behind after it is officially shut down, and I believe that any System V init scripts will be run this way (plus any explicit .service that a distribution configures this way, such as cron and atd on Ubuntu 16.04).
siebenmann
commented
Sep 7, 2017
|
As a note, user scopes are not the only way that processes can linger after systemd starts trying to unmount NFS filesystems and then shut down networking. Any |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sourcejedi
Sep 7, 2017
Contributor
Thanks for correction.
(fwiw my unverified assumptions were drawn from what I've seen systemctl isolate rescue do. Which is tricky to improve by making units conflict with rescue, because if you end up starting that unit from the rescue shell, then your shell goes away and you're locked out...
Except maybe units that aren't gdm or gettys could conflict with rescue.target but not rescue.service. Hmm).
|
Thanks for correction. (fwiw my unverified assumptions were drawn from what I've seen Except maybe units that aren't gdm or gettys could conflict with |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
poettering
Sep 7, 2017
Member
yeah, so @neilbrown's analysis is right: the deps are all in place already, all services and scopes (including user sessions) should all be shut down properly by the time we umount NFS shares, and that's done before we shut down the network. However, this fails to work properly if:
- There are services which explicitly exclude themselves from killing, for example via KillMode=none or suchlike. If you have some of those, then it's really their fault, there's little we can do, please file a bug against these services asking them to not do this.
- If processes already hang on NFS in a non-interruptible sleep, then systemd can't kill them either. This is a limitation of the Linux kernel, and there's nothing systemd can do about them.
- Some distros don't get the deps on their networking stacks right, i.e. miss Before=network.target in their networking service, so that the networking stack is shut down after network.target goes away, and not before.
- If people split out /var or /var/log onto NFS they are in trouble, as journald will keep /var/log/journal busy until the very end, and will thus keep these mounts busy for good. This is a limitation of the journal, we should fix eventually (and we would have fixed this already a long time ago if we had useful IPC for the journal, but we don't, as dbus-daemon is a client of the journal, and hence the journal can't use D-Bus IPC since we'd otherwise have a cyclic dep, and deadlocks)
Note that systemd applies a time-out to service stopping, hence an unkillable process due to NFS is actually not a major problem beyond causing a delay at shutdown. Moreover, the umount commands invoked by systemd during the regular shutdown phase also have a time-out applied as well, and if they don't complete within 90s systemd won't wait for them, and continue with the shutdown. However, in the second shutdown phase (i.e. where all units are already stopped, and we transitioned into the systemd-shutdown destruction loop) we will try to umount everything left-over again, and these umount() syscalls do not have any userspace time-out applied curently, but this is being worked on in #6598. As soon as we have that we should be reasonably safe regarding hanging NFS (modulo some bugs). That is of course unless PID 1 itself hangs on NFS for some reason...
|
yeah, so @neilbrown's analysis is right: the deps are all in place already, all services and scopes (including user sessions) should all be shut down properly by the time we umount NFS shares, and that's done before we shut down the network. However, this fails to work properly if:
Note that systemd applies a time-out to service stopping, hence an unkillable process due to NFS is actually not a major problem beyond causing a delay at shutdown. Moreover, the umount commands invoked by systemd during the regular shutdown phase also have a time-out applied as well, and if they don't complete within 90s systemd won't wait for them, and continue with the shutdown. However, in the second shutdown phase (i.e. where all units are already stopped, and we transitioned into the systemd-shutdown destruction loop) we will try to umount everything left-over again, and these umount() syscalls do not have any userspace time-out applied curently, but this is being worked on in #6598. As soon as we have that we should be reasonably safe regarding hanging NFS (modulo some bugs). That is of course unless PID 1 itself hangs on NFS for some reason... |
henryx commentedJun 12, 2017
Submission type
systemd version the issue has been seen with
Used distribution
In case of bug report: Expected behaviour you didn't see
In case of bug report: Unexpected behaviour you saw
In case of bug report: Steps to reproduce the problem