New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot shutdown machine if NFS mount is not reachable #6115

Closed
henryx opened this Issue Jun 12, 2017 · 13 comments

Comments

8 participants
@henryx

henryx commented Jun 12, 2017

Submission type

  • Bug report

systemd version the issue has been seen with

systemd 219 and above

Used distribution

I have tested the problem on CentOS 7 and Fedora > 20, but i think is independent to the distribution

In case of bug report: Expected behaviour you didn't see

After a reasonable timeout, machine correctly shutdown

In case of bug report: Unexpected behaviour you saw

Machine cannot shutdown until NFS share returns reachable

In case of bug report: Steps to reproduce the problem

On machine A: echo "/home network/24(rw,no_root_squash)" >> /etc/exports
On machine A: activate NFS share (e.g. systemctl restart rpcbind nfs-server)
On machine B: mount -t nfs -o soft machine-a:/home /mnt
On machine A: power off the machine
On Machine B power off the machine

@zdzichu

This comment has been minimized.

Show comment
Hide comment
@zdzichu

zdzichu Jun 12, 2017

Contributor

I think 83897d5 fixes it.

Contributor

zdzichu commented Jun 12, 2017

I think 83897d5 fixes it.

@neilbrown

This comment has been minimized.

Show comment
Hide comment
@neilbrown

neilbrown Jun 13, 2017

Contributor

That patch is only one part of fixing this problem.
You also need util-linux newer than v2.30 (which is the latest release...) particularly commit ce6e269447ce86b0c303ec2febbf4264d318c713
You also need a small change to nfs-utils which hasn't landed upsrteam yet ... I should chase that.

Contributor

neilbrown commented Jun 13, 2017

That patch is only one part of fixing this problem.
You also need util-linux newer than v2.30 (which is the latest release...) particularly commit ce6e269447ce86b0c303ec2febbf4264d318c713
You also need a small change to nfs-utils which hasn't landed upsrteam yet ... I should chase that.

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Jun 13, 2017

Member

@neilbrown hmm, are you suggesting that what there's to fix on the systemd side is already fixed in systemd git? If so, we should close this bug, no?

(BTW, I still think we should add a user-space timeout around umount() to deal with such umount() requests that just hang, i.e. the stuff I suggested in #5688 (comment))

Member

poettering commented Jun 13, 2017

@neilbrown hmm, are you suggesting that what there's to fix on the systemd side is already fixed in systemd git? If so, we should close this bug, no?

(BTW, I still think we should add a user-space timeout around umount() to deal with such umount() requests that just hang, i.e. the stuff I suggested in #5688 (comment))

@neilbrown

This comment has been minimized.

Show comment
Hide comment
@neilbrown

neilbrown Jun 13, 2017

Contributor

You really need pull request #6086 as well, but that only affects automounts.

Yes, there is a fix that is already in systemd (which depends on new util-linux and nfs-utils). This should fix many problems of this kind including the one as described above.
It is possible to synthesize problems that take more work to fix.
If you have an NFS filesystem mounted, and another filesystem mounted on that, and if the NFS filesystem cannot contact its server, then any attempt to unmount the underlying filesystem will block indefinitely. Attempt to unmount the NFS filesystem will fail because it is busy.
The only certain way to unmount a filesystem is to call the sys_umount system call directly passing MNT_DETACH, and aim for the mountpoint closest to the root. That will detach all the filesystems, but won't wait for them to flush out.
I think if you
1/ kill all processes that might be using a filesystem - waiting with timeout for them to disappear
2/ umount(MNT_DETACH) the top level mount point of any tree that you want to discard
3/ call sys_sync() in a child and wait until IO activity subsides (maybe watch changes to Dirty and Writeback in /proc/meminfo?)

then you maximize the chance of cleaning up as much as possible without hanging.
Maybe that is a bit excessive though.

I think that, in general, a timeout on unmount is a good idea to catch possible problems walking down the path to the mountpoint. A JobRunningTimeout would probably be sufficient (as long as it wasn't imposed at mount time as well).

I've been thinking recently that I'd like AT_CACHE_ONLY and MNT_CACHE_ONLY flags which affect the various *at() systemcalls and umount() respectively, and tell path walking to trust the cache, and return an error rather than talk to a filesystem for pathname lookup. That would allow you to unmount anything without risk...

Contributor

neilbrown commented Jun 13, 2017

You really need pull request #6086 as well, but that only affects automounts.

Yes, there is a fix that is already in systemd (which depends on new util-linux and nfs-utils). This should fix many problems of this kind including the one as described above.
It is possible to synthesize problems that take more work to fix.
If you have an NFS filesystem mounted, and another filesystem mounted on that, and if the NFS filesystem cannot contact its server, then any attempt to unmount the underlying filesystem will block indefinitely. Attempt to unmount the NFS filesystem will fail because it is busy.
The only certain way to unmount a filesystem is to call the sys_umount system call directly passing MNT_DETACH, and aim for the mountpoint closest to the root. That will detach all the filesystems, but won't wait for them to flush out.
I think if you
1/ kill all processes that might be using a filesystem - waiting with timeout for them to disappear
2/ umount(MNT_DETACH) the top level mount point of any tree that you want to discard
3/ call sys_sync() in a child and wait until IO activity subsides (maybe watch changes to Dirty and Writeback in /proc/meminfo?)

then you maximize the chance of cleaning up as much as possible without hanging.
Maybe that is a bit excessive though.

I think that, in general, a timeout on unmount is a good idea to catch possible problems walking down the path to the mountpoint. A JobRunningTimeout would probably be sufficient (as long as it wasn't imposed at mount time as well).

I've been thinking recently that I'd like AT_CACHE_ONLY and MNT_CACHE_ONLY flags which affect the various *at() systemcalls and umount() respectively, and tell path walking to trust the cache, and return an error rather than talk to a filesystem for pathname lookup. That would allow you to unmount anything without risk...

@CRCinAU

This comment has been minimized.

Show comment
Hide comment
@CRCinAU

CRCinAU Jun 20, 2017

Am I missing the point that systemd actually shuts down the network before the NFS share? While somewhat brain-dead in doing so, is there no way at all to ensure that certain processes complete before the network is yanked out from under them?

EDIT: I should note that I've seen this issue for months - but always just ignored it as 'broken as designed' and unmount my NFS shares manually before a shutdown (if I remember). The important part is that I still see this on Fedora 26 - even when the NFS server is always reachable...

CRCinAU commented Jun 20, 2017

Am I missing the point that systemd actually shuts down the network before the NFS share? While somewhat brain-dead in doing so, is there no way at all to ensure that certain processes complete before the network is yanked out from under them?

EDIT: I should note that I've seen this issue for months - but always just ignored it as 'broken as designed' and unmount my NFS shares manually before a shutdown (if I remember). The important part is that I still see this on Fedora 26 - even when the NFS server is always reachable...

@zdzichu

This comment has been minimized.

Show comment
Hide comment
@zdzichu

zdzichu Jun 20, 2017

Contributor

For ordering against network you have to use _netdev option in /etc/fstab

Contributor

zdzichu commented Jun 20, 2017

For ordering against network you have to use _netdev option in /etc/fstab

@CRCinAU

This comment has been minimized.

Show comment
Hide comment
@CRCinAU

CRCinAU Jun 20, 2017

@zdzichu I'll give that a go over the next few weeks and see if it helps...

CRCinAU commented Jun 20, 2017

@zdzichu I'll give that a go over the next few weeks and see if it helps...

@arvidjaar

This comment has been minimized.

Show comment
Hide comment
@arvidjaar

arvidjaar Jun 20, 2017

Contributor

For ordering against network you have to use _netdev option in /etc/fstab

This issue is about manual NFS mounts, how is /etc/fstab relevant?

Contributor

arvidjaar commented Jun 20, 2017

For ordering against network you have to use _netdev option in /etc/fstab

This issue is about manual NFS mounts, how is /etc/fstab relevant?

kyle-walker added a commit to kyle-walker/systemd that referenced this issue Aug 10, 2017

core: Handle hung remount operations in umount via timeout and limit
The remount read only operation currently is not attempt, or timeout,
limited. As a result, the shutdown operation can stall endlessly due to a
inaccessible NFS mounts, or a number of similar factors. This results in
a manual system reset being necessary.

With this change, the remount is now limited to a maximum of 6 attempts
(UMOUNT_MAX_RETRIES + 1). In addition to a maximum attempt count, each
remount operation is limited to 90 seconds (DEFAULT_TIMEOUT_USEC) before
the child process exits with a SIGALRM.

Resolves: systemd#6115

kyle-walker added a commit to kyle-walker/systemd that referenced this issue Aug 10, 2017

core: Add a timeout-limited remount operation to umount
The remount read only operation currently is not limited. As a result, the
shutdown operation can stall endlessly due to a inaccessible NFS mounts,
or a number of similar factors. This results in a manual system reset
being necessary.

With this change, the remount is now limited to a maximum of 6 attempts
(UMOUNT_MAX_RETRIES + 1). In addition to a maximum attempt count, each
remount operation is limited to 90 seconds (DEFAULT_TIMEOUT_USEC) before
the child process exits with a SIGALRM.

Resolves: systemd#6115

kyle-walker added a commit to kyle-walker/systemd that referenced this issue Aug 10, 2017

core: Limit umount operations in shutdown via limit and timeout
The remount read only operation currently is not limited. As a result, the
shutdown operation can stall endlessly due to a inaccessible NFS mounts,
or a number of similar factors. This results in a manual system reset
being necessary.

With this change, the remount is now limited to a maximum of 6 attempts
(UMOUNT_MAX_RETRIES + 1). In addition to a maximum attempt count, each
remount operation is limited to 90 seconds (DEFAULT_TIMEOUT_USEC) before
the child process exits with a SIGALRM.

Resolves: systemd#6115
@sourcejedi

This comment has been minimized.

Show comment
Hide comment
@sourcejedi

sourcejedi Sep 6, 2017

Contributor

Am I missing the point that systemd actually shuts down the network before the NFS share? While somewhat brain-dead in doing so, is there no way at all to ensure that certain processes complete before the network is yanked out from under them?

For ordering against network you have to use _netdev option in /etc/fstab

Not according to docs! Which were updated sometime since v233, I think. They say _netdev is only necessary if you have a filesystem type which is not recognized as network. The example given is a mount of an iSCSI network block device.

EDIT: I should note that I've seen this issue for months - but always just ignored it as 'broken as designed' and unmount my NFS shares manually before a shutdown (if I remember). The important part is that I still see this on Fedora 26 - even when the NFS server is always reachable...

So, with the dependencies, the expectation seems to be that NFS filesystems are stopped before the networking unit. But, user processes are not explicitly killed before the network unit! So it looks like they can easily hold the NFS mounts open. I think stopping the mount units simply fails (no timeout), then we rely on systemd-shutdown's kill+unmount loop... by which point we have already stopped networking.service? Though I must be missing something about mount units: I don't know why -.mount doesn't show a visible failure on every shutdown.

User processes which die when the getty or xdm are closed are OK. But I recently noticed gnome sessions surviving while gdm is stopped, so that's a thing that's at least possible. (They were degraded in ways, suggesting this might not be the best idea). And SSH sessions survive the main SSH process, by design. You can also just have random stupid user processes that ignore stuff short of SIGKILL.

In e.g. Debian sysvinit, I think this works differently: there's sendsigs which kills user processes before NFS filesystems are unmounted. (Along with any system daemon that didn't realize it wanted a sendsigs.omit/ drop-in). Considering the issue, this seems like the big hole in the systemd shutdown design.

It seems like this issue would be avoided if we were able to stop all the user scopes for shutdown.target (as well as services like cron, at, and httpd (suexec) that might not be scoping user processes), and manage to order that procedure before the unmounting of filesystems. I dunno how practical it is. Feels like we'd at least need an explicit ordering point (target(s)) to represent stopping user processes.

Contributor

sourcejedi commented Sep 6, 2017

Am I missing the point that systemd actually shuts down the network before the NFS share? While somewhat brain-dead in doing so, is there no way at all to ensure that certain processes complete before the network is yanked out from under them?

For ordering against network you have to use _netdev option in /etc/fstab

Not according to docs! Which were updated sometime since v233, I think. They say _netdev is only necessary if you have a filesystem type which is not recognized as network. The example given is a mount of an iSCSI network block device.

EDIT: I should note that I've seen this issue for months - but always just ignored it as 'broken as designed' and unmount my NFS shares manually before a shutdown (if I remember). The important part is that I still see this on Fedora 26 - even when the NFS server is always reachable...

So, with the dependencies, the expectation seems to be that NFS filesystems are stopped before the networking unit. But, user processes are not explicitly killed before the network unit! So it looks like they can easily hold the NFS mounts open. I think stopping the mount units simply fails (no timeout), then we rely on systemd-shutdown's kill+unmount loop... by which point we have already stopped networking.service? Though I must be missing something about mount units: I don't know why -.mount doesn't show a visible failure on every shutdown.

User processes which die when the getty or xdm are closed are OK. But I recently noticed gnome sessions surviving while gdm is stopped, so that's a thing that's at least possible. (They were degraded in ways, suggesting this might not be the best idea). And SSH sessions survive the main SSH process, by design. You can also just have random stupid user processes that ignore stuff short of SIGKILL.

In e.g. Debian sysvinit, I think this works differently: there's sendsigs which kills user processes before NFS filesystems are unmounted. (Along with any system daemon that didn't realize it wanted a sendsigs.omit/ drop-in). Considering the issue, this seems like the big hole in the systemd shutdown design.

It seems like this issue would be avoided if we were able to stop all the user scopes for shutdown.target (as well as services like cron, at, and httpd (suexec) that might not be scoping user processes), and manage to order that procedure before the unmounting of filesystems. I dunno how practical it is. Feels like we'd at least need an explicit ordering point (target(s)) to represent stopping user processes.

@neilbrown

This comment has been minimized.

Show comment
Hide comment
@neilbrown

neilbrown Sep 7, 2017

Contributor

It seems like this issue would be avoided if we were able to stop all the user scopes for shutdown.target

I think we do. user-1000.slice Conflicts with shutdown.target, so it will be shut down, and user-1000.slice is After systemd-user-sessions.service which is After remote-fs.target. So all processes in user-1000.slice should be killed before any remote filesystems are unmounted, which in turn is before the network is shutdown.

One problem that does remain is that some processes that are accessing an NFS filesystem mounted from an inaccessible server, can not be killed. They get stuck in filemap_fdatawait_range() which waits in a non-killable mode. It is on my to-do list to resolve that, but it isn't straight forward.

Contributor

neilbrown commented Sep 7, 2017

It seems like this issue would be avoided if we were able to stop all the user scopes for shutdown.target

I think we do. user-1000.slice Conflicts with shutdown.target, so it will be shut down, and user-1000.slice is After systemd-user-sessions.service which is After remote-fs.target. So all processes in user-1000.slice should be killed before any remote filesystems are unmounted, which in turn is before the network is shutdown.

One problem that does remain is that some processes that are accessing an NFS filesystem mounted from an inaccessible server, can not be killed. They get stuck in filemap_fdatawait_range() which waits in a non-killable mode. It is on my to-do list to resolve that, but it isn't straight forward.

@siebenmann

This comment has been minimized.

Show comment
Hide comment
@siebenmann

siebenmann Sep 7, 2017

As a note, user scopes are not the only way that processes can linger after systemd starts trying to unmount NFS filesystems and then shut down networking. Any .service that uses KillMode=process may leave children behind after it is officially shut down, and I believe that any System V init scripts will be run this way (plus any explicit .service that a distribution configures this way, such as cron and atd on Ubuntu 16.04).

siebenmann commented Sep 7, 2017

As a note, user scopes are not the only way that processes can linger after systemd starts trying to unmount NFS filesystems and then shut down networking. Any .service that uses KillMode=process may leave children behind after it is officially shut down, and I believe that any System V init scripts will be run this way (plus any explicit .service that a distribution configures this way, such as cron and atd on Ubuntu 16.04).

@sourcejedi

This comment has been minimized.

Show comment
Hide comment
@sourcejedi

sourcejedi Sep 7, 2017

Contributor

Thanks for correction.

(fwiw my unverified assumptions were drawn from what I've seen systemctl isolate rescue do. Which is tricky to improve by making units conflict with rescue, because if you end up starting that unit from the rescue shell, then your shell goes away and you're locked out...

Except maybe units that aren't gdm or gettys could conflict with rescue.target but not rescue.service. Hmm).

Contributor

sourcejedi commented Sep 7, 2017

Thanks for correction.

(fwiw my unverified assumptions were drawn from what I've seen systemctl isolate rescue do. Which is tricky to improve by making units conflict with rescue, because if you end up starting that unit from the rescue shell, then your shell goes away and you're locked out...

Except maybe units that aren't gdm or gettys could conflict with rescue.target but not rescue.service. Hmm).

@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Sep 7, 2017

Member

yeah, so @neilbrown's analysis is right: the deps are all in place already, all services and scopes (including user sessions) should all be shut down properly by the time we umount NFS shares, and that's done before we shut down the network. However, this fails to work properly if:

  1. There are services which explicitly exclude themselves from killing, for example via KillMode=none or suchlike. If you have some of those, then it's really their fault, there's little we can do, please file a bug against these services asking them to not do this.
  2. If processes already hang on NFS in a non-interruptible sleep, then systemd can't kill them either. This is a limitation of the Linux kernel, and there's nothing systemd can do about them.
  3. Some distros don't get the deps on their networking stacks right, i.e. miss Before=network.target in their networking service, so that the networking stack is shut down after network.target goes away, and not before.
  4. If people split out /var or /var/log onto NFS they are in trouble, as journald will keep /var/log/journal busy until the very end, and will thus keep these mounts busy for good. This is a limitation of the journal, we should fix eventually (and we would have fixed this already a long time ago if we had useful IPC for the journal, but we don't, as dbus-daemon is a client of the journal, and hence the journal can't use D-Bus IPC since we'd otherwise have a cyclic dep, and deadlocks)

Note that systemd applies a time-out to service stopping, hence an unkillable process due to NFS is actually not a major problem beyond causing a delay at shutdown. Moreover, the umount commands invoked by systemd during the regular shutdown phase also have a time-out applied as well, and if they don't complete within 90s systemd won't wait for them, and continue with the shutdown. However, in the second shutdown phase (i.e. where all units are already stopped, and we transitioned into the systemd-shutdown destruction loop) we will try to umount everything left-over again, and these umount() syscalls do not have any userspace time-out applied curently, but this is being worked on in #6598. As soon as we have that we should be reasonably safe regarding hanging NFS (modulo some bugs). That is of course unless PID 1 itself hangs on NFS for some reason...

Member

poettering commented Sep 7, 2017

yeah, so @neilbrown's analysis is right: the deps are all in place already, all services and scopes (including user sessions) should all be shut down properly by the time we umount NFS shares, and that's done before we shut down the network. However, this fails to work properly if:

  1. There are services which explicitly exclude themselves from killing, for example via KillMode=none or suchlike. If you have some of those, then it's really their fault, there's little we can do, please file a bug against these services asking them to not do this.
  2. If processes already hang on NFS in a non-interruptible sleep, then systemd can't kill them either. This is a limitation of the Linux kernel, and there's nothing systemd can do about them.
  3. Some distros don't get the deps on their networking stacks right, i.e. miss Before=network.target in their networking service, so that the networking stack is shut down after network.target goes away, and not before.
  4. If people split out /var or /var/log onto NFS they are in trouble, as journald will keep /var/log/journal busy until the very end, and will thus keep these mounts busy for good. This is a limitation of the journal, we should fix eventually (and we would have fixed this already a long time ago if we had useful IPC for the journal, but we don't, as dbus-daemon is a client of the journal, and hence the journal can't use D-Bus IPC since we'd otherwise have a cyclic dep, and deadlocks)

Note that systemd applies a time-out to service stopping, hence an unkillable process due to NFS is actually not a major problem beyond causing a delay at shutdown. Moreover, the umount commands invoked by systemd during the regular shutdown phase also have a time-out applied as well, and if they don't complete within 90s systemd won't wait for them, and continue with the shutdown. However, in the second shutdown phase (i.e. where all units are already stopped, and we transitioned into the systemd-shutdown destruction loop) we will try to umount everything left-over again, and these umount() syscalls do not have any userspace time-out applied curently, but this is being worked on in #6598. As soon as we have that we should be reasonably safe regarding hanging NFS (modulo some bugs). That is of course unless PID 1 itself hangs on NFS for some reason...

keszybz added a commit to keszybz/systemd that referenced this issue Sep 11, 2017

build-sys: require libmount >= 2.30
Fixes systemd#4871.

The new libmount has two changes relevant for us:

- x-* options are propagated to /run/mount/utab and are visible through
  libmount (fixes systemd#4817).

- umount -c now really works (partially solves systemd#6115).

keszybz added a commit to keszybz/systemd that referenced this issue Sep 11, 2017

build-sys: require libmount >= 2.30
Fixes systemd#4871.

The new libmount has two changes relevant for us:

- x-* options are propagated to /run/mount/utab and are visible through
  libmount (fixes systemd#4817).

- umount -c now really works (partially solves systemd#6115).

keszybz added a commit that referenced this issue Sep 15, 2017

build-sys: require libmount >= 2.30 (#6795)
Fixes #4871.

The new libmount has two changes relevant for us:

- x-* options are propagated to /run/mount/utab and are visible through
  libmount (fixes #4817).

- umount -c now really works (partially solves #6115).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment