Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour of canmount=off #8833

Closed
SoongNoonien opened this issue May 29, 2019 · 11 comments
Closed

Strange behaviour of canmount=off #8833

SoongNoonien opened this issue May 29, 2019 · 11 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@SoongNoonien
Copy link
Contributor

SoongNoonien commented May 29, 2019

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version Gentoo Base System release 2.6
Linux Kernel 5.1.5-gentoo
Architecture amd64
ZFS Version 0.8.0-r0-gentoo
SPL Version 0.8.0-r0-gentoo

Describe the problem you're observing

I'm using the canmount=off property to make children of different pools appear in the same directory as described in the man page. Unfortunately, this seems to be kind of broken since 0.8.0. I've isolated the problem to this list of commands:

/tmp # truncate -s 1G test0.img
/tmp # truncate -s 1G test1.img
/tmp # zpool create test0 /tmp/test0.img
/tmp # zpool create test1 /tmp/test1.img
/tmp # zfs create test0/a
/tmp # zfs create test1/b
/tmp # zfs set mountpoint=none test0
/tmp # zfs set mountpoint=/a test0/a
/tmp # zfs set canmount=off test1
/tmp # zfs set mountpoint=/a test1
/tmp # zfs list -o name,mounted,canmount,mountpoint
NAME     MOUNTED  CANMOUNT  MOUNTPOINT
test0         no        on  none
test0/a      yes        on  /a
test1         no       off  /a
test1/b      yes        on  /a/b
/tmp # zfs unmount -a
/tmp # zfs mount -a
/tmp # zfs unmount -a
/tmp # zfs mount -a
cannot mount '/a': directory is not empty
/tmp #

Apparently zfs mount -a isn't able to resolve the mountpoints anymore. I found two different workarounds for this issue.
One can simply avoid using canmount to inherit mountpoints by setting the mountpoints manually. For this example zfs set mountpoint=none test1 and zfs set mountpoint=/a/b test1/b.
The stranger solution is swapping the names of the pools. Or simply choose any new name for test0 to make it appear after test1 in zfs list. (e.g. test2) So the name order of the pools seems to have an impact on the disentanglement of mountpoints.
I was able to reproduce the issue on Gentoo and Arch but not on Openindiana.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 29, 2019
@johnnyjacq16
Copy link

I am not sure but it seems that systemd.mount could solve this issue, by setting the zfs mountpoint of the dataset to legacy, placing it into the fstab file, then tell systemd which mountpoint you want to be mounted first and after creating a dependency between the mountpoints.
example below in the fstab file:
<pool/dataset> /var/log zfs x-systemd.after=var.mount
The following would tell systemd to mount /var/log mountpoint after /var mount has been mounted.
Also a service could be used
<pool/data> /home zfs x-systemd.before=zfs-mount.service
this will mount /home before the zfs-mount.service starts.
The status of mountpoints can be check
systemctl status -- -.mount
shows the status of the root mountpoint.
For more information see https://jlk.fjfi.cvut.cz/arch/manpages/man/systemd.mount.5
read the section default dependencies and fstab which explains x-systemd.after= ,x-systemd.before=.

@aerusso
Copy link
Contributor

aerusso commented May 30, 2019

Take a look at zfs-mount-generator, it produces the systemd.mount units (instead of requiring you to manually put things in your /etc/fstab). And systemd should be able to resolve these "tangled" mounts in a sane way. It should be able to get these things to mount correctly at boot up.

@SoongNoonien Can you see if that resolves your use case?

@SoongNoonien
Copy link
Contributor Author

SoongNoonien commented May 30, 2019

Thanks for your efforts but I do not use systemd.
@aerusso Therefore I can't test if it resolves the issue in my exact use case.
As I said in my first comment I already found two workarounds. I implemented the first one to able to boot properly again. But the actual problem is the resolution of zfs mount -a . A manual disentanglement with zfs mount <pool/dataset> works just fine, so I'am sure both of your suggestions will also work.
Perhaps the parallel mounting could be the problem #8092. It seems to be new to 0.8.0 and I haven't found it in 0.7.13. But I'm really not sure about that, because this seems to be active in Illumos as well and I wasn't able to reproduce the issue on Openindiana.
What do you think?

@SoongNoonien
Copy link
Contributor Author

SoongNoonien commented May 30, 2019

I took a closer look on this. Now I'm pretty sure, it is the parallel mounting. I looked in the commit history and found the last commit before "OpenZFS 8115 - parallel zfs mount" (a10d50f) which simply is "Tag 0.8.0-rc2" (af2e841).
So I've started my Arch Linux virtual machine and cloned the current master branch of ZFS. I rolled back to "Tag 0.8.0-rc2" (git reset --hard af2e8411dacbc694b1aaf9074e68a9d12270e74c) and built and installed it. Afterwards I wasn't able to reproduce the issue anymore.
Then I rolled back the whole VM and started again. But this time I rolled back the git clone to "OpenZFS 8115 - parallel zfs mount" (git reset --hard a10d50f999511d304f910852c7825c70c9c9e303). Afterwards I was able to reproduce the issue again. I think this shows that "OpenZFS 8115 - parallel zfs mount" (a10d50f) is responsible for this issue.
Because of the SIMD problems of the new kernels I had to install an old kernel to be able to compile these old versions of ZFS.

Version overview:

System on which the issue is not reproduceable:

Type Version/Name
Distribution Name Arch Linux
Linux Kernel 4.20.1-arch1-1-ARCH
Architecture amd64
ZFS Version 0.8.0-rc2
SPL Version 0.8.0-rc2

System on which the issue is reproduceable:

Type Version/Name
Distribution Name Arch Linux
Linux Kernel 4.20.1-arch1-1-ARCH
Architecture amd64
ZFS Version 0.8.0-rc2_1_ga10d50f99
SPL Version 0.8.0-rc2_1_ga10d50f99

I hope this helps.

@kusumi
Copy link
Member

kusumi commented May 31, 2019

The problem here seems to be that zfs umount -a tries to unmount in reverse-strcmp order, while pthreads on parallel mounts can race each other. This will likely happen with or without using canmount if the same race condition happens.

Below is basically same as what op has pasted here, except that it creates regfiles to make this clearer.

# cat ./test0.sh
zpool create -f p1 <device>
zpool create -f p2 <device>

zfs create p1/f1
zfs create p2/f2

zfs set mountpoint=none p1
zfs set mountpoint=/mntpt p1/f1
touch /mntpt/file_for_p1

zfs set canmount=off p2
zfs set mountpoint=/mntpt p2
touch /mntpt/f2/file_for_p2

zfs list -o name,mounted,canmount,mountpoint
zfs unmount -a
zfs mount -a

tree /mntpt

zfs list -o name,mounted,canmount,mountpoint
zfs unmount -a
zfs mount -a

As shown below, p2/f2 mount has been shadowed by p1/f1 mount, hence zfs unmount -a trying to unmount /mntpt/f2 first will fail with "not mounted".

+ zpool create -f p1 sdb
+ zpool create -f p2 sdc
+ zfs create p1/f1
+ zfs create p2/f2
+ zfs set mountpoint=none p1
+ zfs set mountpoint=/mntpt p1/f1
+ touch /mntpt/file_for_p1
+ zfs set canmount=off p2
+ zfs set mountpoint=/mntpt p2
+ touch /mntpt/f2/file_for_p2
+ zfs list -o name,mounted,canmount,mountpoint
NAME   MOUNTED  CANMOUNT  MOUNTPOINT
p1          no        on  none
p1/f1      yes        on  /mntpt
p2          no       off  /mntpt
p2/f2      yes        on  /mntpt/f2
+ zfs unmount -a
+ zfs mount -a
+ tree /mntpt
/mntpt
|-- f2
`-- file_for_p1

1 directory, 1 file
+ zfs list -o name,mounted,canmount,mountpoint
NAME   MOUNTED  CANMOUNT  MOUNTPOINT
p1          no        on  none
p1/f1      yes        on  /mntpt
p2          no       off  /mntpt
p2/f2      yes        on  /mntpt/f2
+ zfs unmount -a
umount: /mntpt/f2: not mounted.
cannot unmount 'p2/f2': umount failed
+ zfs mount -a
cannot mount '/mntpt': directory is not empty

Having a small delay in zfs_mount_task(), will make zfs mount -a mount in the order zfs unmount -a expects. This heuristic defeats concept of parallel mount for mountpoints that overlap (one contains another), but will likely catch most cases.
@behlendorf do you have any thought on this ?

# git diff
diff --git a/lib/libzfs/libzfs_mount.c b/lib/libzfs/libzfs_mount.c
index 649c232aa..c74aa77cf 100644
--- a/lib/libzfs/libzfs_mount.c
+++ b/lib/libzfs/libzfs_mount.c
@@ -1450,6 +1450,7 @@ zfs_mount_task(void *arg)

                if (!libzfs_path_contains(mountpoint, child))
                        break; /* not a descendant, return */
+               usleep(10 * 1000);
                zfs_dispatch_mount(mp->mnt_hdl, handles, num_handles, i,
                    mp->mnt_func, mp->mnt_data, mp->mnt_tp);
        }

With above change, now both regfiles are visible. The last zfs mount -a failure is caused by /mntpt not being empty for p1/f1 mount (i.e. at this point /mntpt/f2 exists). It has nothing to do with the race.

+ zpool create -f p1 sdb
+ zpool create -f p2 sdc
+ zfs create p1/f1
+ zfs create p2/f2
+ zfs set mountpoint=none p1
+ zfs set mountpoint=/mntpt p1/f1
+ touch /mntpt/file_for_p1
+ zfs set canmount=off p2
+ zfs set mountpoint=/mntpt p2
+ touch /mntpt/f2/file_for_p2
+ zfs list -o name,mounted,canmount,mountpoint
NAME   MOUNTED  CANMOUNT  MOUNTPOINT
p1          no        on  none
p1/f1      yes        on  /mntpt
p2          no       off  /mntpt
p2/f2      yes        on  /mntpt/f2
+ zfs unmount -a
+ zfs mount -a
+ tree /mntpt
/mntpt
|-- f2
|   `-- file_for_p2
`-- file_for_p1

1 directory, 2 files
+ zfs list -o name,mounted,canmount,mountpoint
NAME   MOUNTED  CANMOUNT  MOUNTPOINT
p1          no        on  none
p1/f1      yes        on  /mntpt
p2          no       off  /mntpt
p2/f2      yes        on  /mntpt/f2
+ zfs unmount -a
+ zfs mount -a
cannot mount '/mntpt': directory is not empty

@behlendorf
Copy link
Contributor

@kusumi do you happen to know if the reason this wasn't reproducible on illumos is simply down to the timing? Adding a delay might make this less likely but it would be nice to have a properly robust solution.

@kusumi
Copy link
Member

kusumi commented Jun 5, 2019

@behlendorf
Not sure about illumos (I don't use illumos), but I can/will check how this works on FreeBSD 12. FreeBSD also has this parallel mount feature.

and yes, random delay is definitely not what we want to do to fix it. I did it to showcase the problem is in timing of pthreads which eventually call /bin/mount.

@kusumi
Copy link
Member

kusumi commented Jun 7, 2019

@behlendorf
The same script worked fine on FreeBSD. With some printf debugging, it seems this really is a matter of timing. I also don't see any significant difference in parallel mount logic in FreeBSD.

Basically, a dataset whose mountpoint is descendant of another mountpoint gets mounted at a cost of extra pthread_create(3) or pthread_cond_signal(3) (depends on state of pthread pool), so mount(2) usually gets invoked in that order.

(If that's the case the (only?) consistent fix might be to serialize mounts for those that overlap. This shouldn't be difficult as there is an example of it in zfs_foreach_mountpoint() when serial_mount is true.)

Edited:
Looks like we can fix this (i.e. make it behave similarly to FreeBSD) without forcing serial mount with kusumi@3fa419b .

The main reason parallel mount behaves differently on ZoL is that
pthreads fork/execs another process for /bin/mount, whereas illumos
and FreeBSD calls mount(2) and nmount(2) respectively.

By taking a global lock around /bin/mount (and also /bin/umount)
call, ZoL will behave similarly to other implementations.

@kusumi
Copy link
Member

kusumi commented Jun 8, 2019

There is an env var called ZFS_SERIAL_MOUNT to force non-parallel mount.
Define this variable to enable.

This is undocumented (only appears in source code), so may not be stable i/f in the future.

kusumi added a commit to kusumi/zfs that referenced this issue Jun 8, 2019
The main reason parallel mount behaves differently on ZoL is that
pthreads fork/execs another process for /bin/mount, whereas illumos
and FreeBSD calls mount(2) and nmount(2) respectively.

By taking a global lock around /bin/mount (and also /bin/umount)
call, ZoL will behave similarly to other implementations.

See openzfs#8833 for details.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 10, 2019
Parallel mount behaves differently in ZoL by each pthread forking
another process to exec /bin/mount, whereas illumos and FreeBSD
directly use mount(2) and nmount(2) syscall respectively.

This can cause parallel mount to mount datasets in incorrect order
depending on the timing /bin/mount runs, and results in shadowing
dataset(s) whose mountpoint is descendant of another.

Having a global lock around /bin/mount (and also /bin/umount) call
adjusts the timing (eliminates the timing difference caused by
fork/exec) and makes its behavior similar to illumos and FreeBSD.

Note that this isn't to "fix" a race condition by proper locking.
Parallel mount by design doesn't guarantee ordering of threads when
multiple datasets have the same mount point. Adjusting is all it
can do. See openzfs#8833 for details.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
This was referenced Jun 10, 2019
@cyphar
Copy link
Contributor

cyphar commented Jun 15, 2019

I think I've seen the underlying issue on openSUSE a couple of times -- sometimes on-boot zfs mount -a ends up mounting a child dataset before the parent. I'll give ZFS_SERIAL_MOUNT a shot then.

kusumi added a commit to kusumi/zfs that referenced this issue Jun 16, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same mount point, and this puts threads under race
condition. This appeared as mount issue on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8)
on mount.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 16, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 17, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 20, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 20, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 21, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 23, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 23, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jun 23, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jul 2, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jul 2, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching can spawn >1 threads
for datasets with the same top level directory, and this puts threads
under race condition. This appeared as mount issue on ZoL for ZoL having
different timing regarding mount(2) execution due to fork(2)/exec(2) of
mount(8) on mount. Fix it by libzfs_path_contains() returning true for
same paths.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jul 4, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching in
zfs_foreach_mountpoint() can spawn >1 threads for datasets which should
run in a single thread, and this puts threads under race condition. This
appeared as mount issue on ZoL for ZoL having different timing regarding
mount(2) execution due to fork(2)/exec(2) of mount(8) on mount.
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of `zfs_foreach_mountpoint(...,
handles, ...)` input which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directory.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test0/a and "/a/b"
  * ThreadB for "/a" for test1
 and in case of openzfs#8833, ThreadA won the race. ThreadB was created because
 "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list which contains "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. ThreadB was created because
 "/var/data" wasn't considered as `"/" contains "/var/data"`. In other
 words, when there is at least one "/" in the input list, it must be
 single threaded, because every directory is a child of "/", meaning they
 all depend on "/" either directly or indirectly.

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result create another thread
when mounts should be done in a single thread. Fix a conditional in
libzfs_path_contains() to consider above two cases.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jul 4, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching in
zfs_foreach_mountpoint() can spawn >1 threads for datasets which should
run in a single thread, and this puts threads under race condition.
This appeared as a mount issue on ZoL for ZoL having different timing
regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts are reordered by the race condition.

There are currently two known patterns of `zfs_foreach_mountpoint(...,
handles, ...)` input which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directory.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test0/a and "/a/b"
  * ThreadB for "/a" for test1
 and in case of openzfs#8833, ThreadA won the race. ThreadB was created because
 "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list which contains "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. ThreadB was created because
 "/var/data" wasn't considered as `"/" contains "/var/data"`. In other
 words, if there is at least one "/" in the input list, it must be
 single threaded since every directory is a child of "/", meaning they
 all directly or indirectly depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result create another thread
when mounts should be done in a single thread. Fix a conditional in
libzfs_path_contains() to consider above two cases.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jul 4, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching selects sets of mount points that don't
 have dependencies on other sets, hence threads can/should go lock-less
 and shouldn't race with other threads for other sets. Each thread
 dispatched corresponds to top level directory which may or may not have
 datasets mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1) is
 to mount datasets for each set of mount points. The mount points within
 each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that initial thread dispatching in
zfs_foreach_mountpoint() can spawn >1 threads for datasets which should
run in a single thread, and this puts threads under race condition.
This appeared as a mount issue on ZoL for ZoL having different timing
regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts are reordered by the race condition.

There are currently two known patterns of `zfs_foreach_mountpoint(...,
handles, ...)` input which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directory.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. ThreadB was created because
 "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list which contains "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. ThreadB was created because
 "/var/data" wasn't considered as `"/" contains "/var/data"`. In other
 words, if there is at least one "/" in the input list, it must be
 single threaded since every directory is a child of "/", meaning they
 all directly or indirectly depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result create another thread
when mounts should be done in a single thread. Fix a conditional in
libzfs_path_contains() to consider above two cases.

Closes openzfs#8450
Closes openzfs#8833

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
kusumi added a commit to kusumi/zfs that referenced this issue Jul 6, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
TulsiJain pushed a commit to TulsiJain/zfs that referenced this issue Jul 20, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
TulsiJain pushed a commit to TulsiJain/zfs that referenced this issue Jul 20, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 13, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 22, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 23, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 17, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) openzfs#8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of openzfs#8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of openzfs#8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes openzfs#8450
Closes openzfs#8833
Closes openzfs#8878
tonyhutter pushed a commit that referenced this issue Sep 26, 2019
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) #8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of #8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of #8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8450
Closes #8833
Closes #8878
@Virtual-Java
Copy link

Virtual-Java commented Jun 24, 2022

This issue seams to be still existing.
I use GNU/Linux Arch and the official zfs-dkms was build for linux kernel version 5.17.9.arch1-1
sudo pacman -Si zfs-dkms ==> repository: archzfs; name: zfs-dkms; version: 2.1.4-1; arch: x86_64; url: https://zfsonlinux.org/

I formated my luks encrypted root partition as described in the official openzfs guide.
Unfortunately mounting zfs subvolumes does not work.
Importing the pool results in 'lib' and 'var' being the only partitions showing up at my mounting point:
sudo zpool import -NR /mnt/ssd rpool_$INST_UUID; ls /mnt/ssd/ ==> usr/ var/
Although all subvolumes are listed to be mounted
zfs list ==>
NAME USED AVAIL REFER MOUNTPOINT
rpool
$INST_UUID 3.70G 45.7G 96K /mnt/ssd
rpool_$INST_UUID/alarm 3.70G 45.7G 96K none
rpool_$INST_UUID/alarm/DATA 859M 45.7G 96K none
rpool_$INST_UUID/alarm/DATA/default 859M 45.7G 96K /mnt/ssd
rpool_$INST_UUID/alarm/DATA/default/etc 96K 45.7G 96K /mnt/ssd/etc
rpool_$INST_UUID/alarm/DATA/default/home 772M 45.7G 761M /mnt/ssd/home
rpool_$INST_UUID/alarm/DATA/default/home/User 152K 45.7G 96K /mnt/ssd/home/User
rpool_$INST_UUID/alarm/DATA/default/opt 152K 45.7G 96K /mnt/ssd/opt
rpool_$INST_UUID/alarm/DATA/default/root 83M 45.7G 82.7M /mnt/ssd/root
rpool_$INST_UUID/alarm/DATA/default/srv 160K 45.7G 96K /mnt/ssd/srv
rpool_$INST_UUID/alarm/DATA/default/usr 264K 45.7G 96K /mnt/ssd/usr
rpool_$INST_UUID/alarm/DATA/default/usr/local 168K 45.7G 104K /mnt/ssd/usr/local
rpool_$INST_UUID/alarm/DATA/default/var 3.57M 45.7G 96K /mnt/ssd/var
rpool_$INST_UUID/alarm/DATA/default/var/games 152K 45.7G 96K /mnt/ssd/var/games
rpool_$INST_UUID/alarm/DATA/default/var/lib 856K 45.7G 96K /mnt/ssd/var/lib
rpool_$INST_UUID/alarm/DATA/default/var/lib/AccountsService 152K 45.7G 96K /mnt/ssd/var/lib/AccountsService
rpool_$INST_UUID/alarm/DATA/default/var/lib/docker 152K 45.7G 96K /mnt/ssd/var/lib/docker
rpool_$INST_UUID/alarm/DATA/default/var/lib/libvirt 152K 45.7G 96K /mnt/ssd/var/lib/libvirt
rpool_$INST_UUID/alarm/DATA/default/var/lib/lxc 152K 45.7G 96K /mnt/ssd/var/lib/lxc
rpool_$INST_UUID/alarm/DATA/default/var/lib/nfs 152K 45.7G 96K /mnt/ssd/var/lib/nfs
rpool_$INST_UUID/alarm/DATA/default/var/log 2.09M 45.7G 2.01M /mnt/ssd/var/log
rpool_$INST_UUID/alarm/DATA/default/var/snap 152K 45.7G 96K /mnt/ssd/var/snap
rpool_$INST_UUID/alarm/DATA/default/var/spool 160K 45.7G 96K /mnt/ssd/var/spool
rpool_$INST_UUID/alarm/DATA/default/var/www 96K 45.7G 96K /mnt/ssd/var/www
rpool_$INST_UUID/alarm/ROOT 2.86G 45.7G 96K none
rpool_$INST_UUID/alarm/ROOT/default 2.86G 45.7G 2.01G /mnt/ssd_

However mounting all volumes does not have any effect:
sudo zfs mount -a; ls /mnt/ssd/ ==> home/ opt/ root/ srv/ usr/ var/

While mounting some subvolumes (with property "canmount=on') works, those with property "canmount=off" fail.
sudo zfs mount -f rpool_hq9ept/alarm/DATA/default/etc ==> cannot mount 'rpool_hq9ept/alarm/DATA/default/etc': 'canmount' property is set to 'off'

Mounting all subvolumes separately by looping over all listed volumes was a temporary solution but doesn't work a second time mounting and chrooting (in) my filesystem.
for i in $(zfs list | grep -i rpool | awk -F" " '{ print $1 }'); do sudo zfs mount $i; done
Then datasets like "home" are missing.
ls /mnt/ssd/ ==> backup/ boot/ etc/ lost+found/ proc/ run/ sys/ usr/ bin/ dev/ lib/ mnt/ root/ sbin/ tmp/ var/

After unmounting all subvolumes they are still mounted:
sudo zfs umount -a ==>
cannot unmount '/mnt/ssd/var/www': no such pool or dataset
cannot unmount '/mnt/ssd/var/spool': no such pool or dataset
cannot unmount '/mnt/ssd/var/snap': no such pool or dataset
cannot unmount '/mnt/ssd/var/log': unmount failed
cannot unmount '/mnt/ssd/var/lib/nfs': no such pool or dataset
cannot unmount '/mnt/ssd/var/lib/lxc': no such pool or dataset
cannot unmount '/mnt/ssd/var/lib/libvirt': no such pool or dataset
cannot unmount '/mnt/ssd/var/lib/docker': no such pool or dataset
cannot unmount '/mnt/ssd/var/lib/AccountsService': no such pool or dataset
cannot unmount '/mnt/ssd/var/games': no such pool or dataset
cannot unmount '/mnt/ssd/usr/local': no such pool or dataset
cannot unmount '/mnt/ssd/srv': no such pool or dataset
cannot unmount '/mnt/ssd/root': unmount failed
cannot unmount '/mnt/ssd/opt': no such pool or dataset
cannot unmount '/mnt/ssd/home/User': no such pool or dataset
cannot unmount '/mnt/ssd/home': no such pool or dataset

ls /mnt/ssd/==> backup/ boot/ etc/ lost+found/ proc/ run/ sys/ usr/
bin/ dev/ lib/ mnt/ root/ sbin/ tmp/ var/

After unmounting all listed subvolumes in the same way as I mounted them above results in the hidden volumes (home, opt, …) showing up again!!!
for i in $(zfs list | grep -i rpool | awk -F" " '{ print $1 }'); do sudo zfs umount $i; done ==>
cannot unmount 'rpool_hq9ept': not currently mounted
cannot unmount 'rpool_hq9ept/alarm': not currently mounted
cannot unmount 'rpool_hq9ept/alarm/DATA': not currently mounted
cannot unmount 'rpool_hq9ept/alarm/DATA/default': not currently mounted
cannot unmount 'rpool_hq9ept/alarm/DATA/default/etc': not currently mounted
cannot unmount '/mnt/ssd/home/User': no such pool or dataset
cannot unmount '/mnt/ssd/home/User': no such pool or dataset
cannot unmount '/mnt/ssd/opt': no such pool or dataset
cannot unmount '/mnt/ssd/root': unmount failed
cannot unmount '/mnt/ssd/srv': no such pool or dataset
cannot unmount 'rpool_hq9ept/alarm/DATA/default/usr': not currently mounted
cannot unmount '/mnt/ssd/usr/local': no such pool or dataset
cannot unmount 'rpool_hq9ept/alarm/DATA/default/var': not currently mounted
cannot unmount '/mnt/ssd/var/games': no such pool or dataset
cannot unmount 'rpool_hq9ept/alarm/DATA/default/var/lib': not currently mounted
cannot unmount '/mnt/ssd/var/lib/AccountsService': no such pool or dataset
cannot unmount '/mnt/ssd/var/lib/docker': no such pool or dataset
cannot unmount '/mnt/ssd/var/lib/libvirt': no such pool or dataset
cannot unmount '/mnt/ssd/var/lib/lxc': no such pool or dataset
cannot unmount '/mnt/ssd/var/lib/nfs': no such pool or dataset
cannot unmount '/mnt/ssd/var/log': unmount failed
cannot unmount '/mnt/ssd/var/snap': no such pool or dataset
cannot unmount '/mnt/ssd/var/spool': no such pool or dataset
cannot unmount '/mnt/ssd/var/www': no such pool or dataset
cannot unmount 'rpool_hq9ept/alarm/ROOT': not currently mounted

ls /mnt/ssd/ ==> home/ opt/ root/ srv/ usr/ var/

Can you explain me the reason for this strange behavior when mounting subvolumes and please fix that bug because that is an issue that holds back newcomers like me in using openzfs.
I invested hours trying to solve/bypass this issue.
Thank you very much for your great work in developing openzfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

8 participants
@behlendorf @kusumi @cyphar @aerusso @johnnyjacq16 @SoongNoonien @Virtual-Java and others