TrueNAS ZFS rebased to OpenZFS 2.0 RELEASE for SCALE #2

ghost · 2020-12-01T15:21:40Z

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Under certain conditions commit a3a4b8d appears to result in a hang, or poor performance, when importing a pool. Until the root cause can be identified it has been reverted from the release branch. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#11245

Despite that dracut has a hard dependency on bash, its modules doesn't, dracut only has a hard dependency on bash for module-setup (on a fully usable machine). Inside initramfs, dracut allows users choose from a list of handful other shells, e.g. bash, busybox, dash, mkfsh. In fact, my local machine's initramfs is being built with dash, and it's functional for a very long time. Before 64025fa (Silence 'make checkbashisms', 2020-08-20), we also allows our users to have that right, too. Let's fix the problem 'make checkbashisms' reported and allows our users to have that right, again. For 'plymouth' case, let's simply run the command inside the if instead of checking for the existence of command before running it, because the status is also failture if plymouth is unavailable. While we're at it, let's remove an unnecessary fork for grep in zfs-generator.sh.in and its following complicated 'if elif fi' with a simple 'case ... esac'. To support this change, also exclude 90zfs from "make checkbashisms" because the current CI infrastructure ships an old version of "checkbashisms", which complains about "command -v", while the current latest "checkbashisms" thinks it's fine. In the near future, we can revert that change to "Makefile.am" when CI infrastructure is updated. Reviewed-by: Gabriel A. Devenyi <gdevenyi@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com> Closes openzfs#11244

Extend the change made in ae12b02 to verify the zfs kernel modules are loaded to the rest of the OpenZFS services. If the modules aren't loaded the neither the share, volume, or and zed services can be started. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#11243

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

The current dmu_zfetch code implicitly assumes that I/Os complete within min_sec_reap seconds. With async dmu and a readonly workload (and thus no exponential backoff in operations from the "write throttle") such as L2ARC rebuild it is possible to saturate the drives with I/O requests. These are then effectively compounded with prefetch requests. This change reference counts streams and prevents them from being recycled after their min_sec_reap timeout if they still have outstanding I/Os. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#10900

Currently streams are only freed when: - They have no referencing zfetch and and their I/O references go to zero. - They are more than 2s old and a new I/O request comes in on the same zfetch. This means that we will leak unreferenced streams when their zfetch structure is freed. This change checks the reference count on a stream at zfetch free time. If it is zero we free it immediately. If it has remaining references we allow the prefetch callback to free it at I/O completion time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#11052

The assert does not account for the case where there is a single buffer in the chain that is decompressed and has a valid checksum. Signed-off-by: Matt Macy <mmacy@FreeBSD.org>

Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#11166

This is needed for zfsd to autoreplace vdevs. Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

Co-authored-by: Will Andrews <wca@FreeBSD.org> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org>

This allows parsing of zfs send progress by checking the process title. Doing so requires some changes to the send code in libzfs_sendrecv.c; primarily these changes move some of the accounting around, to allow for the code to be verbose as normal, or set the process title. Unlike BSD, setproctitle() isn't standard in Linux; I found a reference to it in libbsd, and included autoconf-related changes to test for that. Authored-by: Sean Eric Fagan <sef@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

ghost · 2020-12-01T17:03:42Z

Looks good. I have to force push though.

`zpool_do_import()` passes `argv[0]`, (optionally) `argv[1]`, and `pool_specified` to `import_pools()`. If `pool_specified==FALSE`, the `argv[]` arguments are not used. However, these values may be off the end of the `argv[]` array, so loading them could dereference unmapped memory. This error is reported by the asan build: ``` ================================================================= ==6003==ERROR: AddressSanitizer: heap-buffer-overflow READ of size 8 at 0x6030000004a8 thread T0 #0 0x562a078b50eb in zpool_do_import zpool_main.c:3796 #1 0x562a078858c5 in main zpool_main.c:10709 #2 0x7f5115231bf6 in __libc_start_main #3 0x562a07885eb9 in _start 0x6030000004a8 is located 0 bytes to the right of 24-byte region allocated by thread T0 here: #0 0x7f5116ac6b40 in __interceptor_malloc #1 0x562a07885770 in main zpool_main.c:10699 #2 0x7f5115231bf6 in __libc_start_main ``` This commit passes NULL for these arguments if they are off the end of the `argv[]` array. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#12339

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes openzfs#14501

behlendorf and others added 18 commits November 30, 2020 09:43

Tag 2.0.0

dcbf847

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix ZFS_DEBUG_MODIFY assert in arc_buf_try_copy_decompressed_data

67b2985

The assert does not account for the case where there is a single buffer in the chain that is decompressed and has a valid checksum. Signed-off-by: Matt Macy <mmacy@FreeBSD.org>

Notify userspace when a vdev is removed

8f29bda

This is needed for zfsd to autoreplace vdevs. Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

Add async DMU and async CoW support

2b048b6

Co-authored-by: Will Andrews <wca@FreeBSD.org> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org>

Cleanly handle dbuf_hold_level_async failures

d6f878d

Don't lose EIO from dbuf_hold

cafd740

reduce dbs_count for hold on writes as well

31923b1

Add zfsd for FreeBSD

1717d4e

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

Add rc.d/zfs script

f227fd3

Add packaging bits for TrueNAS SCALE

1e6b27d

Add CI for building zfs package

1bd979f

ghost closed this Dec 1, 2020

ghost deleted the check branch December 1, 2020 17:03

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TrueNAS ZFS rebased to OpenZFS 2.0 RELEASE for SCALE #2

TrueNAS ZFS rebased to OpenZFS 2.0 RELEASE for SCALE #2

ghost commented Dec 1, 2020

ghost commented Dec 1, 2020

TrueNAS ZFS rebased to OpenZFS 2.0 RELEASE for SCALE #2

TrueNAS ZFS rebased to OpenZFS 2.0 RELEASE for SCALE #2

Conversation

ghost commented Dec 1, 2020

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

ghost commented Dec 1, 2020