TRIM/Discard support from Nexenta #3656

Closed
wants to merge 8 commits into
from

Conversation

Projects
None yet
Member

dweeezil commented Aug 2, 2015

This patch stack includes Nextenta's support for TRIM/Discard on disk and file vdevs as well as an update to the dkio headers for appropriate Solaris compatibility. It requires the current https://github.com/dweeezil/spl/tree/ntrim patch in order to compile properly.

The usual disclaimers apply at this point: I've performed moderate testing with ext4-backed file vdevs and light testing with SSD-backed disk vdevs and it appears to work properly. Use at your own risk. It may DESTROY YOUR DATA! I'm posting the pull request because it seems to work during initial testing and I'd like the buildbots to get a chance at it (which I'm expecting to fail unless they use the corresponding SPL code).

The initial TRIM support (currently in commit 719301c) caused frequent deadlocks in ztest due to the SCL_ALL spa locking during the trim operations. The follow-on patch to support on-demand trim changed the locking scheme and I'm no longer seeing deadlocks with either ztest or normal operation.

The final last commit (currently 9e5cfd7) adds ZIL logging for zvol trim operations. This code was mostly borrowed from an older Nexenta patch (referenced in the commit log) and has been merged into the existing zvol trim function.

In order to enable the feature, you must use zpool set autotrim=on on the pool and the zfs_trim module parameter must be set to 1 (which is its default value). The zfs_trim parameter controls the lower-level vdev trimming whereas the pool property controls it at a higher level. By default, trims are batched and only applied every 32 transaction groups as controlled by the new zfs_txgs_per_trim parameter. This allows for zpool import -T to continue to be useful. Finally, by default, only regions of at least 1MiB are trimmed as set by the zfs_trim_min_ext_sz module parameter.

@dweeezil dweeezil changed the title from WIP - TIM/Discard support from Nexenta to WIP - TRIM/Discard support from Nexenta Aug 2, 2015

Contributor

edillmann commented Aug 4, 2015

Hi @dweeezil

FYI: zpool trim rpool triggers the following warning

[ 61.844048] Large kmem_alloc(101976, 0x1000), please file an issue at:
[ 61.844048] https://github.com/zfsonlinux/zfs/issues/new
[ 61.844053] CPU: 3 PID: 392 Comm: spl_system_task Tainted: P OE 3.19.0 #4
[ 61.844054] Hardware name: Dell Inc. Dell Precision M3800/Dell Precision M3800, BIOS A07 10/14/2014
[ 61.844055] 000000000000c2d0 ffff88041557fc48 ffffffff81732ffb 0000000000000001
[ 61.844057] 0000000000000000 ffff88041557fc88 ffffffffa0209b73 ffff880400000000
[ 61.844059] 00000000000018e3 ffff880409ef9c00 ffff880409ef9c00 ffff8802c42d08c0
[ 61.844061] Call Trace:
[ 61.844068] [] dump_stack+0x45/0x57
[ 61.844091] [] spl_kmem_zalloc+0x113/0x180 [spl]
[ 61.844128] [] zio_trim+0x79/0x1b0 [zfs]
[ 61.844147] [] metaslab_exec_trim+0xa6/0xf0 [zfs]
[ 61.844166] [] metaslab_trim_all+0x10c/0x1a0 [zfs]
[ 61.844188] [] vdev_trim_all+0x13d/0x310 [zfs]
[ 61.844194] [] taskq_thread+0x205/0x450 [spl]
[ 61.844198] [] ? wake_up_state+0x20/0x20
[ 61.844203] [] ? taskq_cancel_id+0x120/0x120 [spl]
[ 61.844206] [] kthread+0xd2/0xf0
[ 61.844208] [] ? kthread_create_on_node+0x180/0x180
[ 61.844210] [] ret_from_fork+0x7c/0xb0
[ 61.844212] [] ? kthread_create_on_node+0x180/0x180

module/zfs/zio.c
+ return (sub_pio);
+
+ num_exts = avl_numnodes(&tree->rt_root);
+ dfl = kmem_zalloc(DFL_SZ(num_exts), KM_SLEEP);
@ryao

ryao Aug 7, 2015

Member

We probably need to change this to vmem_zalloc() to address the issue that @edillmann reported due to the way that we implemented kmem_zalloc(). It should be ifdefed to Linux because the O3X port that will likely merge this code is using the real kmem_zalloc().

@dweeezil

dweeezil Aug 8, 2015

Member

I think I'm going to rework zio_trim() as well as the other consumers of the dkioc free lists to work with a linked list rather than an array. This would allow us to avoid the large allocations but would cause the code to diverge a bit from the upstream. Another benefit is that we'd avoid the double allocation and copy which typically occurs in zio_trim() when zfs_trim_min_ext_sz is set to a large value.

@dweeezil: sorry to be pain, but curious to know status on this - we've got a few SSD-only pools to play with for a few days before we stuff 'em into prod (making sure our hardware doesnt screw us), so we can do a bit of testing on this without losing production data if you happen to have some test paths for us to run through. Thanks

Member

dweeezil commented Sep 9, 2015

@sempervictus I'm actively looking for feedback. The patch does need to be refreshed against a current master codebase which I'll try to do today. There's a bit of interference with the recent zvol improvements.

In my own testing, the patch does appear to work properly, although the behavior of the TRIM "batching" needs a bit better documentation and, possibly, slightly different implementation (IIRC, one or both of the parameters only has certain effect upon module load and/or pool import). I'd also like to add some kstats to help monitor its behavior.

I've used the on-demand TRIM quite a bit and it seems to work perfectly. You can TRIM a pool with zpool trim <pool> and monitor its progress with zpool status (although I'd expect it to be pretty instantaneous on most SSDs). Most of my testing, however, has used file-based vdevs but I have also used real SSDs. I just got a couple of new SSDs today which I plan on using for a bit more extensive testing.

There's also a backport to 0.6.4.2 in a branch named "ntrim-0.6.4.2" ("ntrim-0.6.4.1" for SPL).

Soon as this is updated to reflect changes in master we'll add it to our stack. One potential caveat is that we generally utilize dm-crypt with the discard option at mount time. Any thoughts on potential side effects from this? Has this sort of setup been tested in any way?

@dweeezil dweeezil changed the title from WIP - TRIM/Discard support from Nexenta to TRIM/Discard support from Nexenta Sep 14, 2015

Contributor

edillmann commented Sep 14, 2015

Hi @dweeezil ,

Just to let you know I have been running this pull rq since it was released, and beside the kmem_alloc warning, I did not see any problem or corruption on my test zpool (dual ssd mirror). The system has been crunching video camera recording for 2 months :-)

Is there any hope of having it rebased on master ?

Tried to throw this into our stack today and noticed it has some conflicts with ABD in the raidz code. Rumor has it that should be merged "soon after the 0.6.5 tag" so i'm hoping by next rebase it'll be in there (nudge @behlendorf ) :).

Contributor

edillmann commented Sep 14, 2015

@dweeezil I didn't see it was already rebased, thanks.

Contributor

Mic92 commented Oct 18, 2015

So SATA Trim is currently not supported, according to the comments in the source or is this handled by SPL properly?

Member

dweeezil commented Oct 18, 2015

@Mic92 SATA TRIM works just fine and I've tested it plenty. The documentation is still the original from Illumos. TRIM will work on any block device vdev supporting BLKDISCARD or any file vdev on which the containing filesystem supports fallocate hole punching.

@dweeezil - I would love to test this patch but it conflicts with the ABD branch (pull 3441) in vdev_raidz.c

I have the .rej file from both sides (patched abd first then this patch, and then patched this patch first then abd) if that helps at all

much appreciated

Contributor

skiselkov commented Oct 29, 2015

Hey guys, I wanted to get your take on the latest submission on this that we're trying to get upstreamed from Nexenta. I'd primarily like to make bottom end of the ZFS portion more accommodating to Linux & FreeBSD.
If you could, please drop by and take a look at https://reviews.csiden.org/r/263/

Member

dweeezil commented Nov 4, 2015

@greg-hydrogen I tried transplanting the relevant commits on to ABD awhile ago and other than the bio argument issues, the other main conflict is the logging I added to discards on zvols. The vdev conflicts you likely ran into are pretty easy to fix. I'll try to get an ABD-based version of this working within the next few days.

@skiselkov I'll check it out. It looks to be a port of the same Nexenta code in this pull request, correct?

Contributor

skiselkov commented Nov 4, 2015

@dweeezil It is indeed, with some minor updates & fixes.

vaLski commented Nov 10, 2015

Can't compile it on CentOS 6.7. Attempted to install

spl-0.6.5.3 from github
zfs-0.6.5.3 from github + dweeezil:ntrim / #3656

Used the following part "If, instead you would like to use the GIT version, use the following commands instead:" from http://zfsonlinux.org/generic-rpm.html

On the last step make rpm-utils rpm-dkms

It fails with:

Preparing... ########################################### [100%]
1:zfs-dkms ########################################### [100%]
Removing old zfs-0.6.5 DKMS files...


Deleting module version: 0.6.5

completely from the DKMS tree.

Done.
Loading new zfs-0.6.5 DKMS files...
Building for 2.6.32-573.7.1.el6.x86_64
Building initial module for 2.6.32-573.7.1.el6.x86_64
Error! Bad return status for module build on kernel: 2.6.32-573.7.1.el6.x86_64 (x86_64)
Consult /var/lib/dkms/zfs/0.6.5/build/make.log for more information.
warning: %post(zfs-dkms-0.6.5-36_gafb9fad.el6.noarch) scriptlet failed, exit status 10

Log says

CC [M] /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.o
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:36:33: error: sys/dkioc_free_util.h: No such file or directory
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c: In function 'vdev_disk_io_start':
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:693: error: 'DKIOCFREE' undeclared (first use in this function)
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:693: error: (Each undeclared identifier is reported only once
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:693: error: for each function it appears in.)
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:696: error: 'dkioc_free_list_t' undeclared (first use in this function)
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:696: error: 'dfl' undeclared (first use in this function)
make[5]: *** [/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.o] Error 1
make[4]: *** [/var/lib/dkms/zfs/0.6.5/build/module/zfs] Error 2
make[3]: *** [module/var/lib/dkms/zfs/0.6.5/build/module] Error 2
make[3]: Leaving directory /usr/src/kernels/2.6.32-573.7.1.el6.x86_64' make[2]: *** [modules] Error 2 make[2]: Leaving directory/var/lib/dkms/zfs/0.6.5/build/module'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/var/lib/dkms/zfs/0.6.5/build'
make: *** [all] Error 2

strace shows

[pid 20344] open("include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/usr/src/kernels/2.6.32-573.7.1.el6.x86_64/arch/x86/include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/var/lib/dkms/zfs/0.6.5/build/include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/usr/src/spl-0.6.5/include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/usr/src/spl-0.6.5/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] write(2, "sys/dkioc_free_util.h: No such f"..., 48) = 48

Files are at

find / -name dkioc_free_util.h
/usr/include/libspl/sys/dkioc_free_util.h
/usr/src/zfs-0.6.5/lib/libspl/include/sys/dkioc_free_util.h
/usr/src/zfs/lib/libspl/include/sys/dkioc_free_util.h
/var/lib/dkms/zfs/0.6.5/build/lib/libspl/include/sys/dkioc_free_util.h

Symlinked one of the "search" locations to dkioc_free_util.h but it started throwing other errors:

In file included from /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:36:
/usr/src/kernels/2.6.32-573.7.1.el6.x86_64/arch/x86/include/sys/dkioc_free_util.h:25: error: expected ')' before '*' token

Attempted to solve those but to no avil.

It seems that dkioc_free_util.h is added by the "ntrim" branch as the dfl_free function is referenced in module/zfs/zio.c, module/zfs/vdev_raidz.c where trim is mentioned.

I really hope that this is the correct place to post this issue as it is directly related with this merge request.

Let me know if there is anything else I can assist with.

Member

dweeezil commented Nov 10, 2015

@vaLski I suspect you need the corresponding spl patch from https://github.com/dweeezil/spl/tree/ntrim which I just rebased to master as dweeezil/spl@305d417.

vaLski commented Nov 10, 2015

@dweeezil sorry for overlooking this - you are right. Now everything is building just as expected. I will later report how the patch works for me. Initial testing seems fine.

Contributor

RichardSharpe commented Nov 21, 2015

So, I have built this and we are testing it, but I encountered a problem.

If I run zpool trim on the same zpool twice things seem to get stuck in SPL and things don't work any more.

What can I do to debug this?

Member

dweeezil commented Nov 21, 2015

@RichardSharpe A good start would be to get some stack traces from the blocked processes. Generally they'll show up in your syslog after several minutes. Otherwise, typically "echo b > /proc/sysrq-trigger" will cause them to be generated.

As to the trigger for your problem, was the first trim still running when you ran the second one? This type of hang sounds suspiciously like a taskq dispatch problem. Hopefully the stack traces from the blocked processes will shed some light on it.

@dweeezil

echo b will trigger a machine reboot:)

Contributor

Mic92 commented Nov 21, 2015

I could tests this on a SSD Raid. Would this actually make a difference implementation wise?

Member

dweeezil commented Nov 21, 2015

@tomposmiko Blush, I meant echo w > /dev/sysrq-trigger, of course (ugh, long day).

@Mic92 There is differing support code for the different types of vdevs. In particular, raidz has its own particular support. I have personally tested with raids, mirrors, stripes and file vdevs and all seemed to be properly trimmed.

As a reminder to anyone working with this, there are some new module parameters involved (as well as a new pool property which is also in the relevant man page):

       zfs_trim (int)
                   Controls whether the underlying vdevs of the pool are noti‐
                   fied when space is  freed  using  the  device-type-specific
                   command  set  (TRIM  here  being a general placeholder term
                   rather than referring to just the SATA TRIM command).  This
                   is frequently used on backing storage devices which support
                   thin provisioning or pre-erasure of blocks on flash media.

                   Default value: 0.

and

       zfs_trim_min_ext_sz (int)
                   Minimum size region in bytes over which  a  device-specific
                   TRIM  command  will  be  sent  to the underlying vdevs when
                   zfs_trim is set.

                   Default value: 1048576.

and

       zfs_txgs_per_trim (int)
                   Number of transaction  groups  over  which  device-specific
                   TRIM commands are batched when zfs_trim is set.

                   Default value: 32.

It's been awhile but IIRC, the latter 2 really ought to be set during module load time (with either kernel command-line parameters or with modprobe arguments after booting).

During my testing, after enabling the first one, of course, I generally set zfs_trim_min_ext_sz =0 so that even the smallest regions would be trimmed.

Also, IIRC, zfs_txgs_per_trim can not be set too low. Based on the existing logic, I think it makes not sense to set it lower than 2 (to "hurry along" the trim operations).

Member

dweeezil commented Nov 21, 2015

I'll push a refresh shortly to fix the compile problem with debug builds.

Member

dweeezil commented Nov 21, 2015

As of a2cd68b, this ought to build properly for debug builds. We'll see what the buildbots think of it. The corresponding spl branch has also been rebased to a current master.

Contributor

RichardSharpe commented Nov 21, 2015

OK, looking at messages I have the stack traces:
[ 480.450269] spl_system_task D ffff88033fc93680 0 3476 2 0x00000080
[ 480.450279] ffff880033407cb8 0000000000000046 ffff880033407fd8 0000000000013680
[ 480.450281] ffff880033407fd8 0000000000013680 ffff880328520000 ffff88033fc93f48
[ 480.450283] ffff8801d9382ee8 ffff8801d9382f28 ffff8801d9382f10 0000000000000001
[ 480.450286] Call Trace:
[ 480.450295] [] io_schedule+0x9d/0x140
[ 480.450312] [] cv_wait_common+0xae/0x150 [spl]
[ 480.450317] [] ? wake_up_bit+0x30/0x30
[ 480.450321] [] __cv_wait_io+0x18/0x20 [spl]
[ 480.450356] [] zio_wait+0x123/0x210 [zfs]
[ 480.450375] [] ? spa_config_exit+0x8d/0xc0 [zfs]
[ 480.450394] [] vdev_trim_all+0x189/0x370 [zfs]
[ 480.450398] [] taskq_thread+0x21e/0x420 [spl]
[ 480.450401] [] ? wake_up_state+0x20/0x20
[ 480.450405] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 480.450407] [] kthread+0xcf/0xe0
[] ? kthread_create_on_node+0x140/0x140
[ 480.450412] [] ret_from_fork+0x58/0x90
[ 480.450413] [] ? kthread_create_on_node+0x140/0x140
[ 480.450756] txg_sync D ffff88033fd13680 0 4375 2 0x00000080
[ 480.450759] ffff8802cf33bba8 0000000000000046 ffff8802cf33bfd8 0000000000013680
[ 480.450760] ffff8802cf33bfd8 0000000000013680 ffff8802ce45c440 ffff88033fd13f48
[ 480.450763] ffff8801bdd50788 ffff8801bdd507c8 ffff8801bdd507b0 0000000000000001
[ 480.450764] Call Trace:
[ 480.450767] [] io_schedule+0x9d/0x140
[ 480.450772] [] cv_wait_common+0xae/0x150 [spl]
[ 480.450774] [] ? wake_up_bit+0x30/0x30
[ 480.450778] [] __cv_wait_io+0x18/0x20 [spl]
[ 480.450796] [] zio_wait+0x123/0x210 [zfs]
[ 480.450815] [] vdev_config_sync+0xf4/0x140 [zfs]
[ 480.450832] [] spa_sync+0x91c/0xb70 [zfs]
[ 480.450834] [] ? autoremove_wake_function+0x2b/0x40
[ 480.450852] [] txg_sync_thread+0x3cc/0x640 [zfs]
[ 480.450869] [] ? txg_fini+0x2a0/0x2a0 [zfs]
[ 480.450873] [] thread_generic_wrapper+0x7a/0x90 [spl]
[ 480.450876] [] ? __thread_exit+0x20/0x20 [spl]
[ 480.450878] [] kthread+0xcf/0xe0
[ 480.450880] [] ? kthread_create_on_node+0x140/0x140
[ 480.450882] [] ret_from_fork+0x58/0x90
[ 480.450884] [] ? kthread_create_on_node+0x140/0x140
[ 600.450747] spl_system_task D ffff88033fc93680 0 3476 2 0x00000080
[ 600.450750] ffff880033407cb8 0000000000000046 ffff880033407fd8 0000000000013680
[ 600.450753] ffff880033407fd8 0000000000013680 ffff880328520000 ffff88033fc93f48
[ 600.450755] ffff8801d9382ee8 ffff8801d9382f28 ffff8801d9382f10 0000000000000001
[ 600.450757] Call Trace:
[ 600.450764] [] io_schedule+0x9d/0x140
[ 600.450779] [] cv_wait_common+0xae/0x150 [spl]
[ 600.450783] [] ? wake_up_bit+0x30/0x30
[ 600.450788] [] __cv_wait_io+0x18/0x20 [spl]
[ 600.450821] [] zio_wait+0x123/0x210 [zfs]
[ 600.450842] [] ? spa_config_exit+0x8d/0xc0 [zfs]
[ 600.450863] [] vdev_trim_all+0x189/0x370 [zfs]
[ 600.450868] [] taskq_thread+0x21e/0x420 [spl]
[ 600.450871] [] ? wake_up_state+0x20/0x20
[ 600.450875] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 600.450877] [] kthread+0xcf/0xe0
[ 600.450879] [] ? kthread_create_on_node+0x140/0x140
[ 600.450882] [] ret_from_fork+0x58/0x90
[ 600.450884] [] ? kthread_create_on_node+0x140/0x140

Looks like they are starting to repeat now.

Member

dweeezil commented Nov 21, 2015

@RichardSharpe At first glance, that looks like the condition which zfsonlinux/spl@a64e557 was designed to address. If you're using dynamic taskqs and have zfsonlinux/spl@f5f2b87, you might try setting spl_taskq_thread_sequential=0 otherwise disabling dynamic taskqs.

Contributor

RichardSharpe commented Nov 21, 2015

OK, I have a64e557 in the repos I used,

I will try setting spl_taskq_thread_sequential=0 or should I just set spl_taskq_thread_dynamic=0?

Member

dweeezil commented Nov 22, 2015

@RichardSharpe If you don't have zfsonlinux/spl@f5f2b87, use spl_taskq_thread_dynamic=0 on module load.

Contributor

RichardSharpe commented Nov 23, 2015

Heh,

Just tried 'zpool trim ...' and got this:

Message from syslogd@NTNX-10-5-40-237-A-NVM at Nov 22 20:07:59 ...
kernel:[177460.307994] VERIFY3(taskq_dispatch(system_taskq, (void (*)(void *))vdev_trim_all, vti, 0x00000000 | 0x01000000) != 0) failed (0 != 0)

Message from syslogd@NTNX-10-5-40-237-A-NVM at Nov 22 20:07:59 ...
kernel:[177460.308559] PANIC at spa.c:6810:spa_trim()

Contributor

RichardSharpe commented Nov 23, 2015

Seems to be here.

Does that mean corruption of ZFS?

I had just done a test where I copied on some 18GB, then deleted it then did a zpool trim.

Member

dweeezil commented Nov 23, 2015

@RichardSharpe I wasn't real happy about the VERIFY in https://github.com/dweeezil/zfs/blob/ntrim/module/zfs/spa.c#L6808 from the moment I saw it. There's only one other place where we use TQ_NOQUEUE and it doesn't result in a panic. Given that trimming is an operation which is OK to not complete, I think I'm going to modify the code so it either re-tries or simply bails if there aren't any threads free.

As to the details, how many cores/threads does your system have? What's the pool layout (zpool status). This is not some sort of fatal corruption but is instead happening most likely because you've got more top-level vdevs in your pool than the system has available threads (which is how many are given to the system taskq).

The more I think about, the more I think it would be a good idea to make a new taskq specifically for this purpose. I'm also thinking it might be worthwhile to add a per-pool stat regarding the last on-demand trim which could be examined with zpool status.

In the mean time, the following patch ought to work around the problem:

diff --git a/module/zfs/spa.c b/module/zfs/spa.c
index e64467a..262d6c4 100644
--- a/module/zfs/spa.c
+++ b/module/zfs/spa.c
@@ -6805,9 +6805,12 @@ spa_trim(spa_t *spa, uint64_t rate)
                spa_open_ref(spa, vti);

                vd->vdev_trim_prog = 0;
-               VERIFY3U(taskq_dispatch(system_taskq,
+               if (taskq_dispatch(system_taskq,
                    (void (*)(void *))vdev_trim_all, vti,
-                   TQ_SLEEP | TQ_NOQUEUE), !=, 0);
+                   TQ_SLEEP | TQ_NOQUEUE) == 0) {
+                       spa->spa_num_trimming--;
+                       break;
+               }
        }
        mutex_exit(&spa->spa_trim_lock);
        spa_config_exit(spa, SCL_TRIM_ALL, FTAG);

The ultimate effect is that it might take a few zpool trim operations to get the pool fully trimmed.

Contributor

RichardSharpe commented Nov 23, 2015

zpool status
pool: zpool-share-a3339f8f-5cdb-46d8-bc33-e3d1459b2703-5e791f69-dd6e-4170-a319-7586a996b82a
state: ONLINE
scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    zpool-share-a3339f8f-5cdb-46d8-bc33-e3d1459b2703-5e791f69-dd6e-4170-a319-7586a996b82a  ONLINE       0     0     0
      mpathi    ONLINE       0     0     0
      mpathh    ONLINE       0     0     0
      mpathj    ONLINE       0     0     0
      mpathf    ONLINE       0     0     0
    logs
      mpathg    ONLINE       0     0     0

errors: No known data errors

Four cores. 12GB of memory.

It's a VM.

Contributor

RichardSharpe commented Nov 23, 2015

I have seen this once before.

To be honest, we are building our UNMAP implementation as well, and that had a bug. At first it was responding to all UNMAP requests with ILLEGAL REQUEST/,,, but we fixed that.

Could that have corrupted things?

Member

dweeezil commented Nov 23, 2015

@RichardSharpe 4 top-level vdev stripes plus 1 log device = 5 actual top-level vdevs. On-demand trim will try to dispatch all 5 to the system taskq which is only configured to run 4 concurrent threads (based on your VM guest's core count). If you can bump the cores to the guest, the error ought to go away.

Contributor

skiselkov commented Nov 23, 2015

@dweeezil You are absolutely correct on VERIFY on taskq_dispatch with TQ_NOQUEUE - we had another bit of code that had that in and removed it. I guess I overlooked this bit. I will soon push an update to the illumos feature review which will fix this and also place manual trims into a separate taskq so as not to cannibalize the system_taskq.

Contributor

RichardSharpe commented Nov 23, 2015

So, I increased the number of vCPUs to 8 but it looks like I can't win.

My latest try of copy a bunch of data on, delete it and then zpool trim then copy some more data hit this:

[ 2640.503074] INFO: task spl_system_task:3602 blocked for more than 120 seconds.
[ 2640.503861] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2640.504601] spl_system_task D ffff88033fd53680 0 3602 2 0x00000080
[ 2640.504605] ffff88032d8bfcb8 0000000000000046 ffff88032d8bffd8 0000000000013680
[ 2640.504607] ffff88032d8bffd8 0000000000013680 ffff88032d8016c0 ffff88033fd53f48
[ 2640.504615] ffff88012ba674e8 ffff88012ba67528 ffff88012ba67510 0000000000000001
[ 2640.504619] Call Trace:
[ 2640.504626] [] io_schedule+0x9d/0x140
[ 2640.504637] [] cv_wait_common+0xae/0x150 [spl]
[ 2640.504641] [] ? wake_up_bit+0x30/0x30
[ 2640.504646] [] __cv_wait_io+0x18/0x20 [spl]
[ 2640.504683] [] zio_wait+0x123/0x210 [zfs]
[ 2640.504707] [] ? spa_config_exit+0x8d/0xc0 [zfs]
[ 2640.504729] [] vdev_trim_all+0x189/0x370 [zfs]
[ 2640.504734] [] taskq_thread+0x21e/0x420 [spl]
[ 2640.504737] [] ? wake_up_state+0x20/0x20
[ 2640.504741] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 2640.504743] [] kthread+0xcf/0xe0
[ 2640.504745] [] ? kthread_create_on_node+0x140/0x140
[ 2640.504748] [] ret_from_fork+0x58/0x90
[ 2640.504750] [] ? kthread_create_on_node+0x140/0x140
[ 2640.504770] INFO: task txg_sync:4493 blocked for more than 120 seconds.
[ 2640.505193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2640.505925] txg_sync D ffff88033fc13680 0 4493 2 0x00000080
[ 2640.505928] ffff88009fc23ba8 0000000000000046 ffff88009fc23fd8 0000000000013680
[ 2640.505930] ffff88009fc23fd8 0000000000013680 ffff8800347fad80 ffff88033fc13f48
[ 2640.505932] ffff880042fa3c08 ffff880042fa3c48 ffff880042fa3c30 0000000000000001
[ 2640.505934] Call Trace:
[ 2640.505937] [] io_schedule+0x9d/0x140
[ 2640.505942] [] cv_wait_common+0xae/0x150 [spl]
[ 2640.505944] [] ? wake_up_bit+0x30/0x30
[ 2640.505947] [] __cv_wait_io+0x18/0x20 [spl]
[ 2640.505967] [] zio_wait+0x123/0x210 [zfs]
[ 2640.505986] [] vdev_config_sync+0xf4/0x140 [zfs]
[ 2640.506006] [] spa_sync+0x91c/0xb70 [zfs]
[ 2640.506019] [] ? autoremove_wake_function+0x2b/0x40
[ 2640.506051] [] txg_sync_thread+0x3cc/0x640 [zfs]
[ 2640.506075] [] ? txg_fini+0x2a0/0x2a0 [zfs]
[ 2640.506082] [] thread_generic_wrapper+0x7a/0x90 [spl]
[ 2640.506086] [] ? __thread_exit+0x20/0x20 [spl]
[ 2640.506087] [] kthread+0xcf/0xe0
[ 2640.506089] [] ? kthread_create_on_node+0x140/0x140
[ 2640.506092] [] ret_from_fork+0x58/0x90
[ 2640.506093] [] ? kthread_create_on_node+0x140/0x140

Maybe I need to build a new ZPool and start again.

Contributor

RichardSharpe commented Nov 23, 2015

A question. I notice that up above it says that you have to set the zfs_trim parameter to 1 at module load time, but when I checked /sys/modules/zfs/parameters/zfs_trim it was 1. Was that because I set the ZPOOL parameter autotrim?

Contributor

RichardSharpe commented Nov 23, 2015

So, I started again. This time with four vdevs only and 8 CPUs. Then I copied about 31GB onto the ZFS file system and then deleted it and then waited 10 or so seconds and issued zpool trim, but hit that panic:

zpool trim special

Message from syslogd@NTNX-10-5-40-237-A-NVM at Nov 23 09:03:41 ...
kernel:[ 1153.175911] VERIFY3(taskq_dispatch(system_taskq, (void (*)(void *))vdev_trim_all, vti, 0x00000000 | 0x01000000) != 0) failed (0 != 0)

Message from syslogd@NTNX-10-5-40-237-A-NVM at Nov 23 09:03:41 ...
kernel:[ 1153.176468] PANIC at spa.c:6810:spa_trim()

My zpol looks like:

x# zpool status -v
pool: special
state: ONLINE
scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    special     ONLINE       0     0     0
      mpathi    ONLINE       0     0     0
      mpathh    ONLINE       0     0     0
      mpathj    ONLINE       0     0     0
      mpathf    ONLINE       0     0     0

errors: No known data errors

I'm thinking of hacking that ASSERT out and trying again.

Contributor

RichardSharpe commented Nov 23, 2015

It seems like I could change TQ_NOQUEUE to TQ_QUEUE to prevent that panic.

Does that seem correct?

Contributor

skiselkov commented Nov 23, 2015

@RichardSharpe It will get rid of this panic, but I'm preparing a more complex fix that in addition to what you describe places on-demand trim into the same taskq as autotrim.

Contributor

RichardSharpe commented Nov 23, 2015

OK, remove TQ_NOQUEUE :-)

Contributor

RichardSharpe commented Nov 24, 2015

After removing TQ_NOQUEUE, I get the following 'hang' after creating a new ZPool, creating a fs in it., copying some 34TB of data to it, deleting the data after 10 minutes, waiting 120 seconds and then doing a zpool trim:

[25800.519092] INFO: task spl_system_task:3512 blocked for more than 120 seconds.
[25800.519492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25800.519631] spl_system_task D ffff88033fc93680 0 3512 2 0x00000080
[25800.519634] ffff88032cb9fcb8 0000000000000046 ffff88032cb9ffd8 0000000000013680
[25800.519637] ffff88032cb9ffd8 0000000000013680 ffff880329a06660 ffff88033fc93f48
[25800.519639] ffff88003a12cd88 ffff88003a12cdc8 ffff88003a12cdb0 0000000000000001
[25800.519641] Call Trace:
[25800.519647] [] io_schedule+0x9d/0x140
[25800.519658] [] cv_wait_common+0xae/0x150 [spl]
[25800.519662] [] ? wake_up_bit+0x30/0x30
[25800.519666] [] __cv_wait_io+0x18/0x20 [spl]
[25800.519695] [] zio_wait+0x123/0x210 [zfs]
[25800.519716] [] ? spa_config_exit+0x8d/0xc0 [zfs]
[25800.519737] [] vdev_trim_all+0x189/0x370 [zfs]
[25800.519741] [] taskq_thread+0x21e/0x420 [spl]
[25800.519744] [] ? wake_up_state+0x20/0x20
[25800.519748] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[25800.519750] [] kthread+0xcf/0xe0
[25800.519752] [] ? kthread_create_on_node+0x140/0x140
[25800.519755] [] ret_from_fork+0x58/0x90
[25800.519757] [] ? kthread_create_on_node+0x140/0x140
[25800.519772] INFO: task txg_sync:18977 blocked for more than 120 seconds.
[25800.519882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25800.520015] txg_sync D ffff88033fc93680 0 18977 2 0x00000080
[25800.520030] ffff880286d9bba8 0000000000000046 ffff880286d9bfd8 0000000000013680
[25800.520032] ffff880286d9bfd8 0000000000013680 ffff8802d462ad80 ffff88033fc93f48
[25800.520034] ffff88003a128be8 ffff88003a128c28 ffff88003a128c10 0000000000000001
[25800.520036] Call Trace:
[25800.520039] [] io_schedule+0x9d/0x140
[25800.520044] [] cv_wait_common+0xae/0x150 [spl]
[25800.520046] [] ? wake_up_bit+0x30/0x30
[25800.520050] [] __cv_wait_io+0x18/0x20 [spl]
[25800.520076] [] zio_wait+0x123/0x210 [zfs]
[25800.520100] [] vdev_config_sync+0xf4/0x140 [zfs]
[25800.520123] [] spa_sync+0x91c/0xb70 [zfs]
[25800.520129] [] ? autoremove_wake_function+0x2b/0x40
[25800.520149] [] txg_sync_thread+0x3cc/0x640 [zfs]
[25800.520177] [] ? txg_fini+0x2a0/0x2a0 [zfs]
[25800.520182] [] thread_generic_wrapper+0x7a/0x90 [spl]
[25800.520186] [] ? __thread_exit+0x20/0x20 [spl]
[25800.520189] [] kthread+0xcf/0xe0
[25800.520191] [] ? kthread_create_on_node+0x140/0x140
[25800.520193] [] ret_from_fork+0x58/0x90
[25800.520195] [] ? kthread_create_on_node+0x140/0x140

Owner

behlendorf commented Dec 10, 2015

@RichardSharpe I suspect your recent hang was unrelated and caused by some recent taskq issues which were fixed in master. @dweeezil it would be great if you could rebase this on master and force update the branch.

@dweezil: Will this also trim l2arc/slog or just the vdev devices?

I'm asking because l2arc in particular get a lot of writes and some SSDs get themselves into a pretty bad steady-state mode if not given trim commands (eg: Samsung 840pro - I've seen 98% write speed slowdown in steady-state(full) over operation with trimming in use in ZFS and when used as DB storage on ext4 systems)

Member

dweeezil commented Dec 13, 2015

@Stoatwblr That's a very good question. This patch relies on the metaslabs which provide a map of the used and unused regions on each vdev. There is special handling to deal with the mapping on raidz vdevs. The slog devices are true top-level vdevs with metaslabs so they will be trimmed (at least I think they ought to be). The l2arc devices, however, don't have metaslabs and also aren't true top-level vdevs so they won't be trimmed.

It does seem to me there ought to be some way to trim the l2arc when the buffers are invalidated, for example when the blocks they represent are freed (file deletion, etc.).

There's also likely a tie-in to the persistent l2arc work, which I've not followed at all. I think Nexenta may also be involved with that effort so it wouldn't surprise me if they address this issue.

In the mean time, @Stoatwblr, if untrimmed l2arc is really slowing down your system, it seems the only recourse would be to periodically remove the l2arc device(s) and manually trim them (blkdiscard or whatever) and then add them again. I suppose you could even set up a scheme with several l2arc devices and the rotate which ones are subject to the remove-trim-add regimen. It's a hack but it would work.

Contributor

dswartz commented Dec 13, 2015

Tim back when I was using an 840 pro for ZIL I had this exact same issue. Sync writes would plummet...

Tim Chase notifications@github.com wrote:

@Stoatwblr That's a very good question. This patch relies on the metaslabs which provide a map of the used and unused regions on each vdev. There is special handling to deal with the mapping on raidz vdevs. The slog devices are true top-level vdevs with metaslabs so they will be trimmed (at least I think they ought to be). The l2arc devices, however, don't have metaslabs and also aren't true top-level vdevs so they won't be trimmed.

It does seem to me there ought to be some way to trim the l2arc when the buffers are invalidated, for example when the blocks they represent are freed (file deletion, etc.).

There's also likely a tie-in to the persistent l2arc work, which I've not followed at all. I think Nexenta may also be involved with that effort so it wouldn't surprise me if they address this issue.

In the mean time, @Stoatwblr, if untrimmed l2arc is really slowing down your system, it seems the only recourse would be to periodically remove the l2arc device(s) and manually trim them (blkdiscard or whatever) and then add them again. I suppose you could even set up a scheme with several l2arc devices and the rotate which ones are subject to the remove-trim-add regimen. It's a hack but it would work.


Reply to this email directly or view it on GitHub.

@dweezil: I was afraid you'd say that. It's probably a better option to simply replace the devices.

In terms of quantifying how bad Samsung 840s can get: on an untrimmed MD-raid1 system, writing 1.2 million entries into a 800 million entry Postgresql database (Bacula file tables) would usually take over an hour. Moving the drives to a SATA controller on the same system which supported trim commands resulted in those same inserts dropping to less than 25 seconds.

Even datacentre-class drives perform much better if trimmed than not trimmed (tested on a BSD ZFS system using a 6-drive RAIDz2 setup based on 800GB STEC 842 SSDs - inline garbage collection is good on these but notifying drives that blocks are unused still gives much better write results)

Persistent ARC would be nice but faces the same general issue after the l2arc has been written to multiples of the underlaying disk capacity. Once a SSD cache becomes a bottleneck there's not much point in keeping it in service.

Contributor

RichardSharpe commented Jan 13, 2016

@skiselkov: Any progress on the TRIM changes?

Contributor

RichardSharpe commented Feb 3, 2016

So, I pulled the ZoL upstream into the ntrim branch and this stuff built no problems and trim-on-demand seems to work well.

Ie, zpool trim issues lots of nice UNMAP requests and so forth.

However, while I have the auto-trim parameter set, it is not clear that that is happening.

How can I check?

Sachiru commented Feb 3, 2016

I was about to suggest creating a file, using filefrag to determine its
address, then deleting it and trimming, however I remembered that ZFS still
does not have FIBMAP support yet.

There are 10 kinds of people in the world; those who can read binary and
those who can't.

On Wed, Feb 3, 2016 at 11:38 AM, Richard Sharpe notifications@github.com
wrote:

So, I pulled the ZoL upstream into the ntrim branch and this stuff built
no problems and trim-on-demand seems to work well.

Ie, zpool trim issues lots of nice UNMAP requests and so forth.

However, while I have the auto-trim parameter set, it is not clear that
that is happening.

How can I check?


Reply to this email directly or view it on GitHub
#3656 (comment).

Member

dweeezil commented Feb 3, 2016

@RichardSharpe Thanks for the update. I've taken the opportunity to refresh this to a current master. As to auto-trim mode, please see the last paragraph of my first comment in this request. Also please review my comment from November 21, 2015. The quick answer is that by default, TRIMs are effectively delayed for 32 transaction groups which in a low-write environment is normally about 2.5 minutes. I still plan on adding some kstats to help monitor its operation.

Member

dweeezil commented Jan 14, 2017

Is this still in state like described in first comment? ("Use at your own risk. It may DESTROY YOUR DATA! ")
I have few Gb worth of trimming and would like to try it out.

@mailinglists35 I've been personally using it on several systems for a long time now with autotrim=on and it works great. There are certainly others who have indicated they've been using it as well.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Jan 15, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Jan 15, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Jan 15, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

@dweeezil is there an easy way to apply this on master? or a particular order? I'm trying to apply on git but I'm being shown errors on the files listed above as conflicts. Sorry I'm not such a great git user...

Member

dweeezil commented Jan 16, 2017

@mailinglists35 The recent metaslab work introduced a few wrinkles. I'm working on a refresh and am hoping to get it pushed today. It's currently in dweeezil:ntrim-next. [EDIT] This The patch stack ought to apply cleanly to anything earlier than 4e21fd0.

Member

kpande commented Jan 21, 2017

@dweeezil any progress on this? thank you for the work 👍

Member

dweeezil commented Jan 21, 2017

@kpande The ntrim-next branch does work properly and is based on a relatively current master, however, the new metaslab code required yet another order-of-operation change when a pool is exported which caused one of the TRIM-related ASSERTs to be hit. I've been meaning to investigate whether that assert is correct anymore but haven't had a change to look into it yet.

Earlier on this thread @behlendorf had posted:
"@dweeezil thanks for refreshing this, I'd like to get this in 0.7.0-rc3. To that end it would be helpful to get some real world feedback from anyone watching this issue and testing the latest version of the patch."
We are using an early version of this patch for trim. It has worked for us so far. @behlendorf, @dweezil is this patch going to be part of 0.7.0-rc4? If not is the intent still to make it part of 0.7.0? What are the (presumably) additional things planned for the work in ntrim-next branch? Would appreciate if you could provide info on what you are planning. Thanks.

Owner

behlendorf commented Jan 24, 2017

@batchuravi the plan was to get this added as part of 0.7.0-rc3. That didn't happen because of upstream OpenZFS review comments regarding code overlap between this feature and the planned eager zero feature.

I felt we needed to pause briefly and consider how these two features we're going to interact and overlap. We want to avoid having two very similar mechanisms. Rather than hold up the -rc3 release any longer this PR was bumped to -rc4.

@dweeezil @skiselkov what are your thoughts on potential conflicts with the eager zero changes. The patches haven't yet been upstreamed to OpenZFS but they are available in the Delphix tree. I think we can live with resolving merge conflicts, but how would you like to proceed.

delphix/delphix-os@f08e5bc

Member

dweeezil commented Jan 30, 2017

@behlendorf It certainly appears the trim patch could leverage the new xlate vdev methods. That said, it'd likely not be too difficult to do that down the road once eager zero is upstreamed and merged to ZoL.

Member

dweeezil commented Jan 30, 2017

I suppose I could try porting the eager zero patch [EDIT] mainly for the purpose of adding the new xlate vdev methods and modifying the trim patch to use them, but that would bring the trim patch even that much more out-of-sync with the upstream PR.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Jan 30, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.
Member

kpande commented Jan 30, 2017

eager zero and trim accomplish different goals.

Member

dweeezil commented Jan 30, 2017

I added some clarification to my comment regarding the possibility of porting eager zero; namely that the rationale in the context of this PR would be to try to use some of the new support it provides.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Feb 1, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

UralZima commented Feb 3, 2017

Hi People.
Thank you for your awesome work! I am using ZFS with Gentoo, and I urgently need to enable TRIM feature, coz my SSD failing on some sectors.
I am new to Git, and I don't understand some issues.
I see this issue #598, this 56027a2 and current one.
As I understood, this is most up-to-date thread on TRIM/DISCARD support for ZFS.

I am using zfs-0.6.5.8, but can switch to zfs-0.7.0-rc3 or zfs-9999

How to enable these patches in gentoo?

In this thread mentioned, that I need a patch for spl from https://github.com/dweeezil/spl/tree/ntrim.
It is whole tree, last updated in April, I can't understand how to get a patch from it and commit it to spl-9999 or spl-0.7.0-rc3.

How to get patch files and put it somewhere to /usr/portage to compile with trim support?

I need at least fstrim support, if autotrim=on is considered unstable. However, I read in comments that many people using it and don't have any problems.

Please help me enabling TRIM support for ZFS in Gentoo.

Thank you very much!
ZFS is the only future FS, I use it everywhere now, you guys do big and awesome work.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Feb 4, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Feb 5, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

I urgently need to enable TRIM feature, coz my SSD failing on some sectors

enabling TRIM/DISCARD/UNMAP will probably unlikely solve your ssd hardware issue.
you need more urgently to backup your data than to have trim in zfs.

mailinglists35 commented Feb 6, 2017

eager zero and trim accomplish different goals.

@kpande I don't have the illumos issue url right now but IIRC I have read there they want to incorporate this PR in illumos in the same patch/issue with eager zero

dweeezil added a commit to dweeezil/zfs that referenced this pull request Feb 9, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Feb 9, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

Saso Kiselkov and others added some commits Apr 20, 2015

6363 Add UNMAP/TRIM functionality to ZFS
Ported by: Tim Chase <tim@chase2k.com>

Porting notes:
    The trim kstats are in zfs/<pool> along with the other per-pool stats.
    The kstats can be cleared by writing to the kstat file.

    Null format parameters to strftime() were replaced with "%c".

    Added vdev trace support.

    New dfl_alloc() function in the SPL is used to allocate arrays of
    dkioc_free_list_t objects since they may be large enough to require
    virtual memory.

Other changes:
    Suppressed kstat creation for pools with "$" names.

    The changes to vdev_raidz_map_alloc() have been minimized in order to allow
    more conflict-free merging with future changes (ABD).

Added the following module parameters:
    zfs_trim - Enable TRIM
    zfs_trim_min_ext_sz - Minimum size to trim
    zfs_txgs_per_trim - Transaction groups over which to batch trims
Refresh dkio.h and add dkioc_free_util.h
Update dkio.h from Nexenta's version to pick up DKIOCFREE and add their
dkioc_free_util.h header for TRIM support.
Stop autotrim tasks before freeing zios
If the autotrim tasks aren't stopped first, some will continue to run
during the "Wait for any outstanding async I/O to complete." wait code
and those which are still running will encounter bogus root zios once
they're freed.
Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.
Don't lock rt_lock in range_tree_verify()
range_tree_verify() was the only range tree support function which
locked rt_lock whereas all the other functions required the lock to
be taken by the caller.  If the lock is taken in range_tree_verify(),
it's not possible to atomically verify a set of related range trees
(those which are likely protected by the same lock).

In the previous implementation, checking "related" trees would be done
as follows:

    range_tree_verify(tree1, offset, size);
    /* tree1's rt_lock is not taken here */
    range_tree_verify(tree2, offset, size);

The new implementation requires:

    mutex_enter(tree1->rt_lock);
    range_tree_verify(tree1, offset, size);
    range_tree_verify(tree2, offset, size);
    mutex_exit(tree1->rt_lock);

Currently, the only consumer of range_tree_verify() is
metaslab_check_free() which verifies a set of realted range trees in
a metaslab.  The TRIM/DISCARD code adds an additional set of checks of
the current and previous trimsets, both of which are represented as
range trees.

metaslab_check_free() has been updated to lock ms_lock once for each
vdev's metaslab and also for debugging builds to verify that the each
tree's rt_lock matches the metaslab's ms_lock to prove they're related.
Add "zpool trim -p" to initiate a partial trim
Some storage backends such as large thinly-provisioned SANs are very slow
for large trims.  With the "-p" option, a manual trim will skip metaslabs
for which no spacemap exists.
Add TRIM tests to the test suite
Test autotrim and manual trim for each vdev type.  Testing autotrim in
a reasonable amount of time requires lowering zfs_trim_min_ext_sz and
forcing transaction groups.
Wait for trimming to finish in metaslab_fini
The new spa_unload() code add as part of "OpenZFS 7303 - dynamic metaslab
selection" (4e21fd0) would cause in-flight trim zio to fail.  This patch
makes sure each metaslab is finished trimming before removing it during
metaslab shutdown.
Member

dweeezil commented Feb 20, 2017

The latest commit, c7654b5, is rebased on a current master. Also, I've re-worded a bunch of the documentation in the spirit of @ahrens' suggestions. This should also fix panicking upon export which started occurring due to 4e21fd0.

sempervictus commented Feb 28, 2017

Looks like we have a memory leak in the zpool trim command.
Reproducer showing different bytes are leaked when run without permissions:

#/bin/env ruby
require 'open3'
poolname = ARGV[0] || "tank"
while true
  Open3.capture3("zpool trim #{poolname}").tap {|o,e,i| puts e.index(':').to_s + ' bytes: ' + e[0..(e.index(':') -1)].each_byte.map { |b| b.to_s(16) }.join }
end

This is off the current revision, on Arch Linux in a Grsec/PAX environment using --with-pic=yes.
Out of 4287 executions, 4133 produce unique data...

Member

dweeezil commented Feb 28, 2017

@sempervictus The patch in 6c9f7af should fix this.

@dweeezil: thanks, will add in to next set. I've got this running on the current test stack and seeing some decent numbers for ZVOL performance atop and SSD with autotrim. If all goes well and it doesnt eat my data, i'll get this on some 10+-disk VDEV hardware soon enough. Any specific rough edges i should be testing?

Member

kpande commented Mar 8, 2017

for me the main issue is now sending too many trim requests to the backend when it can not reply in time. it will eventually cause I/O to the vdev to stop.

Contributor

skiselkov commented Mar 8, 2017

Hey guys, just a heads up that the upstream PR has been significantly updated.

  • trim zios are now processed via vdev_queue.c and limited to at most 10 executing at the same time per vdev.
  • individual trim commands are capped at 262144 sectors (or 128MB) - recommendation from FreeBSD.
  • The rate limiter in manual trimming is now much finer, working now in at most 128MB increments, rather than one-metaslab-at-a-time.
  • The zfs_trim tunable is now examined later, just before issuing out trim commands from vdev_file/vdev_disk. This allows testing the entire pipeline up to this point.
  • I've built in changes suggested by Matt Ahrens regarding range_tree_clear and range_tree_contains.
  • The default minimum extent size to be trimmed reduced to 128k to catch smaller blocks in more fragmented metaslabs.

The most significant departure from what we have in-house at Nexenta is the zio queueing and the manual trim rate limiting. The remaining parts are largely conserved and we've been running them in production for over a year now.

Contributor

skiselkov commented Mar 14, 2017

@dweeezil I'd really appreciate it if you could find time to drop by the OpenZFS PR for this and give it a look over: openzfs/openzfs#172
Sadly, I'm kinda short on reviewers. We want to share this with everybody, but if nobody steps up to the plate, then there's nothing we can do.

Member

dweeezil commented Mar 14, 2017

@skiselkov Thanks for the poke. I'm definitely planning on going over the OpenZFS PR and also getting this one refreshed to match.

Contributor

skiselkov commented Mar 14, 2017

@dweeezil Thanks, appreciate it a lot!

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 19, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 20, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.
Member

dweeezil commented Mar 20, 2017

This PR has gotten way too long to comfortably deal with in Github. I've just done a complete refresh of the TRIM patch stack based on the upstream PR for OpenZFS and rebased to a current ZoL master. Once I've done some testing of the new stack, this PR will be closed and will be replaced with a new one.

In the mean time, the soon-to-be-posted-PR is in dweeezil:ntrim-next-2.

@skiselkov Once I do some testing and post the new PR, I'll finally be able to start reviewing the OpenZFS PR. I tried to keep as many notes as I can as to some of the issues I've had to deal with which might be applicable upstream.

Contributor

skiselkov commented Mar 20, 2017

@dweeezil Thank you, appreciate it.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 20, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 22, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 22, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 23, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 24, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

@dweeezil dweeezil referenced this pull request Mar 25, 2017

Open

OpenZFS - 6363 Add UNMAP/TRIM functionality #5925

1 of 11 tasks complete
Member

dweeezil commented Mar 25, 2017

Replaced with #5925.

@dweeezil dweeezil closed this Mar 25, 2017

tuomari added a commit to tuomari/zfs that referenced this pull request Mar 30, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 2, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 2, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 7, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

@behlendorf behlendorf removed this from In Progress in 0.7.0-rc4 Apr 8, 2017

dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 8, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 14, 2017

Fix vdev_raidz_psize_floor()
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in #3656.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment