Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
TRIM/Discard support from Nexenta #3656
Conversation
dweeezil
changed the title from
WIP - TIM/Discard support from Nexenta
to
WIP - TRIM/Discard support from Nexenta
Aug 2, 2015
|
Hi @dweeezil FYI: zpool trim rpool triggers the following warning [ 61.844048] Large kmem_alloc(101976, 0x1000), please file an issue at: |
ryao
reviewed
Aug 7, 2015
| + return (sub_pio); | ||
| + | ||
| + num_exts = avl_numnodes(&tree->rt_root); | ||
| + dfl = kmem_zalloc(DFL_SZ(num_exts), KM_SLEEP); |
ryao
Aug 7, 2015
Member
We probably need to change this to vmem_zalloc() to address the issue that @edillmann reported due to the way that we implemented kmem_zalloc(). It should be ifdefed to Linux because the O3X port that will likely merge this code is using the real kmem_zalloc().
dweeezil
Aug 8, 2015
Member
I think I'm going to rework zio_trim() as well as the other consumers of the dkioc free lists to work with a linked list rather than an array. This would allow us to avoid the large allocations but would cause the code to diverge a bit from the upstream. Another benefit is that we'd avoid the double allocation and copy which typically occurs in zio_trim() when zfs_trim_min_ext_sz is set to a large value.
dweeezil
referenced this pull request
Aug 21, 2015
Closed
[buildbot, cherry-pick to master] Illumos #4950 files sometimes can't be removed from a full filesystem (#2784) #3629
sempervictus
commented
Sep 9, 2015
|
@dweeezil: sorry to be pain, but curious to know status on this - we've got a few SSD-only pools to play with for a few days before we stuff 'em into prod (making sure our hardware doesnt screw us), so we can do a bit of testing on this without losing production data if you happen to have some test paths for us to run through. Thanks |
|
@sempervictus I'm actively looking for feedback. The patch does need to be refreshed against a current master codebase which I'll try to do today. There's a bit of interference with the recent zvol improvements. In my own testing, the patch does appear to work properly, although the behavior of the TRIM "batching" needs a bit better documentation and, possibly, slightly different implementation (IIRC, one or both of the parameters only has certain effect upon module load and/or pool import). I'd also like to add some kstats to help monitor its behavior. I've used the on-demand TRIM quite a bit and it seems to work perfectly. You can TRIM a pool with There's also a backport to 0.6.4.2 in a branch named "ntrim-0.6.4.2" ("ntrim-0.6.4.1" for SPL). |
sempervictus
commented
Sep 10, 2015
|
Soon as this is updated to reflect changes in master we'll add it to our stack. One potential caveat is that we generally utilize dm-crypt with the discard option at mount time. Any thoughts on potential side effects from this? Has this sort of setup been tested in any way? |
dweeezil
changed the title from
WIP - TRIM/Discard support from Nexenta
to
TRIM/Discard support from Nexenta
Sep 14, 2015
dweeezil
referenced this pull request
Sep 14, 2015
Closed
TRIM/UNMAP/DISCARD support for vdevs (2) #1016
|
Hi @dweeezil , Just to let you know I have been running this pull rq since it was released, and beside the kmem_alloc warning, I did not see any problem or corruption on my test zpool (dual ssd mirror). The system has been crunching video camera recording for 2 months :-) Is there any hope of having it rebased on master ? |
sempervictus
commented
Sep 14, 2015
|
Tried to throw this into our stack today and noticed it has some conflicts with ABD in the raidz code. Rumor has it that should be merged "soon after the 0.6.5 tag" so i'm hoping by next rebase it'll be in there (nudge @behlendorf ) :). |
|
@dweeezil I didn't see it was already rebased, thanks. |
|
So SATA Trim is currently not supported, according to the comments in the source or is this handled by SPL properly? |
|
@Mic92 SATA TRIM works just fine and I've tested it plenty. The documentation is still the original from Illumos. TRIM will work on any block device vdev supporting BLKDISCARD or any file vdev on which the containing filesystem supports fallocate hole punching. |
FransUrbo
referenced this pull request
Oct 18, 2015
Closed
Build bots can't handle dependencies #3935
greg-hydrogen
commented
Oct 18, 2015
|
@dweeezil - I would love to test this patch but it conflicts with the ABD branch (pull 3441) in vdev_raidz.c I have the .rej file from both sides (patched abd first then this patch, and then patched this patch first then abd) if that helps at all much appreciated |
|
Hey guys, I wanted to get your take on the latest submission on this that we're trying to get upstreamed from Nexenta. I'd primarily like to make bottom end of the ZFS portion more accommodating to Linux & FreeBSD. |
ryao
referenced this pull request
Oct 29, 2015
Open
0.6.5.3 PANIC at zfs_vfsops.c:426:zfs_space_delta_cb() #3968
|
@greg-hydrogen I tried transplanting the relevant commits on to ABD awhile ago and other than the bio argument issues, the other main conflict is the logging I added to discards on zvols. The vdev conflicts you likely ran into are pretty easy to fix. I'll try to get an ABD-based version of this working within the next few days. @skiselkov I'll check it out. It looks to be a port of the same Nexenta code in this pull request, correct? |
|
@dweeezil It is indeed, with some minor updates & fixes. |
vaLski
commented
Nov 10, 2015
|
Can't compile it on CentOS 6.7. Attempted to install spl-0.6.5.3 from github Used the following part "If, instead you would like to use the GIT version, use the following commands instead:" from http://zfsonlinux.org/generic-rpm.html On the last step make rpm-utils rpm-dkms It fails with: Preparing... ########################################### [100%] Deleting module version: 0.6.5 completely from the DKMS tree.Done. Log says CC [M] /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.o strace shows [pid 20344] open("include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory) Files are at find / -name dkioc_free_util.h Symlinked one of the "search" locations to dkioc_free_util.h but it started throwing other errors: In file included from /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:36: Attempted to solve those but to no avil. It seems that dkioc_free_util.h is added by the "ntrim" branch as the dfl_free function is referenced in module/zfs/zio.c, module/zfs/vdev_raidz.c where trim is mentioned. I really hope that this is the correct place to post this issue as it is directly related with this merge request. Let me know if there is anything else I can assist with. |
|
@vaLski I suspect you need the corresponding spl patch from https://github.com/dweeezil/spl/tree/ntrim which I just rebased to master as dweeezil/spl@305d417. |
vaLski
commented
Nov 10, 2015
|
@dweeezil sorry for overlooking this - you are right. Now everything is building just as expected. I will later report how the patch works for me. Initial testing seems fine. |
|
So, I have built this and we are testing it, but I encountered a problem. If I run zpool trim on the same zpool twice things seem to get stuck in SPL and things don't work any more. What can I do to debug this? |
|
@RichardSharpe A good start would be to get some stack traces from the blocked processes. Generally they'll show up in your syslog after several minutes. Otherwise, typically "echo b > /proc/sysrq-trigger" will cause them to be generated. As to the trigger for your problem, was the first trim still running when you ran the second one? This type of hang sounds suspiciously like a taskq dispatch problem. Hopefully the stack traces from the blocked processes will shed some light on it. |
tomposmiko
commented
Nov 21, 2015
|
|
|
I could tests this on a SSD Raid. Would this actually make a difference implementation wise? |
|
@tomposmiko Blush, I meant @Mic92 There is differing support code for the different types of vdevs. In particular, raidz has its own particular support. I have personally tested with raids, mirrors, stripes and file vdevs and all seemed to be properly trimmed. As a reminder to anyone working with this, there are some new module parameters involved (as well as a new pool property which is also in the relevant man page):
and
and
It's been awhile but IIRC, the latter 2 really ought to be set during module load time (with either kernel command-line parameters or with modprobe arguments after booting). During my testing, after enabling the first one, of course, I generally set Also, IIRC, |
|
I'll push a refresh shortly to fix the compile problem with debug builds. |
|
As of a2cd68b, this ought to build properly for debug builds. We'll see what the buildbots think of it. The corresponding spl branch has also been rebased to a current master. |
|
OK, looking at messages I have the stack traces: Looks like they are starting to repeat now. |
|
@RichardSharpe At first glance, that looks like the condition which zfsonlinux/spl@a64e557 was designed to address. If you're using dynamic taskqs and have zfsonlinux/spl@f5f2b87, you might try setting |
|
OK, I have a64e557 in the repos I used, I will try setting spl_taskq_thread_sequential=0 or should I just set spl_taskq_thread_dynamic=0? |
|
@RichardSharpe If you don't have zfsonlinux/spl@f5f2b87, use |
|
Heh, Just tried 'zpool trim ...' and got this: Message from syslogd@NTNX-10-5-40-237-A-NVM at Nov 22 20:07:59 ... Message from syslogd@NTNX-10-5-40-237-A-NVM at Nov 22 20:07:59 ... |
|
Seems to be here. Does that mean corruption of ZFS? I had just done a test where I copied on some 18GB, then deleted it then did a zpool trim. |
|
@RichardSharpe I wasn't real happy about the VERIFY in https://github.com/dweeezil/zfs/blob/ntrim/module/zfs/spa.c#L6808 from the moment I saw it. There's only one other place where we use As to the details, how many cores/threads does your system have? What's the pool layout (zpool status). This is not some sort of fatal corruption but is instead happening most likely because you've got more top-level vdevs in your pool than the system has available threads (which is how many are given to the system taskq). The more I think about, the more I think it would be a good idea to make a new taskq specifically for this purpose. I'm also thinking it might be worthwhile to add a per-pool stat regarding the last on-demand trim which could be examined with In the mean time, the following patch ought to work around the problem:
The ultimate effect is that it might take a few |
|
zpool status
errors: No known data errors Four cores. 12GB of memory. It's a VM. |
|
I have seen this once before. To be honest, we are building our UNMAP implementation as well, and that had a bug. At first it was responding to all UNMAP requests with ILLEGAL REQUEST/,,, but we fixed that. Could that have corrupted things? |
|
@RichardSharpe 4 top-level vdev stripes plus 1 log device = 5 actual top-level vdevs. On-demand trim will try to dispatch all 5 to the system taskq which is only configured to run 4 concurrent threads (based on your VM guest's core count). If you can bump the cores to the guest, the error ought to go away. |
|
@dweeezil You are absolutely correct on VERIFY on taskq_dispatch with TQ_NOQUEUE - we had another bit of code that had that in and removed it. I guess I overlooked this bit. I will soon push an update to the illumos feature review which will fix this and also place manual trims into a separate taskq so as not to cannibalize the system_taskq. |
|
So, I increased the number of vCPUs to 8 but it looks like I can't win. My latest try of copy a bunch of data on, delete it and then zpool trim then copy some more data hit this: [ 2640.503074] INFO: task spl_system_task:3602 blocked for more than 120 seconds. Maybe I need to build a new ZPool and start again. |
|
A question. I notice that up above it says that you have to set the zfs_trim parameter to 1 at module load time, but when I checked /sys/modules/zfs/parameters/zfs_trim it was 1. Was that because I set the ZPOOL parameter autotrim? |
|
So, I started again. This time with four vdevs only and 8 CPUs. Then I copied about 31GB onto the ZFS file system and then deleted it and then waited 10 or so seconds and issued zpool trim, but hit that panic: zpool trim specialMessage from syslogd@NTNX-10-5-40-237-A-NVM at Nov 23 09:03:41 ... Message from syslogd@NTNX-10-5-40-237-A-NVM at Nov 23 09:03:41 ... My zpol looks like: x# zpool status -v
errors: No known data errors I'm thinking of hacking that ASSERT out and trying again. |
|
It seems like I could change TQ_NOQUEUE to TQ_QUEUE to prevent that panic. Does that seem correct? |
|
@RichardSharpe It will get rid of this panic, but I'm preparing a more complex fix that in addition to what you describe places on-demand trim into the same taskq as autotrim. |
|
OK, remove TQ_NOQUEUE :-) |
|
After removing TQ_NOQUEUE, I get the following 'hang' after creating a new ZPool, creating a fs in it., copying some 34TB of data to it, deleting the data after 10 minutes, waiting 120 seconds and then doing a zpool trim: [25800.519092] INFO: task spl_system_task:3512 blocked for more than 120 seconds. |
|
@RichardSharpe I suspect your recent hang was unrelated and caused by some recent taskq issues which were fixed in master. @dweeezil it would be great if you could rebase this on master and force update the branch. |
Stoatwblr
commented
Dec 12, 2015
|
@dweezil: Will this also trim l2arc/slog or just the vdev devices? I'm asking because l2arc in particular get a lot of writes and some SSDs get themselves into a pretty bad steady-state mode if not given trim commands (eg: Samsung 840pro - I've seen 98% write speed slowdown in steady-state(full) over operation with trimming in use in ZFS and when used as DB storage on ext4 systems) |
|
@Stoatwblr That's a very good question. This patch relies on the metaslabs which provide a map of the used and unused regions on each vdev. There is special handling to deal with the mapping on raidz vdevs. The slog devices are true top-level vdevs with metaslabs so they will be trimmed (at least I think they ought to be). The l2arc devices, however, don't have metaslabs and also aren't true top-level vdevs so they won't be trimmed. It does seem to me there ought to be some way to trim the l2arc when the buffers are invalidated, for example when the blocks they represent are freed (file deletion, etc.). There's also likely a tie-in to the persistent l2arc work, which I've not followed at all. I think Nexenta may also be involved with that effort so it wouldn't surprise me if they address this issue. In the mean time, @Stoatwblr, if untrimmed l2arc is really slowing down your system, it seems the only recourse would be to periodically remove the l2arc device(s) and manually trim them (blkdiscard or whatever) and then add them again. I suppose you could even set up a scheme with several l2arc devices and the rotate which ones are subject to the remove-trim-add regimen. It's a hack but it would work. |
|
Tim back when I was using an 840 pro for ZIL I had this exact same issue. Sync writes would plummet... Tim Chase notifications@github.com wrote:
|
Stoatwblr
commented
Dec 14, 2015
|
@dweezil: I was afraid you'd say that. It's probably a better option to simply replace the devices. In terms of quantifying how bad Samsung 840s can get: on an untrimmed MD-raid1 system, writing 1.2 million entries into a 800 million entry Postgresql database (Bacula file tables) would usually take over an hour. Moving the drives to a SATA controller on the same system which supported trim commands resulted in those same inserts dropping to less than 25 seconds. Even datacentre-class drives perform much better if trimmed than not trimmed (tested on a BSD ZFS system using a 6-drive RAIDz2 setup based on 800GB STEC 842 SSDs - inline garbage collection is good on these but notifying drives that blocks are unused still gives much better write results) Persistent ARC would be nice but faces the same general issue after the l2arc has been written to multiples of the underlaying disk capacity. Once a SSD cache becomes a bottleneck there's not much point in keeping it in service. |
kernelOfTruth
referenced this pull request
Dec 23, 2015
Closed
Illumos 6367 spa_config_tryenter incorrectly handles the multiple-loc… #4138
|
@skiselkov: Any progress on the TRIM changes? |
dweeezil
referenced this pull request
Jan 25, 2016
Closed
Log zvol truncate/discard operations #4271
|
So, I pulled the ZoL upstream into the ntrim branch and this stuff built no problems and trim-on-demand seems to work well. Ie, zpool trim issues lots of nice UNMAP requests and so forth. However, while I have the auto-trim parameter set, it is not clear that that is happening. How can I check? |
Sachiru
commented
Feb 3, 2016
|
I was about to suggest creating a file, using filefrag to determine its There are 10 kinds of people in the world; those who can read binary and On Wed, Feb 3, 2016 at 11:38 AM, Richard Sharpe notifications@github.com
|
|
@RichardSharpe Thanks for the update. I've taken the opportunity to refresh this to a current master. As to auto-trim mode, please see the last paragraph of my first comment in this request. Also please review my comment from November 21, 2015. The quick answer is that by default, TRIMs are effectively delayed for 32 transaction groups which in a low-write environment is normally about 2.5 minutes. I still plan on adding some kstats to help monitor its operation. |
@mailinglists35 I've been personally using it on several systems for a long time now with autotrim=on and it works great. There are certainly others who have indicated they've been using it as well. |
added a commit
to dweeezil/zfs
that referenced
this pull request
Jan 15, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Jan 15, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Jan 15, 2017
mailinglists35
commented
Jan 16, 2017
|
@dweeezil is there an easy way to apply this on master? or a particular order? I'm trying to apply on git but I'm being shown errors on the files listed above as conflicts. Sorry I'm not such a great git user... |
|
@mailinglists35 The recent metaslab work introduced a few wrinkles. I'm working on a refresh and am hoping to get it pushed today. It's currently in dweeezil:ntrim-next. [EDIT] |
|
@dweeezil any progress on this? thank you for the work |
|
@kpande The ntrim-next branch does work properly and is based on a relatively current master, however, the new metaslab code required yet another order-of-operation change when a pool is exported which caused one of the TRIM-related ASSERTs to be hit. I've been meaning to investigate whether that assert is correct anymore but haven't had a change to look into it yet. |
batchuravi
commented
Jan 23, 2017
|
Earlier on this thread @behlendorf had posted: |
|
@batchuravi the plan was to get this added as part of 0.7.0-rc3. That didn't happen because of upstream OpenZFS review comments regarding code overlap between this feature and the planned eager zero feature. I felt we needed to pause briefly and consider how these two features we're going to interact and overlap. We want to avoid having two very similar mechanisms. Rather than hold up the -rc3 release any longer this PR was bumped to -rc4. @dweeezil @skiselkov what are your thoughts on potential conflicts with the eager zero changes. The patches haven't yet been upstreamed to OpenZFS but they are available in the Delphix tree. I think we can live with resolving merge conflicts, but how would you like to proceed. |
|
@behlendorf It certainly appears the trim patch could leverage the new xlate vdev methods. That said, it'd likely not be too difficult to do that down the road once eager zero is upstreamed and merged to ZoL. |
|
I suppose I could try porting the eager zero patch [EDIT] mainly for the purpose of adding the new xlate vdev methods and modifying the trim patch to use them, but that would bring the trim patch even that much more out-of-sync with the upstream PR. |
added a commit
to dweeezil/zfs
that referenced
this pull request
Jan 30, 2017
|
eager zero and trim accomplish different goals. |
|
I added some clarification to my comment regarding the possibility of porting eager zero; namely that the rationale in the context of this PR would be to try to use some of the new support it provides. |
added a commit
to dweeezil/zfs
that referenced
this pull request
Feb 1, 2017
UralZima
commented
Feb 3, 2017
|
Hi People. I am using zfs-0.6.5.8, but can switch to zfs-0.7.0-rc3 or zfs-9999 How to enable these patches in gentoo? In this thread mentioned, that I need a patch for spl from https://github.com/dweeezil/spl/tree/ntrim. How to get patch files and put it somewhere to /usr/portage to compile with trim support? I need at least fstrim support, if autotrim=on is considered unstable. However, I read in comments that many people using it and don't have any problems. Please help me enabling TRIM support for ZFS in Gentoo. Thank you very much! |
added a commit
to dweeezil/zfs
that referenced
this pull request
Feb 4, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Feb 5, 2017
mailinglists35
commented
Feb 6, 2017
enabling TRIM/DISCARD/UNMAP will probably unlikely solve your ssd hardware issue. |
mailinglists35
commented
Feb 6, 2017
•
@kpande I don't have the illumos issue url right now but IIRC I have read there they want to incorporate this PR in illumos in the same patch/issue with eager zero |
added a commit
to dweeezil/zfs
that referenced
this pull request
Feb 9, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Feb 9, 2017
and others
added some commits
Apr 20, 2015
sempervictus
commented
Feb 28, 2017
•
|
Looks like we have a memory leak in the zpool trim command.
This is off the current revision, on Arch Linux in a Grsec/PAX environment using --with-pic=yes. |
|
@sempervictus The patch in 6c9f7af should fix this. |
sempervictus
commented
Feb 28, 2017
|
@dweeezil: thanks, will add in to next set. I've got this running on the current test stack and seeing some decent numbers for ZVOL performance atop and SSD with autotrim. If all goes well and it doesnt eat my data, i'll get this on some 10+-disk VDEV hardware soon enough. Any specific rough edges i should be testing? |
|
for me the main issue is now sending too many trim requests to the backend when it can not reply in time. it will eventually cause I/O to the vdev to stop. |
|
Hey guys, just a heads up that the upstream PR has been significantly updated.
The most significant departure from what we have in-house at Nexenta is the zio queueing and the manual trim rate limiting. The remaining parts are largely conserved and we've been running them in production for over a year now. |
|
@dweeezil I'd really appreciate it if you could find time to drop by the OpenZFS PR for this and give it a look over: openzfs/openzfs#172 |
|
@skiselkov Thanks for the poke. I'm definitely planning on going over the OpenZFS PR and also getting this one refreshed to match. |
|
@dweeezil Thanks, appreciate it a lot! |
added a commit
to dweeezil/zfs
that referenced
this pull request
Mar 19, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Mar 20, 2017
|
This PR has gotten way too long to comfortably deal with in Github. I've just done a complete refresh of the TRIM patch stack based on the upstream PR for OpenZFS and rebased to a current ZoL master. Once I've done some testing of the new stack, this PR will be closed and will be replaced with a new one. In the mean time, the soon-to-be-posted-PR is in dweeezil:ntrim-next-2. @skiselkov Once I do some testing and post the new PR, I'll finally be able to start reviewing the OpenZFS PR. I tried to keep as many notes as I can as to some of the issues I've had to deal with which might be applicable upstream. |
|
@dweeezil Thank you, appreciate it. |
added a commit
to dweeezil/zfs
that referenced
this pull request
Mar 20, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Mar 22, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Mar 22, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Mar 23, 2017
added a commit
to dweeezil/zfs
that referenced
this pull request
Mar 24, 2017
dweeezil
referenced this pull request
Mar 25, 2017
Open
OpenZFS - 6363 Add UNMAP/TRIM functionality #5925
|
Replaced with #5925. |
dweeezil commentedAug 2, 2015
This patch stack includes Nextenta's support for TRIM/Discard on disk and file vdevs as well as an update to the dkio headers for appropriate Solaris compatibility. It requires the current https://github.com/dweeezil/spl/tree/ntrim patch in order to compile properly.
The usual disclaimers apply at this point: I've performed moderate testing with ext4-backed file vdevs and light testing with SSD-backed disk vdevs and it appears to work properly. Use at your own risk. It may DESTROY YOUR DATA! I'm posting the pull request because it seems to work during initial testing and I'd like the buildbots to get a chance at it (which I'm expecting to fail unless they use the corresponding SPL code).
The initial TRIM support (currently in commit 719301c) caused frequent deadlocks in ztest due to the SCL_ALL spa locking during the trim operations. The follow-on patch to support on-demand trim changed the locking scheme and I'm no longer seeing deadlocks with either ztest or normal operation.
The final last commit (currently 9e5cfd7) adds ZIL logging for zvol trim operations. This code was mostly borrowed from an older Nexenta patch (referenced in the commit log) and has been merged into the existing zvol trim function.
In order to enable the feature, you must use
zpool set autotrim=onon the pool and thezfs_trimmodule parameter must be set to 1 (which is its default value). Thezfs_trimparameter controls the lower-level vdev trimming whereas the pool property controls it at a higher level. By default, trims are batched and only applied every 32 transaction groups as controlled by the newzfs_txgs_per_trimparameter. This allows forzpool import -Tto continue to be useful. Finally, by default, only regions of at least 1MiB are trimmed as set by thezfs_trim_min_ext_szmodule parameter.