Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ABD: linear/scatter dual typed buffer for ARC (ver 2) #3441
Conversation
|
Will start using tonight. |
|
Hi, I give it a test and it show's a very high CPU usage on arc_adapt vs master Regards, |
|
@edillmann I don't quite follow you comment. Very good, very bad CPU usage? |
behlendorf
referenced this pull request
May 28, 2015
Closed
Move ARC data buffers out of vmalloc #2129
|
Hi, It show's a high CPU usage on arc_adapt, the pool is build this way
And some stats :
Any clues ? Regards, |
|
Hi @edillmann |
|
I got this constantly when doing ztest on 32bit ubuntu 15.04 VM.
|
|
@tuxoko that's actually a long standing issue. Oddly enough it only happens on 32-bit systems and I've never spent the time to determine why. But since this patch is going to enable 32-bit systems we'll need to get to the bottom of it. |
|
I don't know if this is ABD-specific or something related to the master tree, but I ended up with my system stuck in a sync cycle: I don't know what caused it to start but apps started hanging. Technically this is my tree, with my standard extra patches and a slight rebase to bring it closer to the master HEAD by a bit. https://github.com/DeHackEd/zfs/commits/dehacked-bleedingedge if interested (commit 484f14b in case I update the tree) |
|
@edillmann |
|
@tuxoko Yes it cause a very notable performance degradation, mean IO wait climbs from 3% to 20%, and mean server load from 1 to 6. The degradation is not immediate but appears after 2 ou 3 days (memory fragmentation ?). For now I'm running 1aa5c0f and will wait some days to see I performance is staying as good as now. |
sempervictus
commented
Jun 12, 2015
|
@tuxoko: could we bug you for a refresh against master? The lock contention patches hit causing serious conflicts and it would be useful to update our test environments to the final revision of that stack (we're using an old merge). Thanks as always. |
|
@tuxoko I'd like to focus on getting this patch stack merged next for master. It would be great if you could rebase it on master. I'll work my way through it this week to get you some additional review comments and hopefully we can get a few people to provide additional test coverage for the updated patch. @edillmann what, if anything, did you determine about running only up to 1aa5c0f? |
|
@behlendorf the system I'm running with 1aa5c0f is performing well (low load, low iowait), with previous version I had problems with arc_adapt eating cpu and system global perfs where getting bad (high load, high iowait). |
|
@behlendorf |
behlendorf
referenced this pull request
Jun 23, 2015
Closed
[ARM] Kernel NULL pointer dereference in arc_shrink #3517
|
Rebased to master. Last version can be access by my branch abd2_archive00. |
|
Excellent Just in time for updating the kernel modules Thank you very much |
sempervictus
commented
Jun 25, 2015
|
This appears to conflict with ef56b07, according to git blame, the conflict is @ line 5848 of arc.c and looks like:
Git blame says:
So looks like the L2ARC sizing bit has made its way into master (yay), but conflicts with ABD now. |
|
@sempervictus FYI: kernelOfTruth/zfs@cc3e5e6 write_psize is superfluous so it's not needed anyway so the only change should be
to
|
This was referenced Jun 26, 2015
|
I've had this intermittent (and fairly brief) system hang going on my system off and on. I don't know if it's related to master or ABD. I'm running a version based on a rebasing to master (a7b10a9 with pretty easy-to-resolve conflicts) on kernel 4.1.1. The biggest issue seems to be that the ARC jamming on I/O. [<ffffffffc02a7055>] cv_wait_common+0xf5/0x130 [spl] [<ffffffffc02a70e0>] __cv_wait+0x10/0x20 [spl] [<ffffffffc02e3a37>] arc_get_data_buf+0x427/0x450 [zfs] [<ffffffffc02e7040>] arc_read+0x510/0x9e0 [zfs] [<ffffffffc02ee706>] dbuf_read+0x236/0x7b0 [zfs] [<ffffffffc02f7804>] dmu_buf_hold_array_by_dnode+0x124/0x490 [zfs] [<ffffffffc02f869e>] dmu_read_uio_dnode+0x3e/0xc0 [zfs] [<ffffffffc02f87dc>] dmu_read_uio_dbuf+0x3c/0x60 [zfs] [<ffffffffc0384a20>] zfs_read+0x140/0x410 [zfs] [<ffffffffc039b068>] zpl_read+0xa8/0xf0 [zfs] [<ffffffff8b13a2cf>] __vfs_read+0x2f/0xf0 [<ffffffff8b13a5f8>] vfs_read+0x98/0xe0 [<ffffffff8b13aea5>] SyS_read+0x55/0xc0 [<ffffffff8b4e90d7>] system_call_fastpath+0x12/0x6a [<ffffffffffffffff>] 0xffffffffffffffff ARC stats shows: p 4 370216448 c 4 2166505216 c_min 4 33554432 c_max 4 4000000000 size 4 2524924248 hdr_size 4 120098688 data_size 4 1049088 meta_size 4 1436928000 other_size 4 912485592 ... rc_prune 4 40914 arc_meta_used 4 2523875160 arc_meta_limit 4 3000000000 arc_meta_max 4 3061794744 arc_meta_min 4 16777216 Since |
sempervictus
commented
Jul 18, 2015
|
I'm seeing something similar to @DeHackEd, though manifesting a bit differently. After a few days of runtime on a desktop system the entire system comes to vicious crawl with no IOWait registering and system resources barely being touched. Shell commands take 30s-2m to execute, and nothing in dmesg to indicate an actual crash somewhere. iSCSI hosts are showing no such problem, and so far the NFS systems arent either (both backing a cloudstack). All systems in the testing round do regular send receive, though it seems to be what's killing the desktop system - i've found the workstation unresponsive a couple of times since we moved to the new ABD stack, and last thing i did with it each time was to initiate a send/recv. All hosts use L2ARC, the servers (SCST and NFS) both have mirrored SLOGs as well (in case it matters at all). The ARC meta_size looks to me to be a bit too large, but then again, it may be the workload.
The patch stack in testing (tsv20150706):
ZFS options in use on crashing host:
|
|
@DeHackEd |
sempervictus
commented
Jul 20, 2015
|
Thanks @kernelOfTruth, that last issue may explain some of the lag I've
|
olw2005
commented
Jul 23, 2015
|
Just as an additional datapoint to whom it may concern. For the past week I've been torture testing this patchset applied to HEAD@13 July, up to commit b39c22b Translate sync zio to sync bio. Test env is ESXi iscsi --> scst --> drbd (2 nodes) --> ~8TB thin zvol (326GB used @ 1.87x lz4 compression) --> Vanilla Centos 6.6. (The zvol is a copy of one of our small production vm environments.) I don't have baseline performance for this hardware to compare, but the above build has been rock solid (not so much as a blip in dmesg) for a week now under a 24x7 heavy I/o load and the performance has been [subjectively] quite good on relatively old, low-spec hardware [single i7 cpu, 9GB ram, 3x1TB sata raidz + 64GB ssd partitioned for zil and l2arc]. |
|
@tuxoko I had worked on my own set of patches that address this problem intermittently since 2013, but I had not followed your patches much after the first version this year. I just read through their latest iteration and I very much like what they are now. I have been thinking about this in the context of DirectIO. It seems to me that it would be easier to implement DirectIO on filesystems if we were to use iovecs for the scatter buffers. Doing DirectIO safely would require modifying the page tables to keep userland from modifying them, but it would allow us to translate arbitrary ranges in memory into something that could be passed through to the disk vdevs for reads and writes. We would still have a copy for compression, but we won't need to allocate additional memory in userland in the common case. Does changing this to use iovecs for scatter IO sound reasonable? |
|
@tuxoko Also, I pushed my own set of patches that were called sgbuf to a branch. The sgbuf code was in a state that compiled, but failed to run properly when I last touched it. While I think ABD v2 is a cleaner approach, there might be a few ideas in the sgbuf code that you could reuse in ABD: |
behlendorf
added this to the 0.7.0 milestone
Jul 24, 2015
behlendorf
added
Feature
Performance
labels
Jul 24, 2015
sempervictus
commented
Aug 6, 2015
|
@tuxoko: could we ask for another refresh? I'm a bit concerned that i'm missing something in the merge fixes which is causing the random crashes. Builds subsequent to 20150705 appear to hard-lock the systems in testing when they approach memory limits. I have one from 0718 which seems otherwise rock solid (the SCST hosts even seem to like it, but they always have ~30% free memory, even after iSCSI buffers). |
|
https://github.com/DeHackEd/zfs/commits/dehacked-bleedingedge2 This is my build of master+ABD which appears stable so far. Running it on the system I'm typing this into. You might be seeing the ARC size getting stuck at the minimum and slowing to a crawl. This was fixed in master and is in the above patch. |
angstymeat
commented
Aug 11, 2015
|
I just thought I would mention that I have been having lockup issues with |
sempervictus
commented
Aug 16, 2015
|
I second that, bleeding edge 2 seems stable in user and service
|
|
@tuxoko: For external dmu consumers, like Luster OSD, an interface is needed to obtain all the page addresses from an ABD scatter buffer. Similar to how Would thanks |
|
@don-brady Also, please note that small allocation will fallback to linear ABD, you need to check the flags to make sure which type it is. |
|
@kernelOfTruth I'll wait till #3651 is merged to master, then I'll update the main |
|
Hi @tuxoko , will probably do within the next few days thanks a lot |
greg-hydrogen
commented
Aug 25, 2015
|
@greg-hydrogen take a look at @tuxoko 's branch abd_loop_fix |
|
Rebase to master, support #3651, add one patch to use miter in abd_uiomove. |
olw2005
commented
Sep 2, 2015
|
Probably dated information by now, but we ran about 150TB of zvols through @DeHackEd bleeding edge 2 branch to good effect. Approximately two weeks of near-constant i/o load, no issues. |
|
@olw2005 could you post a few details about the workload pattern ? roughly how much percent read, write ? l2arc ? zil ? specific torturing ? clamav / clamscan ? |
olw2005
commented
Sep 2, 2015
|
@kernelOfTruth We were adding disk space to two SANs (a handful of multi-TB lz4-compressed zvols served via iscsi to VMware). I decided it would be better to recreate the pool (with a slightly modified layout) to prevent lopsided disk usage -- current pools were 70ish% full. So we did a full backup of the pull (zfs send) out to [and then back from] a staging server. I used the bleedingedge2 branch on the staging server, h/w spec -- 192GB ram, no l2arc or zil, 36 disks (then 24 disks on the second send/recv). The usage profile was roughly a week of continuous write I/o (zfs recv), followed by roughly a week of continuous read I/o (zfs send). Send/recv speeds were ~500MB/s across the wire. The load avg got a bit high (into the 20's) at times but there were no issues sending / receiving about 155TB total. |
sempervictus
commented
Sep 12, 2015
|
Looks like we have a conflict with 4e0f33f.
in module/zfs/arc.c makes me think that this has already been addressed by the ABD stack, but it does conflict. |
greg-hydrogen
commented
Sep 12, 2015
|
with v0.6.5 tagged today, @behlendorf is there any way we can get this pull request in for the next tag... I have been using this patch (and the previous iteration) on a number of machines without any issues, in fact, this has greatly added stability for my workloads @tuxoko - any way you can refresh this yet again? I would try and fix the conflicts myself, but I am a bit scared to screw something up |
|
Absolutely. This is one of the first things I'd like to get finalized and merged. |
|
I rebased my bleedingedge3 tree last night. I'm testing it now. Seems okay so far, but a possible master issue has come up which I will look into. |
|
@DeHackEd could you pinpoint it somehow already ? I was originally planning to update my patchset but might postpone in that regard |
|
If you mean this issue I'm tracking down, I was testing L2ARC (because that's the big difference in the latest Master patches) but for some reason I'm not seeing the hit rates I was expecting. But 0.6.5 is doing the same thing so I'm probably doing something wrong and not realizing it. I believe bleedingedge3 is fine otherwise. Edit: if I had to guess I'd say that only metadata is being read even though all data is being written... further investigation is required, but for now this shoulnd't stop you from using ABD. |
|
@DeHackEd thanks for the headsup, Currently L2ARC is disabled here anyway but I'm kinda missing the speedup it provides - so it's good to know that I can update to a safe(r) patchstack to re-enable it. |
|
I'm getting a failure with 'zfs send' involving large block sizes. Test case: # zpool create testpool /dev/sdx -O recordsize=1M # cp somebigfile /testpool/testfile # zfs snapshot testpool@sendme # zfs send testpool@sendme > /dev/null And it crashes with: VERIFY3(c < (1ULL << 24) >> 9) failed (36028797018963967 < 32768) PANIC at zio.c:258:zio_buf_free() Stack trace: zio_buf_free abd_return_buf traverse_visitbp traverse_visitbp ... Sorry, transcribing the crash by hand here. The crash can be avoided by adding '-L' to the send commandline. Without -L the send will break the large block into virtual 128k blocks so receivers can accept it. With -L the full 1M block size will be sent but receivers must be able to accept it. |
|
Pushed out new version, last version in https://github.com/tuxoko/zfs/tree/abd2_archive02 Changes from last version:
|
greg-hydrogen
commented
Nov 7, 2015
|
Thanks @tuxoko again for refreshing the patch I have installed on a couple of machines and things seem snappier! |
This was referenced Dec 11, 2015
|
@tuxoko FYI: There's again recent upstream master changes that break ABD: 6fe5378 Fix vdev_queue_aggregate() deadlock specifically module/zfs/vdev_queue.c , commit tuxoko/zfs@4c97b0f "Handle abd_t in vdev*.c sans vdev_raidz.c" is affected edit: additional changes are introduced with 37f8a88 Illumos 5746 - more checksumming in zfs send a slightly more recent master rebase is at: especially of concern is whether the changes to include "Fix vdev_queue_aggregate() deadlock" were correct [https://github.com/zfsonlinux/zfs/commit/6fe53787f38f10956b8d375133ed4559f8ce847b] and can work with ABD2 |
This was referenced Jan 7, 2016
added a commit
to kernelOfTruth/zfs
that referenced
this pull request
Jan 17, 2016
kernelOfTruth
referenced this pull request
Jan 17, 2016
Closed
[dedicated box testing] ABD2 + Illumos #4950 + Illumos #2605 (master January 16th, 2015) #4236
|
Rebased to (near) master. |
|
Update: add scatter support for raidz parity and zfs_fm |
|
Rebased to master. Add zio_{,de}copmress_abd and dmu_write_abd to reduce borrow_buf. |
This was referenced Feb 10, 2016
ptx0
commented
Jun 8, 2016
|
@tuxoko does that mean it fully supports lz4 etc? |
|
@ptx0 |
sempervictus
commented
Jul 27, 2016
|
What do you mean scatter linear copy problem? Do you mean the extra copy when compress/decompress? If so then yes since it need a new lz4 implementation. |
sempervictus
commented
Jul 27, 2016
|
Yes, i meant the additional copy for the compression operation. Thanks for clarifying. |
This was referenced Aug 12, 2016
samuelxhu
commented
Sep 17, 2016
|
For help: Does someone maintain an ABD patch for older stable ZoL release v0.6.4.2 and v0.6.3-1.3? I have tens of production server running those versions stable for 1.5 years, and do not want to run risks to upgrade but would very much like to cheey-pick ABD feature. How can i backport ABD to ZoL release v0.6.4.2 and v0.6.3-1.3? Is there anyone else done the same thing? |
|
Hi all, I just updated this branch. The old version can be found in branch The ABD branch from illumos #5135 is missing some stuff from my original branch, so I decided to rebase my branch to master. This update is still WIP, I'm just posting it early for testing. I'll start merging stuff like simd raidz and some API change from #5135. |
|
@samuelxhu |
|
Besides the style failures, I'm getting ztest failing to run because zdb can't open the test pools. |
tuxoko
added some commits
May 11, 2015
|
Yeah, I see that in the build bot. Strange I didn't get that when I tested. It seams that the simd incremental fletcher patch does have some issue depending on compiler or machine. I'll leave that patch out until I figure it out. |
tuxoko commentedMay 25, 2015
This is a refreshed version of #2129.
It is rebased after large block support and have cleaner history.
Here's a list of possible further enhancement:
spa_history, bpobj, zap. (It seems that zap, indirect blocks, and dnodes are limited to 16K blocks, which is currently using kernel slab. Whether using scatter list for them is better or worse remains to be seen.)Scatter support for byteswap.Scatter support for SHA256.Enable scatter for raidz parityScatter support for zfs_fm