ABD: linear/scatter dual typed buffer for ARC (ver 2) #3441

Closed
wants to merge 26 commits into
from

Conversation

Projects
None yet
Member

tuxoko commented May 25, 2015

This is a refreshed version of #2129.
It is rebased after large block support and have cleaner history.

Here's a list of possible further enhancement:

  1. Enable scatter for more metadata type: spa_history, bpobj, zap. (It seems that zap, indirect blocks, and dnodes are limited to 16K blocks, which is currently using kernel slab. Whether using scatter list for them is better or worse remains to be seen.)
  2. Scatter support for byteswap.
  3. Scatter support for lz4 and other compression
  4. Scatter support for SHA256.
  5. Scatter support for vdev_file
  6. Scatter support for dmu_send
  7. Enable scatter for raidz parity
  8. Scatter support for raidz matrix reconstruction
  9. Scatter support for zfs_fm
Contributor

DeHackEd commented May 25, 2015

Will start using tonight.

Contributor

edillmann commented May 28, 2015

Hi,

I give it a test and it show's a very high CPU usage on arc_adapt vs master

Regards,
Eric

Owner

behlendorf commented May 28, 2015

@edillmann I don't quite follow you comment. Very good, very bad CPU usage?

Contributor

edillmann commented May 29, 2015

Hi,

It show's a high CPU usage on arc_adapt, the pool is build this way

zpool status
  pool: bfs
 state: ONLINE
  scan: scrub repaired 0 in 9h23m with 0 errors on Sun May  3 10:23:14 2015
config:
    NAME        STATE     READ WRITE CKSUM
    bfs         ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
FULL_DRIVE      sdc     ONLINE       0     0     0
FULL_DRIVE      sde     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
FULL_DRIVE      sdd     ONLINE       0     0     0
FULL_DRIVE      sdf     ONLINE       0     0     0
    logs
LVM/SSD   lva-zil0  ONLINE       0     0     0
    cache
LVM/SSD   zcache1   ONLINE       0     0     0

And some stats :

cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 11319918778 252428229027638
name                            type data
hits                            4    78061054
misses                          4    40955739
demand_data_hits                4    56497175
demand_data_misses              4    10841192
demand_metadata_hits            4    21498002
demand_metadata_misses          4    29548717
prefetch_data_hits              4    39808
prefetch_data_misses            4    505804
prefetch_metadata_hits          4    26069
prefetch_metadata_misses        4    60026
mru_hits                        4    29614684
mru_ghost_hits                  4    431054
mfu_hits                        4    48384144
mfu_ghost_hits                  4    681788
deleted                         4    55535257
recycle_miss                    4    25484240
mutex_miss                      4    292255
evict_skip                      4    162321851980
evict_l2_cached                 4    534329798656
evict_l2_eligible               4    771167293440
evict_l2_ineligible             4    18798989312
hash_elements                   4    1350811
hash_elements_max               4    1412083
hash_collisions                 4    15198772
hash_chains                     4    176317
hash_chain_max                  4    6
p                               4    8366558720
c                               4    8589934592
c_min                           4    4294967296
c_max                           4    8589934592
size                            4    6909182400
hdr_size                        4    56284320
data_size                       4    5069312
meta_size                       4    2260425216
other_size                      4    4160235936
anon_size                       4    17095680
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    2042542080
mru_evict_data                  4    0
mru_evict_metadata              4    1589248
mru_ghost_size                  4    0
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    0
mfu_size                        4    205856768
mfu_evict_data                  4    0
mfu_evict_metadata              4    16384
mfu_ghost_size                  4    0
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    0
l2_hits                         4    15391819
l2_misses                       4    20784025
l2_feeds                        4    162808
l2_rw_clash                     4    42923
l2_read_bytes                   4    226944701440
l2_write_bytes                  4    49032701440
l2_writes_sent                  4    128430
l2_writes_done                  4    128430
l2_writes_error                 4    0
l2_writes_hdr_miss              4    96
l2_evict_lock_retry             4    4
l2_evict_reading                4    0
l2_free_on_write                4    666234
l2_cdata_free_on_write          4    18037
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    12775517696
l2_asize                        4    3749617152
l2_hdr_size                     4    427167616
l2_compress_successes           4    14800470
l2_compress_zeros               4    0
l2_compress_failures            4    16272
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    2870064338
arc_meta_used                   4    6904113088
arc_meta_limit                  4    6442450944
arc_meta_max                    4    6933723168

Any clues ?

Regards,
Eric

Member

tuxoko commented May 29, 2015

Hi @edillmann
Could you try only upto 1aa5c0f
Thanks.

Member

tuxoko commented May 29, 2015

I got this constantly when doing ztest on 32bit ubuntu 15.04 VM.
Does anyone know what does this assertion indicates?

5 vdevs, 7 datasets, 23 threads, 300 seconds...
loading space map for vdev 0 of 1, metaslab 6 of 14 ...
ztest: ../../module/zfs/vdev.c:923: Assertion `vd->vdev_pending_fastwrite == 0 (0x80000c400 == 0x0)' failed.

Owner

behlendorf commented May 29, 2015

@tuxoko that's actually a long standing issue. Oddly enough it only happens on 32-bit systems and I've never spent the time to determine why. But since this patch is going to enable 32-bit systems we'll need to get to the bottom of it.

Contributor

DeHackEd commented May 31, 2015

I don't know if this is ABD-specific or something related to the master tree, but I ended up with my system stuck in a sync cycle:
http://pastebin.com/wunCLu1S

I don't know what caused it to start but apps started hanging. drop_caches appeared to run successfully but didn't help.

Technically this is my tree, with my standard extra patches and a slight rebase to bring it closer to the master HEAD by a bit. https://github.com/DeHackEd/zfs/commits/dehacked-bleedingedge if interested (commit 484f14b in case I update the tree)

Owner

behlendorf commented Jun 1, 2015

@DeHackEd It's not you, it's very likely #3409. You can work around it by increasing zfs_arc_c_min to say 1/16 of memory. This is something we need to address.

Contributor

edillmann commented Jun 2, 2015

Hi @tuxoko

Running with zfs upto 1aa5c0f resolve the problem with arc_adapt.
Server load dropped from 6.5 to 0.8 with the same workload.

Thanks,

Member

tuxoko commented Jun 2, 2015

@edillmann
Regarding the high CPU load of arc_adapt, does it cause stall or performance degradation?
Could you try to use perf record -p <pid of arc_adapt> to see what's consuming the CPU?
Thanks.

Contributor

edillmann commented Jun 3, 2015

@tuxoko Yes it cause a very notable performance degradation, mean IO wait climbs from 3% to 20%, and mean server load from 1 to 6. The degradation is not immediate but appears after 2 ou 3 days (memory fragmentation ?). For now I'm running 1aa5c0f and will wait some days to see I performance is staying as good as now.

@tuxoko: could we bug you for a refresh against master? The lock contention patches hit causing serious conflicts and it would be useful to update our test environments to the final revision of that stack (we're using an old merge). Thanks as always.

Owner

behlendorf commented Jun 15, 2015

@tuxoko I'd like to focus on getting this patch stack merged next for master. It would be great if you could rebase it on master. I'll work my way through it this week to get you some additional review comments and hopefully we can get a few people to provide additional test coverage for the updated patch.

@edillmann what, if anything, did you determine about running only up to 1aa5c0f?

Contributor

edillmann commented Jun 16, 2015

@behlendorf the system I'm running with 1aa5c0f is performing well (low load, low iowait), with previous version I had problems with arc_adapt eating cpu and system global perfs where getting bad (high load, high iowait).

Member

tuxoko commented Jun 23, 2015

@behlendorf
Sorry for the delay, a rebased version should come shortly.

Member

tuxoko commented Jun 24, 2015

Rebased to master. Last version can be access by my branch abd2_archive00.

Contributor

kernelOfTruth commented Jun 24, 2015

Excellent 👍

Just in time for updating the kernel modules

Thank you very much 😄

This appears to conflict with ef56b07, according to git blame, the conflict is @ line 5848 of arc.c and looks like:

5848 <<<<<<< HEAD
5849         uint64_t write_asize, write_sz, headroom, buf_compress_minsz,
5850             stats_size;
5851         void *buf_data;
5852 =======
5853         uint64_t write_asize, write_psize, write_sz, headroom,
5854             buf_compress_minsz;
5855         abd_t *buf_data;
5856 >>>>>>> origin/pr/3441

Git blame says:

ef56b078 module/zfs/arc.c       (Andriy Gapon      2015-06-12 21:20:29 +0200 5800)  uint64_t write_asize, write_sz, headroom, buf_compress_minsz,
ef56b078 module/zfs/arc.c       (Andriy Gapon      2015-06-12 21:20:29 +0200 5801)      stats_size;

So looks like the L2ARC sizing bit has made its way into master (yay), but conflicts with ABD now.
@tuxoko: with the write_psize added into the call, how should i address this? Thanks as always

Contributor

kernelOfTruth commented Jun 25, 2015

@sempervictus FYI: kernelOfTruth/zfs@cc3e5e6

write_psize is superfluous so it's not needed anyway

so the only change should be

void *buf_data;

to

abd_t *buf_data;
Contributor

DeHackEd commented Jul 16, 2015

I've had this intermittent (and fairly brief) system hang going on my system off and on. I don't know if it's related to master or ABD. I'm running a version based on a rebasing to master (a7b10a9 with pretty easy-to-resolve conflicts) on kernel 4.1.1. The biggest issue seems to be that the ARC jamming on I/O.

[<ffffffffc02a7055>] cv_wait_common+0xf5/0x130 [spl]
[<ffffffffc02a70e0>] __cv_wait+0x10/0x20 [spl]
[<ffffffffc02e3a37>] arc_get_data_buf+0x427/0x450 [zfs]
[<ffffffffc02e7040>] arc_read+0x510/0x9e0 [zfs]
[<ffffffffc02ee706>] dbuf_read+0x236/0x7b0 [zfs]
[<ffffffffc02f7804>] dmu_buf_hold_array_by_dnode+0x124/0x490 [zfs]
[<ffffffffc02f869e>] dmu_read_uio_dnode+0x3e/0xc0 [zfs]
[<ffffffffc02f87dc>] dmu_read_uio_dbuf+0x3c/0x60 [zfs]
[<ffffffffc0384a20>] zfs_read+0x140/0x410 [zfs]
[<ffffffffc039b068>] zpl_read+0xa8/0xf0 [zfs]
[<ffffffff8b13a2cf>] __vfs_read+0x2f/0xf0
[<ffffffff8b13a5f8>] vfs_read+0x98/0xe0
[<ffffffff8b13aea5>] SyS_read+0x55/0xc0
[<ffffffff8b4e90d7>] system_call_fastpath+0x12/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff

ARC stats shows:

p                               4    370216448
c                               4    2166505216
c_min                           4    33554432
c_max                           4    4000000000
size                            4    2524924248
hdr_size                        4    120098688
data_size                       4    1049088
meta_size                       4    1436928000
other_size                      4    912485592
...
rc_prune                       4    40914
arc_meta_used                   4    2523875160
arc_meta_limit                  4    3000000000
arc_meta_max                    4    3061794744
arc_meta_min                    4    16777216

Since size > c and mots of it is metadata it could be jamming because it's doing an eviction pass and somehow hanging there. Operations usually unwedge in a second or so, but back-to-back IO will grind the system to a slowdown fairly quickly.

I'm seeing something similar to @DeHackEd, though manifesting a bit differently. After a few days of runtime on a desktop system the entire system comes to vicious crawl with no IOWait registering and system resources barely being touched. Shell commands take 30s-2m to execute, and nothing in dmesg to indicate an actual crash somewhere.

iSCSI hosts are showing no such problem, and so far the NFS systems arent either (both backing a cloudstack). All systems in the testing round do regular send receive, though it seems to be what's killing the desktop system - i've found the workstation unresponsive a couple of times since we moved to the new ABD stack, and last thing i did with it each time was to initiate a send/recv. All hosts use L2ARC, the servers (SCST and NFS) both have mirrored SLOGs as well (in case it matters at all).

The ARC meta_size looks to me to be a bit too large, but then again, it may be the workload.
I managed to pull an arcstats dump before the whole thing stopped responding whatsoever this time:

5 1 0x01 89 4272 29154622658 93491015770290
name                            type data
hits                            4    12820953
misses                          4    3298382
demand_data_hits                4    5675931
demand_data_misses              4    422482
demand_metadata_hits            4    6408370
demand_metadata_misses          4    2152548
prefetch_data_hits              4    48759
prefetch_data_misses            4    504710
prefetch_metadata_hits          4    687894
prefetch_metadata_misses        4    218642
mru_hits                        4    4971539
mru_ghost_hits                  4    452869
mfu_hits                        4    7112766
mfu_ghost_hits                  4    183086
deleted                         4    880583
mutex_miss                      4    2411
evict_skip                      4    892990
evict_not_enough                4    131000
evict_l2_cached                 4    2469370880
evict_l2_eligible               4    46359740928
evict_l2_ineligible             4    17156611072
evict_l2_skip                   4    0
hash_elements                   4    590800
hash_elements_max               4    713425
hash_collisions                 4    566822
hash_chains                     4    29153
hash_chain_max                  4    4
p                               4    2403761164
c                               4    7545637400
c_min                           4    1073741824
c_max                           4    10737418240
size                            4    7635005488
hdr_size                        4    106172160
data_size                       4    1056768
meta_size                       4    2336360448
other_size                      4    5133050608
anon_size                       4    4887040
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1722641408
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    1429168640
mru_ghost_evict_data            4    1049606144
mru_ghost_evict_metadata        4    379562496
mfu_size                        4    609888768
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    2018373120
mfu_ghost_evict_data            4    1666403328
mfu_ghost_evict_metadata        4    351946240
l2_hits                         4    908
l2_misses                       4    2942626
l2_feeds                        4    90480
l2_rw_clash                     4    0
l2_read_bytes                   4    738304
l2_write_bytes                  4    2265349120
l2_writes_sent                  4    631
l2_writes_done                  4    631
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    349
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    6
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    2467729920
l2_asize                        4    2265340928
l2_hdr_size                     4    58365504
l2_compress_successes           4    41221
l2_compress_zeros               4    0
l2_compress_failures            4    256942
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    1
memory_indirect_count           4    47273
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    7633948720
arc_meta_limit                  4    8053063680
arc_meta_max                    4    7639335560
arc_meta_min                    4    536870912

The patch stack in testing (tsv20150706):

  * origin/pr/3526
  ** Change default cachefile property to 'none'.
  * origin/pr/3576
  ** Fix Xen Virtual Block Device detection
  * origin/pr/3575
  ** Failure of userland copy should return EFAULT
  * origin/pr/3574
  ** 5745 zfs set allows only one dataset property to be set at a time
  * origin/pr/3572
  ** 5813 zfs_setprop_error(): Handle errno value E2BIG.
  * origin/pr/3571
  ** 5661 ZFS: "compression = on" should use lz4 if feature is enabled
  * origin/pr/3569
  ** 5427 memory leak in libzfs when doing rollback
  * origin/pr/3567
  ** 5118 When verifying or creating a storage pool, error messages only show one device
  * origin/pr/3566
  ** 4966 zpool list iterator does not update output
  * origin/pr/3565
  ** 4745 fix AVL code misspellings
  * origin/pr/3563
  ** 4626 libzfs memleak in zpool_in_use()
  * origin/pr/3562
  ** 1765 assert triggered in libzfs_import.c trying to import pool name beginning with a number
  * origin/pr/2784
  ** Illumos #4950 files sometimes can't be removed from a full filesystem
  * origin/pr/3166
  ** Make linking with and finding libblkid required
  * origin/pr/3529
  ** Translate zio requests with ZIO_PRIORITY_SYNC_READ and
  * origin/pr/3557
  ** Prevent reclaim in metaslab preload threads
  * origin/pr/3555
  ** Illumos 5008 lock contention (rrw_exit) while running a read only load
  * origin/pr/3554
  ** 5911 ZFS "hangs" while deleting file
  * origin/pr/3553
  ** 5981 Deadlock in dmu_objset_find_dp
  * origin/pr/3552
  ** Illumos 5946, 5945
  * origin/pr/3551
  ** Illumos 5870 - dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch
  * origin/pr/3550
  ** Illumos 5909 - ensure that shared snap names don't become too long after promotion
  * origin/pr/3549
  ** Illumos 5912 - full stream can not be force-received into a dataset if it has a snapshot
  * origin/pr/2012
  ** Add option to zpool status to print guids
  * origin/pr/3169
  ** Add dfree_zfs for changing how Samba reports space
  * origin/pr/3441
  ** Add abd version byteswap functions
  * master @ a7b10a931911d3a98a90965795daad031c6d33a2

ZFS options in use on crashing host:

# ZFS ARC cache boundaries from 1-10G
options zfs zfs_arc_min=1073741824
options zfs zfs_arc_max=10737418240
#
# Write throttle
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_sync_write_max_active=32
#
# Read throttle
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_max_active=12

@sempervictus sempervictus referenced this pull request Jul 18, 2015

Closed

ARC Sync Up #3533

0 of 19 tasks complete
Member

tuxoko commented Jul 20, 2015

@DeHackEd
I'm guessing that is not directly related to ABD.
Holding up on arc_size > arc_c seems to be too strong a thing to do.
If you remove the cv_wait in arc_get_data_buf, would the system behave normally?

Contributor

kernelOfTruth commented Jul 20, 2015

@tuxoko , @DeHackEd there's a related newly created issue: #3616

Thanks @kernelOfTruth, that last issue may explain some of the lag I've
seen on contended systems recently.
On Jul 20, 2015 11:15 AM, "kernelOfTruth aka. kOT, Gentoo user" <
notifications@github.com> wrote:

@tuxoko https://github.com/tuxoko there's a related newly created
issue: #3616 #3616


Reply to this email directly or view it on GitHub
#3441 (comment).

olw2005 commented Jul 23, 2015

Just as an additional datapoint to whom it may concern.

For the past week I've been torture testing this patchset applied to HEAD@13 July, up to commit b39c22b Translate sync zio to sync bio. Test env is ESXi iscsi --> scst --> drbd (2 nodes) --> ~8TB thin zvol (326GB used @ 1.87x lz4 compression) --> Vanilla Centos 6.6. (The zvol is a copy of one of our small production vm environments.)

I don't have baseline performance for this hardware to compare, but the above build has been rock solid (not so much as a blip in dmesg) for a week now under a 24x7 heavy I/o load and the performance has been [subjectively] quite good on relatively old, low-spec hardware [single i7 cpu, 9GB ram, 3x1TB sata raidz + 64GB ssd partitioned for zil and l2arc].

Member

ryao commented Jul 24, 2015

@tuxoko I had worked on my own set of patches that address this problem intermittently since 2013, but I had not followed your patches much after the first version this year. I just read through their latest iteration and I very much like what they are now.

I have been thinking about this in the context of DirectIO. It seems to me that it would be easier to implement DirectIO on filesystems if we were to use iovecs for the scatter buffers. Doing DirectIO safely would require modifying the page tables to keep userland from modifying them, but it would allow us to translate arbitrary ranges in memory into something that could be passed through to the disk vdevs for reads and writes. We would still have a copy for compression, but we won't need to allocate additional memory in userland in the common case. Does changing this to use iovecs for scatter IO sound reasonable?

Member

ryao commented Jul 24, 2015

@tuxoko Also, I pushed my own set of patches that were called sgbuf to a branch. The sgbuf code was in a state that compiled, but failed to run properly when I last touched it. While I think ABD v2 is a cleaner approach, there might be a few ideas in the sgbuf code that you could reuse in ABD:

https://github.com/ryao/zfs/commits/sgbuf-2015

@behlendorf behlendorf added this to the 0.7.0 milestone Jul 24, 2015

@tuxoko: could we ask for another refresh? I'm a bit concerned that i'm missing something in the merge fixes which is causing the random crashes.

Builds subsequent to 20150705 appear to hard-lock the systems in testing when they approach memory limits. I have one from 0718 which seems otherwise rock solid (the SCST hosts even seem to like it, but they always have ~30% free memory, even after iSCSI buffers).

Contributor

DeHackEd commented Aug 6, 2015

https://github.com/DeHackEd/zfs/commits/dehacked-bleedingedge2 This is my build of master+ABD which appears stable so far. Running it on the system I'm typing this into.

You might be seeing the ARC size getting stuck at the minimum and slowing to a crawl. This was fixed in master and is in the above patch.

I just thought I would mention that I have been having lockup issues with zfs send, #3668, and I decided to give DeHackEd's bleedingedge2 branch a try. I would lock up after 9 hours before, but now I'm going just fine over 24 hours. Everything looks good.

I second that, bleeding edge 2 seems stable in user and service
environments so far. Thanks @DeHackEd
On Aug 11, 2015 6:10 PM, "Michael Love" notifications@github.com wrote:

I just thought I would mention that I have been having lockup issues with zfs
send, #3668 #3668, and I
decided to give DeHackEd's bleedingedge2 branch a try. I would lock up
after 9 hours before, but now I'm going just fine over 24 hours. Everything
looks good.


Reply to this email directly or view it on GitHub
#3441 (comment).

Member

don-brady commented Aug 19, 2015

@tuxoko: For external dmu consumers, like Luster OSD, an interface is needed to obtain all the page addresses from an ABD scatter buffer. Similar to how abd_scatter_bio_map_off() needs to obtain the page addresses for bio layer. Lustre OSD will "borrow" a buffer using the dmu_request_arcbuf() interface which I assume will yield a scatter buffer.

Would abd_buf_segment() be safe to use for this purpose? Note that a abd_iterate_func() like interface is not appropriate for Lustre use since it only needs the page addresses and will perform its own kmaps.

thanks

Member

tuxoko commented Aug 20, 2015

@don-brady
I think the best way would be just direct access to the abd->abd_sgl.
abd_sgl is a linux specific struct scatterlist, so it should be easy to handle it in lustre under linux kernel.

Also, please note that small allocation will fallback to linear ABD, you need to check the flags to make sure which type it is.

Member

tuxoko commented Aug 20, 2015

@kernelOfTruth
Hi all,
The current abd2 branch is not compatible with #3651
In case you want to test both of them, I pushed out a new branch abd_loop_fix which fix the compatible issue and is based on the current master.

I'll wait till #3651 is merged to master, then I'll update the main abd2 branch.

Contributor

kernelOfTruth commented Aug 20, 2015

Hi @tuxoko ,

will probably do within the next few days thanks a lot 👍

@tuxoko now that #3651 has been merged, can we bother you to refresh this patch again?

Contributor

kernelOfTruth commented Aug 26, 2015

@greg-hydrogen take a look at @tuxoko 's branch abd_loop_fix

Member

tuxoko commented Aug 29, 2015

Rebase to master, support #3651, add one patch to use miter in abd_uiomove.

olw2005 commented Sep 2, 2015

Probably dated information by now, but we ran about 150TB of zvols through @DeHackEd bleeding edge 2 branch to good effect. Approximately two weeks of near-constant i/o load, no issues.

Contributor

kernelOfTruth commented Sep 2, 2015

@olw2005 could you post a few details about the workload pattern ?

roughly how much percent read, write ? l2arc ? zil ? specific torturing ? clamav / clamscan ?
(there were a few reported issues with clamav-based or rsync-based workloads in the past that stressed ZFS, ARC very strongly)

olw2005 commented Sep 2, 2015

@kernelOfTruth We were adding disk space to two SANs (a handful of multi-TB lz4-compressed zvols served via iscsi to VMware). I decided it would be better to recreate the pool (with a slightly modified layout) to prevent lopsided disk usage -- current pools were 70ish% full. So we did a full backup of the pull (zfs send) out to [and then back from] a staging server. I used the bleedingedge2 branch on the staging server, h/w spec -- 192GB ram, no l2arc or zil, 36 disks (then 24 disks on the second send/recv). The usage profile was roughly a week of continuous write I/o (zfs recv), followed by roughly a week of continuous read I/o (zfs send). Send/recv speeds were ~500MB/s across the wire. The load avg got a bit high (into the 20's) at times but there were no issues sending / receiving about 155TB total.

Contributor

kernelOfTruth commented Sep 2, 2015

@olw2005 impressive !

thanks for the info 👍

@tuxoko great work with ABD 👍 I'm using it everywhere I can & use ZFS

Looks like we have a conflict with 4e0f33f.

<<<<<<< HEAD 
                abd_free(cdata, len);
                HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_EMPTY);
=======
                zio_data_buf_free(cdata, len);
                l2hdr->b_compress = ZIO_COMPRESS_EMPTY;
>>>>>>> origin/master

in module/zfs/arc.c makes me think that this has already been addressed by the ABD stack, but it does conflict.

with v0.6.5 tagged today, @behlendorf is there any way we can get this pull request in for the next tag... I have been using this patch (and the previous iteration) on a number of machines without any issues, in fact, this has greatly added stability for my workloads

@tuxoko - any way you can refresh this yet again? I would try and fix the conflicts myself, but I am a bit scared to screw something up

Owner

behlendorf commented Sep 12, 2015

Absolutely. This is one of the first things I'd like to get finalized and merged.

Contributor

DeHackEd commented Sep 12, 2015

I rebased my bleedingedge3 tree last night. I'm testing it now. Seems okay so far, but a possible master issue has come up which I will look into.

Contributor

kernelOfTruth commented Sep 12, 2015

@DeHackEd could you pinpoint it somehow already ? I was originally planning to update my patchset but might postpone in that regard

Contributor

DeHackEd commented Sep 12, 2015

If you mean this issue I'm tracking down, I was testing L2ARC (because that's the big difference in the latest Master patches) but for some reason I'm not seeing the hit rates I was expecting.

But 0.6.5 is doing the same thing so I'm probably doing something wrong and not realizing it. I believe bleedingedge3 is fine otherwise.

Edit: if I had to guess I'd say that only metadata is being read even though all data is being written... further investigation is required, but for now this shoulnd't stop you from using ABD.

Contributor

kernelOfTruth commented Sep 12, 2015

@DeHackEd thanks for the headsup,

Currently L2ARC is disabled here anyway but I'm kinda missing the speedup it provides - so it's good to know that I can update to a safe(r) patchstack to re-enable it.

Contributor

DeHackEd commented Oct 13, 2015

I'm getting a failure with 'zfs send' involving large block sizes. Test case:

# zpool create testpool /dev/sdx -O recordsize=1M
# cp somebigfile /testpool/testfile
# zfs snapshot testpool@sendme
# zfs send testpool@sendme > /dev/null

And it crashes with:

VERIFY3(c < (1ULL << 24) >> 9) failed (36028797018963967 < 32768)
PANIC at zio.c:258:zio_buf_free()
Stack trace:
zio_buf_free
abd_return_buf
traverse_visitbp
traverse_visitbp
...

Sorry, transcribing the crash by hand here.

The crash can be avoided by adding '-L' to the send commandline. Without -L the send will break the large block into virtual 128k blocks so receivers can accept it. With -L the full 1M block size will be sent but receivers must be able to accept it.

Member

tuxoko commented Nov 5, 2015

Pushed out new version, last version in https://github.com/tuxoko/zfs/tree/abd2_archive02

Changes from last version:

  1. Rebase to current master.
  2. Fix crash when large block zfs send
  3. Fix assertion in abd_uiomove
  4. Improve IO performance by contiguous pages and sg merging.
  5. Reorder patches so metadata scatter stuff comes after, starting from "Add non-highmem scatter ABD" bf7402d

Thanks @tuxoko again for refreshing the patch

I have installed on a couple of machines and things seem snappier!

Contributor

kernelOfTruth commented Dec 22, 2015

@tuxoko FYI:

There's again recent upstream master changes that break ABD: 6fe5378 Fix vdev_queue_aggregate() deadlock

specifically module/zfs/vdev_queue.c , commit tuxoko/zfs@4c97b0f "Handle abd_t in vdev*.c sans vdev_raidz.c" is affected
(abd2 branch)


edit:

additional changes are introduced with

37f8a88 Illumos 5746 - more checksumming in zfs send
in include/sys/zio_checksum.h
and other files


a slightly more recent master rebase is at:

#4195

especially of concern is whether the changes to include "Fix vdev_queue_aggregate() deadlock" were correct [https://github.com/zfsonlinux/zfs/commit/6fe53787f38f10956b8d375133ed4559f8ce847b]

and can work with ABD2

kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Jan 17, 2016

big blob patch with
Update arc_c under a mutex from #4197
ABD2 #3441 ABD: linear/scatter dual typed buffer for ARC (ver 2)
4950 #4207 Illumos #4950 files sometimes can't be removed from a full filesystem
2605 #4213 Illumos 2605 want to resume interrupted zfs send
Member

tuxoko commented Jan 19, 2016

Rebased to (near) master.
Last version in https://github.com/tuxoko/zfs/tree/abd2_archive03

Member

tuxoko commented Jan 25, 2016

Update: add scatter support for raidz parity and zfs_fm

Member

tuxoko commented Feb 10, 2016

Rebased to master. Add zio_{,de}copmress_abd and dmu_write_abd to reduce borrow_buf.
Last version in https://github.com/tuxoko/zfs/tree/abd2_archive04

ptx0 commented Jun 8, 2016

@tuxoko does that mean it fully supports lz4 etc?

Member

tuxoko commented Jun 8, 2016

@ptx0
It still needs to copy to linear buffer to do compression, if that's what you're asking.

Contributor

kernelOfTruth commented Jul 11, 2016

@tuxoko could you please rebase to current master ?

the last rebase I did was mostly stable but under certain circumstances (updating desktop database of recoll) reliably led to a hardlock, the buildbots didn't indicate any issue: #4781 - not certain if I have missed anything ...

@tuxoko: does the scatter->linear copy still pose a problem if ARC stores data compressed a la #4768 ?
I recall reading somewhere that the "master plan" upstream was to implement compressed ARC and s/g atop that based on the work you've done here.

Member

tuxoko commented Jul 27, 2016

What do you mean scatter linear copy problem? Do you mean the extra copy when compress/decompress? If so then yes since it need a new lz4 implementation.
But other than that, it should be fine, though I haven't try to merge the ARC compress changed stuff with ABD yet.

Yes, i meant the additional copy for the compression operation. Thanks for clarifying.
Is there any chance you'd have time to review merge possibilities in the near future? :)

For help: Does someone maintain an ABD patch for older stable ZoL release v0.6.4.2 and v0.6.3-1.3?

I have tens of production server running those versions stable for 1.5 years, and do not want to run risks to upgrade but would very much like to cheey-pick ABD feature.

How can i backport ABD to ZoL release v0.6.4.2 and v0.6.3-1.3? Is there anyone else done the same thing?

Member

tuxoko commented Sep 24, 2016

Hi all, I just updated this branch. The old version can be found in branch abd2_archive05

The ABD branch from illumos #5135 is missing some stuff from my original branch, so I decided to rebase my branch to master. This update is still WIP, I'm just posting it early for testing. I'll start merging stuff like simd raidz and some API change from #5135.

Member

tuxoko commented Sep 24, 2016

@samuelxhu
I actually have a branch abd_0.6.4 that's on top of 0.6.4.2. You can try that one, but be sure to test it before put into production.

Contributor

DeHackEd commented Sep 24, 2016

Besides the style failures, I'm getting ztest failing to run because zdb can't open the test pools.

tuxoko added some commits May 11, 2015

Introduce ABD: linear/scatter dual typed buffer for ARC
zfsolinux currently uses vmalloc backed slab for ARC buffers. There are some
major problems with this approach. One is that 32-bit system have only a
handful of vmalloc space. Another is that the fragmentation in slab will easily
deplete memory in busy system.

With ABD, we use scatterlist to allocate data buffers. In this approach we can
allocate in HIGHMEM, which alleviates vmalloc space pressure on 32-bit. Also,
we don't have to rely on slab, so there's no fragmentation issue.

But for metadata buffers, we still uses linear buffer from slab. The reason for
this is that there are a lot of *_phys pointers directly point to metadata
buffers. Thus, it would be much more complicated to use scatter buffer for
metadata.

Currently, ABD is not enabled and its API will treat them as normal buffers.
We will enable it once all relevant code is modified to use the API.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Modify/Add incremental checksum function for abd_iterate_rfunc
Modify/Add incremental fletcher function prototype to match abd_iterate_rfunc
callback type.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Use abd_t in arc_*.h, ddt.h, dmu.h and zio.h
1. Use abd_t in arc_buf_t->b_data, dmu_buf_t->db_data, zio_t->io_data and
zio_transform_t->zt_orig_data
2. zio_* function take abd_t for data

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Convert zio_checksum to ABD version
1. Add checksum function for abd_t
2. Use abd_t version checksum function in zio_checksum_table
3. Make zio_checksum_compute and zio_checksum_error handle abd_t

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Member

tuxoko commented Sep 24, 2016

Yeah, I see that in the build bot. Strange I didn't get that when I tested. It seams that the simd incremental fletcher patch does have some issue depending on compiler or machine. I'll leave that patch out until I figure it out.

tuxoko and others added some commits May 11, 2015

Handle abd_t in arc.c, bpobj.c and bptree.c
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Handle abd_t in dbuf.c, ddt.c, dmu*.c
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Handle abd_t in dnode*.c, dsl_*.c, dsl_*.h
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Handle abd_t in sa_impl.h, sa.c, space_map.c, spa.c, spa_history.c an…
…d zil.c

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Handle abd_t in zap*.c, zap_*.h, zfs_fuid.c, zfs_sa.c and zfs_vnops.c
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Handle abd_t in zio.c
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Handle abd_t in vdev*.c sans vdev_raidz.c
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Handle abd_t in vdev_raidz.c
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Handle ABD in ztest and zdb
Use ABD API on related pointers and functions.(b_data, db_data, zio_*(), etc.)

Suggested-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Enable ABD
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Disable simd raidz until we support it
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Add abd version byteswap functions
Add abd version byteswap function with the name "abd_<old bswap func name>".

Note that abd_byteswap_uint*_array and abd_dnode_buf_byteswap can handle
scatter buffer, so now we don't need extra borrow/copy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Split out miter part of copy funcs for abd_uiomove
Currently, abd_uiomove repeatedly calls abd_copy_*_off. The problem is that it
will need to do abd_miter_advance repeatedly over the parts that were skipped
before.

We split out the miter part of the abd_copy_*_off into abd_miter_copy_*. These
new function will take miter directly and they will automatically advance it
after finish. We initialize an miter in uiomove and use the iterator copy
functions to solve the stated problem.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Only use abd_magic when we enable debug
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Optimize abd with contiguous pages and sg merging
When we aren't allocating in HIGHMEM, we can try to allocate contiguous pages,
we can also use sg_alloc_table_from_pages to merge adjacent pages for us. This
will allow more efficient cache prefetch and also reduce sg iterator overhead.
And this has been tested to greatly improve performance.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Add access_ok check before __copy_{to,from}_ user_inatomic
The check is needed to make sure the user buffer is indeed in user space. Also
change copy_{to,from}_user to __copy_{to,from}_user so that we don't
repeatedly call access_ok.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Enable scatter ABD support for SHA256
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Enable scatter ABD for raidz parity
Use scatter type ABD for raidz parity and allow parity generate and
reconstruct function to directly operate on ABD without borrow buffer. Note
matrix reconstruct still needs borrow buffer after this patch.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Enable scatter ABD in zfs_fm
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Set rrd->payload to NULL after return buffer
In DRR_WRITE in receive_read_record, drr->payload would be set to a borrowed
buffer. Since we immediately return the buffer after reading from the recv
stream, it would become a dangling pointer. We set it to NULL to prevent
accidentally using it.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Add ABD version of zio_{,de}compress_data
Add zio_{,de}copmress_abd so the callers don't need to do borrow buffers
themselves.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Add dmu_write_abd
Add ABD version of dmu_write, which takes ABD as input buffer, to get rid of
some buffer borrowing.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>

@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 11, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment