Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ABD: linear/scatter dual typed buffer for ARC (ver 2) #3441

Closed
wants to merge 26 commits into from

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented May 25, 2015

This is a refreshed version of #2129.
It is rebased after large block support and have cleaner history.

Here's a list of possible further enhancement:

  1. Enable scatter for more metadata type: spa_history, bpobj, zap. (It seems that zap, indirect blocks, and dnodes are limited to 16K blocks, which is currently using kernel slab. Whether using scatter list for them is better or worse remains to be seen.)
  2. Scatter support for byteswap.
  3. Scatter support for lz4 and other compression
  4. Scatter support for SHA256.
  5. Scatter support for vdev_file
  6. Scatter support for dmu_send
  7. Enable scatter for raidz parity
  8. Scatter support for raidz matrix reconstruction
  9. Scatter support for zfs_fm

@DeHackEd
Copy link
Contributor

Will start using tonight.

@edillmann
Copy link
Contributor

Hi,

I give it a test and it show's a very high CPU usage on arc_adapt vs master

Regards,
Eric

@behlendorf
Copy link
Contributor

@edillmann I don't quite follow you comment. Very good, very bad CPU usage?

@edillmann
Copy link
Contributor

Hi,

It show's a high CPU usage on arc_adapt, the pool is build this way

zpool status
  pool: bfs
 state: ONLINE
  scan: scrub repaired 0 in 9h23m with 0 errors on Sun May  3 10:23:14 2015
config:
    NAME        STATE     READ WRITE CKSUM
    bfs         ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
FULL_DRIVE      sdc     ONLINE       0     0     0
FULL_DRIVE      sde     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
FULL_DRIVE      sdd     ONLINE       0     0     0
FULL_DRIVE      sdf     ONLINE       0     0     0
    logs
LVM/SSD   lva-zil0  ONLINE       0     0     0
    cache
LVM/SSD   zcache1   ONLINE       0     0     0

And some stats :

cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 11319918778 252428229027638
name                            type data
hits                            4    78061054
misses                          4    40955739
demand_data_hits                4    56497175
demand_data_misses              4    10841192
demand_metadata_hits            4    21498002
demand_metadata_misses          4    29548717
prefetch_data_hits              4    39808
prefetch_data_misses            4    505804
prefetch_metadata_hits          4    26069
prefetch_metadata_misses        4    60026
mru_hits                        4    29614684
mru_ghost_hits                  4    431054
mfu_hits                        4    48384144
mfu_ghost_hits                  4    681788
deleted                         4    55535257
recycle_miss                    4    25484240
mutex_miss                      4    292255
evict_skip                      4    162321851980
evict_l2_cached                 4    534329798656
evict_l2_eligible               4    771167293440
evict_l2_ineligible             4    18798989312
hash_elements                   4    1350811
hash_elements_max               4    1412083
hash_collisions                 4    15198772
hash_chains                     4    176317
hash_chain_max                  4    6
p                               4    8366558720
c                               4    8589934592
c_min                           4    4294967296
c_max                           4    8589934592
size                            4    6909182400
hdr_size                        4    56284320
data_size                       4    5069312
meta_size                       4    2260425216
other_size                      4    4160235936
anon_size                       4    17095680
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    2042542080
mru_evict_data                  4    0
mru_evict_metadata              4    1589248
mru_ghost_size                  4    0
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    0
mfu_size                        4    205856768
mfu_evict_data                  4    0
mfu_evict_metadata              4    16384
mfu_ghost_size                  4    0
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    0
l2_hits                         4    15391819
l2_misses                       4    20784025
l2_feeds                        4    162808
l2_rw_clash                     4    42923
l2_read_bytes                   4    226944701440
l2_write_bytes                  4    49032701440
l2_writes_sent                  4    128430
l2_writes_done                  4    128430
l2_writes_error                 4    0
l2_writes_hdr_miss              4    96
l2_evict_lock_retry             4    4
l2_evict_reading                4    0
l2_free_on_write                4    666234
l2_cdata_free_on_write          4    18037
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    12775517696
l2_asize                        4    3749617152
l2_hdr_size                     4    427167616
l2_compress_successes           4    14800470
l2_compress_zeros               4    0
l2_compress_failures            4    16272
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    2870064338
arc_meta_used                   4    6904113088
arc_meta_limit                  4    6442450944
arc_meta_max                    4    6933723168

Any clues ?

Regards,
Eric

@tuxoko
Copy link
Contributor Author

tuxoko commented May 29, 2015

Hi @edillmann
Could you try only upto 1aa5c0f
Thanks.

@tuxoko
Copy link
Contributor Author

tuxoko commented May 29, 2015

I got this constantly when doing ztest on 32bit ubuntu 15.04 VM.
Does anyone know what does this assertion indicates?

5 vdevs, 7 datasets, 23 threads, 300 seconds...
loading space map for vdev 0 of 1, metaslab 6 of 14 ...
ztest: ../../module/zfs/vdev.c:923: Assertion `vd->vdev_pending_fastwrite == 0 (0x80000c400 == 0x0)' failed.

@behlendorf
Copy link
Contributor

@tuxoko that's actually a long standing issue. Oddly enough it only happens on 32-bit systems and I've never spent the time to determine why. But since this patch is going to enable 32-bit systems we'll need to get to the bottom of it.

@DeHackEd
Copy link
Contributor

I don't know if this is ABD-specific or something related to the master tree, but I ended up with my system stuck in a sync cycle:
http://pastebin.com/wunCLu1S

I don't know what caused it to start but apps started hanging. drop_caches appeared to run successfully but didn't help.

Technically this is my tree, with my standard extra patches and a slight rebase to bring it closer to the master HEAD by a bit. https://github.com/DeHackEd/zfs/commits/dehacked-bleedingedge if interested (commit 484f14b in case I update the tree)

@behlendorf
Copy link
Contributor

@DeHackEd It's not you, it's very likely #3409. You can work around it by increasing zfs_arc_c_min to say 1/16 of memory. This is something we need to address.

@edillmann
Copy link
Contributor

Hi @tuxoko

Running with zfs upto 1aa5c0f resolve the problem with arc_adapt.
Server load dropped from 6.5 to 0.8 with the same workload.

Thanks,

@tuxoko
Copy link
Contributor Author

tuxoko commented Jun 2, 2015

@edillmann
Regarding the high CPU load of arc_adapt, does it cause stall or performance degradation?
Could you try to use perf record -p <pid of arc_adapt> to see what's consuming the CPU?
Thanks.

@edillmann
Copy link
Contributor

@tuxoko Yes it cause a very notable performance degradation, mean IO wait climbs from 3% to 20%, and mean server load from 1 to 6. The degradation is not immediate but appears after 2 ou 3 days (memory fragmentation ?). For now I'm running 1aa5c0f and will wait some days to see I performance is staying as good as now.

@sempervictus
Copy link
Contributor

@tuxoko: could we bug you for a refresh against master? The lock contention patches hit causing serious conflicts and it would be useful to update our test environments to the final revision of that stack (we're using an old merge). Thanks as always.

@behlendorf
Copy link
Contributor

@tuxoko I'd like to focus on getting this patch stack merged next for master. It would be great if you could rebase it on master. I'll work my way through it this week to get you some additional review comments and hopefully we can get a few people to provide additional test coverage for the updated patch.

@edillmann what, if anything, did you determine about running only up to 1aa5c0f?

@edillmann
Copy link
Contributor

@behlendorf the system I'm running with 1aa5c0f is performing well (low load, low iowait), with previous version I had problems with arc_adapt eating cpu and system global perfs where getting bad (high load, high iowait).

@tuxoko
Copy link
Contributor Author

tuxoko commented Jun 23, 2015

@behlendorf
Sorry for the delay, a rebased version should come shortly.

@tuxoko
Copy link
Contributor Author

tuxoko commented Jun 24, 2015

Rebased to master. Last version can be access by my branch abd2_archive00.

@kernelOfTruth
Copy link
Contributor

Excellent 👍

Just in time for updating the kernel modules

Thank you very much 😄

@sempervictus
Copy link
Contributor

This appears to conflict with ef56b07, according to git blame, the conflict is @ line 5848 of arc.c and looks like:

5848 <<<<<<< HEAD
5849         uint64_t write_asize, write_sz, headroom, buf_compress_minsz,
5850             stats_size;
5851         void *buf_data;
5852 =======
5853         uint64_t write_asize, write_psize, write_sz, headroom,
5854             buf_compress_minsz;
5855         abd_t *buf_data;
5856 >>>>>>> origin/pr/3441

Git blame says:

ef56b078 module/zfs/arc.c       (Andriy Gapon      2015-06-12 21:20:29 +0200 5800)  uint64_t write_asize, write_sz, headroom, buf_compress_minsz,
ef56b078 module/zfs/arc.c       (Andriy Gapon      2015-06-12 21:20:29 +0200 5801)      stats_size;

So looks like the L2ARC sizing bit has made its way into master (yay), but conflicts with ABD now.
@tuxoko: with the write_psize added into the call, how should i address this? Thanks as always

@kernelOfTruth
Copy link
Contributor

@sempervictus FYI: kernelOfTruth@cc3e5e6

write_psize is superfluous so it's not needed anyway

so the only change should be

void *buf_data;

to

abd_t *buf_data;

@DeHackEd
Copy link
Contributor

I've had this intermittent (and fairly brief) system hang going on my system off and on. I don't know if it's related to master or ABD. I'm running a version based on a rebasing to master (a7b10a9 with pretty easy-to-resolve conflicts) on kernel 4.1.1. The biggest issue seems to be that the ARC jamming on I/O.

[<ffffffffc02a7055>] cv_wait_common+0xf5/0x130 [spl]
[<ffffffffc02a70e0>] __cv_wait+0x10/0x20 [spl]
[<ffffffffc02e3a37>] arc_get_data_buf+0x427/0x450 [zfs]
[<ffffffffc02e7040>] arc_read+0x510/0x9e0 [zfs]
[<ffffffffc02ee706>] dbuf_read+0x236/0x7b0 [zfs]
[<ffffffffc02f7804>] dmu_buf_hold_array_by_dnode+0x124/0x490 [zfs]
[<ffffffffc02f869e>] dmu_read_uio_dnode+0x3e/0xc0 [zfs]
[<ffffffffc02f87dc>] dmu_read_uio_dbuf+0x3c/0x60 [zfs]
[<ffffffffc0384a20>] zfs_read+0x140/0x410 [zfs]
[<ffffffffc039b068>] zpl_read+0xa8/0xf0 [zfs]
[<ffffffff8b13a2cf>] __vfs_read+0x2f/0xf0
[<ffffffff8b13a5f8>] vfs_read+0x98/0xe0
[<ffffffff8b13aea5>] SyS_read+0x55/0xc0
[<ffffffff8b4e90d7>] system_call_fastpath+0x12/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff

ARC stats shows:

p                               4    370216448
c                               4    2166505216
c_min                           4    33554432
c_max                           4    4000000000
size                            4    2524924248
hdr_size                        4    120098688
data_size                       4    1049088
meta_size                       4    1436928000
other_size                      4    912485592
...
rc_prune                       4    40914
arc_meta_used                   4    2523875160
arc_meta_limit                  4    3000000000
arc_meta_max                    4    3061794744
arc_meta_min                    4    16777216

Since size > c and mots of it is metadata it could be jamming because it's doing an eviction pass and somehow hanging there. Operations usually unwedge in a second or so, but back-to-back IO will grind the system to a slowdown fairly quickly.

@sempervictus
Copy link
Contributor

I'm seeing something similar to @DeHackEd, though manifesting a bit differently. After a few days of runtime on a desktop system the entire system comes to vicious crawl with no IOWait registering and system resources barely being touched. Shell commands take 30s-2m to execute, and nothing in dmesg to indicate an actual crash somewhere.

iSCSI hosts are showing no such problem, and so far the NFS systems arent either (both backing a cloudstack). All systems in the testing round do regular send receive, though it seems to be what's killing the desktop system - i've found the workstation unresponsive a couple of times since we moved to the new ABD stack, and last thing i did with it each time was to initiate a send/recv. All hosts use L2ARC, the servers (SCST and NFS) both have mirrored SLOGs as well (in case it matters at all).

The ARC meta_size looks to me to be a bit too large, but then again, it may be the workload.
I managed to pull an arcstats dump before the whole thing stopped responding whatsoever this time:

5 1 0x01 89 4272 29154622658 93491015770290
name                            type data
hits                            4    12820953
misses                          4    3298382
demand_data_hits                4    5675931
demand_data_misses              4    422482
demand_metadata_hits            4    6408370
demand_metadata_misses          4    2152548
prefetch_data_hits              4    48759
prefetch_data_misses            4    504710
prefetch_metadata_hits          4    687894
prefetch_metadata_misses        4    218642
mru_hits                        4    4971539
mru_ghost_hits                  4    452869
mfu_hits                        4    7112766
mfu_ghost_hits                  4    183086
deleted                         4    880583
mutex_miss                      4    2411
evict_skip                      4    892990
evict_not_enough                4    131000
evict_l2_cached                 4    2469370880
evict_l2_eligible               4    46359740928
evict_l2_ineligible             4    17156611072
evict_l2_skip                   4    0
hash_elements                   4    590800
hash_elements_max               4    713425
hash_collisions                 4    566822
hash_chains                     4    29153
hash_chain_max                  4    4
p                               4    2403761164
c                               4    7545637400
c_min                           4    1073741824
c_max                           4    10737418240
size                            4    7635005488
hdr_size                        4    106172160
data_size                       4    1056768
meta_size                       4    2336360448
other_size                      4    5133050608
anon_size                       4    4887040
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1722641408
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    1429168640
mru_ghost_evict_data            4    1049606144
mru_ghost_evict_metadata        4    379562496
mfu_size                        4    609888768
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    2018373120
mfu_ghost_evict_data            4    1666403328
mfu_ghost_evict_metadata        4    351946240
l2_hits                         4    908
l2_misses                       4    2942626
l2_feeds                        4    90480
l2_rw_clash                     4    0
l2_read_bytes                   4    738304
l2_write_bytes                  4    2265349120
l2_writes_sent                  4    631
l2_writes_done                  4    631
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    349
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    6
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    2467729920
l2_asize                        4    2265340928
l2_hdr_size                     4    58365504
l2_compress_successes           4    41221
l2_compress_zeros               4    0
l2_compress_failures            4    256942
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    1
memory_indirect_count           4    47273
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    7633948720
arc_meta_limit                  4    8053063680
arc_meta_max                    4    7639335560
arc_meta_min                    4    536870912

The patch stack in testing (tsv20150706):

  * origin/pr/3526
  ** Change default cachefile property to 'none'.
  * origin/pr/3576
  ** Fix Xen Virtual Block Device detection
  * origin/pr/3575
  ** Failure of userland copy should return EFAULT
  * origin/pr/3574
  ** 5745 zfs set allows only one dataset property to be set at a time
  * origin/pr/3572
  ** 5813 zfs_setprop_error(): Handle errno value E2BIG.
  * origin/pr/3571
  ** 5661 ZFS: "compression = on" should use lz4 if feature is enabled
  * origin/pr/3569
  ** 5427 memory leak in libzfs when doing rollback
  * origin/pr/3567
  ** 5118 When verifying or creating a storage pool, error messages only show one device
  * origin/pr/3566
  ** 4966 zpool list iterator does not update output
  * origin/pr/3565
  ** 4745 fix AVL code misspellings
  * origin/pr/3563
  ** 4626 libzfs memleak in zpool_in_use()
  * origin/pr/3562
  ** 1765 assert triggered in libzfs_import.c trying to import pool name beginning with a number
  * origin/pr/2784
  ** Illumos #4950 files sometimes can't be removed from a full filesystem
  * origin/pr/3166
  ** Make linking with and finding libblkid required
  * origin/pr/3529
  ** Translate zio requests with ZIO_PRIORITY_SYNC_READ and
  * origin/pr/3557
  ** Prevent reclaim in metaslab preload threads
  * origin/pr/3555
  ** Illumos 5008 lock contention (rrw_exit) while running a read only load
  * origin/pr/3554
  ** 5911 ZFS "hangs" while deleting file
  * origin/pr/3553
  ** 5981 Deadlock in dmu_objset_find_dp
  * origin/pr/3552
  ** Illumos 5946, 5945
  * origin/pr/3551
  ** Illumos 5870 - dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch
  * origin/pr/3550
  ** Illumos 5909 - ensure that shared snap names don't become too long after promotion
  * origin/pr/3549
  ** Illumos 5912 - full stream can not be force-received into a dataset if it has a snapshot
  * origin/pr/2012
  ** Add option to zpool status to print guids
  * origin/pr/3169
  ** Add dfree_zfs for changing how Samba reports space
  * origin/pr/3441
  ** Add abd version byteswap functions
  * master @ a7b10a931911d3a98a90965795daad031c6d33a2

ZFS options in use on crashing host:

# ZFS ARC cache boundaries from 1-10G
options zfs zfs_arc_min=1073741824
options zfs zfs_arc_max=10737418240
#
# Write throttle
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_sync_write_max_active=32
#
# Read throttle
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_max_active=12

@sempervictus sempervictus mentioned this pull request Jul 18, 2015
19 tasks
@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 20, 2015

@DeHackEd
I'm guessing that is not directly related to ABD.
Holding up on arc_size > arc_c seems to be too strong a thing to do.
If you remove the cv_wait in arc_get_data_buf, would the system behave normally?

@kernelOfTruth
Copy link
Contributor

@tuxoko , @DeHackEd there's a related newly created issue: #3616

@tuxoko
Copy link
Contributor Author

tuxoko commented Sep 24, 2016

Yeah, I see that in the build bot. Strange I didn't get that when I tested. It seams that the simd incremental fletcher patch does have some issue depending on compiler or machine. I'll leave that patch out until I figure it out.

tuxoko and others added 22 commits September 24, 2016 18:35
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
…d zil.c

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Use ABD API on related pointers and functions.(b_data, db_data, zio_*(), etc.)

Suggested-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Currently, abd_uiomove repeatedly calls abd_copy_*_off. The problem is that it
will need to do abd_miter_advance repeatedly over the parts that were skipped
before.

We split out the miter part of the abd_copy_*_off into abd_miter_copy_*. These
new function will take miter directly and they will automatically advance it
after finish. We initialize an miter in uiomove and use the iterator copy
functions to solve the stated problem.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
The check is needed to make sure the user buffer is indeed in user space. Also
change copy_{to,from}_user to __copy_{to,from}_user so that we don't
repeatedly call access_ok.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
When we aren't allocating in HIGHMEM, we can try to allocate contiguous pages,
we can also use sg_alloc_table_from_pages to merge adjacent pages for us. This
will allow more efficient cache prefetch and also reduce sg iterator overhead.
And this has been tested to greatly improve performance.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Add abd version byteswap function with the name "abd_<old bswap func name>".

Note that abd_byteswap_uint*_array and abd_dnode_buf_byteswap can handle
scatter buffer, so now we don't need extra borrow/copy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Use scatter type ABD for raidz parity and allow parity generate and
reconstruct function to directly operate on ABD without borrow buffer. Note
matrix reconstruct still needs borrow buffer after this patch.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
In DRR_WRITE in receive_read_record, drr->payload would be set to a borrowed
buffer. Since we immediately return the buffer after reading from the recv
stream, it would become a dangling pointer. We set it to NULL to prevent
accidentally using it.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Add zio_{,de}copmress_abd so the callers don't need to do borrow buffers
themselves.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Add ABD version of dmu_write, which takes ABD as input buffer, to get rid of
some buffer borrowing.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016
@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review Type: Feature Feature request or new feature Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet