Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-102535: Backport some OS/ZFS bugfixes #204

Merged
merged 13 commits into from Jul 17, 2019
Merged

NAS-102535: Backport some OS/ZFS bugfixes #204

merged 13 commits into from Jul 17, 2019

Commits on Jul 17, 2019

  1. MFC r347953:

    MFV/ZoL: `zfs userspace` ignored all unresolved UIDs after the first
    
    openzfs/zfs@88cfff1
    
    zfs_main: fix `zfs userspace` squashing unresolved entries
    
    The `zfs userspace` squashes all entries with unresolved numeric
    values into a single output entry due to the comparsion always
    made by the string name which is empty in case of unresolved IDs.
    
    Fix this by falling to a numerical comparison when either one
    of string values is not found. This then compares any numerical
    values after all with a name resolved.
    
    Signed-off-by: Pavel Boldin <boldin.pavel@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    
    Reported by:	clusteradm
    Obtained from:	ZFS-on-Linux
    
    Approved by:	re (gjb)
    
    (cherry picked from commit cc93071)
    allanjude authored and amotin committed Jul 17, 2019
  2. Allocate buffers smaller then ABD chunk size as linear.

    This allows to reduce memory waste by letting UMA to put multiple small
    buffers into one memory page slab.  The page sharing means that UMA
    may not be able to free memory page when some of buffers are freed, but
    alternatively memory used by that buffer would just be wasted from the
    beginning.
    
    This change follows alike change in ZoL, but unlike Linux (according to
    my understanding of it from comments) FreeBSD never shares slabs bigger
    then one memory page, so this should be even less invasive then there.
    
    MFC after:	2 weeks
    Sponsored by:	iXsystems, Inc.
    
    (cherry picked from commit d4527f36d52a6b83a203b54ce67bb9d441bd1c96)
    (cherry picked from commit 0c4e60b)
    amotin committed Jul 17, 2019
  3. MFV r348573: 9993 zil writes can get delayed in zio pipeline

    illumos/illumos-gate@2258ad0
    
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: Brad Lewis <brad.lewis@delphix.com>
    Reviewed by: Matt Ahrens <matt@delphix.com>
    Approved by: Dan McDonald <danmcd@joyent.com>
    Author:     George Wilson <george.wilson@delphix.com>
    
    (cherry picked from commit fec436759eb17690404be56433f495e2da7650df)
    (cherry picked from commit 1502d57)
    amotin committed Jul 17, 2019
  4. Restore ARC MFU/MRU pressure

    Before r305323 (MFV r302991: 6950 ARC should cache compressed data)
    arc_read() code did this for access to a ghost buffer:
     arc_adapt() (from arc_get_data_buf())
     arc_access(hdr, hash_lock)
    I.e., we first checked access to the MFU ghost/MRU ghost buffer and
    adapt MFU/MRU sizes (in arc_adapt()) and next move buffer from the ghost
    state to regular.
    
    After r305323 the sequence is different:
     arc_access(hdr, hash_lock);
     arc_hdr_alloc_pabd(hdr);
    I.e., we first move the buffer from the ghost state in arc_access() and
    then we check access to buffer in ghost state (in arc_hdr_alloc_pabd()
    -> arc_get_data_abd() -> arc_get_data_impl() -> arc_adapt()).  This is
    incorrect: arc_adapt() never see access to the ghost buffer because
    arc_access() already migrated the buffer from the ghost state to
    regular.
    
    So, the fix is to restore a call to arc_adapt() before arc_access() and
    to suppress the call to arc_adapt() after arc_access().
    
    Submitted by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
    MFC after:	2 weeks
    Sponsored by:	Integros [integros.com]
    Differential Revision: https://reviews.freebsd.org/D19094
    
    (cherry picked from commit c3d9fbde345e2dd806fb239c5b06fcfa870255f9)
    (cherry picked from commit eca6490)
    avg-I authored and amotin committed Jul 17, 2019
  5. Explicitly start ARC adjustment on limits change.

    While formally it is not necessary, but the sooner it start, the sooner it
    finish, and supposedly less disturbing for workload it will be.
    
    MFC after:	2 weeks
    
    (cherry picked from commit 83ee51bf028e778aa584f79403fc952ec15e2811)
    (cherry picked from commit 72b7b24)
    amotin committed Jul 17, 2019
  6. Fix comparison signedness in arc_is_overflowing().

    When ARC size is very small, aggsum_lower_bound(&arc_size) may return
    negative values, that due to unsigned comparison caused delays, waiting
    for arc_adjust() to "fix" it by calling aggsum_value(&arc_size).  Use
    of signed comparison there fixes the problem.
    
    MFC after:	2 weeks
    Sponsored by:	iXsystems, Inc.
    
    (cherry picked from commit 7bd3ba93f748836f7101df6a860361378f42f538)
    (cherry picked from commit dc3e1f5)
    amotin committed Jul 17, 2019
  7. Move write aggregation memory copy out of vq_lock.

    Memory copy is too heavy operation to do under the congested lock.
    Moving it out reduces congestion by many times to almost invisible.
    Since the original zio removed from the queue, and the child zio is
    not executed yet, I don't see why would the copy need protection.
    My guess it just remained like this from the time when lock was not
    dropped here, which was added later to fix lock ordering issue.
    
    Multi-threaded sequential write tests with both HDD and SSD pools
    with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
    reduction of lock congestion, saving from 15% to 35% of CPU time
    and increasing throughput from 10% to 40%.
    
    Reviewed by:	ahrens, behlendorf, ryao
    MFC after:	2 weeks
    Sponsored by:	iXsystems, Inc.
    
    (cherry picked from commit de44a22304f8ec0db52ffc36755b814097ba98d8)
    (cherry picked from commit 2c68077)
    amotin committed Jul 17, 2019
  8. Update td_runtime of running thread on each statclock().

    Normally td_runtime is updated on context switch, but there are some kernel
    threads that due to high absolute priority may run for many seconds without
    context switches (yes, that is bad, but that is true), which means their
    td_runtime was not updated all that time, that made them invisible for top
    other then as some general CPU usage.
    
    MFC after:	1 week
    Sponsored by:	iXsystems, Inc.
    
    (cherry picked from commit a8eed19ff99997e7832dafc684bf999c5506dcf1)
    (cherry picked from commit 394f634)
    amotin committed Jul 17, 2019
  9. Properly align struct multilist_sublist to cache line.

    Manual Illumos alignment does not fit us due to different kmutex_t size.
    
    MFC after:	1 week
    Sponsored by:	iXsystems, Inc.
    
    (cherry picked from commit 24febf634a9d50aaf6d8a832890ac5456d1df400)
    (cherry picked from commit 2b0e1e8)
    amotin committed Jul 17, 2019
  10. Alike to ZoL disable metaslab allocation tracing code.

    It is too generous to collect in production debug traces that can only
    be read with kernel debugger.  Illumos includes special code in their
    mdb debugger to read it, we don't.
    
    MFC after:	1 week
    Sponsored by:	iXsystems, Inc.
    
    (cherry picked from commit 07ff926fdda392f28d11d76f0559c576cc36fb16)
    (cherry picked from commit fb7cf57)
    amotin committed Jul 17, 2019
  11. Minimize aggsum_compare(&arc_size, arc_c) calls.

    For busy ARC situation when arc_size close to arc_c is desired.  But
    then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
    to flush per-CPU buckets to find exact comparison result.  Doing that
    often in a hot path penalizes whole idea of aggsum usage there, since it
    replaces few simple atomic additions with dozens of lock acquisitions.
    
    Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
    arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
    allows to save ~5% of CPU time in aggsum code during sequential write
    to 12 ZVOLs with 16KB block size on large dual-socket system.
    
    I suppose there some minor arc_p behavior change due to lower precision
    of the new code, but I don't think it is a big deal, since it should
    affect only very small window in time (aggsum buckets are flushed every
    second) and in ARC size (buckets are limited to 10 average ARC blocks
    per CPU).
    
    MFC after:	2 weeks
    Sponsored by:	iXsystems, Inc.
    
    (cherry picked from commit d36b8746db91d7921eb81941847838872800ce33)
    (cherry picked from commit 75869d2)
    amotin committed Jul 17, 2019
  12. MFC r344042:

    UMA: unsign some variables related to allocation in hash_alloc().
    
    As a followup to r343673, unsign some variables related to allocation
    since the hashsize cannot be negative. This gives a bit more space to
    handle bigger allocations and avoid some implicit casting.
    
    While here also unsign uh_hashmask, it makes little sense to keep it
    signed.
    
    Differential Revision:	https://reviews.freebsd.org/D19148
    
    (cherry picked from commit 2149862)
    pgiffuni authored and amotin committed Jul 17, 2019
  13. MFC r348764: Allow UMA hash tables to expand faster then 2x in 20 sec…

    …onds.
    
    ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables.
    With initial hash table size of only 32 elements it takes ~20 expansions
    or ~400 seconds to adapt to handling 220GB ZFS ARC.  During that time not
    only the hash table is highly inefficient, but also each of those expan-
    sions takes significant time with the lock held, blocking operation.
    
    On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change
    reduces time needed to first time read 240GB from ~300-400s, during which
    system is quite busy and unresponsive, to only ~150s with light CPU load
    and just 5 sub-second CPU spikes to expand the hash table.
    
    (cherry picked from commit 2b89445e2d111900e8e46d1e5f4535768ad0b9c8)
    (cherry picked from commit df4156e)
    amotin committed Jul 17, 2019