NAS-102535: Backport some OS/ZFS bugfixes #204
Merged
Conversation
MFV/ZoL: `zfs userspace` ignored all unresolved UIDs after the first openzfs/zfs@88cfff1 zfs_main: fix `zfs userspace` squashing unresolved entries The `zfs userspace` squashes all entries with unresolved numeric values into a single output entry due to the comparsion always made by the string name which is empty in case of unresolved IDs. Fix this by falling to a numerical comparison when either one of string values is not found. This then compares any numerical values after all with a name resolved. Signed-off-by: Pavel Boldin <boldin.pavel@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported by: clusteradm Obtained from: ZFS-on-Linux Approved by: re (gjb) (cherry picked from commit cc93071)
This allows to reduce memory waste by letting UMA to put multiple small buffers into one memory page slab. The page sharing means that UMA may not be able to free memory page when some of buffers are freed, but alternatively memory used by that buffer would just be wasted from the beginning. This change follows alike change in ZoL, but unlike Linux (according to my understanding of it from comments) FreeBSD never shares slabs bigger then one memory page, so this should be even less invasive then there. MFC after: 2 weeks Sponsored by: iXsystems, Inc. (cherry picked from commit d4527f36d52a6b83a203b54ce67bb9d441bd1c96) (cherry picked from commit 0c4e60b)
illumos/illumos-gate@2258ad0 Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: George Wilson <george.wilson@delphix.com> (cherry picked from commit fec436759eb17690404be56433f495e2da7650df) (cherry picked from commit 1502d57)
Before r305323 (MFV r302991: 6950 ARC should cache compressed data) arc_read() code did this for access to a ghost buffer: arc_adapt() (from arc_get_data_buf()) arc_access(hdr, hash_lock) I.e., we first checked access to the MFU ghost/MRU ghost buffer and adapt MFU/MRU sizes (in arc_adapt()) and next move buffer from the ghost state to regular. After r305323 the sequence is different: arc_access(hdr, hash_lock); arc_hdr_alloc_pabd(hdr); I.e., we first move the buffer from the ghost state in arc_access() and then we check access to buffer in ghost state (in arc_hdr_alloc_pabd() -> arc_get_data_abd() -> arc_get_data_impl() -> arc_adapt()). This is incorrect: arc_adapt() never see access to the ghost buffer because arc_access() already migrated the buffer from the ghost state to regular. So, the fix is to restore a call to arc_adapt() before arc_access() and to suppress the call to arc_adapt() after arc_access(). Submitted by: Slawa Olhovchenkov <slw@zxy.spb.ru> MFC after: 2 weeks Sponsored by: Integros [integros.com] Differential Revision: https://reviews.freebsd.org/D19094 (cherry picked from commit c3d9fbde345e2dd806fb239c5b06fcfa870255f9) (cherry picked from commit eca6490)
While formally it is not necessary, but the sooner it start, the sooner it finish, and supposedly less disturbing for workload it will be. MFC after: 2 weeks (cherry picked from commit 83ee51bf028e778aa584f79403fc952ec15e2811) (cherry picked from commit 72b7b24)
When ARC size is very small, aggsum_lower_bound(&arc_size) may return negative values, that due to unsigned comparison caused delays, waiting for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use of signed comparison there fixes the problem. MFC after: 2 weeks Sponsored by: iXsystems, Inc. (cherry picked from commit 7bd3ba93f748836f7101df6a860361378f42f538) (cherry picked from commit dc3e1f5)
Memory copy is too heavy operation to do under the congested lock. Moving it out reduces congestion by many times to almost invisible. Since the original zio removed from the queue, and the child zio is not executed yet, I don't see why would the copy need protection. My guess it just remained like this from the time when lock was not dropped here, which was added later to fix lock ordering issue. Multi-threaded sequential write tests with both HDD and SSD pools with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major reduction of lock congestion, saving from 15% to 35% of CPU time and increasing throughput from 10% to 40%. Reviewed by: ahrens, behlendorf, ryao MFC after: 2 weeks Sponsored by: iXsystems, Inc. (cherry picked from commit de44a22304f8ec0db52ffc36755b814097ba98d8) (cherry picked from commit 2c68077)
Normally td_runtime is updated on context switch, but there are some kernel threads that due to high absolute priority may run for many seconds without context switches (yes, that is bad, but that is true), which means their td_runtime was not updated all that time, that made them invisible for top other then as some general CPU usage. MFC after: 1 week Sponsored by: iXsystems, Inc. (cherry picked from commit a8eed19ff99997e7832dafc684bf999c5506dcf1) (cherry picked from commit 394f634)
Manual Illumos alignment does not fit us due to different kmutex_t size. MFC after: 1 week Sponsored by: iXsystems, Inc. (cherry picked from commit 24febf634a9d50aaf6d8a832890ac5456d1df400) (cherry picked from commit 2b0e1e8)
It is too generous to collect in production debug traces that can only be read with kernel debugger. Illumos includes special code in their mdb debugger to read it, we don't. MFC after: 1 week Sponsored by: iXsystems, Inc. (cherry picked from commit 07ff926fdda392f28d11d76f0559c576cc36fb16) (cherry picked from commit fb7cf57)
For busy ARC situation when arc_size close to arc_c is desired. But then it is quite likely that aggsum_compare(&arc_size, arc_c) will need to flush per-CPU buckets to find exact comparison result. Doing that often in a hot path penalizes whole idea of aggsum usage there, since it replaces few simple atomic additions with dozens of lock acquisitions. Replacing aggsum_compare() with aggsum_upper_bound() in code increasing arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles allows to save ~5% of CPU time in aggsum code during sequential write to 12 ZVOLs with 16KB block size on large dual-socket system. I suppose there some minor arc_p behavior change due to lower precision of the new code, but I don't think it is a big deal, since it should affect only very small window in time (aggsum buckets are flushed every second) and in ARC size (buckets are limited to 10 average ARC blocks per CPU). MFC after: 2 weeks Sponsored by: iXsystems, Inc. (cherry picked from commit d36b8746db91d7921eb81941847838872800ce33) (cherry picked from commit 75869d2)
UMA: unsign some variables related to allocation in hash_alloc(). As a followup to r343673, unsign some variables related to allocation since the hashsize cannot be negative. This gives a bit more space to handle bigger allocations and avoid some implicit casting. While here also unsign uh_hashmask, it makes little sense to keep it signed. Differential Revision: https://reviews.freebsd.org/D19148 (cherry picked from commit 2149862)
…onds. ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables. With initial hash table size of only 32 elements it takes ~20 expansions or ~400 seconds to adapt to handling 220GB ZFS ARC. During that time not only the hash table is highly inefficient, but also each of those expan- sions takes significant time with the lock held, blocking operation. On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change reduces time needed to first time read 240GB from ~300-400s, during which system is quite busy and unresponsive, to only ~150s with light CPU load and just 5 sub-second CPU spikes to expand the hash table. (cherry picked from commit 2b89445e2d111900e8e46d1e5f4535768ad0b9c8) (cherry picked from commit df4156e)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
No description provided.