Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-102535: Backport some OS/ZFS bugfixes #204

Merged
merged 13 commits into from Jul 17, 2019
Merged

NAS-102535: Backport some OS/ZFS bugfixes #204

merged 13 commits into from Jul 17, 2019

Conversation

@amotin
Copy link
Collaborator

@amotin amotin commented Jul 17, 2019

No description provided.

allanjude and others added 13 commits Jun 6, 2019
MFV/ZoL: `zfs userspace` ignored all unresolved UIDs after the first

openzfs/zfs@88cfff1

zfs_main: fix `zfs userspace` squashing unresolved entries

The `zfs userspace` squashes all entries with unresolved numeric
values into a single output entry due to the comparsion always
made by the string name which is empty in case of unresolved IDs.

Fix this by falling to a numerical comparison when either one
of string values is not found. This then compares any numerical
values after all with a name resolved.

Signed-off-by: Pavel Boldin <boldin.pavel@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Reported by:	clusteradm
Obtained from:	ZFS-on-Linux

Approved by:	re (gjb)

(cherry picked from commit cc93071)
This allows to reduce memory waste by letting UMA to put multiple small
buffers into one memory page slab.  The page sharing means that UMA
may not be able to free memory page when some of buffers are freed, but
alternatively memory used by that buffer would just be wasted from the
beginning.

This change follows alike change in ZoL, but unlike Linux (according to
my understanding of it from comments) FreeBSD never shares slabs bigger
then one memory page, so this should be even less invasive then there.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.

(cherry picked from commit d4527f36d52a6b83a203b54ce67bb9d441bd1c96)
(cherry picked from commit 0c4e60b)
illumos/illumos-gate@2258ad0

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     George Wilson <george.wilson@delphix.com>

(cherry picked from commit fec436759eb17690404be56433f495e2da7650df)
(cherry picked from commit 1502d57)
Before r305323 (MFV r302991: 6950 ARC should cache compressed data)
arc_read() code did this for access to a ghost buffer:
 arc_adapt() (from arc_get_data_buf())
 arc_access(hdr, hash_lock)
I.e., we first checked access to the MFU ghost/MRU ghost buffer and
adapt MFU/MRU sizes (in arc_adapt()) and next move buffer from the ghost
state to regular.

After r305323 the sequence is different:
 arc_access(hdr, hash_lock);
 arc_hdr_alloc_pabd(hdr);
I.e., we first move the buffer from the ghost state in arc_access() and
then we check access to buffer in ghost state (in arc_hdr_alloc_pabd()
-> arc_get_data_abd() -> arc_get_data_impl() -> arc_adapt()).  This is
incorrect: arc_adapt() never see access to the ghost buffer because
arc_access() already migrated the buffer from the ghost state to
regular.

So, the fix is to restore a call to arc_adapt() before arc_access() and
to suppress the call to arc_adapt() after arc_access().

Submitted by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after:	2 weeks
Sponsored by:	Integros [integros.com]
Differential Revision: https://reviews.freebsd.org/D19094

(cherry picked from commit c3d9fbde345e2dd806fb239c5b06fcfa870255f9)
(cherry picked from commit eca6490)
While formally it is not necessary, but the sooner it start, the sooner it
finish, and supposedly less disturbing for workload it will be.

MFC after:	2 weeks

(cherry picked from commit 83ee51bf028e778aa584f79403fc952ec15e2811)
(cherry picked from commit 72b7b24)
When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size).  Use
of signed comparison there fixes the problem.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.

(cherry picked from commit 7bd3ba93f748836f7101df6a860361378f42f538)
(cherry picked from commit dc3e1f5)
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.

Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.

Reviewed by:	ahrens, behlendorf, ryao
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.

(cherry picked from commit de44a22304f8ec0db52ffc36755b814097ba98d8)
(cherry picked from commit 2c68077)
Normally td_runtime is updated on context switch, but there are some kernel
threads that due to high absolute priority may run for many seconds without
context switches (yes, that is bad, but that is true), which means their
td_runtime was not updated all that time, that made them invisible for top
other then as some general CPU usage.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.

(cherry picked from commit a8eed19ff99997e7832dafc684bf999c5506dcf1)
(cherry picked from commit 394f634)
Manual Illumos alignment does not fit us due to different kmutex_t size.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.

(cherry picked from commit 24febf634a9d50aaf6d8a832890ac5456d1df400)
(cherry picked from commit 2b0e1e8)
It is too generous to collect in production debug traces that can only
be read with kernel debugger.  Illumos includes special code in their
mdb debugger to read it, we don't.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.

(cherry picked from commit 07ff926fdda392f28d11d76f0559c576cc36fb16)
(cherry picked from commit fb7cf57)
For busy ARC situation when arc_size close to arc_c is desired.  But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result.  Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.

Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.

I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.

(cherry picked from commit d36b8746db91d7921eb81941847838872800ce33)
(cherry picked from commit 75869d2)
UMA: unsign some variables related to allocation in hash_alloc().

As a followup to r343673, unsign some variables related to allocation
since the hashsize cannot be negative. This gives a bit more space to
handle bigger allocations and avoid some implicit casting.

While here also unsign uh_hashmask, it makes little sense to keep it
signed.

Differential Revision:	https://reviews.freebsd.org/D19148

(cherry picked from commit 2149862)
…onds.

ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables.
With initial hash table size of only 32 elements it takes ~20 expansions
or ~400 seconds to adapt to handling 220GB ZFS ARC.  During that time not
only the hash table is highly inefficient, but also each of those expan-
sions takes significant time with the lock held, blocking operation.

On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change
reduces time needed to first time read 240GB from ~300-400s, during which
system is quite busy and unresponsive, to only ~150s with light CPU load
and just 5 sub-second CPU spikes to expand the hash table.

(cherry picked from commit 2b89445e2d111900e8e46d1e5f4535768ad0b9c8)
(cherry picked from commit df4156e)
@amotin amotin merged commit 1c9e23a into freenas/11.2-stable Jul 17, 2019
@amotin amotin deleted the NAS-102535 branch Jul 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants