OpenZFS 9112 - Improve allocation performance on high-end systems #7682

pcd1193182 · 2018-07-05T18:58:43Z

Authored by: Paul Dagnelie pcd@delphix.com
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Serapheim Dimitropoulos serapheim.dimitro@delphix.com
Reviewed by: Alexander Motin mav@FreeBSD.org
Approved by: Gordon Ross gwr@nexenta.com
Ported-by: Paul Dagnelie pcd@delphix.com
OpenZFS-issue: https://www.illumos.org/issues/9112
OpenZFS-commit: openzfs/openzfs@3f3cc3c

Overview

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results

Performance results on Linux are a small improvement, on the order
of a 5-10%. Unfortunately, it seems that there are different bottlenecks
on Linux than there are in the Illumos codebase, so the significant wins we
saw there do not translate. However, when these bottlenecks are addressed,
this change may cause a significant additional boost. On Illumos, for an fio async
sequential write workload on a 24 core NUMA system with 256 GB of RAM
and 8 128 GB SSDs, there is a roughly 25% performance improvement.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

dweeezil · 2018-07-10T16:41:24Z

@pcd1193182 FYI, I'm working on porting OpenZFS 9102 (device initialization) and realized this (9112) was sort-of a prerequisite for it. I'll abandon my port of 9112 and use yours and will look into some of the failures the bots encountered.

pcd1193182 · 2018-07-10T21:52:14Z

@dweeezil It's not exactly a prereq, but they are a bit intertwined. I'm working on debugging the issues this PR has run into; if something specific is blocking your progress on 9102, let me know and I'll try to focus that down first.

dweeezil · 2018-07-11T23:14:00Z

@pcd1193182 I've gotten 9102 merged cleanly now and am running local tests before submitting the PR for it. It's based on your "paq" branch (rebased to this afternoon's master). As I suspected 9102 merged pretty cleanly atop 9112.

behlendorf · 2018-07-11T23:26:10Z

@pcd1193182 you're going to need to rebase this PR on the current master to get the zimport tests to pass. The issue is that this PR is based off the commit prior to the v2 spacemap encoding being merged. The zimport tests verify this PR can import a pool created by the master branch and it can't due to the missing feature. The test case also verifies the opposite, which is working, that the current master branch can import a pool created by this PR.

behlendorf · 2018-07-27T21:40:16Z

@pcd1193182 regarding the ZTS failures:

reservation_017_pos - This is unrelated, you just got unlucky. I've opened ZTS: Fix reservation_017_pos #7750 with a fix.
reservation_008_pos - This test case has been occasionally failing in master for a while now unrelated to this change, Test case: reservation_008_pos #7741. It was grudgingly added to the exception list recently, e106a7b.
reservation_004_pos - This failure has never been observed on master, only in this PR. That said, my best guess is that this patch isn't really responsible either, and it's a combination of the recent metaslab changes plus this change which pushes the test slightly outside the expected tolerance. Rather than get bogged down by this, I'd be OK with slightly increasing the tolerate and adding a comment to the test. Particularly because more space was available than the test expected not less.

That one issue aside this LGTM.

Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com>

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

drescherjm · 2018-07-30T21:37:47Z

tests/zfs-tests/tests/functional/reservation/reservation.cfg

@@ -29,7 +29,7 @@
 #

 export RESV_DELTA=5242880
-export RESV_TOLERANCE=5242880	# Acceptable limit (5MB) for diff in space stats
+export RESV_TOLERANCE=10485760	# Acceptable limit (5MB) for diff in space stats


Should the comment be changed here?

I'll fix this when integrating it, no need to refresh the PR.

codecov · 2018-07-31T05:55:40Z

Codecov Report

Merging #7682 into master will decrease coverage by 0.5%.
The diff coverage is 95.6%.

@@            Coverage Diff             @@
##           master    #7682      +/-   ##
==========================================
- Coverage   78.41%   77.91%   -0.51%     
==========================================
  Files         368      366       -2     
  Lines      112142   112191      +49     
==========================================
- Hits        87941    87412     -529     
- Misses      24201    24779     +578

Flag	Coverage Δ
#kernel	`78.43% <97.03%> (-0.47%)`	⬇️
#user	`66.86% <94.4%> (-0.78%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dae3e9e...43d8272. Read the comment docs.

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

pcd1193182 force-pushed the paq branch from c43d8e4 to 00011e2 Compare July 11, 2018 23:41

behlendorf added the Reviewed label Jul 27, 2018

pcd1193182 added 3 commits July 30, 2018 13:31

macro?

5b75a8d

fix reservation test failures by increasing tolerance

5bd8b36

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

pcd1193182 force-pushed the paq branch from 9e1e24e to 5bd8b36 Compare July 30, 2018 20:37

behlendorf approved these changes Jul 30, 2018

View reviewed changes

drescherjm reviewed Jul 30, 2018

View reviewed changes

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Jul 31, 2018

update comment

43d8272

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

behlendorf closed this in 492f64e Jul 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenZFS 9112 - Improve allocation performance on high-end systems #7682

OpenZFS 9112 - Improve allocation performance on high-end systems #7682

pcd1193182 commented Jul 5, 2018

dweeezil commented Jul 10, 2018

pcd1193182 commented Jul 10, 2018

dweeezil commented Jul 11, 2018

behlendorf commented Jul 11, 2018

behlendorf commented Jul 27, 2018

drescherjm Jul 30, 2018

behlendorf Jul 31, 2018

codecov bot commented Jul 31, 2018 •

edited

OpenZFS 9112 - Improve allocation performance on high-end systems #7682

OpenZFS 9112 - Improve allocation performance on high-end systems #7682

Conversation

pcd1193182 commented Jul 5, 2018

Overview

Performance Results

Types of changes

Checklist:

dweeezil commented Jul 10, 2018

pcd1193182 commented Jul 10, 2018

dweeezil commented Jul 11, 2018

behlendorf commented Jul 11, 2018

behlendorf commented Jul 27, 2018

drescherjm Jul 30, 2018

Choose a reason for hiding this comment

behlendorf Jul 31, 2018

Choose a reason for hiding this comment

codecov bot commented Jul 31, 2018 • edited

Codecov Report

codecov bot commented Jul 31, 2018 •

edited