New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenZFS 9112 - Improve allocation performance on high-end systems #7682
Conversation
@pcd1193182 FYI, I'm working on porting OpenZFS 9102 (device initialization) and realized this (9112) was sort-of a prerequisite for it. I'll abandon my port of 9112 and use yours and will look into some of the failures the bots encountered. |
@dweeezil It's not exactly a prereq, but they are a bit intertwined. I'm working on debugging the issues this PR has run into; if something specific is blocking your progress on 9102, let me know and I'll try to focus that down first. |
@pcd1193182 I've gotten 9102 merged cleanly now and am running local tests before submitting the PR for it. It's based on your "paq" branch (rebased to this afternoon's master). As I suspected 9102 merged pretty cleanly atop 9112. |
@pcd1193182 you're going to need to rebase this PR on the current master to get the |
@pcd1193182 regarding the ZTS failures:
That one issue aside this LGTM. |
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@@ -29,7 +29,7 @@ | |||
# | |||
|
|||
export RESV_DELTA=5242880 | |||
export RESV_TOLERANCE=5242880 # Acceptable limit (5MB) for diff in space stats | |||
export RESV_TOLERANCE=10485760 # Acceptable limit (5MB) for diff in space stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the comment be changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this when integrating it, no need to refresh the PR.
Codecov Report
@@ Coverage Diff @@
## master #7682 +/- ##
==========================================
- Coverage 78.41% 77.91% -0.51%
==========================================
Files 368 366 -2
Lines 112142 112191 +49
==========================================
- Hits 87941 87412 -529
- Misses 24201 24779 +578
Continue to review full report at Codecov.
|
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Authored by: Paul Dagnelie pcd@delphix.com
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Serapheim Dimitropoulos serapheim.dimitro@delphix.com
Reviewed by: Alexander Motin mav@FreeBSD.org
Approved by: Gordon Ross gwr@nexenta.com
Ported-by: Paul Dagnelie pcd@delphix.com
OpenZFS-issue: https://www.illumos.org/issues/9112
OpenZFS-commit: openzfs/openzfs@3f3cc3c
Overview
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
Performance results on Linux are a small improvement, on the order
of a 5-10%. Unfortunately, it seems that there are different bottlenecks
on Linux than there are in the Illumos codebase, so the significant wins we
saw there do not translate. However, when these bottlenecks are addressed,
this change may cause a significant additional boost. On Illumos, for an fio async
sequential write workload on a 24 core NUMA system with 256 GB of RAM
and 8 128 GB SSDs, there is a roughly 25% performance improvement.
Types of changes
Checklist:
Signed-off-by
.