Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS 9112 - Improve allocation performance on high-end systems #7682

Closed
wants to merge 4 commits into from

Conversation

pcd1193182
Copy link
Contributor

Authored by: Paul Dagnelie pcd@delphix.com
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Serapheim Dimitropoulos serapheim.dimitro@delphix.com
Reviewed by: Alexander Motin mav@FreeBSD.org
Approved by: Gordon Ross gwr@nexenta.com
Ported-by: Paul Dagnelie pcd@delphix.com
OpenZFS-issue: https://www.illumos.org/issues/9112
OpenZFS-commit: openzfs/openzfs@3f3cc3c

Overview

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results

Performance results on Linux are a small improvement, on the order
of a 5-10%. Unfortunately, it seems that there are different bottlenecks
on Linux than there are in the Illumos codebase, so the significant wins we
saw there do not translate. However, when these bottlenecks are addressed,
this change may cause a significant additional boost. On Illumos, for an fio async
sequential write workload on a 24 core NUMA system with 256 GB of RAM
and 8 128 GB SSDs, there is a roughly 25% performance improvement.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@dweeezil
Copy link
Contributor

@pcd1193182 FYI, I'm working on porting OpenZFS 9102 (device initialization) and realized this (9112) was sort-of a prerequisite for it. I'll abandon my port of 9112 and use yours and will look into some of the failures the bots encountered.

@pcd1193182
Copy link
Contributor Author

@dweeezil It's not exactly a prereq, but they are a bit intertwined. I'm working on debugging the issues this PR has run into; if something specific is blocking your progress on 9102, let me know and I'll try to focus that down first.

@dweeezil
Copy link
Contributor

@pcd1193182 I've gotten 9102 merged cleanly now and am running local tests before submitting the PR for it. It's based on your "paq" branch (rebased to this afternoon's master). As I suspected 9102 merged pretty cleanly atop 9112.

@behlendorf
Copy link
Contributor

@pcd1193182 you're going to need to rebase this PR on the current master to get the zimport tests to pass. The issue is that this PR is based off the commit prior to the v2 spacemap encoding being merged. The zimport tests verify this PR can import a pool created by the master branch and it can't due to the missing feature. The test case also verifies the opposite, which is working, that the current master branch can import a pool created by this PR.

@behlendorf
Copy link
Contributor

@pcd1193182 regarding the ZTS failures:

  • reservation_017_pos - This is unrelated, you just got unlucky. I've opened ZTS: Fix reservation_017_pos #7750 with a fix.

  • reservation_008_pos - This test case has been occasionally failing in master for a while now unrelated to this change, Test case: reservation_008_pos #7741. It was grudgingly added to the exception list recently, e106a7b.

  • reservation_004_pos - This failure has never been observed on master, only in this PR. That said, my best guess is that this patch isn't really responsible either, and it's a combination of the recent metaslab changes plus this change which pushes the test slightly outside the expected tolerance. Rather than get bogged down by this, I'd be OK with slightly increasing the tolerate and adding a comment to the test. Particularly because more space was available than the test expected not less.

That one issue aside this LGTM.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@@ -29,7 +29,7 @@
#

export RESV_DELTA=5242880
export RESV_TOLERANCE=5242880 # Acceptable limit (5MB) for diff in space stats
export RESV_TOLERANCE=10485760 # Acceptable limit (5MB) for diff in space stats

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment be changed here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this when integrating it, no need to refresh the PR.

@codecov
Copy link

codecov bot commented Jul 31, 2018

Codecov Report

Merging #7682 into master will decrease coverage by 0.5%.
The diff coverage is 95.6%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7682      +/-   ##
==========================================
- Coverage   78.41%   77.91%   -0.51%     
==========================================
  Files         368      366       -2     
  Lines      112142   112191      +49     
==========================================
- Hits        87941    87412     -529     
- Misses      24201    24779     +578
Flag Coverage Δ
#kernel 78.43% <97.03%> (-0.47%) ⬇️
#user 66.86% <94.4%> (-0.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dae3e9e...43d8272. Read the comment docs.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Jul 31, 2018
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants