Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIL Pipelining #6133

Closed
wants to merge 1 commit into from
Closed

ZIL Pipelining #6133

wants to merge 1 commit into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented May 14, 2017

Description

This modifies ZIL to allow the lwb_t objects of each ZIL commit to be calculated while the previous ZIL commit's lwb_t objects are being written.

This is the first version that both passes all of the tests I throw at it and does not appear to risk data integrity. I have a few ideas on how to slightly improve locking, but I feel good enough about this version that I am putting it up for additional scrutiny.

Motivation and Context

This improves synchronous IO performance. The phoronix benchmarks will likely benefit from this because many of them are synchronous IO heavy.

How Has This Been Tested?

At work, we have a couple of dual-socket Xeon E5 v4 systems with 20 cores each in total, a dozen SSDs and NVRAM for SLOG at work. There is a semi-proprietary benchmark that measures IOPS on 4 zvols on each system simultaneously that we are using to quantify performance. Furthermore, sync=always is set on the zvols. This code passes that while boosting performance by at least 50%. Without nvram, performance goes from 19000 IOPS to at least 29,000 IOPS. This is in a configuration with volblocksize=4k and all 12 SSDs in RAIDZ-2 with ashift=12 (and yes, I know this is not space efficient). Also, this passes ztest. Earlier versions did not.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@ryao ryao added OpenZFS Review Type: Performance Performance improvement or performance problem labels May 14, 2017
@ryao ryao requested review from behlendorf and ahrens May 14, 2017 12:15
@mention-bot
Copy link

@ryao, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @nedbass and @ahrens to be potential reviewers.

@ryao
Copy link
Contributor Author

ryao commented May 14, 2017

The buildbot made it blatently obvious that I missed a race between this and txg_sync. I am working on a fix now.

@ahrens
Copy link
Member

ahrens commented May 15, 2017

@prakashsurya and I are also working on significant changes to improve ZIL performance. @ryao could you summarize the architectural changes you're proposing here? We can also share our design doc and hopefully soon code as well. We can figure out if our changes are going after the same problems with ZIL performance, or if they are complementary.

@prakashsurya
Copy link
Member

I would be happy to share my code, with the caveat that it still isn't finished. What I have passes the normal regression testing (zfstest and zloop), so it's probably in a good enough state for others to check out the overall design.

@ryao
Copy link
Contributor Author

ryao commented May 15, 2017

@ahrens Right now, zil_commit() has a few distinct phases:

  1. Move async itxs for the foid to the sync queues.
  2. Get the list of itxs to commit.
  3. Generate a linked list of lwbs representing the async itxs.
  4. Write out the lwbs.
  5. Block on the root zio of lwbs.

Phase 1 is concurrent while phases 2 through 5 are effectively a single threaded critical section (ignoring ZIO). The idea here is to reduce the size of the critical section to phases 2 and 3. This reduces latencies because the next ZIL commit no longer has to wait for the previous one to clear phase 5 before it can enter phase 2. The previous one just needs to clear phase 3, which is much faster. This also improves IOPS from allowing multiple threads to operate in phases 4 and 5. Also, eliminating the batching reduces incidence of thundering herds.

At a lower level, what I am really doing is refactoring zil_commit_writer() and friends to operate on a per commit object (zil_writer_t) that contains information private to each of them after phase 3 ends. These objects live on a linked list that represent all ZIL commits going through the pipeline. It contains the information needed to make phase 4 and 5 operate concurrently while allowing phases 2 and 3 to remain single threaded. A commit is not allowed to exit phase 5 before one earlier than it in the pipeline, to ensure we do not return before things are on stable storage.

@prakashsurya I'd appreciate it if you could share that. People at work want me to get a group of improvements together for a demo of future performance that takes place in about 36 hours. If our changes turn out to be complementary, it could be rather helpful to combine them.

@prakashsurya
Copy link
Member

quickly scanning your diff on my phone, i think we're trying to solve the same problem, but in slightly different ways. i can't look in detail right now, though.

@ryao
Copy link
Contributor Author

ryao commented May 15, 2017

@prakashsurya I had a hunch that we were. There isn't too much opportunity to improve ZIL aside from reducing the size of the critical section without resorting to a disk format change. I have an idea for one where particularly active datasets could be given multiple ZIL pipelines, which would also improve replay, but that is an idea to explore after getting the basics right.

Edit: I suppose that allowing early exit in situations where the itx list is empty when no other zil_commit_writer exists that is doing the same foid with the exception of foid=0, which would block anyway. I have not given that much thought since I am focused on zvol performance for work.

prakashsurya pushed a commit to prakashsurya/openzfs that referenced this pull request May 15, 2017
This is still a work in progress, and isn't meant to be landed in its
current state. I'm only pushing this code publicly now, rather than when
it's "done", so the design of the changes implemented here can be
compared with the code in the ZFS on Linux PR here:
openzfs/zfs#6133

There's a "high level" comment about the new design of `zil_commit()` in
a code comment above that function; please see that for more details.
@ryao ryao force-pushed the zfs_pipelining branch 3 times, most recently from c3483c3 to 4604c8c Compare May 15, 2017 16:47
@ryao
Copy link
Contributor Author

ryao commented May 15, 2017

It turns out that I had two races. I have my fingers crossed that this is the last of them. :)

@prakashsurya Thanks for sharing. I like the idea of using lwb to simplify the arguments to *_get_data callbacks. I think I will "borrow" it for my version. :)

Your code appears to have a per lwb vdev tree. That should cause us to flush excessively in situations where a userland application does many IOs and then calls fsync(). I am not sure if I like that. You might want to copy my idea of having a per commit vdev_tree that is protected by reference counts. Then you could have whatever notifies the waiter of completion do a flush before performing notification. This ought to reduce the number of flushes to 1 per commit, which is the best we can do without implementing FUA.

I suggest we benchmark our approaches against each other once they are mature, enhance the stronger one with any good ideas from the weaker one and get that version merged.

@prakashsurya
Copy link
Member

Your code appears to have a per lwb vdev tree. That should cause us to flush excessively in situations where a userland application does many IOs and then calls fsync().

right, that's something we are concerned about, but we didn't want to overly complicate matters until we had some performance results to lean on.

@ryao
Copy link
Contributor Author

ryao commented May 15, 2017

@prakashsurya I had the same thought about implementing a mechanism to allow short circuit on a unique foid > 0 when there is no foid = 0 in the pipeline. It looks like your code handles that while mine does not.

By the way, it looks like there is a 3rd race that I missed in the testing at work. This time between zil_commit_writer() and zil_destroy(). I should fix it soon. I want the buildbot to be all green.

@ryao ryao force-pushed the zfs_pipelining branch 4 times, most recently from 26c114a to 05d33c6 Compare May 15, 2017 21:40
@sempervictus
Copy link
Contributor

A bunch of ztest later, the following gem cropped up:

ztest: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
/usr/bin/ztest(+0xbd4f)[0x3e6bd28d4f]
/usr/lib/libpthread.so.0(+0x11fe0)[0x31135565fe0]
/usr/lib/libc.so.6(gsignal+0x110)[0x311351e3a10]
/usr/lib/libc.so.6(abort+0x17a)[0x311351e513a]
/usr/lib/libc.so.6(+0x2c607)[0x311351dc607]
/usr/lib/libc.so.6(+0x2c6b2)[0x311351dc6b2]
/usr/lib/libpthread.so.0(pthread_mutex_lock+0x210)[0x3113555dcc0]
/usr/lib/libzpool.so.2(mutex_enter+0x33)[0x31135abe653]
/usr/lib/libzpool.so.2(zil_add_block+0xe7)[0x31135bb94e7]
/usr/bin/ztest(+0xa3fb)[0x3e6bd273fb]
/usr/lib/libzpool.so.2(+0x59861)[0x31135ade861]
/usr/lib/libzpool.so.2(+0x48be6)[0x31135acdbe6]
/usr/lib/libzpool.so.2(+0x13d7b1)[0x31135bc27b1]
/usr/lib/libzpool.so.2(zio_execute+0x9b)[0x31135bbff5b]
/usr/lib/libzpool.so.2(+0x3c1af)[0x31135ac11af]
/usr/lib/libzpool.so.2(zk_thread_helper+0x1fc)[0x31135abd5cc]
/usr/lib/libpthread.so.0(+0x72e7)[0x3113555b2e7]
/usr/lib/libc.so.6(clone+0x3f)[0x3113529c54f]

with

/usr/lib/libzpool.so.2(mutex_enter+0x33)[0x31135abe653]
/usr/lib/libzpool.so.2(zil_add_block+0xe7)[0x31135bb94e7]

being the apparent culprit, and

ec693a2ebb module/zfs/zil.c       (Richard Yao      2017-05-14 08:00:37 -0400  854) zil_add_block(zil_writer_t *zilw, const blkptr_t *bp)

indicating its from this PR.
Test system is 4.9, a KVM VM, kernel has the last grsec patch (though i dont think that's the problem here).

@ryao
Copy link
Contributor Author

ryao commented May 18, 2017

@sempervictus Thanks for testing. The culprit turns out to be 2ba96b5, which I did after seeing how @prakashsurya did things in his patch. What happens is that the lwb_t was made in the previous pipeline writer. It caches a pointer to that pipeline writer's zilw_t, which is then freed. When the lwb_t is used by the next writer, this callback operates on the old zilw_t that had been freed and boom. That was a really subtle mistake that was hard to find, since my first thought it was an existing bug that I somehow managed to make more common with my latest fixes. Anyway, I will be refreshing this shortly with the fix after confirming this theory locally.

@behlendorf
Copy link
Contributor

@sempervictus those traces were from the automated kmemleak buildbot test results. After one run of the ZFS Test Suite we verify nothing was leaked for any of the test cases.

@ryao
Copy link
Contributor Author

ryao commented Jun 8, 2017

@sempervictus @behlendorf My apologies for the delays. I am using separate machines for pushing, developing and testing, so I have been trying to batch process these things. I have been stuck trying to diagnose dedup at work, which might not have worked out too well in conjunction with the manual batch processing. I will get it updated, rebased and pushed in the morning though. :)

@wangdbang
Copy link

wangdbang commented Jun 9, 2017

Hello all, I ported this patch back to 0.6.5.8, and set up an all ssd envrionment, five intel 3510 ssd created a raidz1, no independent zil disk, the test tool was fio, 4k randwrite, i could get the IOPS from 7k to 20k+, it's not stable, but a big improvement, the 4k randwrite performance value was 5k+ without the patch. 4k random read was stable, always output 33k+ performance value. thank you for the great job. Is somebody meet the 4k randwrite unstable performance issue?

@wangdbang
Copy link

After sevrial 4k randwrite test, the low value became to 3k+, the iostatus -xm 1 view, sometimes zvol was 100%, most of time the disks were not 100% busy, seems that the zvol still has other botternecks

@ryao
Copy link
Contributor Author

ryao commented Jun 9, 2017

I had intended to push this in the morning, but better late then never. Anyway, there was a small cstyle issue from some debug code I had added. Rather than force push, I just added a small commit that we can squash later if the buildbot is happy with it and reviewers are happy with it.

@ryao
Copy link
Contributor Author

ryao commented Jun 9, 2017

@sempervictus The latest version has passed at least 100 hours of strenuous testing at work. However, there is one issue that the buildbot had caught that I have not fully digested. If I understood it correctly, it is a pre-existing race that dates back to OpenSolaris, but it is complex enough that I am not 100% certain. I haven't had much time to analyze it since spotting it in the buildbot. If the rebase @behlendorf suggested makes it go away, I would be happy saying that the current patch is probably okay. At the very least, I am increasingly confident that this is no worse than the existing code.

@ryao
Copy link
Contributor Author

ryao commented Jun 9, 2017

@wangdbang There are other issues in the zvol code. #6207 is intended to fix some of them. I just pushed it. You might want to try it too.

@wangdbang
Copy link

wangdbang commented Jun 9, 2017

@ryao ,thank you for your clue, i will try it.

@prakashsurya
Copy link
Member

@ryao I'd like to try and get some performance numbers of this change, and compare it with the work that I'm doing. Any chance you could port this to OpenZFS and open a PR against that project?

@ryao
Copy link
Contributor Author

ryao commented Jun 13, 2017

@prakashsurya Yes. I'll do that today.

The previous ZIL code implemented a batching operation. This blocked all
writers while an existing ZIL commit was being done and was bad for
latencies. This patch changes the ZIL code to allow ZIL commits and ZIL
zio operations to occur simultaneously through a pipeline. This
simultaneously reduces latencies and increases IOPS in synchronous IO
heavy workloads.

Signed-off-by: Richard Yao <richard.yao@prophetstor.com>
@ryao
Copy link
Contributor Author

ryao commented Jun 13, 2017

@prakashsurya Here is a first stab at a port. I will lean on the buildbot to know whether there are any syntatic errors or not:

openzfs/openzfs#404

Also, as a bonus, I ported the other patch that just landed into ZoL:

openzfs/openzfs#403

@ryao ryao mentioned this pull request Jun 16, 2017
12 tasks
prakashsurya pushed a commit to prakashsurya/openzfs that referenced this pull request Jul 10, 2017
Notes
=====

This change addresses the same deficiencies that Richard Yao is
addressing in the following pull requests:

 - openzfs/zfs#6133
 - openzfs#404

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Where to Start w.r.t. Reviews
=============================

For those attempting to review this change, I suggest the following order:

 1. to familiarize yourself with the high level design this change is
    shooting for, read the comment directly above zil_commit().
 2. if anything in that comment is unclear, not true, and/or it's
    missing critical information, please open review a comment so that
    can be addressed.
 3. If you're already familiar with the data structures of the ZIL, now
    would be a good time to review the modifications I made, as a
    precursor to the algorithm changes. If you're not already familiar,
    it's probably best to do this after scanning the algorithm, at which
    point you'll hopefully have more context to fall back on.
 4. briefly read through zil_commit(), zil_commit_writer(), and
    zil_commit_waiter() to get familiar with the "flow" of the code
    w.r.t. the thread(s) calling zil_commit().
 5. briefly read through zil_lwb_write_done(), zil_lwb_flush_vdevs(),
    and zil_lwb_flush_vdevs_done() to get familiar with how the ZIL
    blocks complete, and how the "commit waiters" are signalled.

At this point, I would hope that you'll have a decent grasp of the
changes at a high level, such that you'll be able to navigate the code
on your own, and rely on the code comments for further inspection and
understanding. If that's not the case, I should probably address it
with more code comments, so please open review issues so I can do that.

Performance Testing
===================

I used two fio workloads to drive a synchronous write workload. In both
cases, the performance of this change was pitted against the baseline
"master" branch.

The first workload had each fio thread trying to perform synchronous
writes as fast as it could; issuing a single write at at time. The
results from this test can be seen here: https://goo.gl/jBUiH5

For the "tl;dr", use these two links that jump directly to the section
that describe the percentage difference in fio IOPs and fio write
latency when running on both HDDs and SSDs:

 - HDD drives, IOPs: https://goo.gl/iChbUh
 - SSD drives, IOPs: https://goo.gl/97u2LA

 - HDD drives, latency: https://goo.gl/RCGxr4
 - SSD drives, latency: https://goo.gl/n7DQ1D

The second workload also had fio issuing synchronous writes, but instead
of attempting to perform as many operations as fast as each thread could,
each fio thread attempted to achieve about 64 write operations per
second. The results of this test can be seen here: https://goo.gl/xEPGm4

Additionally, the following links will jump directly to the percentage
difference information between the baseline and the project branch:

 - HDD drives, IOPs: https://goo.gl/rgk7m4
 - SSD drives, IOPs: https://goo.gl/jCSVnH

 - HDD drives, latency: https://goo.gl/qK9E5q
 - SSD drives, latency: https://goo.gl/dGAXk1

Upstream bugs: DLPX-52375
@sempervictus
Copy link
Contributor

So I think I found a bug here, unfortunately on my root pool of all things. Seems a condition can be created where the pool starts to cause consistent crashes in lwb_create. What's odd is that it only happens when the OS is booted from the affected pool, running from external media and chrooting into the root dataset on the affected rpool does not cause it to happen - same kernel and ZFS builds on the removable media and system (from our own internal repo).
Haven't seen this on any of the dozen systems running with this patch, seems somehow related to being a root volume in this case. Rebuilt without the patch, testing results presently.

@sempervictus
Copy link
Contributor

Definitely this stack causing the crash, though I'm wondering if its related to #6477

@sempervictus
Copy link
Contributor

So having pulled this out of the stack, I'm seeing human-noticeable performance drops. Whatever happened to the zil in my rpool is concerning, but with nothing but some time and nerves lost, I'm seriously considering getting it back into prod stack for the clear benefits on our cinder volume nodes.
@ryao: any chance this is still a priority at work, or you could share the solutions implemented there?

@ahrens
Copy link
Member

ahrens commented Aug 19, 2017

@sempervictus We'd like to see @prakashsurya's version of this go in to illumos and Linux, which we believe has a superset of the improvements of this PR. Prakash, could you update the OpenZFS PR (openzfs/openzfs#421) with the final version that we integrated to Delphix?

@sempervictus
Copy link
Contributor

@ahrens: will throw the updated version into our test queue, thank you. Last time I tried that PR I ran into some dtrace_probe codepath which had undefined macros or something and it never passed the initial builders. Be glad to put it through its paces.

@prakashsurya
Copy link
Member

prakashsurya commented Aug 19, 2017

openzfs/openzfs#421 is up to date.

I've also ported that work to ZOL here: #6337 .. there appear to be some more work needed on the port, though, as it's not passing buildbot due to (what appears to be) an mmap issue. I haven't tried to root cause the linux test failure yet.

@sempervictus
Copy link
Contributor

sempervictus commented Aug 19, 2017 via email

@behlendorf
Copy link
Contributor

The alternate OpenZFS implementation ported to ZoL in #6566 is passing all the test cases and is ready for additional testing. @ryao it would be great if you could review those proposed changes and verify they meet your needs.

@behlendorf
Copy link
Contributor

Closing, see #6566.

@behlendorf behlendorf closed this Oct 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants