New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock contention on arcs mtx #3115

Closed
wants to merge 7 commits into
base: master
from

Conversation

Projects
None yet
10 participants
@dweeezil
Member

dweeezil commented Feb 17, 2015

I'm posting this as a pull request now in order that it get some wider review and also that the buildbots get a chance to chew on it.

I've given it some pretty intense testing, mainly with fio over the past few days and there don't seem to be any obvious regressions.

I still consider this a work-in-progress. The dbufs kstat still needs to be fixed but I don't expect that to be terribly difficult.

@sempervictus

This comment has been minimized.

Show comment
Hide comment
@sempervictus

sempervictus Feb 18, 2015

Contributor

I'll throw this on a testbed not running the updated ABD stack (that PR may limit the scope of testing here since people who run strange things in the night tend to use #2129).

Thank you very much for getting this ticking so quickly

Contributor

sempervictus commented Feb 18, 2015

I'll throw this on a testbed not running the updated ABD stack (that PR may limit the scope of testing here since people who run strange things in the night tend to use #2129).

Thank you very much for getting this ticking so quickly

@prakashsurya

This comment has been minimized.

Show comment
Hide comment
@prakashsurya

prakashsurya Feb 18, 2015

Member

@dweeezil FYI, I just opened a review to fix arcstat in the upstream illumos code: https://reviews.csiden.org/r/164/

Member

prakashsurya commented Feb 18, 2015

@dweeezil FYI, I just opened a review to fix arcstat in the upstream illumos code: https://reviews.csiden.org/r/164/

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Mar 2, 2015

Member

Quick update: My initial benchmarking with fio (using the same test as @prakashsurya did under illumos) has not shown the expected performance increase as the number cores increase.

I was concerned I was running into memory bandwidth issues so I did a round of pmbw tests to get a handle on the best-case random-read memory bandwidth. It looks like I'm likely not running into the absolute memory bandwidth limit yet, but it's difficult to tell because the memory bandwidth drops so dramatically once the various caches' sizes are exceeded (I'm looking at the pmbw graphs).

While trying to get this running on a lockstat-enabled kernel (to see where there might still be a bottleneck), I realized that all the tracepoint stuff for the ARC was broken by the illumos 5408 patch which means I've got to go back to that patch and overhaul the ARC tracepointing. I'll post more information as it becomes available.

Member

dweeezil commented Mar 2, 2015

Quick update: My initial benchmarking with fio (using the same test as @prakashsurya did under illumos) has not shown the expected performance increase as the number cores increase.

I was concerned I was running into memory bandwidth issues so I did a round of pmbw tests to get a handle on the best-case random-read memory bandwidth. It looks like I'm likely not running into the absolute memory bandwidth limit yet, but it's difficult to tell because the memory bandwidth drops so dramatically once the various caches' sizes are exceeded (I'm looking at the pmbw graphs).

While trying to get this running on a lockstat-enabled kernel (to see where there might still be a bottleneck), I realized that all the tracepoint stuff for the ARC was broken by the illumos 5408 patch which means I've got to go back to that patch and overhaul the ARC tracepointing. I'll post more information as it becomes available.

@prakashsurya

This comment has been minimized.

Show comment
Hide comment
@prakashsurya

prakashsurya Mar 2, 2015

Member

While trying to get this running on a lockstat-enabled kernel (to see where there might still be a bottleneck), I realized that all the tracepoint stuff for the ARC was broken by the illumos 5408 patch which means I've got to go back to that patch and overhaul the ARC tracepointing.

Ugh, that's going to be a constant headache moving forward. I think we'll need to make a regression test to make sure the tracepoints remain intact as the code continues to evolve, but that's work for another issue.

Member

prakashsurya commented Mar 2, 2015

While trying to get this running on a lockstat-enabled kernel (to see where there might still be a bottleneck), I realized that all the tracepoint stuff for the ARC was broken by the illumos 5408 patch which means I've got to go back to that patch and overhaul the ARC tracepointing.

Ugh, that's going to be a constant headache moving forward. I think we'll need to make a regression test to make sure the tracepoints remain intact as the code continues to evolve, but that's work for another issue.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Mar 2, 2015

Member

Ugh, that's going to be a constant headache moving forward. I think we'll need to make a regression test to make sure the tracepoints remain intact as the code continues to evolve

In some sense we already have one. The Centos 7 debug builder enables tracepoints so it will fail if the tracepoint code isn't right. What we really need to to eradicate the remaining known test failures so when a builder fails peoples immediately look in to why.

http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-debug-builder/builds/802

Member

behlendorf commented Mar 2, 2015

Ugh, that's going to be a constant headache moving forward. I think we'll need to make a regression test to make sure the tracepoints remain intact as the code continues to evolve

In some sense we already have one. The Centos 7 debug builder enables tracepoints so it will fail if the tracepoint code isn't right. What we really need to to eradicate the remaining known test failures so when a builder fails peoples immediately look in to why.

http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-debug-builder/builds/802

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Mar 2, 2015

Member

@behlendorf What are your thoughts on handling tracepoints for the newly-structured arc_buf_hdr_t insofar as the b_l2hdr and b_l1hdr members are concerned? In particular, do you think it's OK to always copy both of them in TP_fast_assign regardless of whether they're valid? Regarding the TP_printk, I was thinking of trying to add some conditional logic to only print the contents of one or the other depending on which one is valid.

Member

dweeezil commented Mar 2, 2015

@behlendorf What are your thoughts on handling tracepoints for the newly-structured arc_buf_hdr_t insofar as the b_l2hdr and b_l1hdr members are concerned? In particular, do you think it's OK to always copy both of them in TP_fast_assign regardless of whether they're valid? Regarding the TP_printk, I was thinking of trying to add some conditional logic to only print the contents of one or the other depending on which one is valid.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Mar 2, 2015

Member

@dweeezil I think we should keep it simple and fast and assign the useful bits of both b_l2hdr and b_l1hdr in TP_fast_assign. A developer looking at the tracepoint log should be able to interpret which parts of that structure and meaningful. That said, I'm sure @nedbass and @prakashsurya have some thoughts about this as well.

Member

behlendorf commented Mar 2, 2015

@dweeezil I think we should keep it simple and fast and assign the useful bits of both b_l2hdr and b_l1hdr in TP_fast_assign. A developer looking at the tracepoint log should be able to interpret which parts of that structure and meaningful. That said, I'm sure @nedbass and @prakashsurya have some thoughts about this as well.

@nedbass

This comment has been minimized.

Show comment
Hide comment
@nedbass

nedbass Mar 2, 2015

Member

I agree with @behlendorf, TP_* should be kept dumb and simple. In fact you might want to see what's in the supposedly invalid parts of a structure in some cases.

Member

nedbass commented Mar 2, 2015

I agree with @behlendorf, TP_* should be kept dumb and simple. In fact you might want to see what's in the supposedly invalid parts of a structure in some cases.

@prakashsurya

This comment has been minimized.

Show comment
Hide comment
@prakashsurya

prakashsurya Mar 3, 2015

Member

I don't really have a preference, so I'm cool with either way; "keep it simple" is usually my guidance unless there's a compelling reason not to.

And glad to hear we at least have something to try and catch issues introduced with the tracepoints. 👍

Member

prakashsurya commented Mar 3, 2015

I don't really have a preference, so I'm cool with either way; "keep it simple" is usually my guidance unless there's a compelling reason not to.

And glad to hear we at least have something to try and catch issues introduced with the tracepoints. 👍

kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Mar 13, 2015

[dweeezil] Lock contention on arcs mtx #3115
I'm posting this as a pull request now in order that it get some wider review and also that the buildbots get a chance to chew on it.

I've given it some pretty intense testing, mainly with fio over the past few days and there don't seem to be any obvious regressions.

I still consider this a work-in-progress. The dbufs kstat still needs to be fixed but I don't expect that to be terribly difficult.
@kernelOfTruth

This comment has been minimized.

Show comment
Hide comment
@kernelOfTruth

kernelOfTruth Mar 14, 2015

Contributor

referencing/posting Illumos #5497 here for easier discovery (e.g. via Google):
https://www.illumos.org/issues/5497

Contributor

kernelOfTruth commented Mar 14, 2015

referencing/posting Illumos #5497 here for easier discovery (e.g. via Google):
https://www.illumos.org/issues/5497

@angstymeat

This comment has been minimized.

Show comment
Hide comment
@angstymeat

angstymeat Mar 14, 2015

I installed this on my system running Fedora 21 with kernel 3.17.8-200.fc20.x86_64 after trying the patch in 3181 to resolve an issue with the system becoming unresponsive or crashing after rsync'ing mail directories with lots of small files.

I haven't frozen up like 3181, but as soon as my ARC got full ZFS operations slowed to a crawl. It takes several seconds to list each file in a directory. There's almost no CPU in use now.

angstymeat commented Mar 14, 2015

I installed this on my system running Fedora 21 with kernel 3.17.8-200.fc20.x86_64 after trying the patch in 3181 to resolve an issue with the system becoming unresponsive or crashing after rsync'ing mail directories with lots of small files.

I haven't frozen up like 3181, but as soon as my ARC got full ZFS operations slowed to a crawl. It takes several seconds to list each file in a directory. There's almost no CPU in use now.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Mar 14, 2015

Member

I've still not gotten a chance to give this a good workout and to test whether it's actually operating as intended. Unfortunately while I was working on setting up a test environment to evaluate this patch, I came across the issue I just posted in #3183 and that has been keeping me distracted for the past week or so.

During my initial testing of the loads for which this patch was designed to help, I did not see any better scalability in read bandwidth versus a stock system. While trying to arrange a test environment in which I could get better measurements, I came across the issue mentioned above.

And to be clear, this patch is not something that was going to help metadata-intensive loads much. The benchmark used by its author and by myself involves highly parallel 100% ARC-cached reads.

To be clear, however, when I say "the patch" above, I'm referring only to 175f413 which is the last patch in the stack (and the subject of this pull request). The other prerequisite patches could certainly help other issues.

Member

dweeezil commented Mar 14, 2015

I've still not gotten a chance to give this a good workout and to test whether it's actually operating as intended. Unfortunately while I was working on setting up a test environment to evaluate this patch, I came across the issue I just posted in #3183 and that has been keeping me distracted for the past week or so.

During my initial testing of the loads for which this patch was designed to help, I did not see any better scalability in read bandwidth versus a stock system. While trying to arrange a test environment in which I could get better measurements, I came across the issue mentioned above.

And to be clear, this patch is not something that was going to help metadata-intensive loads much. The benchmark used by its author and by myself involves highly parallel 100% ARC-cached reads.

To be clear, however, when I say "the patch" above, I'm referring only to 175f413 which is the last patch in the stack (and the subject of this pull request). The other prerequisite patches could certainly help other issues.

@kernelOfTruth

This comment has been minimized.

Show comment
Hide comment
@kernelOfTruth

kernelOfTruth Mar 16, 2015

Contributor

FYI:

module/zfs/arc.c:3458:47: error: ‘buf’ undeclared (first use in this function)
   cmn_err(CE_PANIC, "invalid arc state 0x%p", buf->b_state);

kernelOfTruth@c76ca73 & kernelOfTruth@129c618

module/zfs/multilist.c:30:1: warning: ‘multilist_d2l’ defined but not used [-Wunused-function]
 multilist_d2l(multilist_t *ml, void *obj)

not sure how to handle that one

Contributor

kernelOfTruth commented Mar 16, 2015

FYI:

module/zfs/arc.c:3458:47: error: ‘buf’ undeclared (first use in this function)
   cmn_err(CE_PANIC, "invalid arc state 0x%p", buf->b_state);

kernelOfTruth@c76ca73 & kernelOfTruth@129c618

module/zfs/multilist.c:30:1: warning: ‘multilist_d2l’ defined but not used [-Wunused-function]
 multilist_d2l(multilist_t *ml, void *obj)

not sure how to handle that one

@kernelOfTruth

This comment has been minimized.

Show comment
Hide comment
@kernelOfTruth

kernelOfTruth Mar 16, 2015

Contributor

If I had known that I would succeed, I had done it on a clean base (D'oh ! sorry about that)

anyway, it's a waste of work to throw it away and start anew (perhaps I'll do so - but for now the buildbots can chew on it) - here's a branch / tree with #2129 and this patchset #3115 on top

https://github.com/kernelOfTruth/zfs/commits/zfs_kOT_15.03.2015

Midway through I realized that I had to revert #3181 otherwise it would have been even more complicated

it compiled cleanly - and besides the above mentioned message

module/zfs/multilist.c:30:1: warning: ‘multilist_d2l’ defined but not used [-Wunused-function]
 multilist_d2l(multilist_t *ml, void *obj)

there were no error messages

Hope it helps

Contributor

kernelOfTruth commented Mar 16, 2015

If I had known that I would succeed, I had done it on a clean base (D'oh ! sorry about that)

anyway, it's a waste of work to throw it away and start anew (perhaps I'll do so - but for now the buildbots can chew on it) - here's a branch / tree with #2129 and this patchset #3115 on top

https://github.com/kernelOfTruth/zfs/commits/zfs_kOT_15.03.2015

Midway through I realized that I had to revert #3181 otherwise it would have been even more complicated

it compiled cleanly - and besides the above mentioned message

module/zfs/multilist.c:30:1: warning: ‘multilist_d2l’ defined but not used [-Wunused-function]
 multilist_d2l(multilist_t *ml, void *obj)

there were no error messages

Hope it helps

@angstymeat

This comment has been minimized.

Show comment
Hide comment
@angstymeat

angstymeat Mar 16, 2015

When I compiled it I got the same message, and then changed buf->b_state to hdr->b_state since it was the only thing in scope that seemed correct.

angstymeat commented Mar 16, 2015

When I compiled it I got the same message, and then changed buf->b_state to hdr->b_state since it was the only thing in scope that seemed correct.

@kernelOfTruth

This comment has been minimized.

Show comment
Hide comment
@kernelOfTruth

kernelOfTruth Mar 16, 2015

Contributor

@angstymeat it's actually

hdr->b_l1hdr.b_state

if you also use "5408 managing ZFS cache devices requires lots of RAM"

there the definition or name changed once again 😉

(provided of course I understood it correctly, but otherwise it wouldn't compile for me)

Contributor

kernelOfTruth commented Mar 16, 2015

@angstymeat it's actually

hdr->b_l1hdr.b_state

if you also use "5408 managing ZFS cache devices requires lots of RAM"

there the definition or name changed once again 😉

(provided of course I understood it correctly, but otherwise it wouldn't compile for me)

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Mar 16, 2015

Member

As an update, my currently-interrupted work on this stack requires going back to the "5408 managing ZFS cache devices requires lots of RAM" patch and updating our tracing infrastructure to properly deal with the newly-nested ARC structures. Once that update is complete, the rest of the patch stack will be rebased atop that work.

Member

dweeezil commented Mar 16, 2015

As an update, my currently-interrupted work on this stack requires going back to the "5408 managing ZFS cache devices requires lots of RAM" patch and updating our tracing infrastructure to properly deal with the newly-nested ARC structures. Once that update is complete, the rest of the patch stack will be rebased atop that work.

@kernelOfTruth

This comment has been minimized.

Show comment
Hide comment
@kernelOfTruth

kernelOfTruth Mar 20, 2015

Contributor

@dweeezil

FYI:

the changes in these commits and #2129 , once adapted, seem to lead to nice write-speed improvements, latency reductions and decrease of cpu load: #3190 (comment)

any comments about the multilist-related warning from:

#3115 (comment)

?

Thanks 👍

Contributor

kernelOfTruth commented Mar 20, 2015

@dweeezil

FYI:

the changes in these commits and #2129 , once adapted, seem to lead to nice write-speed improvements, latency reductions and decrease of cpu load: #3190 (comment)

any comments about the multilist-related warning from:

#3115 (comment)

?

Thanks 👍

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Mar 23, 2015

Member

I've re-pushed after rebasing onto the latest managing-zfs-cache-devices-requires-lots-of-ram branch (see #3038) and have added initial tracepoint support for the new multilist module.

This has not yet been rebased on to a current master. I suspect there will be some conflicts with the recent commits to the ARC code.

Member

dweeezil commented Mar 23, 2015

I've re-pushed after rebasing onto the latest managing-zfs-cache-devices-requires-lots-of-ram branch (see #3038) and have added initial tracepoint support for the new multilist module.

This has not yet been rebased on to a current master. I suspect there will be some conflicts with the recent commits to the ARC code.

@kernelOfTruth

This comment has been minimized.

Show comment
Hide comment
@kernelOfTruth

kernelOfTruth Mar 23, 2015

Contributor

@dweeezil thanks !

@behlendorf would the commit 730d690 in the light of the most recent changes (bc88866) still be needed ?

Related to #3216 I'm currently at this commit and considering the positive effect this change has on limiting ARC-growth - I'm not sure if this should be applied

isn't it also preferable to leave it at the updated approach: "Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data
buffers." ?

@yarikoptic , @snajpa and @DeHackEd issues seem to be addressed by this - I also seemingly can't trigger that unlimited growth and slowing down anymore 👍

Contributor

kernelOfTruth commented Mar 23, 2015

@dweeezil thanks !

@behlendorf would the commit 730d690 in the light of the most recent changes (bc88866) still be needed ?

Related to #3216 I'm currently at this commit and considering the positive effect this change has on limiting ARC-growth - I'm not sure if this should be applied

isn't it also preferable to leave it at the updated approach: "Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data
buffers." ?

@yarikoptic , @snajpa and @DeHackEd issues seem to be addressed by this - I also seemingly can't trigger that unlimited growth and slowing down anymore 👍

@kernelOfTruth

This comment has been minimized.

Show comment
Hide comment
@kernelOfTruth

kernelOfTruth Mar 27, 2015

Contributor

I'll take the freedom to reply to myself:

in the end (with Illumos 5497, lock contention on arcs_mtx) it doesn't really matter since the parts are replaced and/or superseded, modified anyway - affected commits:

Revert "Allow arc_evict_ghost() to only evict meta data"
dweeezil@730d690

and

Fix arc_adjust_meta() behavior
behlendorf@f156031

which includes in a way
arc_adjust_meta: limit number of restarts to 4096
kernelOfTruth@942e973


untouched are:
behlendorf@6116b99
Fix arc_meta_max accounting

behlendorf@fd90e76
Restructure per-filesystem reclaim

so older kernels in all cases will benefit from recent changes

So the difference is the difficulty in porting and or patching things

Contributor

kernelOfTruth commented Mar 27, 2015

I'll take the freedom to reply to myself:

in the end (with Illumos 5497, lock contention on arcs_mtx) it doesn't really matter since the parts are replaced and/or superseded, modified anyway - affected commits:

Revert "Allow arc_evict_ghost() to only evict meta data"
dweeezil@730d690

and

Fix arc_adjust_meta() behavior
behlendorf@f156031

which includes in a way
arc_adjust_meta: limit number of restarts to 4096
kernelOfTruth@942e973


untouched are:
behlendorf@6116b99
Fix arc_meta_max accounting

behlendorf@fd90e76
Restructure per-filesystem reclaim

so older kernels in all cases will benefit from recent changes

So the difference is the difficulty in porting and or patching things

kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Mar 27, 2015

5497 lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

Porting notes:

This patch from illumos either duplicates or obviates a number of
ZoL-specific patches which have been added over time.  As part of this
porting effort, several functions were copied in toto from illumos,
effictively undoing some of the previous ZoL-specific patches:

	... trying to remember which ones ...
	arc_adapt() ?
	arc_adjust_impl()
	arc_adjust_meta()
	arc_adjust_type()

Other notes:

The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, are mostly undone by this patch.

Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.  Similarly, and in order to operate in a similar manner
as to illumos, the arc_adapt_thread does the same, analogously to the
arc_reclaim_thread in illumos.

Notable conflicting ZoL commits which conflict with this patch
or whose effects are either duplicated or un-done by this patch:

302f753 - Integrate ARC more tightly with Linux
39e055c - Adjust arc_p based on "bytes" in arc_shrink
f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
77765b5 - Remove "arc_meta_used" from arc_adjust calculation
94520ca - Prune metadata from ghost lists in arc_adjust_meta

arcstat.py needs to be updated to deal with new/missing stats.
I have removed "recycle_miss" from it in order that it still work.

Added minimal multilist trace support.

---------------------------------------------------

@l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
‣ 	abd_t *buf_data;
‣ 	kmutex_t *list_lock = NULL;
• to
‣ 	abd_t *buf_data;
• #3115: void *buf_data;
• #2119: abd_t *buf_data;

for later, where applicable (#2129):
‣ zio_buf_free
• equals
‣ abd_free

‣ zio_data_buf_free
• equals
‣ abd_free

‣ zio_buf_alloc(size);
• buf->b_data = zio_buf_alloc(size);
• equals
‣ abd_alloc_linear(size);
• buf->b_data = abd_alloc_linear(size);

‣ zio_data_buf_alloc(size);
• buf->b_data = zio_data_buf_alloc(size);
• equals
‣ abd_alloc_scatter(size);
• buf->b_data = abd_alloc_scatter(size);

@ arc_buf_add_ref(arc_buf_t *buf, void* tag)
‣ re-introduction of
• arc_buf_free_on_write(void *data, size_t size,
• removed due to "Revert "fix l2arc compression buffers leak"", dweeezil@ba5ac5c

fix "warning: assignment from incompatible pointer type" error of
• @ arc_buf_add_ref(arc_buf_t *buf, void* tag)
• void (*free_func)(abd_t *, size_t))
(#2129)

• arc_buf_free_on_write(void *data, size_t size,
•     void (*free_func)(void *, size_t))
∘ to
• arc_buf_free_on_write(abd_t *data, size_t size,
•     void (*free_func)(abd_t *, size_t))

reverting the zfs_arc_meta_adjust_restarts changes introduced by
‣ zfsonlinux@bc88866
‣ Fix arc_adjust_meta() behavior
‣ thus removing the "zfs_arc_meta_adjust_restarts" parameter from code & documentation

conflicts in removal & replacement due to omitting "Revert "Allow arc_evict_ghost() to only evict meta data" in
‣ @ arc_adjust(void)
• if (adjustment > 0 && arc_mru_ghost->arcs_size > 0) {
‣ @ arc_adjust_meta(void)
• if (adjustmnt > 0 && arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA] > 0) {
‣ @static arc_buf_hdr_t arc_eviction_hdr;
• static void arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes,
•     arc_buf_contents_t type);
‣ @ arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
• arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes,
‣ ...
‣ see: dweeezil@730d690
for the changes of the revert
@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Apr 6, 2015

Member

Now that pull request 3225 is behind us, I'm getting this patch stack up-to-date. It looks like others may have picked up this one for buildbot testing but I've not had the time to pay attention to those efforts while dealing with 3225. In any case, I've gotten it rebased to current master but have, for the time being, reverted bc88866 from #3202 for some baseline testing and will re-integrate it as necessary once I get a handle on the behavior without it and will then refresh the commit for this pull request.

Member

dweeezil commented Apr 6, 2015

Now that pull request 3225 is behind us, I'm getting this patch stack up-to-date. It looks like others may have picked up this one for buildbot testing but I've not had the time to pay attention to those efforts while dealing with 3225. In any case, I've gotten it rebased to current master but have, for the time being, reverted bc88866 from #3202 for some baseline testing and will re-integrate it as necessary once I get a handle on the behavior without it and will then refresh the commit for this pull request.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Apr 6, 2015

Member

It didn't take much testing after a rebase and getting back to working on this to realize I definitely went a bit overboard in doing things the "illumos way" (dropped large swaths of ab26409, for example) when creating this patch stack. Unfortunately, it seems that some people may have picked up this patch as a Good Thing but it's clearly not something that should be used on a real system as it stands now. This was never intended on something that would land in 0.6.4 and I'll point out that it is actually tagged as 0.7.0 which is more realistic. I'll try to drop a few pointers into the other pull requests where this is mentioned.

Member

dweeezil commented Apr 6, 2015

It didn't take much testing after a rebase and getting back to working on this to realize I definitely went a bit overboard in doing things the "illumos way" (dropped large swaths of ab26409, for example) when creating this patch stack. Unfortunately, it seems that some people may have picked up this patch as a Good Thing but it's clearly not something that should be used on a real system as it stands now. This was never intended on something that would land in 0.6.4 and I'll point out that it is actually tagged as 0.7.0 which is more realistic. I'll try to drop a few pointers into the other pull requests where this is mentioned.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Apr 30, 2015

Member

@dweeezil how are you feeling about the ARC changes you've been working on? Does it look like they'll come together in the next couple of weeks and be something we'll want to merge?

Member

behlendorf commented Apr 30, 2015

@dweeezil how are you feeling about the ARC changes you've been working on? Does it look like they'll come together in the next couple of weeks and be something we'll want to merge?

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Apr 30, 2015

Member

@behlendorf Yes, I'm quite certain I can get this in shape for merging within the next couple of weeks. Unfortunately this has slid to the back burner for for awhile but I expect to get back on to it very soon.

Member

dweeezil commented Apr 30, 2015

@behlendorf Yes, I'm quite certain I can get this in shape for merging within the next couple of weeks. Unfortunately this has slid to the back burner for for awhile but I expect to get back on to it very soon.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 1, 2015

Member

@prakashsurya thanks for checking in with us!

As for making arc_meta_limit I'm not convinced there are any real technical advantages. It always has been a point of confusion, I just wanted to call it out as a significant change in behavior. Unless we uncover some serious issue with that approach I'm inclined to change it to a soft limit and clearly document what the arc_meta_limit actually means. It will be one less point of divergence.

It looks to me like the current behavior for illumos is to treat the arc_meta_limit as a soft limit. The arc_adjust_meta() function with attempt to evict only metadata buffers when your over the limit. You might recall we identified some issues with this behavior for certain workloads. Specifically it's possible to be over the limit yet unable to free any meta data due to references held by data blocks or other VFS objects. For this reason, we need to add a slightly different version of arc_adjust_meta() to handle that situation. See arc_adjust_meta_balanced() in behlendorf/zfs@5c6c960. It would be great if we could settle on a single platform agnostic algorithm to handle this case.

Speaking of getting our changes upstream to illumos what I'm planning to work after this works is merged is pulling in the other ARC related changes we haven't ported yet. There appear to be about 6 changes we still need. Then in the context of that I'll try to eliminate any unnecessary differences. That should give us a much better idea of what can and should be pushed up to illumos. It'll also put us in a good place to get the ABD changes merged for both platforms. If that's something you could help with I'd definitely appreciate it! :)

illumos/illumos-gate@a52fc31 5701 zpool list reports incorrect "alloc" value for cache devices
illumos/illumos-gate@2fd872a 5817 change type of arcs_size from uint64_t to refcount_t
illumos/illumos-gate@4076b1b 5445 Add more visibility via arcstats; specifically arc_state_t stats
illumos/illumos-gate@2ec99e3 5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
illumos/illumos-gate@fbefb14 5179 Remove unused ZFS ARC functions
illumos/illumos-gate@83803b5 5163 arc should reap range_seg_cache

@dweeezil I'll get a new version of the patch pushed today. But by all means if you want to pull it in to your branch and rework it as needed please do. It was just one idea on how to resolve the deadlock so we can finalize these changes and merge them. If you have a better approach, or a better implementation, I'm all for it.

Member

behlendorf commented Jun 1, 2015

@prakashsurya thanks for checking in with us!

As for making arc_meta_limit I'm not convinced there are any real technical advantages. It always has been a point of confusion, I just wanted to call it out as a significant change in behavior. Unless we uncover some serious issue with that approach I'm inclined to change it to a soft limit and clearly document what the arc_meta_limit actually means. It will be one less point of divergence.

It looks to me like the current behavior for illumos is to treat the arc_meta_limit as a soft limit. The arc_adjust_meta() function with attempt to evict only metadata buffers when your over the limit. You might recall we identified some issues with this behavior for certain workloads. Specifically it's possible to be over the limit yet unable to free any meta data due to references held by data blocks or other VFS objects. For this reason, we need to add a slightly different version of arc_adjust_meta() to handle that situation. See arc_adjust_meta_balanced() in behlendorf/zfs@5c6c960. It would be great if we could settle on a single platform agnostic algorithm to handle this case.

Speaking of getting our changes upstream to illumos what I'm planning to work after this works is merged is pulling in the other ARC related changes we haven't ported yet. There appear to be about 6 changes we still need. Then in the context of that I'll try to eliminate any unnecessary differences. That should give us a much better idea of what can and should be pushed up to illumos. It'll also put us in a good place to get the ABD changes merged for both platforms. If that's something you could help with I'd definitely appreciate it! :)

illumos/illumos-gate@a52fc31 5701 zpool list reports incorrect "alloc" value for cache devices
illumos/illumos-gate@2fd872a 5817 change type of arcs_size from uint64_t to refcount_t
illumos/illumos-gate@4076b1b 5445 Add more visibility via arcstats; specifically arc_state_t stats
illumos/illumos-gate@2ec99e3 5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
illumos/illumos-gate@fbefb14 5179 Remove unused ZFS ARC functions
illumos/illumos-gate@83803b5 5163 arc should reap range_seg_cache

@dweeezil I'll get a new version of the patch pushed today. But by all means if you want to pull it in to your branch and rework it as needed please do. It was just one idea on how to resolve the deadlock so we can finalize these changes and merge them. If you have a better approach, or a better implementation, I'm all for it.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 1, 2015

Member

@behlendorf I think your approach is OK from what I can tell, but I wanted to make sure I was looking at it the way you intended and, since it wouldn't compile, I didn't want to make any guesses.

Member

dweeezil commented Jun 1, 2015

@behlendorf I think your approach is OK from what I can tell, but I wanted to make sure I was looking at it the way you intended and, since it wouldn't compile, I didn't want to make any guesses.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 1, 2015

Member

@dweeezil yup, it's what I intended. Although, the system doesn't appear to have survived running dbench in a loop all weekend. Unfortunately, I wasn't able to extract and useful debugging so I'll need to reproduce it.

Member

behlendorf commented Jun 1, 2015

@dweeezil yup, it's what I intended. Although, the system doesn't appear to have survived running dbench in a loop all weekend. Unfortunately, I wasn't able to extract and useful debugging so I'll need to reproduce it.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 1, 2015

Member

@behlendorf OK, I think dweeezil/zfs@aef1403 jives with your intent. I'll be looking it over this evening and also running my own battery of tests on it.

Member

dweeezil commented Jun 1, 2015

@behlendorf OK, I think dweeezil/zfs@aef1403 jives with your intent. I'll be looking it over this evening and also running my own battery of tests on it.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 2, 2015

Member

@dweeezil see behlendorf/zfs@44b2d17 for a slightly updated version. I have fixed the build failure you noticed and added an async and sync version. The async version is used by the arc_adapt_thread() and the sync version in arc_kmem_reap_now() which is called by the shrinker. I've tried to explain a bit why in the commit comment. Finally, I updated arc_fini() to wait on the taskq to make sure we don't have any cleanup issues.

https://github.com/behlendorf/zfs/commits/lock-contention-on-arcs_mtx

I'm going to queue up some more stress testing with this version today and your updated patches.

Member

behlendorf commented Jun 2, 2015

@dweeezil see behlendorf/zfs@44b2d17 for a slightly updated version. I have fixed the build failure you noticed and added an async and sync version. The async version is used by the arc_adapt_thread() and the sync version in arc_kmem_reap_now() which is called by the shrinker. I've tried to explain a bit why in the commit comment. Finally, I updated arc_fini() to wait on the taskq to make sure we don't have any cleanup issues.

https://github.com/behlendorf/zfs/commits/lock-contention-on-arcs_mtx

I'm going to queue up some more stress testing with this version today and your updated patches.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 2, 2015

Member

@behlendorf OK, thanks for the update. I'll be looking at your commit instead of mine.

Member

dweeezil commented Jun 2, 2015

@behlendorf OK, thanks for the update. I'll be looking at your commit instead of mine.

@prakashsurya

This comment has been minimized.

Show comment
Hide comment
@prakashsurya

prakashsurya Jun 2, 2015

Member

@behlendorf, see below..

You might recall we identified some issues with this behavior for certain workloads. Specifically
it's possible to be over the limit yet unable to free any meta data due to references held by data
blocks or other VFS objects.

Yup, I definitely remember that. Matt and I decided not to address that or some of the other peculiarities of the ARC in the context of this patch. It was already a large enough change, and didn't want to make it any bigger than necessary (despite it being so tempting to do so).

I'm not opposed to a solution that works for both platforms, I just need to sit down and verify the issue on illumos like I previously did with Linux. With the differences between the VFS and caching layers of the two platforms, I'd like to better understand the illumos side of things first.

With that said, is the prune callback used often in a standard ZoL system? I remember we originally added that functionality thinking that Lustre was the culprit for the large amount of un-evictable space in the ARC, but it ultimately was determined to be due to us pruning metadata from the MRU and MFU, but not also pruning from the MRUG and MFUG lists (thus, it was actually the accumulation of arc headers for ghost list entries that was causing grief, not due to Lustre pinning dbuf's in the cache). I'm curious if that callback's usefulness should re-evaluated.

It'll also put us in a good place to get the ABD changes merged for both platforms. If that's
something you could help with I'd definitely appreciate it! :)

I'll sync up with @ahrens on this, last I heard he had a rough port of those patches working on DelphixOS. I think that work would help both platforms, albeit for different reasons.

Member

prakashsurya commented Jun 2, 2015

@behlendorf, see below..

You might recall we identified some issues with this behavior for certain workloads. Specifically
it's possible to be over the limit yet unable to free any meta data due to references held by data
blocks or other VFS objects.

Yup, I definitely remember that. Matt and I decided not to address that or some of the other peculiarities of the ARC in the context of this patch. It was already a large enough change, and didn't want to make it any bigger than necessary (despite it being so tempting to do so).

I'm not opposed to a solution that works for both platforms, I just need to sit down and verify the issue on illumos like I previously did with Linux. With the differences between the VFS and caching layers of the two platforms, I'd like to better understand the illumos side of things first.

With that said, is the prune callback used often in a standard ZoL system? I remember we originally added that functionality thinking that Lustre was the culprit for the large amount of un-evictable space in the ARC, but it ultimately was determined to be due to us pruning metadata from the MRU and MFU, but not also pruning from the MRUG and MFUG lists (thus, it was actually the accumulation of arc headers for ghost list entries that was causing grief, not due to Lustre pinning dbuf's in the cache). I'm curious if that callback's usefulness should re-evaluated.

It'll also put us in a good place to get the ABD changes merged for both platforms. If that's
something you could help with I'd definitely appreciate it! :)

I'll sync up with @ahrens on this, last I heard he had a rough port of those patches working on DelphixOS. I think that work would help both platforms, albeit for different reasons.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 2, 2015

Member

@behlendorf I like the approach and it seems to work just fine, except that I think we need our own taskq (see the code comments).

Strangely enough, while running dbench -x 32, I was never able to reproduce the original deadlock (during short runs, however). Today, with the patch, it did survive 4 hours on a system capped at 8GiB.

@prakashsurya FYI, during the dbench 32-process test on a system capped at 8GiB, the callback was invoked exactly zero times. I'm going to try running some of my own metadata-heavy stress tests.

Member

dweeezil commented Jun 2, 2015

@behlendorf I like the approach and it seems to work just fine, except that I think we need our own taskq (see the code comments).

Strangely enough, while running dbench -x 32, I was never able to reproduce the original deadlock (during short runs, however). Today, with the patch, it did survive 4 hours on a system capped at 8GiB.

@prakashsurya FYI, during the dbench 32-process test on a system capped at 8GiB, the callback was invoked exactly zero times. I'm going to try running some of my own metadata-heavy stress tests.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 2, 2015

Member

@dweeezil quick update. It appears that the most significant issue I was running in to wasn't introduced by any of these changes. I happened to be testing using a 2.6.32 kernel so I had no per-filesystem shrinker support and the shrink_dcache_parent() for older kernels clearly doesn't work properly. So this isn't a new issue which is good. After resolving that issues I'm seeing considerably improved behavior. I'll push (thus unrelated) patch tomorrow once I'm happier with it.

Member

behlendorf commented Jun 2, 2015

@dweeezil quick update. It appears that the most significant issue I was running in to wasn't introduced by any of these changes. I happened to be testing using a 2.6.32 kernel so I had no per-filesystem shrinker support and the shrink_dcache_parent() for older kernels clearly doesn't work properly. So this isn't a new issue which is good. After resolving that issues I'm seeing considerably improved behavior. I'll push (thus unrelated) patch tomorrow once I'm happier with it.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 3, 2015

Member

I'm continuing to run tests and to exercise all the new code paths, adding additional arcstat instrumentation along the way to assist.

The reason I wasn't seeing any metadata-related pruning in the dbench test I mentioned above is because arc_c had collapsed and was never raised again (I'm looking into why at this point). In this condition, arc_meta_limit became impossible to reach and the async pruning in arc_adjust_meta_balanced() was never able to happen.

Member

dweeezil commented Jun 3, 2015

I'm continuing to run tests and to exercise all the new code paths, adding additional arcstat instrumentation along the way to assist.

The reason I wasn't seeing any metadata-related pruning in the dbench test I mentioned above is because arc_c had collapsed and was never raised again (I'm looking into why at this point). In this condition, arc_meta_limit became impossible to reach and the async pruning in arc_adjust_meta_balanced() was never able to happen.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 3, 2015

Member

I've been able to get my test system into the following unfortunate state (output courtesy of a hacked arcstat.py):

dweeezil@zfs-dev:~$ ./arcstat.sh 
   p     c  meta  arcsz  metalimit  c_max  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  

The problem is that the large block support bumped the maximum block size to 16MiB and the logic in the adapt thread does this:

        /*
         * If we're within (2 * maxblocksize) bytes of the target
         * cache size, increment the target cache size
         */
        if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {

Oops, if arc_c drops too low, the result will be underflow and look like a huge value and we've got a system that's pretty well bound up because arc_c will never be allowed to grow.

This may very well be unrelated to the issue at hand, but I do seem to be able to stress the system rather easily and cause a bad collapse of arc_c.

Member

dweeezil commented Jun 3, 2015

I've been able to get my test system into the following unfortunate state (output courtesy of a hacked arcstat.py):

dweeezil@zfs-dev:~$ ./arcstat.sh 
   p     c  meta  arcsz  metalimit  c_max  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  

The problem is that the large block support bumped the maximum block size to 16MiB and the logic in the adapt thread does this:

        /*
         * If we're within (2 * maxblocksize) bytes of the target
         * cache size, increment the target cache size
         */
        if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {

Oops, if arc_c drops too low, the result will be underflow and look like a huge value and we've got a system that's pretty well bound up because arc_c will never be allowed to grow.

This may very well be unrelated to the issue at hand, but I do seem to be able to stress the system rather easily and cause a bad collapse of arc_c.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 4, 2015

Member

I'm testing now with dweeezil/zfs@4288ef8 to fix the non-expanding arc_c problem and will likely add that patch to the stack.

Member

dweeezil commented Jun 4, 2015

I'm testing now with dweeezil/zfs@4288ef8 to fix the non-expanding arc_c problem and will likely add that patch to the stack.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 4, 2015

Member

@behlendorf Back to the use of the system taskq again. One of the arcstats I added while testing is this:

@@ -2488,7 +2494,9 @@ static void
 arc_prune(int64_t adjust)
 {
        arc_prune_async(adjust);
+       ARCSTAT_INCR(arcstat_prune_wait, 1);
        taskq_wait(system_taskq);
+       ARCSTAT_INCR(arcstat_prune_wait, -1);
 }

and after causing a bunch of pruning, I see:

dweeezil@zfs-dev:~/src/zfs-mtx$ grep prune /proc/spl/kstat/zfs/arcstats 
arc_prune                       4    382
arc_balanced_meta_prune         4    273
arc_prune_wait                  4    3

and:

root@zfs-dev:~# ps axwww | grep D
  PID TTY      STAT   TIME COMMAND
 2265 ?        Ss     0:00 /usr/sbin/sshd -D
58664 ?        D      0:16 /bin/bash ./create.sh
58673 ?        D      0:17 /bin/bash ./create.sh
58691 ?        D      0:16 /bin/bash ./create.sh
74797 pts/2    S+     0:00 grep --color=auto D

so it looks like there are 3 perpetual waiters. Might there not be other things on the system taskq such as the deadman, prefetcher thread, etc.?

Member

dweeezil commented Jun 4, 2015

@behlendorf Back to the use of the system taskq again. One of the arcstats I added while testing is this:

@@ -2488,7 +2494,9 @@ static void
 arc_prune(int64_t adjust)
 {
        arc_prune_async(adjust);
+       ARCSTAT_INCR(arcstat_prune_wait, 1);
        taskq_wait(system_taskq);
+       ARCSTAT_INCR(arcstat_prune_wait, -1);
 }

and after causing a bunch of pruning, I see:

dweeezil@zfs-dev:~/src/zfs-mtx$ grep prune /proc/spl/kstat/zfs/arcstats 
arc_prune                       4    382
arc_balanced_meta_prune         4    273
arc_prune_wait                  4    3

and:

root@zfs-dev:~# ps axwww | grep D
  PID TTY      STAT   TIME COMMAND
 2265 ?        Ss     0:00 /usr/sbin/sshd -D
58664 ?        D      0:16 /bin/bash ./create.sh
58673 ?        D      0:17 /bin/bash ./create.sh
58691 ?        D      0:16 /bin/bash ./create.sh
74797 pts/2    S+     0:00 grep --color=auto D

so it looks like there are 3 perpetual waiters. Might there not be other things on the system taskq such as the deadman, prefetcher thread, etc.?

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 5, 2015

Member

I'd like to better understand the illumos side of things first. With that said, is the prune callback used often in a standard ZoL system?

@prakashsurya it can be for a meta-data heavy workload. I think the right way to think about this is that the Linux prune functionality is directly analogous to the dnlc_reduce_cache() functionality on illumos. You would want to call it under exactly the same circumstances. We've just needed to make that hook generic under Linux because the Linux VFS doesn't provide us the needed interface.

I think one reasonable way to do this portably would be just to agree on a common zfs specific interface. That could be called arc_prune() or something entirely different, but all it would need to be on illumos is a maco or inline function. Right now our arc_prune() takes a byte count and dnlc_reduce_cache() a percentage but that's the only meaningful way they differ.

#define arc_prune(x) dnlc_reduce_cache(x)

As for Lustre I think that's a slightly different kettle of fish. Your right they probably shouldn't be using this interface any more. But having something like this available is generally useful if you're considering building something which isn't a filesystem on top of the DMU. And these days there's increasing interest in doing just that.

@dweeezil I've refreshed my lock-contention-on-arcs_mtx branch which is based on your lock-contention-on-arcs_mtx branch. It depends on the changes in behlendorf/spl#453 and includes all the changes we discussed. In the end I just caved and went down the rabit hole of implementing the TASKQ_DYNAMIC flag from illumos. This way we can make all the multi-thread taskq's dynamic, restore the illumos default values, and not has a crazy number of threads on an idle system.

zfsonlinux/spl#455
https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx

I also added a patch to my stack which addressing the arc_prune() issues I saw for older kernels without per-filesystem shrinkers. With these changes in place the only issues I'm seeing are a cleanup time where it looks like an iput() is getting lost (or never issued). I still need to run that down.

As for your 3 perpetual waiters I'm not sure what's going on there. But in my updated patches I did create a dedicated arc_prune() taskq which make shed some light on things.

Member

behlendorf commented Jun 5, 2015

I'd like to better understand the illumos side of things first. With that said, is the prune callback used often in a standard ZoL system?

@prakashsurya it can be for a meta-data heavy workload. I think the right way to think about this is that the Linux prune functionality is directly analogous to the dnlc_reduce_cache() functionality on illumos. You would want to call it under exactly the same circumstances. We've just needed to make that hook generic under Linux because the Linux VFS doesn't provide us the needed interface.

I think one reasonable way to do this portably would be just to agree on a common zfs specific interface. That could be called arc_prune() or something entirely different, but all it would need to be on illumos is a maco or inline function. Right now our arc_prune() takes a byte count and dnlc_reduce_cache() a percentage but that's the only meaningful way they differ.

#define arc_prune(x) dnlc_reduce_cache(x)

As for Lustre I think that's a slightly different kettle of fish. Your right they probably shouldn't be using this interface any more. But having something like this available is generally useful if you're considering building something which isn't a filesystem on top of the DMU. And these days there's increasing interest in doing just that.

@dweeezil I've refreshed my lock-contention-on-arcs_mtx branch which is based on your lock-contention-on-arcs_mtx branch. It depends on the changes in behlendorf/spl#453 and includes all the changes we discussed. In the end I just caved and went down the rabit hole of implementing the TASKQ_DYNAMIC flag from illumos. This way we can make all the multi-thread taskq's dynamic, restore the illumos default values, and not has a crazy number of threads on an idle system.

zfsonlinux/spl#455
https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx

I also added a patch to my stack which addressing the arc_prune() issues I saw for older kernels without per-filesystem shrinkers. With these changes in place the only issues I'm seeing are a cleanup time where it looks like an iput() is getting lost (or never issued). I still need to run that down.

As for your 3 perpetual waiters I'm not sure what's going on there. But in my updated patches I did create a dedicated arc_prune() taskq which make shed some light on things.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 5, 2015

Member

@behlendorf I'll be re-testing today with your version of the stack (https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx) and modified SPL as well as an updated version of my arc_c_min fix from https://github.com/dweeezil/zfs/tree/arc-c-too-small. I'll be interested to observe the taskq threads on my current test rig (40 cores/80 threads).

Member

dweeezil commented Jun 5, 2015

@behlendorf I'll be re-testing today with your version of the stack (https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx) and modified SPL as well as an updated version of my arc_c_min fix from https://github.com/dweeezil/zfs/tree/arc-c-too-small. I'll be interested to observe the taskq threads on my current test rig (40 cores/80 threads).

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 5, 2015

Member

@dweeezil I've refreshed my spl and zfs branches with the following small changes.

zfsonlinux/spl#455
https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx

  • Decreased the default spl_taskq_thread_sequential value to 1. At 8 it clearly wasn't ramping up the IO threads fast enough for certain workloads. This is helped considerably.
  • Added you arc_c_min path with 2 small additions I mentioned in the comment.
  • Added a patch to change several existing taskq_wait() callers to taskq_wait_outstanding() to preserve the existing behavior where safe.
  • Fixed a cstyle issue.

It appears the umount issues I referenced above isn't a new issue introduced by these changes. I was able to reproduce it on the master branch with a bit of work. So I don't consider that a blocker although it needs to be explained.

My intention is to run these patch stacks over the weekend and see how they hold up. So far things are looking promising.

Member

behlendorf commented Jun 5, 2015

@dweeezil I've refreshed my spl and zfs branches with the following small changes.

zfsonlinux/spl#455
https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx

  • Decreased the default spl_taskq_thread_sequential value to 1. At 8 it clearly wasn't ramping up the IO threads fast enough for certain workloads. This is helped considerably.
  • Added you arc_c_min path with 2 small additions I mentioned in the comment.
  • Added a patch to change several existing taskq_wait() callers to taskq_wait_outstanding() to preserve the existing behavior where safe.
  • Fixed a cstyle issue.

It appears the umount issues I referenced above isn't a new issue introduced by these changes. I was able to reproduce it on the master branch with a bit of work. So I don't consider that a blocker although it needs to be explained.

My intention is to run these patch stacks over the weekend and see how they hold up. So far things are looking promising.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 6, 2015

Member

@behlendorf I'm using your behlendorf/spl@a0f5e41 and behlendorf/zfs@af184e8. If I run my create.sh stress test with 32 process concurrency, after awhile, the (now single initial) z_wr_iss thread blows up into 34 threads and all of them are blocked:

16636 ?        DN     0:00 [z_wr_iss]
16638 ?        DN     0:00 [z_wr_iss]
16639 ?        DN     0:00 [z_wr_iss]
16640 ?        DN     0:00 [z_wr_iss]
16643 ?        DN     0:00 [z_wr_iss]
16647 ?        DN     0:00 [z_wr_iss]
16650 ?        DN     0:00 [z_wr_iss]
16651 ?        DN     0:00 [z_wr_iss]
etc...

root@zfs-dev:~# cat /proc/16640/stack
[<ffffffff81090cf2>] kthread_create_on_node+0xd2/0x180
[<ffffffffc0308774>] spl_kthread_create+0x94/0xe0 [spl]
[<ffffffffc0309c9f>] taskq_thread_create+0x6f/0x100 [spl]
[<ffffffffc030a19c>] taskq_thread+0x46c/0x540 [spl]
[<ffffffff81090e72>] kthread+0xd2/0xf0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Access to the ZFS filesystem is, likewise, blocked as you might imagine. I'm going to reboot and see what's at kthread_create_on_node+0xd2 in my kernel.

Member

dweeezil commented Jun 6, 2015

@behlendorf I'm using your behlendorf/spl@a0f5e41 and behlendorf/zfs@af184e8. If I run my create.sh stress test with 32 process concurrency, after awhile, the (now single initial) z_wr_iss thread blows up into 34 threads and all of them are blocked:

16636 ?        DN     0:00 [z_wr_iss]
16638 ?        DN     0:00 [z_wr_iss]
16639 ?        DN     0:00 [z_wr_iss]
16640 ?        DN     0:00 [z_wr_iss]
16643 ?        DN     0:00 [z_wr_iss]
16647 ?        DN     0:00 [z_wr_iss]
16650 ?        DN     0:00 [z_wr_iss]
16651 ?        DN     0:00 [z_wr_iss]
etc...

root@zfs-dev:~# cat /proc/16640/stack
[<ffffffff81090cf2>] kthread_create_on_node+0xd2/0x180
[<ffffffffc0308774>] spl_kthread_create+0x94/0xe0 [spl]
[<ffffffffc0309c9f>] taskq_thread_create+0x6f/0x100 [spl]
[<ffffffffc030a19c>] taskq_thread+0x46c/0x540 [spl]
[<ffffffff81090e72>] kthread+0xd2/0xf0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Access to the ZFS filesystem is, likewise, blocked as you might imagine. I'm going to reboot and see what's at kthread_create_on_node+0xd2 in my kernel.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 7, 2015

Member

The z_wr_iss threads (or likely any other thread on a dynamic taskq) stall in do_wait_common() from wait_for_completion_killable() when kthreadd enters reclaim as follows:

root@zfs-dev:/home/dweeezil# cat /proc/2/stack
[<ffffffffc036a2ed>] taskq_wait_outstanding+0x7d/0xe0 [spl]
[<ffffffffc076f386>] arc_kmem_reap_now+0xf6/0x100 [zfs] *** second line in arc_prune() ***
[<ffffffffc076f467>] __arc_shrinker_func.isra.23+0xd7/0x180 [zfs]
[<ffffffffc076f527>] arc_shrinker_func_scan_objects+0x17/0x30 [zfs]
[<ffffffff811812e6>] shrink_node_slabs+0x1d6/0x370
[<ffffffff81183f3a>] shrink_zone+0x20a/0x240
[<ffffffff811842d5>] do_try_to_free_pages+0x155/0x440
[<ffffffff8118467a>] try_to_free_pages+0xba/0x150
[<ffffffff81177b6b>] __alloc_pages_nodemask+0x61b/0xa60
[<ffffffff811bd1a1>] alloc_pages_current+0x91/0x100
[<ffffffff811c7ceb>] new_slab+0x34b/0x450
[<ffffffff81792ac9>] __slab_alloc+0x315/0x475
[<ffffffff811c854c>] kmem_cache_alloc_node+0x8c/0x250
[<ffffffff8106f853>] copy_process.part.26+0xf3/0x1c20
[<ffffffff81071535>] do_fork+0xd5/0x340
[<ffffffff810717c6>] kernel_thread+0x26/0x30
[<ffffffff8109185a>] kthreadd+0x15a/0x1c0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Unfortunately, arc_prune_taskq has the following stack:

root@zfs-dev:/home/dweeezil/src/zfs-mtx# cat /proc/3404/stack
[<ffffffff81090cf2>] kthread_create_on_node+0xd2/0x180
[<ffffffffc0368774>] spl_kthread_create+0x94/0xe0 [spl]
[<ffffffffc0369c9f>] taskq_thread_create+0x6f/0x100 [spl]
[<ffffffffc036a19c>] taskq_thread+0x46c/0x540 [spl]
[<ffffffff81090e72>] kthread+0xd2/0xf0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Deadlock.

Member

dweeezil commented Jun 7, 2015

The z_wr_iss threads (or likely any other thread on a dynamic taskq) stall in do_wait_common() from wait_for_completion_killable() when kthreadd enters reclaim as follows:

root@zfs-dev:/home/dweeezil# cat /proc/2/stack
[<ffffffffc036a2ed>] taskq_wait_outstanding+0x7d/0xe0 [spl]
[<ffffffffc076f386>] arc_kmem_reap_now+0xf6/0x100 [zfs] *** second line in arc_prune() ***
[<ffffffffc076f467>] __arc_shrinker_func.isra.23+0xd7/0x180 [zfs]
[<ffffffffc076f527>] arc_shrinker_func_scan_objects+0x17/0x30 [zfs]
[<ffffffff811812e6>] shrink_node_slabs+0x1d6/0x370
[<ffffffff81183f3a>] shrink_zone+0x20a/0x240
[<ffffffff811842d5>] do_try_to_free_pages+0x155/0x440
[<ffffffff8118467a>] try_to_free_pages+0xba/0x150
[<ffffffff81177b6b>] __alloc_pages_nodemask+0x61b/0xa60
[<ffffffff811bd1a1>] alloc_pages_current+0x91/0x100
[<ffffffff811c7ceb>] new_slab+0x34b/0x450
[<ffffffff81792ac9>] __slab_alloc+0x315/0x475
[<ffffffff811c854c>] kmem_cache_alloc_node+0x8c/0x250
[<ffffffff8106f853>] copy_process.part.26+0xf3/0x1c20
[<ffffffff81071535>] do_fork+0xd5/0x340
[<ffffffff810717c6>] kernel_thread+0x26/0x30
[<ffffffff8109185a>] kthreadd+0x15a/0x1c0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Unfortunately, arc_prune_taskq has the following stack:

root@zfs-dev:/home/dweeezil/src/zfs-mtx# cat /proc/3404/stack
[<ffffffff81090cf2>] kthread_create_on_node+0xd2/0x180
[<ffffffffc0368774>] spl_kthread_create+0x94/0xe0 [spl]
[<ffffffffc0369c9f>] taskq_thread_create+0x6f/0x100 [spl]
[<ffffffffc036a19c>] taskq_thread+0x46c/0x540 [spl]
[<ffffffff81090e72>] kthread+0xd2/0xf0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Deadlock.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 7, 2015

Member

Testing my theory, the following gross hack (and a similar one in the SPL shrinker) does fix the deadlock with kthreadd:

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 84676ff..5d98c87 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -3247,7 +3247,7 @@ __arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc)
                return (pages);

        /* Not allowed to perform filesystem reclaim */
-       if (!(sc->gfp_mask & __GFP_FS))
+       if (!(sc->gfp_mask & __GFP_FS) || current->pid == 2)
                return (SHRINK_STOP);

        /* Reclaim in progress */

EDIT: Never mind, the deadlock just takes longer to appear.

EDIT[2]: Because it's a different deadlock, involving kswapd (see next comment).

Member

dweeezil commented Jun 7, 2015

Testing my theory, the following gross hack (and a similar one in the SPL shrinker) does fix the deadlock with kthreadd:

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 84676ff..5d98c87 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -3247,7 +3247,7 @@ __arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc)
                return (pages);

        /* Not allowed to perform filesystem reclaim */
-       if (!(sc->gfp_mask & __GFP_FS))
+       if (!(sc->gfp_mask & __GFP_FS) || current->pid == 2)
                return (SHRINK_STOP);

        /* Reclaim in progress */

EDIT: Never mind, the deadlock just takes longer to appear.

EDIT[2]: Because it's a different deadlock, involving kswapd (see next comment).

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 7, 2015

Member

I'm going to try locking down kswapd as well (at least there's a proper API for detecting it). Clearly dynamic taskqs add a layer of memory allocations which wouldn't have happened in ZoL in the past.

Member

dweeezil commented Jun 7, 2015

I'm going to try locking down kswapd as well (at least there's a proper API for detecting it). Clearly dynamic taskqs add a layer of memory allocations which wouldn't have happened in ZoL in the past.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 7, 2015

Member

@behlendorf I'm going to continue testing with spl_taskq_thread_dynamic=0 for the time being which seems to work around deadlocks caused by reclaim from kthreadd and kswapd. This appears to be latent problem which caused no problems in the past because ZoL never (rarely?) created new threads after module initialization.

Member

dweeezil commented Jun 7, 2015

@behlendorf I'm going to continue testing with spl_taskq_thread_dynamic=0 for the time being which seems to work around deadlocks caused by reclaim from kthreadd and kswapd. This appears to be latent problem which caused no problems in the past because ZoL never (rarely?) created new threads after module initialization.

kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Jun 8, 2015

[RFC, discussion, feedback] account for ashift when choosing buffers …
…to be written to l2arc device

If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual increment
to l2ad_hand was introduced in
commit e14bb3258d05c1b1077e2db7cf77088924e56919

Also, consistently use asize / a_sz for the allocated size, psize / p_sz
for the physical size.  Where the latter accounts for possible size
reduction because of compression, whereas the former accounts for possible
size expansion because of alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical I/O
has a suitable size.

modified to account for the changes of

zfsonlinux#2129 (ABD) ,

zfsonlinux#3115 (Illumos - 5408 managing ZFS cache devices requires lots of RAM)

and

Illumos - 5701 zpool list reports incorrect "alloc" value for cache devices
@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 8, 2015

Member

With spl_taskq_thread_dynamic=0 my memory-constrained (about 8GiB usable) test system survived an overnight run of dbench -x 32 with only a handful of "120 second block" messages. This is very good news. I'll be re-running the test today with 512GiB memory available. I'll also likely try it with xattr=sa and/or relatime=on.

Member

dweeezil commented Jun 8, 2015

With spl_taskq_thread_dynamic=0 my memory-constrained (about 8GiB usable) test system survived an overnight run of dbench -x 32 with only a handful of "120 second block" messages. This is very good news. I'll be re-running the test today with 512GiB memory available. I'll also likely try it with xattr=sa and/or relatime=on.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 8, 2015

Member

@dweeezil my testing survived all weekend as well. I was running high thread versions of xdd, dbench, and iozone concurrently in a loop for ~2 days. No stack traces or lockups. Not too bad!

Interestingly I didn't observe the issue TASKQ_DYNAMIC lockups you hit. This was almost certainly because I wasn't pushing as hard on the meta limit. But this is clearly an issue introduced by the patch which needs to be addressed.

The creation of dynamic threads was moved from taskq_create() in to the taskq thread itself which is what caused the issue. Let me give some thought about how to resolve this. Worst case we can drop the TASKQ_DYNAMIC patches for now they're not critical just an optimization. We might want to keep them separate anyway (or disabled by default) to quantify any performance benefit or penalty.

Member

behlendorf commented Jun 8, 2015

@dweeezil my testing survived all weekend as well. I was running high thread versions of xdd, dbench, and iozone concurrently in a loop for ~2 days. No stack traces or lockups. Not too bad!

Interestingly I didn't observe the issue TASKQ_DYNAMIC lockups you hit. This was almost certainly because I wasn't pushing as hard on the meta limit. But this is clearly an issue introduced by the patch which needs to be addressed.

The creation of dynamic threads was moved from taskq_create() in to the taskq thread itself which is what caused the issue. Let me give some thought about how to resolve this. Worst case we can drop the TASKQ_DYNAMIC patches for now they're not critical just an optimization. We might want to keep them separate anyway (or disabled by default) to quantify any performance benefit or penalty.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 8, 2015

Member

@behlendorf My 40-process 512GiB instance of dbench just passed 6 hours and ran fine the whole time (as expected). I also had xattr=sa and relatime=on set (dynamic taskqs were disabled but likely would have been just fine in this case). I'd have to say this patch stack looks pretty good now. [EDIT] Did you see my note about the comparison operator in arc_adapt()?

As to the dynamic taskqs, I do like their effect with lots of cores, especially on pools with a large number of top-level vdevs. I made a pool with 40 top-level vdevs once and with the non-dynamic scheme, wound up with over 1600 (!) metaslab_group_taskq threads. Low-memory systems, however, can be trivially deadlocked if dynamic taskqs are enabled. I'm thinking the dynamic taskq feature ought to be its own patch in its own pull request anyway.

Member

dweeezil commented Jun 8, 2015

@behlendorf My 40-process 512GiB instance of dbench just passed 6 hours and ran fine the whole time (as expected). I also had xattr=sa and relatime=on set (dynamic taskqs were disabled but likely would have been just fine in this case). I'd have to say this patch stack looks pretty good now. [EDIT] Did you see my note about the comparison operator in arc_adapt()?

As to the dynamic taskqs, I do like their effect with lots of cores, especially on pools with a large number of top-level vdevs. I made a pool with 40 top-level vdevs once and with the non-dynamic scheme, wound up with over 1600 (!) metaslab_group_taskq threads. Low-memory systems, however, can be trivially deadlocked if dynamic taskqs are enabled. I'm thinking the dynamic taskq feature ought to be its own patch in its own pull request anyway.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 9, 2015

Member

@dweeezil I think we're in good shape as well. I've opened #3481 with a finalized version of this patch stack with the following modifications.

  • All TASKQ_DYNAMIC patches have been dropped. I'll open a new dedicated pull request for these improvements after resolving the issues you uncovered. I have new patches they just need some testing before I post them.
  • The zfs_sb_prune_compat() patch has been dropped. I was able to expose a deadlock with the existing patch which still needs to be resolved. However, this same problem currently exists in master so it doesn't need to hold up these changes.
  • The taskq_wait_outstanding() patch has been merged in to the SPL master source. There have no no observed issues with those changes and it is required for the buildbot to test the patch stack.
  • I've made all the changes you suggested in previous comments.

I'll queue this patch stack up for another evening of testing. If everything looks good and you don't have any objections I'd like to merge these changes tomorrow. Please review it one final time.

Member

behlendorf commented Jun 9, 2015

@dweeezil I think we're in good shape as well. I've opened #3481 with a finalized version of this patch stack with the following modifications.

  • All TASKQ_DYNAMIC patches have been dropped. I'll open a new dedicated pull request for these improvements after resolving the issues you uncovered. I have new patches they just need some testing before I post them.
  • The zfs_sb_prune_compat() patch has been dropped. I was able to expose a deadlock with the existing patch which still needs to be resolved. However, this same problem currently exists in master so it doesn't need to hold up these changes.
  • The taskq_wait_outstanding() patch has been merged in to the SPL master source. There have no no observed issues with those changes and it is required for the buildbot to test the patch stack.
  • I've made all the changes you suggested in previous comments.

I'll queue this patch stack up for another evening of testing. If everything looks good and you don't have any objections I'd like to merge these changes tomorrow. Please review it one final time.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 10, 2015

Member

@dweeezil after addressing a minor build issue in the proposed patch stack they held up well to my overnight stress testing. At this point I think they're ready to be merged, so let me know when your happy and I'll merge them.

Member

behlendorf commented Jun 10, 2015

@dweeezil after addressing a minor build issue in the proposed patch stack they held up well to my overnight stress testing. At this point I think they're ready to be merged, so let me know when your happy and I'll merge them.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 10, 2015

Member

@behlendorf I'll get the stack (your -final branch) reviewed and give it a bit more testing today.

Member

dweeezil commented Jun 10, 2015

@behlendorf I'll get the stack (your -final branch) reviewed and give it a bit more testing today.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 11, 2015

Member

@behlendorf So far, it's looking good to me. I'll continue running tests throughout the rest of day, but at this point, I'd say it's ready to merge. More than anything, I'd like to move on to investigate some of the other scalability bottlenecks I've discovered while working on this.

Member

dweeezil commented Jun 11, 2015

@behlendorf So far, it's looking good to me. I'll continue running tests throughout the rest of day, but at this point, I'd say it's ready to merge. More than anything, I'd like to move on to investigate some of the other scalability bottlenecks I've discovered while working on this.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 11, 2015

Member

@dweeezil sounds good. I'll leave my tests running overnight as well and if everything still looks good in the morning I'll get this merged so we can pursue the other improvements.

Member

behlendorf commented Jun 11, 2015

@dweeezil sounds good. I'll leave my tests running overnight as well and if everything still looks good in the morning I'll get this merged so we can pursue the other improvements.

@dweeezil

This comment has been minimized.

Show comment
Hide comment
@dweeezil

dweeezil Jun 11, 2015

Member

This is still looking good after overnight testing with fio and dbench. I ran some final numbers with @prakashsurya's original test. For reference, it's fio with the following configuration:

[global]
group_reporting=1
fallocate=none
ioengine=psync
numjobs=32
iodepth=1
bs=8k
filesize=512m
size=512m
randrepeat=0
use_os_rand=1

[test]
rw=randread
time_based
runtime=1m
directory=/tank/fish

on a filesystem with recordsize=8k.

With current master code, there are 873419 contentions on arcs_mtx and an average wait time of 1.85ms. With this patch stack, the ARC-related contentions are pushed down into the noise. There are only 7589 contentsions on mls_lock with an average wait time of 14.95us.

Unfortunately, as mentioned in previous issue comments, the overall performance for 8K block reads does not improve at all with this patch. There are clearly other ZoL-specific issues at play here. With this patch stack, the top source of lock contention becomes the wait_lock and rr_lock which are part of the SPL's rwlock and rwsem implementations. These are separate problems and will soon get their own issues as soon as this stack is committed and I can start to seriously look into the other scalability issues specific to ZoL.

In summary, I think this patch stack is ready to go as-is.

Member

dweeezil commented Jun 11, 2015

This is still looking good after overnight testing with fio and dbench. I ran some final numbers with @prakashsurya's original test. For reference, it's fio with the following configuration:

[global]
group_reporting=1
fallocate=none
ioengine=psync
numjobs=32
iodepth=1
bs=8k
filesize=512m
size=512m
randrepeat=0
use_os_rand=1

[test]
rw=randread
time_based
runtime=1m
directory=/tank/fish

on a filesystem with recordsize=8k.

With current master code, there are 873419 contentions on arcs_mtx and an average wait time of 1.85ms. With this patch stack, the ARC-related contentions are pushed down into the noise. There are only 7589 contentsions on mls_lock with an average wait time of 14.95us.

Unfortunately, as mentioned in previous issue comments, the overall performance for 8K block reads does not improve at all with this patch. There are clearly other ZoL-specific issues at play here. With this patch stack, the top source of lock contention becomes the wait_lock and rr_lock which are part of the SPL's rwlock and rwsem implementations. These are separate problems and will soon get their own issues as soon as this stack is committed and I can start to seriously look into the other scalability issues specific to ZoL.

In summary, I think this patch stack is ready to go as-is.

@kernelOfTruth

This comment has been minimized.

Show comment
Hide comment
@kernelOfTruth

kernelOfTruth Jun 11, 2015

Contributor

With current master code, there are 873419 contentions on arcs_mtx and an average wait time of 1.85ms. With this patch stack, the ARC-related contentions are pushed down into the noise. There are only 7589 contentsions on mls_lock with an average wait time of 14.95us.

That sounds VERY interesting

guess I'll give this a spin against vanilla (if it's not merged before I have to opportunity to update my zfs drivers),

@dweeezil
Occasionally I get sound stutters (currently running 4.1-rc7 with threadirqs, low-latency) - hopefully this shows an positive effect in that direction

maximal latency during "high" i/o (one rsync backup job to btrfs volume) measured with latencytop was around 3000 ms (3 seconds !)

Using a different cpu scheduler (BFS instead of CFS) also helps but so far there was some noticable stalls, latency spikes, etc. when using ZFS - so looking forward to the merge

Thanks

Contributor

kernelOfTruth commented Jun 11, 2015

With current master code, there are 873419 contentions on arcs_mtx and an average wait time of 1.85ms. With this patch stack, the ARC-related contentions are pushed down into the noise. There are only 7589 contentsions on mls_lock with an average wait time of 14.95us.

That sounds VERY interesting

guess I'll give this a spin against vanilla (if it's not merged before I have to opportunity to update my zfs drivers),

@dweeezil
Occasionally I get sound stutters (currently running 4.1-rc7 with threadirqs, low-latency) - hopefully this shows an positive effect in that direction

maximal latency during "high" i/o (one rsync backup job to btrfs volume) measured with latencytop was around 3000 ms (3 seconds !)

Using a different cpu scheduler (BFS instead of CFS) also helps but so far there was some noticable stalls, latency spikes, etc. when using ZFS - so looking forward to the merge

Thanks

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 11, 2015

Member

@dweeezil my testing held up as expected too. Since these changes are ready to go I've merged them to master. That's a nice big step forwards. Now to address some of the lingering performance issues you uncovered. Thanks for all your hard work on this!

06358ea Merge branch 'lock-contention-on-arcs_mtx-final'
121b3ca Increase arc_c_min to allow safe operation of arc_adapt()
f604673 Make arc_prune() asynchronous
c5528b9 Use taskq_wait_outstanding() function
4f34bd9 Add taskq_wait_outstanding() function
ca0bf58 Illumos 5497 - lock contention on arcs_mtx
b9541d6 Illumos 5408 - managing ZFS cache devices requires lots of RAM
2a43241 Illumos 5369 - arc flags should be an enum
ad4af89 Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches"
97639d0 Revert "Allow arc_evict_ghost() to only evict meta data"
f6b3b1f Revert "fix l2arc compression buffers leak"
7807028 Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc"

Member

behlendorf commented Jun 11, 2015

@dweeezil my testing held up as expected too. Since these changes are ready to go I've merged them to master. That's a nice big step forwards. Now to address some of the lingering performance issues you uncovered. Thanks for all your hard work on this!

06358ea Merge branch 'lock-contention-on-arcs_mtx-final'
121b3ca Increase arc_c_min to allow safe operation of arc_adapt()
f604673 Make arc_prune() asynchronous
c5528b9 Use taskq_wait_outstanding() function
4f34bd9 Add taskq_wait_outstanding() function
ca0bf58 Illumos 5497 - lock contention on arcs_mtx
b9541d6 Illumos 5408 - managing ZFS cache devices requires lots of RAM
2a43241 Illumos 5369 - arc flags should be an enum
ad4af89 Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches"
97639d0 Revert "Allow arc_evict_ghost() to only evict meta data"
f6b3b1f Revert "fix l2arc compression buffers leak"
7807028 Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc"

@prakashsurya

This comment has been minimized.

Show comment
Hide comment
@prakashsurya

prakashsurya Jun 22, 2015

Member

Woo! 👍 Nice work!! It's too bad this doesn't help the bottom line performance numbers, but at least it moves the contention out of the ARC; which was the point. Reducing the average latency from milliseconds to 10s of microseconds is good to hear! :)

EDIT: Oh, and let's not forget, this will also effectively give people that use an L2ARC device more RAM, since Illumos 5408 - managing ZFS cache devices requires lots of RAM was also merged in the context of this work. Good stuff.

Member

prakashsurya commented Jun 22, 2015

Woo! 👍 Nice work!! It's too bad this doesn't help the bottom line performance numbers, but at least it moves the contention out of the ARC; which was the point. Reducing the average latency from milliseconds to 10s of microseconds is good to hear! :)

EDIT: Oh, and let's not forget, this will also effectively give people that use an L2ARC device more RAM, since Illumos 5408 - managing ZFS cache devices requires lots of RAM was also merged in the context of this work. Good stuff.

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jun 22, 2015

Member

@prakashsurya speaking us further slimming down the ARC header memory usage it would get great to get your feedback on #1971. It pulls the kmutex_t and kcondvar_t out of the l1arc_buf_hdr structure which results in significant additional savings with minimal additional contention.

Member

behlendorf commented Jun 22, 2015

@prakashsurya speaking us further slimming down the ARC header memory usage it would get great to get your feedback on #1971. It pulls the kmutex_t and kcondvar_t out of the l1arc_buf_hdr structure which results in significant additional savings with minimal additional contention.

@behlendorf behlendorf referenced this pull request Jun 29, 2015

Closed

ARC Sync Up #3533

0 of 19 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment