Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof of concept] A direct-reclaim excercising mechanism #3192

Closed
tuxoko opened this issue Mar 17, 2015 · 5 comments
Closed

[Proof of concept] A direct-reclaim excercising mechanism #3192

tuxoko opened this issue Mar 17, 2015 · 5 comments
Labels
Component: Memory Management kernel memory management Component: Test Suite Indicates an issue with the test framework or a test case

Comments

@tuxoko
Copy link
Contributor

tuxoko commented Mar 17, 2015

Hi all,

In light of the recent deadlock issues related to direct reclaim after the kmem-work.
I wonder if we can come up with a direct reclaim excercising mechanism.

My Idea is that we could hook up a stub code into kmem_alloc macro and friends. And that they will check the FSTRANS flag, if it's not set, then we by some chances calls into direct reclaim paths.

It would also be great if it could be done in user space, then we could run it in ztest. But I'm not sure if the inode reclaim path could be done in user space though.

What do you guys think?

@behlendorf behlendorf added Component: Test Suite Indicates an issue with the test framework or a test case Component: Memory Management kernel memory management Difficulty - Hard labels Mar 19, 2015
@behlendorf
Copy link
Contributor

@tuxoko I think that's a really interesting idea. Forcing direct reclaim to occur at a much higher frequency would certainly let us expose these kinds of deadlocks more easily. Unfortunately, I don't see any easy to do this without patching the kernel. What we want is try_to_free_pages() to be called to expose the deadlocks.

Another way to possibly expose these issues early, and without causing a deadlock, is with the kernels run time lock checker. It should be able to analyze these call paths and determine if a lock inversion or deadlock is possible. However, to make that happen we'll need to address the remaining areas in the code which the checker currently generates false positives for.

Kernel Lock Validator
http://lwn.net/Articles/185666/

Another way to attack this would be to run stress tests like the ones @dweeezil has put together under low memory conditions. This will almost certainly ensure the direct reclaim paths are run frequently.

@dweeezil
Copy link
Contributor

@behlendorf I gave up on CONFIG_PROVE_LOCKING pretty early because ZFS has a bunch of cases in which nested locks are used. I started adding support for the various lock classes but decided to bail on that idea for the time being because it looked like it would be a pain to get it working properly and I was more interested in trying to track down the existing lock inversions. That said, I think it would be very useful to get ZFS working with PROVE_LOCKING.

My current testing regimen certainly seems to be quite good at catching lock inversions due to re-entering ZFS during reclaim and CONFIG_LOCK_STAT along with stack traces makes it pretty obvious where the problems are. I was rather surprised how frequently my little test scripts caused reclaim code to be exercised. Granted, it requires truncating the memory to 4GiB, but during the tests, there really isn't much memory pressure on the system. That said, however, part of my testing generally involves manually applying such pressure (which could easily be automated).

I'm hoping to nail down the one lock inversion I can reproduce within the next couple of days.

@chrisrd
Copy link
Contributor

chrisrd commented Mar 20, 2015

Given the current flurry of deadlocks it seems CONFIG_PROVE_LOCKING might be the only sensible way forward. @dweeezil, do you have any of your initial work to share?

@behlendorf
Copy link
Contributor

@dweeezil I sympathize, CONFIG_PROVE_LOCKING has been a sore point for too long now. It's something I've wanted to get working from day 1 because I think it would be very valuable, but I've never had the resources to invest in getting it working. As you said, it's always been easier just to run down and resolve the deadlock once it's uncovered. I think getting this working would be a great way to contribute to the project.

@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 1, 2015

closed by #3246

@tuxoko tuxoko closed this as completed Apr 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Component: Test Suite Indicates an issue with the test framework or a test case
Projects
None yet
Development

No branches or pull requests

4 participants