-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support preemptible kernels (CONFIG_PREEMPT) #83
Comments
I think PREEMPT is just one issue; i think it's related to kswapd which gets unusual high cpu load even with no swap set. For example disable all swap in your system, start bonnie++ onto a mounted zvol (i used ext4) and as soon as it does: Reading with getc()...done kswapd reaches very high cpu loads and the system starts to become very loaded. I'm unsure how to track that down. I'm using a tickless kernel, with 1000hz set (though this shouldn't matter on a tickless kernel) i use slub instead of slab and currently i use Server (no preemption) as this seemed to help a bit. Apart from that there's nothing special in my kernel configuration. |
Alright, with some help of brian i was able to solve this. Doing: zfs set primarycache=metadata zfspool/wdp is fixing this issue it seems. Thats probably due to double-caching (i guess brian can explain this better) |
Actually it sounds like disabling the cache entirely for ZVOLs is the right thing to do in the short term. I'll take a note to make this the default behavior for ZVOLs. zfs set primarycache=none tank/fish |
Is this issue still valid? In current ZPL branch, zvols have prrimarycache=all by default
|
It's still valid. It's waiting for someone to write a small patch which sets the default primarycache value for zvols to 'none' instead of 'all'. |
Do we still need to take PREEMPT out of the kernel config? |
Absolutely, until this gets fixed you need to disable PREEMPT in your kernel. In fact, we should just add a check to configure for now to detect this for now and produce a fatal error message if you kernel has PREEMPT enabled. |
primarycache=none kills the performance. primarycache=all hits a SPL PANIC in line 558 of zfs-znode.c. No PREEMPT is used in the kernel config. |
Commit: 30d8f8c993f83d481957d2600d89801645a41f27 Make CONFIG_PREEMPT Fatal Until support is added for preemptible kernels detect this at |
I have been looking at this, and I cannot find how
How do you suggest we identify these callers? Or just by trial and error with |
You've picked a particularly tricky issue to cut your teeth on. But left me try and point you in the right direction. I believe all the calls to smp_processor_id() we're concerned with occur in the spl's slab implementation. The problem is that this code wasn't written to be preemptible. That is it assumes that unless it explicitly calls schedule() or a function which can block it cannot be rescheduled to a different processor. This was done so unlocked per-cpu caches could be used to minimize lock contention and get good performance. Now when CONFIG_PREEMPT is defined that's no longer the case. For example take the spl_magazine_age() function which calls smp_processor_id() near the top and stores the cpu it's currently running on in the variable 'i'. Since preemption is enabled this process could be immediately rescheduled to a different core resulting in this value being wrong. That would result in us accessing the wrong per-cpu cache and destroying the accounting which is being done. There are certain critical regions like this where preemption must be briefly disabled to ensure correctness. |
Ok thanks for explaining this to me. Can you just confirm that I understand this correctly.
|
Yes. Although notice that spl_cache_grow() which is called from spl_cache_refill() will briefly re-enable interrupts so it can safely allocate a new slab. Upon return spl_cache_refill() will check if the process was rescheduled to a different cpu while interrupts/preemption where enabled. For spl_magazine_age it would probably suffice to simply wrap the critical section in preempt_disable()/preempt_enable(). The section is small enough that this is a reasonable approach. The smp_processor_id() entries in spl-debug.c are used strictly for debugging purposes. It should be safe to simply wrap the two call sites with preempt_disable()/preempt_enable(). This code will be rarely called unless someone is debugging or the system is trips an ASSERT/VERIFY. In either case it's not performance critical. If you can make the spl_magazine_age fix and do some testing with preemption enabled to make sure it's working as expected we could consider supporting this. The key will be getting enough testing to make sure nothing was missed. |
I think testing is going to be very hard to do. The kernel needs to preempt at exactly the right time to test. I recently noticed that when I recompiled my kernel with CONFIG_DEBUG_MUTEXES disabled that I had CONFIG_PREEMPT enabled. I have been running ZFS on this system for 28 days and it seems to be fine, this most likely means that my kernel has NEVER preempted zfs at the exact time where it matters. This is a dual-core system. It is weird that when I first installed I came across this issue instantly then an issue related to CONFIG_DEBUG_MUTEXES, once I disabled CONFIG_DEBUG_MUTEXES my system has become stable. |
Are you sure? With the latest code the configure step should fail if your kernel has CONFIG_PREEMPT defined. Regardless, your right about the testing we'll want to make sure it's well tested. |
I compiled it ~28 days ago so it was just before CONFIG_PREEMPT detection was fixed. |
Supporting preemptible kernels does not require identifying the exact code paths where smp_processor_id() is called in a preemptible region. Instead, we only need to identify the entry points to ZFS code, disable preemption at the entry point and enable preemption at the exit point. That will cover those code paths by definition. That is unideal, but it will work until a better solution can be put in place. @behlendorf How would you feel about splitting this issue into two parts. One would be supporting preemptible kernels. The other is supporting preemption in the ZFS code itself? |
@gentoofan I'd prefer to just support preemption in the spl/zfs code itself. This shouldn't be a huge amount of work to fix since the spl slab is really the only place this should be an issue. I've been meaning to do it for years now, but since frankly this is only a desktop issue I've never prioritized it. |
This needs more testing, but I have been running this for some time without any issues. https://raw.github.com/kylef/ark/master/spl/preempt.patch |
@kylef Thanks for posting that. Nice work. I am testing it as gentoofan/spl@b8ea7afc96b7674ee3e0601afece68114d23a261. If all goes well, I will file appropriate pull requests with zfsonlinux/spl and zfsonlinux/zfs. |
@kylef My system has not crashed yet, but I am seeing many issues being reported to dmesg: http://paste.pocoo.org/show/583429/ It looks like spl_debug_msg and txg_hold_open need preempt_disable/preempt_enable. |
I am working on the issues I mentioned earlier. Interestingly, fixing the txg_hold_open issue causes another issue: [ 52.043212] BUG: scheduling while atomic: Chrome_CacheThr/5943/0x00000002 |
Is there any work going on with this issue? It was reported a year ago, so... |
Pull request #674 has preemption support. It works on my desktop, although it needs a little more attention before it is merged. |
@ryao, Sounds great. @behlendorf, any plans of taking this a step further so that more people can test it? |
@Nowaker This pull request is a good start but more work is really needed before preempt can be fully supported. In particular the spl kmem cache layer needs a little attention. |
After surveying the code, the few places where smp_processor_id is used were deemed to be safe to use with a preempt enabled kernel. As such, no core logic had to be changed. These smp_processor_id call sites are simply are wrapped in kpreempt_disable and kpreempt_enabled to prevent the Linux kernel from emitting scary warnings. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Issue #83
Full preemption support has been merged in to the spl and zfs master branches (thanks everybody). The only thing that remains are the autoconf checks which prevent people from using it. I'd like to remove those as well next week but I'd feel better about it if we could get some additional testing on the code. If you have the time I'd appreciate it if you could test this, I've made the following tags to test with autoconf checks reverted. https://github.com/behlendorf/spl/tarball/spl-0.6.0-rc10-preempt |
… io-fd (openzfs#83) * [TA1652][DE17] [DE37] Atomic inc and decrement of zinfo refcount done. Closing data fd lazily so that operations on stale fd can be avoided Wait reference to be drained out before freeing zinfo. Signed-off-by: satbir <satbir.chhikara@gmail.com>
Jean-Michel Bruenn has reported that there are problems with preemptible kernels.
http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/ff44d9f001eb8f57#
It appears there are several code paths where smp_processor_id() is called in a preemptible region. The kernel in turn logs a message to the console, in fact lots and lots and lots of messages to the console, which bogs down the system making it look hung.
To fix this we will need to identify all callers of smp_processor_id and make them preempt safe by calling preempt_disable/preempt_enable as appropriate. This may end up being a little tricky in the slab since we make heavy use of implicitly locked per-cpu data structures to improve performance.
Until this is fixed CONFIG_PREEMPT should be disabled in your kernel build.
The text was updated successfully, but these errors were encountered: