Skip to content
This repository

Support preemptible kernels (CONFIG_PREEMPT) #83

Closed
behlendorf opened this Issue · 27 comments

7 participants

Brian Behlendorf Jean Bruenn Fajar Arief Nugraha devsk Kyle Fuller Richard Yao Damian Nowak
Brian Behlendorf
Owner

Jean-Michel Bruenn has reported that there are problems with preemptible kernels.

http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/ff44d9f001eb8f57#

It appears there are several code paths where smp_processor_id() is called in a preemptible region. The kernel in turn logs a message to the console, in fact lots and lots and lots of messages to the console, which bogs down the system making it look hung.

BUG: using smp_processor_id() in preemptible [00000000] code: zpool/30907
caller is txg_hold_open+0x22/0x5f [zfs]
Pid: 30907, comm: zpool Tainted: P            2.6.35-lunar #1
Call Trace:
[] debug_smp_processor_id+0xc4/0xe0
[] txg_hold_open+0x22/0x5f [zfs]
[] dmu_tx_assign+0x161/0x357 [zfs]
[] spa_history_log+0x3d/0xfa [zfs]
[] spa_create+0x6d7/0x72f [zfs]
[] zfs_ioc_pool_create+0x1cb/0x238 [zfs]
[] zfsdev_ioctl+0x100/0x15c [zfs]
[] vfs_ioctl+0x36/0xa7
[] do_vfs_ioctl+0x42a/0x45b
[] sys_ioctl+0x47/0x6d

To fix this we will need to identify all callers of smp_processor_id and make them preempt safe by calling preempt_disable/preempt_enable as appropriate. This may end up being a little tricky in the slab since we make heavy use of implicitly locked per-cpu data structures to improve performance.

Until this is fixed CONFIG_PREEMPT should be disabled in your kernel build.

Jean Bruenn

I think PREEMPT is just one issue; i think it's related to kswapd which gets unusual high cpu load even with no swap set. For example disable all swap in your system, start bonnie++ onto a mounted zvol (i used ext4) and as soon as it does:

Reading with getc()...done

kswapd reaches very high cpu loads and the system starts to become very loaded. I'm unsure how to track that down. I'm using a tickless kernel, with 1000hz set (though this shouldn't matter on a tickless kernel) i use slub instead of slab and currently i use Server (no preemption) as this seemed to help a bit. Apart from that there's nothing special in my kernel configuration.

Jean Bruenn

Alright, with some help of brian i was able to solve this. Doing:

zfs set primarycache=metadata zfspool/wdp

is fixing this issue it seems. Thats probably due to double-caching (i guess brian can explain this better)

Brian Behlendorf
Owner

Actually it sounds like disabling the cache entirely for ZVOLs is the right thing to do in the short term. I'll take a note to make this the default behavior for ZVOLs.

zfs set primarycache=none tank/fish
Fajar Arief Nugraha

Is this issue still valid? In current ZPL branch, zvols have prrimarycache=all by default

# zfs create -V 1G vm/test
# zfs get primarycache vm/test
NAME     PROPERTY      VALUE         SOURCE
vm/test  primarycache  all           default
Brian Behlendorf
Owner

It's still valid. It's waiting for someone to write a small patch which sets the default primarycache value for zvols to 'none' instead of 'all'.

devsk

Do we still need to take PREEMPT out of the kernel config?

Brian Behlendorf
Owner

Absolutely, until this gets fixed you need to disable PREEMPT in your kernel. In fact, we should just add a check to configure for now to detect this for now and produce a fatal error message if you kernel has PREEMPT enabled.

devsk

primarycache=none kills the performance. primarycache=all hits a SPL PANIC in line 558 of zfs-znode.c. No PREEMPT is used in the kernel config.

Brian Behlendorf
Owner

Commit: 30d8f8c

Make CONFIG_PREEMPT Fatal

Until support is added for preemptible kernels detect this at
configure time and make it fatal. Otherwise, it is possible to
have a successful build and kernel modules with flakey behavior.

Kyle Fuller

I have been looking at this, and I cannot find how txg_hold_open (or anything it calls) will execute smp_processor_id in a preemptable region. The only two things tag_hold_open calls are pthread_self and the mutex_enter macro which should call the mutex_lock. These definitely should not call smp_processor_id in a preemptable region.

To fix this we will need to identify all callers of smp_processor_id and make them preempt safe by calling preempt_disable/preempt_enable as appropriate.

How do you suggest we identify these callers? Or just by trial and error with CONFIG_DEBUG_PREEMPT until we catch them all?

Brian Behlendorf
Owner

You've picked a particularly tricky issue to cut your teeth on. But left me try and point you in the right direction.

I believe all the calls to smp_processor_id() we're concerned with occur in the spl's slab implementation. The problem is that this code wasn't written to be preemptible. That is it assumes that unless it explicitly calls schedule() or a function which can block it cannot be rescheduled to a different processor. This was done so unlocked per-cpu caches could be used to minimize lock contention and get good performance.

Now when CONFIG_PREEMPT is defined that's no longer the case. For example take the spl_magazine_age() function which calls smp_processor_id() near the top and stores the cpu it's currently running on in the variable 'i'. Since preemption is enabled this process could be immediately rescheduled to a different core resulting in this value being wrong. That would result in us accessing the wrong per-cpu cache and destroying the accounting which is being done. There are certain critical regions like this where preemption must be briefly disabled to ensure correctness.

Kyle Fuller

Ok thanks for explaining this to me. Can you just confirm that I understand this correctly.

spl_kmem_cache_free and spl_kmem_cache_alloc are preempt safe because they already disable preemption when they disable interrupts (with local_irq_disable). This makes spl_cache_refill safe because it is called from inside spl_kmem_cache_alloc. This just leaves spl_magazine_age which is run in a workqueue, a workqueue can be ran on any CPU. Alternatively, this could be replaced with a softirq's. I have not looked into how to use softirq's yet, am I understanding this correctly and should I replace the workqueue with a softirq?

trace_set_debug_header sets header->ph_cpu_id = smp_processor_id() but nothing ever references ph_cpu_id. Although it could be incorrect if it was used for whatever reason, because the processor may be different since it was set.

Brian Behlendorf
Owner

Yes. Although notice that spl_cache_grow() which is called from spl_cache_refill() will briefly re-enable interrupts so it can safely allocate a new slab. Upon return spl_cache_refill() will check if the process was rescheduled to a different cpu while interrupts/preemption where enabled.

For spl_magazine_age it would probably suffice to simply wrap the critical section in preempt_disable()/preempt_enable(). The section is small enough that this is a reasonable approach.

The smp_processor_id() entries in spl-debug.c are used strictly for debugging purposes. It should be safe to simply wrap the two call sites with preempt_disable()/preempt_enable(). This code will be rarely called unless someone is debugging or the system is trips an ASSERT/VERIFY. In either case it's not performance critical.

If you can make the spl_magazine_age fix and do some testing with preemption enabled to make sure it's working as expected we could consider supporting this. The key will be getting enough testing to make sure nothing was missed.

Kyle Fuller

If you can make the spl_magazine_age fix and do some testing with preemption enabled to make sure it's working as expected we could consider supporting this. The key will be getting enough testing to make sure nothing was missed.

I think testing is going to be very hard to do. The kernel needs to preempt at exactly the right time to test. I recently noticed that when I recompiled my kernel with CONFIG_DEBUG_MUTEXES disabled that I had CONFIG_PREEMPT enabled. I have been running ZFS on this system for 28 days and it seems to be fine, this most likely means that my kernel has NEVER preempted zfs at the exact time where it matters. This is a dual-core system. It is weird that when I first installed I came across this issue instantly then an issue related to CONFIG_DEBUG_MUTEXES, once I disabled CONFIG_DEBUG_MUTEXES my system has become stable.

Brian Behlendorf
Owner

Are you sure? With the latest code the configure step should fail if your kernel has CONFIG_PREEMPT defined. Regardless, your right about the testing we'll want to make sure it's well tested.

Kyle Fuller

I compiled it ~28 days ago so it was just before CONFIG_PREEMPT detection was fixed.

Richard Yao
ryao commented

Supporting preemptible kernels does not require identifying the exact code paths where smp_processor_id() is called in a preemptible region. Instead, we only need to identify the entry points to ZFS code, disable preemption at the entry point and enable preemption at the exit point. That will cover those code paths by definition. That is unideal, but it will work until a better solution can be put in place.

@behlendorf How would you feel about splitting this issue into two parts. One would be supporting preemptible kernels. The other is supporting preemption in the ZFS code itself?

Brian Behlendorf
Owner

@gentoofan I'd prefer to just support preemption in the spl/zfs code itself. This shouldn't be a huge amount of work to fix since the spl slab is really the only place this should be an issue. I've been meaning to do it for years now, but since frankly this is only a desktop issue I've never prioritized it.

Kyle Fuller
kylef commented

This needs more testing, but I have been running this for some time without any issues. https://raw.github.com/kylef/ark/master/spl/preempt.patch

Richard Yao
ryao commented

@kylef Thanks for posting that. Nice work.

I am testing it as gentoofan/spl@b8ea7af. If all goes well, I will file appropriate pull requests with zfsonlinux/spl and zfsonlinux/zfs.

Richard Yao
ryao commented

@kylef My system has not crashed yet, but I am seeing many issues being reported to dmesg:

http://paste.pocoo.org/show/583429/

It looks like spl_debug_msg and txg_hold_open need preempt_disable/preempt_enable.

Richard Yao
ryao commented

I am working on the issues I mentioned earlier. Interestingly, fixing the txg_hold_open issue causes another issue:

[ 52.043212] BUG: scheduling while atomic: Chrome_CacheThr/5943/0x00000002
[ 52.043215] Modules linked in: bridge ipv6 stp llc snd_hda_codec_analog arc4 nvidia(PO) rtl8187 mac80211 cfg80211 eeprom_93cx6 firewire_ohci firewire_core rtc_cmos crc_itu_t r8169 asus_atk0110 floppy snd_hda_intel hwmon mii snd_hda_codec i2c_i801 snd_pcm iTCO_wdt evdev snd_timer sky2 snd agpgart processor soundcore i2c_core snd_page_alloc unix sha256_generic fuse zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate nfs nfs_acl auth_rpcgss lockd sunrpc ext4 jbd2 mbcache scsi_wait_scan hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech usbhid uhci_hcd usb_storage ehci_hcd usbcore usb_common sr_mod cdrom sg pata_jmicron
[ 52.043282] Pid: 5943, comm: Chrome_CacheThr Tainted: P O 3.3.2 #14
[ 52.043284] Call Trace:
[ 52.043292] [] __schedule_bug+0x63/0x68
[ 52.043297] [] __schedule+0x54c/0x6a0
[ 52.043300] [] schedule+0x3a/0x50
[ 52.043303] [] __mutex_lock_slowpath+0xdf/0x180
[ 52.043306] [] mutex_lock+0x1d/0x40
[ 52.043326] [] txg_hold_open+0x50/0x90 [zfs]
[ 52.043364] [] dmu_tx_assign+0x1dc/0xc00 [zfs]
[ 52.043375] [] ? dsl_dataset_block_freeable+0x3b/0x50 [zfs]
[ 52.043385] [] ? dmu_tx_callback_register+0xe1/0x150 [zfs]
[ 52.043395] [] zfs_write+0x356/0xd10 [zfs]
[ 52.043405] [] zpl_write_common+0x4d/0x110 [zfs]
[ 52.043413] [] zpl_write_common+0xe3/0x110 [zfs]
[ 52.043417] [] vfs_write+0xae/0x180
[ 52.043419] [] sys_pwrite64+0xa2/0xb0
[ 52.043424] [] system_call_fastpath+0x16/0x1b
[ 54.656010] br0: port 1(eth0) entered forwarding state

Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Richard Yao ryao referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Damian Nowak

Is there any work going on with this issue? It was reported a year ago, so...

Richard Yao
ryao commented

Pull request #674 has preemption support. It works on my desktop, although it needs a little more attention before it is merged.

Damian Nowak

@ryao, Sounds great. @behlendorf, any plans of taking this a step further so that more people can test it?

Brian Behlendorf
Owner

@Nowaker This pull request is a good start but more work is really needed before preempt can be fully supported. In particular the spl kmem cache layer needs a little attention.

Brian Behlendorf behlendorf referenced this issue from a commit
Prakash Surya Wrap smp_processor_id in kpreempt_[dis|en]able
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
15a9e03
Brian Behlendorf
Owner

Full preemption support has been merged in to the spl and zfs master branches (thanks everybody). The only thing that remains are the autoconf checks which prevent people from using it. I'd like to remove those as well next week but I'd feel better about it if we could get some additional testing on the code. If you have the time I'd appreciate it if you could test this, I've made the following tags to test with autoconf checks reverted.

https://github.com/behlendorf/spl/tarball/spl-0.6.0-rc10-preempt
https://github.com/behlendorf/zfs/tarball/zfs-0.6.0-rc10-preempt

Brian Behlendorf behlendorf closed this issue from a commit
Prakash Surya Remove autoconf check for CONFIG_PREEMPT
The autoconf macro which failed if CONFIG_PREEMPT was set in the kernel
config was removed. With the inclusion of a few previous patches
targeting support for preempt enabled kernels, it is now safe to run
with this kernel config option enabled.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #83
f86373f
Brian Behlendorf behlendorf closed this in f86373f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.