Support preemptible kernels (CONFIG_PREEMPT) #83

behlendorf · 2011-01-05T17:45:00Z

Jean-Michel Bruenn has reported that there are problems with preemptible kernels.

http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/ff44d9f001eb8f57#

It appears there are several code paths where smp_processor_id() is called in a preemptible region. The kernel in turn logs a message to the console, in fact lots and lots and lots of messages to the console, which bogs down the system making it look hung.

BUG: using smp_processor_id() in preemptible [00000000] code: zpool/30907
caller is txg_hold_open+0x22/0x5f [zfs]
Pid: 30907, comm: zpool Tainted: P            2.6.35-lunar #1
Call Trace:
[] debug_smp_processor_id+0xc4/0xe0
[] txg_hold_open+0x22/0x5f [zfs]
[] dmu_tx_assign+0x161/0x357 [zfs]
[] spa_history_log+0x3d/0xfa [zfs]
[] spa_create+0x6d7/0x72f [zfs]
[] zfs_ioc_pool_create+0x1cb/0x238 [zfs]
[] zfsdev_ioctl+0x100/0x15c [zfs]
[] vfs_ioctl+0x36/0xa7
[] do_vfs_ioctl+0x42a/0x45b
[] sys_ioctl+0x47/0x6d

To fix this we will need to identify all callers of smp_processor_id and make them preempt safe by calling preempt_disable/preempt_enable as appropriate. This may end up being a little tricky in the slab since we make heavy use of implicitly locked per-cpu data structures to improve performance.

Until this is fixed CONFIG_PREEMPT should be disabled in your kernel build.

chani · 2011-01-06T17:37:42Z

I think PREEMPT is just one issue; i think it's related to kswapd which gets unusual high cpu load even with no swap set. For example disable all swap in your system, start bonnie++ onto a mounted zvol (i used ext4) and as soon as it does:

Reading with getc()...done

kswapd reaches very high cpu loads and the system starts to become very loaded. I'm unsure how to track that down. I'm using a tickless kernel, with 1000hz set (though this shouldn't matter on a tickless kernel) i use slub instead of slab and currently i use Server (no preemption) as this seemed to help a bit. Apart from that there's nothing special in my kernel configuration.

chani · 2011-01-06T18:50:14Z

Alright, with some help of brian i was able to solve this. Doing:

zfs set primarycache=metadata zfspool/wdp

is fixing this issue it seems. Thats probably due to double-caching (i guess brian can explain this better)

behlendorf · 2011-01-20T18:00:01Z

Actually it sounds like disabling the cache entirely for ZVOLs is the right thing to do in the short term. I'll take a note to make this the default behavior for ZVOLs.

zfs set primarycache=none tank/fish

fajarnugraha · 2011-02-18T18:34:46Z

Is this issue still valid? In current ZPL branch, zvols have prrimarycache=all by default

# zfs create -V 1G vm/test
# zfs get primarycache vm/test
NAME     PROPERTY      VALUE         SOURCE
vm/test  primarycache  all           default

behlendorf · 2011-02-18T21:27:49Z

It's still valid. It's waiting for someone to write a small patch which sets the default primarycache value for zvols to 'none' instead of 'all'.

devsk · 2011-02-19T18:39:44Z

Do we still need to take PREEMPT out of the kernel config?

behlendorf · 2011-02-19T19:37:14Z

Absolutely, until this gets fixed you need to disable PREEMPT in your kernel. In fact, we should just add a check to configure for now to detect this for now and produce a fatal error message if you kernel has PREEMPT enabled.

devsk · 2011-02-20T00:01:04Z

primarycache=none kills the performance. primarycache=all hits a SPL PANIC in line 558 of zfs-znode.c. No PREEMPT is used in the kernel config.

behlendorf · 2011-03-07T19:00:32Z

Commit: 30d8f8c993f83d481957d2600d89801645a41f27

Make CONFIG_PREEMPT Fatal

Until support is added for preemptible kernels detect this at
configure time and make it fatal. Otherwise, it is possible to
have a successful build and kernel modules with flakey behavior.

kylef · 2011-08-09T03:12:19Z

I have been looking at this, and I cannot find how txg_hold_open (or anything it calls) will execute smp_processor_id in a preemptable region. The only two things tag_hold_open calls are pthread_self and the mutex_enter macro which should call the mutex_lock. These definitely should not call smp_processor_id in a preemptable region.

To fix this we will need to identify all callers of smp_processor_id and make them preempt safe by calling preempt_disable/preempt_enable as appropriate.

How do you suggest we identify these callers? Or just by trial and error with CONFIG_DEBUG_PREEMPT until we catch them all?

behlendorf · 2011-08-09T21:59:18Z

You've picked a particularly tricky issue to cut your teeth on. But left me try and point you in the right direction.

I believe all the calls to smp_processor_id() we're concerned with occur in the spl's slab implementation. The problem is that this code wasn't written to be preemptible. That is it assumes that unless it explicitly calls schedule() or a function which can block it cannot be rescheduled to a different processor. This was done so unlocked per-cpu caches could be used to minimize lock contention and get good performance.

Now when CONFIG_PREEMPT is defined that's no longer the case. For example take the spl_magazine_age() function which calls smp_processor_id() near the top and stores the cpu it's currently running on in the variable 'i'. Since preemption is enabled this process could be immediately rescheduled to a different core resulting in this value being wrong. That would result in us accessing the wrong per-cpu cache and destroying the accounting which is being done. There are certain critical regions like this where preemption must be briefly disabled to ensure correctness.

kylef · 2011-08-11T03:06:50Z

Ok thanks for explaining this to me. Can you just confirm that I understand this correctly.

spl_kmem_cache_free and spl_kmem_cache_alloc are preempt safe because they already disable preemption when they disable interrupts (with local_irq_disable). This makes spl_cache_refill safe because it is called from inside spl_kmem_cache_alloc. This just leaves spl_magazine_age which is run in a workqueue, a workqueue can be ran on any CPU. Alternatively, this could be replaced with a softirq's. I have not looked into how to use softirq's yet, am I understanding this correctly and should I replace the workqueue with a softirq?

trace_set_debug_header sets header->ph_cpu_id = smp_processor_id() but nothing ever references ph_cpu_id. Although it could be incorrect if it was used for whatever reason, because the processor may be different since it was set.

behlendorf · 2011-08-19T18:46:24Z

Yes. Although notice that spl_cache_grow() which is called from spl_cache_refill() will briefly re-enable interrupts so it can safely allocate a new slab. Upon return spl_cache_refill() will check if the process was rescheduled to a different cpu while interrupts/preemption where enabled.

For spl_magazine_age it would probably suffice to simply wrap the critical section in preempt_disable()/preempt_enable(). The section is small enough that this is a reasonable approach.

The smp_processor_id() entries in spl-debug.c are used strictly for debugging purposes. It should be safe to simply wrap the two call sites with preempt_disable()/preempt_enable(). This code will be rarely called unless someone is debugging or the system is trips an ASSERT/VERIFY. In either case it's not performance critical.

If you can make the spl_magazine_age fix and do some testing with preemption enabled to make sure it's working as expected we could consider supporting this. The key will be getting enough testing to make sure nothing was missed.

kylef · 2011-08-19T19:32:09Z

If you can make the spl_magazine_age fix and do some testing with preemption enabled to make sure it's working as expected we could consider supporting this. The key will be getting enough testing to make sure nothing was missed.

I think testing is going to be very hard to do. The kernel needs to preempt at exactly the right time to test. I recently noticed that when I recompiled my kernel with CONFIG_DEBUG_MUTEXES disabled that I had CONFIG_PREEMPT enabled. I have been running ZFS on this system for 28 days and it seems to be fine, this most likely means that my kernel has NEVER preempted zfs at the exact time where it matters. This is a dual-core system. It is weird that when I first installed I came across this issue instantly then an issue related to CONFIG_DEBUG_MUTEXES, once I disabled CONFIG_DEBUG_MUTEXES my system has become stable.

behlendorf · 2011-08-19T20:51:54Z

Are you sure? With the latest code the configure step should fail if your kernel has CONFIG_PREEMPT defined. Regardless, your right about the testing we'll want to make sure it's well tested.

kylef · 2011-08-19T22:19:06Z

I compiled it ~28 days ago so it was just before CONFIG_PREEMPT detection was fixed.

ryao · 2012-04-16T08:32:16Z

Supporting preemptible kernels does not require identifying the exact code paths where smp_processor_id() is called in a preemptible region. Instead, we only need to identify the entry points to ZFS code, disable preemption at the entry point and enable preemption at the exit point. That will cover those code paths by definition. That is unideal, but it will work until a better solution can be put in place.

@behlendorf How would you feel about splitting this issue into two parts. One would be supporting preemptible kernels. The other is supporting preemption in the ZFS code itself?

behlendorf · 2012-04-18T04:09:13Z

@gentoofan I'd prefer to just support preemption in the spl/zfs code itself. This shouldn't be a huge amount of work to fix since the spl slab is really the only place this should be an issue. I've been meaning to do it for years now, but since frankly this is only a desktop issue I've never prioritized it.

kylef · 2012-04-18T08:10:15Z

This needs more testing, but I have been running this for some time without any issues. https://raw.github.com/kylef/ark/master/spl/preempt.patch

ryao · 2012-04-18T08:49:22Z

@kylef Thanks for posting that. Nice work.

I am testing it as gentoofan/spl@b8ea7afc96b7674ee3e0601afece68114d23a261. If all goes well, I will file appropriate pull requests with zfsonlinux/spl and zfsonlinux/zfs.

ryao · 2012-04-18T09:09:41Z

@kylef My system has not crashed yet, but I am seeing many issues being reported to dmesg:

http://paste.pocoo.org/show/583429/

It looks like spl_debug_msg and txg_hold_open need preempt_disable/preempt_enable.

ryao · 2012-04-18T10:48:50Z

I am working on the issues I mentioned earlier. Interestingly, fixing the txg_hold_open issue causes another issue:

[ 52.043212] BUG: scheduling while atomic: Chrome_CacheThr/5943/0x00000002
[ 52.043215] Modules linked in: bridge ipv6 stp llc snd_hda_codec_analog arc4 nvidia(PO) rtl8187 mac80211 cfg80211 eeprom_93cx6 firewire_ohci firewire_core rtc_cmos crc_itu_t r8169 asus_atk0110 floppy snd_hda_intel hwmon mii snd_hda_codec i2c_i801 snd_pcm iTCO_wdt evdev snd_timer sky2 snd agpgart processor soundcore i2c_core snd_page_alloc unix sha256_generic fuse zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate nfs nfs_acl auth_rpcgss lockd sunrpc ext4 jbd2 mbcache scsi_wait_scan hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech usbhid uhci_hcd usb_storage ehci_hcd usbcore usb_common sr_mod cdrom sg pata_jmicron
[ 52.043282] Pid: 5943, comm: Chrome_CacheThr Tainted: P O 3.3.2 #14
[ 52.043284] Call Trace:
[ 52.043292] [] __schedule_bug+0x63/0x68
[ 52.043297] [] __schedule+0x54c/0x6a0
[ 52.043300] [] schedule+0x3a/0x50
[ 52.043303] [] __mutex_lock_slowpath+0xdf/0x180
[ 52.043306] [] mutex_lock+0x1d/0x40
[ 52.043326] [] txg_hold_open+0x50/0x90 [zfs]
[ 52.043364] [] dmu_tx_assign+0x1dc/0xc00 [zfs]
[ 52.043375] [] ? dsl_dataset_block_freeable+0x3b/0x50 [zfs]
[ 52.043385] [] ? dmu_tx_callback_register+0xe1/0x150 [zfs]
[ 52.043395] [] zfs_write+0x356/0xd10 [zfs]
[ 52.043405] [] zpl_write_common+0x4d/0x110 [zfs]
[ 52.043413] [] zpl_write_common+0xe3/0x110 [zfs]
[ 52.043417] [] vfs_write+0xae/0x180
[ 52.043419] [] sys_pwrite64+0xa2/0xb0
[ 52.043424] [] system_call_fastpath+0x16/0x1b
[ 54.656010] br0: port 1(eth0) entered forwarding state

Nowaker · 2012-06-11T00:57:58Z

Is there any work going on with this issue? It was reported a year ago, so...

ryao · 2012-06-11T11:21:27Z

Pull request #674 has preemption support. It works on my desktop, although it needs a little more attention before it is merged.

Nowaker · 2012-06-11T20:49:19Z

@ryao, Sounds great. @behlendorf, any plans of taking this a step further so that more people can test it?

behlendorf · 2012-06-12T15:53:17Z

@Nowaker This pull request is a good start but more work is really needed before preempt can be fully supported. In particular the spl kmem cache layer needs a little attention.

After surveying the code, the few places where smp_processor_id is used were deemed to be safe to use with a preempt enabled kernel. As such, no core logic had to be changed. These smp_processor_id call sites are simply are wrapped in kpreempt_disable and kpreempt_enabled to prevent the Linux kernel from emitting scary warnings. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Issue #83

behlendorf · 2012-08-25T04:14:31Z

Full preemption support has been merged in to the spl and zfs master branches (thanks everybody). The only thing that remains are the autoconf checks which prevent people from using it. I'd like to remove those as well next week but I'd feel better about it if we could get some additional testing on the code. If you have the time I'd appreciate it if you could test this, I've made the following tags to test with autoconf checks reverted.

https://github.com/behlendorf/spl/tarball/spl-0.6.0-rc10-preempt
https://github.com/behlendorf/zfs/tarball/zfs-0.6.0-rc10-preempt

… io-fd (openzfs#83) * [TA1652][DE17] [DE37] Atomic inc and decrement of zinfo refcount done. Closing data fd lazily so that operations on stale fd can be avoided Wait reference to be drained out before freeing zinfo. Signed-off-by: satbir <satbir.chhikara@gmail.com>

rssalerno mentioned this issue Mar 12, 2012

txg_sync at 100% CPU #596

Closed

behlendorf mentioned this issue Apr 3, 2012

Integrate ARC more tightly with Linux #618

Closed

This was referenced Apr 16, 2012

Disable direct reclaim on zvols #669

Closed

page allocation failure #631

Closed

ryao mentioned this issue Apr 18, 2012

Preemption Support #674

Closed

ryao mentioned this issue Apr 19, 2012

z_wr_iss/0: page allocation failure #540

Closed

behlendorf closed this as completed in f86373f Aug 27, 2012

gg7 mentioned this issue Feb 7, 2017

Can't mount or unmount filesystem with legacy mountpoint #5755

Closed

sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 14, 2022

DOSE-442 zpool export tests (openzfs#83)

dbfa19d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support preemptible kernels (CONFIG_PREEMPT) #83

Support preemptible kernels (CONFIG_PREEMPT) #83

behlendorf commented Jan 5, 2011

chani commented Jan 6, 2011

chani commented Jan 6, 2011

behlendorf commented Jan 20, 2011

fajarnugraha commented Feb 18, 2011

behlendorf commented Feb 18, 2011

devsk commented Feb 19, 2011

behlendorf commented Feb 19, 2011

devsk commented Feb 20, 2011

behlendorf commented Mar 7, 2011

kylef commented Aug 9, 2011

behlendorf commented Aug 9, 2011

kylef commented Aug 11, 2011

behlendorf commented Aug 19, 2011

kylef commented Aug 19, 2011

behlendorf commented Aug 19, 2011

kylef commented Aug 19, 2011

ryao commented Apr 16, 2012

behlendorf commented Apr 18, 2012

kylef commented Apr 18, 2012

ryao commented Apr 18, 2012

ryao commented Apr 18, 2012

ryao commented Apr 18, 2012

Nowaker commented Jun 11, 2012

ryao commented Jun 11, 2012

Nowaker commented Jun 11, 2012

behlendorf commented Jun 12, 2012

behlendorf commented Aug 25, 2012

Support preemptible kernels (CONFIG_PREEMPT) #83

Support preemptible kernels (CONFIG_PREEMPT) #83

Comments

behlendorf commented Jan 5, 2011

chani commented Jan 6, 2011

chani commented Jan 6, 2011

behlendorf commented Jan 20, 2011

fajarnugraha commented Feb 18, 2011

behlendorf commented Feb 18, 2011

devsk commented Feb 19, 2011

behlendorf commented Feb 19, 2011

devsk commented Feb 20, 2011

behlendorf commented Mar 7, 2011

kylef commented Aug 9, 2011

behlendorf commented Aug 9, 2011

kylef commented Aug 11, 2011

behlendorf commented Aug 19, 2011

kylef commented Aug 19, 2011

behlendorf commented Aug 19, 2011

kylef commented Aug 19, 2011

ryao commented Apr 16, 2012

behlendorf commented Apr 18, 2012

kylef commented Apr 18, 2012

ryao commented Apr 18, 2012

ryao commented Apr 18, 2012

ryao commented Apr 18, 2012

Nowaker commented Jun 11, 2012

ryao commented Jun 11, 2012

Nowaker commented Jun 11, 2012

behlendorf commented Jun 12, 2012

behlendorf commented Aug 25, 2012