Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support preemptible kernels (CONFIG_PREEMPT) #83

Closed
behlendorf opened this issue Jan 5, 2011 · 27 comments
Closed

Support preemptible kernels (CONFIG_PREEMPT) #83

behlendorf opened this issue Jan 5, 2011 · 27 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@behlendorf
Copy link
Contributor

Jean-Michel Bruenn has reported that there are problems with preemptible kernels.

http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/ff44d9f001eb8f57#

It appears there are several code paths where smp_processor_id() is called in a preemptible region. The kernel in turn logs a message to the console, in fact lots and lots and lots of messages to the console, which bogs down the system making it look hung.

BUG: using smp_processor_id() in preemptible [00000000] code: zpool/30907
caller is txg_hold_open+0x22/0x5f [zfs]
Pid: 30907, comm: zpool Tainted: P            2.6.35-lunar #1
Call Trace:
[] debug_smp_processor_id+0xc4/0xe0
[] txg_hold_open+0x22/0x5f [zfs]
[] dmu_tx_assign+0x161/0x357 [zfs]
[] spa_history_log+0x3d/0xfa [zfs]
[] spa_create+0x6d7/0x72f [zfs]
[] zfs_ioc_pool_create+0x1cb/0x238 [zfs]
[] zfsdev_ioctl+0x100/0x15c [zfs]
[] vfs_ioctl+0x36/0xa7
[] do_vfs_ioctl+0x42a/0x45b
[] sys_ioctl+0x47/0x6d

To fix this we will need to identify all callers of smp_processor_id and make them preempt safe by calling preempt_disable/preempt_enable as appropriate. This may end up being a little tricky in the slab since we make heavy use of implicitly locked per-cpu data structures to improve performance.

Until this is fixed CONFIG_PREEMPT should be disabled in your kernel build.

@chani
Copy link

chani commented Jan 6, 2011

I think PREEMPT is just one issue; i think it's related to kswapd which gets unusual high cpu load even with no swap set. For example disable all swap in your system, start bonnie++ onto a mounted zvol (i used ext4) and as soon as it does:

Reading with getc()...done

kswapd reaches very high cpu loads and the system starts to become very loaded. I'm unsure how to track that down. I'm using a tickless kernel, with 1000hz set (though this shouldn't matter on a tickless kernel) i use slub instead of slab and currently i use Server (no preemption) as this seemed to help a bit. Apart from that there's nothing special in my kernel configuration.

@chani
Copy link

chani commented Jan 6, 2011

Alright, with some help of brian i was able to solve this. Doing:

zfs set primarycache=metadata zfspool/wdp

is fixing this issue it seems. Thats probably due to double-caching (i guess brian can explain this better)

@behlendorf
Copy link
Contributor Author

Actually it sounds like disabling the cache entirely for ZVOLs is the right thing to do in the short term. I'll take a note to make this the default behavior for ZVOLs.

zfs set primarycache=none tank/fish

@fajarnugraha
Copy link
Contributor

Is this issue still valid? In current ZPL branch, zvols have prrimarycache=all by default

# zfs create -V 1G vm/test
# zfs get primarycache vm/test
NAME     PROPERTY      VALUE         SOURCE
vm/test  primarycache  all           default

@behlendorf
Copy link
Contributor Author

It's still valid. It's waiting for someone to write a small patch which sets the default primarycache value for zvols to 'none' instead of 'all'.

@devsk
Copy link

devsk commented Feb 19, 2011

Do we still need to take PREEMPT out of the kernel config?

@behlendorf
Copy link
Contributor Author

Absolutely, until this gets fixed you need to disable PREEMPT in your kernel. In fact, we should just add a check to configure for now to detect this for now and produce a fatal error message if you kernel has PREEMPT enabled.

@devsk
Copy link

devsk commented Feb 20, 2011

primarycache=none kills the performance. primarycache=all hits a SPL PANIC in line 558 of zfs-znode.c. No PREEMPT is used in the kernel config.

@behlendorf
Copy link
Contributor Author

Commit: 30d8f8c993f83d481957d2600d89801645a41f27

Make CONFIG_PREEMPT Fatal

Until support is added for preemptible kernels detect this at
configure time and make it fatal. Otherwise, it is possible to
have a successful build and kernel modules with flakey behavior.

@kylef
Copy link
Contributor

kylef commented Aug 9, 2011

I have been looking at this, and I cannot find how txg_hold_open (or anything it calls) will execute smp_processor_id in a preemptable region. The only two things tag_hold_open calls are pthread_self and the mutex_enter macro which should call the mutex_lock. These definitely should not call smp_processor_id in a preemptable region.

To fix this we will need to identify all callers of smp_processor_id and make them preempt safe by calling preempt_disable/preempt_enable as appropriate.

How do you suggest we identify these callers? Or just by trial and error with CONFIG_DEBUG_PREEMPT until we catch them all?

@behlendorf
Copy link
Contributor Author

You've picked a particularly tricky issue to cut your teeth on. But left me try and point you in the right direction.

I believe all the calls to smp_processor_id() we're concerned with occur in the spl's slab implementation. The problem is that this code wasn't written to be preemptible. That is it assumes that unless it explicitly calls schedule() or a function which can block it cannot be rescheduled to a different processor. This was done so unlocked per-cpu caches could be used to minimize lock contention and get good performance.

Now when CONFIG_PREEMPT is defined that's no longer the case. For example take the spl_magazine_age() function which calls smp_processor_id() near the top and stores the cpu it's currently running on in the variable 'i'. Since preemption is enabled this process could be immediately rescheduled to a different core resulting in this value being wrong. That would result in us accessing the wrong per-cpu cache and destroying the accounting which is being done. There are certain critical regions like this where preemption must be briefly disabled to ensure correctness.

@kylef
Copy link
Contributor

kylef commented Aug 11, 2011

Ok thanks for explaining this to me. Can you just confirm that I understand this correctly.

spl_kmem_cache_free and spl_kmem_cache_alloc are preempt safe because they already disable preemption when they disable interrupts (with local_irq_disable). This makes spl_cache_refill safe because it is called from inside spl_kmem_cache_alloc. This just leaves spl_magazine_age which is run in a workqueue, a workqueue can be ran on any CPU. Alternatively, this could be replaced with a softirq's. I have not looked into how to use softirq's yet, am I understanding this correctly and should I replace the workqueue with a softirq?

trace_set_debug_header sets header->ph_cpu_id = smp_processor_id() but nothing ever references ph_cpu_id. Although it could be incorrect if it was used for whatever reason, because the processor may be different since it was set.

@behlendorf
Copy link
Contributor Author

Yes. Although notice that spl_cache_grow() which is called from spl_cache_refill() will briefly re-enable interrupts so it can safely allocate a new slab. Upon return spl_cache_refill() will check if the process was rescheduled to a different cpu while interrupts/preemption where enabled.

For spl_magazine_age it would probably suffice to simply wrap the critical section in preempt_disable()/preempt_enable(). The section is small enough that this is a reasonable approach.

The smp_processor_id() entries in spl-debug.c are used strictly for debugging purposes. It should be safe to simply wrap the two call sites with preempt_disable()/preempt_enable(). This code will be rarely called unless someone is debugging or the system is trips an ASSERT/VERIFY. In either case it's not performance critical.

If you can make the spl_magazine_age fix and do some testing with preemption enabled to make sure it's working as expected we could consider supporting this. The key will be getting enough testing to make sure nothing was missed.

@kylef
Copy link
Contributor

kylef commented Aug 19, 2011

If you can make the spl_magazine_age fix and do some testing with preemption enabled to make sure it's working as expected we could consider supporting this. The key will be getting enough testing to make sure nothing was missed.

I think testing is going to be very hard to do. The kernel needs to preempt at exactly the right time to test. I recently noticed that when I recompiled my kernel with CONFIG_DEBUG_MUTEXES disabled that I had CONFIG_PREEMPT enabled. I have been running ZFS on this system for 28 days and it seems to be fine, this most likely means that my kernel has NEVER preempted zfs at the exact time where it matters. This is a dual-core system. It is weird that when I first installed I came across this issue instantly then an issue related to CONFIG_DEBUG_MUTEXES, once I disabled CONFIG_DEBUG_MUTEXES my system has become stable.

@behlendorf
Copy link
Contributor Author

Are you sure? With the latest code the configure step should fail if your kernel has CONFIG_PREEMPT defined. Regardless, your right about the testing we'll want to make sure it's well tested.

@kylef
Copy link
Contributor

kylef commented Aug 19, 2011

I compiled it ~28 days ago so it was just before CONFIG_PREEMPT detection was fixed.

@ryao
Copy link
Contributor

ryao commented Apr 16, 2012

Supporting preemptible kernels does not require identifying the exact code paths where smp_processor_id() is called in a preemptible region. Instead, we only need to identify the entry points to ZFS code, disable preemption at the entry point and enable preemption at the exit point. That will cover those code paths by definition. That is unideal, but it will work until a better solution can be put in place.

@behlendorf How would you feel about splitting this issue into two parts. One would be supporting preemptible kernels. The other is supporting preemption in the ZFS code itself?

This was referenced Apr 16, 2012
@behlendorf
Copy link
Contributor Author

@gentoofan I'd prefer to just support preemption in the spl/zfs code itself. This shouldn't be a huge amount of work to fix since the spl slab is really the only place this should be an issue. I've been meaning to do it for years now, but since frankly this is only a desktop issue I've never prioritized it.

@kylef
Copy link
Contributor

kylef commented Apr 18, 2012

This needs more testing, but I have been running this for some time without any issues. https://raw.github.com/kylef/ark/master/spl/preempt.patch

@ryao
Copy link
Contributor

ryao commented Apr 18, 2012

@kylef Thanks for posting that. Nice work.

I am testing it as gentoofan/spl@b8ea7afc96b7674ee3e0601afece68114d23a261. If all goes well, I will file appropriate pull requests with zfsonlinux/spl and zfsonlinux/zfs.

@ryao
Copy link
Contributor

ryao commented Apr 18, 2012

@kylef My system has not crashed yet, but I am seeing many issues being reported to dmesg:

http://paste.pocoo.org/show/583429/

It looks like spl_debug_msg and txg_hold_open need preempt_disable/preempt_enable.

@ryao
Copy link
Contributor

ryao commented Apr 18, 2012

I am working on the issues I mentioned earlier. Interestingly, fixing the txg_hold_open issue causes another issue:

[ 52.043212] BUG: scheduling while atomic: Chrome_CacheThr/5943/0x00000002
[ 52.043215] Modules linked in: bridge ipv6 stp llc snd_hda_codec_analog arc4 nvidia(PO) rtl8187 mac80211 cfg80211 eeprom_93cx6 firewire_ohci firewire_core rtc_cmos crc_itu_t r8169 asus_atk0110 floppy snd_hda_intel hwmon mii snd_hda_codec i2c_i801 snd_pcm iTCO_wdt evdev snd_timer sky2 snd agpgart processor soundcore i2c_core snd_page_alloc unix sha256_generic fuse zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate nfs nfs_acl auth_rpcgss lockd sunrpc ext4 jbd2 mbcache scsi_wait_scan hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech usbhid uhci_hcd usb_storage ehci_hcd usbcore usb_common sr_mod cdrom sg pata_jmicron
[ 52.043282] Pid: 5943, comm: Chrome_CacheThr Tainted: P O 3.3.2 #14
[ 52.043284] Call Trace:
[ 52.043292] [] __schedule_bug+0x63/0x68
[ 52.043297] [] __schedule+0x54c/0x6a0
[ 52.043300] [] schedule+0x3a/0x50
[ 52.043303] [] __mutex_lock_slowpath+0xdf/0x180
[ 52.043306] [] mutex_lock+0x1d/0x40
[ 52.043326] [] txg_hold_open+0x50/0x90 [zfs]
[ 52.043364] [] dmu_tx_assign+0x1dc/0xc00 [zfs]
[ 52.043375] [] ? dsl_dataset_block_freeable+0x3b/0x50 [zfs]
[ 52.043385] [] ? dmu_tx_callback_register+0xe1/0x150 [zfs]
[ 52.043395] [] zfs_write+0x356/0xd10 [zfs]
[ 52.043405] [] zpl_write_common+0x4d/0x110 [zfs]
[ 52.043413] [] zpl_write_common+0xe3/0x110 [zfs]
[ 52.043417] [] vfs_write+0xae/0x180
[ 52.043419] [] sys_pwrite64+0xa2/0xb0
[ 52.043424] [] system_call_fastpath+0x16/0x1b
[ 54.656010] br0: port 1(eth0) entered forwarding state

@Nowaker
Copy link

Nowaker commented Jun 11, 2012

Is there any work going on with this issue? It was reported a year ago, so...

@ryao
Copy link
Contributor

ryao commented Jun 11, 2012

Pull request #674 has preemption support. It works on my desktop, although it needs a little more attention before it is merged.

@Nowaker
Copy link

Nowaker commented Jun 11, 2012

@ryao, Sounds great. @behlendorf, any plans of taking this a step further so that more people can test it?

@behlendorf
Copy link
Contributor Author

@Nowaker This pull request is a good start but more work is really needed before preempt can be fully supported. In particular the spl kmem cache layer needs a little attention.

behlendorf pushed a commit that referenced this issue Aug 24, 2012
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
@behlendorf
Copy link
Contributor Author

Full preemption support has been merged in to the spl and zfs master branches (thanks everybody). The only thing that remains are the autoconf checks which prevent people from using it. I'd like to remove those as well next week but I'd feel better about it if we could get some additional testing on the code. If you have the time I'd appreciate it if you could test this, I've made the following tags to test with autoconf checks reverted.

https://github.com/behlendorf/spl/tarball/spl-0.6.0-rc10-preempt
https://github.com/behlendorf/zfs/tarball/zfs-0.6.0-rc10-preempt

richardelling pushed a commit to richardelling/zfs that referenced this issue Oct 15, 2018
… io-fd (openzfs#83)


* [TA1652][DE17] [DE37] Atomic inc and decrement of zinfo refcount done.
Closing data fd lazily so that operations on stale fd can be avoided
Wait reference to be drained out before freeing zinfo.

Signed-off-by: satbir <satbir.chhikara@gmail.com>
sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

7 participants