CPU stall when disk is removed #3652

ghost · 2015-07-30T19:26:03Z

I'm getting a pool lockup when:

Pool is doing IO (read or write)
A disk is removed (or fails)

z_null_int process takes 100% CPU:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                        
1424 root       0 -20       0      0      0 R 100.0  0.0   5:00.81 z_null_int

We get this at the console just after removal of the disk:

blk_update_request: I/O error, dev sdb, sector 2392232
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU

A hard reset of the machine is required.

I've experienced this on multiple machines with:

Local disks connected with LSI HBA
SRP disks over InfiniBand
Multipath disks

Kernel version is 3.18.19.

This only appeared on the ZFS master release from 13th July onwards.
I don't know the specific commit.
The latest release on 29th July is still affected by this issue. I confirmed that today.

The latest SPL master is not affected and is OK to use.

The text was updated successfully, but these errors were encountered:

kaazoo · 2015-09-18T08:44:32Z

I'm experiencing a similar situation. In a system with multiple I/O errors on different harddrives, txg_sync, spa_async and zpool (doing watch zpool status while resilvering) lock up:

Sep 17 21:15:13 myserver kernel: [15249.223675] INFO: task txg_sync:4383 blocked for more than 120 seconds.
Sep 17 21:15:13 myserver kernel: [15249.224108]       Tainted: P          IOX 3.13.0-63-generic #103-Ubuntu
Sep 17 21:15:13 myserver kernel: [15249.224529] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 17 21:15:13 myserver kernel: [15249.224971] txg_sync        D ffff880333233180     0  4383      2 0x00000000
Sep 17 21:15:13 myserver kernel: [15249.224975]  ffff88031a36dce8 0000000000000046 ffff88031a366000 ffff88031a36dfd8
Sep 17 21:15:13 myserver kernel: [15249.224978]  0000000000013180 0000000000013180 ffff88031a366000 ffff880321dc7130
Sep 17 21:15:13 myserver kernel: [15249.224980]  ffff880321dc70e8 ffff880321dc7138 ffff880321dc7110 0000000000000000
Sep 17 21:15:13 myserver kernel: [15249.224983] Call Trace:
Sep 17 21:15:13 myserver kernel: [15249.224989]  [<ffffffff817280c9>] schedule+0x29/0x70
Sep 17 21:15:13 myserver kernel: [15249.225002]  [<ffffffffa0199c35>] cv_wait_common+0xe5/0x120 [spl]
Sep 17 21:15:13 myserver kernel: [15249.225006]  [<ffffffff810ab390>] ? prepare_to_wait_event+0x100/0x100
Sep 17 21:15:13 myserver kernel: [15249.225012]  [<ffffffffa0199c85>] __cv_wait+0x15/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.225047]  [<ffffffffa02c36f3>] spa_config_enter+0xf3/0x110 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.225074]  [<ffffffffa02ca633>] txg_sync_thread+0x2b3/0x620 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.225102]  [<ffffffffa02ca380>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.225107]  [<ffffffffa0194f61>] thread_generic_wrapper+0x71/0x80 [spl]
Sep 17 21:15:13 myserver kernel: [15249.225111]  [<ffffffffa0194ef0>] ? __thread_exit+0x20/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.225115]  [<ffffffff8108b7d2>] kthread+0xd2/0xf0
Sep 17 21:15:13 myserver kernel: [15249.225117]  [<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0
Sep 17 21:15:13 myserver kernel: [15249.225120]  [<ffffffff817347e8>] ret_from_fork+0x58/0x90
Sep 17 21:15:13 myserver kernel: [15249.225122]  [<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0

Sep 17 21:15:13 myserver kernel: [15249.225132] INFO: task spa_async:28624 blocked for more than 120 seconds.
Sep 17 21:15:13 myserver kernel: [15249.225593]       Tainted: P          IOX 3.13.0-63-generic #103-Ubuntu
Sep 17 21:15:13 myserver kernel: [15249.226069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 17 21:15:13 myserver kernel: [15249.226567] spa_async       D ffff8803332b3180     0 28624      2 0x00000000
Sep 17 21:15:13 myserver kernel: [15249.226570]  ffff88030da2fc00 0000000000000046 ffff88006a5a4800 ffff88030da2ffd8
Sep 17 21:15:13 myserver kernel: [15249.226572]  0000000000013180 0000000000013180 ffff88006a5a4800 ffff8803332b3a18
Sep 17 21:15:13 myserver kernel: [15249.226574]  ffff8802f8ecaa88 ffff8802f8ecaac8 ffff8802f8ecaab0 0000000000000001
Sep 17 21:15:13 myserver kernel: [15249.226577] Call Trace:
Sep 17 21:15:13 myserver kernel: [15249.226579]  [<ffffffff817283ed>] io_schedule+0x9d/0x140
Sep 17 21:15:13 myserver kernel: [15249.226587]  [<ffffffffa0199bef>] cv_wait_common+0x9f/0x120 [spl]
Sep 17 21:15:13 myserver kernel: [15249.226590]  [<ffffffff810ab390>] ? prepare_to_wait_event+0x100/0x100
Sep 17 21:15:13 myserver kernel: [15249.226595]  [<ffffffffa0199cc8>] __cv_wait_io+0x18/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.226627]  [<ffffffffa0313233>] zio_wait+0x123/0x210 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226655]  [<ffffffffa02d3c5e>] vdev_label_read_config+0xce/0x210 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226682]  [<ffffffffa02cfb95>] vdev_validate+0xe5/0x2c0 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226710]  [<ffffffffa02cf64d>] ? vdev_open+0x31d/0x500 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226737]  [<ffffffffa02d02a5>] vdev_reopen+0xa5/0xe0 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226762]  [<ffffffffa02b4e42>] spa_async_probe+0x72/0x80 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226786]  [<ffffffffa02b4e17>] spa_async_probe+0x47/0x80 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226811]  [<ffffffffa02b4e17>] spa_async_probe+0x47/0x80 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226836]  [<ffffffffa02bf629>] spa_async_thread+0x149/0x310 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226840]  [<ffffffff811a28cd>] ? kfree+0xfd/0x140
Sep 17 21:15:13 myserver kernel: [15249.226865]  [<ffffffffa02bf4e0>] ? spa_vdev_resilver_done+0x140/0x140 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226870]  [<ffffffffa0194f61>] thread_generic_wrapper+0x71/0x80 [spl]
Sep 17 21:15:13 myserver kernel: [15249.226874]  [<ffffffffa0194ef0>] ? __thread_exit+0x20/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.226877]  [<ffffffff8108b7d2>] kthread+0xd2/0xf0
Sep 17 21:15:13 myserver kernel: [15249.226879]  [<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0
Sep 17 21:15:13 myserver kernel: [15249.226881]  [<ffffffff817347e8>] ret_from_fork+0x58/0x90
Sep 17 21:15:13 myserver kernel: [15249.226884]  [<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0

Sep 17 21:15:13 myserver kernel: [15249.226886] INFO: task zpool:28627 blocked for more than 120 seconds.
Sep 17 21:15:13 myserver kernel: [15249.227400]       Tainted: P          IOX 3.13.0-63-generic #103-Ubuntu
Sep 17 21:15:13 myserver kernel: [15249.227939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 17 21:15:13 myserver kernel: [15249.228527] zpool           D ffff8803332b3180     0 28627  28626 0x00000000
Sep 17 21:15:13 myserver kernel: [15249.228530]  ffff88028a399be8 0000000000000086 ffff8802e8dce000 ffff88028a399fd8
Sep 17 21:15:13 myserver kernel: [15249.228533]  0000000000013180 0000000000013180 ffff8802e8dce000 ffff880321dc7130
Sep 17 21:15:13 myserver kernel: [15249.228535]  ffff880321dc70e8 ffff880321dc7138 ffff880321dc7110 0000000000000000
Sep 17 21:15:13 myserver kernel: [15249.228537] Call Trace:
Sep 17 21:15:13 myserver kernel: [15249.228540]  [<ffffffff817280c9>] schedule+0x29/0x70
Sep 17 21:15:13 myserver kernel: [15249.228547]  [<ffffffffa0199c35>] cv_wait_common+0xe5/0x120 [spl]
Sep 17 21:15:13 myserver kernel: [15249.228550]  [<ffffffff810ab390>] ? prepare_to_wait_event+0x100/0x100
Sep 17 21:15:13 myserver kernel: [15249.228556]  [<ffffffffa0199c85>] __cv_wait+0x15/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.228583]  [<ffffffffa02c36f3>] spa_config_enter+0xf3/0x110 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228609]  [<ffffffffa02c0a66>] spa_config_generate+0x636/0x900 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228635]  [<ffffffffa02bbb80>] spa_open_common+0xb0/0x460 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228660]  [<ffffffffa02bbfa9>] spa_get_stats+0x49/0x530 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228663]  [<ffffffff811a33be>] ? __kmalloc_node+0x26e/0x2a0
Sep 17 21:15:13 myserver kernel: [15249.228667]  [<ffffffff81157e61>] ? __free_pages+0x41/0x60
Sep 17 21:15:13 myserver kernel: [15249.228696]  [<ffffffffa02ee9ba>] zfs_ioc_pool_stats+0x2a/0x70 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228725]  [<ffffffffa02f19ed>] zfsdev_ioctl+0x49d/0x4e0 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228728]  [<ffffffff811d1710>] do_vfs_ioctl+0x2e0/0x4c0
Sep 17 21:15:13 myserver kernel: [15249.228731]  [<ffffffff811ce322>] ? final_putname+0x22/0x50
Sep 17 21:15:13 myserver kernel: [15249.228736]  [<ffffffff811d1971>] SyS_ioctl+0x81/0xa0
Sep 17 21:15:13 myserver kernel: [15249.228739]  [<ffffffff8173489d>] system_call_fastpath+0x1a/0x1f

After that z_null_int runs at 100 % CPU. According to iostat there still seems to be I/O happening, so the resilvering process might still run.

OS: Ubuntu Linux 14.04 LTS x64
ZFS: 0.6.5-1~trusty
HBA: LSI SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)
Disks: WD 2002FYPS, WD 2003FYYS and WD 2000FYYZ, 12 in total

Each drive is encrypted via LUKS. These 12 Device Mapper devices are configured as 2x raidz2 (6 disks each).

adessemond · 2015-09-22T11:43:40Z

Got something similar, on a pre-production machine running under CentOS 7.1 (downgraded to 0.6.4.2 since I hit the bug). The issue is however perfectly reproducible in a test environment running under VirtualBox: when a disk is detached and an write operation occurs on the pool => CPU stall and soft lock-up notices show up on the console. (CPU#0 stuck for 22s [z_null_int:717]).

A 'top' on the host shows a thread hung at 100% in kernel mode...

ghost · 2015-09-22T19:39:41Z

Is anyone planning to tackle this one?
It is preventing us from using the latest code.
We don't have any ZFS developers but are happy to do any testing.

kernelOfTruth · 2015-09-22T20:24:17Z

@AB17 , @kaazoo , @adessemond just to be clear: not every one of you is using LUKS ?

ghost · 2015-09-22T20:48:09Z

@kernelOfTruth, I'm not using LUKS. That is for encryption right?
I'm not doing any encryption. Just removing disks.

ghost · 2015-09-22T20:54:24Z

I built kernel 4.1.7 with the latest ZoL GitHub release to see if this was related to kernel 3.18.x.
The problem is still there with 4.1.7.

xatian · 2015-09-22T20:54:42Z

Hello,
Got the same problem running Gentoo with kernel version 4.0.5 and zfs version 0.6.5.
I also tested with fedora-server 22 (out of the box) --> same problem.

adessemond · 2015-09-23T17:01:32Z

@kernelOfTruth no LUKS here also.

ghost · 2015-09-23T18:11:21Z

I built the very latest release today.

Kernel 3.18.21.

Still the same problem.

You can use a virtual or physical machine to test this.

Remove a pool disk and then do some write IO to the pool.

You might need to remove/re-add/remove a few times to see it happen.

It does not always happen on the first disk removal.

xatian · 2015-09-23T18:22:38Z

I have not spent much time with zfs so take this with a grain of salt -->
I think the problem is that the kernel-module does not generate any events in case of disk-eject (or failure). When I remove a disk and run

zpool events

I see no additional events generated (I am not sure if that is intended but I think there should be more).
But when I run

zpool status

I see that the pool is still health and that in fact nothing happened --> and that is a bug as I tested "disk-eject" with FreeNAS where the pool immediately goes to degraded.
@AB17 could you please check if you experience the same behaviour?

So I think as the system does not know that the pool is down/degraded it can't handle a subsequent reattach of the disk or the next IO-operation.

kernelOfTruth · 2015-09-23T18:41:06Z

@xatian do you run openRC ? systemd ?
I'm pretty sure that Ubuntu 14.04 LTS runs upstart, so that rules a potential systemd issue with it out

with openRC 0.17 events are being generated:

zpool events
TIME                           CLASS
Sep 23 2015 13:37:35.880998397 resource.fs.zfs.statechange
Sep 23 2015 13:37:35.880998397 resource.fs.zfs.statechange
Sep 23 2015 13:37:36.020998395 resource.fs.zfs.statechange
Sep 23 2015 13:37:40.586998328 resource.fs.zfs.statechange
Sep 23 2015 13:37:40.601998328 resource.fs.zfs.statechange
Sep 23 2015 13:37:40.612998328 resource.fs.zfs.statechange
Sep 23 2015 13:37:40.905998323 ereport.fs.zfs.config.sync
Sep 23 2015 13:37:41.179998319 ereport.fs.zfs.config.sync
Sep 23 2015 18:04:09.203765092 resource.fs.zfs.statechange
Sep 23 2015 18:04:09.600765086 ereport.fs.zfs.config.sync

xatian · 2015-09-23T20:27:55Z

Yeah I am using openRC 0.17
After reboot I got these events: (I have 3 pools)

zpool events 
TIME                           CLASS
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:03.087999901 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:03.487999897 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:05.003999880 resource.fs.zfs.statechange
Sep 23 2015 22:11:05.003999880 resource.fs.zfs.statechange
Sep 23 2015 22:11:05.711999872 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:06.087999868 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:07.439999854 resource.fs.zfs.statechange
Sep 23 2015 22:11:07.439999854 resource.fs.zfs.statechange
Sep 23 2015 22:11:07.439999854 resource.fs.zfs.statechange
Sep 23 2015 22:11:08.071999847 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:08.455999843 ereport.fs.zfs.config.sync

When I pull a disk there are no additional events shown and like I said zpool status still shows everything fine.
When I reinsert the disk and produce any IO "z_null_int" starts looping and dmesg shows:

[  292.755888] INFO: rcu_sched self-detected stall on CPU
[  292.755892]  4: (5249 ticks this GP) idle=3bd/140000000000001/0 softirq=2448/2448 fqs=5249 
[  292.755893]   (t=5250 jiffies g=2236 c=2235 q=21023)
[  292.755895] Task dump for CPU 4:
[  292.755896] z_null_int      R  running task        0  2173      2 0x00000008
[  292.755897]  ffffffff81c35c80 ffff88087fd03da8 ffffffff810753fe 0000000000000004
[  292.755899]  ffffffff81c35c80 ffff88087fd03dc8 ffffffff81077a38 ffff88087fd03e08
[  292.755900]  0000000000000005 ffff88087fd03df8 ffffffff81091fe0 ffff88087fd130c0
[  292.755901] Call Trace:
[  292.755902]  <IRQ>  [<ffffffff810753fe>] sched_show_task+0xae/0x120
[  292.755908]  [<ffffffff81077a38>] dump_cpu_task+0x38/0x40
[  292.755910]  [<ffffffff81091fe0>] rcu_dump_cpu_stacks+0x90/0xd0
[  292.755912]  [<ffffffff81094d23>] rcu_check_callbacks+0x423/0x700
[  292.755913]  [<ffffffff8109d744>] ? update_wall_time+0x234/0x650
[  292.755915]  [<ffffffff81097134>] update_process_times+0x34/0x60
[  292.755917]  [<ffffffff810a5231>] tick_sched_handle.isra.18+0x31/0x40
[  292.755918]  [<ffffffff810a527c>] tick_sched_timer+0x3c/0x70
[  292.755920]  [<ffffffff81097791>] __run_hrtimer.isra.34+0x41/0xf0
[  292.755921]  [<ffffffff81097f05>] hrtimer_interrupt+0xc5/0x1e0
[  292.755923]  [<ffffffff8102b454>] local_apic_timer_interrupt+0x34/0x60
[  292.755924]  [<ffffffff8102bacc>] smp_apic_timer_interrupt+0x3c/0x60
[  292.755926]  [<ffffffff816f9cea>] apic_timer_interrupt+0x6a/0x70
[  292.755927]  <EOI>  [<ffffffff816f8859>] ? _raw_spin_unlock_irqrestore+0x9/0x10
[  292.755934]  [<ffffffffa0003f6a>] taskq_cancel_id+0x31a/0x530 [spl]
[  292.755935]  [<ffffffff81075a60>] ? wake_up_state+0x10/0x10
[  292.755937]  [<ffffffffa0003d70>] ? taskq_cancel_id+0x120/0x530 [spl]
[  292.755940]  [<ffffffff8106d7e4>] kthread+0xc4/0xe0
[  292.755941]  [<ffffffff8106d720>] ? kthread_create_on_node+0x170/0x170
[  292.755943]  [<ffffffff816f8e48>] ret_from_fork+0x58/0x90
[  292.755944]  [<ffffffff8106d720>] ? kthread_create_on_node+0x170/0x170

BTW --> fedora server 22 I also tested with is running systemd and had exactly the same behaviour.
BTW2 --> once I restart still with the disk ejected the status is correctly shown as degraded. When I reinsert the disk there is no event created and the status stays at degraded.

behlendorf · 2015-09-23T21:55:58Z

@ab12 yes we'll definitely be tackling this one in a point release. We not have several reports of this issue, #3821 and #817 are both variants of this.

ghost · 2015-09-23T22:00:56Z

Thanks @behlendorf
We are happy to test as usual.

ghost · 2015-09-24T08:23:53Z

Kernel 3.18.21
Latest ZFS release.

I've tried this:

echo 0 > /sys/module/spl/parameters/spl_taskq_thread_dynamic

That helps. A lot. I can remove and re-add the disk several times.

But...

Eventually - after several removals and re-adds, I get the CPU stall again:

INFO: rcu_sched self-detected stall on CPU { 1}  (t=15000 jiffies g=13616 c=13615 q=0)
Task dump for CPU 1:
z_null_int      R  running task        0  6557      2 0x00080008
 0000000000000002 ffffffff818b5200 ffffffff810a9487 0000000000003530
 ffff88080b42d580 ffffffff818b5100 ffffffff818b5200 ffffffff81912c40
 ffffffff810ad008 0000000000000001 0000000000012000 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff810a9487>] ? rcu_dump_cpu_stacks+0x77/0xb0
 [<ffffffff810ad008>] ? rcu_check_callbacks+0x3f8/0x650
 [<ffffffff810bec00>] ? tick_sched_handle.isra.20+0x40/0x40
 [<ffffffff810b072b>] ? update_process_times+0x3b/0x60
 [<ffffffff810bebf1>] ? tick_sched_handle.isra.20+0x31/0x40
 [<ffffffff810bec34>] ? tick_sched_timer+0x34/0x60
 [<ffffffff810b0dd5>] ? __run_hrtimer.isra.34+0x45/0xc0
 [<ffffffff810b13a7>] ? hrtimer_interrupt+0xd7/0x230
 [<ffffffff810152d5>] ? sched_clock+0x5/0x10
 [<ffffffff8103ddd6>] ? smp_apic_timer_interrupt+0x36/0x50
 [<ffffffff815eba7d>] ? apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffffa07fc9b6>] ? zio_done+0x9e6/0xe30 [zfs]
 [<ffffffff815ea3e2>] ? _raw_spin_unlock_irqrestore+0x12/0x70
 [<ffffffffa06a0ebe>] ? taskq_thread+0x1fe/0x410 [spl]
 [<ffffffff81080670>] ? wake_up_process+0x30/0x30
 [<ffffffffa06a0cc0>] ? taskq_cancel_id+0x110/0x110 [spl]
 [<ffffffff8107798c>] ? kthread+0xbc/0xe0
 [<ffffffff810152d5>] ? sched_clock+0x5/0x10
 [<ffffffff810778d0>] ? kthread_create_on_node+0x170/0x170
 [<ffffffff815eaa58>] ? ret_from_fork+0x58/0x90
 [<ffffffff810778d0>] ? kthread_create_on_node+0x170/0x170

This time, the system does not lockup.

There is no z_null_int process running at 100% either.

However, no writes can take place to the pool.

A reset of the system is required.

Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed side-effect of making the vdev_disk_io_start() synchronous for certain I/Os. This in turn resulted in vdev_disk_io_start() being able to re-dispatch zio's which would result in a RCU stalls when a disk was removed from the system. Additionally, this could negatively impact performance and may explain the performance regressions reported in both openzfs#3829 and openzfs#3780. This patch resolves the issue by making the blocking behavior dependant on a 'wait' flag being passed rather than overloading the passed bio flags. Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to non-rotational devices where there is no benefit to queuing to aggregate the I/O. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3780 Issue openzfs#3829 Issue openzfs#3652

behlendorf · 2015-09-25T00:01:43Z

@AB17 I was able to reproduce the issue. Can you please verify the fix proposed in #3833. This may also address several of the other regressions observed in 0.6.5.1.

adessemond · 2015-09-25T01:59:57Z

@behlendorf @ kernelOfTruth : echo 0 > /sys/module/spl/parameters/spl_taskq_thread_dynamic did not help in my case (even forced at system startup before importing anything). Will test the patch and report.

adessemond · 2015-09-25T04:06:28Z

@behlendorf your fix listed above for #3833 solves the issue on my side, should be tested on my real hardware but so far no more CPU stall when a virtual disk is removed.

ghost · 2015-09-25T08:26:13Z

@behlendorf I tried the patch this morning.
Unfortunately it does not fix the issue.
I removed and re-added a disk successfully 3 times, but the 4th time was a problem.

The pool did not go into degraded mode when the disk was removed and write IO took place (the expected behaviour).

The CPU stall issue is gone, but the pool is no longer writable.

status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.

"zpool clear" command hangs.

No write IO can take place to the pool.

A reset of the system is required.

ghost · 2015-09-25T08:54:32Z

@behlendorf an update to previous test.

I assumed I was to test with the default spl_taskq_thread_dynamic = 1.

I have tested again with spl_taskq_thread_dynamic = 0.

I have completed a total of 20 tests where I remove and re-add the disk.

In each case the pool goes into degraded mode as expected. A re-add resilvers OK so I can remove again.

Works exactly like it should do.

So to confirm: #3833 appears to fix the problem when spl_taskq_thread_dynamic = 0.

behlendorf · 2015-09-25T16:27:08Z

@adessemond @AB17 thanks for the quick testing turnaround. I think I've got a good idea what the lingering issue is but I'd like to address that in a separate patch. It's an issue which has always existed but was masked in 0.6.4.x because we didn't have dynamic taskqs. We'll get #3833 in right away, but @AB17 you may need to run with spl_taskq_thread_dynamic = 0 for a little while until the second issue is addressed.

ghost · 2015-09-25T17:21:55Z

@behlendorf no problem. Thank you.

I'll run with spl_taskq_thread_dynamic = 0 for now.

Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed side-effect of making the vdev_disk_io_start() synchronous for certain I/Os. This in turn resulted in vdev_disk_io_start() being able to re-dispatch zio's which would result in a RCU stalls when a disk was removed from the system. Additionally, this could negatively impact performance and explains the performance regressions reported in both #3829 and #3780. This patch resolves the issue by making the blocking behavior dependent on a 'wait' flag being passed rather than overloading the passed bio flags. Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to non-rotational devices where there is no benefit to queuing to aggregate the I/O. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3652 Issue #3780 Issue #3785 Issue #3817 Issue #3821 Issue #3829 Issue #3832 Issue #3870

behlendorf · 2015-09-25T20:23:04Z

Partially resolved by 5592404 which will be cherry-picked in to 0.6.5.2 release. I've opened #3840 to track the remaining work. For the moment users can mitigate the remaining issue by setting spl_taskq_thread_dynamic = 0.

perfinion · 2015-09-29T12:55:24Z

Just for the records, I think I was hitting this too. My server has one drive that is starting to go bad and once in a while hits an unreadable sector. Rsyncing data from an mdadm array to the new ZFS pool would lock up after a while. I never removed or added any drives tho. Pinging the machine worked fine as well as anything that was still running in ram. As soon as something hit the disk it was dead and the only fix was magic sysreq or hard poweroff. I do not have any dmesg logs unfortunately so i cannot be sure since the machine is remote. running echo 3>drop_caches once in a while would appear to make it last longer before locking up but not sure why or how or if it was just luck for a while.

I applied ef5b2e1 and this patch locally and the machine has been fine for many hours since. I have rsync'd more data over to the zfs pool and it should probably have had an error by now so it appears fixed.

@behlendorf since spl_taskq_thread_dynamic = 0 is required to make things stable, should the default for that be set back to 0 until the other issue is fixed too so that the next point release will work correctly without user intervention?

behlendorf · 2015-09-29T17:04:11Z

@perfinion I was considering the same thing myself. However, from what I've seen reported the f5b2e1 fix addresses the vast majority of issues. Setting spl_taskq_thread_dynamic = 0 has only been needed in a few cases, and I believe in those cases it actually just does a better job of hiding an long standing issue. I'd like to leave it enabled by default for now and get that underlying issues resolved hopefully in the 0.6.5.3.

Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed side-effect of making the vdev_disk_io_start() synchronous for certain I/Os. This in turn resulted in vdev_disk_io_start() being able to re-dispatch zio's which would result in a RCU stalls when a disk was removed from the system. Additionally, this could negatively impact performance and explains the performance regressions reported in both #3829 and #3780. This patch resolves the issue by making the blocking behavior dependent on a 'wait' flag being passed rather than overloading the passed bio flags. Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to non-rotational devices where there is no benefit to queuing to aggregate the I/O. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3652 Issue #3780 Issue #3785 Issue #3817 Issue #3821 Issue #3829 Issue #3832 Issue #3870

behlendorf added the Bug - Major label Jul 31, 2015

behlendorf added this to the 0.6.5.2 milestone Sep 23, 2015

This was referenced Sep 24, 2015

System hangs (bad RIP value) when disk used in pool is removed (zfs-0.6.5.1) #3821

Closed

Host freezes under 0.6.5.1, ubuntu vivid #3832

Closed

behlendorf mentioned this issue Sep 25, 2015

Fix synchronous behavior in __vdev_disk_physio() #3833

Closed

behlendorf mentioned this issue Sep 25, 2015

Calling zio_interrupt() from vdev_disk_io_start() may cause RCU stall #3840

Closed

behlendorf closed this as completed Sep 25, 2015

behlendorf mentioned this issue Oct 2, 2015

0.6.5 regression: zpool import -d /dev/disk/by-id hangs #3866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU stall when disk is removed #3652

CPU stall when disk is removed #3652

ghost commented Jul 30, 2015

kaazoo commented Sep 18, 2015

adessemond commented Sep 22, 2015

ghost commented Sep 22, 2015

kernelOfTruth commented Sep 22, 2015

ghost commented Sep 22, 2015

ghost commented Sep 22, 2015

xatian commented Sep 22, 2015

adessemond commented Sep 23, 2015

ghost commented Sep 23, 2015

xatian commented Sep 23, 2015

kernelOfTruth commented Sep 23, 2015

xatian commented Sep 23, 2015

behlendorf commented Sep 23, 2015

ghost commented Sep 23, 2015

ghost commented Sep 24, 2015

behlendorf commented Sep 25, 2015

adessemond commented Sep 25, 2015

adessemond commented Sep 25, 2015

ghost commented Sep 25, 2015

ghost commented Sep 25, 2015

behlendorf commented Sep 25, 2015

ghost commented Sep 25, 2015

behlendorf commented Sep 25, 2015

perfinion commented Sep 29, 2015

behlendorf commented Sep 29, 2015

CPU stall when disk is removed #3652

CPU stall when disk is removed #3652

Comments

ghost commented Jul 30, 2015

kaazoo commented Sep 18, 2015

adessemond commented Sep 22, 2015

ghost commented Sep 22, 2015

kernelOfTruth commented Sep 22, 2015

ghost commented Sep 22, 2015

ghost commented Sep 22, 2015

xatian commented Sep 22, 2015

adessemond commented Sep 23, 2015

ghost commented Sep 23, 2015

xatian commented Sep 23, 2015

kernelOfTruth commented Sep 23, 2015

xatian commented Sep 23, 2015

behlendorf commented Sep 23, 2015

ghost commented Sep 23, 2015

ghost commented Sep 24, 2015

behlendorf commented Sep 25, 2015

adessemond commented Sep 25, 2015

adessemond commented Sep 25, 2015

ghost commented Sep 25, 2015

ghost commented Sep 25, 2015

behlendorf commented Sep 25, 2015

ghost commented Sep 25, 2015

behlendorf commented Sep 25, 2015

perfinion commented Sep 29, 2015

behlendorf commented Sep 29, 2015