Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU stall when disk is removed #3652

Closed
ghost opened this issue Jul 30, 2015 · 25 comments
Closed

CPU stall when disk is removed #3652

ghost opened this issue Jul 30, 2015 · 25 comments
Milestone

Comments

@ghost
Copy link

ghost commented Jul 30, 2015

I'm getting a pool lockup when:

  1. Pool is doing IO (read or write)
  2. A disk is removed (or fails)

z_null_int process takes 100% CPU:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                        
1424 root       0 -20       0      0      0 R 100.0  0.0   5:00.81 z_null_int                                                                                                                                     

We get this at the console just after removal of the disk:

blk_update_request: I/O error, dev sdb, sector 2392232
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU

A hard reset of the machine is required.

I've experienced this on multiple machines with:

  1. Local disks connected with LSI HBA
  2. SRP disks over InfiniBand
  3. Multipath disks

Kernel version is 3.18.19.

This only appeared on the ZFS master release from 13th July onwards.
I don't know the specific commit.
The latest release on 29th July is still affected by this issue. I confirmed that today.

The latest SPL master is not affected and is OK to use.

@kaazoo
Copy link

kaazoo commented Sep 18, 2015

I'm experiencing a similar situation. In a system with multiple I/O errors on different harddrives, txg_sync, spa_async and zpool (doing watch zpool status while resilvering) lock up:

Sep 17 21:15:13 myserver kernel: [15249.223675] INFO: task txg_sync:4383 blocked for more than 120 seconds.
Sep 17 21:15:13 myserver kernel: [15249.224108]       Tainted: P          IOX 3.13.0-63-generic #103-Ubuntu
Sep 17 21:15:13 myserver kernel: [15249.224529] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 17 21:15:13 myserver kernel: [15249.224971] txg_sync        D ffff880333233180     0  4383      2 0x00000000
Sep 17 21:15:13 myserver kernel: [15249.224975]  ffff88031a36dce8 0000000000000046 ffff88031a366000 ffff88031a36dfd8
Sep 17 21:15:13 myserver kernel: [15249.224978]  0000000000013180 0000000000013180 ffff88031a366000 ffff880321dc7130
Sep 17 21:15:13 myserver kernel: [15249.224980]  ffff880321dc70e8 ffff880321dc7138 ffff880321dc7110 0000000000000000
Sep 17 21:15:13 myserver kernel: [15249.224983] Call Trace:
Sep 17 21:15:13 myserver kernel: [15249.224989]  [<ffffffff817280c9>] schedule+0x29/0x70
Sep 17 21:15:13 myserver kernel: [15249.225002]  [<ffffffffa0199c35>] cv_wait_common+0xe5/0x120 [spl]
Sep 17 21:15:13 myserver kernel: [15249.225006]  [<ffffffff810ab390>] ? prepare_to_wait_event+0x100/0x100
Sep 17 21:15:13 myserver kernel: [15249.225012]  [<ffffffffa0199c85>] __cv_wait+0x15/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.225047]  [<ffffffffa02c36f3>] spa_config_enter+0xf3/0x110 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.225074]  [<ffffffffa02ca633>] txg_sync_thread+0x2b3/0x620 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.225102]  [<ffffffffa02ca380>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.225107]  [<ffffffffa0194f61>] thread_generic_wrapper+0x71/0x80 [spl]
Sep 17 21:15:13 myserver kernel: [15249.225111]  [<ffffffffa0194ef0>] ? __thread_exit+0x20/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.225115]  [<ffffffff8108b7d2>] kthread+0xd2/0xf0
Sep 17 21:15:13 myserver kernel: [15249.225117]  [<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0
Sep 17 21:15:13 myserver kernel: [15249.225120]  [<ffffffff817347e8>] ret_from_fork+0x58/0x90
Sep 17 21:15:13 myserver kernel: [15249.225122]  [<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0

Sep 17 21:15:13 myserver kernel: [15249.225132] INFO: task spa_async:28624 blocked for more than 120 seconds.
Sep 17 21:15:13 myserver kernel: [15249.225593]       Tainted: P          IOX 3.13.0-63-generic #103-Ubuntu
Sep 17 21:15:13 myserver kernel: [15249.226069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 17 21:15:13 myserver kernel: [15249.226567] spa_async       D ffff8803332b3180     0 28624      2 0x00000000
Sep 17 21:15:13 myserver kernel: [15249.226570]  ffff88030da2fc00 0000000000000046 ffff88006a5a4800 ffff88030da2ffd8
Sep 17 21:15:13 myserver kernel: [15249.226572]  0000000000013180 0000000000013180 ffff88006a5a4800 ffff8803332b3a18
Sep 17 21:15:13 myserver kernel: [15249.226574]  ffff8802f8ecaa88 ffff8802f8ecaac8 ffff8802f8ecaab0 0000000000000001
Sep 17 21:15:13 myserver kernel: [15249.226577] Call Trace:
Sep 17 21:15:13 myserver kernel: [15249.226579]  [<ffffffff817283ed>] io_schedule+0x9d/0x140
Sep 17 21:15:13 myserver kernel: [15249.226587]  [<ffffffffa0199bef>] cv_wait_common+0x9f/0x120 [spl]
Sep 17 21:15:13 myserver kernel: [15249.226590]  [<ffffffff810ab390>] ? prepare_to_wait_event+0x100/0x100
Sep 17 21:15:13 myserver kernel: [15249.226595]  [<ffffffffa0199cc8>] __cv_wait_io+0x18/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.226627]  [<ffffffffa0313233>] zio_wait+0x123/0x210 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226655]  [<ffffffffa02d3c5e>] vdev_label_read_config+0xce/0x210 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226682]  [<ffffffffa02cfb95>] vdev_validate+0xe5/0x2c0 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226710]  [<ffffffffa02cf64d>] ? vdev_open+0x31d/0x500 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226737]  [<ffffffffa02d02a5>] vdev_reopen+0xa5/0xe0 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226762]  [<ffffffffa02b4e42>] spa_async_probe+0x72/0x80 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226786]  [<ffffffffa02b4e17>] spa_async_probe+0x47/0x80 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226811]  [<ffffffffa02b4e17>] spa_async_probe+0x47/0x80 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226836]  [<ffffffffa02bf629>] spa_async_thread+0x149/0x310 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226840]  [<ffffffff811a28cd>] ? kfree+0xfd/0x140
Sep 17 21:15:13 myserver kernel: [15249.226865]  [<ffffffffa02bf4e0>] ? spa_vdev_resilver_done+0x140/0x140 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.226870]  [<ffffffffa0194f61>] thread_generic_wrapper+0x71/0x80 [spl]
Sep 17 21:15:13 myserver kernel: [15249.226874]  [<ffffffffa0194ef0>] ? __thread_exit+0x20/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.226877]  [<ffffffff8108b7d2>] kthread+0xd2/0xf0
Sep 17 21:15:13 myserver kernel: [15249.226879]  [<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0
Sep 17 21:15:13 myserver kernel: [15249.226881]  [<ffffffff817347e8>] ret_from_fork+0x58/0x90
Sep 17 21:15:13 myserver kernel: [15249.226884]  [<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0

Sep 17 21:15:13 myserver kernel: [15249.226886] INFO: task zpool:28627 blocked for more than 120 seconds.
Sep 17 21:15:13 myserver kernel: [15249.227400]       Tainted: P          IOX 3.13.0-63-generic #103-Ubuntu
Sep 17 21:15:13 myserver kernel: [15249.227939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 17 21:15:13 myserver kernel: [15249.228527] zpool           D ffff8803332b3180     0 28627  28626 0x00000000
Sep 17 21:15:13 myserver kernel: [15249.228530]  ffff88028a399be8 0000000000000086 ffff8802e8dce000 ffff88028a399fd8
Sep 17 21:15:13 myserver kernel: [15249.228533]  0000000000013180 0000000000013180 ffff8802e8dce000 ffff880321dc7130
Sep 17 21:15:13 myserver kernel: [15249.228535]  ffff880321dc70e8 ffff880321dc7138 ffff880321dc7110 0000000000000000
Sep 17 21:15:13 myserver kernel: [15249.228537] Call Trace:
Sep 17 21:15:13 myserver kernel: [15249.228540]  [<ffffffff817280c9>] schedule+0x29/0x70
Sep 17 21:15:13 myserver kernel: [15249.228547]  [<ffffffffa0199c35>] cv_wait_common+0xe5/0x120 [spl]
Sep 17 21:15:13 myserver kernel: [15249.228550]  [<ffffffff810ab390>] ? prepare_to_wait_event+0x100/0x100
Sep 17 21:15:13 myserver kernel: [15249.228556]  [<ffffffffa0199c85>] __cv_wait+0x15/0x20 [spl]
Sep 17 21:15:13 myserver kernel: [15249.228583]  [<ffffffffa02c36f3>] spa_config_enter+0xf3/0x110 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228609]  [<ffffffffa02c0a66>] spa_config_generate+0x636/0x900 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228635]  [<ffffffffa02bbb80>] spa_open_common+0xb0/0x460 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228660]  [<ffffffffa02bbfa9>] spa_get_stats+0x49/0x530 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228663]  [<ffffffff811a33be>] ? __kmalloc_node+0x26e/0x2a0
Sep 17 21:15:13 myserver kernel: [15249.228667]  [<ffffffff81157e61>] ? __free_pages+0x41/0x60
Sep 17 21:15:13 myserver kernel: [15249.228696]  [<ffffffffa02ee9ba>] zfs_ioc_pool_stats+0x2a/0x70 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228725]  [<ffffffffa02f19ed>] zfsdev_ioctl+0x49d/0x4e0 [zfs]
Sep 17 21:15:13 myserver kernel: [15249.228728]  [<ffffffff811d1710>] do_vfs_ioctl+0x2e0/0x4c0
Sep 17 21:15:13 myserver kernel: [15249.228731]  [<ffffffff811ce322>] ? final_putname+0x22/0x50
Sep 17 21:15:13 myserver kernel: [15249.228736]  [<ffffffff811d1971>] SyS_ioctl+0x81/0xa0
Sep 17 21:15:13 myserver kernel: [15249.228739]  [<ffffffff8173489d>] system_call_fastpath+0x1a/0x1f

After that z_null_int runs at 100 % CPU. According to iostat there still seems to be I/O happening, so the resilvering process might still run.

OS: Ubuntu Linux 14.04 LTS x64
ZFS: 0.6.5-1~trusty
HBA: LSI SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)
Disks: WD 2002FYPS, WD 2003FYYS and WD 2000FYYZ, 12 in total

Each drive is encrypted via LUKS. These 12 Device Mapper devices are configured as 2x raidz2 (6 disks each).

@adessemond
Copy link

Got something similar, on a pre-production machine running under CentOS 7.1 (downgraded to 0.6.4.2 since I hit the bug). The issue is however perfectly reproducible in a test environment running under VirtualBox: when a disk is detached and an write operation occurs on the pool => CPU stall and soft lock-up notices show up on the console. (CPU#0 stuck for 22s [z_null_int:717]).

A 'top' on the host shows a thread hung at 100% in kernel mode...

@ghost
Copy link
Author

ghost commented Sep 22, 2015

Is anyone planning to tackle this one?
It is preventing us from using the latest code.
We don't have any ZFS developers but are happy to do any testing.

@kernelOfTruth
Copy link
Contributor

@AB17 , @kaazoo , @adessemond just to be clear: not every one of you is using LUKS ?

@ghost
Copy link
Author

ghost commented Sep 22, 2015

@kernelOfTruth, I'm not using LUKS. That is for encryption right?
I'm not doing any encryption. Just removing disks.

@ghost
Copy link
Author

ghost commented Sep 22, 2015

I built kernel 4.1.7 with the latest ZoL GitHub release to see if this was related to kernel 3.18.x.
The problem is still there with 4.1.7.

@xatian
Copy link

xatian commented Sep 22, 2015

Hello,
Got the same problem running Gentoo with kernel version 4.0.5 and zfs version 0.6.5.
I also tested with fedora-server 22 (out of the box) --> same problem.

@adessemond
Copy link

@kernelOfTruth no LUKS here also.

@ghost
Copy link
Author

ghost commented Sep 23, 2015

I built the very latest release today.

Kernel 3.18.21.

Still the same problem.

You can use a virtual or physical machine to test this.

Remove a pool disk and then do some write IO to the pool.

You might need to remove/re-add/remove a few times to see it happen.

It does not always happen on the first disk removal.

@xatian
Copy link

xatian commented Sep 23, 2015

I have not spent much time with zfs so take this with a grain of salt -->
I think the problem is that the kernel-module does not generate any events in case of disk-eject (or failure). When I remove a disk and run

zpool events

I see no additional events generated (I am not sure if that is intended but I think there should be more).
But when I run

zpool status

I see that the pool is still health and that in fact nothing happened --> and that is a bug as I tested "disk-eject" with FreeNAS where the pool immediately goes to degraded.
@AB17 could you please check if you experience the same behaviour?

So I think as the system does not know that the pool is down/degraded it can't handle a subsequent reattach of the disk or the next IO-operation.

@kernelOfTruth
Copy link
Contributor

@xatian do you run openRC ? systemd ?
I'm pretty sure that Ubuntu 14.04 LTS runs upstart, so that rules a potential systemd issue with it out

with openRC 0.17 events are being generated:

zpool events
TIME                           CLASS
Sep 23 2015 13:37:35.880998397 resource.fs.zfs.statechange
Sep 23 2015 13:37:35.880998397 resource.fs.zfs.statechange
Sep 23 2015 13:37:36.020998395 resource.fs.zfs.statechange
Sep 23 2015 13:37:40.586998328 resource.fs.zfs.statechange
Sep 23 2015 13:37:40.601998328 resource.fs.zfs.statechange
Sep 23 2015 13:37:40.612998328 resource.fs.zfs.statechange
Sep 23 2015 13:37:40.905998323 ereport.fs.zfs.config.sync
Sep 23 2015 13:37:41.179998319 ereport.fs.zfs.config.sync
Sep 23 2015 18:04:09.203765092 resource.fs.zfs.statechange
Sep 23 2015 18:04:09.600765086 ereport.fs.zfs.config.sync

@xatian
Copy link

xatian commented Sep 23, 2015

Yeah I am using openRC 0.17
After reboot I got these events: (I have 3 pools)

zpool events 
TIME                           CLASS
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:02.291999909 resource.fs.zfs.statechange
Sep 23 2015 22:11:03.087999901 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:03.487999897 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:05.003999880 resource.fs.zfs.statechange
Sep 23 2015 22:11:05.003999880 resource.fs.zfs.statechange
Sep 23 2015 22:11:05.711999872 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:06.087999868 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:07.439999854 resource.fs.zfs.statechange
Sep 23 2015 22:11:07.439999854 resource.fs.zfs.statechange
Sep 23 2015 22:11:07.439999854 resource.fs.zfs.statechange
Sep 23 2015 22:11:08.071999847 ereport.fs.zfs.config.sync
Sep 23 2015 22:11:08.455999843 ereport.fs.zfs.config.sync

When I pull a disk there are no additional events shown and like I said zpool status still shows everything fine.
When I reinsert the disk and produce any IO "z_null_int" starts looping and dmesg shows:

[  292.755888] INFO: rcu_sched self-detected stall on CPU
[  292.755892]  4: (5249 ticks this GP) idle=3bd/140000000000001/0 softirq=2448/2448 fqs=5249 
[  292.755893]   (t=5250 jiffies g=2236 c=2235 q=21023)
[  292.755895] Task dump for CPU 4:
[  292.755896] z_null_int      R  running task        0  2173      2 0x00000008
[  292.755897]  ffffffff81c35c80 ffff88087fd03da8 ffffffff810753fe 0000000000000004
[  292.755899]  ffffffff81c35c80 ffff88087fd03dc8 ffffffff81077a38 ffff88087fd03e08
[  292.755900]  0000000000000005 ffff88087fd03df8 ffffffff81091fe0 ffff88087fd130c0
[  292.755901] Call Trace:
[  292.755902]  <IRQ>  [<ffffffff810753fe>] sched_show_task+0xae/0x120
[  292.755908]  [<ffffffff81077a38>] dump_cpu_task+0x38/0x40
[  292.755910]  [<ffffffff81091fe0>] rcu_dump_cpu_stacks+0x90/0xd0
[  292.755912]  [<ffffffff81094d23>] rcu_check_callbacks+0x423/0x700
[  292.755913]  [<ffffffff8109d744>] ? update_wall_time+0x234/0x650
[  292.755915]  [<ffffffff81097134>] update_process_times+0x34/0x60
[  292.755917]  [<ffffffff810a5231>] tick_sched_handle.isra.18+0x31/0x40
[  292.755918]  [<ffffffff810a527c>] tick_sched_timer+0x3c/0x70
[  292.755920]  [<ffffffff81097791>] __run_hrtimer.isra.34+0x41/0xf0
[  292.755921]  [<ffffffff81097f05>] hrtimer_interrupt+0xc5/0x1e0
[  292.755923]  [<ffffffff8102b454>] local_apic_timer_interrupt+0x34/0x60
[  292.755924]  [<ffffffff8102bacc>] smp_apic_timer_interrupt+0x3c/0x60
[  292.755926]  [<ffffffff816f9cea>] apic_timer_interrupt+0x6a/0x70
[  292.755927]  <EOI>  [<ffffffff816f8859>] ? _raw_spin_unlock_irqrestore+0x9/0x10
[  292.755934]  [<ffffffffa0003f6a>] taskq_cancel_id+0x31a/0x530 [spl]
[  292.755935]  [<ffffffff81075a60>] ? wake_up_state+0x10/0x10
[  292.755937]  [<ffffffffa0003d70>] ? taskq_cancel_id+0x120/0x530 [spl]
[  292.755940]  [<ffffffff8106d7e4>] kthread+0xc4/0xe0
[  292.755941]  [<ffffffff8106d720>] ? kthread_create_on_node+0x170/0x170
[  292.755943]  [<ffffffff816f8e48>] ret_from_fork+0x58/0x90
[  292.755944]  [<ffffffff8106d720>] ? kthread_create_on_node+0x170/0x170

BTW --> fedora server 22 I also tested with is running systemd and had exactly the same behaviour.
BTW2 --> once I restart still with the disk ejected the status is correctly shown as degraded. When I reinsert the disk there is no event created and the status stays at degraded.

@behlendorf behlendorf added this to the 0.6.5.2 milestone Sep 23, 2015
@behlendorf
Copy link
Contributor

@ab12 yes we'll definitely be tackling this one in a point release. We not have several reports of this issue, #3821 and #817 are both variants of this.

@ghost
Copy link
Author

ghost commented Sep 23, 2015

Thanks @behlendorf
We are happy to test as usual.

@ghost
Copy link
Author

ghost commented Sep 24, 2015

Kernel 3.18.21
Latest ZFS release.

I've tried this:

echo 0 > /sys/module/spl/parameters/spl_taskq_thread_dynamic

That helps. A lot. I can remove and re-add the disk several times.

But...

Eventually - after several removals and re-adds, I get the CPU stall again:

INFO: rcu_sched self-detected stall on CPU { 1}  (t=15000 jiffies g=13616 c=13615 q=0)
Task dump for CPU 1:
z_null_int      R  running task        0  6557      2 0x00080008
 0000000000000002 ffffffff818b5200 ffffffff810a9487 0000000000003530
 ffff88080b42d580 ffffffff818b5100 ffffffff818b5200 ffffffff81912c40
 ffffffff810ad008 0000000000000001 0000000000012000 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff810a9487>] ? rcu_dump_cpu_stacks+0x77/0xb0
 [<ffffffff810ad008>] ? rcu_check_callbacks+0x3f8/0x650
 [<ffffffff810bec00>] ? tick_sched_handle.isra.20+0x40/0x40
 [<ffffffff810b072b>] ? update_process_times+0x3b/0x60
 [<ffffffff810bebf1>] ? tick_sched_handle.isra.20+0x31/0x40
 [<ffffffff810bec34>] ? tick_sched_timer+0x34/0x60
 [<ffffffff810b0dd5>] ? __run_hrtimer.isra.34+0x45/0xc0
 [<ffffffff810b13a7>] ? hrtimer_interrupt+0xd7/0x230
 [<ffffffff810152d5>] ? sched_clock+0x5/0x10
 [<ffffffff8103ddd6>] ? smp_apic_timer_interrupt+0x36/0x50
 [<ffffffff815eba7d>] ? apic_timer_interrupt+0x6d/0x80
 <EOI>  [<ffffffffa07fc9b6>] ? zio_done+0x9e6/0xe30 [zfs]
 [<ffffffff815ea3e2>] ? _raw_spin_unlock_irqrestore+0x12/0x70
 [<ffffffffa06a0ebe>] ? taskq_thread+0x1fe/0x410 [spl]
 [<ffffffff81080670>] ? wake_up_process+0x30/0x30
 [<ffffffffa06a0cc0>] ? taskq_cancel_id+0x110/0x110 [spl]
 [<ffffffff8107798c>] ? kthread+0xbc/0xe0
 [<ffffffff810152d5>] ? sched_clock+0x5/0x10
 [<ffffffff810778d0>] ? kthread_create_on_node+0x170/0x170
 [<ffffffff815eaa58>] ? ret_from_fork+0x58/0x90
 [<ffffffff810778d0>] ? kthread_create_on_node+0x170/0x170

This time, the system does not lockup.

There is no z_null_int process running at 100% either.

However, no writes can take place to the pool.

A reset of the system is required.

behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 24, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and may explain the performance regressions
reported in both openzfs#3829 and openzfs#3780.

This patch resolves the issue by making the blocking behavior
dependant on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3780
Issue openzfs#3829
Issue openzfs#3652
@behlendorf
Copy link
Contributor

@AB17 I was able to reproduce the issue. Can you please verify the fix proposed in #3833. This may also address several of the other regressions observed in 0.6.5.1.

@adessemond
Copy link

@behlendorf @ kernelOfTruth : echo 0 > /sys/module/spl/parameters/spl_taskq_thread_dynamic did not help in my case (even forced at system startup before importing anything). Will test the patch and report.

@adessemond
Copy link

@behlendorf your fix listed above for #3833 solves the issue on my side, should be tested on my real hardware but so far no more CPU stall when a virtual disk is removed.

@ghost
Copy link
Author

ghost commented Sep 25, 2015

@behlendorf I tried the patch this morning.
Unfortunately it does not fix the issue.
I removed and re-added a disk successfully 3 times, but the 4th time was a problem.

The pool did not go into degraded mode when the disk was removed and write IO took place (the expected behaviour).

The CPU stall issue is gone, but the pool is no longer writable.

status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.

"zpool clear" command hangs.

No write IO can take place to the pool.

A reset of the system is required.

@ghost
Copy link
Author

ghost commented Sep 25, 2015

@behlendorf an update to previous test.

I assumed I was to test with the default spl_taskq_thread_dynamic = 1.

I have tested again with spl_taskq_thread_dynamic = 0.

I have completed a total of 20 tests where I remove and re-add the disk.

In each case the pool goes into degraded mode as expected. A re-add resilvers OK so I can remove again.

Works exactly like it should do.

So to confirm: #3833 appears to fix the problem when spl_taskq_thread_dynamic = 0.

@behlendorf
Copy link
Contributor

@adessemond @AB17 thanks for the quick testing turnaround. I think I've got a good idea what the lingering issue is but I'd like to address that in a separate patch. It's an issue which has always existed but was masked in 0.6.4.x because we didn't have dynamic taskqs. We'll get #3833 in right away, but @AB17 you may need to run with spl_taskq_thread_dynamic = 0 for a little while until the second issue is addressed.

@ghost
Copy link
Author

ghost commented Sep 25, 2015

@behlendorf no problem. Thank you.

I'll run with spl_taskq_thread_dynamic = 0 for now.

behlendorf added a commit that referenced this issue Sep 25, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
@behlendorf
Copy link
Contributor

Partially resolved by 5592404 which will be cherry-picked in to 0.6.5.2 release. I've opened #3840 to track the remaining work. For the moment users can mitigate the remaining issue by setting spl_taskq_thread_dynamic = 0.

@perfinion
Copy link
Contributor

Just for the records, I think I was hitting this too. My server has one drive that is starting to go bad and once in a while hits an unreadable sector. Rsyncing data from an mdadm array to the new ZFS pool would lock up after a while. I never removed or added any drives tho. Pinging the machine worked fine as well as anything that was still running in ram. As soon as something hit the disk it was dead and the only fix was magic sysreq or hard poweroff. I do not have any dmesg logs unfortunately so i cannot be sure since the machine is remote. running echo 3>drop_caches once in a while would appear to make it last longer before locking up but not sure why or how or if it was just luck for a while.

I applied ef5b2e1 and this patch locally and the machine has been fine for many hours since. I have rsync'd more data over to the zfs pool and it should probably have had an error by now so it appears fixed.

@behlendorf since spl_taskq_thread_dynamic = 0 is required to make things stable, should the default for that be set back to 0 until the other issue is fixed too so that the next point release will work correctly without user intervention?

@behlendorf
Copy link
Contributor

@perfinion I was considering the same thing myself. However, from what I've seen reported the f5b2e1 fix addresses the vast majority of issues. Setting spl_taskq_thread_dynamic = 0 has only been needed in a few cases, and I believe in those cases it actually just does a better job of hiding an long standing issue. I'd like to leave it enabled by default for now and get that underlying issues resolved hopefully in the 0.6.5.3.

behlendorf added a commit that referenced this issue Sep 30, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants