Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs send hangs #1638

Closed
snajpa opened this issue Aug 8, 2013 · 48 comments
Closed

zfs send hangs #1638

snajpa opened this issue Aug 8, 2013 · 48 comments
Milestone

Comments

@snajpa
Copy link
Contributor

snajpa commented Aug 8, 2013

Hello,

it happens fairly often that zfs send just hangs.

There's a bit of info:

ZoL version - last commit = cd72af9 ( Fix 'zpool list -H' error code )

[root@node6]
 /proc/840113 # ps axl | grep 840113
4     0  840113  840096  20   0 124072  1520 taskq_ Ds   ?          0:20 zfs send -I vz/private/1062 backup-2013-08-07T01:00:16 vz/private/1062 backup-2013-08-08T01:00:09
0     0  885009  229605  20   0 103252   856 pipe_w S+   pts/3      0:00 grep 840113
[root@]
 /proc/840113 # zfs list -t all -r vz/private/1062
NAME                                         USED  AVAIL  REFER  MOUNTPOINT
vz/private/1062                             35.4G  25.7G  34.3G  /vz/private/1062
vz/private/1062@backup-2013-08-07T01:00:16   525M      -  32.2G  -
vz/private/1062@backup-2013-08-08T01:00:09   573M      -  33.2G  -
[root@node6]
 /proc/840113 # cat cmdline; echo; cat wchan; echo;
zfssend-Ivz/private/1062backup-2013-08-07T01:00:16vz/private/1062backup-2013-08-08T01:00:09
taskq_wait_id
[root@node6]
 /proc/840113 # rpm -qa | grep zfs
zfs-dkms-0.6.1-89_gcd72af9.el6.noarch
zfs-devel-0.6.1-89_gcd72af9.el6.x86_64
zfs-release-1-2.el6.noarch
zfs-dracut-0.6.1-89_gcd72af9.el6.x86_64
zfs-test-0.6.1-89_gcd72af9.el6.x86_64
zfs-0.6.1-89_gcd72af9.el6.x86_64
[root@node6]
 /proc/840113 # uname -a
Linux node6.prg.vpsfree.cz 2.6.32-042stab079.5 #1 SMP Fri Aug 2 17:16:15 MSK 2013 x86_64 x86_64 x86_64 GNU/Linux

@ryao
Copy link
Contributor

ryao commented Aug 9, 2013

Someone else had a similar problem. It was determined to that memory corruption from his use of non-ECC RAM corrupted his pool. In his situation, a scrub revealed several metadata objects had been corrupted, although it is possible for the checksum calculation to be done against the corrupt data if the memory was causing corruption.

Have you done a scrub and are you using ECC memory?

@snajpa
Copy link
Contributor Author

snajpa commented Aug 9, 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'm scrubbing regularly once a week and yes, it's all Xeon E5s w/ ECC
reg. memory.

On 08/09/2013 01:18 PM, Richard Yao wrote:

Someone else had a similar problem. It was determined to that
memory corruption from his use of non-ECC RAM corrupted his pool.
In his situation, a scrub revealed several metadata objects had
been corrupted, although it is possible for the checksum
calculation to be done against the corrupt data if the memory was
causing corruption.

Have you done a scrub and are you using ECC memory?

— Reply to this email directly or view it on GitHub
#1638 (comment).

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iF4EAREIAAYFAlIE6uIACgkQMBKdi9lkZ6qdhAEAuI+J6I1eP+6cshCB4hFFBoIv
uoQCAyIm4XfXg4asVMEBAM3cImAly3S7RWxUA70IAASZo4ao9Au/aPEmOY8SdtTp
=iko8
-----END PGP SIGNATURE-----

@ryao
Copy link
Contributor

ryao commented Aug 9, 2013

@snajpa @behlendorf debugged the last report like this. I will leave this in his hands.

@snajpa
Copy link
Contributor Author

snajpa commented Aug 10, 2013

And because I can't reboot the machine (it's in production), naturally, our system left one zfs destroy on that snapshot hanging (it's trying to do snapshots, send them to backup storage and then destroy them).

  80200 ?        D      0:00 zfs destroy vz/private/1062 backup-2013-08-07T01:00:16
 840113 ?        Ds     1:07 zfs send -I vz/private/1062 backup-2013-08-07T01:00:16 vz/private/1062 backup-2013-08-08T01:00:09
[root@node6.prg.vpsfree.cz]
 ~ # cd /proc/80200
[root@node6.prg.vpsfree.cz]
 /proc/80200 # cat wchan ; echo;
call_rwsem_down_write_failed

@snajpa
Copy link
Contributor Author

snajpa commented Aug 18, 2013

bump :)

I really need a fix for this, we need badly ZFS in production state and this is a pretty serious blocker.

Can I send beers or anything to anyone who'd fix it? Please? :)

@casualfish
Copy link
Contributor

The zfs send process is in the D state which usually means the process is waiting for IO, the kernel will complain about blocking processes after a specific period(120s by default) and print out the stacktrace, are there any such logs in your /var/log/messages?

@snajpa
Copy link
Contributor Author

snajpa commented Aug 18, 2013

nope, there's nothing + gstack hangs, systemtap is useless; I'm out of options how to get the stack trace :(
so if it's needed for the debug process, it'll have to wait until next reboot. sad story :D

@snajpa
Copy link
Contributor Author

snajpa commented Aug 20, 2013

This is what I'm getting in the syslog (lots of these, all of them seem to be the same), but I doubt it's much related:

Aug 20 05:54:30 node6 kernel: [1113561.392928] SPL: Showing stack for process 2624
Aug 20 05:54:30 node6 kernel: [1113561.392935] Pid: 2624, comm: txg_sync veid: 0 Tainted: P           ---------------    2.6.32-042stab079.5 #1
Aug 20 05:54:30 node6 kernel: [1113561.392939] Call Trace:
Aug 20 05:54:30 node6 kernel: [1113561.392975]  [<ffffffffa021d5a7>] ? spl_debug_dumpstack+0x27/0x40 [spl]
Aug 20 05:54:30 node6 kernel: [1113561.392989]  [<ffffffffa0225327>] ? kmem_alloc_debug+0x437/0x4c0 [spl]
Aug 20 05:54:30 node6 kernel: [1113561.393004]  [<ffffffffa0227ac3>] ? task_alloc+0x1d3/0x380 [spl]
Aug 20 05:54:30 node6 kernel: [1113561.393016]  [<ffffffffa0224f7f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
Aug 20 05:54:30 node6 kernel: [1113561.393075]  [<ffffffffa0350300>] ? spa_deadman+0x0/0x120 [zfs]
Aug 20 05:54:30 node6 kernel: [1113561.393087]  [<ffffffffa02282db>] ? taskq_dispatch_delay+0x19b/0x2a0 [spl]
Aug 20 05:54:30 node6 kernel: [1113561.393100]  [<ffffffffa02285c2>] ? taskq_cancel_id+0x102/0x1e0 [spl]
Aug 20 05:54:30 node6 kernel: [1113561.393135]  [<ffffffffa03401aa>] ? spa_sync+0x1fa/0xa80 [zfs]
Aug 20 05:54:30 node6 kernel: [1113561.393145]  [<ffffffff81012c99>] ? read_tsc+0x9/0x20
Aug 20 05:54:30 node6 kernel: [1113561.393152]  [<ffffffff810a89a1>] ? ktime_get_ts+0xb1/0xf0
Aug 20 05:54:30 node6 kernel: [1113561.393188]  [<ffffffffa0353317>] ? txg_sync_thread+0x307/0x590 [zfs]
Aug 20 05:54:30 node6 kernel: [1113561.393197]  [<ffffffff81057a69>] ? set_user_nice+0xc9/0x130
Aug 20 05:54:30 node6 kernel: [1113561.393230]  [<ffffffffa0353010>] ? txg_sync_thread+0x0/0x590 [zfs]
Aug 20 05:54:30 node6 kernel: [1113561.393243]  [<ffffffffa0226428>] ? thread_generic_wrapper+0x68/0x80 [spl]
Aug 20 05:54:30 node6 kernel: [1113561.393254]  [<ffffffffa02263c0>] ? thread_generic_wrapper+0x0/0x80 [spl]
Aug 20 05:54:30 node6 kernel: [1113561.393261]  [<ffffffff8109cca6>] ? kthread+0x96/0xa0
Aug 20 05:54:30 node6 kernel: [1113561.393266]  [<ffffffff8100c1aa>] ? child_rip+0xa/0x20
Aug 20 05:54:30 node6 kernel: [1113561.393272]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 20 05:54:30 node6 kernel: [1113561.393276]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

@casualfish
Copy link
Contributor

@snajpa Yes it's useful, according to the backtrace and the commit cc92e9d, can you show the output of 'zpool events -v'? It can help to identify the delayed zio request.

@snajpa
Copy link
Contributor Author

snajpa commented Aug 20, 2013

Ok, I'll try to get more info next time, which I think will be in a few days from now. I had to reboot the machine, because it completely stopped doing any writes to the zpool. Only I'm afraid that I'm seeing more than one issue at once - that ^ dmesg, zfs send hangs and these complete write lockups...

How can I go about debugging this and getting more info for you guys? I've already made sure that hang tasks traces go to syslog, I'lll try zpool events -v next time. What more can I do? :)

@casualfish
Copy link
Contributor

Of course the ideal way is to minimize the workload or test environment which also triggers the same bug so we can reproduce it by ourselves to find the root cause. But since your servers are in production maybe it's not easy to do so. Anyway the output of zpool events -v may help to find if the problem is caused by a faulting device, if so there will be many zios delayed by the same vdev(which I suspect to be the root cause).

@snajpa
Copy link
Contributor Author

snajpa commented Aug 21, 2013

Ok so there it is again, zfs send's hung again:

Last login: Wed Aug 21 09:33:48 2013 from 172.16.107.100
[root@node6.prg.vpsfree.cz]
 ~ # ps ax | grep -P "\s+D"
   2182 ?        D<     0:00 [spl_system_task]
 349131 ?        Ss     0:00 /bin/sh -c /usr/sbin/amdump DailySet1
 349133 ?        S      0:00 /usr/bin/perl /usr/sbin/amdump DailySet1
 349645 ?        S      0:00 /usr/lib/amanda/driver DailySet1
 349663 ?        Sl     7:46 /usr/bin/perl /usr/lib/amanda/taper DailySet1
 349665 ?        S      0:42 dumper0 DailySet1
 349666 ?        S      0:00 dumper1 DailySet1
 349667 ?        S      0:00 dumper2 DailySet1
 349669 ?        S      0:00 dumper3 DailySet1
 349670 ?        S      0:00 dumper4 DailySet1
 349671 ?        S      0:00 dumper5 DailySet1
 349672 ?        S      0:00 dumper6 DailySet1
 349673 ?        S      0:00 dumper7 DailySet1
 349674 ?        S      0:00 dumper8 DailySet1
 349675 ?        S      0:00 dumper9 DailySet1
1036449 ?        Ds     0:08 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-08-21T01:00:43
[root@node6.prg.vpsfree.cz]
 ~ # cat /proc/1036449/wchan; echo;
taskq_wait_id
[root@node6.prg.vpsfree.cz]
 ~ # tail -n330 /var/log/messages
Aug 21 04:59:13 node6 kernel: [77006.771088]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 04:59:13 node6 kernel: [77006.771097]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 04:59:13 node6 kernel: [77006.771111]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 04:59:13 node6 kernel: [77006.771120]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 04:59:13 node6 kernel: [77006.771130]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 04:59:13 node6 kernel: [77006.771137]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 04:59:13 node6 kernel: [77006.771143]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 05:01:13 node6 kernel: [77126.721212] INFO: task spl_system_task:2182 blocked for more than 120 seconds.
Aug 21 05:01:13 node6 kernel: [77126.721236] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 05:01:13 node6 kernel: [77126.721256] spl_system_ta D ffff882077b2d3e0     0  2182      2    0 0x00000000
Aug 21 05:01:13 node6 kernel: [77126.721263]  ffff882078131500 0000000000000046 ffff8820781314c8 0000000000000003
Aug 21 05:01:13 node6 kernel: [77126.721272]  0000000000000001 00000000005ced69 ffff8810c0950650 ffff881000000000
Aug 21 05:01:13 node6 kernel: [77126.721279]  ffff8810c0950640 00000001048ece5c ffff882077b2d9a8 000000000001ea80
Aug 21 05:01:13 node6 kernel: [77126.721286] Call Trace:
Aug 21 05:01:13 node6 kernel: [77126.721320]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 21 05:01:13 node6 kernel: [77126.721349]  [<ffffffffa02f3294>] ? add_reference+0xc4/0x100 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721359]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 21 05:01:13 node6 kernel: [77126.721382]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721413]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 21 05:01:13 node6 kernel: [77126.721446]  [<ffffffffa030e9c5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721481]  [<ffffffffa030ed59>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721514]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721548]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721577]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721609]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721638]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721670]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721700]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721733]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721764]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721797]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721837]  [<ffffffffa0395913>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721860]  [<ffffffffa02f73a7>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721885]  [<ffffffffa030f6b4>] traverse_dnode+0x74/0x100 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721911]  [<ffffffffa030f141>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721928]  [<ffffffff81065b9f>] ? finish_task_switch+0x4f/0xe0
Aug 21 05:01:13 node6 kernel: [77126.721961]  [<ffffffffa030f7c3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.721994]  [<ffffffffa030e920>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 21 05:01:13 node6 kernel: [77126.722012]  [<ffffffffa02245d8>] taskq_thread+0x218/0x4b0 [spl]
Aug 21 05:01:13 node6 kernel: [77126.722028]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 05:01:13 node6 kernel: [77126.722042]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 05:01:13 node6 kernel: [77126.722060]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 05:01:13 node6 kernel: [77126.722072]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 05:01:13 node6 kernel: [77126.722086]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 05:01:13 node6 kernel: [77126.722097]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 05:01:13 node6 kernel: [77126.722108]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 05:03:13 node6 kernel: [77246.664857] INFO: task spl_system_task:2182 blocked for more than 120 seconds.
Aug 21 05:03:13 node6 kernel: [77246.664887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 05:03:13 node6 kernel: [77246.664914] spl_system_ta D ffff882077b2d3e0     0  2182      2    0 0x00000000
Aug 21 05:03:13 node6 kernel: [77246.664925]  ffff882078131500 0000000000000046 ffff8820781314c8 0000000000000003
Aug 21 05:03:13 node6 kernel: [77246.664937]  0000000000000001 00000000005ced69 ffff8810c0950650 ffff881000000000
Aug 21 05:03:13 node6 kernel: [77246.664947]  ffff8810c0950640 00000001048ece5c ffff882077b2d9a8 000000000001ea80
Aug 21 05:03:13 node6 kernel: [77246.664957] Call Trace:
Aug 21 05:03:13 node6 kernel: [77246.665009]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 21 05:03:13 node6 kernel: [77246.665049]  [<ffffffffa02f3294>] ? add_reference+0xc4/0x100 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665061]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 21 05:03:13 node6 kernel: [77246.665094]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665110]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 21 05:03:13 node6 kernel: [77246.665150]  [<ffffffffa030e9c5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665209]  [<ffffffffa030ed59>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665236]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665282]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665314]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665350]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665382]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665417]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665449]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665484]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665532]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665567]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665631]  [<ffffffffa0395913>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665667]  [<ffffffffa02f73a7>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665705]  [<ffffffffa030f6b4>] traverse_dnode+0x74/0x100 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665741]  [<ffffffffa030f141>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665773]  [<ffffffff81065b9f>] ? finish_task_switch+0x4f/0xe0
Aug 21 05:03:13 node6 kernel: [77246.665823]  [<ffffffffa030f7c3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665874]  [<ffffffffa030e920>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 21 05:03:13 node6 kernel: [77246.665904]  [<ffffffffa02245d8>] taskq_thread+0x218/0x4b0 [spl]
Aug 21 05:03:13 node6 kernel: [77246.665930]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 05:03:13 node6 kernel: [77246.665945]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 05:03:13 node6 kernel: [77246.665987]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 05:03:13 node6 kernel: [77246.666008]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 05:03:13 node6 kernel: [77246.666031]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 05:03:13 node6 kernel: [77246.666049]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 05:03:13 node6 kernel: [77246.666066]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 05:05:13 node6 kernel: [77366.611610] INFO: task spl_system_task:2182 blocked for more than 120 seconds.
Aug 21 05:05:13 node6 kernel: [77366.611632] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 05:05:13 node6 kernel: [77366.611650] spl_system_ta D ffff882077b2d3e0     0  2182      2    0 0x00000000
Aug 21 05:05:13 node6 kernel: [77366.611656]  ffff882078131500 0000000000000046 ffff8820781314c8 0000000000000003
Aug 21 05:05:13 node6 kernel: [77366.611662]  0000000000000001 00000000005ced69 ffff8810c0950650 ffff881000000000
Aug 21 05:05:13 node6 kernel: [77366.611666]  ffff8810c0950640 00000001048ece5c ffff882077b2d9a8 000000000001ea80
Aug 21 05:05:13 node6 kernel: [77366.611671] Call Trace:
Aug 21 05:05:13 node6 kernel: [77366.611700]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 21 05:05:13 node6 kernel: [77366.611723]  [<ffffffffa02f3294>] ? add_reference+0xc4/0x100 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611731]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 21 05:05:13 node6 kernel: [77366.611748]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611757]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 21 05:05:13 node6 kernel: [77366.611778]  [<ffffffffa030e9c5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611798]  [<ffffffffa030ed59>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611817]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611836]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611852]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611870]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611886]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611904]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611921]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611939]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611955]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.611973]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.612011]  [<ffffffffa0395913>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.612035]  [<ffffffffa02f73a7>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.612069]  [<ffffffffa030f6b4>] traverse_dnode+0x74/0x100 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.612095]  [<ffffffffa030f141>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.612108]  [<ffffffff81065b9f>] ? finish_task_switch+0x4f/0xe0
Aug 21 05:05:13 node6 kernel: [77366.612133]  [<ffffffffa030f7c3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.612157]  [<ffffffffa030e920>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 21 05:05:13 node6 kernel: [77366.612172]  [<ffffffffa02245d8>] taskq_thread+0x218/0x4b0 [spl]
Aug 21 05:05:13 node6 kernel: [77366.612184]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 05:05:13 node6 kernel: [77366.612195]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 05:05:13 node6 kernel: [77366.612208]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 05:05:13 node6 kernel: [77366.612219]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 05:05:13 node6 kernel: [77366.612230]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 05:05:13 node6 kernel: [77366.612239]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 05:05:13 node6 kernel: [77366.612247]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 05:07:13 node6 kernel: [77486.558794] INFO: task spl_system_task:2182 blocked for more than 120 seconds.
Aug 21 05:07:13 node6 kernel: [77486.558813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 05:07:13 node6 kernel: [77486.558831] spl_system_ta D ffff882077b2d3e0     0  2182      2    0 0x00000000
Aug 21 05:07:13 node6 kernel: [77486.558837]  ffff882078131500 0000000000000046 ffff8820781314c8 0000000000000003
Aug 21 05:07:13 node6 kernel: [77486.558843]  0000000000000001 00000000005ced69 ffff8810c0950650 ffff881000000000
Aug 21 05:07:13 node6 kernel: [77486.558848]  ffff8810c0950640 00000001048ece5c ffff882077b2d9a8 000000000001ea80
Aug 21 05:07:13 node6 kernel: [77486.558852] Call Trace:
Aug 21 05:07:13 node6 kernel: [77486.558880]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 21 05:07:13 node6 kernel: [77486.558902]  [<ffffffffa02f3294>] ? add_reference+0xc4/0x100 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.558909]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 21 05:07:13 node6 kernel: [77486.558927]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.558935]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 21 05:07:13 node6 kernel: [77486.558956]  [<ffffffffa030e9c5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.558977]  [<ffffffffa030ed59>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.558996]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559015]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559031]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559050]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559066]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559084]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559100]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559119]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559135]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559153]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559180]  [<ffffffffa0395913>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559197]  [<ffffffffa02f73a7>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559216]  [<ffffffffa030f6b4>] traverse_dnode+0x74/0x100 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559234]  [<ffffffffa030f141>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559240]  [<ffffffff81065b9f>] ? finish_task_switch+0x4f/0xe0
Aug 21 05:07:13 node6 kernel: [77486.559259]  [<ffffffffa030f7c3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559276]  [<ffffffffa030e920>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 21 05:07:13 node6 kernel: [77486.559285]  [<ffffffffa02245d8>] taskq_thread+0x218/0x4b0 [spl]
Aug 21 05:07:13 node6 kernel: [77486.559290]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 05:07:13 node6 kernel: [77486.559295]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 05:07:13 node6 kernel: [77486.559303]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 05:07:13 node6 kernel: [77486.559307]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 05:07:13 node6 kernel: [77486.559312]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 05:07:13 node6 kernel: [77486.559316]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 05:07:13 node6 kernel: [77486.559319]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 05:09:13 node6 kernel: [77606.501113] INFO: task spl_system_task:2182 blocked for more than 120 seconds.
Aug 21 05:09:13 node6 kernel: [77606.501134] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 05:09:13 node6 kernel: [77606.501152] spl_system_ta D ffff882077b2d3e0     0  2182      2    0 0x00000000
Aug 21 05:09:13 node6 kernel: [77606.501158]  ffff882078131500 0000000000000046 ffff8820781314c8 0000000000000003
Aug 21 05:09:13 node6 kernel: [77606.501164]  0000000000000001 00000000005ced69 ffff8810c0950650 ffff881000000000
Aug 21 05:09:13 node6 kernel: [77606.501169]  ffff8810c0950640 00000001048ece5c ffff882077b2d9a8 000000000001ea80
Aug 21 05:09:13 node6 kernel: [77606.501173] Call Trace:
Aug 21 05:09:13 node6 kernel: [77606.501203]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 21 05:09:13 node6 kernel: [77606.501226]  [<ffffffffa02f3294>] ? add_reference+0xc4/0x100 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501233]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 21 05:09:13 node6 kernel: [77606.501250]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501259]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 21 05:09:13 node6 kernel: [77606.501280]  [<ffffffffa030e9c5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501300]  [<ffffffffa030ed59>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501319]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501337]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501353]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501371]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501388]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501406]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501422]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501440]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501456]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501474]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501502]  [<ffffffffa0395913>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501519]  [<ffffffffa02f73a7>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501538]  [<ffffffffa030f6b4>] traverse_dnode+0x74/0x100 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501556]  [<ffffffffa030f141>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501561]  [<ffffffff81065b9f>] ? finish_task_switch+0x4f/0xe0
Aug 21 05:09:13 node6 kernel: [77606.501580]  [<ffffffffa030f7c3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501598]  [<ffffffffa030e920>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 21 05:09:13 node6 kernel: [77606.501606]  [<ffffffffa02245d8>] taskq_thread+0x218/0x4b0 [spl]
Aug 21 05:09:13 node6 kernel: [77606.501612]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 05:09:13 node6 kernel: [77606.501617]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 05:09:13 node6 kernel: [77606.501625]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 05:09:13 node6 kernel: [77606.501628]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 05:09:13 node6 kernel: [77606.501634]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 05:09:13 node6 kernel: [77606.501638]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 05:09:13 node6 kernel: [77606.501640]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 05:11:13 node6 kernel: [77726.451650] INFO: task spl_system_task:2182 blocked for more than 120 seconds.
Aug 21 05:11:13 node6 kernel: [77726.451674] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 05:11:13 node6 kernel: [77726.451696] spl_system_ta D ffff882077b2d3e0     0  2182      2    0 0x00000000
Aug 21 05:11:13 node6 kernel: [77726.451703]  ffff882078131500 0000000000000046 ffff8820781314c8 0000000000000003
Aug 21 05:11:13 node6 kernel: [77726.451712]  0000000000000001 00000000005ced69 ffff8810c0950650 ffff881000000000
Aug 21 05:11:13 node6 kernel: [77726.451719]  ffff8810c0950640 00000001048ece5c ffff882077b2d9a8 000000000001ea80
Aug 21 05:11:13 node6 kernel: [77726.451725] Call Trace:
Aug 21 05:11:13 node6 kernel: [77726.451757]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 21 05:11:13 node6 kernel: [77726.451785]  [<ffffffffa02f3294>] ? add_reference+0xc4/0x100 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.451796]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 21 05:11:13 node6 kernel: [77726.451817]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.451828]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 21 05:11:13 node6 kernel: [77726.451854]  [<ffffffffa030e9c5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.451879]  [<ffffffffa030ed59>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.451902]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.451925]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.451945]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.451967]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.451987]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452009]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452029]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452051]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452070]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452091]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452125]  [<ffffffffa0395913>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452145]  [<ffffffffa02f73a7>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452168]  [<ffffffffa030f6b4>] traverse_dnode+0x74/0x100 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452190]  [<ffffffffa030f141>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452198]  [<ffffffff81065b9f>] ? finish_task_switch+0x4f/0xe0
Aug 21 05:11:13 node6 kernel: [77726.452220]  [<ffffffffa030f7c3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452242]  [<ffffffffa030e920>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 21 05:11:13 node6 kernel: [77726.452253]  [<ffffffffa02245d8>] taskq_thread+0x218/0x4b0 [spl]
Aug 21 05:11:13 node6 kernel: [77726.452262]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 05:11:13 node6 kernel: [77726.452268]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 05:11:13 node6 kernel: [77726.452278]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 05:11:13 node6 kernel: [77726.452283]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 05:11:13 node6 kernel: [77726.452290]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 05:11:13 node6 kernel: [77726.452295]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 05:11:13 node6 kernel: [77726.452299]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 05:13:13 node6 kernel: [77846.394000] INFO: task spl_system_task:2182 blocked for more than 120 seconds.
Aug 21 05:13:13 node6 kernel: [77846.394024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 05:13:13 node6 kernel: [77846.394045] spl_system_ta D ffff882077b2d3e0     0  2182      2    0 0x00000000
Aug 21 05:13:13 node6 kernel: [77846.394053]  ffff882078131500 0000000000000046 ffff8820781314c8 0000000000000003
Aug 21 05:13:13 node6 kernel: [77846.394062]  0000000000000001 00000000005ced69 ffff8810c0950650 ffff881000000000
Aug 21 05:13:13 node6 kernel: [77846.394070]  ffff8810c0950640 00000001048ece5c ffff882077b2d9a8 000000000001ea80
Aug 21 05:13:13 node6 kernel: [77846.394078] Call Trace:
Aug 21 05:13:13 node6 kernel: [77846.394114]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 21 05:13:13 node6 kernel: [77846.394146]  [<ffffffffa02f3294>] ? add_reference+0xc4/0x100 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394158]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 21 05:13:13 node6 kernel: [77846.394183]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394206]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 21 05:13:13 node6 kernel: [77846.394236]  [<ffffffffa030e9c5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394266]  [<ffffffffa030ed59>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394294]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394321]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394346]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394373]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394397]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394424]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394448]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394475]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394499]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394525]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394569]  [<ffffffffa0395913>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394593]  [<ffffffffa02f73a7>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394621]  [<ffffffffa030f6b4>] traverse_dnode+0x74/0x100 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394648]  [<ffffffffa030f141>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394656]  [<ffffffff81065b9f>] ? finish_task_switch+0x4f/0xe0
Aug 21 05:13:13 node6 kernel: [77846.394684]  [<ffffffffa030f7c3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394711]  [<ffffffffa030e920>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 21 05:13:13 node6 kernel: [77846.394723]  [<ffffffffa02245d8>] taskq_thread+0x218/0x4b0 [spl]
Aug 21 05:13:13 node6 kernel: [77846.394731]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 05:13:13 node6 kernel: [77846.394737]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 05:13:13 node6 kernel: [77846.394749]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 05:13:13 node6 kernel: [77846.394755]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 05:13:13 node6 kernel: [77846.394763]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 05:13:13 node6 kernel: [77846.394768]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 05:13:13 node6 kernel: [77846.394773]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 05:15:13 node6 kernel: [77966.339896] INFO: task spl_system_task:2182 blocked for more than 120 seconds.
Aug 21 05:15:13 node6 kernel: [77966.339916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 05:15:13 node6 kernel: [77966.339933] spl_system_ta D ffff882077b2d3e0     0  2182      2    0 0x00000000
Aug 21 05:15:13 node6 kernel: [77966.339939]  ffff882078131500 0000000000000046 ffff8820781314c8 0000000000000003
Aug 21 05:15:13 node6 kernel: [77966.339944]  0000000000000001 00000000005ced69 ffff8810c0950650 ffff881000000000
Aug 21 05:15:13 node6 kernel: [77966.339949]  ffff8810c0950640 00000001048ece5c ffff882077b2d9a8 000000000001ea80
Aug 21 05:15:13 node6 kernel: [77966.339953] Call Trace:
Aug 21 05:15:13 node6 kernel: [77966.339979]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 21 05:15:13 node6 kernel: [77966.339999]  [<ffffffffa02f3294>] ? add_reference+0xc4/0x100 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340006]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 21 05:15:13 node6 kernel: [77966.340021]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340029]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 21 05:15:13 node6 kernel: [77966.340048]  [<ffffffffa030e9c5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340067]  [<ffffffffa030ed59>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340084]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340101]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340115]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340132]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340146]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340162]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340177]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340193]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340208]  [<ffffffffa02f7d72>] ? arc_read+0x672/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340224]  [<ffffffffa030eefd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340249]  [<ffffffffa0395913>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340263]  [<ffffffffa02f73a7>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340280]  [<ffffffffa030f6b4>] traverse_dnode+0x74/0x100 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340296]  [<ffffffffa030f141>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340301]  [<ffffffff81065b9f>] ? finish_task_switch+0x4f/0xe0
Aug 21 05:15:13 node6 kernel: [77966.340318]  [<ffffffffa030f7c3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340334]  [<ffffffffa030e920>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 21 05:15:13 node6 kernel: [77966.340341]  [<ffffffffa02245d8>] taskq_thread+0x218/0x4b0 [spl]
Aug 21 05:15:13 node6 kernel: [77966.340346]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 21 05:15:13 node6 kernel: [77966.340351]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 21 05:15:13 node6 kernel: [77966.340358]  [<ffffffffa02243c0>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 21 05:15:13 node6 kernel: [77966.340361]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 21 05:15:13 node6 kernel: [77966.340366]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 21 05:15:13 node6 kernel: [77966.340369]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 21 05:15:13 node6 kernel: [77966.340372]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 21 07:31:07 node6 kernel: [86116.165778] Fatal resource shortage: privvmpages, UB 830.
Aug 21 07:31:07 node6 kernel: [86116.165796] Fatal resource shortage: privvmpages, UB 830.
Aug 21 07:31:56 node6 kernel: [86165.168982] Fatal resource shortage: privvmpages, UB 830.
Aug 21 07:31:56 node6 kernel: [86165.168996] Fatal resource shortage: privvmpages, UB 830.
Aug 21 07:48:54 node6 kernel: [87182.590017] Fatal resource shortage: privvmpages, UB 830.
Aug 21 07:48:54 node6 kernel: [87182.590031] Fatal resource shortage: privvmpages, UB 830.
Aug 21 08:04:17 node6 kernel: [88104.989360] Fatal resource shortage: privvmpages, UB 830.
Aug 21 08:04:17 node6 kernel: [88104.989372] Fatal resource shortage: privvmpages, UB 830.
Aug 21 08:04:17 node6 kernel: [88104.989395] Fatal resource shortage: privvmpages, UB 830.
Aug 21 08:04:17 node6 kernel: [88104.989398] Fatal resource shortage: privvmpages, UB 830.
Aug 21 09:34:33 node6 kernel: [93518.471525] ICMPv6 checksum failed [2a01:8c00:ff00:001d:0000:0000:0000:0001 > 2a01:0430:0017:0001:0000:0000:ffff:0090]
[root@node6.prg.vpsfree.cz]
 ~ # zpool events -v
TIME                           CLASS
[root@node6.prg.vpsfree.cz]
 ~ # 

@casualfish
Copy link
Contributor

So the oops seem more or less random without revealing the root cause, this time it did not trigger the newly added IO deadman facility(zpool events -v showed no output), @ryao @behlendorf is threre any other way to catch the zio that never finished? I noticed the oops logs are all on server node6, does this bug also happen on other servers(if not perhaps there are some faulting disks on server node6)?

@snajpa
Copy link
Contributor Author

snajpa commented Aug 21, 2013

We don't have ZFS on any other nodes yet, this first is sort of a proof of concept for OpenVZ+ZFS. I don't think there's a hardware issue, mdraid+ext4+linux combo was working pretty well there (for a few weeks, before it got reinstalled w/ zfs).

@casualfish
Copy link
Contributor

For production use I'd like to recommend the current stable version 0.6.1, can you try this version and see if this bug still happens? If so we're facing a long living issue, otherwise it's a regression and can be analyzed with git bisect.

@snajpa
Copy link
Contributor Author

snajpa commented Aug 21, 2013

Well that's where I started and this problem was the reason why I decided to build and install latest git version at that time.

@behlendorf
Copy link
Contributor

The system is certainly blocked waiting for something to complete. That could be a slow IO or a whole mess of smaller IOs, there's not quite enough in the stack to say for certain.

@snajpa Are there are non-standard zfs module options being used on this system? Have you enable deduplication? Is there anything else different about it?

Assuming there isn't we still need to know what it's blocked on. To do that we need the full output of sysrq-t from the hung system. That will log to the console all of the active processes and their stacks which should allow us to see which threads are waiting on each other.

@snajpa
Copy link
Contributor Author

snajpa commented Aug 23, 2013

@behlendorf I'm not passing zfs any params, no dedup, only compression=on, there's a dataset per each OpenVZ container, each of them has refquota set, otherwise nothing too fancy. It is backed up by snap -> send | recv to illumos powered backup server.

sysrq-t output doesn't include these processes in D state - btw I have 3 of them for the same dataset today on the machine:

[root@node6.prg.vpsfree.cz]
 ~ # ps ax | grep -P "\s+D"
   2182 ?        D<     0:00 [spl_system_task]
   2184 ?        D<     0:01 [spl_system_task]
   2193 ?        D<     0:03 [spl_system_task]
  95220 ?        Ds     0:06 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-08-23T01:00:55
 105194 ?        Ds     0:37 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-08-22T01:00:47
1036449 ?        Ds     1:13 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-08-21T01:00:43

@snajpa
Copy link
Contributor Author

snajpa commented Aug 25, 2013

@behlendorf ZFS was stuck for about 15 mins before 7am today. There's nothing in the system log which would say anything about it, except of that trace which is filling syslog all the time (below).
cpu-day

https://prasiatko.vpsfree.cz/munin/prg.vpsfree.cz/node6.prg.vpsfree.cz/

Example of one disk in the pool of how the IO dropped (and that's the same for all of the disks in the pool):
sdc-day

Zpool config:

[root@node6.prg.vpsfree.cz]
 ~ # zpool status -v vz
  pool: vz
 state: ONLINE
  scan: scrub repaired 0 in 1h56m with 0 errors on Fri Aug  9 17:57:38 2013
config:

    NAME                                                     STATE     READ WRITE CKSUM
    vz                                                       ONLINE       0     0     0
      mirror-0                                               ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c50056666405-lun-0        ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c500566663c9-lun-0        ONLINE       0     0     0
      mirror-1                                               ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c50056666b61-lun-0        ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c500566669c9-lun-0        ONLINE       0     0     0
      mirror-2                                               ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c50056666729-lun-0        ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c500566667cd-lun-0        ONLINE       0     0     0
      mirror-3                                               ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c5005661b201-lun-0        ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c50056665cd9-lun-0        ONLINE       0     0     0
      mirror-4                                               ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c50056664edd-lun-0        ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5000c50056664811-lun-0        ONLINE       0     0     0
    logs
      mirror-5                                               ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5003048001e2156c-lun-0-part3  ONLINE       0     0     0
        pci-0000:03:00.0-sas-0x5003048001e2156d-lun-0-part3  ONLINE       0     0     0
    cache
      pci-0000:03:00.0-sas-0x5003048001e2156d-lun-0-part5    ONLINE       0     0     0
      pci-0000:03:00.0-sas-0x5003048001e2156c-lun-0-part5    ONLINE       0     0     0

errors: No known data errors

And I was talking about these traces in the syslog which are filling it:

Aug 25 12:16:11 node6 kernel: [448652.884233] SPL: Showing stack for process 2601
Aug 25 12:16:11 node6 kernel: [448652.884238] Pid: 2601, comm: txg_sync veid: 0 Tainted: P           ---------------    2.6.32-042stab079.5 #1
Aug 25 12:16:11 node6 kernel: [448652.884241] Call Trace:
Aug 25 12:16:11 node6 kernel: [448652.884271]  [<ffffffffa021a5a7>] ? spl_debug_dumpstack+0x27/0x40 [spl]
Aug 25 12:16:11 node6 kernel: [448652.884281]  [<ffffffffa0222327>] ? kmem_alloc_debug+0x437/0x4c0 [spl]
Aug 25 12:16:11 node6 kernel: [448652.884291]  [<ffffffffa0224ac3>] ? task_alloc+0x1d3/0x380 [spl]
Aug 25 12:16:11 node6 kernel: [448652.884299]  [<ffffffffa0221f7f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
Aug 25 12:16:11 node6 kernel: [448652.884338]  [<ffffffffa034d300>] ? spa_deadman+0x0/0x120 [zfs]
Aug 25 12:16:11 node6 kernel: [448652.884347]  [<ffffffffa02252db>] ? taskq_dispatch_delay+0x19b/0x2a0 [spl]
Aug 25 12:16:11 node6 kernel: [448652.884355]  [<ffffffffa02255c2>] ? taskq_cancel_id+0x102/0x1e0 [spl]
Aug 25 12:16:11 node6 kernel: [448652.884381]  [<ffffffffa033d1aa>] ? spa_sync+0x1fa/0xa80 [zfs]
Aug 25 12:16:11 node6 kernel: [448652.884387]  [<ffffffff81012c99>] ? read_tsc+0x9/0x20
Aug 25 12:16:11 node6 kernel: [448652.884393]  [<ffffffff810a89a1>] ? ktime_get_ts+0xb1/0xf0
Aug 25 12:16:11 node6 kernel: [448652.884419]  [<ffffffffa0350317>] ? txg_sync_thread+0x307/0x590 [zfs]
Aug 25 12:16:11 node6 kernel: [448652.884437]  [<ffffffff81057a69>] ? set_user_nice+0xc9/0x130
Aug 25 12:16:11 node6 kernel: [448652.884470]  [<ffffffffa0350010>] ? txg_sync_thread+0x0/0x590 [zfs]
Aug 25 12:16:11 node6 kernel: [448652.884481]  [<ffffffffa0223428>] ? thread_generic_wrapper+0x68/0x80 [spl]
Aug 25 12:16:11 node6 kernel: [448652.884491]  [<ffffffffa02233c0>] ? thread_generic_wrapper+0x0/0x80 [spl]
Aug 25 12:16:11 node6 kernel: [448652.884498]  [<ffffffff8109cca6>] ? kthread+0x96/0xa0
Aug 25 12:16:11 node6 kernel: [448652.884503]  [<ffffffff8100c1aa>] ? child_rip+0xa/0x20
Aug 25 12:16:11 node6 kernel: [448652.884508]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 25 12:16:11 node6 kernel: [448652.884513]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 25 12:16:16 node6 kernel: [448657.882038] SPL: Showing stack for process 2601
Aug 25 12:16:16 node6 kernel: [448657.882045] Pid: 2601, comm: txg_sync veid: 0 Tainted: P           ---------------    2.6.32-042stab079.5 #1
Aug 25 12:16:16 node6 kernel: [448657.882049] Call Trace:
Aug 25 12:16:16 node6 kernel: [448657.882084]  [<ffffffffa021a5a7>] ? spl_debug_dumpstack+0x27/0x40 [spl]
Aug 25 12:16:16 node6 kernel: [448657.882097]  [<ffffffffa0222327>] ? kmem_alloc_debug+0x437/0x4c0 [spl]
Aug 25 12:16:16 node6 kernel: [448657.882112]  [<ffffffffa0224ac3>] ? task_alloc+0x1d3/0x380 [spl]
Aug 25 12:16:16 node6 kernel: [448657.882123]  [<ffffffffa0221f7f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
Aug 25 12:16:16 node6 kernel: [448657.882178]  [<ffffffffa034d300>] ? spa_deadman+0x0/0x120 [zfs]
Aug 25 12:16:16 node6 kernel: [448657.882192]  [<ffffffffa02252db>] ? taskq_dispatch_delay+0x19b/0x2a0 [spl]
Aug 25 12:16:16 node6 kernel: [448657.882204]  [<ffffffffa02255c2>] ? taskq_cancel_id+0x102/0x1e0 [spl]
Aug 25 12:16:16 node6 kernel: [448657.882239]  [<ffffffffa033d1aa>] ? spa_sync+0x1fa/0xa80 [zfs]
Aug 25 12:16:16 node6 kernel: [448657.882248]  [<ffffffff81012c99>] ? read_tsc+0x9/0x20
Aug 25 12:16:16 node6 kernel: [448657.882255]  [<ffffffff810a89a1>] ? ktime_get_ts+0xb1/0xf0
Aug 25 12:16:16 node6 kernel: [448657.882292]  [<ffffffffa0350317>] ? txg_sync_thread+0x307/0x590 [zfs]
Aug 25 12:16:16 node6 kernel: [448657.882301]  [<ffffffff81057a69>] ? set_user_nice+0xc9/0x130
Aug 25 12:16:16 node6 kernel: [448657.882335]  [<ffffffffa0350010>] ? txg_sync_thread+0x0/0x590 [zfs]
Aug 25 12:16:16 node6 kernel: [448657.882348]  [<ffffffffa0223428>] ? thread_generic_wrapper+0x68/0x80 [spl]
Aug 25 12:16:16 node6 kernel: [448657.882359]  [<ffffffffa02233c0>] ? thread_generic_wrapper+0x0/0x80 [spl]
Aug 25 12:16:16 node6 kernel: [448657.882366]  [<ffffffff8109cca6>] ? kthread+0x96/0xa0
Aug 25 12:16:16 node6 kernel: [448657.882371]  [<ffffffff8100c1aa>] ? child_rip+0xa/0x20
Aug 25 12:16:16 node6 kernel: [448657.882377]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 25 12:16:16 node6 kernel: [448657.882381]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

@snajpa
Copy link
Contributor Author

snajpa commented Aug 25, 2013

The system was now doing the same thing (actually not doing - no IO), this time I sysrq-t'd all stack traces to the syslog. Let me know if they might be of any help, I'd deliver it to you - it's not so small dump :)

Example of process suffering by not being able to do the IO:

Aug 25 22:14:01 node6 kernel: [484506.364408] php           D ffff881f9b2c0540     0 805650 805608  556 0x00000000
Aug 25 22:14:01 node6 kernel: [484506.364413]  ffff881326193b48 0000000000000082 0000000000000001 0000000000000286
Aug 25 22:14:01 node6 kernel: [484506.364419]  0000000000000001 0000000000000286 ffff881326193af8 ffffffff81055bbe
Aug 25 22:14:01 node6 kernel: [484506.364425]  ffff88153e49ece0 ffff88162593f780 ffff881f9b2c0b08 000000000001ea80
Aug 25 22:14:01 node6 kernel: [484506.364431] Call Trace:
Aug 25 22:14:01 node6 kernel: [484506.364437]  [<ffffffff81055bbe>] ? __wake_up_sync_key+0x5e/0x80
Aug 25 22:14:01 node6 kernel: [484506.364442]  [<ffffffff8109d48e>] ? prepare_to_wait_exclusive+0x4e/0x80
Aug 25 22:14:01 node6 kernel: [484506.364455]  [<ffffffffa022b385>] cv_wait_common+0x105/0x1c0 [spl]
Aug 25 22:14:01 node6 kernel: [484506.364460]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 25 22:14:01 node6 kernel: [484506.364467]  [<ffffffffa0038190>] ? avl_find+0x60/0xb0 [zavl]
Aug 25 22:14:01 node6 kernel: [484506.364477]  [<ffffffffa022b495>] __cv_wait+0x15/0x20 [spl]
Aug 25 22:14:01 node6 kernel: [484506.364509]  [<ffffffffa037e1ec>] zfs_range_lock+0x2ac/0x5c0 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.364515]  [<ffffffff81517fce>] ? mutex_lock+0x1e/0x50
Aug 25 22:14:01 node6 kernel: [484506.364546]  [<ffffffffa0389272>] zfs_write+0x252/0xc90 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.364552]  [<ffffffff811bfc99>] ? dput_nocache+0xc9/0x250
Aug 25 22:14:01 node6 kernel: [484506.364556]  [<ffffffff81517fce>] ? mutex_lock+0x1e/0x50
Aug 25 22:14:01 node6 kernel: [484506.364588]  [<ffffffffa0382c55>] ? zfs_open+0xa5/0x140 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.364619]  [<ffffffffa039ca71>] ? zpl_open+0x71/0x90 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.364624]  [<ffffffff811b3ef1>] ? path_put+0x31/0x40
Aug 25 22:14:01 node6 kernel: [484506.364636]  [<ffffffffa022c82f>] ? tsd_exit+0x5f/0x2b0 [spl]
Aug 25 22:14:01 node6 kernel: [484506.364666]  [<ffffffffa039cd82>] zpl_write_common+0x52/0x70 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.364695]  [<ffffffffa039ce08>] zpl_write+0x68/0xa0 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.364701]  [<ffffffff811a50c8>] vfs_write+0xb8/0x1a0
Aug 25 22:14:01 node6 kernel: [484506.364705]  [<ffffffff811a59c1>] sys_write+0x51/0x90
Aug 25 22:14:01 node6 kernel: [484506.364711]  [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b

@casualfish
Copy link
Contributor

The stack trace makes sense. It looks like the php process is trying to write to a file and waiting for the range lock to become available. But why the range lock is unusable still keeps unknown. Would you please CC me when compressing and sending the dumped logs? It may help to locate the root cause.

@snajpa
Copy link
Contributor Author

snajpa commented Aug 26, 2013

@casualfish @behlendorf

Ok, there shouldn't be anything sensitive in there, so here is the log:

https://vpsfree.cz/download/messages-snajpa-2013-08-25.gz

Btw, I've upgraded to 0.6.2 y'day and today while having my morning tea I've discovered two freshly new hung zfs send commands, one of them is hanging on the same dataset as before the reboot, I've got no idea if its anyhow relevant.

Good news is that traces in the syslog are finally somewhat more telling, they're about the kernel thread, not the userspace process anymore:

Aug 26 09:21:07 node6 kernel: [37904.820219] INFO: task spl_system_task:351594 blocked for more than 120 seconds.
Aug 26 09:21:07 node6 kernel: [37904.820243] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 26 09:21:07 node6 kernel: [37904.820326] spl_system_ta D ffff8810761e4900     0 351594      2    0 0x00000080
Aug 26 09:21:07 node6 kernel: [37904.820333]  ffff88104f49b500 0000000000000046 0000000000000000 ffff8811a6180040
Aug 26 09:21:07 node6 kernel: [37904.820343]  ffffffffa09df880 0000000000357d6a ffff8810c0a10110 ffff881000000000
Aug 26 09:21:07 node6 kernel: [37904.820350]  0000000000000000 0000000102344deb ffff8810761e4ec8 000000000001ea80
Aug 26 09:21:07 node6 kernel: [37904.820358] Call Trace:
Aug 26 09:21:07 node6 kernel: [37904.820382]  [<ffffffffa08603c5>] cv_wait_common+0x105/0x1c0 [spl]
Aug 26 09:21:07 node6 kernel: [37904.820414]  [<ffffffffa092d294>] ? add_reference+0xc4/0x100 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820425]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 26 09:21:07 node6 kernel: [37904.820447]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820456]  [<ffffffffa08604d5>] __cv_wait+0x15/0x20 [spl]
Aug 26 09:21:07 node6 kernel: [37904.820483]  [<ffffffffa0948da5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820506]  [<ffffffffa0949139>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820528]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820551]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820570]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820593]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820618]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820641]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820660]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820682]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820702]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820724]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820763]  [<ffffffffa09cfd13>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820783]  [<ffffffffa0931557>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820806]  [<ffffffffa0949a94>] traverse_dnode+0x74/0x100 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820828]  [<ffffffffa0949521>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820837]  [<ffffffff81009770>] ? __switch_to+0xd0/0x320
Aug 26 09:21:07 node6 kernel: [37904.820859]  [<ffffffffa0949ba3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820881]  [<ffffffffa0948d00>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 26 09:21:07 node6 kernel: [37904.820892]  [<ffffffffa0859618>] taskq_thread+0x218/0x4b0 [spl]
Aug 26 09:21:07 node6 kernel: [37904.820901]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 26 09:21:07 node6 kernel: [37904.820908]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 26 09:21:07 node6 kernel: [37904.820917]  [<ffffffffa0859400>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 26 09:21:07 node6 kernel: [37904.820923]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 26 09:21:07 node6 kernel: [37904.820929]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 26 09:21:07 node6 kernel: [37904.820935]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 26 09:21:07 node6 kernel: [37904.820939]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 26 09:23:07 node6 kernel: [38024.762559] INFO: task spl_system_task:351594 blocked for more than 120 seconds.
Aug 26 09:23:07 node6 kernel: [38024.762699] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 26 09:23:07 node6 kernel: [38024.762833] spl_system_ta D ffff8810761e4900     0 351594      2    0 0x00000080
Aug 26 09:23:07 node6 kernel: [38024.762839]  ffff88104f49b500 0000000000000046 0000000000000000 ffff8811a6180040
Aug 26 09:23:07 node6 kernel: [38024.762845]  ffffffffa09df880 0000000000357d6a ffff8810c0a10110 ffff881000000000
Aug 26 09:23:07 node6 kernel: [38024.762850]  0000000000000000 0000000102344deb ffff8810761e4ec8 000000000001ea80
Aug 26 09:23:07 node6 kernel: [38024.762855] Call Trace:
Aug 26 09:23:07 node6 kernel: [38024.762874]  [<ffffffffa08603c5>] cv_wait_common+0x105/0x1c0 [spl]
Aug 26 09:23:07 node6 kernel: [38024.762900]  [<ffffffffa092d294>] ? add_reference+0xc4/0x100 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.762907]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 26 09:23:07 node6 kernel: [38024.762923]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.762930]  [<ffffffffa08604d5>] __cv_wait+0x15/0x20 [spl]
Aug 26 09:23:07 node6 kernel: [38024.762950]  [<ffffffffa0948da5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.762968]  [<ffffffffa0949139>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.762987]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763004]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763019]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763036]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763051]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763068]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763083]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763101]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763115]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763132]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763163]  [<ffffffffa09cfd13>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763178]  [<ffffffffa0931557>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763195]  [<ffffffffa0949a94>] traverse_dnode+0x74/0x100 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763212]  [<ffffffffa0949521>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763218]  [<ffffffff81009770>] ? __switch_to+0xd0/0x320
Aug 26 09:23:07 node6 kernel: [38024.763237]  [<ffffffffa0949ba3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763258]  [<ffffffffa0948d00>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 26 09:23:07 node6 kernel: [38024.763266]  [<ffffffffa0859618>] taskq_thread+0x218/0x4b0 [spl]
Aug 26 09:23:07 node6 kernel: [38024.763273]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 26 09:23:07 node6 kernel: [38024.763278]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 26 09:23:07 node6 kernel: [38024.763285]  [<ffffffffa0859400>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 26 09:23:07 node6 kernel: [38024.763289]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 26 09:23:07 node6 kernel: [38024.763293]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 26 09:23:07 node6 kernel: [38024.763296]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 26 09:23:07 node6 kernel: [38024.763299]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
Aug 26 09:25:07 node6 kernel: [38144.705099] INFO: task spl_system_task:351594 blocked for more than 120 seconds.
Aug 26 09:25:07 node6 kernel: [38144.705241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 26 09:25:07 node6 kernel: [38144.705376] spl_system_ta D ffff8810761e4900     0 351594      2    0 0x00000080
Aug 26 09:25:07 node6 kernel: [38144.705383]  ffff88104f49b500 0000000000000046 0000000000000000 ffff8811a6180040
Aug 26 09:25:07 node6 kernel: [38144.705390]  ffffffffa09df880 0000000000357d6a ffff8810c0a10110 ffff881000000000
Aug 26 09:25:07 node6 kernel: [38144.705395]  0000000000000000 0000000102344deb ffff8810761e4ec8 000000000001ea80
Aug 26 09:25:07 node6 kernel: [38144.705400] Call Trace:
Aug 26 09:25:07 node6 kernel: [38144.705420]  [<ffffffffa08603c5>] cv_wait_common+0x105/0x1c0 [spl]
Aug 26 09:25:07 node6 kernel: [38144.705447]  [<ffffffffa092d294>] ? add_reference+0xc4/0x100 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705455]  [<ffffffff8109d280>] ? autoremove_wake_function+0x0/0x40
Aug 26 09:25:07 node6 kernel: [38144.705472]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705479]  [<ffffffffa08604d5>] __cv_wait+0x15/0x20 [spl]
Aug 26 09:25:07 node6 kernel: [38144.705499]  [<ffffffffa0948da5>] traverse_prefetcher+0xa5/0x150 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705518]  [<ffffffffa0949139>] traverse_visitbp+0x2e9/0x830 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705536]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705553]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705568]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705586]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705601]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705619]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705635]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705651]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705666]  [<ffffffffa0932609>] ? arc_read+0x6b9/0x8b0 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705683]  [<ffffffffa09492dd>] traverse_visitbp+0x48d/0x830 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705711]  [<ffffffffa09cfd13>] ? zio_buf_alloc+0x23/0x30 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705725]  [<ffffffffa0931557>] ? arc_get_data_buf+0x307/0x480 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705743]  [<ffffffffa0949a94>] traverse_dnode+0x74/0x100 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705760]  [<ffffffffa0949521>] traverse_visitbp+0x6d1/0x830 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705765]  [<ffffffff81009770>] ? __switch_to+0xd0/0x320
Aug 26 09:25:07 node6 kernel: [38144.705783]  [<ffffffffa0949ba3>] traverse_prefetch_thread+0x83/0xc0 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705799]  [<ffffffffa0948d00>] ? traverse_prefetcher+0x0/0x150 [zfs]
Aug 26 09:25:07 node6 kernel: [38144.705806]  [<ffffffffa0859618>] taskq_thread+0x218/0x4b0 [spl]
Aug 26 09:25:07 node6 kernel: [38144.705813]  [<ffffffff815164d0>] ? thread_return+0xbe/0x88e
Aug 26 09:25:07 node6 kernel: [38144.705819]  [<ffffffff8105e510>] ? default_wake_function+0x0/0x20
Aug 26 09:25:07 node6 kernel: [38144.705826]  [<ffffffffa0859400>] ? taskq_thread+0x0/0x4b0 [spl]
Aug 26 09:25:07 node6 kernel: [38144.705830]  [<ffffffff8109cca6>] kthread+0x96/0xa0
Aug 26 09:25:07 node6 kernel: [38144.705834]  [<ffffffff8100c1aa>] child_rip+0xa/0x20
Aug 26 09:25:07 node6 kernel: [38144.705838]  [<ffffffff8109cc10>] ? kthread+0x0/0xa0
Aug 26 09:25:07 node6 kernel: [38144.705841]  [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

@casualfish
Copy link
Contributor

@snajpa Unfortunately I have no permission to access that file :(

@snajpa
Copy link
Contributor Author

snajpa commented Aug 26, 2013

Sorry, my bad. Try now, it should work (wrong file permissions).

@casualfish
Copy link
Contributor

ok I got it, does 'zpool events -v' show any output this time?

@snajpa
Copy link
Contributor Author

snajpa commented Aug 26, 2013

[root@node6.prg.vpsfree.cz]
 ~ # zpool events -v
TIME                           CLASS
Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x302046e7687fc74
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0xa59779fa37f0ccab
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x1b8abaf8540e900b
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x84f9f41b2f7f9aea
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x9ad8bd73f010213a
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0xe452f5314ecbefc2
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0xd8d8b52c4d00774d
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x74b213ea825e2e8c
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.351065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x17f173668c6109b0
        vdev_state = 0x7
        time = 0x521a75a0 0x14ecd730 

Aug 25 2013 23:22:40.352065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x32d0efece57ba228
        vdev_state = 0x7
        time = 0x521a75a0 0x14fc1970 

Aug 25 2013 23:22:40.352065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x3a612094a459f07c
        vdev_state = 0x7
        time = 0x521a75a0 0x14fc1970 

Aug 25 2013 23:22:40.352065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0xeb6ce56820d8ad65
        vdev_state = 0x7
        time = 0x521a75a0 0x14fc1970 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x32d0efece57ba228
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0xeb6ce56820d8ad65
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x3a612094a459f07c
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0xd8d8b52c4d00774d
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x302046e7687fc74
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x1b8abaf8540e900b
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x9ad8bd73f010213a
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x74b213ea825e2e8c
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0xa59779fa37f0ccab
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x84f9f41b2f7f9aea
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0xe452f5314ecbefc2
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.171065905 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x17f173668c6109b0
        vdev_state = 0x7
        time = 0x521a75a1 0xa324231 

Aug 25 2013 23:22:41.772065900 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x3a634b20fba18f72
        vdev_state = 0x7
        time = 0x521a75a1 0x2e04ca6c 

Aug 25 2013 23:22:41.777065904 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0xfcf6802aa91ab67
        vdev_guid = 0x57632b59596ff64a
        vdev_state = 0x7
        time = 0x521a75a1 0x2e5115b0 

@casualfish
Copy link
Contributor

This stack trace is a little wierd, ARC is trying to allocate a zio data buf from the SPL slab caches but got stuck on the allocation(process in the D state). Normally this should not happen, is your system low in free memory when the hang happens?

Aug 25 22:14:01 node6 kernel: [484506.360860] php D ffff881f64eb4c60 0 799788 799702 556 0x00000000
Aug 25 22:14:01 node6 kernel: [484506.360865] ffff8811fd44b838 0000000000000086 ffff8811fd44b800 0000000000fa5af9
Aug 25 22:14:01 node6 kernel: [484506.360871] 0000000000000001 0000000000000282 ffff8811fd44b7d8 ffffffff8108486c
Aug 25 22:14:01 node6 kernel: [484506.360878] ffff8811fd44b858 00000000ffffffff ffff881f64eb5228 000000000001ea80
Aug 25 22:14:01 node6 kernel: [484506.360884] Call Trace:
Aug 25 22:14:01 node6 kernel: [484506.360888] [] ? lock_timer_base+0x3c/0x70
Aug 25 22:14:01 node6 kernel: [484506.360894] [] schedule_timeout+0x192/0x2e0
Aug 25 22:14:01 node6 kernel: [484506.360899] [] ? process_timeout+0x0/0x10
Aug 25 22:14:01 node6 kernel: [484506.360911] [] spl_kmem_cache_alloc+0x795/0xfe0 [spl]
Aug 25 22:14:01 node6 kernel: [484506.360916] [] ? autoremove_wake_function+0x0/0x40
Aug 25 22:14:01 node6 kernel: [484506.360949] [] zio_data_buf_alloc+0x23/0x30 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.360971] [] arc_get_data_buf+0xb8/0x480 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.360993] [] arc_buf_alloc+0xb9/0xf0 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361014] [] dbuf_fix_old_data+0xe5/0x140 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361036] [] dbuf_dirty+0x5d4/0x7a0 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361057] [] dmu_buf_will_dirty+0xaf/0x100 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361081] [] dmu_write_uio_dnode+0x70/0x140 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361105] [] dmu_write_uio_dbuf+0x46/0x60 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361135] [] zfs_write+0xc6a/0xc90 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361140] [] ? dput_nocache+0xc9/0x250
Aug 25 22:14:01 node6 kernel: [484506.361145] [] ? mutex_lock+0x1e/0x50
Aug 25 22:14:01 node6 kernel: [484506.361150] [] ? path_put+0x31/0x40
Aug 25 22:14:01 node6 kernel: [484506.361161] [] ? tsd_exit+0x5f/0x2b0 [spl]
Aug 25 22:14:01 node6 kernel: [484506.361191] [] zpl_write_common+0x52/0x70 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361222] [] zpl_write+0x68/0xa0 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.361227] [] vfs_write+0xb8/0x1a0
Aug 25 22:14:01 node6 kernel: [484506.361232] [] sys_write+0x51/0x90
Aug 25 22:14:01 node6 kernel: [484506.361237] [] system_call_fastpath+0x16/0x1b

@snajpa
Copy link
Contributor Author

snajpa commented Aug 26, 2013

You can see for yourself here - about 11pm yesterday:
https://prasiatko.vpsfree.cz/munin/prg.vpsfree.cz/node6.prg.vpsfree.cz/index.html#system
https://prasiatko.vpsfree.cz/munin/prg.vpsfree.cz/node6.prg.vpsfree.cz/memory.html

I didn't mention that I'm using swappiness = 100, do you think that may play a role in this?

@casualfish
Copy link
Contributor

Yes the higher the value the more swap in/out IO, but it's also triggered by the low memory situation. According to the collected system stats the system load suddenly went really high(the result of many processes pending in D state) and there was a lot of swap in/out nearly 0:00 this morning. If there are no other applications consuming so much memory I think it's ARC itself did, can you limit the total memory ARC occupies and see if the hang still happens?

@snajpa
Copy link
Contributor Author

snajpa commented Aug 26, 2013

Is it possible to set the ARC limits w/o reloading ZFS module? It is still kind of in production, people are getting mad about all of the downtime and I'd rather... well. I shouldn't have put this in production in the first place... :)

@casualfish
Copy link
Contributor

I suspect the memory trashing is caused by the memory fragmentation in the current SPL slab implementation, which can not easily get condensed without limiting the total memory usage at first. Also it's better to limit ARC's size when there are many kinds of applications on the same server since ARC is really memory hogging.

@snajpa
Copy link
Contributor Author

snajpa commented Aug 26, 2013

Ok, I'm trying to figure out to what levels should I limit ARC size. Currently, it looks like this:

 ~ # arcstat.py -f hit%,miss,read,arcsz,l2hit%,l2miss,l2read,l2size 1
hit%  miss  read  arcsz  l2hit%  l2miss  l2read  l2size  
 100     0  3.7K    62G       0       0       0     55G  
  99     1  214K    62G       0       1       1     55G  
  99    18  271K    62G      11      16      18     55G  
 100     0  274K    62G       0       0       0     55G  
  99     5  273K    62G      80       1       5     55G  
  99     2  288K    62G       0       2       2     55G  
  99    10  306K    62G      40       6      10     55G  
  99    18  294K    62G       5      17      18     55G  
  99    18  294K    62G      77       4      18     55G  

So there's 62G in ARC. Usecase for the machine is OpenVZ containers, so there's no way to predict RAM usage on that server. I'm not sure that if I limit ARC to say 45G, what will happen if the system actually needs memory for applications? If there are such conditions that ARC needs to go down to eg. 20G, isn't the problem still going to be there? B/c limits will be set to 45G, it still creates about the same situation as I'm now in, doesn't it?

@casualfish
Copy link
Contributor

If you limit ARC to 45GB, then it will consume 45GB memory at most(not adding up the potential fragmentation). The rest will be reserved for other applications so they can use it at their own freely. If you limit the size of ARC to a reasonable value w/o causing swap in/out, firstly you are free from the memory trashing situation when the system is busy swapping in/out pages without doing any useful work, secondly you are free from the potential deadlock caused by ARC itself, namely the memory pressure forces ARC to write back the dirty ARC buffers, while writing back themselves needs more memory.
This can be verified in your log as there were several processes waiting for the completion of current transaction group but it never happened.
Aug 25 22:14:01 node6 kernel: [484506.372340] munin-graph D ffff88000a2710e0 0 810096 799600 956 0x00000000
Aug 25 22:14:01 node6 kernel: [484506.372345] ffff88150cf71a68 0000000000000082 ffff8810797c22b8 ffff88000a2710e0
Aug 25 22:14:01 node6 kernel: [484506.372352] ffff88150cf71a28 ffffffff8100990e ffff880069b1ea80 ffff8810797c22b8
Aug 25 22:14:01 node6 kernel: [484506.372358] 0000000001f71a38 ffff880069cdc600 ffff88000a2716a8 000000000001ea80
Aug 25 22:14:01 node6 kernel: [484506.372364] Call Trace:
Aug 25 22:14:01 node6 kernel: [484506.372368] [] ? __switch_to+0x26e/0x320
Aug 25 22:14:01 node6 kernel: [484506.372373] [] __mutex_lock_slowpath+0x13e/0x180
Aug 25 22:14:01 node6 kernel: [484506.372378] [] mutex_lock+0x2b/0x50
Aug 25 22:14:01 node6 kernel: [484506.372391] [] cv_wait_common+0xb4/0x1c0 [spl]
Aug 25 22:14:01 node6 kernel: [484506.372396] [] ? autoremove_wake_function+0x0/0x40
Aug 25 22:14:01 node6 kernel: [484506.372406] [] __cv_wait+0x15/0x20 [spl]
Aug 25 22:14:01 node6 kernel: [484506.372437] [] txg_wait_open+0x8b/0x110 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.372465] [] dmu_tx_wait+0xed/0xf0 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.372497] [] zfs_create+0x290/0x6f0 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.372529] [] zpl_create+0xa5/0x140 [zfs]
Aug 25 22:14:01 node6 kernel: [484506.372535] [] ? generic_permission+0x23/0xb0
Aug 25 22:14:01 node6 kernel: [484506.372539] [] vfs_create+0x9e/0xc0
Aug 25 22:14:01 node6 kernel: [484506.372543] [] do_filp_open+0xaea/0xcf0
Aug 25 22:14:01 node6 kernel: [484506.372555] [] ? tsd_exit+0x5f/0x2b0 [spl]
Aug 25 22:14:01 node6 kernel: [484506.372560] [] ? strncpy_from_user+0x4a/0x90
Aug 25 22:14:01 node6 kernel: [484506.372565] [] ? alloc_fd+0x92/0x160
Aug 25 22:14:01 node6 kernel: [484506.372569] [] do_sys_open+0x69/0x140
Aug 25 22:14:01 node6 kernel: [484506.372573] [] sys_open+0x20/0x30
Aug 25 22:14:01 node6 kernel: [484506.372578] [] system_call_fastpath+0x16/0x1b

@snajpa
Copy link
Contributor Author

snajpa commented Aug 29, 2013

Ok, thanks. I'll try adding ARC limit.

Here's one thing though, we've been talking about at least two different issues here and I'm not sure whether they're related or not.

One is this complete write freeze, which is/was happening from time to time, that should be "solved" by limiting ARC size as you say.

But the other one is those hanging zfs send processes and that doesn't seem to me like it's got anything to do with ARC size. And these keep piling up:

[root@node6.prg.vpsfree.cz]
 ~ # ps ax | grep -P "\s+D"
 102806 ?        Ds     0:42 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-08-28T01:00:50
 351594 ?        D<     0:00 [spl_system_task]
 351597 ?        D<     0:02 [spl_system_task]
 351598 ?        D<     0:02 [spl_system_task]
 351599 ?        D<     0:00 [spl_system_task]
 351932 ?        D<    87:45 [txg_sync]
 472655 ?        Ds     1:56 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-08-26T09:09:28
 983564 ?        Ds     0:06 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-08-29T01:00:48
 993478 ?        Ds     1:21 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-08-27T01:00:47

@behlendorf
Copy link
Contributor

@snajpa The two may be related, if memory allocations end up blocking due to low memory there could be a range of symptoms. @casualfish looks to be on the right trail here, limiting your ARC may help as a medium term workaround. And to answer your previous question the ARC limits can now be dynamically tuned, the module parameters have been made writable. Just echo in the new values and they'll take effect.

$ ls -l /sys/module/zfs/parameters/zfs_arc_max 
-rw-r--r-- 1 root root 4096 Sep  3 20:07 /sys/module/zfs/parameters/zfs_arc_max

@snajpa
Copy link
Contributor Author

snajpa commented Sep 5, 2013

So I've set zfs_arc_max to 32G, but that hasn't helped. There is one zfs send hanging again :(

[root@node6.prg.vpsfree.cz]
 ~ # cat /sys/module/zfs/parameters/zfs_arc_max
34359738368
[root@node6.prg.vpsfree.cz]
 ~ # ps ax | grep -P "\s+D"
   2182 ?        D<     0:00 [spl_system_task]
   2585 ?        D<    48:03 [txg_sync]
 360422 ?        Ds     0:35 zfs send -I vz/private/2067 backup-2013-08-17T01:01:14 vz/private/2067 backup-2013-09-05T01:00:44

@snajpa
Copy link
Contributor Author

snajpa commented Sep 14, 2013

Ok so here's an update on current situation:

I've updated to current HEAD, but not much has changed. It froze after one night of backups.

So I've switched backups from send/recv to original old rsync (= effectively stopped any snapshot manipulation @zpool) and it seems to be running stable & ok.

@aivanise
Copy link

I got bitten by this using the latest rpm on centos6 zfs-0.6.3-1.el6.x86_64, also on openvz

I'm still in a phase where I'm experimenting manually with zfs send/recv from the command line and if anything happens with the receiver (i.e. doesn't receive anything because I've messed up the command line params or left the destination in a state that can't receive snapshots) it just hangs.

It also hangs if I try to terminate the sender with SIGINT (i.e. hitting Ctrl-C on the keyboard), no mater what the destination is, even redirecting a stream to a file on the same machine hangs on Ctrl-C.

@aivanise
Copy link

It looks like it only happens when deduplicated stream is being sent (zfs send -D). Tested with both OpenVZ and regular kernel and it is not openvz specific.

Now that I'm looking at the comments above, I see that send examples are without -D, so my case might be different.

Here is my trace:

Jul 11 09:12:06 free-3 kernel: [ 1079.290247] INFO: task spl_system_task:810 blocked for more than 120 seconds.
Jul 11 09:12:06 free-3 kernel: [ 1079.290293] Tainted: P --------------- 2.6.32-042stab092.2 #1
Jul 11 09:12:06 free-3 kernel: [ 1079.290334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 11 09:12:06 free-3 kernel: [ 1079.290376] spl_system_ta D ffff88041a1cedb0 0 810 2 0 0x00000000
Jul 11 09:12:06 free-3 kernel: [ 1079.290422] ffff880419a97680 0000000000000046 0000000000000000 ffffc9001b01e300
Jul 11 09:12:06 free-3 kernel: [ 1079.290471] ffffffffa04b36e0 0000000000000000 ffff880419a97620 ffffffffa049d6b2
Jul 11 09:12:06 free-3 kernel: [ 1079.290519] 0000000000020000 ffffc9001b01e300 ffff88041a1cf378 000000000001ec80
Jul 11 09:12:06 free-3 kernel: [ 1079.290567] Call Trace:
Jul 11 09:12:06 free-3 kernel: [ 1079.290634] [] ? zio_vdev_io_done+0x42/0x190 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.291745] [] ? prepare_to_wait_exclusive+0x4e/0x80
Jul 11 09:12:06 free-3 kernel: [ 1079.292887] [] cv_wait_common+0x105/0x1c0 [spl]
Jul 11 09:12:06 free-3 kernel: [ 1079.293993] [] ? autoremove_wake_function+0x0/0x40
Jul 11 09:12:06 free-3 kernel: [ 1079.295109] [] __cv_wait+0x15/0x20 [spl]
Jul 11 09:12:06 free-3 kernel: [ 1079.296219] [] traverse_prefetcher+0xa5/0x150 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.297315] [] traverse_visitbp+0x2b6/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.298427] [] traverse_dnode+0x6c/0xf0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.299521] [] traverse_visitbp+0x54d/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.300571] [] traverse_visitbp+0x438/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.301601] [] traverse_visitbp+0x438/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.302603] [] traverse_visitbp+0x438/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.303578] [] traverse_visitbp+0x438/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.304555] [] traverse_visitbp+0x438/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.305510] [] traverse_visitbp+0x438/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.306433] [] traverse_dnode+0x6c/0xf0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.307348] [] traverse_visitbp+0x5f0/0x6d0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.308258] [] traverse_prefetch_thread+0x83/0xc0 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.309163] [] ? traverse_prefetcher+0x0/0x150 [zfs]
Jul 11 09:12:06 free-3 kernel: [ 1079.310062] [] taskq_thread+0x218/0x4b0 [spl]
Jul 11 09:12:06 free-3 kernel: [ 1079.310962] [] ? thread_return+0xbc/0x87e
Jul 11 09:12:06 free-3 kernel: [ 1079.311863] [] ? default_wake_function+0x0/0x20
Jul 11 09:12:06 free-3 kernel: [ 1079.312784] [] ? taskq_thread+0x0/0x4b0 [spl]
Jul 11 09:12:06 free-3 kernel: [ 1079.313715] [] kthread+0x96/0xa0
Jul 11 09:12:06 free-3 kernel: [ 1079.314655] [] child_rip+0xa/0x20
Jul 11 09:12:06 free-3 kernel: [ 1079.315607] [] ? kthread+0x0/0xa0
Jul 11 09:12:06 free-3 kernel: [ 1079.316570] [] ? child_rip+0x0/0x20

@seletskiy
Copy link
Contributor

I've got same issue like @snajpa in the same workload on production servers, e.g. while sending incremental backups using zfs snap/zfs send.

I'm not using OpenVZ.

@behlendorf
Copy link
Contributor

@aivanise Your case is likely #2210

@seletskiy
Copy link
Contributor

We tried to reproduce issue using test stand launching stress -i 10 to produce 10 threads looping on fsync(), fio running rand read-write test on 13 separate zfs file systems and while looping on zfs list and it's reproduced in 3 of about 20 cases without clear evidence of what is required to cause deadlock.

@seletskiy
Copy link
Contributor

OK, the issue got reproduced on production host with linux 3.16.1 and zfs 0.6.3, stacktraces:

Sep 02 09:04:47 kernel: INFO: task txg_sync:1009 blocked for more than 120 seconds.
Sep 02 09:04:47 kernel:       Tainted: P           O  3.16.1-1-apparmor #1
Sep 02 09:04:47 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 02 09:04:47 kernel: txg_sync        D 0000000000000004     0  1009      2 0x00000000
Sep 02 09:04:47 kernel:  ffff8817f07cbaf0 0000000000000046 ffff8817e9bef010 00000000000145c0
Sep 02 09:04:47 kernel:  ffff8817f07cbfd8 00000000000145c0 ffff8817e9bef010 0000000000000112
Sep 02 09:04:47 kernel:  ffff8817f07cbac0 ffffffffa0188a4d ffff8817f04b6900 ffff8817f07cba98
Sep 02 09:04:47 kernel: Call Trace:
Sep 02 09:04:47 kernel:  [<ffffffffa0188a4d>] ? kmem_alloc_debug+0x29d/0x540 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa04be031>] ? dbuf_dirty+0x471/0x8c0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffff811b22e3>] ? __kmalloc+0x163/0x1c0
Sep 02 09:04:47 kernel:  [<ffffffffa0188a4d>] ? kmem_alloc_debug+0x29d/0x540 [spl]
Sep 02 09:04:47 kernel:  [<ffffffff81549d79>] schedule+0x29/0x70
Sep 02 09:04:47 kernel:  [<ffffffff8154a1e6>] schedule_preempt_disabled+0x16/0x20
Sep 02 09:04:47 kernel:  [<ffffffff8154bb65>] __mutex_lock_slowpath+0xe5/0x230
Sep 02 09:04:47 kernel:  [<ffffffff8154bcc7>] mutex_lock+0x17/0x30
Sep 02 09:04:47 kernel:  [<ffffffffa0564c51>] zvol_rename_minors+0x61/0x170 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04dda16>] dsl_dataset_rename_snapshot_sync_impl+0x196/0x2e0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ddc2f>] dsl_dataset_rename_snapshot_sync+0xcf/0x1c0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04eee4a>] dsl_sync_task_sync+0xda/0xe0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04e6ea3>] dsl_pool_sync+0x323/0x480 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ff53d>] spa_sync+0x42d/0xb10 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffff810da948>] ? ktime_get_ts+0x48/0xf0
Sep 02 09:04:47 kernel:  [<ffffffffa0510252>] txg_sync_thread+0x382/0x5f0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa050fed0>] ? txg_delay+0xf0/0xf0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa018d2ca>] thread_generic_wrapper+0x7a/0x90 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa018d250>] ? __thread_exit+0xa0/0xa0 [spl]
Sep 02 09:04:47 kernel:  [<ffffffff81091eca>] kthread+0xea/0x100
Sep 02 09:04:47 kernel:  [<ffffffff811d0000>] ? __fput+0x110/0x200
Sep 02 09:04:47 kernel:  [<ffffffff81091de0>] ? kthread_create_on_node+0x1b0/0x1b0
Sep 02 09:04:47 kernel:  [<ffffffff8154db7c>] ret_from_fork+0x7c/0xb0
Sep 02 09:04:47 kernel:  [<ffffffff81091de0>] ? kthread_create_on_node+0x1b0/0x1b0
Sep 02 09:04:47 kernel: INFO: task zfs:16622 blocked for more than 120 seconds.
Sep 02 09:04:47 kernel:       Tainted: P           O  3.16.1-1-apparmor #1
Sep 02 09:04:47 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 02 09:04:47 kernel: zfs             D 000000000000000d     0 16622  14558 0x00000000
Sep 02 09:04:47 kernel:  ffff881d825f3a10 0000000000000082 ffff881f5ec7b2f0 00000000000145c0
Sep 02 09:04:47 kernel:  ffff881d825f3fd8 00000000000145c0 ffff881f5ec7b2f0 ffff881d825f3970
Sep 02 09:04:47 kernel:  0000000000000000 ffff881d825f3970 ffff8817c8ef8000 0000000000000240
Sep 02 09:04:47 kernel: Call Trace:
Sep 02 09:04:47 kernel:  [<ffffffffa01886ab>] ? kmem_free_debug+0x4b/0x150 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa00d4b1e>] ? zprop_name_to_prop_cb+0x6e/0x90 [zcommon]
Sep 02 09:04:47 kernel:  [<ffffffffa00d4ab0>] ? zprop_string_to_index+0xc0/0xc0 [zcommon]
Sep 02 09:04:47 kernel:  [<ffffffffa00d4ab0>] ? zprop_string_to_index+0xc0/0xc0 [zcommon]
Sep 02 09:04:47 kernel:  [<ffffffffa00d48c1>] ? zprop_iter_common+0xe1/0x140 [zcommon]
Sep 02 09:04:47 kernel:  [<ffffffff81549d79>] schedule+0x29/0x70
Sep 02 09:04:47 kernel:  [<ffffffffa01951d5>] cv_wait_common+0x125/0x1b0 [spl]
Sep 02 09:04:47 kernel:  [<ffffffff810b6d10>] ? __wake_up_sync+0x20/0x20
Sep 02 09:04:47 kernel:  [<ffffffffa0195275>] __cv_wait+0x15/0x20 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa04f5c9b>] rrw_enter_read+0x3b/0x130 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04f5e30>] rrw_enter+0x20/0x30 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04e7e4a>] dsl_pool_hold+0x5a/0x80 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04c8745>] dmu_objset_hold+0x35/0xd0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0188a4d>] ? kmem_alloc_debug+0x29d/0x540 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa04e8654>] dsl_prop_get+0x44/0xa0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04e86ce>] dsl_prop_get_integer+0x1e/0x20 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0562ae7>] __zvol_create_minor+0xc7/0x660 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564933>] zvol_create_minor+0x23/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa056495e>] zvol_create_minors_cb+0xe/0x20 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ca35d>] dmu_objset_find_impl+0x38d/0x420 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564950>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ca1b6>] dmu_objset_find_impl+0x1e6/0x420 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564950>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa05029ed>] ? spa_open_common+0x2ad/0x430 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564950>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ca448>] dmu_objset_find+0x58/0x90 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564a9f>] zvol_create_minors+0x2f/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0535dfa>] zfs_ioc_snapshot+0x14a/0x2d0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0536c2a>] zfsdev_ioctl+0x20a/0x500 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffff811e1480>] do_vfs_ioctl+0x2d0/0x4b0
Sep 02 09:04:47 kernel:  [<ffffffff811e16e1>] SyS_ioctl+0x81/0xa0
Sep 02 09:04:47 kernel:  [<ffffffff8154dc29>] system_call_fastpath+0x16/0x1b

After that all zfs commands got stuck in uninterruptable sleep.

It's reproduced on variety of hosts so it's not a HW problem.

@behlendorf: Can you, please, tell what can we do to track down the issue? I'm glad to produce any debug information needed. It's feels like a blocker, because, you known, randomly crashing hosts in production is not that well...

@seletskiy
Copy link
Contributor

@behlendorf: If I get this correctly, txg_sync awaits zvol_state_lock to be unlocked while it taken by zfs snap command, which is waits for sync to be done. Which results in deadlock.

@seletskiy
Copy link
Contributor

Probably that's the issue: #2652.

@behlendorf
Copy link
Contributor

@seletskiy @snajpa has this been resolved for you guys since #2652 was merged?

@behlendorf behlendorf removed this from the 0.6.4 milestone Oct 16, 2014
@seletskiy
Copy link
Contributor

@behlendorf: Yes, we are running 0.6.3 with that patch for about a month on snap/rename workload and no single error happened.

@behlendorf behlendorf added this to the 0.6.4 milestone Oct 17, 2014
@behlendorf
Copy link
Contributor

Thanks for the confirmation. Then I'm going to close this as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants