-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
z_vol task hung on receiving side of zfs send #6330
Comments
I am observing the very same issue on Archlinux, running the 4.9.37 kernel. |
I appear to also be having this issue. The backtrace in my case is slightly different but appears to be mostly the same. I have been experiencing this semi-regularly, particularly with large sends. Unlike the OP, I'm seeing this on the sending side, not the receiving side. I enabled the hung task detector in the hopes of obtaining logs of this freeze when it happened, without that the freeze appears to be permanent, remaining frozen for over an hour before I rebooted. I've also experienced shorter freezes, lasting perhaps 20 seconds or so, which do resolve themselves. They may be unrelated. Also, the kernel hung task detector has failed to obtain logs for this a few times, but it is unclear if this is due to this issue somehow freezing the detector as well, or if it is just firmware weirdness w.r.t pstore.
|
@klkblake if you observe this again can you please try running the following command to try and unwedge things. It will cause a new thread to be spawned for any taskq which appears to be stalled. Obviously we really need all the backtraces to determine why it stalled, but this might let you work around the issue until the root cause is understood.
|
Unfortunately I can't run commands when this happens -- can't even move my mouse cursor, and attempts to ssh in give "no route to host". I have had no success using the magic sysrq key to reboot, even. |
The following reliably reproduces the hardlock for me:
It usually hardlocks the system within 5-10 minutes after running this. I guess it doesn't like having a slow receiver? When doing a backup on my system the transfer rate is 1MB/s or so, so I'd guess the send is outrunning the receive there too. |
The reproducer @klkblake provided, gives the following hung task in my system:
Replacing |
@gamanakis were you able to reproduce the hard lockup on the system? I ask because using [edit] I should add that we made a similar change a while back in commit 8e70975 |
@behlendorf I could not reproduce a hard lockup this far. |
Possibly relevant: I'm running with |
During a receive operation zvol_create_minors_impl() can wait needlessly for the prefetch thread because both share the same tasks queue. This results in hung tasks: <3>INFO: task z_zvol:5541 blocked for more than 120 seconds. <3> Tainted: P O 3.16.0-4-amd64 <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first z_zvol:5541 (zvol_task_cb) is waiting for the long running traverse_prefetch_thread:260 root@linux:~# cat /proc/spl/taskq taskq act nthr spwn maxt pri mina spl_system_taskq/0 1 2 0 64 100 1 active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40) wait: 5541 spl_delay_taskq/0 0 1 0 4 100 1 delay: spa_deadman [zfs](0xffff880039924000) z_zvol/1 1 1 0 1 120 1 active: [5541]zvol_task_cb [zfs](0xffff88001fde6400) pend: zvol_task_cb [zfs](0xffff88001fde6800) This change adds a dedicated, per-pool, prefetch taskq to prevent the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. Reviewed-by: Albert Lee <trisk@forkgnu.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6330 Closes #6890 Closes #7343
During a receive operation zvol_create_minors_impl() can wait needlessly for the prefetch thread because both share the same tasks queue. This results in hung tasks: <3>INFO: task z_zvol:5541 blocked for more than 120 seconds. <3> Tainted: P O 3.16.0-4-amd64 <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first z_zvol:5541 (zvol_task_cb) is waiting for the long running traverse_prefetch_thread:260 root@linux:~# cat /proc/spl/taskq taskq act nthr spwn maxt pri mina spl_system_taskq/0 1 2 0 64 100 1 active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40) wait: 5541 spl_delay_taskq/0 0 1 0 4 100 1 delay: spa_deadman [zfs](0xffff880039924000) z_zvol/1 1 1 0 1 120 1 active: [5541]zvol_task_cb [zfs](0xffff88001fde6400) pend: zvol_task_cb [zfs](0xffff88001fde6800) This change adds a dedicated, per-pool, prefetch taskq to prevent the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. Reviewed-by: Albert Lee <trisk@forkgnu.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#6330 Closes openzfs#6890 Closes openzfs#7343
During a receive operation zvol_create_minors_impl() can wait needlessly for the prefetch thread because both share the same tasks queue. This results in hung tasks: <3>INFO: task z_zvol:5541 blocked for more than 120 seconds. <3> Tainted: P O 3.16.0-4-amd64 <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first z_zvol:5541 (zvol_task_cb) is waiting for the long running traverse_prefetch_thread:260 root@linux:~# cat /proc/spl/taskq taskq act nthr spwn maxt pri mina spl_system_taskq/0 1 2 0 64 100 1 active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40) wait: 5541 spl_delay_taskq/0 0 1 0 4 100 1 delay: spa_deadman [zfs](0xffff880039924000) z_zvol/1 1 1 0 1 120 1 active: [5541]zvol_task_cb [zfs](0xffff88001fde6400) pend: zvol_task_cb [zfs](0xffff88001fde6800) This change adds a dedicated, per-pool, prefetch taskq to prevent the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. Reviewed-by: Albert Lee <trisk@forkgnu.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#6330 Closes openzfs#6890 Closes openzfs#7343
During a receive operation zvol_create_minors_impl() can wait needlessly for the prefetch thread because both share the same tasks queue. This results in hung tasks: <3>INFO: task z_zvol:5541 blocked for more than 120 seconds. <3> Tainted: P O 3.16.0-4-amd64 <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first z_zvol:5541 (zvol_task_cb) is waiting for the long running traverse_prefetch_thread:260 root@linux:~# cat /proc/spl/taskq taskq act nthr spwn maxt pri mina spl_system_taskq/0 1 2 0 64 100 1 active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40) wait: 5541 spl_delay_taskq/0 0 1 0 4 100 1 delay: spa_deadman [zfs](0xffff880039924000) z_zvol/1 1 1 0 1 120 1 active: [5541]zvol_task_cb [zfs](0xffff88001fde6400) pend: zvol_task_cb [zfs](0xffff88001fde6800) This change adds a dedicated, per-pool, prefetch taskq to prevent the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. Reviewed-by: Albert Lee <trisk@forkgnu.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6330 Closes #6890 Closes #7343
I apparently get the same issue sending/receiving a filesystem between 2 different pools on the same machine. Setup. Fresh Ubuntu 18.04.1 install, fully updated.
Action Dmesg (same message reoccrurring every 120s)
Anything I can do/provide? |
@github-duran the fix for this issue was included as of the zfs-0.7.9 tag, it has not yet been included in the Ubuntu release. https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1772412 |
Oh thank you. Can I safely ignore it until the fix is released? |
You'll need to reboot to stop the warnings at some point. But aside from that there's no risk to your data. |
I'm facing this issue in production as I type this. I'm on Ubuntu 18.04.2 which has ZFS 0.7.5. The transfer went smoothly till about 95GB and now it has slowed down to a crawl. Is there no easy fix for this? |
If I let the transfer take it's own time, will it pick up speed later? |
The status of the Ubuntu bug has been changed to |
System information
Describe the problem you're observing
I'm observing the following stack trace on the receiving side of zfs send.
Describe how to reproduce the problem
Setup a regular send to remote system (1/hour)
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: