Receive of unencrypted dataset crashes system 0.7.11 #8030

sempervictus · 2018-10-16T13:26:42Z

System information

Describe the problem you're observing

Sending incremental or full streams causes crashes on receiving system and prevent send completion.

Describe how to reproduce the problem

Znapzend seems to do it pretty consistently via znapzend --runonce--autoCreation --nodestroy, any send/recv for the rpool does this. Scrub of rpool on sending machine reveals no damage, but this wouldn't be the first time we found that scrub doesn't actually understand some classes of on-disk error.

Include any warning/errors/backtraces from the system logs

On the recv size we see this:

[1843295.890900] INFO: task zfs:5493 blocked for more than 120 seconds.
[1843295.891083]       Tainted: P           OE   4.14.69 #1
[1843295.891231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1843295.891466] zfs             D    0  5493   5491 0x80080082
[1843295.891470]  ffff880c0418c380 8000000000000000 ffff8809f88c0880 ffff8805e57fad00
[1843295.891473]  ffff880c1fd1cfc0 ffffc90006537868 ffffffff817c2670 ffffffff810ac468
[1843295.891476]  ffff8804a5869b48 7a7832ec00ab3945 ffff8804a5869b38 ffff8804a5869b40
[1843295.891479] Call Trace:
[1843295.891489]  [<ffffffff817c2670>] ? __schedule+0x200/0x670
[1843295.891493]  [<ffffffff810ac468>] ? prepare_to_wait_exclusive+0xa8/0xe0
[1843295.891496]  [<ffffffff817c2b06>] schedule+0x26/0x80
[1843295.891506]  [<ffffffffa02784a2>] cv_wait_common+0x112/0x150 [spl]
[1843295.891509]  [<ffffffff810ac020>] ? finish_wait+0xd0/0xd0
[1843295.891559]  [<ffffffffa06de536>] dmu_recv_stream+0x396/0x980 [zfs]
[1843295.891584]  [<ffffffffa075d2c0>] zfs_ioc_recv_impl+0x640/0x10e0 [zfs]
[1843295.891590]  [<ffffffff810957d2>] ? account_entity_enqueue+0xf2/0x120
[1843295.891594]  [<ffffffff811c3bdf>] ? ___slab_alloc+0xef/0x4b0
[1843295.891596]  [<ffffffff8109a26e>] ? enqueue_entity+0x41e/0x7d0
[1843295.891620]  [<ffffffffa075defb>] zfs_ioc_recv+0x19b/0x270 [zfs]
[1843295.891646]  [<ffffffffa07595ba>] zfsdev_ioctl+0x56a/0x650 [zfs]
[1843295.891665]  [<ffffffff8118897d>] ? __handle_mm_fault+0xfad/0x12e0
[1843295.891669]  [<ffffffff812097fa>] do_vfs_ioctl+0xba/0x780
[1843295.891675]  [<ffffffff81209f35>] sys_ioctl+0x75/0x80
[1843295.891678]  [<ffffffff81001937>] do_syscall_64+0x77/0x1b0
[1843295.891681]  [<ffffffff818001ef>] entry_SYSCALL_64_after_hwframe+0x3d/0xa0
[1843295.891684] RIP: 0033:[<000003e581fc279b>] 0x3e581fc279b
[1843295.891685] RSP: 002b:000003efa7d8e968 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[1843295.891687] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000003e581fc279b
[1843295.891688] RDX: 000003efa7d8eb00 RSI: 0000000000005a1b RDI: 0000000000000005
[1843295.891689] RBP: 000003efa7d8eb00 R08: 00000000ffffffff R09: 0000000000000000
[1843295.891690] R10: 0000000000000022 R11: 0000000000000246 R12: 000003efa7d9cca8
[1843295.891692] R13: 000003efa7d98df0 R14: 0000000000000000 R15: 000003efa7d967f0
[1843418.769387] INFO: task zfs:5493 blocked for more than 120 seconds.
[1843418.769642]       Tainted: P           OE   4.14.69 #1
[1843418.769821] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1843418.770074] zfs             D    0  5493   5491 0x80080082
[1843418.770079]  ffff880c0418c380 8000000000000000 ffff8809f88c0880 ffff8805e57fad00
[1843418.770082]  ffff880c1fd1cfc0 ffffc90006537868 ffffffff817c2670 ffffffff810ac468
[1843418.770085]  ffff8804a5869b48 7a7832ec00ab3945 ffff8804a5869b38 ffff8804a5869b40
[1843418.770087] Call Trace:
[1843418.770098]  [<ffffffff817c2670>] ? __schedule+0x200/0x670
[1843418.770103]  [<ffffffff810ac468>] ? prepare_to_wait_exclusive+0xa8/0xe0
[1843418.770105]  [<ffffffff817c2b06>] schedule+0x26/0x80
[1843418.770115]  [<ffffffffa02784a2>] cv_wait_common+0x112/0x150 [spl]
[1843418.770118]  [<ffffffff810ac020>] ? finish_wait+0xd0/0xd0
[1843418.770203]  [<ffffffffa06de536>] dmu_recv_stream+0x396/0x980 [zfs]
[1843418.770230]  [<ffffffffa075d2c0>] zfs_ioc_recv_impl+0x640/0x10e0 [zfs]
[1843418.770235]  [<ffffffff810957d2>] ? account_entity_enqueue+0xf2/0x120
[1843418.770238]  [<ffffffff811c3bdf>] ? ___slab_alloc+0xef/0x4b0
[1843418.770240]  [<ffffffff8109a26e>] ? enqueue_entity+0x41e/0x7d0
[1843418.770264]  [<ffffffffa075defb>] zfs_ioc_recv+0x19b/0x270 [zfs]
[1843418.770292]  [<ffffffffa07595ba>] zfsdev_ioctl+0x56a/0x650 [zfs]
[1843418.770296]  [<ffffffff8118897d>] ? __handle_mm_fault+0xfad/0x12e0
[1843418.770299]  [<ffffffff812097fa>] do_vfs_ioctl+0xba/0x780
[1843418.770304]  [<ffffffff81209f35>] sys_ioctl+0x75/0x80
[1843418.770308]  [<ffffffff81001937>] do_syscall_64+0x77/0x1b0
[1843418.770311]  [<ffffffff818001ef>] entry_SYSCALL_64_after_hwframe+0x3d/0xa0

On the sending side, it still thinks something is receiving on the other side, but data transfer rate is zero.

The text was updated successfully, but these errors were encountered:

rincebrain · 2018-10-16T13:52:51Z

Are you saying that one side is 0.7.11 and the other is 0.7.10, or that you've tested it on both of those versions?

Are both the sender and receiver running the same kernel/OS/etc?

What's in /proc/spl/kstat/zfs/dbgmsg? (Please just pastebin the whole thing, rather than inlining a few lines.)

Are any of the kernel threads pegging the CPU?

Are you sure there's not another message like VERIFY3(...) or PANIC[...] above those two? Otherwise those two messages don't necessarily mean anything harmful, just slow.

sempervictus · 2018-10-16T14:58:20Z

The recv side has been up a while, and everything is on zfs root, so its 0.7.10. This has been happening in one way or another for ages though, the cv_wait_common from zfs_ioc_recv calls keep coming up in stack traces.

rincebrain · 2018-10-16T19:48:26Z

I'd suggest not running one of the two blacklisted versions of the 0.7.X series on the receiver, but other than that, the information I requested from you is still needed to do anything useful.

sempervictus · 2018-10-16T19:58:03Z

Blacklisted?
I'll pull the dbgmsg when I jump back into the host, but there's no load on it other than this. There are no panics, no asserts.

rincebrain · 2018-10-16T20:00:05Z

Both 0.7.7 and 0.7.10 have rather critical flaws, are struck out on the homepage, and have big "don't use this" notices at the top of their respective releases.

I doubt, in this case, that the problem you're having is related to 0.7.10's nasty bugs, but it's something I'd eliminate before trying to look further.

sempervictus · 2018-10-16T21:23:45Z

Thanks for the heads up. Agree this is not likely to be it though, pretty sure it's much older than both. Updating it to the kernel with .11 built in.

stale · 2020-08-25T03:21:45Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020

stale bot closed this as completed Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receive of unencrypted dataset crashes system 0.7.11 #8030

Receive of unencrypted dataset crashes system 0.7.11 #8030

sempervictus commented Oct 16, 2018

rincebrain commented Oct 16, 2018

sempervictus commented Oct 16, 2018

rincebrain commented Oct 16, 2018

sempervictus commented Oct 16, 2018

rincebrain commented Oct 16, 2018

sempervictus commented Oct 16, 2018

stale bot commented Aug 25, 2020

Receive of unencrypted dataset crashes system 0.7.11 #8030

Receive of unencrypted dataset crashes system 0.7.11 #8030

Comments

sempervictus commented Oct 16, 2018

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

rincebrain commented Oct 16, 2018

sempervictus commented Oct 16, 2018

rincebrain commented Oct 16, 2018

sempervictus commented Oct 16, 2018

rincebrain commented Oct 16, 2018

sempervictus commented Oct 16, 2018

stale bot commented Aug 25, 2020