New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to recv stream from 0.7.0-rc3 on 0.6.5.8, stuck on 100% CPU #5999

Closed
aither64 opened this Issue Apr 10, 2017 · 7 comments

Comments

Projects
None yet
6 participants
@aither64

aither64 commented Apr 10, 2017

System information

Type Version/Name
Distribution Name Scientific Linux
Distribution Version 6
Linux Kernel 2.6.32-042stab120.20 (OpenVZ)
Architecture x86_64
ZFS Version 0.7.0-rc3_154_g7b0dc2a and 0.6.5.8-1
SPL Version 0.7.0-rc3_7_gbf8abea and 0.6.5.8-1

Describe the problem you're observing

It's not possible to receive a stream created with ZFS 0.7 on ZFS 0.6. zfs recv writes most (or all) of the data, as I can see the used property to increase. When it should finish, it keeps spinning at 100% CPU in the R state. The process is stuck and cannot be killed even with SIGKILL.

Describe how to reproduce the problem

First, with ZFS 0.7:

zfs send src-tank@snapshot > stream.dat

Then with ZFS 0.6:

zfs recv -F dst-tank < stream.dat

For example, stream in 50.dat.gz was created with ZFS 0.7 and every time I try to receive it on ZFS 0.6, the problem occurs:

[root@node1-zfs.dev1.prg.vpsfree.cz]
 ~ # zcat 50.dat.gz | zfs recv -v -F vz/test/50
receiving full stream of vz/private/50@test into vz/test/50@test

This process never finishes.

Include any warning/errors/backtraces from the system logs

No error is shown, the zfs recv process is just stuck.

[root@node1-zfs.dev1.prg.vpsfree.cz]
 ~ # for i in {1..5} ; do cat /proc/8599/stack ; echo '---' ; sleep 2 ; done
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa030e899>] dbuf_find+0x59/0x1a0 [zfs]
[<ffffffffa0311d09>] __dbuf_hold_impl+0xb9/0x4f0 [zfs]
[<ffffffffa03121bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa030e899>] dbuf_find+0x59/0x1a0 [zfs]
[<ffffffffa0311d09>] __dbuf_hold_impl+0xb9/0x4f0 [zfs]
[<ffffffffa03121bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffff811aa8b7>] kmem_cache_alloc_node_trace+0x127/0x220
[<ffffffff811aaa85>] __kmalloc_node+0x65/0x70
[<ffffffffa02a8b8e>] spl_kmem_zalloc+0x8e/0x190 [spl]
[<ffffffffa031218f>] dbuf_hold_impl+0x4f/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffff811aa8b7>] kmem_cache_alloc_node_trace+0x127/0x220
[<ffffffff811aaa85>] __kmalloc_node+0x65/0x70
[<ffffffffa02a8b8e>] spl_kmem_zalloc+0x8e/0x190 [spl]
[<ffffffffa031218f>] dbuf_hold_impl+0x4f/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa030e899>] dbuf_find+0x59/0x1a0 [zfs]
[<ffffffffa0311d09>] __dbuf_hold_impl+0xb9/0x4f0 [zfs]
[<ffffffffa03121bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
[<ffffffffa032bc10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa032e66e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa031c1f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa0322d27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa0323a94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0399e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03994b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---

On another system with a different kernel (2.6.32-042stab120.11), it looks like this:

[root@node12.prg.vpsfree.cz]
 ~ # for i in {1..5} ; do cat /proc/527723/stack ; echo '---' ; sleep 2 ; done    
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa02f7994>] dbuf_find+0x154/0x1a0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa02f7994>] dbuf_find+0x154/0x1a0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa02fb1d0>] dbuf_hold_impl+0x90/0xb0 [zfs]
[<ffffffffa0314c10>] dnode_next_offset_level+0xe0/0x3d0 [zfs]
[<ffffffffa031766e>] dnode_next_offset+0x10e/0x230 [zfs]
[<ffffffffa03051f5>] dmu_object_next+0x45/0x60 [zfs]
[<ffffffffa030bd27>] restore_freeobjects+0x57/0xb0 [zfs]
[<ffffffffa030ca94>] dmu_recv_stream+0x664/0xb50 [zfs]
[<ffffffffa0382e80>] zfs_ioc_recv+0x210/0xbf0 [zfs]
[<ffffffffa03824b5>] zfsdev_ioctl+0x4c5/0x500 [zfs]
[<ffffffff811d6892>] vfs_ioctl+0x22/0xa0
[<ffffffff811d6a34>] do_vfs_ioctl+0x84/0x5b0
[<ffffffff811d6faf>] sys_ioctl+0x4f/0x80
[<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffff81079a95>] __cond_resched+0x65/0x70
[<ffffffffa02f7994>] dbuf_find+0x154/0x1a0 [zfs]
[<ffffffff811aacdb>] kfree+0x5b/0x250
[<ffffffffffffffff>] 0xffffffffffffffff
---
@loli10K

This comment has been minimized.

Show comment
Hide comment
@loli10K

loli10K Apr 10, 2017

Contributor

restore_freeobjects

possible duplicate of #5699?

Contributor

loli10K commented Apr 10, 2017

restore_freeobjects

possible duplicate of #5699?

@aither64

This comment has been minimized.

Show comment
Hide comment
@aither64

aither64 Apr 12, 2017

Yes, it seems to be the same issue. So I followed the OpenZFS 6393 to commit e6d3a84, which is in 0.7.0. Could the fix of receive_freeobjects (as stated in illumos-discuss) be backported to ZoL 0.6.5.10? In 0.6.5, the function's name is indeed restore_freeobjects.

I tried to extract the fix from 6393 and came up with this patch. It seems to work for me, recv finishes correctly. I don't know if it has some other consequences though.

aither64 commented Apr 12, 2017

Yes, it seems to be the same issue. So I followed the OpenZFS 6393 to commit e6d3a84, which is in 0.7.0. Could the fix of receive_freeobjects (as stated in illumos-discuss) be backported to ZoL 0.6.5.10? In 0.6.5, the function's name is indeed restore_freeobjects.

I tried to extract the fix from 6393 and came up with this patch. It seems to work for me, recv finishes correctly. I don't know if it has some other consequences though.

@loli10K

This comment has been minimized.

Show comment
Hide comment
@loli10K

loli10K Apr 12, 2017

Contributor

Could the fix of receive_freeobjects (as stated in illumos-discuss) be backported to ZoL 0.6.5.10?

That fix (which doesn't seem to work as intended, see #5699 (comment)) is needed on the sending side, not the receiving end: backporting to 0.6.5.10 would not help in this case.

Contributor

loli10K commented Apr 12, 2017

Could the fix of receive_freeobjects (as stated in illumos-discuss) be backported to ZoL 0.6.5.10?

That fix (which doesn't seem to work as intended, see #5699 (comment)) is needed on the sending side, not the receiving end: backporting to 0.6.5.10 would not help in this case.

@Jontem

This comment has been minimized.

Show comment
Hide comment
@Jontem

Jontem Jul 3, 2017

Hi!

I have the same problem. With the latest FreeNAS(freebsd 11-stable) as the sending side and Ubuntu 16.04 with the latest stable of Zol as the receiving side.

I've read this issue and #5699 a couple of times but I'm not totally clear if there is a way opting out of freerecords. This commit seems to be related freenas/os@d29ea66

According to this thread it seems unlikely. Especially this comment https://illumos.topicbox.com/groups/discuss/discussions/T03e37d6ecdac991f-Mbe84d215678afc7bd84d9a68

I've tested with on 0.7.0-rc4 and it works. Is there a release date planned for 0.7 stable? Have searched but couldn't find any.

Jontem commented Jul 3, 2017

Hi!

I have the same problem. With the latest FreeNAS(freebsd 11-stable) as the sending side and Ubuntu 16.04 with the latest stable of Zol as the receiving side.

I've read this issue and #5699 a couple of times but I'm not totally clear if there is a way opting out of freerecords. This commit seems to be related freenas/os@d29ea66

According to this thread it seems unlikely. Especially this comment https://illumos.topicbox.com/groups/discuss/discussions/T03e37d6ecdac991f-Mbe84d215678afc7bd84d9a68

I've tested with on 0.7.0-rc4 and it works. Is there a release date planned for 0.7 stable? Have searched but couldn't find any.

@kpande

This comment has been minimized.

Show comment
Hide comment
@kpande

kpande Jul 3, 2017

Member

hm, there's not even a project created for 0.7.0-rc5 yet. @behlendorf ?

Member

kpande commented Jul 3, 2017

hm, there's not even a project created for 0.7.0-rc5 yet. @behlendorf ?

@behlendorf

This comment has been minimized.

Show comment
Hide comment
@behlendorf

behlendorf Jul 3, 2017

Member

Created a project for 0.7.0-rc5, the only major change outstanding is the finalize the safeimport (multi-modifier protection) code.

Member

behlendorf commented Jul 3, 2017

Created a project for 0.7.0-rc5, the only major change outstanding is the finalize the safeimport (multi-modifier protection) code.

@gmelikov

This comment has been minimized.

Show comment
Hide comment
@gmelikov

gmelikov Aug 9, 2017

Member

0.7 branch is stable now, closed as #5699 duplicate.

Member

gmelikov commented Aug 9, 2017

0.7 branch is stable now, closed as #5699 duplicate.

@gmelikov gmelikov closed this Aug 9, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment