Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'zfs send/recv' hang with 16M blocks #7404

Merged
merged 1 commit into from
Apr 9, 2018

Conversation

behlendorf
Copy link
Contributor

Description

When using 16MB blocks the send/recv queue's aren't quite big
enough. This change leaves the default 16M queue size which a
good value for most pools. But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Motivation and Context

Issue #7365

How Has This Been Tested?

Locally with the provided reproducer, https://gist.github.com/fling-/c66bf1e4a082b5cf9cd4d1106fe6e2bc

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

When using 16MB blocks the send/recv queue's aren't quite big
enough.  This change leaves the default 16M queue size which a
good value for most pools.  But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#7365
@loli10K
Copy link
Contributor

loli10K commented Apr 7, 2018

The reproducer provided in #7365 triggers the following failure on the recv side, even with this change applied:

VERIFY3(item_size <= q->bq_maxsize) failed (16777592 <= 16777216)
PANIC at bqueue.c:72:bqueue_enqueue()
Showing stack for process 701
CPU: 0 PID: 701 Comm: zfs Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 ffffffff8151ea00 ffffffffa07c7400 ffff88002fdef858
 ffffffffa040b229 ffff880039bff608 ffff880000000030 ffff88002fdef868
 ffff88002fdef808 2833594649524556 7a69735f6d657469 3e2d71203d3c2065
Call Trace:
 [<ffffffff8151ea00>] ? dump_stack+0x5d/0x78
 [<ffffffffa040b229>] ? spl_panic+0xc9/0x110 [spl]
 [<ffffffffa05b5f30>] ? receive_read_prefetch+0x37/0x69 [zfs]
 [<ffffffffa05b64e2>] ? receive_read_record+0x580/0xa00 [zfs]
 [<ffffffff810a6cc1>] ? pick_next_task_fair+0x6e1/0x820
 [<ffffffff810135dc>] ? __switch_to+0x15c/0x5a0
 [<ffffffff81520fa1>] ? __schedule+0x2b1/0x800
 [<ffffffffa058bfe3>] ? bqueue_enqueue+0xb8/0x21e [zfs]
 [<ffffffffa05b7f8b>] ? dmu_recv_stream+0xa6e/0xecb [zfs]
 [<ffffffffa05b00e2>] ? dmu_recv_begin+0x21e/0x220 [zfs]
 [<ffffffffa06c4e27>] ? zfs_ioc_recv_impl+0x52b/0xb34 [zfs]
 [<ffffffffa06c575c>] ? zfs_ioc_recv+0x32c/0x40e [zfs]
 [<ffffffffa062a80f>] ? __raw_spin_unlock+0x18/0x1a [zfs]
 [<ffffffffa062a855>] ? spin_unlock+0x18/0x1a [zfs]
 [<ffffffffa062b19e>] ? refcount_add_many+0x162/0x168 [zfs]
 [<ffffffffa06ca46e>] ? zfsdev_ioctl+0x66a/0x768 [zfs]
 [<ffffffff811c1b2f>] ? do_vfs_ioctl+0x2cf/0x4b0
 [<ffffffff810593c7>] ? __do_page_fault+0x177/0x410
 [<ffffffff811c1d91>] ? SyS_ioctl+0x81/0xa0
 [<ffffffff81526d58>] ? async_page_fault+0x28/0x30
 [<ffffffff81524d0d>] ? system_call_fast_compare_end+0x10/0x15

The change LGTM, i am going to do some more debugging to find why we're hitting a VERIFY.

EDIT: i was probably running with the old kernel module loaded, cannot trigger this anymore with this change applied:

root@linux:~# stap -d kernel -d zfs -e '
> probe
> module("zfs").function("bqueue_init").call
> {
>    printf(" %s -> %s %s\n", symname(caller_addr()), ppfunc(), $size$$);
> }
> probe module("zfs").function("bqueue_init").return
> {
>    printf(" %s <- %s %s\n", symname(caller_addr()), ppfunc(), $$return$$);
> }' -c "./issue-7365.sh"
+ issue=7365
+ date +%s
+ sdate=1523091708
+ tmpdir=/tmp/7365
+ poolfile=/tmp/7365/1523091708.ima
+ poolname=issue-7365-1523091708
+ mountpoint=/tmp/7365/source-1523091708
+ mkdir -p /tmp/7365
+ touch /tmp/7365/1523091708.ima
+ truncate --size=1G /tmp/7365/1523091708.ima
+ echo 16777216
+ zpool create -m none -d -o feature@large_blocks=enabled -O recordsize=16M issue-7365-1523091708 /tmp/7365/1523091708.ima
+ zfs create issue-7365-1523091708/source -o mountpoint=/tmp/7365/source-1523091708
+ echo
+ truncate --size=16M /tmp/7365/source-1523091708/large.block
+ zfs snap issue-7365-1523091708/source@testing
+ zfs recv -v issue-7365-1523091708/target
+ zfs send -L issue-7365-1523091708/source@testing
receiving full stream of issue-7365-1523091708/source@testing into issue-7365-1523091708/target@testing
 dmu_send_impl -> bqueue_init 33554432
 dmu_send_obj <- bqueue_init return=0
 dmu_recv_stream -> bqueue_init 33554432
 zfs_ioc_recv_impl <- bqueue_init return=0
received 16.0M stream in 1 seconds (16.0M/sec)
+ zpool destroy issue-7365-1523091708
+ rm /tmp/7365/1523091708.ima

@codecov
Copy link

codecov bot commented Apr 7, 2018

Codecov Report

Merging #7404 into master will decrease coverage by 0.06%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7404      +/-   ##
==========================================
- Coverage   76.44%   76.37%   -0.07%     
==========================================
  Files         330      330              
  Lines      104277   104279       +2     
==========================================
- Hits        79712    79641      -71     
- Misses      24565    24638      +73
Flag Coverage Δ
#kernel 76.35% <100%> (-23.65%) ⬇️
#user 65.71% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c9af9e...92bfef6. Read the comment docs.

@behlendorf behlendorf merged commit 3b0d992 into openzfs:master Apr 9, 2018
@fling-
Copy link
Contributor

fling- commented Apr 11, 2018

@loli10K are you testing it on top of zfs-0.7.6 tag?

@loli10K
Copy link
Contributor

loli10K commented Apr 11, 2018

@fling- no, i am not This should probably be tested on top of 0.7.8 before 0.7.9 gets released.

tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 16, 2018
When using 16MB blocks the send/recv queue's aren't quite big
enough.  This change leaves the default 16M queue size which a
good value for most pools.  But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#7365 
Closes openzfs#7404
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request May 4, 2018
When using 16MB blocks the send/recv queue's aren't quite big
enough.  This change leaves the default 16M queue size which a
good value for most pools.  But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#7365 
Closes openzfs#7404
tonyhutter pushed a commit that referenced this pull request May 10, 2018
When using 16MB blocks the send/recv queue's aren't quite big
enough.  This change leaves the default 16M queue size which a
good value for most pools.  But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7365
Closes #7404
@behlendorf behlendorf deleted the issue-7365 branch April 19, 2021 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants