Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread overran stack, or stack corrupted #1354

Closed
behlendorf opened this issue Mar 14, 2013 · 10 comments
Closed

Thread overran stack, or stack corrupted #1354

behlendorf opened this issue Mar 14, 2013 · 10 comments
Milestone

Comments

@behlendorf
Copy link
Contributor

Observed when running filebench on Fedora 18 using file vdevs. Based on the stack it looks like ext4 was responsible for a good chunk of the stack. I suspect this wouldn't have been an issue on block vdevs.

BUG: scheduling while atomic: filebench/31204/0xffff8800
BUG: unable to handle kernel paging request at 00000005551ec8a0
IP: [<ffffffff8108e867>] resched_task+0x17/0x70
PGD b005e067 PUD 0
Thread overran stack, or stack corrupted
Process filebench (pid: 31204, threadinfo ffff8800ba6a2000, task ffff880009d2c560)
Call Trace:
 <IRQ>
 [<ffffffff81097e19>] check_preempt_wakeup+0x1c9/0x270
 [<ffffffff8108f1e4>] check_preempt_curr+0x84/0xa0
 [<ffffffff8108f23d>] ttwu_do_wakeup+0x3d/0x120
 [<ffffffff8108f596>] ttwu_do_activate.constprop.40+0x66/0x70
 [<ffffffff81092686>] try_to_wake_up+0x226/0x2b0
 [<ffffffff81092722>] default_wake_function+0x12/0x20
 [<ffffffff81080b5d>] autoremove_wake_function+0x1d/0x50
 [<ffffffff81089138>] __wake_up_common+0x58/0x90
 [<ffffffff8108a788>] __wake_up+0x48/0x70
 [<ffffffff8105fcf0>] printk_tick+0x60/0x90
 [<ffffffff8106de8d>] update_process_times+0x4d/0x90
 [<ffffffff810b3a94>] tick_sched_timer+0x64/0xd0
 [<ffffffff81084501>] __run_hrtimer+0x71/0x1e0
 [<ffffffff810b3a30>] ? tick_nohz_handler+0x100/0x100
 [<ffffffff810ad26c>] ? ktime_get_update_offsets+0x4c/0xd0
 [<ffffffff81084e87>] hrtimer_interrupt+0xf7/0x230
 [<ffffffff81623ef9>] smp_apic_timer_interrupt+0x69/0x99
 [<ffffffff81622eca>] apic_timer_interrupt+0x6a/0x70
 <EOI>
 [<ffffffff8105e1ac>] ? vprintk_emit+0x1ec/0x510
 [<ffffffff811bd950>] ? __wait_on_buffer+0x30/0x30
 [<ffffffff8160fbd0>] printk+0x61/0x63
 [<ffffffff811bd950>] ? __wait_on_buffer+0x30/0x30
 [<ffffffff81610051>] __schedule_bug+0x3c/0x59
 [<ffffffff81618ec0>] __schedule+0x720/0x7c0
 [<ffffffff811bd950>] ? __wait_on_buffer+0x30/0x30
 [<ffffffff81619269>] schedule+0x29/0x70
 [<ffffffff8161933f>] io_schedule+0x8f/0xd0
 [<ffffffff811bd95e>] sleep_on_buffer+0xe/0x20
 [<ffffffff81617bff>] __wait_on_bit+0x5f/0x90
 [<ffffffff811bd950>] ? __wait_on_buffer+0x30/0x30
 [<ffffffff81617cac>] out_of_line_wait_on_bit+0x7c/0x90
 [<ffffffff81080b90>] ? autoremove_wake_function+0x50/0x50
 [<ffffffff811bd94e>] __wait_on_buffer+0x2e/0x30
 [<ffffffff811be758>] bh_submit_read+0x68/0x80
 [<ffffffff8124acd3>] ext4_get_branch+0x73/0x140
 [<ffffffff8124bc4c>] ext4_ind_map_blocks+0x1ac/0x6a0
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffff8120ed4d>] ext4_map_blocks+0x15d/0x2b0
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffff81211156>] _ext4_get_block+0xa6/0x160
 [<ffffffff81211276>] ext4_get_block+0x16/0x20
 [<ffffffff811c9622>] do_mpage_readpage+0x3b2/0x5f0
 [<ffffffff81139830>] ? __lru_cache_add+0x10/0x90
 [<ffffffff811c999f>] mpage_readpages+0xcf/0x120
 [<ffffffff81211260>] ? noalloc_get_block_write+0x30/0x30
 [<ffffffff81211260>] ? noalloc_get_block_write+0x30/0x30
 [<ffffffff8116f616>] ? alloc_pages_current+0xb6/0x120
 [<ffffffff8120d260>] ? ext4_releasepage+0xe0/0xe0
 [<ffffffff8120d27d>] ext4_readpages+0x1d/0x20
 [<ffffffff81138077>] __do_page_cache_readahead+0x1b7/0x260
 [<ffffffff812da23b>] ? radix_tree_lookup+0xb/0x10
 [<ffffffff811385e9>] ondemand_readahead+0x189/0x230
 [<ffffffff81138763>] page_cache_sync_readahead+0x33/0x50
 [<ffffffff8112db70>] generic_file_aio_read+0x4b0/0x720
 [<ffffffff8118d7a0>] ? do_sync_write+0xe0/0xe0
 [<ffffffff8118d843>] do_sync_read+0xa3/0xe0
 [<ffffffff8118e180>] vfs_read+0xb0/0x180
 [<ffffffffa01bd6a5>] vn_rdwr+0x145/0x490 [spl]
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffffa02b3534>] vdev_file_io_start+0x94/0x140 [zfs]
 [<ffffffffa02f11fb>] zio_vdev_io_start+0xab/0x2f0 [zfs]
 [<ffffffffa02f1689>] zio_nowait+0xb9/0x150 [zfs]
 [<ffffffffa02b6498>] vdev_mirror_io_start+0x188/0x3c0 [zfs]
 [<ffffffffa02b5ef0>] ? vdev_config_sync+0x180/0x180 [zfs]
 [<ffffffffa02a8643>] ? spa_config_enter+0xb3/0x100 [zfs]
 [<ffffffffa02f1387>] zio_vdev_io_start+0x237/0x2f0 [zfs]
 [<ffffffffa02f1689>] zio_nowait+0xb9/0x150 [zfs]
 [<ffffffffa02543f3>] arc_read_nolock+0x533/0x800 [zfs]
 [<ffffffffa0259960>] ? dbuf_do_evict+0x70/0x70 [zfs]
 [<ffffffffa0254742>] arc_read+0x82/0x170 [zfs]
 [<ffffffffa0259960>] ? dbuf_do_evict+0x70/0x70 [zfs]
 [<ffffffffa028c241>] dsl_read+0x31/0x40 [zfs]
 [<ffffffffa0259ef3>] dbuf_read+0x263/0x840 [zfs]
 [<ffffffffa026d2d3>] dmu_tx_check_ioerr+0xc3/0x140 [zfs]
 [<ffffffffa026e111>] dmu_tx_hold_free+0x131/0x8f0 [zfs]
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffffa0275ef5>] ? dsl_dataset_prev_snap_txg+0x35/0x70 [zfs]
 [<ffffffffa0260807>] dmu_free_long_range_impl+0x177/0x270 [zfs]
 [<ffffffffa026272b>] dmu_free_long_range+0x4b/0x70 [zfs]
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffffa02c88a0>] zfs_rmnode+0x60/0x330 [zfs]
 [<ffffffffa02e9257>] zfs_zinactive+0x87/0xd0 [zfs]
 [<ffffffffa02e3d9c>] zfs_inactive+0x7c/0x230 [zfs]
 [<ffffffffa02f7b08>] zpl_evict_inode+0x28/0x30 [zfs]
 [<ffffffff811a7db7>] evict+0xa7/0x1b0
 [<ffffffff811a86b5>] iput+0x105/0x190
 [<ffffffffa02e6371>] zfs_inode_destroy+0xd1/0x110 [zfs]
 [<ffffffffa02f7b2e>] zpl_inode_destroy+0xe/0x10 [zfs]
 [<ffffffff811a7cdc>] destroy_inode+0x3c/0x70
 [<ffffffff811a7e22>] evict+0x112/0x1b0
 [<ffffffff811a7efe>] dispose_list+0x3e/0x60
 [<ffffffff811a8e24>] prune_icache_sb+0x164/0x2f0
 [<ffffffff811912c8>] prune_super+0xf8/0x1b0
 [<ffffffff8113dc64>] shrink_slab+0x154/0x310
 [<ffffffff81184398>] ? mem_cgroup_iter+0xe8/0x200
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffff81140820>] do_try_to_free_pages+0x440/0x570
 [<ffffffff8113cb0e>] ? pfmemalloc_watermark_ok+0xe/0xa0
 [<ffffffff81140c70>] try_to_free_pages+0xd0/0x180
 [<ffffffff811353c0>] __alloc_pages_nodemask+0x630/0x9d0
 [<ffffffff81171343>] alloc_pages_vma+0xb3/0x190
 [<ffffffff81154699>] handle_pte_fault+0x729/0xac0
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffff8161a469>] ? _raw_spin_lock+0x9/0x30
 [<ffffffff81152007>] ? __pte_alloc+0x97/0x160
 [<ffffffff81181c58>] do_huge_pmd_anonymous_page+0xc8/0x370
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffff81155ced>] handle_mm_fault+0x28d/0x340
 [<ffffffff8161ddf2>] do_page_fault+0x142/0x4f0
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffff816221d6>] ? ftrace_call+0x5/0x2b
 [<ffffffff8161d905>] do_async_page_fault+0x35/0x90
 [<ffffffff8161a915>] async_page_fault+0x25/0x30
Code: 30 64 ff ff eb e4 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 e8 37 39 59 00 48 8b 57 08 48 c7 c0 40 3d 01 00 8b 4a 18 <48> 03 04 cd e0 c4 cd 81 8b 00 89 c1 c1 e9 10 66 39 c1 74 3f 48
RIP  [<ffffffff8108e867>] resched_task+0x17/0x70
 RSP <ffff88011fd03c38>
CR2: 00000005551ec8a0
---[ end trace c62dc16169ad09d8 ]---
Kernel panic - not syncing: Fatal exception in interrupt
@cwedgwood
Copy link
Contributor

@behlendorf out of paranoia i'm still using 16KiB stacks

The deepest stacks I see:

[ 1041.608671] rsync (8027) used greatest stack depth: 10416 bytes left

(likely ZPL not a linux fs over a zvol)

That leaves about 2K free, I can imagine that's not hard to exhaust.

For stack use where zvols are involved:

[   90.801533] blkid (4707) used greatest stack depth: 11128 bytes left
[   90.972974] blkid (4760) used greatest stack depth: 11064 bytes left

Which is well under 3K when you consider not all 8K would be available.

@florianernst
Copy link

FWIW, I experienced "Thread overran stack, or stack corrupted" as well during testing a "zfs send -D -Rv $SNAP | zfs receive -uF $DEST/$SOURCE-test".

This occurred with the source pool containing $SNAP snapshot being backed by LVM on top of mdadm and the destination pool $DEST being backed by a single, full disk attached to the same system. The system was running an up-to-date Debian Squeeze with a Proxmox Kernel 2.6.32-17-pve and with spl 0.6.0.98-0ubuntu1lucid1 / zfs 0.6.0.98-0ubuntu2lucid1 (being based on 0.6.0-rc14 as far as I gather). zfs_arc_max=4294967296 was set, the system has 8 GB in total.

So far, the above send | receive never failed to crash the system eventually, but the trace varied. Please find below the messages from a hard lockup:

BUG: unable to handle kernel paging request at 00000001e7afd320
IP: [<ffffffff810554ae>] hrtick_start_fair+0x3e/0x190
PGD 105a2f067 PUD 0
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/net/bond0.113/speed
CPU 3
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat vzcpt nfs lockd fscache nfs_acl auth_rpcgss sunrpc vzdquota vzmon vzdev xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle xt_limit xt_dscp xt_multiport vzevent zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl ipt_REJECT iptable_filter ip_tables fuse bonding 8021q garp ext4 jbd2 xt_state nf_conntrack_ftp nf_conntrack_ipv6 nf_defrag_ipv6 ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack snd_pcsp zlib_deflate snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 tpm_tis tpm serio_raw tpm_bios i2c_core i3200_edac edac_core ext3 jbd mbcache raid1 sg e1000 ahci e1000e [last unloaded: scsi_wait_scan]

Pid: 67335, comm: zfs veid: 0 Tainted: P           ---------------    2.6.32-17-pve #1 042stab065_3 Intel Corporation S3210SH/S3210SH
RIP: 0010:[<ffffffff810554ae>]  [<ffffffff810554ae>] hrtick_start_fair+0x3e/0x190
RSP: 0018:ffff88014cbdff98  EFLAGS: 00010082
RAX: 000000000001e9c0 RBX: ffff88002839e9c0 RCX: 0000000000000001
RDX: 000000004cbe0038 RSI: ffff8801ead2af10 RDI: ffff88002839e9c0
RBP: ffff88014cbdffb8 R08: ffff88021ad9e9f8 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000007d R12: ffff8801ead2af10
R13: 000000000001e9c0 R14: ffff88021ad50600 R15: 0000000000000001
FS:  00007f06877a9be0(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000001e7afd320 CR3: 000000013cb40000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process zfs (pid: 67335, veid: 0, threadinfo ffff88014cbe0000, task ffff8801ead2af10)
Stack:
 0000000000000000 ffff88002839e9c0 ffff88002839ea28 0000000000000001
<d> ffff88014cbdffe8 ffffffff810668af ffff880219454600 ffff88002839e9c0
<d> 0000000000000003 0000000000000003 ffff88014cbe0018 ffffffff810545a6
Call Trace:
Code: 6c 24 10 4c 89 74 24 18 0f 1f 44 00 00 48 8b 56 08 49 c7 c5 c0 e9 01 00 4c 8b b6 c0 01 00 00 4c 89 e8 48 89 fb 49 89 f4 8b 52 18 <48> 03 04 d5 60 d1 bf 81 48 39 c7 0f 85 2b 01 00 00 f6 05 3b 11
RIP  [<ffffffff810554ae>] hrtick_start_fair+0x3e/0x190
 RSP <ffff88014cbdff98>
CR2: 00000001e7afd320
---[ end trace ee8574bf2f4941bf ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 67335, comm: zfs veid: 0 Tainted: P      D    ---------------    2.6.32-17-pve #1
Call Trace:
------------[ cut here ]------------
kernel BUG at arch/x86/kernel/traps.c:544!
invalid opcode: 0000 [#2] SMP
last sysfs file: /sys/devices/virtual/net/bond0.113/speed
CPU 3
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat vzcpt nfs lockd fscache nfs_acl auth_rpcgss sunrpc vzdquota vzmon vzdev xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle xt_limit xt_dscp xt_multiport vzevent zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl ipt_REJECT iptable_filter ip_tables fuse bonding 8021q garp ext4 jbd2 xt_state nf_conntrack_ftp nf_conntrack_ipv6 nf_defrag_ipv6 ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack snd_pcsp zlib_deflate snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 tpm_tis tpm serio_raw tpm_bios i2c_core i3200_edac edac_core ext3 jbd mbcache raid1 sg e1000 ahci e1000e [last unloaded: scsi_wait_scan]

Pid: 67335, comm: zfs veid: 0 Tainted: P      D    ---------------    2.6.32-17-pve #1 042stab065_3 Intel Corporation S3210SH/S3210SH
RIP: 0010:[<ffffffff8152bab1>]  [<ffffffff8152bab1>] do_nmi+0x251/0x2e0
RSP: 0018:ffff880028387f28  EFLAGS: 00010002
RAX: ffff88014cbe1fd8 RBX: ffff880028387f58 RCX: 00000000c0000101
RDX: 00000000ffff8800 RSI: ffffffffffffffff RDI: ffff880028387f58
RBP: ffff880028387f48 R08: ffffffff817b9686 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000065e6db78
R13: 0000000000000003 R14: 0000000000000a64 R15: 0000000000000046
FS:  00007f06877a9be0(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000001e7afd320 CR3: 000000013cb40000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process zfs (pid: 67335, veid: 0, threadinfo ffff88014cbe0000, task ffff8801ead2af10)
Stack:
 0000000000000001 0000000065e6db78 0000000000000003 0000000000000a64
<d> ffff88014cbdfbd8 ffffffff8152b2d0 0000000000000046 0000000000000a64
<d> 0000000000000003 0000000065e6db78 ffff88014cbdfbd8 0000000000000003
Call Trace:
 <NMI>
 [<ffffffff8152b2d0>] nmi+0x20/0x30
 [<ffffffff8127d0ea>] ? delay_tsc+0x4a/0x80
 <<EOE>>
Code: 81 e8 1a c0 ff ff 83 3d cd 27 8c 00 00 75 24 83 3d e8 27 8c 00 00 75 1b 48 c7 c7 90 cd 78 81 31 c0 e8 fa bf ff ff e9 09 fe ff ff <0f> 0b eb fe 0f 0b eb fe 48 c7 c7 23 8c 78 81 31 c0 e8 77 be ff
RIP  [<ffffffff8152bab1>] do_nmi+0x251/0x2e0
 RSP <ffff880028387f28>
---[ end trace ee8574bf2f4941c0 ]---

Curiously enough, the very same system sometimes crashes differently during the same test. The occurrence listed below even still allowed the kernel to issue a reset so no hardware reset was required:

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
Pid: 59, comm: kswapd0 veid: 0 Tainted: P           ---------------    2.6.32-17-pve #1
Call Trace:
 <NMI>  [<ffffffff815279de>] ? panic+0xa0/0x168
 [<ffffffff810eb28d>] ? watchdog_overflow_callback+0xcd/0xd0
 [<ffffffff8111fead>] ? __perf_event_overflow+0x9d/0x230
 [<ffffffff81120464>] ? perf_event_overflow+0x14/0x20
 [<ffffffff8101ed76>] ? intel_pmu_handle_irq+0x336/0x550
 [<ffffffff8152d746>] ? kprobe_exceptions_notify+0x16/0x450
 [<ffffffff8152c219>] ? perf_event_nmi_handler+0x39/0xb0
 [<ffffffff8152dd96>] ? notifier_call_chain+0x56/0x80
 [<ffffffff8152ddfa>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff8109cdce>] ? notify_die+0x2e/0x30
 [<ffffffff8152b9eb>] ? do_nmi+0x18b/0x2e0
 [<ffffffff8152b2d0>] ? nmi+0x20/0x30
 [<ffffffff8152aa2d>] ? _spin_lock_irqsave+0x2d/0x40
 <<EOE>>  [<ffffffff81053382>] ? __wake_up+0x32/0x70
 [<ffffffff81407359>] ? md_wakeup_thread+0x39/0x70
 [<ffffffffa00a3d55>] ? md_raid1_unplug_device+0x25/0x30 [raid1]
 [<ffffffffa00a71b1>] ? make_request+0x6f1/0xd00 [raid1]
 [<ffffffff8118044b>] ? cache_grow+0x1eb/0x300
 [<ffffffff8140ca05>] ? md_make_request+0xd5/0x210
 [<ffffffff812706e1>] ? cpumask_any_but+0x31/0x50
 [<ffffffff812530fe>] ? generic_make_request+0x25e/0x530
 [<ffffffff810f0f85>] ? call_rcu_sched+0x15/0x20
 [<ffffffff812534c3>] ? submit_bio+0xf3/0x1a0
 [<ffffffff8113de08>] ? test_set_page_writeback+0xe8/0x1f0
 [<ffffffff8116c7cf>] ? swap_writepage+0xbf/0x110
 [<ffffffff8114194d>] ? pageout+0x14d/0x350
 [<ffffffff81143965>] ? shrink_page_list+0x745/0xaf0
 [<ffffffff811440fc>] ? shrink_inactive_list+0x3ec/0xc10
 [<ffffffff81142845>] ? move_active_pages_to_lru+0x215/0x2b0
 [<ffffffff81012b79>] ? read_tsc+0x9/0x20
 [<ffffffffa041f3de>] ? spl_slab_reclaim+0x6e/0x3f0 [spl]
 [<ffffffff81144d03>] ? shrink_active_list+0x3e3/0x4a0
 [<ffffffff8100bcce>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff811453b3>] ? shrink_zone+0x5f3/0x900
 [<ffffffff81134a8d>] ? zone_watermark_ok_safe+0xad/0xc0
 [<ffffffff81146cf9>] ? balance_pgdat+0x739/0x820
 [<ffffffff811428e0>] ? isolate_pages_global+0x0/0x530
 [<ffffffff8114fa96>] ? set_pgdat_percpu_threshold+0xa6/0xd0
 [<ffffffff81146f11>] ? kswapd+0x131/0x3a0
 [<ffffffff81096b50>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81146de0>] ? kswapd+0x0/0x3a0
 [<ffffffff81096576>] ? kthread+0x96/0xa0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff810964e0>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Rebooting in 2400 seconds.

@behlendorf
Copy link
Contributor Author

@florianernst To determine the root cause of your stack overflow you're going to need to rebuild you kernel with the following patch to enable 16k stacks. This will prevent the overflow from occurring and allow you to grab a detailed stack like the one below.

diff -uNrp a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
--- a/arch/x86/include/asm/page_64_types.h       2013-03-16 12:06:47.675203835 -0700
+++ b/arch/x86/include/asm/page_64_types.h        2013-03-16 12:05:55.224891720 -0700
@@ -1,7 +1,7 @@
 #ifndef _ASM_X86_PAGE_64_DEFS_H
 #define _ASM_X86_PAGE_64_DEFS_H

-#define THREAD_SIZE_ORDER      1
+#define THREAD_SIZE_ORDER      2
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
 #define CURRENT_MASK (~(THREAD_SIZE - 1))

This detailed stack call graph is for the case I'm observing. Somewhat frustratingly zfs is responsible for less than 2k of the total stack usage. However, since we probably only need to save a hundred bytes or so we still may be able to do something about this.

        Depth    Size   Location    (70 entries)
        -----    ----   --------
  0)     7696     192   cfq_dispatch_insert+0x12b/0x260
  1)     7504     240   cfq_dispatch_requests+0x1d6/0xc10
  2)     7264      48   blk_peek_request+0x5e/0x220
  3)     7216     128   virtblk_request+0x3c/0x480 [virtio_blk]
  4)     7088      32   __blk_run_queue+0x37/0x50
  5)     7056     208   cfq_insert_request+0x374/0x540
  6)     6848      64   __elv_add_request+0x16a/0x290
  7)     6784     112   blk_flush_plug_list+0x199/0x210
  8)     6672      32   io_schedule+0x7a/0xd0
  9)     6640      16   sleep_on_buffer+0xe/0x20
 10)     6624      80   __wait_on_bit+0x60/0x90
 11)     6544     112   out_of_line_wait_on_bit+0x7c/0x90
 12)     6432      16   __wait_on_buffer+0x2a/0x30
 13)     6416      32   bh_submit_read+0x68/0x80
 14)     6384      80   ext4_get_branch+0xcf/0x140
 15)     6304     256   ext4_ind_map_blocks+0xc0/0x6c0
 16)     6048      96   ext4_map_blocks+0x125/0x2b0
 17)     5952      96   _ext4_get_block+0x87/0x190
 18)     5856      16   ext4_get_block+0x16/0x20
 19)     5840     208   do_mpage_readpage+0x456/0x6a0
 20)     5632     224   mpage_readpages+0xdb/0x140
 21)     5408      48   ext4_readpages+0x45/0x60
 22)     5360     192   __do_page_cache_readahead+0x1b7/0x240
 23)     5168      96   ondemand_readahead+0x234/0x240
 24)     5072      16   page_cache_sync_readahead+0x33/0x50
 25)     5056     224   generic_file_aio_read+0x540/0x6c0
 26)     4832     288   do_sync_read+0xa7/0xe0
 27)     4544      48   vfs_read+0xa9/0x180
 28)     4496     128   vn_rdwr+0x145/0x480 [spl]
 29)     4368      80   vdev_file_io_start+0x94/0x140 [zfs]
 30)     4288      64   zio_vdev_io_start+0xaa/0x2f0 [zfs]
 31)     4224      64   zio_nowait+0xba/0x170 [zfs]
 32)     4160     144   vdev_mirror_io_start+0x188/0x3c0 [zfs]
 33)     4016      64   zio_vdev_io_start+0x237/0x2f0 [zfs]
 34)     3952      64   zio_nowait+0xba/0x170 [zfs]
 35)     3888     176   arc_read_nolock+0x3cf/0x820 [zfs]
 36)     3712     112   arc_read+0x7f/0x170 [zfs]
 37)     3600      48   dsl_read+0x31/0x40 [zfs]
 38)     3552     208   dbuf_read+0x25d/0x810 [zfs]
 39)     3344      80   dmu_tx_check_ioerr+0xe2/0x140 [zfs]
 40)     3264     192   dmu_tx_hold_free+0x11c/0x940 [zfs]
 41)     3072     128   dmu_free_long_range_impl+0x15f/0x280 [zfs]
 42)     2944      80   dmu_free_long_range+0x4b/0x70 [zfs]
 43)     2864      80   zfs_rmnode+0x5d/0x340 [zfs]
 44)     2784      64   zfs_zinactive+0x8d/0xf0 [zfs]
 45)     2720      80   zfs_inactive+0x7c/0x230 [zfs]
 46)     2640      32   zpl_evict_inode+0x28/0x30 [zfs]
 47)     2608      48   evict+0xa2/0x1a0
 48)     2560      48   iput+0x103/0x190
 49)     2512      48   zfs_inode_destroy+0xd1/0x110 [zfs]
 50)     2464      16   zpl_inode_destroy+0xe/0x10 [zfs]
 51)     2448      32   destroy_inode+0x3c/0x70
 52)     2416      48   evict+0x109/0x1a0
 53)     2368      48   dispose_list+0x3e/0x60
 54)     2320     112   prune_icache_sb+0x161/0x300
 55)     2208      64   prune_super+0xe5/0x1b0
 56)     2144     176   shrink_slab+0x149/0x2d0
 57)     1968     160   do_try_to_free_pages+0x424/0x620
 58)     1808     176   try_to_free_pages+0xe0/0x160
 59)     1632     288   __alloc_pages_nodemask+0x663/0xa80
 60)     1344     112   alloc_pages_vma+0xa5/0x150
 61)     1232     208   handle_pte_fault+0x7d0/0xb20
 62)     1024     160   handle_mm_fault+0x291/0x650
 63)      864     272   __do_page_fault+0x171/0x4d0
 64)      592      16   do_page_fault+0xe/0x10
 65)      576      32   do_async_page_fault+0x35/0x90
 66)      544     288   async_page_fault+0x28/0x30
 67)      256      48   vfs_read+0xa9/0x180
 68)      208      80   sys_read+0x52/0xa0
 69)      128     128   system_call_fastpath+0x16/0x1b

@behlendorf
Copy link
Contributor Author

After giving this some thought, I think the right thing to do here isn't to shave a few bytes off the stack. Instead we should rework the vdev_file_io_start() code to operate asynchronously rather than synchronously. This has a few advantages.

  1. Stack space: By moving the file I/O portion of this stack to its own context we allow for deeper layering (i.e ext+lvm+md+dm+vd/lo/sd/hd).
  2. Performance: All file vdev I/O is currently done sequentially and synchronously. Since this is predominantly used for testing/debugging it's never been a huge concern. But making it asynchronous should improve performance which I doubt anyone will complain about.
  3. We can take this opportunity to move away from vn_rdwr() and use the native Linux interfaces. This helps ween us off the spl's vnode code and should make the code more readable.

Two ways occur to me to go about the above changes and should be investigated. We could either a) use Linux's aio interfaces for this if they support a callback on the kernel side. The downside here is that we'll be dependent on the underlying file system to correctly support aio and not all filesystem do. Or b) we could use a taskq in the ZFS code to make the I/O asynchronous with each individual work item being synchronous. This has the advantage of buying us the most possible stack and being insensitive to the underlying aio implementation.

Which ever we end up picking I'm deferring this work until 0.6.2 since this isn't a new issue in the code.

@florianernst
Copy link

@behlendorf Thanks for the hint. I recompiled my kernel (for anyone who might want to try this as well: for the proxmox rhel6-based kernel I had to adjust THREAD_ORDER instead of THREAD_SIZE_ORDER), and ever since booting the recompiled kernel I failed to reproduce the previously described crashes. Please advise whether there is anything I should try / provide while the system is running.

I had some hung tasks, though, when trying to "zfs destroy -r" a dataset:

kernel: INFO: task zfs:878425 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: zfs           D ffff8800b3394180     0 878425 300624    0 0x00000000
kernel: ffff8801dad4bc08 0000000000000082 0000000000000000 ffffffffa04e5289
kernel: ffff88011a5b3e40 ffff880181a5af88 ffff8801dad4bb98 ffffffffa04e56d0
kernel: ffff88011a5b3e40 ffff8800b3394730 ffff8801dad4bfd8 ffff8801dad4bfd8
kernel: Call Trace:
kernel: [<ffffffffa04e5289>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
kernel: [<ffffffffa04e56d0>] ? dmu_buf_rele+0x30/0x40 [zfs]
kernel: [<ffffffff8152a7b5>] rwsem_down_failed_common+0x95/0x1d0
kernel: [<ffffffffa04e56d0>] ? dmu_buf_rele+0x30/0x40 [zfs]
kernel: [<ffffffffa04e5289>] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
kernel: [<ffffffff8152a913>] rwsem_down_write_failed+0x23/0x30
kernel: [<ffffffff8127e1b3>] call_rwsem_down_write_failed+0x13/0x20
kernel: [<ffffffff81529e02>] ? down_write+0x32/0x40
kernel: [<ffffffffa0506677>] dsl_dataset_make_exclusive+0x87/0xe0 [zfs]
kernel: [<ffffffffa050bfdb>] dsl_dataset_destroy+0x33b/0x4c0 [zfs]
kernel: [<ffffffffa04f1b2b>] dmu_objset_destroy+0x3b/0x50 [zfs]
kernel: [<ffffffffa055fe80>] zfs_ioc_destroy+0x40/0x90 [zfs]
kernel: [<ffffffffa055fa0d>] zfsdev_ioctl+0xfd/0x1d0 [zfs]
kernel: [<ffffffff811a9f72>] vfs_ioctl+0x22/0xa0
kernel: [<ffffffff811aa0ff>] ? do_vfs_ioctl+0x6f/0x5d0
kernel: [<ffffffff811aa11a>] do_vfs_ioctl+0x8a/0x5d0
kernel: [<ffffffff81160d11>] ? do_munmap+0x2c1/0x360
kernel: [<ffffffff811aa6af>] sys_ioctl+0x4f/0x80
kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b

But this might be entirely unrelated as the test system was under heavy IO load at that time ...

@behlendorf
Copy link
Contributor Author

@florianernst Now that your not overrunning the stack we can get the debugging which is needed. Here's what you'll need to do.

# Enable the Linux kernel's stack tracer
echo 1 > /proc/sys/kernel/stack_tracer_enabled

# Mount the debugfs filesystem (it may already be mounted)
mount -t debugfs debugfs /sys/kernel/debug/

# Periodically check the worst case stack use.
cat /sys/kernel/debug/tracing/stack_max_size
cat /sys/kernel/debug/tracing/stace_trace

If the value in stack_max_size exceeds roughly 7500 bytes you can reasonably expect that your system (when it only had 8k stacks) would have overran the stack and crashed. But now that there are larger stacks on the system the overrun won't occur and you can read stack_trace to get a detailed analysis. This is the information you'll need to post in to the issue so we can determine exactly what used up all the stack space. Then we can better determine what can be done to live within the default 8k limit.

@florianernst
Copy link

@behlendorf Thanks a lot for the info, much appreciated. There you go:

        Depth    Size   Location    (66 entries)
        -----    ----   --------
  0)     9280     128   ftrace_call+0x5/0x2b
  1)     9152      16   mempool_alloc_slab+0x15/0x20
  2)     9136     144   mempool_alloc+0x67/0x170
  3)     8992      16   scsi_sg_alloc+0x4c/0x60
  4)     8976     112   __sg_alloc_table+0x66/0x130
  5)     8864      48   scsi_alloc_sgtable+0x28/0x60
  6)     8816      32   scsi_init_sgtable+0x21/0x70
  7)     8784      64   scsi_init_io+0x3a/0x180
  8)     8720      32   scsi_setup_fs_cmnd+0x79/0xe0
  9)     8688     144   sd_prep_fn+0x157/0xf10
 10)     8544      48   blk_peek_request+0xc7/0x210
 11)     8496     128   scsi_request_fn+0x63/0x790
 12)     8368      32   __generic_unplug_device+0x33/0x40
 13)     8336      32   generic_unplug_device+0x2e/0x50
 14)     8304      32   blk_unplug+0x34/0x70
 15)     8272      64   unplug_slaves+0x84/0xe0 [raid1]
 16)     8208      32   md_raid1_unplug_device+0x19/0x30 [raid1]
 17)     8176     304   make_request+0x6f1/0xd00 [raid1]
 18)     7872     128   md_make_request+0xd5/0x210
 19)     7744     208   generic_make_request+0x25e/0x530
 20)     7536     112   submit_bio+0xf3/0x1a0
 21)     7424      48   swap_writepage+0xbf/0x110
 22)     7376     208   pageout+0x14d/0x350
 23)     7168     336   shrink_page_list+0x745/0xaf0
 24)     6832     544   shrink_inactive_list+0x3ec/0xc10
 25)     6288     192   shrink_zone+0x5f3/0x900
 26)     6096     192   do_try_to_free_pages+0x2a0/0x7f0
 27)     5904     192   try_to_free_pages+0xa0/0x130
 28)     5712     272   __alloc_pages_nodemask+0x628/0xb80
 29)     5440      48   kmem_getpages+0x59/0x140
 30)     5392     128   fallback_alloc+0x1b7/0x260
 31)     5264      96   ____cache_alloc_node+0x99/0x150
 32)     5168      64   kmem_cache_alloc_node_notrace+0x6f/0x130
 33)     5104      80   __kmalloc_node+0x8e/0x110
 34)     5024      80   __alloc_skb+0x83/0x1c0
 35)     4944     176   sock_alloc_send_skb2+0x2e7/0x3b0
 36)     4768     176   unix_stream_sendmsg+0x2df/0x4d0
 37)     4592     208   sock_aio_write+0x142/0x150
 38)     4384     320   do_sync_write+0xfa/0x140
 39)     4064      64   vfs_write+0x184/0x1a0
 40)     4000     128   vn_rdwr+0x93/0x2a0 [spl]
 41)     3872      96   dump_bytes+0x91/0xc0 [zfs]
 42)     3776     176   backup_cb+0x375/0x6b0 [zfs]
 43)     3600     208   traverse_visitbp+0x30d/0x720 [zfs]
 44)     3392     208   traverse_visitbp+0x43d/0x720 [zfs]
 45)     3184     208   traverse_visitbp+0x43d/0x720 [zfs]
 46)     2976     208   traverse_visitbp+0x43d/0x720 [zfs]
 47)     2768     112   traverse_dnode+0x80/0x110 [zfs]
 48)     2656     208   traverse_visitbp+0x551/0x720 [zfs]
 49)     2448     208   traverse_visitbp+0x43d/0x720 [zfs]
 50)     2240     208   traverse_visitbp+0x43d/0x720 [zfs]
 51)     2032     208   traverse_visitbp+0x43d/0x720 [zfs]
 52)     1824     208   traverse_visitbp+0x43d/0x720 [zfs]
 53)     1616     208   traverse_visitbp+0x43d/0x720 [zfs]
 54)     1408     208   traverse_visitbp+0x43d/0x720 [zfs]
 55)     1200     112   traverse_dnode+0x80/0x110 [zfs]
 56)     1088     208   traverse_visitbp+0x5f5/0x720 [zfs]
 57)      880     112   traverse_impl+0x174/0x310 [zfs]
 58)      768      48   traverse_dataset+0x45/0x50 [zfs]
 59)      720     112   dmu_send+0x32e/0x4f0 [zfs]
 60)      608     144   zfs_ioc_send+0x302/0x390 [zfs]
 61)      464      48   zfsdev_ioctl+0xfd/0x1d0 [zfs]
 62)      416      64   vfs_ioctl+0x22/0xa0
 63)      352     144   do_vfs_ioctl+0x8a/0x5d0
 64)      208      80   sys_ioctl+0x4f/0x80
 65)      128     128   system_call_fastpath+0x16/0x1b

As a side note, on my test system stack_max_size lingered for quite a while at around 6800, but then without any discernible change in use pattern (it was running zfs send ... | zfs receive ... followed by zfs destroy -r ... all the time) suddenly jumped to 9280.
If there is anything else I should try / provide please don't hesitate to mention it, I'm at liberty to run quite some tests.

BTW, colleagues mentioned they had another crash when cd'ing into .zfs/snapshot dirs. I'll see whether I can find a reproducible testcase for this and will open another issue if I can.

@behlendorf
Copy link
Contributor Author

@florianernst Thanks, that's what we needed. We'll give a little though how to get the stack usage under control, in the meanwhile there's really no downside to running with the 16k kernel until we sort this out.

@TruHuynh
Copy link

Hi, I ran into this issue too (CentOS 6.4 kernel-2.6.32-358.2.1.el6.x86_64 spl/zfs 0.6.1) just zfs send pool@snapshot > /nfs/file.zfs over a 10GBE network (mtu 9k).

I have rebuild a 16k stack kernel and here is a stack strace:
2.6.32-358.2.1.el6.16k_stack.x86_64

Thu Apr 18 11:10:13 CEST 2013
==> /sys/kernel/debug/tracing/stack_max_size <==
8240

==> /sys/kernel/debug/tracing/stack_trace <==
Depth Size Location (56 entries)
----- ---- --------
0) 8176 80 ftrace_call+0x5/0x2b

  1. 8096      64   cpu_quiet_msk+0x77/0x130
    
  2. 8032      96   rcu_process_dyntick+0x107/0x120
    
  3. 7936      64   force_quiescent_state+0x118/0x1a0
    
  4. 7872      48   __call_rcu+0x105/0x160
    
  5. 7824      16   call_rcu_sched+0x15/0x20
    
  6. 7808      16   call_rcu+0xe/0x10
    
  7. 7792     272   radix_tree_delete+0x150/0x2b0
    
  8. 7520      32   __remove_from_page_cache+0x21/0xe0
    
  9. 7488      64   __remove_mapping+0xa0/0x160
    
  10. 7424     336   shrink_page_list.clone.3+0x3f7/0x650
    
  11. 7088     432   shrink_inactive_list+0x343/0x830
    
  12. 6656     208   shrink_mem_cgroup_zone+0x3ae/0x610
    
  13. 6448     112   shrink_zone+0x63/0xb0
    
  14. 6336     208   zone_reclaim+0x349/0x400
    
  15. 6128     288   get_page_from_freelist+0x69c/0x830
    
  16. 5840     320   __alloc_pages_nodemask+0x113/0x8d0
    
  17. 5520      48   kmem_getpages+0x62/0x170
    
  18. 5472     112   cache_grow+0x2cf/0x320
    
  19. 5360     112   cache_alloc_refill+0x202/0x240
    
  20. 5248      64   kmem_cache_alloc+0x15f/0x190
    
  21. 5184      64   nfs_create_request+0x3e/0xe0 [nfs]
    
  22. 5120     112   nfs_updatepage+0x11c/0x4e0 [nfs]
    
  23. 5008      80   nfs_write_end+0x152/0x2b0 [nfs]
    
  24. 4928     208   generic_file_buffered_write+0x18a/0x2e0
    
  25. 4720     192   __generic_file_aio_write+0x260/0x490
    
  26. 4528      80   generic_file_aio_write+0x88/0x100
    
  27. 4448      80   nfs_file_write+0xde/0x1f0 [nfs]
    
  28. 4368     304   do_sync_write+0xfa/0x140
    
  29. 4064      64   vfs_write+0xb8/0x1a0
    
  30. 4000     128   vn_rdwr+0x93/0x2a0 [spl]
    
  31. 3872      96   dump_bytes+0x91/0xc0 [zfs]
    
  32. 3776     176   backup_cb+0x375/0x6b0 [zfs]
    
  33. 3600     208   traverse_visitbp+0x2fd/0x710 [zfs]
    
  34. 3392     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  35. 3184     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  36. 2976     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  37. 2768     112   traverse_dnode+0x80/0x110 [zfs]
    
  38. 2656     208   traverse_visitbp+0x541/0x710 [zfs]
    
  39. 2448     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  40. 2240     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  41. 2032     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  42. 1824     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  43. 1616     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  44. 1408     208   traverse_visitbp+0x42d/0x710 [zfs]
    
  45. 1200     112   traverse_dnode+0x80/0x110 [zfs]
    
  46. 1088     208   traverse_visitbp+0x5e5/0x710 [zfs]
    
  47.  880     112   traverse_impl+0x174/0x310 [zfs]
    
  48.  768      48   traverse_dataset+0x45/0x50 [zfs]
    
  49.  720     112   dmu_send+0x32e/0x4f0 [zfs]
    
  50.  608     144   zfs_ioc_send+0x302/0x390 [zfs]
    
  51.  464      48   zfsdev_ioctl+0xfd/0x1d0 [zfs]
    
  52.  416      64   vfs_ioctl+0x22/0xa0
    
  53.  352     144   do_vfs_ioctl+0x84/0x580
    
  54.  208      80   sys_ioctl+0x81/0xa0
    
  55.  128     128   system_call_fastpath+0x16/0x1b
    

@behlendorf
Copy link
Contributor Author

Fixed in master.

044baf0 Use taskq for dump_bytes()
7ef5e54 Illumos #3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock con
55d85d5 Illumos #3329, #3330, #3331, #3335
5853fe7 Illumos #3306, #3321

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man
     page and help

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@31d7e8f
  https://www.illumos.org/issues/3306
  https://www.illumos.org/issues/3321

The vdev_file.c implementation in this patch diverges significantly
from the upstream version.  For consistenty with the vdev_disk.c
code the upstream version leverages the Illumos bio interfaces.
This makes sense for Illumos but not for ZoL for two reasons.

1) The vdev_disk.c code in ZoL has been rewritten to use the
   Linux block device interfaces which differ significantly
   from those in Illumos.  Therefore, updating the vdev_file.c
   to use the Illumos interfaces doesn't get you consistency
   with vdev_disk.c.

2) Using the upstream patch as is would requiring implementing
   compatibility code for those Solaris block device interfaces
   in user and kernel space.  That additional complexity could
   lead to confusion and doesn't buy us anything.

For these reasons I've opted to simply move the existing vn_rdwr()
as is in to the taskq function.  This has the advantage of being
low risk and easy to understand.  Moving the vn_rdwr() function
in to its own taskq thread also neatly avoids the possibility of
a stack overflow.

Finally, because of the additional work which is being handled by
the free taskq the number of threads has been increased.  The
thread count under Illumos defaults to 100 but was decreased to 2
in commit 08d08e due to contention.  We increase it to 8 until
the contention can be address by porting Illumos openzfs#3581.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#1354
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants