Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS v0.7.2 call trace - zfs_range_lock issue - zpool hungs #6781

Closed
marcinkuk opened this issue Oct 24, 2017 · 6 comments
Closed

ZFS v0.7.2 call trace - zfs_range_lock issue - zpool hungs #6781

marcinkuk opened this issue Oct 24, 2017 · 6 comments
Projects
Milestone

Comments

@marcinkuk
Copy link

System information

Type Version/Name
Distribution Name Proxmox (Debian)
Distribution Version Proxmox 5.0 - Debian 9.1
Linux Kernel 4.13.4-1-pve
Architecture x86_64
ZFS Version v0.7.2-1
SPL Version v0.7.2-1

Describe the problem you're observing

Zpool functionality hungs - upload > 100

Describe how to reproduce the problem

Don't know how to reproduce. Just use server.
I have 4 similar configured servers. On 2 of them I have this issue average 1 time per hour.
On 2 others servers never happened.

Include any warning/errors/backtraces from the system logs

[ 2538.605247] general protection fault: 0000 [#1] SMP
[ 2538.605259] Modules linked in: xt_tcpudp xt_multiport bluetooth ecdh_generic veth ip_set ip6table_filter ip6_tables iptable_filter softdog nfnetlink_log nfnetlink adt7475 hwmon_vid intel_powerclamp snd_hda_codec_analog snd_hda_codec_generic coretemp kvm_intel kvm dell_wmi dell_smbios irqbypass dcdbas gpio_ich ppdev sparse_keymap wmi_bmof dell_smm_hwmon nouveau intel_cstate mxm_wmi video snd_hda_intel ttm pcspkr serio_raw snd_hda_codec drm_kms_helper snd_hda_core lpc_ich snd_hwdep drm snd_pcm snd_timer i2c_algo_bit fb_sys_fops syscopyarea snd sysfillrect sysimgblt soundcore i7core_edac parport_pc parport wmi shpchp mac_hid vhost_net vhost tap sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO)
[ 2538.605316]  zcommon(PO) znvpair(PO) spl(O) btrfs xor raid6_pq firewire_ohci psmouse ahci libahci i2c_i801 firewire_core mptsas crc_itu_t tg3 mptscsih ptp mptbase pps_core scsi_transport_sas
[ 2538.605333] CPU: 0 PID: 1679 Comm: z_wr_iss Tainted: P           O    4.13.4-1-pve #1
[ 2538.605337] Hardware name: Dell Inc. Precision WorkStation T7500  /06FW8P, BIOS A16 05/28/2013
[ 2538.605342] task: ffff8e074c330000 task.stack: ffffa78a13504000
[ 2538.605414] RIP: 0010:zfs_range_compare+0x5/0x30 [zfs]
[ 2538.605418] RSP: 0018:ffffa78a13507a80 EFLAGS: 00010286
[ 2538.605422] RAX: 0000000000000000 RBX: bb8b4cfffffc4ce1 RCX: 0000000011c30000
[ 2538.605426] RDX: 0000000000000000 RSI: bb8b4cfffffc4ce1 RDI: ffff8e0562ba4600
[ 2538.605430] RBP: ffffa78a13507ab8 R08: ffff8e075241f500 R09: ffff8e0752007180
[ 2538.605434] R10: ffff8e0562ba4600 R11: 0000000000b80000 R12: ffff8e074613a560
[ 2538.605438] R13: ffff8e0562ba4600 R14: bb8b4cfffffc4ce9 R15: bb8b4cfffffc4ce9
[ 2538.605443] FS:  0000000000000000(0000) GS:ffff8e0752400000(0000) knlGS:0000000000000000
[ 2538.605447] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2538.605451] CR2: 00007f7b9737b148 CR3: 000000021b89a000 CR4: 00000000000006f0
[ 2538.605456] Call Trace:
[ 2538.605461]  ? avl_add+0x59/0x90 [zavl]
[ 2538.605494]  zfs_range_lock+0x42b/0x5c0 [zfs]
[ 2538.605536]  zvol_request+0x1e5/0x300 [zfs]
[ 2538.605542]  generic_make_request+0x125/0x300
[ 2538.605546]  submit_bio+0x73/0x150
[ 2538.605549]  ? submit_bio+0x73/0x150
[ 2538.605582]  vdev_disk_io_start+0x616/0x700 [zfs]
[ 2538.605615]  zio_vdev_io_start+0xbf/0x340 [zfs]
[ 2538.605648]  ? zio_vdev_io_start+0xbf/0x340 [zfs]
[ 2538.605681]  ? vdev_config_sync+0x190/0x190 [zfs]
[ 2538.605714]  zio_nowait+0xb6/0x150 [zfs]
[ 2538.605747]  vdev_mirror_io_start+0xa1/0x180 [zfs]
[ 2538.605781]  zio_vdev_io_start+0x223/0x340 [zfs]
[ 2538.605931]  ? taskq_member+0x18/0x30 [spl]
[ 2538.606185]  zio_execute+0x8a/0xe0 [zfs]
[ 2538.606675]  taskq_thread+0x25e/0x460 [spl]
[ 2538.607178]  ? wake_up_q+0x80/0x80
[ 2538.607680]  kthread+0x109/0x140
[ 2538.608163]  ? taskq_thread_should_stop+0x70/0x70 [spl]
[ 2538.608651]  ? kthread_create_on_node+0x70/0x70
[ 2538.609155]  ret_from_fork+0x25/0x30
[ 2538.609653] Code: 48 c7 c7 88 91 53 c0 41 bf 5f 00 00 00 e8 34 cf fe ff e9 59 fd ff ff e8 ba 91 20 cd 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> 8b 56 20 48 8b 4f 20 31 c0 55 48 39 d1 48 89 e5 0f 92 c2 0f 
[ 2538.610745] RIP: zfs_range_compare+0x5/0x30 [zfs] RSP: ffffa78a13507a80
[ 2538.611281] ---[ end trace 36702b944b69d225 ]---
@e100
Copy link

e100 commented Dec 18, 2017

Having same issue.

I have DRBD running ontop of a ZVOL.
When making heavy sequential writes on the primary node, the secondary node throws this exception.

kernel 4.13.8-3-pve
ZFS 0.7.3-1

[ 3024.393239] general protection fault: 0000 [#1] SMP
[ 3024.393285] Modules linked in: ip_set ip6table_filter ip6_tables iptable_filter bonding ib_ipoib zram iTCO_wdt iTCO_vendor_support nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp ipmi_
ssif coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel mgag200 aes_x86_64 crypto_simd glue_helper cryptd ttm drm_kms_helper snd_pcm snd_timer drm snd intel_cstate soundco
re i2c_algo_bit fb_sys_fops pcspkr syscopyarea sysfillrect intel_rapl_perf joydev input_leds sysimgblt lpc_ich mei_me mei ib_mthca ioatdma dca shpchp ipmi_si ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost tap ib_is
er rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drbd sunrpc lru_cache libcrc32c ip_tables x_tables autofs4 zfs(PO)
[ 3024.393734]  zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) hid_generic usbkbd usbmouse usbhid hid btrfs xor raid6_pq ahci e1000e(O) i2c_i801 libahci isci ptp pps_core libsas arcmsr scsi_transport_sas
[ 3024.393865] CPU: 26 PID: 24805 Comm: vgs Tainted: P           O    4.13.8-3-pve #1
[ 3024.393910] Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 1.0b 08/21/2012
[ 3024.393958] task: ffff974a4e342e80 task.stack: ffffb0fcde6e4000
[ 3024.394061] RIP: 0010:zfs_range_compare+0x5/0x30 [zfs]
[ 3024.394094] RSP: 0018:ffffb0fcde6e7990 EFLAGS: 00010202
[ 3024.394130] RAX: 0000000000000000 RBX: 11e8c48949df8940 RCX: 0000000000001000
[ 3024.394173] RDX: 0000000000000000 RSI: 11e8c48949df8940 RDI: ffff973c0e58b600
[ 3024.394218] RBP: ffffb0fcde6e79c8 R08: ffff974e7f49f500 R09: ffff973e7f407180
[ 3024.394262] R10: ffff973c0e58b600 R11: ffff973e76a25b88 R12: ffff974e76e20d60
[ 3024.394306] R13: ffff973c0e58b600 R14: 11e8c48949df8948 R15: 11e8c48949df8948
[ 3024.394352] FS:  00007f9d723e1880(0000) GS:ffff974e7f480000(0000) knlGS:0000000000000000
[ 3024.394400] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3024.394436] CR2: 00000000024f4b78 CR3: 00000019e7484000 CR4: 00000000000426e0
[ 3024.394481] Call Trace:
[ 3024.394505]  ? avl_add+0x59/0x90 [zavl]
[ 3024.396105]  zfs_range_lock+0x3cc/0x5c0 [zfs]
[ 3024.397673]  zvol_request+0x16e/0x300 [zfs]
[ 3024.399197]  generic_make_request+0x125/0x300
[ 3024.400688]  ? iov_iter_get_pages+0xbd/0x2f0
[ 3024.402161]  submit_bio+0x73/0x150
[ 3024.403617]  ? submit_bio+0x73/0x150
[ 3024.405077]  ? bio_iov_iter_get_pages+0xd7/0x120
[ 3024.406544]  __blkdev_direct_IO_simple+0x193/0x320
[ 3024.408033]  ? bd_set_size+0xd5/0xe0
[ 3024.409088]  ? __blkdev_get+0x1ea/0x430
[ 3024.410531]  ? bdget+0x120/0x120
[ 3024.411989]  blkdev_direct_IO+0x3e1/0x3f0
[ 3024.413438]  ? blkdev_direct_IO+0x3e1/0x3f0
[ 3024.414897]  ? dput+0x34/0x1f0
[ 3024.416309]  ? mntput+0x24/0x40
[ 3024.417736]  ? terminate_walk+0xe0/0xf0
[ 3024.419078]  ? path_openat+0x6f4/0x1490
[ 3024.420461]  ? putname+0x54/0x60
[ 3024.421871]  generic_file_read_iter+0xcb/0x9e0
[ 3024.423286]  ? generic_file_read_iter+0xcb/0x9e0
[ 3024.424684]  ? do_filp_open+0xad/0x110
[ 3024.426014]  ? _copy_to_user+0x2a/0x40
[ 3024.426965]  ? cp_new_stat+0x152/0x180
[ 3024.428227]  blkdev_read_iter+0x35/0x40
[ 3024.429309]  new_sync_read+0xde/0x130
[ 3024.430077]  __vfs_read+0x26/0x40
[ 3024.430891]  vfs_read+0x96/0x130
[ 3024.431696]  SyS_read+0x55/0xc0
[ 3024.432410]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[ 3024.433108] RIP: 0033:0x7f9d71242700
[ 3024.433732] RSP: 002b:00007ffe66a9b6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 3024.434363] RAX: ffffffffffffffda RBX: 0000561b170ed6e8 RCX: 00007f9d71242700
[ 3024.434966] RDX: 0000000000001000 RSI: 0000561b170d9000 RDI: 0000000000000003
[ 3024.435557] RBP: 0000000000000001 R08: 00007f9d715011f8 R09: 0000000000001000
[ 3024.436138] R10: 00007f9d7198bf20 R11: 0000000000000246 R12: 0000000000000000
[ 3024.436709] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffe66a9bba0
[ 3024.437277] Code: 48 c7 c7 88 c1 73 c0 41 bf 5f 00 00 00 e8 24 cf fe ff e9 59 fd ff ff e8 0a 60 40 c5 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> 8b 56 20 48 8b 4f 20 31 c0 55 48 39 d1 48 89 e5 0f 92 c2 0f
[ 3024.438549] RIP: zfs_range_compare+0x5/0x30 [zfs] RSP: ffffb0fcde6e7990
[ 3024.439190] ---[ end trace 0fdb8beaee634420 ]---

@behlendorf behlendorf added this to To Do in 0.7.6 via automation Dec 19, 2017
@behlendorf behlendorf added this to the 0.8.0 milestone Dec 19, 2017
@e100
Copy link

e100 commented Dec 20, 2017

Problem has not repeated since doing:
echo 0 > /sys/module/spl/parameters/spl_taskq_thread_dynamic

@behlendorf
Copy link
Contributor

@e100 thanks for the potential clue. Definitely let us know if you observe the problem again with the dynamic taskq threads disabled.

@tuxoko could you please take a look at this.

@tuxoko
Copy link
Contributor

tuxoko commented Dec 20, 2017

@e100
Do you see the problem if you set zvol_request_sync to 1, and revert the spl_taskq_thread_dynamic change? Can you also try using a debug build?

Both of the stack trace show garbage in RSI register, which probably means something in the range lock and/or its avl tree got corrupted?

@e100
Copy link

e100 commented Dec 21, 2017

@tuxoko
I start vacation tomorrow and we have some production loads where this issue occurred so I cannot do any testing at the moment. When I get back in January we will look into moving some things around so we can perform more tests to help narrow down where the issue is.

Different RIP but does involve zfs_range_lock:

[75319.346073] general protection fault: 0000 [#1] SMP
[75319.346106] Modules linked in: dm_crypt algif_skcipher af_alg ip_set ip6table_filter ip6_tables iptable_filter bonding ib_ipoib zram iTCO_wdt iTCO_vendor_support nfnetlink_log nfnetlink intel_rapl sb_edac ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel mgag200 aes_x86_64 ttm crypto_simd glue_helper cryptd drm_kms_helper intel_cstate snd_pcm drm snd_timer snd i2c_algo_bit soundcore fb_sys_fops mei_me syscopyarea sysfillrect intel_rapl_perf pcspkr input_leds joydev sysimgblt lpc_ich mei ib_mthca ioatdma dca shpchp ipmi_si ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc drbd lru_cache libcrc32c ip_tables
[75319.346391]  x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) hid_generic usbmouse usbkbd usbhid hid btrfs xor raid6_pq ahci e1000e(O) i2c_i801 libahci isci ptp pps_core libsas arcmsr scsi_transport_sas
[75319.346482] CPU: 0 PID: 5819 Comm: drbd_r_drbd1 Tainted: P           O    4.13.8-3-pve #1
[75319.346512] Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 1.0b 08/21/2012
[75319.346542] task: ffff895b374c0000 task.stack: ffffb2c9cfd84000
[75319.346572] RIP: 0010:native_queued_spin_lock_slowpath+0x108/0x1a0
[75319.346596] RSP: 0018:ffffb2c9cfd87b88 EFLAGS: 00010006
[75319.346617] RAX: 74705f636f6e39e1 RBX: 0000000000000202 RCX: ffff895b3fa1cd80
[75319.346644] RDX: 00000000000033f5 RSI: 00000000cfd87d60 RDI: ffffb2c9cfd87d28
[75319.346671] RBP: ffffb2c9cfd87b88 R08: 0000000000040000 R09: 0000000000000000
[75319.346698] R10: ffffb2c9cfd87bc0 R11: 000000005c1cbf1e R12: 0000000000000002
[75319.346725] R13: ffffb2c9cfd87d28 R14: ffffb2c9cfd87d28 R15: 0000000000000000
[75319.346752] FS:  0000000000000000(0000) GS:ffff895b3fa00000(0000) knlGS:0000000000000000
[75319.346783] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75319.346805] CR2: 00007ffe3a6b4410 CR3: 0000000fd4da5000 CR4: 00000000000426f0
[75319.346832] Call Trace:
[75319.346849]  _raw_spin_lock_irqsave+0x37/0x3f
[75319.346869]  prepare_to_wait_exclusive+0x23/0x80
[75319.346895]  cv_wait_common+0x7c/0x140 [spl]
[75319.346914]  ? wait_woken+0x80/0x80
[75319.346933]  __cv_wait+0x15/0x20 [spl]
[75319.346996]  zfs_range_lock+0x298/0x5c0 [zfs]
[75319.347051]  zvol_request+0x1e5/0x300 [zfs]
[75319.347071]  generic_make_request+0x125/0x300
[75319.347097]  drbd_submit_peer_request+0x16d/0x370 [drbd]
[75319.347122]  ? drbd_submit_peer_request+0x16e/0x370 [drbd]
[75319.347147]  receive_Data+0x38c/0x9e0 [drbd]
[75319.347168]  ? receive_RSDataReply+0x2d0/0x2d0 [drbd]
[75319.347192]  drbd_receiver+0x198/0x300 [drbd]
[75319.348107]  drbd_thread_setup+0x52/0x120 [drbd]
[75319.349019]  kthread+0x109/0x140
[75319.349919]  ? drbd_destroy_connection+0x100/0x100 [drbd]
[75319.350816]  ? kthread_create_on_node+0x70/0x70
[75319.351700]  ret_from_fork+0x25/0x30
[75319.352573] Code: 02 89 c2 45 31 c9 c1 e2 10 85 d2 74 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 80 cd 01 00 48 03 04 d5 e0 b3 b6 b1 <48> 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 85 c0 74 f7 4c 8b
[75319.354443] RIP: native_queued_spin_lock_slowpath+0x108/0x1a0 RSP: ffffb2c9cfd87b88
[75319.355390] ---[ end trace d30370e8361de6ad ]---```

@behlendorf behlendorf moved this from To Do to In progress in 0.7.6 Jan 19, 2018
@tonyhutter tonyhutter moved this from In progress to Done in 0.7.6 Feb 6, 2018
@behlendorf
Copy link
Contributor

Closing as a duplicate of #6890

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
0.7.6
  
Done
Development

No branches or pull requests

4 participants