PANIC in zfs_mknode leads to cv_wait_common deadlock #5535

Open
mfischer-zd opened this Issue Dec 28, 2016 · 16 comments

Projects

None yet

5 participants

@mfischer-zd

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7.3
Linux Kernel 3.10.0-514.2.2.el7.x86_64
Architecture x86_64
ZFS Version 0.6.5.8-1
SPL Version 0.6.5.8-1

Problem description

The first indication of the problem is in the kernel log:

Dec 26 20:12:50 xxxxx kernel: VERIFY(*zpp != NULL) failed
Dec 26 20:12:50 xxxxx kernel: PANIC at zfs_znode.c:899:zfs_mknode()

After which both the zfs snapshot and txg_quiesce processes are stuck in D state. The only thing we can do at this point is hard-reboot the instance.

Reproduction steps

So far, we've seen this problem in the context of operating a Kubernetes node running Docker (currently 1.12.5, but we've seen this with other Docker versions as well). The condition seems to be triggered only when a process is OOM-killed for exceeding its memory limit. At that point, zfs deadlocks somewhere in the kernel.

Log trace

Dec 26 20:06:00 xxxxx kernel: Memory cgroup out of memory: Kill process 15648 (syncer.rb:275) score 1501 or sacrifice child
Dec 26 20:06:00 xxxxx kernel: Killed process 15454 (ruby) total-vm:169404kB, anon-rss:79636kB, file-rss:0kB, shmem-rss:0kB
Dec 26 20:12:48 xxxxx kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x8050)
Dec 26 20:12:48 xxxxx kernel: syncer.rb:275 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=991
Dec 26 20:12:48 xxxxx kernel: syncer.rb:275 cpuset=952bd45bec82fe13e3e14ba053c2ab2f95e0116420e035736e07eeb24b871578 mems_allowed=0-1
Dec 26 20:12:48 xxxxx kernel: CPU: 2 PID: 2900 Comm: syncer.rb:275 Tainted: P           OE  ------------   3.10.0-514.2.2.el7.x86_64 #1
Dec 26 20:12:48 xxxxx kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
Dec 26 20:12:48 xxxxx kernel:  ffff880216d0af10 000000009a481344 ffff8802b21cfcc0 ffffffff816861cc
Dec 26 20:12:48 xxxxx kernel:  ffff8802b21cfd50 ffffffff81681177 ffff8802759fbb80 0000000000000001
Dec 26 20:12:48 xxxxx kernel:  0000000000000000 0000000000000000 0000000000000046 ffffffff811842b6
Dec 26 20:12:48 xxxxx kernel: Call Trace:
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff816861cc>] dump_stack+0x19/0x1b
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff81681177>] dump_header+0x8e/0x225
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff811842b6>] ? find_lock_task_mm+0x56/0xc0
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff8118476e>] oom_kill_process+0x24e/0x3c0
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff810937ee>] ? has_capability_noaudit+0x1e/0x30
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff811f3131>] mem_cgroup_oom_synchronize+0x551/0x580
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff811f2580>] ? mem_cgroup_charge_common+0xc0/0xc0
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff81184ff4>] pagefault_out_of_memory+0x14/0x90
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff8167ef67>] mm_fault_error+0x68/0x12b
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff81691ed5>] __do_page_fault+0x395/0x450
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff81691fc5>] do_page_fault+0x35/0x90
Dec 26 20:12:48 xxxxx kernel:  [<ffffffff8168e288>] page_fault+0x28/0x30
Dec 26 20:12:48 xxxxx kernel: Task in /docker/952bd45bec82fe13e3e14ba053c2ab2f95e0116420e035736e07eeb24b871578 killed as a result of limit of /docker/952bd45bec82fe13e3e
Dec 26 20:12:48 xxxxx kernel: memory: usage 153600kB, limit 153600kB, failcnt 4109903
Dec 26 20:12:48 xxxxx kernel: memory+swap: usage 153600kB, limit 9007199254740988kB, failcnt 0
Dec 26 20:12:48 xxxxx kernel: kmem: usage 51132kB, limit 9007199254740988kB, failcnt 0
Dec 26 20:12:48 xxxxx kernel: Memory cgroup stats for /docker/952bd45bec82fe13e3e14ba053c2ab2f95e0116420e035736e07eeb24b871578: cache:212KB rss:102256KB rss_huge:0KB map
Dec 26 20:12:48 xxxxx kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Dec 26 20:12:48 xxxxx kernel: [11354]     0 11354      377       13       6        0           991 sh
Dec 26 20:12:48 xxxxx kernel: [11375]     0 11375    16439     3640      35        0           991 ruby
Dec 26 20:12:48 xxxxx kernel: [ 2623]     0  2623    16889     3598      37        0           991 ruby
Dec 26 20:12:48 xxxxx kernel: [ 2749]     0  2749    40415    18288      83        0           991 ruby
Dec 26 20:12:48 xxxxx kernel: Memory cgroup out of memory: Kill process 2906 (syncer.rb:275) score 1459 or sacrifice child
Dec 26 20:12:48 xxxxx kernel: Killed process 2749 (ruby) total-vm:161660kB, anon-rss:73152kB, file-rss:0kB, shmem-rss:0kB
Dec 26 20:12:48 xxxxx kernel:   cache: selinux_inode_security(303:952bd45bec82fe13e3e14ba053c2ab2f95e0116420e035736e07eeb24b871578), object size: 80, buffer size: 80, de
Dec 26 20:12:48 xxxxx kernel:   node 0: slabs: 110, objs: 5610, free: 0
Dec 26 20:12:48 xxxxx kernel:   node 1: slabs: 50, objs: 2550, free: 0
Dec 26 20:12:50 xxxxx kernel: VERIFY(*zpp != NULL) failed
Dec 26 20:12:50 xxxxx kernel: PANIC at zfs_znode.c:899:zfs_mknode()
Dec 26 20:12:50 xxxxx kernel: Showing stack for process 2905
Dec 26 20:12:50 xxxxx kernel: CPU: 3 PID: 2905 Comm: syncer.rb:275 Tainted: P           OE  ------------   3.10.0-514.2.2.el7.x86_64 #1
Dec 26 20:12:50 xxxxx kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
Dec 26 20:12:50 xxxxx kernel:  ffffffffa05960fd 000000002ec8124c ffff88021de2f8a8 ffffffff816861cc
Dec 26 20:12:50 xxxxx kernel:  ffff88021de2f8b8 ffffffffa04096f4 ffff88021de2fa40 ffffffffa04097bf
Dec 26 20:12:50 xxxxx kernel:  ffff8801228b5600 ffff880200000028 ffff88021de2fa50 ffff88021de2f9f0
Dec 26 20:12:50 xxxxx kernel: Call Trace:
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff816861cc>] dump_stack+0x19/0x1b
Dec 26 20:12:50 xxxxx kernel:  [<ffffffffa04096f4>] spl_dumpstack+0x44/0x50 [spl]
Dec 26 20:12:50 xxxxx kernel:  [<ffffffffa04097bf>] spl_panic+0xbf/0xf0 [spl]
Dec 26 20:12:50 xxxxx kernel:  [<ffffffffa0504333>] ? sa_replace_all_by_template_locked+0x53/0x60 [zfs]
Dec 26 20:12:50 xxxxx kernel:  [<ffffffffa0405e8d>] ? spl_kmem_cache_free+0x14d/0x1d0 [spl]
Dec 26 20:12:50 xxxxx kernel:  [<ffffffffa055a649>] zfs_mknode+0xe39/0xe50 [zfs]
Dec 26 20:12:50 xxxxx kernel:  [<ffffffffa05504d1>] zfs_create+0x5e1/0x740 [zfs]
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff81680006>] ? double_lock_balance.part.29+0x1/0xb
Dec 26 20:12:50 xxxxx kernel:  [<ffffffffa056cca3>] zpl_create+0xc3/0x180 [zfs]
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff8120ac8d>] vfs_create+0xcd/0x130
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff8120d65f>] do_last+0xbef/0x12a0
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff812afa3c>] ? selinux_file_alloc_security+0x3c/0x60
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff8120ddd2>] path_openat+0xc2/0x490
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff811806bb>] ? unlock_page+0x2b/0x30
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff8120ff4b>] do_filp_open+0x4b/0xb0
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff8121cbc7>] ? __alloc_fd+0xa7/0x130
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff811fd453>] do_sys_open+0xf3/0x1f0
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff81691fc5>] ? do_page_fault+0x35/0x90
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff811fd56e>] SyS_open+0x1e/0x20
Dec 26 20:12:50 xxxxx kernel:  [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b
Dec 26 20:16:03 xxxxx kernel: INFO: task txg_quiesce:7529 blocked for more than 120 seconds.
Dec 26 20:16:03 xxxxx kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 20:16:03 xxxxx kernel: txg_quiesce     D 0000000000000000     0  7529      2 0x00000080
Dec 26 20:16:03 xxxxx kernel:  ffff8800cc967d60 0000000000000046 ffff8800cc958fb0 ffff8800cc967fd8
Dec 26 20:16:03 xxxxx kernel:  ffff8800cc967fd8 ffff8800cc967fd8 ffff8800cc958fb0 ffff880399e08768
Dec 26 20:16:03 xxxxx kernel:  ffff880399e086a8 ffff880399e08770 ffff880399e086d0 0000000000000000
Dec 26 20:16:03 xxxxx kernel: Call Trace:
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff8168b899>] schedule+0x29/0x70
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa040b67d>] cv_wait_common+0x10d/0x130 [spl]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa040b6b5>] __cv_wait+0x15/0x20 [spl]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa051ada3>] txg_quiesce_thread+0x2a3/0x3a0 [zfs]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa051ab00>] ? txg_fini+0x290/0x290 [zfs]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa0406851>] thread_generic_wrapper+0x71/0x80 [spl]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa04067e0>] ? __thread_exit+0x20/0x20 [spl]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff810b064f>] kthread+0xcf/0xe0
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff81696718>] ret_from_fork+0x58/0x90
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
Dec 26 20:16:03 xxxxx kernel: INFO: task syncer.rb:275:2905 blocked for more than 120 seconds.
Dec 26 20:16:03 xxxxx kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 26 20:16:03 xxxxx kernel: syncer.rb:275   D ffff88021de2fbe8     0  2905   2623 0x00000084
Dec 26 20:16:03 xxxxx kernel:  ffff88021de2f8a8 0000000000000082 ffff880268033ec0 ffff88021de2ffd8
Dec 26 20:16:03 xxxxx kernel:  ffff88021de2ffd8 ffff88021de2ffd8 ffff880268033ec0 ffffffffa05960fd
Dec 26 20:16:03 xxxxx kernel:  0000000000000383 ffffffffa0580c50 ffffffffa059b36a ffff88021de2fbe8
Dec 26 20:16:03 xxxxx kernel: Call Trace:
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff8168b899>] schedule+0x29/0x70
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa04097e5>] spl_panic+0xe5/0xf0 [spl]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa0504333>] ? sa_replace_all_by_template_locked+0x53/0x60 [zfs]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa0405e8d>] ? spl_kmem_cache_free+0x14d/0x1d0 [spl]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa055a649>] zfs_mknode+0xe39/0xe50 [zfs]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa05504d1>] zfs_create+0x5e1/0x740 [zfs]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff81680006>] ? double_lock_balance.part.29+0x1/0xb
Dec 26 20:16:03 xxxxx kernel:  [<ffffffffa056cca3>] zpl_create+0xc3/0x180 [zfs]
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff8120ac8d>] vfs_create+0xcd/0x130
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff8120d65f>] do_last+0xbef/0x12a0
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff811de456>] ? kmem_cache_alloc_trace+0x1d6/0x200
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff812afa3c>] ? selinux_file_alloc_security+0x3c/0x60
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff8120ddd2>] path_openat+0xc2/0x490
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff811806bb>] ? unlock_page+0x2b/0x30
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff8120ff4b>] do_filp_open+0x4b/0xb0
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff8121cbc7>] ? __alloc_fd+0xa7/0x130
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff811fd453>] do_sys_open+0xf3/0x1f0
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff81691fc5>] ? do_page_fault+0x35/0x90
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff811fd56e>] SyS_open+0x1e/0x20
Dec 26 20:16:03 xxxxx kernel:  [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b

Kernel stack traces

zfs snapshot

Dec 28 18:55:38 xxxxx kernel: zfs             D 0000000000000000     0 28839  17500 0x00000084
Dec 28 18:55:38 xxxxx kernel:  ffff88012685fae0 0000000000000082 ffff880042711f60 ffff88012685ffd8
Dec 28 18:55:38 xxxxx kernel:  ffff88012685ffd8 ffff88012685ffd8 ffff880042711f60 ffff880034c24ac0
Dec 28 18:55:38 xxxxx kernel:  ffff880034c24a10 ffff880034c24ac8 ffff880034c24a38 0000000000000000
Dec 28 18:55:38 xxxxx kernel: Call Trace:
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff8168b899>] schedule+0x29/0x70
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa040b67d>] cv_wait_common+0x10d/0x130 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa040b6b5>] __cv_wait+0x15/0x20 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa051a4cf>] txg_wait_synced+0xdf/0x120 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa04e3280>] ? dsl_dataset_snapshot_check_impl+0x180/0x180 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa04f6b77>] dsl_sync_task+0x177/0x270 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa04e5400>] ? dsl_dataset_snapshot_sync_impl+0x740/0x740 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa04e3280>] ? dsl_dataset_snapshot_check_impl+0x180/0x180 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa04e5400>] ? dsl_dataset_snapshot_sync_impl+0x740/0x740 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa04e3883>] dsl_dataset_snapshot+0x133/0x2c0 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa042c147>] ? nvlist_remove_all+0x77/0xd0 [znvpair]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa042c645>] ? nvlist_add_common.part.51+0x325/0x430 [znvpair]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff811dd471>] ? __kmalloc_node+0x171/0x2b0
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa0404ccd>] ? spl_kmem_alloc_impl+0xcd/0x170 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa0404ccd>] ? spl_kmem_alloc_impl+0xcd/0x170 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa0404ccd>] ? spl_kmem_alloc_impl+0xcd/0x170 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa042cfb2>] ? nvlist_lookup_common.part.71+0xa2/0xb0 [znvpair]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa05422e1>] zfs_ioc_snapshot+0x251/0x270 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa0543653>] zfsdev_ioctl+0x233/0x520 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff81212035>] do_vfs_ioctl+0x2d5/0x4b0
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff812aebbe>] ? file_has_perm+0xae/0xc0
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff812122b1>] SyS_ioctl+0xa1/0xc0
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b

txg_sync

Dec 28 18:55:38 xxxxx kernel: txg_sync        S 0000000000000000     0  7530      2 0x00000080
Dec 28 18:55:38 xxxxx kernel:  ffff8800cc96bd38 0000000000000046 ffff8800cc959f60 ffff8800cc96bfd8
Dec 28 18:55:38 xxxxx kernel:  ffff8800cc96bfd8 ffff8800cc96bfd8 ffff8800cc959f60 ffff880034c24b50
Dec 28 18:55:38 xxxxx kernel:  ffff880034c24a10 ffff880034c24b58 ffff880034c24a38 0000000000000000
Dec 28 18:55:38 xxxxx kernel: Call Trace:
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff8168b899>] schedule+0x29/0x70
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa040b67d>] cv_wait_common+0x10d/0x130 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa040b6d5>] __cv_wait_sig+0x15/0x20 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa051b35c>] txg_sync_thread+0x23c/0x620 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa051b120>] ? txg_init+0x280/0x280 [zfs]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa0406851>] thread_generic_wrapper+0x71/0x80 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffffa04067e0>] ? __thread_exit+0x20/0x20 [spl]
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff810b064f>] kthread+0xcf/0xe0
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff81696718>] ret_from_fork+0x58/0x90
Dec 28 18:55:38 xxxxx kernel:  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
@loli10K
Contributor
loli10K commented Dec 28, 2016 edited

@mfischer-zd can you please provide zfs get all output for the affected dataset? That would help in trying to replicate the problem ... i'm particularly interested in the xattr value. Thanks.

@mfischer-zd

@loli10K The particular server instance has been destroyed, but I'm looking at another server that's currently healthy, and here are the attributes of one of the filesystems:

NAME                                                                                  PROPERTY              VALUE                                                                                                SOURCE
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  type                  filesystem                                                                                           -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  creation              Fri Dec 23 18:14 2016                                                                                -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  used                  15K                                                                                                  -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  available             577G                                                                                                 -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  referenced            800K                                                                                                 -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  compressratio         1.00x                                                                                                -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  mounted               yes                                                                                                  -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  origin                zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306-init@281656892  -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  quota                 none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  reservation           none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  recordsize            128K                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  mountpoint            legacy                                                                                               local
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  sharenfs              off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  checksum              on                                                                                                   default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  compression           off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  atime                 on                                                                                                   default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  devices               on                                                                                                   default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  exec                  on                                                                                                   default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  setuid                on                                                                                                   default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  readonly              off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  zoned                 off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  snapdir               hidden                                                                                               default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  aclinherit            restricted                                                                                           default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  canmount              on                                                                                                   default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  xattr                 on                                                                                                   default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  copies                1                                                                                                    default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  version               5                                                                                                    -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  utf8only              off                                                                                                  -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  normalization         none                                                                                                 -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  casesensitivity       sensitive                                                                                            -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  vscan                 off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  nbmand                off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  sharesmb              off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  refquota              none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  refreservation        none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  primarycache          all                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  secondarycache        all                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  usedbysnapshots       0                                                                                                    -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  usedbydataset         15K                                                                                                  -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  usedbychildren        0                                                                                                    -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  usedbyrefreservation  0                                                                                                    -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  logbias               latency                                                                                              default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  dedup                 off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  mlslabel              none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  sync                  standard                                                                                             default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  refcompressratio      1.00x                                                                                                -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  written               15K                                                                                                  -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  logicalused           7.50K                                                                                                -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  logicalreferenced     788K                                                                                                 -
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  filesystem_limit      none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  snapshot_limit        none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  filesystem_count      none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  snapshot_count        none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  snapdev               hidden                                                                                               default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  acltype               off                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  context               none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  fscontext             none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  defcontext            none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  rootcontext           none                                                                                                 default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  relatime              on                                                                                                   temporary
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  redundant_metadata    all                                                                                                  default
zpool-docker/docker/70f52c0298395be350b7c8fb7803776fdf444c3a3e132f9b238f7b1c2f534306  overlay               off                                                                                                  default

Every filesystem or snapshot I've examined has xattr set to on.

@hugues-talbot

Hello, I have a very similar lockup when I try to set a zfs volume as swap.
Same system (CentOS 7.3), same version of ZFS/SPL

to reproduce:

zfs create -V 4G rpool/swap
mkswap /dev/zvol/rpool/swap
swapon -av

Then use a lot of memory, and pretty shortly the machine locks solid, hard reboot necessary.

Jan  3 01:08:16 blade11 kernel: INFO: task kswapd0:225 blocked for more than 120 seconds.
Jan  3 01:09:05 blade11 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  3 01:09:21 blade11 kernel: kswapd0         D 0000000000000000     0   225      2 0x00000000
Jan  3 01:09:21 blade11 kernel: ffff881ffce9f5f8 0000000000000046 ffff881ffcde2f10 ffff881ffce9ffd8
Jan  3 01:09:21 blade11 kernel: ffff881ffce9ffd8 ffff881ffce9ffd8 ffff881ffcde2f10 ffff883ffdb92350
Jan  3 01:09:21 blade11 kernel: ffff883ffdb92210 ffff883ffdb92358 ffff883ffdb92238 0000000000000000
Jan  3 01:09:21 blade11 kernel: Call Trace:
Jan  3 01:09:21 blade11 kernel: [<ffffffff8168b899>] schedule+0x29/0x70
Jan  3 01:09:22 blade11 kernel: [<ffffffffa049767d>] cv_wait_common+0x10d/0x130 [spl]
Jan  3 01:09:22 blade11 kernel: [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
Jan  3 01:09:22 blade11 kernel: [<ffffffffa04976b5>] __cv_wait+0x15/0x20 [spl]
Jan  3 01:09:22 blade11 kernel: [<ffffffffa09aa6a3>] txg_wait_open+0xb3/0xf0 [zfs]
Jan  3 01:09:22 blade11 kernel: [<ffffffffa0965b0d>] dmu_tx_wait+0x34d/0x360 [zfs]
Jan  3 01:09:22 blade11 kernel: [<ffffffffa0965bb1>] dmu_tx_assign+0x91/0x4b0 [zfs]
Jan  3 01:09:22 blade11 kernel: [<ffffffffa0a00991>] zvol_request+0x271/0x5e0 [zfs]
Jan  3 01:09:22 blade11 kernel: [<ffffffff812eb04f>] ? generic_make_request_checks+0x24f/0x380
Jan  3 01:09:22 blade11 kernel: [<ffffffff812ee659>] generic_make_request+0x109/0x1e0
Jan  3 01:09:22 blade11 kernel: [<ffffffff812ee7a1>] submit_bio+0x71/0x150
Jan  3 01:09:22 blade11 kernel: [<ffffffff8118b86a>] ? __test_set_page_writeback+0xca/0x160
Jan  3 01:09:22 blade11 kernel: [<ffffffff811c27ca>] __swap_writepage+0x23a/0x280
Jan  3 01:09:22 blade11 kernel: [<ffffffff811ba360>] ? SyS_msync+0x250/0x250
Jan  3 01:09:22 blade11 kernel: [<ffffffff811ba360>] ? SyS_msync+0x250/0x250
Jan  3 01:09:22 blade11 kernel: [<ffffffff811ba360>] ? SyS_msync+0x250/0x250
Jan  3 01:09:22 blade11 kernel: [<ffffffff810f94bf>] ? smp_call_function_single+0x5f/0xa0
Jan  3 01:09:22 blade11 kernel: [<ffffffff81317b25>] ? cpumask_next_and+0x35/0x50
Jan  3 01:09:22 blade11 kernel: [<ffffffff810f9a63>] ? smp_call_function_many+0x223/0x260
Jan  3 01:09:22 blade11 kernel: [<ffffffff811c32cd>] ? swap_info_get+0x5d/0xe0
Jan  3 01:09:22 blade11 kernel: [<ffffffff811c444c>] ? page_swapcount+0x4c/0x60
Jan  3 01:09:22 blade11 kernel: [<ffffffff811c2849>] swap_writepage+0x39/0x80
Jan  3 01:09:22 blade11 kernel: [<ffffffff811953f0>] shrink_page_list+0x840/0xb00
Jan  3 01:09:22 blade11 kernel: [<ffffffff81195d3a>] shrink_inactive_list+0x1fa/0x630
Jan  3 01:09:22 blade11 kernel: [<ffffffff811968d5>] shrink_lruvec+0x385/0x770
Jan  3 01:09:23 blade11 kernel: [<ffffffff81196d36>] shrink_zone+0x76/0x1a0
Jan  3 01:09:23 blade11 kernel: [<ffffffff81197fdc>] balance_pgdat+0x48c/0x5e0
Jan  3 01:09:23 blade11 kernel: [<ffffffff811982a3>] kswapd+0x173/0x450
Jan  3 01:09:24 blade11 kernel: [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
Jan  3 01:09:24 blade11 kernel: [<ffffffff81198130>] ? balance_pgdat+0x5e0/0x5e0
Jan  3 01:09:24 blade11 kernel: [<ffffffff810b064f>] kthread+0xcf/0xe0
Jan  3 01:09:24 blade11 kernel: [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
Jan  3 01:09:24 blade11 kernel: [<ffffffff81696718>] ret_from_fork+0x58/0x90
Jan  3 01:09:24 blade11 kernel: [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
@loli10K
Contributor
loli10K commented Jan 3, 2017

@hugues-talbot this actually looks very different to me, unless you're hitting the same VERIFY in zfs_mknode().

Your issue is more likely related to #4985. Can you only reproduce this lockup with Swap-on-ZVOL? Did you try to run the same workload with a swap device on a different storage (e.g Swap-on-LVM)?

@hugues-talbot

@loli10K thanks. You are right, the trace is much similar to #4985, well spotted. I've reloaded the server that was locking up, and so far (24h) it has not locked up again with swap-on-raw-device. Let's give it some time because in my case it was freezing once every 2-3 days or so.

@loli10K
Contributor
loli10K commented Jan 4, 2017

So, i was not able (yet) to reproduce this locally, but i think i made some progress: actually i'm not entirely sure the OOM condition from "syncer.rb" is the actual killer here, i'd say it's more like another victim caught in the crossfire. The thing that bothers me is this part of the output pasted:

Dec 26 20:12:48 xxxxx kernel:   cache: selinux_inode_security(303:952bd45bec82fe13e3e14ba053c2ab2f95e0116420e035736e07eeb24b871578), object size: 80, buffer size: 80, de
Dec 26 20:12:48 xxxxx kernel:   node 0: slabs: 110, objs: 5610, free: 0
Dec 26 20:12:48 xxxxx kernel:   node 1: slabs: 50, objs: 2550, free: 0

Seems like we have our selinux_inode_security cache full. AIUI with xattr=on (default value) on the dataset SELinux labels consume additional inodes on disk. For instance this is from a docker image/dataset on the CentOS box i'm trying to reproduce the problem on:

inode 26743:

[root@centos7 ~]# zdb -ddddd zpool-docker/docker/2763e50fa4bf9c9009908ef57c79edff6b84b6c80b044cb901997495f214cf06 26743
Dataset zpool-docker/docker/2763e50fa4bf9c9009908ef57c79edff6b84b6c80b044cb901997495f214cf06 [ZPL], ID 349, cr_txg 35452, 488M, 28326 objects, rootbp DVA[0]=<0:e0000200:200> DVA[1]=<0:1c8000200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=78147L/78147P fill=28326 cksum=12a821d332:615f3994d67:116e634dd6c15:23bff8884cf29f

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     26743    1    16K  72.0K  72.0K  72.0K  100.00  ZFS plain file
                                        176   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED 
	dnode maxblkid: 0
	path	/var/log/dpkg.log
	uid     0
	gid     0
[...]

and its xattr/SELinux inodes:

[root@centos7 ~]# zdb -ddddd zpool-docker/docker/2763e50fa4bf9c9009908ef57c79edff6b84b6c80b044cb901997495f214cf06 26744
Dataset zpool-docker/docker/2763e50fa4bf9c9009908ef57c79edff6b84b6c80b044cb901997495f214cf06 [ZPL], ID 349, cr_txg 35452, 488M, 28326 objects, rootbp DVA[0]=<0:e0000200:200> DVA[1]=<0:1c8000200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=78147L/78147P fill=28326 cksum=12a821d332:615f3994d67:116e634dd6c15:23bff8884cf29f

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     26744    1    16K    512      0    512  100.00  ZFS directory
                                        168   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED 
	dnode maxblkid: 0
	path	/var/log/dpkg.log/<xattrdir>
	uid     0
	gid     0
[...]
[root@centos7 ~]# zdb -ddddd zpool-docker/docker/2763e50fa4bf9c9009908ef57c79edff6b84b6c80b044cb901997495f214cf06 26745
Dataset zpool-docker/docker/2763e50fa4bf9c9009908ef57c79edff6b84b6c80b044cb901997495f214cf06 [ZPL], ID 349, cr_txg 35452, 488M, 28326 objects, rootbp DVA[0]=<0:e0000200:200> DVA[1]=<0:1c8000200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=78147L/78147P fill=28326 cksum=12a821d332:615f3994d67:116e634dd6c15:23bff8884cf29f

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     26745    1    16K    512    512    512  100.00  ZFS plain file
                                        168   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED 
	dnode maxblkid: 0
	path	/var/log/dpkg.log/<xattrdir>/security.selinux
	uid     0
	gid     0
[...]

Now, if we were to fail an allocation for our new xattr/SELinux inode in alloc_inode() we would get back a NULL in the mk_znode->zfs_znode_alloc->new_inode call chain that would trigger the VERIFY(*zpp != NULL).

202 static struct inode *alloc_inode(struct super_block *sb)
203 {
204         struct inode *inode;
205 
206         if (sb->s_op->alloc_inode)
207                 inode = sb->s_op->alloc_inode(sb);
208         else
209                 inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);
210 
211         if (!inode)
212                 return NULL;
213 
214         if (unlikely(inode_init_always(sb, inode))) {
215                 if (inode->i_sb->s_op->destroy_inode)
216                         inode->i_sb->s_op->destroy_inode(inode);
217                 else
218                         kmem_cache_free(inode_cachep, inode);
219                 return NULL;
220         }
221 
222         return inode;
223 }

Knowing that inode_init_always calls security_inode_alloc -> inode_alloc_security -> kmem_cache_zalloc(sel_inode_cache, GFP_NOFS); this also seems to be related:

Dec 26 20:12:48 xxxxx kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x8050)

0x8050 i think is just GFP_NOFS | __GFP_ZERO, which is actually the same mask from our kmem_cache_zalloc(sel_inode_cache, GFP_NOFS).

That said, @mfischer-zd can you share more info about this box (ram/swap/#containers,pool layout) and your workload? Is it possible that entire box (not just the container) is suffering from memory shortage?

@mfischer-zd

@loli10K Unfortunately the instance has been terminated, but it was an EC2 m4.xlarge instance (4 VCPU, 16GB RAM). No swap.

I was able to dredge up some metrics that Datadog recorded while it was up. Have a look:

screen shot 2017-01-04 at 1 31 48 pm

Let me know if there are any other historical metrics I can get you. Unfortunately it doesn't look like the Datadog agent captures any ZFS metrics yet. (Perhaps I can contribute an agent plugin.)

@loli10K
Contributor
loli10K commented Jan 4, 2017

@mfischer-zd never used Datadog, but it seems it just read the row "Slab:" from /proc/meminfo ... we would probably need /proc/slabinfo.

No swap.

A little swap could have saved you there, AIUI GFP_NOFS cannot start FS operations but can still perform disk IO (eg. swap) if you're low on memory.

@mfischer-zd
mfischer-zd commented Jan 4, 2017 edited

Unfortunately I don't think that the Datadog agent currently records any slab allocations. I can write a plugin to record specific slab information, though.

As for swap, perhaps it could have helped, for a while, anyway. EC2 instances don't come with swap, though; you have to allocate and attach a block device to get it. And we haven't found we need it yet in most cases, since we don't typically overprovision our instances with running programs.

At any rate, I think that it's reasonable to expect that even without swap, the kernel would have OOM-killed a userspace process without causing ZFS to deadlock. Agreed?

@loli10K
Contributor
loli10K commented Jan 4, 2017

it's reasonable to expect that even without swap, the kernel would have OOM-killed a userspace process without causing ZFS to deadlock. Agreed?

Yes, i would have expected that too. Still, i don't know if we can consider this a ZFS problem (unless ZFS is actually consuming all the memory) because the issue seems to be caused by a kernel function failing an allocation and returning a NULL inode ...

@mfischer-zd
mfischer-zd commented Jan 4, 2017 edited

the issue seems to be caused by a kernel function failing an allocation and returning a NULL inode

Let's suppose that is the case. Could ZFS behave "better" under these circumstances? I've never seen xfs or ext4 cope so poorly with a low-memory condition (in recent history, anyway) that it caused an unrecoverable deadlock.

@loli10K
Contributor
loli10K commented Jan 4, 2017

Could ZFS behave "better" under these circumstances?

Well ... probably? I know very little about kernel development and EXT4/XFS (or ZFS for that matter) in general, but a quick look at fs/ext4/super.c shows:

 909 /*
 910  * Called inside transaction, so use GFP_NOFS
 911  */
 912 static struct inode *ext4_alloc_inode(struct super_block *sb)
 913 {
 914         struct ext4_inode_info *ei;
 915 
 916         ei = kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS);
 917         if (!ei)
 918                 return NULL;

compared to zfs_znode_alloc:

 492 static znode_t *
 493 zfs_znode_alloc(zfs_sb_t *zsb, dmu_buf_t *db, int blksz,
 494     dmu_object_type_t obj_type, uint64_t obj, sa_handle_t *hdl)
 495 {
 496         znode_t *zp;
 497         struct inode *ip;
 498         uint64_t mode;
 499         uint64_t parent;
 500         sa_bulk_attr_t bulk[9];
 501         int count = 0;
 502 
 503         ASSERT(zsb != NULL);
 504 
 505         ip = new_inode(zsb->z_sb);
 506         if (ip == NULL)
 507                 return (NULL);
@mfischer-zd

Yes, they each return NULL in the event that they can't allocate an inode from the cache.

The question, then, is what happens next. With ext4, for instance, the fs-specific function (ext4_alloc_inode()) returns NULL to the caller:

/* fs/inode.c */
202 static struct inode *alloc_inode(struct super_block *sb)
203 {
204         struct inode *inode;
205 
206         if (sb->s_op->alloc_inode)
207                 inode = sb->s_op->alloc_inode(sb);  /* for ext4 this points to ext4_alloc_inode() */
208         else
209                 inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);
210 
211         if (!inode)
212                 return NULL;

The only callers of the alloc_inode() function (which is non-exported) that I could find were also in fs/inode.c, and they simply propagate the NULL pointer up the stack to return ENOMEM to the caller. See new_inode_pseudo(), iget_locked(), and iget5_locked(), then a caller of those, e.g.:

/* fs/ext4/inode.c */
4429 struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
4430 {
...
4442         inode = iget_locked(sb, ino);
4443         if (!inode)
4444                 return ERR_PTR(-ENOMEM);

This seems to be a lot less going on here than in the ZFS case. Or at least there's less recovery taking place in the latter, depending on how you look at it.

@behlendorf
Member

Could ZFS behave "better" under these circumstances?

Absolutely. The thing to be aware of is that when this bit of code was originally designed it was for Solaris and in that environment it was impossible for this allocation to fail. Hence the VERIFY, and why zfs_mknode() has a void return type. It was subsequently updated for the Linux VFS where a failure was technically possible but highly unlikely, particularly if the system had any kind of swap device. It would require the OOM killer to get invoked just like in this issue.

We have two options to handle this failure more gracefully.

  • Update zfs_mknode() such that it's possible for it to fail and return an error. Then update all of its callers to handle the error and cleanly unwind everything. There's a lot going on in the create case so this would be a pretty disruptive change but it's certainly possible. The end result would be the error was cleanly handled and the system call fails. Or,

  • Retry the zfs_znode_alloc() call in the case of a memory allocation failure until it succeeds. Linux will eventually find the memory elsewhere in the system even if it means killing other processes. This is the more straight forward fix and probably more desirable option because the application will just see a delay in the create while Linux somehow finds the memory. Any system this low on memory is already going to be having issues, and a slight delay here won't be a big deal and is far better than the panic.

@mfischer-zd
mfischer-zd commented Jan 10, 2017 edited

Thanks for the insight @behlendorf. I'm personally indifferent between the two proposed solutions, but would be interested in other stakeholders' feedback.

@griznog
griznog commented Jan 11, 2017 edited

I think I may be running into this issue or something similar, posting my steps to reproduce as it is pretty straightforward and may help with testing.

#As root
cgcreate -g cpu,memory:/zfsswaptest
cgset -r memory.limit_in_bytes=2G zfsswaptest
cgset -r memory.memsw.limit_in_bytes=2.125G zfsswaptest
cgclassify -g cpu,memory:/zfsswaptest PIDOFMYUSERBASH

#From user bash shell
memhog 3G

Expected behavior is to see memhog killed at 2.125 GB used, observed behavior is that the process hangs in such a way that anything that subsequently tries to read from it's /proc/PID entry also hangs.

If I kill the memhog, things return to normal (all processes hung reading from /proc continue) and if I swapoff the zvol swap the process is immediately OOM killed.

I am not seeing this problem on systems using a partition or MD device for swap, only where we use a zvol.

@behlendorf behlendorf added this to the 0.7.0 milestone Jan 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment