zvol write performance issue 0.7.0-rc4 #6127

Bronek · 2017-05-12T21:02:54Z

Type	Version/Name
Distribution Name	Arch
Distribution Version
Linux Kernel	4.9.26 (vanilla)
Architecture	x64
ZFS Version	0.7.0-rc4
SPL Version	0.7.0-rc4

I just encountered severe zvol write performance issue, where zvol is used as a backing store for a kvm virtual machines, running under qemu 2.9.0 and kernel 4.9.26 (vanilla , i.e. no patches) with ZFS/SPL version 0.7.0-rc4. Due to very high CPU utilization in kernel mode, the host became virtually unresponsive and had to be restarted. The host machine has 128GB RAM and 2 sockets, each with Xeon E5-2667v2 and HT enabled, which means 16 cores (32 logical cores, i.e. hardware threads), while the guest was only assigned 2 logical cores and 2GB RAM. The system was luckily responsive to sysreq commands, which allowed me to dump blocked tasks (and several other statistics). Here are interesting excerpts from kernel log:

root@czersk ~ # grep -F "[35224.445587] sysrq: SysRq : Show Blocked State" total.log -A2860 | grep "Call Trace" | wc -l
119
root@czersk ~ # grep -F "[35224.445587] sysrq: SysRq : Show Blocked State" total.log -A2860 | grep -E '^[^ ]+ zvol' | wc -l
8

root@czersk ~ # grep -F "[35270.367799] sysrq: SysRq : Show Blocked State" total.log -A3220 | grep "Call Trace" | wc -l
135
root@czersk ~ # grep -F "[35270.367799] sysrq: SysRq : Show Blocked State" total.log -A3220 | grep -E '^[^ ]+ zvol' | wc -l
24

root@czersk ~ # grep -F "[35343.395782] sysrq: SysRq : Show Blocked State" total.log -A2860 | grep "Call Trace" | wc -l
119
root@czersk ~ # grep -F "[35343.395782] sysrq: SysRq : Show Blocked State" total.log -A2860 | grep -E '^[^ ]+ zvol' | wc -l
5

root@czersk ~ # grep -F "[35393.369542] sysrq: SysRq : Show Blocked State" total.log -A4170 | grep "Call Trace" | wc -l
163
root@czersk ~ # grep -F "[35393.369542] sysrq: SysRq : Show Blocked State" total.log -A4170 | grep -E '^[^ ]+ zvol' | wc -l
31

root@czersk ~ # grep -F "[35468.219917] sysrq: SysRq : Show Blocked State" total.log -A3660 | grep "Call Trace" | wc -l
147
root@czersk ~ # grep -F "[35468.219917] sysrq: SysRq : Show Blocked State" total.log -A3655 | grep -E '^[^ ]+ zvol' | wc -l
22

[35287.590594] zvol            D    0 29199      2 0x00000000
[35287.596136]  ffff881034603840 0000000000000000 ffff8809736f8e00 ffff88103fc97fc0
[35287.603673]  ffff8810341c2a00 ffffc9003a48fba8 ffffffff815e03ec ffffffff815e0b3e
[35287.611257]  00ffc9003a48fbc0 ffff88103fc97fc0 000000005ad0fc63 ffff8809736f8e00
[35287.618831] Call Trace:
[35287.621305]  [<ffffffff815e03ec>] ? __schedule+0x22c/0x6b0
[35287.626831]  [<ffffffff815e0b3e>] ? schedule_preempt_disabled+0xe/0x10
[35287.633391]  [<ffffffff815e08a6>] schedule+0x36/0x80
[35287.638376]  [<ffffffffa01819d0>] cv_wait_common+0x110/0x130 [spl]
[35287.644602]  [<ffffffff810bccd0>] ? wake_atomic_t_function+0x60/0x60
[35287.651010]  [<ffffffffa0181a05>] __cv_wait+0x15/0x20 [spl]
[35287.656625]  [<ffffffffa07764f8>] txg_wait_open+0xa8/0xf0 [zfs]
[35287.662618]  [<ffffffffa0732691>] dmu_tx_wait+0x331/0x340 [zfs]
[35287.668610]  [<ffffffffa0746717>] ? dsl_dir_tempreserve_space+0x97/0x140 [zfs]
[35287.675843]  [<ffffffffa0732727>] dmu_tx_assign+0x87/0x3e0 [zfs]
[35287.681973]  [<ffffffffa07d2ab7>] zvol_write+0x177/0x570 [zfs]
[35287.687851]  [<ffffffffa0179f7b>] ? spl_kmem_alloc+0x9b/0x170 [spl]
[35287.694177]  [<ffffffffa0182526>] ? tsd_hash_search.isra.0+0x46/0xa0 [spl]
[35287.701083]  [<ffffffffa017da0e>] taskq_thread+0x25e/0x460 [spl]
[35287.707151]  [<ffffffff815e03f4>] ? __schedule+0x234/0x6b0
[35287.712694]  [<ffffffff810a3620>] ? wake_up_q+0x80/0x80
[35287.717969]  [<ffffffffa017d7b0>] ? taskq_cancel_id+0x130/0x130 [spl]
[35287.724430]  [<ffffffff81097dd6>] kthread+0xe6/0x100
[35287.729402]  [<ffffffff81097cf0>] ? kthread_park+0x60/0x60
[35287.734897]  [<ffffffff815e4d55>] ret_from_fork+0x25/0x30
[35287.740330] zvol            D    0 29200      2 0x00000000
[35287.745887]  ffff881035847080 0000000000000000 ffff8805440f2a00 ffff88103fc97fc0
[35287.753406]  ffff88045ffab800 ffffc9003a497ba8 ffffffff815e03ec ffffffff815e0b3e
[35287.760979]  00ffc9003a497bc0 ffff88103fc97fc0 000000002d4ba5b0 ffff8805440f2a00
[35287.768504] Call Trace:
[35287.770959]  [<ffffffff815e03ec>] ? __schedule+0x22c/0x6b0
[35287.776469]  [<ffffffff815e0b3e>] ? schedule_preempt_disabled+0xe/0x10
[35287.783036]  [<ffffffff815e08a6>] schedule+0x36/0x80
[35287.788024]  [<ffffffffa01819d0>] cv_wait_common+0x110/0x130 [spl]
[35287.794253]  [<ffffffff810bccd0>] ? wake_atomic_t_function+0x60/0x60
[35287.800665]  [<ffffffffa0181a05>] __cv_wait+0x15/0x20 [spl]
[35287.806303]  [<ffffffffa07764f8>] txg_wait_open+0xa8/0xf0 [zfs]
[35287.812235]  [<ffffffffa0732691>] dmu_tx_wait+0x331/0x340 [zfs]
[35287.818182]  [<ffffffffa0746717>] ? dsl_dir_tempreserve_space+0x97/0x140 [zfs]
[35287.825451]  [<ffffffffa0732727>] dmu_tx_assign+0x87/0x3e0 [zfs]
[35287.831484]  [<ffffffffa07d2ab7>] zvol_write+0x177/0x570 [zfs]
[35287.837332]  [<ffffffffa0179f7b>] ? spl_kmem_alloc+0x9b/0x170 [spl]
[35287.843614]  [<ffffffffa0182526>] ? tsd_hash_search.isra.0+0x46/0xa0 [spl]
[35287.850538]  [<ffffffffa017da0e>] taskq_thread+0x25e/0x460 [spl]
[35287.856597]  [<ffffffff815e03f4>] ? __schedule+0x234/0x6b0
[35287.862081]  [<ffffffff810a3620>] ? wake_up_q+0x80/0x80
[35287.867309]  [<ffffffffa017d7b0>] ? taskq_cancel_id+0x130/0x130 [spl]
[35287.873757]  [<ffffffff81097dd6>] kthread+0xe6/0x100
[35287.878720]  [<ffffffff81097cf0>] ? kthread_park+0x60/0x60
[35287.884231]  [<ffffffff815e4d55>] ret_from_fork+0x25/0x30
[35287.889668] zvol            D    0 29201      2 0x00000000
[35287.895212]  ffff881035847080 0000000000000000 ffff8805440f3800 ffff88103fc97fc0
[35287.902714]  ffff8805916e0e00 ffffc9003a49fba8 ffffffff815e03ec ffffffff815e0b3e
[35287.910176]  00ffc9003a49fbc0 ffff88103fc97fc0 000000008f11df80 ffff8805440f3800
[35287.917680] Call Trace:
[35287.920139]  [<ffffffff815e03ec>] ? __schedule+0x22c/0x6b0
[35287.925659]  [<ffffffff815e0b3e>] ? schedule_preempt_disabled+0xe/0x10
[35287.932216]  [<ffffffff815e08a6>] schedule+0x36/0x80
[35287.937219]  [<ffffffffa01819d0>] cv_wait_common+0x110/0x130 [spl]
[35287.943517]  [<ffffffff810bccd0>] ? wake_atomic_t_function+0x60/0x60
[35287.949973]  [<ffffffffa0181a05>] __cv_wait+0x15/0x20 [spl]
[35287.955639]  [<ffffffffa07764f8>] txg_wait_open+0xa8/0xf0 [zfs]
[35287.961612]  [<ffffffffa0732691>] dmu_tx_wait+0x331/0x340 [zfs]
[35287.967618]  [<ffffffffa0746717>] ? dsl_dir_tempreserve_space+0x97/0x140 [zfs]
[35287.974914]  [<ffffffffa0732727>] dmu_tx_assign+0x87/0x3e0 [zfs]
[35287.981007]  [<ffffffffa07d2ab7>] zvol_write+0x177/0x570 [zfs]
[35287.986925]  [<ffffffffa0179f7b>] ? spl_kmem_alloc+0x9b/0x170 [spl]
[35287.993226]  [<ffffffffa0182526>] ? tsd_hash_search.isra.0+0x46/0xa0 [spl]
[35288.000132]  [<ffffffffa017da0e>] taskq_thread+0x25e/0x460 [spl]
[35288.006222]  [<ffffffff815e03f4>] ? __schedule+0x234/0x6b0
[35288.011770]  [<ffffffff810a3620>] ? wake_up_q+0x80/0x80
[35288.017057]  [<ffffffffa017d7b0>] ? taskq_cancel_id+0x130/0x130 [spl]
[35288.023538]  [<ffffffff81097dd6>] kthread+0xe6/0x100
[35288.028577]  [<ffffffff81097cf0>] ? kthread_park+0x60/0x60
[35288.034093]  [<ffffffff815e4d55>] ret_from_fork+0x25/0x30
[35288.039531] zvol            D    0 29202      2 0x00000000
[35288.045082]  ffff881035fbf800 0000000000000000 ffff8805440f6200 ffff88207fc57fc0
[35288.052641]  ffff8809736fd400 ffffc9003a4a7ba8 ffffffff815e03ec 00000000ffffffff
[35288.060178]  0000000000000020 ffff88207fc57fc0 ffffc9003a4a7c48 ffff8805440f6200
[35288.067778] Call Trace:
[35288.070240]  [<ffffffff815e03ec>] ? __schedule+0x22c/0x6b0
[35288.075734]  [<ffffffff815e08a6>] schedule+0x36/0x80
[35288.080769]  [<ffffffffa01819d0>] cv_wait_common+0x110/0x130 [spl]
[35288.086991]  [<ffffffff810bccd0>] ? wake_atomic_t_function+0x60/0x60
[35288.093388]  [<ffffffffa0181a05>] __cv_wait+0x15/0x20 [spl]
[35288.099017]  [<ffffffffa07764f8>] txg_wait_open+0xa8/0xf0 [zfs]
[35288.105010]  [<ffffffffa0732691>] dmu_tx_wait+0x331/0x340 [zfs]
[35288.111008]  [<ffffffffa0746717>] ? dsl_dir_tempreserve_space+0x97/0x140 [zfs]
[35288.118320]  [<ffffffffa0732727>] dmu_tx_assign+0x87/0x3e0 [zfs]
[35288.124405]  [<ffffffffa07d2ab7>] zvol_write+0x177/0x570 [zfs]
[35288.130305]  [<ffffffffa0179f7b>] ? spl_kmem_alloc+0x9b/0x170 [spl]
[35288.136656]  [<ffffffffa0182526>] ? tsd_hash_search.isra.0+0x46/0xa0 [spl]
[35288.143564]  [<ffffffffa017da0e>] taskq_thread+0x25e/0x460 [spl]
[35288.149631]  [<ffffffff815e03f4>] ? __schedule+0x234/0x6b0
[35288.155201]  [<ffffffff810a3620>] ? wake_up_q+0x80/0x80
[35288.160507]  [<ffffffffa017d7b0>] ? taskq_cancel_id+0x130/0x130 [spl]
[35288.167049]  [<ffffffff81097dd6>] kthread+0xe6/0x100
[35288.172083]  [<ffffffff81097cf0>] ? kthread_park+0x60/0x60
[35288.177570]  [<ffffffff81097cf0>] ? kthread_park+0x60/0x60
[35288.183047]  [<ffffffff815e4d55>] ret_from_fork+0x25/0x30

The guest virtual machine where this was triggered is a very basic install of FreeBSD 10.3 (freshly installed). The issue was triggered when I started this command in guest:

# pkg install vim-8.0.0507

... which brings 116 dependencies, taking 2GB of disk space (as I've just learned, one would have to install "vim-lite-8.0.0507" on FreeBSD, for vim alone). The problem happened somewhere in the middle of guest installing these 2GB of dependencies. The guest does have sufficient free disk space, as seen here:

# df -h
Filesystem      Size    Used   Avail Capacity  Mounted on
/dev/vtbd0p2     12G    1.9G    8.8G    18%    /
devfs           1.0K    1.0K      0B   100%    /dev
/dev/vtbd0p3     12G     32M     11G     0%    /home

This is all hosted by a 24GB ZVOL, defined as below


root@gdansk ~ # virsh dumpxml torun-spice | grep -A6 -B3 'vdis/torun'
    <emulator>/usr/sbin/qemu-system-x86_64</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='threads'/>
      <source dev='/dev/zvol/zdata/vdis/torun'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <boot order='1'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>

root@gdansk ~ # zfs list zdata/vdis/torun
NAME               USED  AVAIL  REFER  MOUNTPOINT
zdata/vdis/torun  26.8G  2.61T  2.05G  -

root@gdansk ~ # zfs get all zdata/vdis/torun
NAME              PROPERTY              VALUE                  SOURCE
zdata/vdis/torun  type                  volume                 -
zdata/vdis/torun  creation              Fri May 12  0:36 2017  -
zdata/vdis/torun  used                  26.8G                  -
zdata/vdis/torun  available             2.61T                  -
zdata/vdis/torun  referenced            2.05G                  -
zdata/vdis/torun  compressratio         1.00x                  -
zdata/vdis/torun  reservation           none                   default
zdata/vdis/torun  volsize               24G                    local
zdata/vdis/torun  volblocksize          8K                     default
zdata/vdis/torun  checksum              on                     default
zdata/vdis/torun  compression           off                    inherited from zdata/vdis
zdata/vdis/torun  readonly              off                    default
zdata/vdis/torun  copies                1                      default
zdata/vdis/torun  refreservation        24.8G                  local
zdata/vdis/torun  primarycache          all                    default
zdata/vdis/torun  secondarycache        all                    default
zdata/vdis/torun  usedbysnapshots       8.06M                  -
zdata/vdis/torun  usedbydataset         2.05G                  -
zdata/vdis/torun  usedbychildren        0B                     -
zdata/vdis/torun  usedbyrefreservation  24.7G                  -
zdata/vdis/torun  logbias               latency                default
zdata/vdis/torun  dedup                 off                    default
zdata/vdis/torun  mlslabel              none                   default
zdata/vdis/torun  sync                  standard               default
zdata/vdis/torun  refcompressratio      1.00x                  -
zdata/vdis/torun  written               43.8M                  -
zdata/vdis/torun  logicalused           2.04G                  -
zdata/vdis/torun  logicalreferenced     2.03G                  -
zdata/vdis/torun  snapshot_limit        none                   default
zdata/vdis/torun  snapshot_count        none                   default
zdata/vdis/torun  snapdev               hidden                 default
zdata/vdis/torun  context               none                   default
zdata/vdis/torun  fscontext             none                   default
zdata/vdis/torun  defcontext            none                   default
zdata/vdis/torun  rootcontext           none                   default
zdata/vdis/torun  redundant_metadata    all                    default


root@gdansk ~ # zpool status zdata
  pool: zdata
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 57h30m with 0 errors on Fri May 13 16:50:16 2016
config:

        NAME                                          STATE     READ WRITE CKSUM
        zdata                                         ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0178587  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0181554  ONLINE       0     0     0
          mirror-1                                    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0196162  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0182472  ONLINE       0     0     0
        logs
          nvme0n1p3                                   ONLINE       0     0     0
        cache
          nvme0n1p11                                  ONLINE       0     0     0

I think this should be easily reproducible and also I am happy to share more details (e.g. VM image , VM libvirt xml definition, my full kernel config etc). This is in fact why I raised this as a separate issue, so there is space for all these details (as the issue itself is not unique)

The text was updated successfully, but these errors were encountered:

Bronek · 2017-05-12T21:06:57Z

Here is full VM libvirt xml definition of guest machine:

<domain type='kvm'>
  <name>torun-spice</name>
  <uuid>5dce2e26-79b3-4edc-9927-a2dd145c6694</uuid>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='9'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.9'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>IvyBridge</model>
    <topology sockets='1' cores='2' threads='1'/>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/sbin/qemu-system-x86_64</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='threads'/>
      <source dev='/dev/zvol/zdata/vdis/torun'/>
      <target dev='vda' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/data/vdis/isos/FreeBSD-10.3-RELEASE-amd64-disc1.iso'/>
      <target dev='hda' bus='ide'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:f5:19:1b'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='spice' port='5911' autoport='no' listen='0.0.0.0'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <sound model='ich6'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </sound>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <redirdev bus='usb' type='spicevmc'>
      <address type='usb' bus='0' port='1'/>
    </redirdev>
    <redirdev bus='usb' type='spicevmc'>
      <address type='usb' bus='0' port='2'/>
    </redirdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </memballoon>
  </devices>
</domain>

behlendorf · 2017-05-13T00:59:09Z

Can you try setting /sys/module/zfs/parameters/zvol_request_sync=1. This should revert to the default behavior in -rc3.

dweeezil · 2017-05-13T03:17:21Z

Another data point, since I've been doing a lot of work with FreeBSD 10.3 guests under libvirt/qemu-kvm lately so this was pretty easy for me to try to duplicate on n my daily driver system which is much lower specced than @Bronek's. Running "pkg install vim" in a pretty-much-dead-stock FreeBSD 10.3 guest, configured identically, the host system never hiccups at all and I was only able to catch txg_sync in D state a couple of times. It's running the equivalent of 0.7.0-RC4 plus trim and a couple of un-interesting patches on a 4.1.13 (Fedora 25) system. One potential difference is that my Fedora 25 qemu only supports virtual hardware version pc-i440fx-2.7 whereas @Bronek's is running 2.9.

@Bronek, In addition to enabling zvol sync request mode, it might be interesting to cat /proc/spl/taskq and cat /proc/spl/kstat/zfs/dmu_tx while this problem is happening. Also, what version of ZoL were you running before 0.7.0-rc4 (you never specifically mentioned 0.7.0-rc3) when this problem was presumably not happening.

Bronek · 2017-05-13T08:57:30Z

I was running -rc3 previously and this did not happen before. However, it was also my first try of FreeBSD under VM guest on this host.

Also it turns out it is not so easily reproducible as I thought - just tried today morning (no changes in the system configuration, not even zvol_request_sync, which I checked is indeed 0 now) and vim installation in the FreeeBSD guest went smoothly. This was right after the host machine was started (I usually do not keep it on overnight), so my guess is that this might be related to ARC load. When the problem happened yesterday, the computer was in use for some hours and its ARC would have been loaded with other ZVOLs. I will try again later today, after the kids have played their "obligatory share" of Steam games. These are played on Windows VMs hosted on ZVOL on the same machine - I have to put these CPUs to some use ;)

I do not know if this is very useful, but I was running date && cat /proc/spl/taskq && sleep 2s in a loop while installing vim in my FreeBSD guest, and noticed that tasq occasionally contains huge number of pending tasks (the list ends with "truncated"), for example:

Sat 13 May 09:30:14 BST 2017
taskq                       act  nthr  spwn  maxt   pri  mina         maxa  cura      flags
spl_delay_taskq/0             0     1     0    32   100    32   2147483647    32   80000005
        delay: spa_deadman [zfs](0xffff88061ea7c000)
zvol/0                       32    32     0    32   100    64   2147483647 42773   80000005
        active: [30017]zvol_write [zfs](0xffff88154b439dc0) [5544]zvol_write [zfs](0xffff88154b439ce0)
                [5542]zvol_write [zfs](0xffff88154b4396c0) [5545]zvol_write [zfs](0xffff88154b439ee0)
                [30015]zvol_write [zfs](0xffff88154b439080) [30003]zvol_write [zfs](0xffff88154b439700)
                [30027]zvol_write [zfs](0xffff88154b439180) [30007]zvol_write [zfs](0xffff88154b439900)
                [30020]zvol_write [zfs](0xffff88154b439e40) [30013]zvol_write [zfs](0xffff88154b439f40)
                [30025]zvol_write [zfs](0xffff88154b439c00) [5546]zvol_write [zfs](0xffff88154b439b40)
                [5543]zvol_write [zfs](0xffff88154b439e60) [5540]zvol_write [zfs](0xffff88154b439140)
                [30021]zvol_write [zfs](0xffff88154b439d40) [30028]zvol_write [zfs](0xffff88154b4392c0)
                [29617]zvol_write [zfs](0xffff88154b439600) [30009]zvol_write [zfs](0xffff88154b439860)
                [30001]zvol_write [zfs](0xffff88154b439cc0) [30026]zvol_write [zfs](0xffff88154b439260)
                [30002]zvol_write [zfs](0xffff88154b4395e0) [5547]zvol_write [zfs](0xffff88154b439ae0)
                [30011]zvol_write [zfs](0xffff88154b439d80) [30014]zvol_write [zfs](0xffff88154b439c20)
                [5539]zvol_write [zfs](0xffff88154b4392e0) [30030]zvol_write [zfs](0xffff88154b4393a0)
                [368]zvol_write [zfs](0xffff88154b439720) [5541]zvol_write [zfs](0xffff88154b439940)
                [30016]zvol_write [zfs](0xffff88154b439c60) [30006]zvol_write [zfs](0xffff88154b439f60)
                [30024]zvol_write [zfs](0xffff88154b439a20) [30008]zvol_write [zfs](0xffff88154b439480)
        pend: zvol_write [zfs](0xffff88154b4399c0) zvol_write [zfs](0xffff88154b439bc0)
              zvol_write [zfs](0xffff88154b439280) zvol_write [zfs](0xffff88154b439b80)
              zvol_write [zfs](0xffff88154b439880) zvol_write [zfs](0xffff88154b439de0)
              zvol_write [zfs](0xffff88154b439a60) zvol_write [zfs](0xffff88154b439ea0)
              zvol_write [zfs](0xffff88154b4394c0) zvol_write [zfs](0xffff88154b439120)
              zvol_write [zfs](0xffff88154b439da0) zvol_write [zfs](0xffff88154b439fc0)
              zvol_write [zfs](0xffff88154b439ca0) zvol_write [zfs](0xffff88154b439840)
              zvol_write [zfs](0xffff88154b439e00) zvol_write [zfs](0xffff88154b4394e0)
              zvol_write [zfs](0xffff88154b4399e0) zvol_write [zfs](0xffff88154b439360)
              zvol_write [zfs](0xffff88154b4390a0) zvol_write [zfs](0xffff88154b4393c0)
              zvol_write [zfs](0xffff88154b439a40) zvol_write [zfs](0xffff88154b4398a0)
              zvol_write [zfs](0xffff88154b439f80) zvol_write [zfs](0xffff88154b439320)
              zvol_write [zfs](0xffff88154b439ec0) zvol_write [zfs](0xffff88154b439500)
              zvol_write [zfs](0xffff88154b439980) zvol_write [zfs](0xffff88154b4396a0)
              zvol_write [zfs](0xffff88154b439300) zvol_write [zfs](0xffff88154b439c80)
              zvol_write [zfs](0xffff88154b439d20) zvol_write [zfs](0xffff88154b4391a0)
              zvol_write [zfs](0xffff88154b439240) zvol_write [zfs](0xffff88154b4396e0)
              zvol_write [zfs](0xffff88154b439100) zvol_write [zfs](0xffff88154b439920)
              zvol_write [zfs](0xffff88154b4397e0) zvol_write [zfs](0xffff88154b439560)
              zvol_write [zfs](0xffff88161606db40) zvol_write [zfs](0xffff88161606de80)
              zvol_write [zfs](0xffff88161606d8e0) zvol_write [zfs](0xffff88161606da60)
              zvol_write [zfs](0xffff88161606de20) zvol_write [zfs](0xffff88161606d480)
              zvol_write [zfs](0xffff88161606de00) zvol_write [zfs](0xffff88161606d260)
              zvol_write [zfs](0xffff88161606dd60) zvol_write [zfs](0xffff88161606d800)
              zvol_write [zfs](0xffff88161606dee0) zvol_write [zfs](0xffff88161606d720)
.
.
.
              zvol_write [zfs](0xffff8816036001c0) zvol_write [zfs](0xffff881603600c20)
              zvol_write [zfs](0xffff8816036008c0) zvol_write [zfs](0xffff881603600060)
              zvol_write [zfs](0xffff881603600280) zvol_write [zfs](0xffff881603600220)
              zvol_write [zfs](0xffff8816036009c0) zvol_write [zfs](0xffff881603600640)
              zvol_write [zfs](0xffff881603600ca0) zvol_write [zfs](0xffff881603600020)
        (truncated)

Bronek · 2017-05-13T18:06:33Z

OK so I am now able to reproduce this issue, it simply requires dumping a lot of IO onto virtualized (i.e. mapped to ZVOLs) disks in my virtual machines. One way I've found was to run different disk benchmarks (I am using"AS SSD" and DiskMark64, running under virtual Windows instances) on different disks at the same time. So far I've not switched zvol_request_sync to 1, next I will switch to this setting and then try to reproduce again. I also noticed that, when the machine enters this bad state, both commands cat /proc/spl/kstat/zfs/dmu_tx and cat /proc/spl/taskq get blocked and produce no output, while all IO worker threads in all virtual machines enter "D" state (these are actually named "worker" in the log below)

In the attacked kernel log you can see kernel's own reporting of blocked tasks and then from line 292 (kernel time 2815.498766 ) me using sysrq to capture host state, before powering it down (accidentally, was aiming for sysrq+p)

minicom135.txt

Bronek · 2017-05-13T19:37:27Z

@behlendorf I can confirm that after setting zvol_request_sync to 1 this problem does not happen. I've thrown a lot of disk IO in random writes into various ZVOLs from multiple VMs at the same time, and my host remained responsive.

tuxoko · 2017-05-15T18:09:26Z

@Bronek
What was your zfs_dirty_data_max on rc4?
If it was 4G, can you try 2G or 1G?

behlendorf · 2017-05-15T18:18:16Z

@Bronek another thing I'd be interested in would be decreasing zvol_threads from the default 32 to nproc with async enabled. If possible we want to make sure this is tuned so that the default values work and yield good performance for the majority of workloads.

tuxoko · 2017-05-15T18:31:55Z

@behlendorf
Perhaps we need a queue depth logic in zvol layer so it won't keep growing.

Edit: Or we can go back to use request queue, which seems to have that.

ryao · 2017-05-15T18:36:42Z

@tuxoko I don't think changing zfs_dirty_data_max is a solution here. We are heavily using zvols at work with a patch (that predates me joining) that reimplements zvol threads on top of 0.6.5.9. My experience is that lowering zfs_dirty_data_max makes performance worse, rather than better. It does make it more consistent, but it is also consistently worse. Performance fluctations in our current branch off of 0.6.5.9 + #6133 at work are an issue that I am actively investigating, although they aren't directly related to this.

@behlendorf I am still a fan of modifying zvols to rely on DMU dispatch. I think that will solve this by combining the best of both worlds. That looks like the next major bottleneck at work after #6133 is done. Hopefully, the patch for making zvols rely on DMU dispatch will take less time than ZIL pipelining.

tuxoko · 2017-05-15T18:48:55Z

@ryao
It depends on the value and your disk config, but if zfs_dirty_data_max is too large, you'll get a lot of multi-second or even multi-minute hang, and make the overall performance much worse. But that's before limiting it to 4G. However, I've tested on some system that performed noticeably better with 2G.

But either way, we should throttle incoming thread so we won't overflow the queue.

ryao · 2017-05-15T18:54:58Z

@tuxoko We do throttle incoming IOs in the elevator. If it is doing a good job, we should not need to adjust zfs_dirty_data_max. We also already initiate txg_sync whenever dirty data exceeds zfs_dirty_data_sync. Lowering zfs_dirty_data_max will limit the opportunity for bad write out behavior to cause a wall of wait, but it does not fix the underlying problem.

behlendorf · 2017-05-15T18:58:38Z

reimplements zvol threads on top of 0.6.5.9

If you have a patch for this you could share that would be useful to facilitate getting benchmark numbers from only the threading change.

rely on DMU dispatch

Could you describe what you have in mind.

But either way, we should throttle incoming thread so we won't overflow the queue.

Agreed. We should verify if the existing nr_requests from the block layer is working properly. We've clear got a ton of threads (well 32) waiting in txg_wait_open() correctly and because the taskq queue is effectively bottomless there appears to be no back pressure to the application. So we can't drain the queues fast enough.

A nice way to potentially handle might be to wire up /sys/block/<dev>/queue/nr_requests to the maxalloc size in taskq_create(). Although if we do this we'd want to improve the waiting code in task_alloc().

ryao · 2017-05-15T19:09:04Z

If you have a patch for this you could share that would be useful to facilitate getting benchmark numbers from only the threading change.

Since I just started work less than a month ago, I think it would be best if I ask my manager for his okay on releasing old patches (although there isn't much old stuff to release) as I am still fairly new at work. However, that should be a formality. The person I need to ask is currently in GMT+8. I am not sure if I will hear back from him within the next few hours. I will push a branch with it as soon as I get an okay.

Could you describe what you have in mind.

Instead of having zvol threads, we would just execute an asynchronous DMU operation with a callback and return whenever the operation is not synchronous. Then when the callback is invoked, we will invoke BIO_END_IO(bio, -error). This allows userland software that expects asynchronous behavior to be asynchronous while allowing userland software that expects synchronous behavior to be synchronous. This was my original vision for how zvols would work in the future. It got side tracked as my last job didn't care about zvols and wore me out, so I did not really have spare time to work on it. zvol performance is a high priority at my new job though.

Bronek · 2017-05-15T19:10:12Z

Re. zvol_threads, see below:

root@gdansk ~ # cat /sys/module/zfs/parameters/zvol_threads
32
root@gdansk ~ # nproc
16
root@gdansk ~ # nproc --all
32

I use isolcpus to dedicate half of all cores available to VCPUs only. Does echo 8 > /sys/module/zfs/parameters/zvol_threads make sense, for testing?

Re. zfs_dirty_data_max perhaps it matters that I use 3GB dedicated SLOG device on NVME? Anyway, the current configuration is:

root@gdansk ~ # cat /sys/module/zfs/parameters/zfs_dirty_data_max
4294967296

The current synchronous performance is not brilliant, but it is not awful either I do not have a robust enough benchmark. I imagine a good benchmark would involve random small writes on multiple threads writing to multiple ZOLs. I do not think VM actually needs to be involved, direct block access from user mode should be sufficient, but different write modes should be exercised.

ryao · 2017-05-15T19:11:08Z

@Bronek #6133 is intended to fix poor concurrent synchronous performance on zvols. Filesystems benefit too.

Bronek · 2017-05-15T19:13:59Z

@ryao yes I read this much from your pull request :) The issue here is obviously asynchronous load, which is virtually killing the machine, when under heavy random write load. Since I'm not keen on having this machine blocked when it is used by someone else, I switched it to synchronous (unless I'm doing testing myself).

ryao · 2017-05-15T19:16:25Z

@Bronek In that case, the IOs aren't forced sync and it wouldn't benefit from it. Anyway, keep your eyes peeled for my next zvol patch.

P.S. I realized that there was a mistake in my previous remark, so I deleted it while I rewrote it. You saw it and replied before I had the chance to finish typing.

tuxoko · 2017-05-15T19:21:39Z

@behlendorf
The problem with nr_requests is that we currently bypass request_queue. So if we go back using it, it should throttle the incoming thread for us.

ryao · 2017-05-15T19:31:37Z

@behlendorf Why would the solution here be any different than the solution on a regular file? Internally, it is basically the same thing, except we can't fall back on nr_requests with a regular file.

tuxoko · 2017-05-15T19:40:12Z

@ryao
The difference is that for block device submit_bio is an async interface, while write(2) is synchronous interface. So an incoming thread from write(2) will get throttled when trying to open a txg. But for an incoming thread from submit_bio, there's no such such throttling. The throttling only happens after they dispatch the IO to zvol_thread, which the incoming thread already returns.

ryao · 2017-05-15T19:47:08Z

@tuxoko I see what you are saying. The number of things waiting in the taskq for execution is essentially unbounded, so we can end up with a terrible wall of wait. The way that I see us fixing this would killing the zvol threads and implementing an asynchronous DMU dispatch. We are adding unnecessary complexity by using zvol threads. This is the sort of thing that I had wanted to get away from doing in the past. It is hard make zvol worker threads work well and even if we get them to work as well as they can, they are still inferior to a solution that doesn't use them. Double dispatch is expensive.

ryao · 2017-05-15T20:47:54Z

@behlendorf It occurs to me that this problem affects swap on zvols as well. Paging out to a "swap" device is supposed to reduce machine memory usage, but the effect of the zvol taskq is that we should get increasing memory usage at a time when we are supposed to be decreasing it. It will eventually go down, but this is like pouring gasoline on a fire just before it runs out of flammable material. Some sort of throttling should help, although we had that before 692e55b broke it. As a workaround until we have a proper solution, I suggest we turn this functionality off by default.

Edit: Also, I suspect that this is one of multiple problems we are having scaling up performance on zvols at work, so effective now, I am starting work on a patch. This is work being done full time as part of my day job rather than a hobby project, so hopefully I should get a working patch soon. #6133 isn't quite finished yet, but I expect to wait on the buildbot for a few hours anyway. Failing to improve zvol performance is going to cause a large company to use a proprietary storage solution, so I have an extra incentive to make an improvement here ASAP.

ryao · 2017-05-20T14:26:05Z

@behlendorf My original idea turned out to be be less than ideal. While it was possible, it would have bypassed txg_sync by writing out early, which could cause IO amplification. I am pursuing a variation of the current ideas where the range locking and transactions are opened synchronously while the actual dmu operation and zil_commit are done in a worker thread. My expectation is that this will give us the best of both worlds.

behlendorf · 2017-05-26T16:25:46Z

@ryao what you're describing is very similar to the current behavior with zvol_request() synchronously taking the range lock and offloading the tx handling to the taskq. The key bit missing is a per-zvol max queue depth which we needs to be implemented.

ryao · 2017-06-09T04:26:31Z

@Bronek @kpande I suggest trying out #6207. It implements the fix that I described.

@behlendorf It is very similar. Per zvol max queue depths likely could be implemented with a condition variable, a mutex and a counter. I'll experiment with it and get back to you on it. :)

richardelling · 2017-06-09T16:48:10Z

yes, you need to stall the request until the depth decreases. While you're there... please add a kstat to count when we stall.

behlendorf · 2017-06-09T18:19:35Z

@ryao if possible it would ideal to use the standard /sys/block/dev/queue/nr_requests interface to control the maximum depth.

mailinglists35 · 2018-05-16T14:32:10Z

@behlendorf is this still an issue or was it fixed eventually?

Bronek · 2018-05-17T19:03:10Z

@mailinglists35 I have disabled asynchronous zvol some time ago - will get back to you after I have enabled it again and set zvol_threads to 8 . FWIW I am currently running version 0.7.9 with kernel 4.14.41

Bronek · 2018-05-18T16:53:43Z

having tested under some load, the impression so far is that the current version is much better in this respect. Hard to say how much help was zvol_thrads=8, also I would not say the situation is optimal. The host remained stable functional and usable through testing, even though the load was pretty high. The guest virtual machine hung on the third test and had to be terminated. The termination did not take effect for some few minutes after the request, despite vfio own IO threads being killed, but the system was usable and stable through. Also the performance figures obtained from tests were sensible.

behlendorf · 2018-05-18T18:07:51Z

@Bronek thanks for testing this again. It sounds like we should open a PR which sets the default number of zvol_threads to number of cpus which would be a reasonable default value for most systems.

Bronek · 2018-05-18T19:20:02Z

@behlendorf FWIW, my system is:

root@gdansk ~ # nproc
16
root@gdansk ~ # nproc --all
32
root@gdansk ~ # cat /proc/cmdline
BOOT_IMAGE=../vmlinuz-linux zfs=zroot rw zfs_import_dir=/dev/disk/by-id zfs_wait=60 spl.spl_hostid=0xa8c03302 console=ttyS0,115200N8R nomodest video=vesa:off video=efifb:off vga=normal udev.children-max=32 edac_core.edac_mc_panic_on_ue=1 intel_iommu=on iommu=pt intel_pstate=passive rcu_nocbs=4-7,12-15,20-23,28-31 nohz_full=4-7,12-15,20-23,28-31 isolcpus=4-7,12-15,20-23,28-31 initrd=../intel-ucode.img,../initramfs-linux.img

I have now changed zvol_threads to 16, but will wait with more testing until details of a fix mentioned by @kpande (which PR?)

janetcampbell · 2019-04-07T23:35:50Z

With kernel-induced ZVOL write RMW out of the way (see #8590), I find that zvol_request_sync=1 is usually the best way to go. Latency and throughput are both extremely good and the throttle works properly.

stale · 2020-08-24T21:53:30Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf mentioned this issue May 17, 2017

I/O starvation during VirtualBox VM installation #5867

Closed

behlendorf added this to the 0.7.0 milestone May 26, 2017

ryao mentioned this issue Jun 9, 2017

Fix dmu throttle on zvols #6207

Closed

12 tasks

trenb mentioned this issue Jun 27, 2017

Performance regression on ZVOL between 0.6.4.2 and 0.7.0rc4 #6278

Closed

dinatale2 mentioned this issue Jun 30, 2017

Enforce request limits on zvols #6294

Closed

13 tasks

behlendorf removed this from the 0.7.0 milestone Jul 25, 2017

stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020

stale bot closed this as completed Nov 25, 2020

zvol write performance issue 0.7.0-rc4 #6127

zvol write performance issue 0.7.0-rc4 #6127

Comments

Bronek commented May 12, 2017 • edited Loading

Bronek commented May 12, 2017

behlendorf commented May 13, 2017

dweeezil commented May 13, 2017

Bronek commented May 13, 2017

Bronek commented May 13, 2017 • edited Loading

Bronek commented May 13, 2017

tuxoko commented May 15, 2017

behlendorf commented May 15, 2017

tuxoko commented May 15, 2017 • edited Loading

ryao commented May 15, 2017 • edited Loading

tuxoko commented May 15, 2017

ryao commented May 15, 2017

behlendorf commented May 15, 2017

ryao commented May 15, 2017 • edited Loading

Bronek commented May 15, 2017

ryao commented May 15, 2017 • edited Loading

Bronek commented May 15, 2017

ryao commented May 15, 2017

tuxoko commented May 15, 2017

ryao commented May 15, 2017

tuxoko commented May 15, 2017

ryao commented May 15, 2017 • edited Loading

ryao commented May 15, 2017 • edited Loading

ryao commented May 20, 2017 • edited Loading

behlendorf commented May 26, 2017

ryao commented Jun 9, 2017

richardelling commented Jun 9, 2017

behlendorf commented Jun 9, 2017

mailinglists35 commented May 16, 2018

Bronek commented May 17, 2018 • edited Loading

Bronek commented May 18, 2018

behlendorf commented May 18, 2018

Bronek commented May 18, 2018

janetcampbell commented Apr 7, 2019

stale bot commented Aug 24, 2020

Bronek commented May 12, 2017 •

edited

Loading

Bronek commented May 13, 2017 •

edited

Loading

tuxoko commented May 15, 2017 •

edited

Loading

ryao commented May 15, 2017 •

edited

Loading

ryao commented May 15, 2017 •

edited

Loading

ryao commented May 15, 2017 •

edited

Loading

ryao commented May 15, 2017 •

edited

Loading

ryao commented May 15, 2017 •

edited

Loading

ryao commented May 20, 2017 •

edited

Loading

Bronek commented May 17, 2018 •

edited

Loading