Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txg_sync and z_rd_int/* become top I/O consumers, the host renders unusable for a period of time #1538

Closed
alexclear opened this issue Jun 20, 2013 · 23 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@alexclear
Copy link

I have a dual Xeon E5645 system w/8 SATA disks, /boot is on an USB drive and / is on ZFS:

root@debhost1 ~ zfs list
NAME                                 USED  AVAIL  REFER  MOUNTPOINT
vol0                                2.50T  1.96T   350M  /vol0
vol0/kvm-Common-Host                 775G  1.96T   186G  /var/lib/libvirt/images/Common-Host
vol0/kvm-Dev6-Host                  9.28G  1.96T  9.28G  /var/lib/libvirt/images/Dev6-Host
vol0/kvm-Dev7-Host                  9.25G  1.96T  9.25G  /var/lib/libvirt/images/Dev7-Host
vol0/kvm-HBase-Slave-1              99.1G  1.96T  23.1G  /var/lib/libvirt/images/HBase-Slave-1
vol0/kvm-HBase-Slave-2               107G  1.96T  23.8G  /var/lib/libvirt/images/HBase-Slave-2
vol0/kvm-W2K8                        577G  1.96T   184G  /var/lib/libvirt/images/W2K8
vol0/libvirt-images                  130G  1.96T    31K  /vol0/libvirt-images
vol0/libvirt-images/Alex-Client     23.2G  1.96T  9.74G  /var/lib/libvirt/images/Alex-Client
vol0/libvirt-images/Dev3-Host       27.1G  1.96T  27.1G  /var/lib/libvirt/images/Dev3-Host
vol0/libvirt-images/Dev4-Host       23.6G  1.96T  23.6G  /var/lib/libvirt/images/Dev4-Host
vol0/libvirt-images/Dev5-Host       20.2G  1.96T  20.2G  /var/lib/libvirt/images/Dev5-Host
vol0/libvirt-images/Dev7-Host       91.2M  1.96T  91.2M  /vol0/libvirt-images/Dev7-Host
vol0/libvirt-images/Dev8-Host       2.73G  1.96T  2.73G  /var/lib/libvirt/images/Dev8-Host
vol0/libvirt-images/HBase-Master    9.65G  1.96T  5.23G  /var/lib/libvirt/images/HBase-Master
vol0/libvirt-images/W7-01d          9.97G  1.96T  9.97G  /var/lib/libvirt/images/W7-01d
vol0/libvirt-images/W7-Scanner-New  12.9G  1.96T  12.9G  /var/lib/libvirt/images/W7-Scanner-New
vol0/rootfs                          260G  1.96T   189G  /
vol0/var                             579G  1.96T   579G  /var2
root@debhost1 ~ 

The zpool is:

root@debhost1 ~ zpool status
  pool: vol0
 state: ONLINE
  scan: scrub repaired 0 in 44h43m with 0 errors on Mon May 13 00:06:19 2013
config:

    NAME                                           STATE     READ WRITE CKSUM
    vol0                                           ONLINE       0     0     0
      mirror-0                                     ONLINE       0     0     0
        scsi-SATA_WDC_WD1003FBYX-_WD-WMAW30418757  ONLINE       0     0     0
        scsi-SATA_WDC_WD1003FBYX-_WD-WMAW31056873  ONLINE       0     0     0
      mirror-1                                     ONLINE       0     0     0
        scsi-SATA_WDC_WD1003FBYX-_WD-WMAW31066713  ONLINE       0     0     0
        scsi-SATA_WDC_WD1003FBYX-_WD-WCAW33977049  ONLINE       0     0     0
      mirror-2                                     ONLINE       0     0     0
        scsi-SATA_WDC_WD1003FBYX-_WD-WMAW31077083  ONLINE       0     0     0
        scsi-SATA_WDC_WD1003FBYX-_WD-WCAW35326792  ONLINE       0     0     0
      mirror-3                                     ONLINE       0     0     0
        scsi-SATA_WDC_WD2000FYYZ-_WD-WCC1P0020205  ONLINE       0     0     0
        scsi-SATA_WDC_WD2000FYYZ-_WD-WCC1P0021665  ONLINE       0     0     0
    cache
      ata-OCZ-AGILITY4_7M7U6Y6X7E6Q3K2D3E3U-part1  ONLINE       0     0     0

errors: No known data errors
root@debhost1 ~ 

This is a Debian 7.0-based system w/ZoL 0.6.1-1:

root@debhost1 ~ uname -a
Linux debhost1 3.2.0-4-amd64 #1 SMP Debian 3.2.41-2 x86_64 GNU/Linux

It is used mainly for virtualization, most datasets have sync=disabled and recordsize=4K
ARC is capped by 20G, ARC utilization graph follows:

zfs_stats_utilization-day

L2ARC size and efficiency graphs follow:

zfs_stats_l2utilization-day
zfs_stats_l2efficiency-day

Disk load is almost constant during working hours but we suffer from sporadical performance drops which do not seem to correlate to actual load generated by virtual machines. Or at least we were unable to find the correlation yet.
Output of iotop -d 5 -P looks like this during these performance drop periods:

screenshot-txg_sync

We also collected iostat -x -k 2 20 output here: https://gist.github.com/alexclear/5821955 and zpool iostat -v 2 20 (at the same time) output here: https://gist.github.com/alexclear/5821960

It seems like the system experiences a lot of internal I/O (mostly random reads) not consumed directly by userspace processes during these periods. It is also very odd that L2ARC efficiency remains quite low.

@porbas
Copy link

porbas commented Jun 21, 2013

Hi
Where did you get ZOL munin plugins?

@alexclear
Copy link
Author

@porbas I ported the plugin from FreeBSD, you can grab it here: https://raw.github.com/alexclear/ZoL-munin-plugin/master/zfs_stats_

@porbas
Copy link

porbas commented Jun 21, 2013

@alexclear Thank you. You are great! :-)

@behlendorf
Copy link
Contributor

@alexclear You might try setting the module option metaslab_debug=1. It sounds as if the txg_sync thread might be loading and unloading numerous metaslabs from disk looking for free space. Setting the option will modify the behavior such that it keeps the metaslabs cached so your memory footprint may go up. However, you should prevent all that internal I/O.

@FransUrbo
Copy link
Contributor

@alexclear is this still a problem with latest HEAD?

@behlendorf behlendorf added Type: Performance Performance improvement or performance problem and removed Bug labels Oct 16, 2014
@behlendorf behlendorf removed this from the 0.6.4 milestone Oct 16, 2014
@pdf
Copy link

pdf commented Nov 13, 2015

I appear to be seeing this on 0.6.5.3 - load average is currently 40+, pool is doing 1GB/s read throughput, all of it generated by z_rd_*.

If someone notes this in the very near future and there are some details I can capture while this is occuring, please let me know. Otherwise, I'm going to have to bounce this rather soon, iotop -P:

iotop

@pdf
Copy link

pdf commented Nov 14, 2015

Whoops, just realized it's scrubbing. Still, it seems like the scrub is starving other processes (load avg 40, 35-40% iowait), which I don't recall being this dramatic a problem in the past.

@behlendorf
Copy link
Contributor

@pdf this may be a result of the change in scheduling priority for the ZFS threads, 1229323 and an increase in their number aa9af22.

@pdf
Copy link

pdf commented Nov 20, 2015

Seem to be likely suspects. This does make scrubs impractical in production, which is problematic.

@behlendorf
Copy link
Contributor

Some tuning may be required, but real user IO will still be given priority.

@aktau
Copy link

aktau commented Dec 5, 2015

Oddly, I've also noticed something similar, but I verified that this was not happening during a scrub:

$ sudo pidstat -d 5
Linux 4.2.0-1-amd64 (kdi)   12/06/2015  _x86_64_    (2 CPU)

12:16:34 AM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
12:16:39 AM     0        31      0.00      0.00      0.00       1  kswapd0
12:16:39 AM     0       150      0.00     16.00      0.00       1  jbd2/sda1-8
12:16:39 AM     0       669   1838.40      0.00      0.00       0  z_rd_int_0
12:16:39 AM     0       670   2048.00      0.00      0.00       0  z_rd_int_1
12:16:39 AM     0       671   1971.20      0.00      0.00       0  z_rd_int_2
12:16:39 AM     0       672   2037.60      0.00      0.00       0  z_rd_int_3
12:16:39 AM     0       673   1715.20      0.00      0.00       0  z_rd_int_4
12:16:39 AM     0       674   2068.80      0.00      0.00       0  z_rd_int_5
12:16:39 AM     0       675   1664.00      0.00      0.00       0  z_rd_int_6
12:16:39 AM     0       676   1995.20      0.00      0.00       0  z_rd_int_7
12:16:39 AM     0       714      0.00      0.00      0.00     245  txg_sync
12:16:39 AM   111       894   2236.80    931.20      0.00       0  transmission-da
12:16:39 AM  1000       931     36.00      0.00      0.00     158  kodi.bin

I'm on 0.6.5.2 with spl 0.6.5.1 (debian 9), should that be important.

I actually suspected that the "reads" from transmission-da were somehow being multiplexed to the z_rd_int_* kernel threads. But the numbers don't add up.

Note that my ZFS pool consists of just one single USB hard drive. No raidz, no mirroring, nothing fancy. So parity data calculation or others things (should the rd_int threads do that) can't be it.

EDIT: Don't know whether it's related but I just saw this appear in my terminal for the very first time:

Message from syslogd@kdi at Dec  6 00:47:43 ...
 kernel:[ 2955.670778] VERIFY3((arc_stats.arcstat_c.value.ui64) >= 2ULL << 24) failed (33550891 >= 33554432)

Message from syslogd@kdi at Dec  6 00:47:43 ...
 kernel:[ 2955.670787] PANIC at arc.c:3760:arc_adapt()

Message from syslogd@kdi at Dec  6 00:47:43 ...
 kernel:[ 2955.809683] VERIFY3((arc_stats.arcstat_c.value.ui64) >= 2ULL << 24) failed (33550891 >= 33554432)

Message from syslogd@kdi at Dec  6 00:47:43 ...
 kernel:[ 2955.809692] PANIC at arc.c:3760:arc_adapt()

One data point that might help: this unit has only 2GB of RAM.

EDIT 2: Apparently these messages were related to stack traces, which I rescued from dmesg: https://gist.github.com/aktau/6e79ecca968a641eb60f

@aktau
Copy link

aktau commented Dec 6, 2015

After reading #2952, I feel obliged to say that I'm using lz4 compression on my zpool. Here's the configuration: https://gist.github.com/aktau/fa6b5ef2fd40201a489b

@behlendorf
Copy link
Contributor

@aktau you've hit #3904 and you seem to have built with --enable-debug. You can safely avoid your issue for the moment by not enabling debug, see commit 935434e.

@aktau
Copy link

aktau commented Dec 10, 2015

@aktau you've hit #3904 and you seem to have built with --enable-debug. You can safely avoid your issue for the moment by not enabling debug, see commit 935434e.

Thanks for the update! I'm not running a self-built kernel module though, I just grabbed it from the Debian Jessie repo the zfsonlinux project provides. I assume those are built with debugging support enabled, then?

That means that the z_rd_int high IO load is a separate thing, sorry for the confusion. Though, from reading #3904, the two seem related: high read workload with a few random writes cause it to happen more often, says a poster. Might be because my unit has low memory and the ARC is filling up available memory quite fast because of all the reading.

That said, I find it odd that my particular workload is causing this. Back-of-the-envelope, the only process that is doing significant reading is transmission-da. Apart from checksum verification and other assorted things it's doing, the main source of reads is sharing chunks. I've capped that to 100KB/sec (so give or a take a few 10's of KB/sec). The HDD, slow and old though it is, should definitely be able to sustain that without reaching 50-90% utilization, I think.

@behlendorf
Copy link
Contributor

Is anybody still observing this behavior with a recent version of ZFS?

@hooliowobbits
Copy link

hooliowobbits commented Oct 18, 2016

Hi yes, i installed ZOL today on Debian Jessie, along with proxmox VE 4.3-3/557191d3.

i am currently doing an debian install within a vm and the copy was slowing down and down and started wondering about host performance. i'm then ran $ dd if=/dev/zero of=/tank/perftest/dd bs=1024M count=1 oflag=dsync. now i/o seems to have dried up altogether :(

syslog showing lots and lots of Oct 18 22:03:30 riverbed kernel: [ 3480.297060] INFO: task txg_sync:6052 blocked for more than 120 seconds. Oct 18 22:03:30 riverbed kernel: [ 3480.297410] txg_sync D ffff8801fbe1faa8 0 6052 2 0x00000000 Oct 18 22:03:30 riverbed kernel: [ 3480.297749] [<ffffffffc049aa74>] txg_sync_thread+0x3e4/0x6a0 [zfs] Oct 18 22:03:30 riverbed kernel: [ 3480.297811] [<ffffffffc049a690>] ? txg_sync_stop+0xf0/0xf0 [zfs]

EDIT: oh the dd returned;
$ dd if=/dev/zero of=/tank/perftest/dd bs=1024M count=1 oflag=dsync 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 963.115 s, 1.1 MB/s

this is on a hp microserver, 8gb ram, 4 x 2T in raidz1
` $ sudo zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
tank        ONLINE       0     0     0
  raidz1-0  ONLINE       0     0     0
    sda     ONLINE       0     0     0
    sdb     ONLINE       0     0     0
    sdc     ONLINE       0     0     0
    sdd     ONLINE       0     0     0

errors: No known data errors`

@behlendorf
Copy link
Contributor

@hooliowobbits can you post your ZFS version. You can find it in /sys/module/zfs/version.

@hooliowobbits
Copy link

0.6.5.7-10_g27f3ec9

@hooliowobbits
Copy link

hooliowobbits commented Oct 18, 2016

$ sudo zpool get all tank | grep ashift   
tank  ashift                      0                           default

admittedly i had seen other users altering ashift, but not knowing much about that i left it as default

EDIT:

hoolio@riverbed:~:$ sudo fdisk -l /dev/sda |grep -i sector\ size
Sector size (logical/physical): 512 bytes / 512 bytes
hoolio@riverbed:~:$ sudo fdisk -l /dev/sdb |grep -i sector\ size
Sector size (logical/physical): 512 bytes / 512 bytes
hoolio@riverbed:~:$ sudo fdisk -l /dev/sdc |grep -i sector\ size
Sector size (logical/physical): 512 bytes / 512 bytes
hoolio@riverbed:~:$ sudo fdisk -l /dev/sdd |grep -i sector\ size
Sector size (logical/physical): 512 bytes / 512 bytes

@kettenbach-it
Copy link

I think I have the same problem.

The system is a Hexa-Core-Xeon E5-1650 with 128GB RAM and 2x 2TB SATA Disk running Kernelversion 4.4.40-1-pve and Proxmox 4.4-12 with ZFS module version 0.6.5.9-1.
Compression is on. Dedup is of and ashift=12. The SATA disks are in a mirror running the system and some volumes. The systems has an additional NVM which is supposed to work as Log+Cache, but it's not enabled right now and it doesn't cause the problem.

From time to time, tenth of z_wr_int_ processes producing io on the disks slowing down the host leaving almost no room for organic io. If the NVMe is in, VMs tend to crash because of no more IO -timeleft for them., because the NVMe accelerates the problem a lot.
Some times the behavior stops after some hours (up to 24.) Quite often it never stops until reboot.

Interestingly, the processes shown in iotop are called "wr" but iotop show "disk reads", though atop shows "writes".

image

atop reports the disks to be very busy:

image

I already posted in the Proxmox forum, but this seems to be a problem very specific on ZFS:

https://forum.proxmox.com/threads/unerkl%C3%A4rliches-phantom-write-i-o-mit-zfs.33284/

@gkkovacs
Copy link

gkkovacs commented Apr 27, 2017

We have been experiencing a similar issue on Proxmox 4.x since last year. We even opened Proxmox support threads about it, this is the latest: https://forum.proxmox.com/threads/kvm-guests-on-zfs-freeze-during-backup-restore.34362/

We now think it's an IO scheduling bug in ZFS.
It looks like the following issues are about the same problem: #5867 #5857

Reproduction
Basically, if you have a few KVM guests' RAW disks on ZFS Zvols, and you start to restore another KVM guest to the same host (so heavy write IO is happening for extended periods), the running guests will get starved of IO for so long that their virtual memory subsystem will trip up and all kinds of problems start to appear: CPU soft locks, hung tasks, rcu_scheduler stalls, and all around slowed down disk and network IO (Windows guests regularly lose their RDP connections for several seconds). Extended heavy reads also cause some IO starvation, but it's the writes that really cause problems.

Mitigation
We have tried mitigating this by limiting host and guest swapping, increasing guest vm buffers, and tuning power management, but nothing solved it so far.

We have also recognized that QCOW2 disks with cache=writeback are much less sensitive to the IO starvation from the KVM guests' point of view than ZVOL RAW disks with cache=none, but it's not yet clear why:

  • either ZFS schedules parallel file IO better between files compared to parallel IO between ZVOLs,
  • or simply the writeback cache solves the guests problems by caching their swap efficiently to evade detection.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

10 participants