Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZVol with master branch source 2015-09-06 and kernel 4.1.6 comes to grinding halt. #3754

Closed
dracwyrm opened this issue Sep 8, 2015 · 52 comments
Labels
Component: ZVOL ZFS Volumes Status: Inactive Not being actively updated
Milestone

Comments

@dracwyrm
Copy link

dracwyrm commented Sep 8, 2015

Hi,

In my set up, I have KVM using ZVols on a Gentoo Linux System with Gentoo Sources 4.1.6. I have been using 0.6.4.2 versions of SPL and ZFS. The tank is built with RAIDZ1 on three spinning HDs with external Logs and Cache on two SSDs.

The commit logs show that there have been speed improvements for ZVols, so I thought I would give it a try. I downloaded the source for SPL and ZFS via the download source button that git hub as and renamed it to something like zfs-20150906.zip. Then used the Gentoo ebuilds as a base to install the new versions. I figured, source downloads like that would allow me to chose the date for an update, rather than using a live ebuild. Naturally, I restarted the machine to make sure the new modules were fully loaded and the old ones out of memory.

The VMs would work for a minute and then they would come to a grinding halt. The Windows circle of dots would not spin nor could I really move the mouse that is passed through via USB.

I wanted to do a full reinstall of Windows anyways and all my data was backed up, so I completely destroyed the tank and a did a secure erase to wipe out the drives. I then created a new tank using the updated ZFS binaries and modules hoping this would help. I tried reinstalling Windows, but the installation would not get very far before things ground to a halt again.

Here's the strange bit. I also have regular datasets on the same tank, and those worked faster than ever. I even transferred 750 Gigs of data to one dataset with no slowdowns at all. It's only the ZVols that gave bad performance.

I have since reinstalled 0.6.4.2 versions of SPL and ZFS, but I didn't recreate the tank, and started the virtual machine and it runs as fast as ever. No slow downs at all.

Cheers.

@behlendorf
Copy link
Contributor

@dracwyrm thanks for testing out the updated zvol code in the master branch. I'm sorry to hear you ran in to problems. Do you happen to know if anything was logged to the console of the ZFS system when you encountered these hangs. That would go a long way towards helping us identify the root cause.

@behlendorf behlendorf added the Component: ZVOL ZFS Volumes label Sep 8, 2015
@ryao
Copy link
Contributor

ryao commented Sep 8, 2015

@dracwyrm I assume that the host itself has not deadlocked. Could you provide backtraces from your kernel threads when the guests have deadlocked?

http://zfsonlinux.org/faq.html#HowDoIReportProblems

Ideally, your kernel would be built with CONFIG_DEBUG_INFO=y and CONFIG_FRAME_POINTER=y.

@ryao
Copy link
Contributor

ryao commented Sep 8, 2015

@dracwyrm Also, what kind of block device is your KVM guest using (e.g. IDE, SATA, SCSI, virtio) and which version of Windows is this?

@behlendorf behlendorf added this to the 0.6.5 milestone Sep 8, 2015
@ryao
Copy link
Contributor

ryao commented Sep 8, 2015

I failed to reproduce this with qemu-system-x86_64 --enable-kvm -smp 8,cores=4,threads=2,sockets=1 -cpu host -m 1024 -hda /dev/zvol/rpool/windows -cdrom backup/isos/win7_64/en_windows_7_professional_x64_dvd_X15-65805.iso. However, my system has openzfs/spl#474 on it, which might be protecting me from some sort of issue. I will retest without it.

@ryao
Copy link
Contributor

ryao commented Sep 9, 2015

That wasn't necessary. I was able to reproduce this with qemu-system-x86_64 --enable-kvm -smp 8,cores=4,threads=2,sockets=1 -cpu host -m 1024 -device ahci,id=ide -device ide-drive,bus=ide.1,drive=HDD -drive file=/dev/zvol/rpool/windows,id=HDD,if=none -cdrom backup/isos/win7_64/en_windows_7_professional_x64_dvd_X15-65805.iso.

Doing (for i in /proc/[0-9]*; do cat $i/{comm,stack}; done;) | less, I see nothing out of the ordinary. Nothing out of the ordinary is logged to dmesg either. iostat -x -d 1 with #3746 is not showing any activity. Given that the virtual IDE hard drive is unaffected, this seems like a bug in QEMU to me. It is as if we finished an IO so fast that we triggered a bug in QEMU.

@ryao
Copy link
Contributor

ryao commented Sep 9, 2015

@dracwyrm No need for those backtraces, although information about your guest WIndows version and guest storage device would be useful.

@ryao
Copy link
Contributor

ryao commented Sep 9, 2015

Some more debugging last night revealed that the install actually finishes. It just takes about half an hour to get past 0%. I had assumed we had somehow become too fast and were triggering a race, but testing with a patch to put msleep(1) on each IO did not improve things.

My test configuration is Linux 4.1.3, QEMU 2.4.0 and Windows 7 Professional x64.

@behlendorf
Copy link
Contributor

Unfortunately I was unable to reproduce this issue at all in my test environment. Everything worked smoothly. @dracwyrm could you post the environment you were testing with.

My test configuration is Linux 3.19, QEMU 2.2.0 and Windows 10 x64.

@dracwyrm
Copy link
Author

dracwyrm commented Sep 9, 2015

My environment is:
Gentoo Sources 4.1.6
qemu 2.4.0
libvirt 1.2.18-r1

Guest is Windows 10 Pro

Dedicated nVIdia graphics card passed through via PCI pass through. USB keyboard and mouse are also passed through until I get synergy up and running.

It was the guest that was deadlocking. It would just be frozen in place not doing anything. It would not respond to commands. Sometimes I was able to get the install done, but then trying to do anything in guest would be impossible, even sending the keystroke CTRL + Shift + Esc to open task manager would do nothing for ages.

When I switched back to the latest release, no issues at all. It works perfectly normal.

I just remembered this important bit! I did try the master branch all commits up to and including the 31th of August and it too had freezing problems, but it wasn't as bad as with the 6th of September checkout. Hope that helps.

Is there anything more you need?

libvirt settings (should be all you need)
http://pastebin.com/WE85aSek (to preserve formatting)

My ZFS Configuration:
zpool status
pool: tank
state: ONLINE
scan: none requested
config:

    NAME                                                     STATE     READ WRITE CKSUM
    tank                                                     ONLINE       0     0     0
      raidz1-0                                               ONLINE       0     0     0
        sdb                                                  ONLINE       0     0     0
        sdc                                                  ONLINE       0     0     0
        sdd                                                  ONLINE       0     0     0
    logs
      mirror-1                                               ONLINE       0     0     0
        ata-Corsair_Force_LS_SSD_1444816800010167031E-part1  ONLINE       0     0     0
        ata-Corsair_Force_LS_SSD_150581680001016752B2-part1  ONLINE       0     0     0
    cache
      ata-Corsair_Force_LS_SSD_1444816800010167031E-part2    ONLINE       0     0     0
      ata-Corsair_Force_LS_SSD_150581680001016752B2-part2    ONLINE       0     0     0

@dracwyrm
Copy link
Author

dracwyrm commented Sep 9, 2015

Oh. My ZVol settings:
zfs get all tank/JonPC
NAME PROPERTY VALUE SOURCE
tank/JonPC type volume -
tank/JonPC creation Mon Sep 7 15:37 2015 -
tank/JonPC used 792G -
tank/JonPC available 2.01T -
tank/JonPC referenced 18.1G -
tank/JonPC compressratio 1.00x -
tank/JonPC reservation none default
tank/JonPC volsize 750G local
tank/JonPC volblocksize 8K -
tank/JonPC checksum on default
tank/JonPC compression off default
tank/JonPC readonly off default
tank/JonPC copies 1 default
tank/JonPC refreservation 774G local
tank/JonPC primarycache all default
tank/JonPC secondarycache all default
tank/JonPC usedbysnapshots 10.7K -
tank/JonPC usedbydataset 18.1G -
tank/JonPC usedbychildren 0 -
tank/JonPC usedbyrefreservation 774G -
tank/JonPC logbias latency default
tank/JonPC dedup off default
tank/JonPC mlslabel none default
tank/JonPC sync standard default
tank/JonPC refcompressratio 1.00x -
tank/JonPC written 10.7K -
tank/JonPC logicalused 13.4G -
tank/JonPC logicalreferenced 13.4G -
tank/JonPC snapdev hidden default
tank/JonPC context none default
tank/JonPC fscontext none default
tank/JonPC defcontext none default
tank/JonPC rootcontext none default
tank/JonPC redundant_metadata all default

@dracwyrm
Copy link
Author

I tried several things. I switched to ck-sources, mainly because Kernel of Truth on the Gentoo forums says he has good luck with it and is active here as well. I increased the folder watch limit. I just now tried the very latest code in spl/zfs masters. It resulted in very poor performance of the VM. In fact, it was so poor it just went to a black screen then the monitor said no video input.

I reinstalled 0.6.4.2 and rebooted. Everything ran perfect. I even ran a benchmark just minutes again after switching back to 0.6.4.2 from the master branch. http://imgur.com/l8kRCWp It's over 9000 (that's actually a Steam unlock achievement). I am typing this from the VM as well.

I admit that I don't know much about how file systems are done, so forgive me if this is not a smart question...

Before, I always did in-kernel compiles of spl/zfs, but with the changes in Kernel 4.1, the 0.6.4.2 release does not work being compiled in. However, you can compile it separately as a module. The commit log shows 4.1 and 4.2 compatibility fixes, so I tried embedding spl and zfs into the kernel the same way I always do. It compiled fine with no errors. However, rebooting the host system, the screen would be blank right after the bios logos. No text. No errors. Just nothing. The same exact kernel compiled with the same config, minus spl/zfs being added in, boots straight away. My question is, is there still some incompatibility left that is causing this slowdown that compiling zfs into kernel finds straight away, but takes a specific condition to trigger when compiled outside the kernel as a module? Or is this a completely separate bug altogether that I need to do a separate bug report on?

Thanks.

@Bronek
Copy link

Bronek commented Sep 10, 2015

@dracwyrm could you try with virtio HDD in Windows? I.e. attach 2 CD as ahci or ide (one Windows installation and other virtio-win.iso from fedoraproject), and define your virtual HDDs as virtio rather than ahci (or ide), for example:

-drive file=/dev/zvol/zdata/vdis/gdynia,if=none,id=drive-virtio-disk0,format=raw
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=drive-virtio-disk0,id=virtio-disk0
-drive file=/data/vdis/isos/windows7.iso,if=none,id=drive-ide0-0-1,readonly=on,format=raw
-device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=1
-drive file=/data/vdis/isos/virtio-win.iso,if=none,id=drive-ide0-0-2,readonly=on,format=raw
-device ide-cd,bus=ide.0,unit=2,drive=drive-ide0-0-2,id=ide0-0-2

This would be generated from libvirt definition similar to:

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/zvol/zdata/vdis/gdynia'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/data/vdis/isos/windows7.iso'/>
      <backingStore/>
      <target dev='hdd' bus='ide'/>
      <boot order='1'/>
      <readonly/>
      <alias name='ide0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/data/vdis/isos/virtio-win.iso'/>
      <backingStore/>
      <target dev='hdd' bus='ide'/>
      <readonly/>
      <alias name='ide0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>

This is because emulation of both ahci (and ide) is known to be slow in qemu. Perhaps this poor emulation is hitting something in ZFS, making it even slower. It is generally recommended to avoid using ahci (or ide) emulation except for installation purposes.

During installation you will need to load drivers from virtio-win CD, in order to allow Windows to "see" virtio HDD.

FWIW, the above configuration is very similar to my own (virtual) Windows 7 computers, which I use daily with no problems. My host is ArchLinux with kernel 4.0.9 and ZOL version 0.6.4.2 + #3718 , qemu 2.4.0 and libvirt 1.2.18

@Bronek
Copy link

Bronek commented Sep 10, 2015

@behlendorf this might be compatibility issue between ZOL and kernel 4.1 , can you try reproducing with newer kernel versions?

@dracwyrm
Copy link
Author

I changed the HDD from Virtio-SCSI to just Virtio. It's a bit better. Writes are fine, but reading seems to be really slow. Also, it seems like only one program can access the disk at a time, because if two start to read from it, it slows down a lot. Also, the mouse jumps around when there is disk activity. It hasn't slowed to a grinding halt yet, but it does seem to come close.

It's also strange that Virtio-SCSI on v0.6.4.2 was perfectly fine.

It might be some issue missed that my setup seems to find, as I mentioned before your comment, I can't compile SPL/ZFS into the kernel anymore, and this is using the latest master source with a 4.1.x kernel. I have yet to try with a 4.2 kernel. So, whatever it is seems to really make itself known when I do that, as the computer doesn't boot at all then.

Thanks for all your help in this.

To put this last. Do these numbers seem right? Write speeds in the megabytes and read in the kilobytes? I don't know how to get it to display formatted text right, sorry. It's the output of zpool iostat -v.

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.26T  4.18T    110  1.22K   861K  15.6M
  raidz1    1.26T  4.18T    110  1.15K   861K  8.88M
    sdb         -      -     34     56   452K  6.03M
    sdc         -      -     35     57   453K  6.03M
    sdd         -      -     34     56   452K  6.03M
logs            -      -      -      -      -      -                                                          
  mirror    90.0M  2.90G      0     73      0  6.72M                                                          
    sde1        -      -      0     73    648  6.72M                                                          
    sdf1        -      -      0     73    648  6.72M                                                          
cache           -      -      -      -      -      -                                                          
  sde2      2.99G  49.9G     32     17   258K  2.08M
  sdf2      2.96G  49.9G     33     17   266K  2.06M
----------  -----  -----  -----  -----  -----  -----

@Bronek
Copy link

Bronek commented Sep 10, 2015

@dracwyrm These speeds seem slow to me. actually, this is not speed. You may want to run actual speed test on ZVOLs, perhaps.

@dracwyrm
Copy link
Author

@Bronek I just managed to kill ZFS completely. I recompiled the kernel with Deadline scheduler since I read that it's best for KVM. Apparently not for ZFS as it came to a grinding halt like before. Had to go back to 0.6.4.2 versions to have a working VM. I was using BFQ before as it's default in the kernel. Lesson learned on that one.

@Bronek
Copy link

Bronek commented Sep 10, 2015

I am using deadline scheduler with 0.6.4.2 . Hope the root cause of this is identified, it would be a pity if this behaviour landed in 0.6.5 because we were not able to find the commit which caused it.

@dracwyrm
Copy link
Author

@Bronek Yeah. It would be. I had very high hopes for the new ZVol code which would make VMs a lot faster.
To clarify, when I switched back to 0.6.4.2, I didn't recompile the kernel. I am still on Deadline, so whatever it was that did this really hates Deadline. It might be a starting point, along with why the kernel doesn't boot when SPL and ZFS are compiled into the kernel directly. Hopefully all this information helps.

@fearedbliss
Copy link

@dracwyrm You should stick to 'noop' for ZFS. Originally I thought BFQ would help but actually it was slower when I ran tests. You can see the results below:

https://bpaste.net/show/7b39c45ce871

@Bronek
Copy link

Bronek commented Sep 10, 2015

@fearedbliss good point. I just learned that, if a whole disk is setup for ZFS, it will switch its scheduler to noop even if default is different. However if only a partition is setup for ZFS, it will leave default scheduler.

@dracwyrm are you using whole disks for ZFS? If so, what scheduler is set for these disks?

For example on my system:

        NAME                                    STATE     READ WRITE CKSUM
        zdata                                   ONLINE       0     0     0
          mirror-0                              ONLINE       0     0     0
            sdd                                 ONLINE       0     0     0
            sdc                                 ONLINE       0     0     0
          mirror-1                              ONLINE       0     0     0
            sdb                                 ONLINE       0     0     0
            sda                                 ONLINE       0     0     0

        NAME        STATE     READ WRITE CKSUM
        zpkgs       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sde4    ONLINE       0     0     0
            sdf4    ONLINE       0     0     0

root@gdansk ~ # cat /sys/block/sda/queue/scheduler
[noop] deadline cfq
root@gdansk ~ # cat /sys/block/sdb/queue/scheduler
[noop] deadline cfq
root@gdansk ~ # cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq
root@gdansk ~ # cat /sys/block/sdd/queue/scheduler
[noop] deadline cfq
root@gdansk ~ # cat /sys/block/sde/queue/scheduler
noop [deadline] cfq
root@gdansk ~ # cat /sys/block/sdf/queue/scheduler
noop [deadline] cfq

What happens if you change the scheduler in runtime to noop, before starting your VMs? Here is how to do it:

root@gdansk ~ # echo noop > /sys/block/sdf/queue/scheduler
root@gdansk ~ # cat /sys/block/sdf/queue/scheduler
[noop] deadline cfq
root@gdansk ~ # echo noop > /sys/block/sde/queue/scheduler
root@gdansk ~ # cat /sys/block/sde/queue/scheduler
[noop] deadline cfq

@fearedbliss
Copy link

@Bronek If by whole disk zfs you mean "zpool create tank /dev/sda" then no. I don't want to rely (nor like) GRUB 2. I basically use the concept of whole disk but for partition. Two things, I don't want /boot inside ZFS, also having swap inside of zfs (zvol) causes my server and laptop to crash. Once I removed 'swap' from zfs, my systems magically became completely stable. Layout is as follows:

/dev/sda1 /boot ext2 250 MB (Extlinux as bootloader)
/dev/sda2 swap 4 GB
/dev/sda3 ZFS (/ , etc)

And yup, you are right about the scheduler being changed. Normally my scheduler is CFQ since my kernel's scheduler is set to CFQ by default (and I didn't let ZFS partition my drive). I use elevator=noop at boot time.

@dracwyrm
Copy link
Author

@Bronek @fearedbliss This is what the system automatically does as I have never messed with setting individual scheduler settings:
cat /sys/block/sd{a,b,c,d,e,f}/queue/scheduler
noop [deadline] cfq bfq
[noop] deadline cfq bfq
[noop] deadline cfq bfq
[noop] deadline cfq bfq
noop [deadline] cfq bfq
noop [deadline] cfq bfq
sde and sdf are the log and cache drives of ZFS, so I don't know why those aren't noop. Does that matter? If so, then should the same mechanism that tells the kernel to use noop for sdb, sdc, and sdd tell the kernel to use noop for the log and cache drives? I will try tomorrow after work as it's late in this timezone.

Mental note: Since it switched to noop automatically, then can I use BFQ for the main drive for performance of the host system...

@fearedbliss
Copy link

@dracwyrm That's weird that it doesn't switch it, I'm not sure if that is intended but you should file another bug for that.

From the results that I posted above, I wouldn't recommend using BFQ (or any other scheduler other than no-op) if you are using ZFS.

@dracwyrm
Copy link
Author

@fearedbliss Setting all drives to noop did not help at all. I switched back to v0.6.4.2 and switched the sde/sdf to noop and v0.6.4 does perform a bit better now. So, same exact kernel, same scheduler, just two different versions of ZFS/SPL. The latest makes the VM so slow that it stalls out and virt-manager thinks the VM is in paused mode, but can't be unpaused.

I will file a separate bug report for the log/cache drives not being noop, because I notice a difference, even on v0.6.4.2.

@behlendorf behlendorf modified the milestones: 0.7.0, 0.6.5 Sep 11, 2015
@dracwyrm
Copy link
Author

Just for grins, I tried 0.6.5 with the same results. Then I tried a 4.2 kernel and the results were even more disastrous. The video wouldn't even come on and virt-manager reported high cpu usage. I'll double check my config to see if anything was missed. I did switch the 4.2 config to a noop only configuration.

@ryao
Copy link
Contributor

ryao commented Sep 18, 2015

@dracwyrm I regret that we did not identify the cause of this issue before 0.6.5 was tagged. This appears to be a critical data loss issue that is the first in the project's history to make it to a tagged production release. What happens is that there is a bug on non-aligned discard requests in the zvol rework where ZoL will try to truncate them to block boundaries, but the size is not updated, so it effectively rounds the request up to a block boundary. If the difference is N bytes, bytes after the request's end point will be discarded, causing data loss while incurring the read-modify-write overhead that this optimization was intended to avoid.

#3798 should fix this. It has been backported to Gentoo via sys-fs/zfs-kmod-0.6.5-r1.

My apologies for the issues that this caused. This was an extremely subtle bug that passed review and our current regression tests. The regression tests should be updated to catch this kind of regression before the next release.

@behlendorf
Copy link
Contributor

@dracwyrm I think it's almost certain your issue was caused by #3833. Sorry we didn't get to the bottom of this sooner, it will be addressed in 0.6.5.2 and it would be great if you could confirm for us it fixed the issue. The required patches are all currently in the master branch if you'd like to test sooner.

@behlendorf behlendorf modified the milestones: 0.6.5.2, 0.7.0 Sep 28, 2015
@behlendorf
Copy link
Contributor

Closing, this issue is believe to be resolved. @dracwyrm please let us know if that's not the case with 0.6.5.2.

@dracwyrm
Copy link
Author

Hi,

Tested 0.6.5.4 with both a 4.1.x and 4.4 kernel and this issue still persists. Sorry, hadn't had much time to play with new versions of ZFS. This issue is not closed for me.

I also removed block devices from being controlled by cgroups in the libvirtd config file to see if that would help, and it didn't.

My next step to debugging is to remove patches until I find the culprit. I will go back to 0.6.5.0 and work my way backwards. This will take a long time.

Can you please reopen?

-Jon

@behlendorf behlendorf reopened this Jan 26, 2016
@dracwyrm
Copy link
Author

I have a question about something that I don't understand.

I unload all these modules: zfs, zunicode, zavl, zcommon, znvpair, spl
Stop zfs.service.
Upgrade spl/zfs-kmod/zfs from 0.6.4.2 to 0.6.5.4
Restart the service which reloads all the kernel modules.
Do a check on the version of loaded module, to confirm that the new modules are loaded -- they are.
Start the virtual machine.
The performance of windows is good, the disk benchmark shows close to the same numbers as before, and everything seems alright.

However, if I restart the host computer, the performance of the guest is back to a grinding halt. The disk benchmarks drop to a crawl (roughly 900MBS throughput to about 300MBS throughput).

So, I tried again. Reinstalled v0.6.4.2, restarted the host, and back to normal speed in the quest.
This time, I reinstalled 0.6.5.4, and then restarted the host. The speed was a grinding halt from the start.

My question is: How can restarting the machine versus reloading the modules make such a difference? The new modules are fully loaded, so the guest is running on a ZVol that is running under the new drivers and not the old ones, so I would expect the same performance issue.

The only thing that I can think of is that the guest was running on the old drivers, so everything is loaded in memory or the arc??? and then the new drivers are loaded, so everything is still in memory or arc, then on reboot whatever is in memory is wiped out. If so, then is this a file transfer to memory/arc/whatever error?

Thanks.

@behlendorf
Copy link
Contributor

That's a good question. If you're able to remove and kernel modules and load new ones then you've definitely wiped any existing ZFS state from the system. We have to tear down everything so they can be unloaded. It sounds like perhaps something else is going on, but I'm not sure what.

@dracwyrm
Copy link
Author

Hi all,

I have good news and bad news. The good: I found the patch that is affecting VM Guest performance. The bad: You may not like which one it is.

First, my testing methodology.
I went to the SPL and ZFS v0.6.5 branches and worked down by day saving the tree at the end of each days commits. I made ebuilds to handle YYYYMMDD formatted version numbers. I unloaded all modules before emerging the new days' worth of commits. In the ZFS tree, when I did 03 Sep 2015 worth of commits, the VM Guest was as fast as ever. So I went up patch by patch for all commits listed under 04 Sep 2015. Then the patch that made the VM Guest grind to halt was found.

It was this one:
37f9dac

I read the description and it said people experience 50 to 80% increase in IOPs. Was this under a VM Guest? Or just a ZVol mounted under Linux running a test suite?

Also, there were other bug reports that were opened after mine when 0.6.5 was officially released saying that VM Guest performance was poor after the upgrade.

So, since I wasn't the only one having issues, can this patch be reversed until the cause for KVM and VM Guest performance is found? Sorry, it screws with VMs and I wasn't the only one.

Thanks,
Jon

@dweeezil
Copy link
Contributor

@dracwyrm Could you please describe the tests you're running in a bit more detail? Is this simply a matter of noticing that a Windows guest boots very slowly? I'm wondering how hard this might be to reproduce in a controlled testing environment. Also, are you continuing to use the same 3-vdev + mirrored logs and 2 cache devices shown above? Have you configured qemu (via libvirt) to access the zvols with direct IO (cache='none')? Have you run "perf top" or any other sort of diagnostics on the host while the problem is occurring? FWIW as another data point, I run Windows 7 guests fairly regularly with their storage on Virtio zvols and cache='none' and haven't noticed any performance regressions. I'll admit, however, it's mainly my "seat of the pants" feeling. Finally, did you run any benchmarks on the zvols from the host?

@dracwyrm
Copy link
Author

@dweeezil I'll try. It's late and recompiling the whole day is not fun, so my grammar isn't exactly the greatest.

The same Raid/log/cache config as before.
Windows 10 on a 750G ZVol (was Windows 8 before and had same issue).
Libvirt Cache mode set to none.
Libvirt IO Mode set to native. (other options are thread and hypervisor default)
Doesn't matter what the disk bus as whether it's SATA or SCSI or VirtIO, disk performance is always degraded when using v0.6.5.x.

I use crystal disk mark to benchmark the performance in the Guest, so it's not entirely seat of the pants. With SPL/ZFS v0.6.4.2 I would get an average of 900MBS sequential read speed and when using v0.6.5.x I would get about 400MBs. The problem is always on constant slow performance of the disk. It becomes so slow that even the mouse is affected. It jumps from point to point as I try to move it. I haven't tried perf top while this is going on, I can try tomorrow and see.

I don't have any ZVols mounted on the host. I do have a vdev mounted on the Host system with files I want under raid protection. I don't access it very often, but I do know that transfers to and from a very fast. The initial transfer of about 1 terabyte of data took a very short time basically maxing out the speed SATA bus was capable of.

@dweeezil
Copy link
Contributor

@dracwyrm I ran my first very small series of tests. On a 4.1.6 kernel running a 64-bit Win7 guest, spl and zfs compiled with --enable-debug and zfs_flags=0. My pool is a single partition an an SSD with ashift=9 (non-optimal) for this test and the Windows 7 system has a 30GiB zvol. With zfs at 782b2c3 (the commit prior to 37f9dac), Crystaldiskmark gave 2421/735 for sequential read/write. At current master (commit 4b9ed69), it gave 2469/786.

Obviously my test rig (on my "toy" test system) is quite different than your production system but I wanted to post a baseline set of numbers. They seem to indicate that at least for this particularly contrived benchmark, there is virtually no performance difference between the commit prior to 37f9dac and current master.

My next steps are going to be to pinch the ARC size way down and to increase the test size in CDM from the default of 1MB to something larger in order to force a lot more disk IO to happen.

@dracwyrm
Copy link
Author

@dweeezil Forgot to mention that I have since moved to kernel 4.1.15, but doesn't matter as it's been the same thing with all kernels. Even tried kernel 4.4 with ZFS 0.6.5.4.

Maybe you could create a couple of loopback drives to replicate a RAID Z1 with separate log/cache? Read/Writes to a RAID system is a lot different.

output of perf top while running CrystalDiskMark v5.1.1:

    39.47%  [kernel]            [k] read_hpet                     
     5.04%  [vdso]              [.] __vdso_clock_gettime          
     1.86%  [vdso]              [.] __vdso_gettimeofday           
     1.55%  [kernel]            [k] kvm_on_user_return            
     1.54%  [kernel]            [k] kvm_set_shared_msr            
     1.48%  [kernel]            [k] __vmx_load_host_state.part.89 
     1.45%  [kernel]            [k] kvm_arch_vcpu_ioctl_run       
     1.21%  [kernel]            [k] vmx_save_host_state           
     1.10%  [kernel]            [k] vmx_vcpu_run                  
     0.94%  [kernel]            [k] check_preemption_disabled     
     0.71%  [kernel]            [k] paging64_walk_addr_generic    
     0.68%  [kernel]            [k] __srcu_read_lock              
     0.66%  [kernel]            [k] gfn_to_hva_prot               
     0.65%  [kernel]            [k] preempt_count_add             
     0.56%  [kernel]            [k] menu_select                   
     0.54%  [kernel]            [k] __srcu_read_unlock            
     0.53%  [kernel]            [k] _raw_spin_lock                
     0.52%  [kernel]            [k] __fget                        
     0.50%  [kernel]            [k] update_cfs_shares             
     0.48%  [kernel]            [k] __schedule                    
     0.47%  [kernel]            [k] _raw_spin_lock_irqsave        
     0.47%  [kernel]            [k] kvm_arch_vcpu_load            
     0.47%  [kernel]            [k] x86_decode_insn               
     0.45%  [kernel]            [k] _raw_spin_lock_irq            
     0.45%  [kernel]            [k] preempt_count_sub             
     0.40%  [kernel]            [k] apic_timer_interrupt          
     0.40%  [kernel]            [k] int_sqrt                      
     0.39%  libpthread-2.22.so  [.] pthread_mutex_lock            
     0.36%  [kernel]            [k] fput                          
     0.36%  [kernel]            [k] system_call                   
     0.35%  [kernel]            [k] vmcs_writel                   
     0.34%  [kernel]            [k] queue_delayed_work_on         
     0.29%  [kernel]            [k] __switch_to                   
     0.29%  [kernel]            [k] enqueue_entity                
     0.29%  libpthread-2.22.so  [.] __pthread_mutex_unlock_usercnt

Cheers.

@dracwyrm
Copy link
Author

Also, just to show how tired I am, I forgot to mention that the patch in question introduced a bug about the size not being updated. Before the size was calculated as a function argument, but the patch put the size variable as the function argument. This is why I applied the following patch to give this version of the source code a fair trial.

diff -purN a/module/zfs/zvol.c b/module/zfs/zvol.c
--- a/module/zfs/zvol.c 2015-09-04 20:30:24.000000000 +0100
+++ b/module/zfs/zvol.c 2016-01-30 09:33:38.755117592 +0000
@@ -658,6 +658,7 @@ zvol_discard(struct bio *bio)
     */
    start = P2ROUNDUP(start, zv->zv_volblocksize);
    end = P2ALIGN(end, zv->zv_volblocksize);
+        size = end - start;

    if (start >= end)
        return (0);

The performance is as described above -- very poor. I hope this shows that I did try to give this version of the source code a complete fair trial. Something in this patch is really affecting my performance.

I then went further. I downloaded the whole source at this point in time and compared it to the latest source at the head of the 0.6.5 branch as of a few hours ago. I went through the differences in the files that the patch in question made to the same files of the latest in this branch, not master branch. I wanted to see if there was anything else major that would really affect the performance. There is a later commit about speed ups, but the comment for the patch in question say there is already a speed up of 50 to 80%, so in theory, I should have those same speed ups instead of slow downs.

Cheers.

@Bronek
Copy link

Bronek commented Jan 30, 2016

@dracwyrm I am also using ZVOLs as underlying storage for my VMs but it works fine for me, perhaps there are differences in our setup which are beneficial for my VMs. Here is what I use:

  • kernel 4.1.16 (no problems with older versions either, but didn't try 4.4 yet) , no patches
  • ZFS 0.6.5.4 (no problems with 0.6.5.3 either)
  • I have NVMe based ZIL device and L2ARC
root@gdansk ~ # zpool status zdata
  pool: zdata
 state: ONLINE
. . .
config:
        NAME                                               STATE     READ WRITE CKSUM
        zdata                                              ONLINE       0     0     0
          mirror-0                                         ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0178587       ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0181554       ONLINE       0     0     0
          mirror-1                                         ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0196162       ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0182472       ONLINE       0     0     0
        logs
          nvme-INTEL_SSDPEDMD80_CVFT4415000K800CGN-part3   ONLINE       0     0     0
        cache
          nvme-INTEL_SSDPEDMD80_CVFT4415000K800CGN-part11  ONLINE       0     0     0

root@gdansk ~ # lsblk /dev/nvme0n1 | grep -E "n1p3|n1p11"
├─nvme0n1p3  259:3    0     3G  0 part
└─nvme0n1p11 259:11   0 257.2G  0 part
  • I am using writeback cache for my VMs configuration
root@gdansk ~ # virsh dumpxml gdynia | grep -B2 -A6 /dev/zvol/zdata/vdis
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source dev='/dev/zvol/zdata/vdis/gdynia'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <boot order='1'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </disk>
  • I use noop IO scheduler exclusively
root@gdansk ~ # zcat /proc/config.gz | grep -i noop
CONFIG_IOSCHED_NOOP=y
CONFIG_DEFAULT_NOOP=y
CONFIG_DEFAULT_IOSCHED="noop"

root@gdansk ~ # cd /sys/block ; for i in sd* nvme0n1 ; do echo $i && cat $i/queue/scheduler; done
sda
[noop] deadline cfq
sdb
[noop] deadline cfq
sdc
[noop] deadline cfq
sdd
[noop] deadline cfq
sde
[noop] deadline cfq
sdf
[noop] deadline cfq
nvme0n1
none

@dweeezil
Copy link
Contributor

@dracwyrm I'm a bit concerned about all the CPU time being spent in read_hpet(). How many vcpus are you allocating the guest? When running slow version of ZFS, what does Windows task manager show for the CPU usage when the guest is otherwise idle? Presumably you're using PCI passthrough with vfio, correct?

The reason I ask is because I recently had a chance to try PCI passthrough of a graphics card on a newer machine with a CPU supporting VT-D (with the guest storage on a zvol with ZoL 0.6.5.? which means I had the 37f9dac zvol code). The guest's vcpus were always at 100% utilization and perf on the host showed lots of time being spent in read_hpet() (caused by an ioctl() of some sort in qemu). I tried the same with a Linux guest but the performance was so bad I simply gave up as this was simply an experiment on my part to try vfio. I can't imagine how at the moment, but I'm wondering if there may be a connection between the use of vfio and the newer zvol code. If your guest's vcpus seem to be running away, could you try it with the normal qxl/spice video stack and see what happens?

@dweeezil
Copy link
Contributor

After constraining the ARC to 512MiB, I ran another set of tests with the single vdev pool and used the threaded sequential tests in Crystaldiskmark (Q32T2 settings) and a 4GiB test file. With 782b2c3 (before the new zvol code) it showed 937.5/761.3 and with current master code (4b9ed69) was 1038/719. Those numbers seem close enough to be considered identical for this micro-benchmark. I'll set up a raidz1 now and try the same thing.

@dweeezil
Copy link
Contributor

Same test with a 3-vdev raidz1: With 782b2c3 (before the new zvol code), CDM Q32T2/4GiB gave 760.4/507.7 and with current master (4b9ed69) it gave 950.3/633.6. I'm going to hold off on any more testing until @dracwyrm can determine whether this is an interaction with vfio.

@Bronek
Copy link

Bronek commented Jan 30, 2016

I am also using vfio (for GPU passthrough) and have no such problems.

@dracwyrm
Copy link
Author

The reason I ask is because I recently had a chance to try PCI passthrough of a graphics card on a newer machine with a CPU supporting VT-D (with the guest storage on a zvol with ZoL 0.6.5.? which means I had the 37f9dac zvol code). The guest's vcpus were always at 100% utilization and perf on the host showed lots of time being spent in read_hpet() (caused by an ioctl() of some sort in qemu). I tried the same with a Linux guest but the performance was so bad I simply gave up as this was simply an experiment on my part to try vfio. I can't imagine how at the moment, but I'm wondering if there may be a connection between the use of vfio and the newer zvol code. If your guest's vcpus seem to be running away, could you try it with the normal qxl/spice video stack and see what happens?

  • dweeezil

I am using VFIO with graphics card pass-through, so this is the same situation I am in. This is the performance issue that I am talking about. You experienced what I have been trying to say all along!!! @Bronek has no issues though. But you and I have. I have used Task manager with ZFS v0.6.4.2 and it does not show 100% CPU usage. I forgot too look when under the new ZFS as it takes ages to do anything I can frustrated.

The difference is now the patch that caused this issue has been found. The question is, what is happening in this patch that causes things to haywire for these two set ups -- the one you tried and mine? Is there a conflict with the new BIO stuff and qemu/kvm? I have used both qemu 2.4 and 2.5. Is there a kernel option that I haven't configured that needs to be set or even unset? If that's the case, then maybe a way to detect it on ZFS compile like some other settings?

The patch seems sound, apart for that size issue later fixed. If it really is this patch and certain configurations, then the only options I see as the cause are A) the removal of the 35 thread thing, or B) the new BIO stuff. So, that leaves me to believe there is a kernel config option. Is there a way to limit BIO threads?

I have tried using storage block CGroups set by libvirt and then disabling that, but both ways yield the same results.

I seem to hit that one configuration that this patch doesn't like, and it is this patch because anything less than this is fine. So, is there a way for me to revert some of the changes out of the new versions and still have it work with the new code? This way, I can test later code to see if the issue still hits?

If it helps, I am using Gentoo Linux with Gentoo Sources, and I do use NOOP since earlier in this bug thread I was told to use it. Also, I did searches on the Internet and they say it's good for SSDs, which is my main drive. Then I have two SSDs for the log/cache partitions.

Jon

@dweeezil
Copy link
Contributor

@dracwyrm I'd like to clarify that I've only tried vfio with GPU passthrough a single time and it performed as you described. I've not tried it with an older version of ZoL. Can you confirm that if you run your guest with the standard spice/qxl video stack that the performance is OK?

@dracwyrm
Copy link
Author

dracwyrm commented Feb 2, 2016

Well, I tried it without passthrough and I had the same degraded performance. I'm well stumped on this. What precisely is going on in this patch that would cause this type of incompatibility with my system? Is it because the max threads was removed, so it's left to the defaults defined in bio.h (I think it was 256)? Is there a kernel setting I don't have?

@dracwyrm
Copy link
Author

dracwyrm commented Feb 2, 2016

I messed around with kernel settings and libvirt settings (I switched to directsync) and now my perf top is this while having a very intensive disk writes/reads going on:

    50.69%  [kernel]       [k] _raw_spin_lock_irq                     
     7.87%  [kernel]       [k] read_hpet                              
     3.67%  [kernel]       [k] __isolate_lru_page                     
     2.58%  [kernel]       [k] osq_lock                               
     2.48%  [kernel]       [k] check_preemption_disabled              
     2.32%  [kernel]       [k] putback_inactive_pages                 
     2.24%  [kernel]       [k] __page_check_address                   
     1.95%  [kernel]       [k] shrink_page_list                       
     1.77%  [kernel]       [k] __anon_vma_interval_tree_subtree_search
     1.69%  [kernel]       [k] mm_find_pmd                            
     1.58%  [kernel]       [k] mutex_spin_on_owner.isra.6             
     1.39%  [kernel]       [k] down_read_trylock                      
     1.35%  [kernel]       [k] _raw_spin_lock                         
     1.33%  [kernel]       [k] page_lock_anon_vma_read                
     1.28%  [kernel]       [k] isolate_lru_pages.isra.63              
     1.23%  [kernel]       [k] unlock_page                            
     0.70%  [kernel]       [k] __mod_zone_page_state                  
     0.69%  [kernel]       [k] rmap_walk                              
     0.61%  [kernel]       [k] page_mapping                           
     0.49%  [kernel]       [k] up_read                                
     0.47%  [kernel]       [k] __wake_up_bit                          
     0.47%  [kernel]       [k] anon_vma_interval_tree_iter_first      
     0.45%  [kernel]       [k] page_referenced_one                    
     0.37%  [kernel]       [k] preempt_count_add                      
     0.37%  [kernel]       [k] page_referenced                        
     0.32%  [kernel]       [k] _raw_spin_lock_irqsave                 
     0.32%  [kernel]       [k] page_evictable                         
     0.26%  [kernel]       [k] __this_cpu_preempt_check               
     0.22%  [kernel]       [k] preempt_count_sub                      
     0.19%  [zcommon]      [k] fletcher_4_native                      
     0.18%  [kernel]       [k] mutex_lock                             
     0.17%  [kernel]       [k] kvm_handle_hva_range                   
     0.16%  [vdso]         [.] __vdso_clock_gettime                   
     0.14%  [kernel]       [k] _raw_spin_unlock                       
     0.13%  [kernel]       [k] apic_timer_interrupt 

The raw spin lock seems to be heavy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZVOL ZFS Volumes Status: Inactive Not being actively updated
Projects
None yet
Development

No branches or pull requests

7 participants
@behlendorf @fearedbliss @Bronek @ryao @dweeezil @dracwyrm and others