-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zvol write performance issue 0.7.0-rc4 #6127
Comments
Here is full VM libvirt xml definition of guest machine:
|
Can you try setting |
Another data point, since I've been doing a lot of work with FreeBSD 10.3 guests under libvirt/qemu-kvm lately so this was pretty easy for me to try to duplicate on n my daily driver system which is much lower specced than @Bronek's. Running "pkg install vim" in a pretty-much-dead-stock FreeBSD 10.3 guest, configured identically, the host system never hiccups at all and I was only able to catch @Bronek, In addition to enabling zvol sync request mode, it might be interesting to |
I was running -rc3 previously and this did not happen before. However, it was also my first try of FreeBSD under VM guest on this host. Also it turns out it is not so easily reproducible as I thought - just tried today morning (no changes in the system configuration, not even I do not know if this is very useful, but I was running
|
OK so I am now able to reproduce this issue, it simply requires dumping a lot of IO onto virtualized (i.e. mapped to ZVOLs) disks in my virtual machines. One way I've found was to run different disk benchmarks (I am using"AS SSD" and DiskMark64, running under virtual Windows instances) on different disks at the same time. So far I've not switched In the attacked kernel log you can see kernel's own reporting of blocked tasks and then from line 292 (kernel time 2815.498766 ) me using sysrq to capture host state, before powering it down (accidentally, was aiming for sysrq+p) |
@behlendorf I can confirm that after setting |
@Bronek |
@Bronek another thing I'd be interested in would be decreasing |
@behlendorf Edit: Or we can go back to use request queue, which seems to have that. |
@tuxoko I don't think changing zfs_dirty_data_max is a solution here. We are heavily using zvols at work with a patch (that predates me joining) that reimplements zvol threads on top of 0.6.5.9. My experience is that lowering zfs_dirty_data_max makes performance worse, rather than better. It does make it more consistent, but it is also consistently worse. Performance fluctations in our current branch off of 0.6.5.9 + #6133 at work are an issue that I am actively investigating, although they aren't directly related to this. @behlendorf I am still a fan of modifying zvols to rely on DMU dispatch. I think that will solve this by combining the best of both worlds. That looks like the next major bottleneck at work after #6133 is done. Hopefully, the patch for making zvols rely on DMU dispatch will take less time than ZIL pipelining. |
@ryao But either way, we should throttle incoming thread so we won't overflow the queue. |
@tuxoko We do throttle incoming IOs in the elevator. If it is doing a good job, we should not need to adjust zfs_dirty_data_max. We also already initiate txg_sync whenever dirty data exceeds zfs_dirty_data_sync. Lowering zfs_dirty_data_max will limit the opportunity for bad write out behavior to cause a wall of wait, but it does not fix the underlying problem. |
If you have a patch for this you could share that would be useful to facilitate getting benchmark numbers from only the threading change.
Could you describe what you have in mind.
Agreed. We should verify if the existing A nice way to potentially handle might be to wire up |
Since I just started work less than a month ago, I think it would be best if I ask my manager for his okay on releasing old patches (although there isn't much old stuff to release) as I am still fairly new at work. However, that should be a formality. The person I need to ask is currently in GMT+8. I am not sure if I will hear back from him within the next few hours. I will push a branch with it as soon as I get an okay.
Instead of having zvol threads, we would just execute an asynchronous DMU operation with a callback and return whenever the operation is not synchronous. Then when the callback is invoked, we will invoke |
Re. zvol_threads, see below:
I use Re. zfs_dirty_data_max perhaps it matters that I use 3GB dedicated SLOG device on NVME? Anyway, the current configuration is:
The current synchronous performance is not brilliant, but it is not awful either I do not have a robust enough benchmark. I imagine a good benchmark would involve random small writes on multiple threads writing to multiple ZOLs. I do not think VM actually needs to be involved, direct block access from user mode should be sufficient, but different write modes should be exercised. |
@ryao yes I read this much from your pull request :) The issue here is obviously asynchronous load, which is virtually killing the machine, when under heavy random write load. Since I'm not keen on having this machine blocked when it is used by someone else, I switched it to synchronous (unless I'm doing testing myself). |
@Bronek In that case, the IOs aren't forced sync and it wouldn't benefit from it. Anyway, keep your eyes peeled for my next zvol patch. P.S. I realized that there was a mistake in my previous remark, so I deleted it while I rewrote it. You saw it and replied before I had the chance to finish typing. |
@behlendorf |
@behlendorf Why would the solution here be any different than the solution on a regular file? Internally, it is basically the same thing, except we can't fall back on nr_requests with a regular file. |
@ryao |
@tuxoko I see what you are saying. The number of things waiting in the taskq for execution is essentially unbounded, so we can end up with a terrible wall of wait. The way that I see us fixing this would killing the zvol threads and implementing an asynchronous DMU dispatch. We are adding unnecessary complexity by using zvol threads. This is the sort of thing that I had wanted to get away from doing in the past. It is hard make zvol worker threads work well and even if we get them to work as well as they can, they are still inferior to a solution that doesn't use them. Double dispatch is expensive. |
@behlendorf It occurs to me that this problem affects swap on zvols as well. Paging out to a "swap" device is supposed to reduce machine memory usage, but the effect of the zvol taskq is that we should get increasing memory usage at a time when we are supposed to be decreasing it. It will eventually go down, but this is like pouring gasoline on a fire just before it runs out of flammable material. Some sort of throttling should help, although we had that before 692e55b broke it. As a workaround until we have a proper solution, I suggest we turn this functionality off by default. Edit: Also, I suspect that this is one of multiple problems we are having scaling up performance on zvols at work, so effective now, I am starting work on a patch. This is work being done full time as part of my day job rather than a hobby project, so hopefully I should get a working patch soon. #6133 isn't quite finished yet, but I expect to wait on the buildbot for a few hours anyway. Failing to improve zvol performance is going to cause a large company to use a proprietary storage solution, so I have an extra incentive to make an improvement here ASAP. |
@behlendorf My original idea turned out to be be less than ideal. While it was possible, it would have bypassed txg_sync by writing out early, which could cause IO amplification. I am pursuing a variation of the current ideas where the range locking and transactions are opened synchronously while the actual dmu operation and zil_commit are done in a worker thread. My expectation is that this will give us the best of both worlds. |
@ryao what you're describing is very similar to the current behavior with |
@Bronek @kpande I suggest trying out #6207. It implements the fix that I described. @behlendorf It is very similar. Per zvol max queue depths likely could be implemented with a condition variable, a mutex and a counter. I'll experiment with it and get back to you on it. :) |
yes, you need to stall the request until the depth decreases. While you're there... please add a kstat to count when we stall. |
@ryao if possible it would ideal to use the standard |
@behlendorf is this still an issue or was it fixed eventually? |
@mailinglists35 I have disabled asynchronous zvol some time ago - will get back to you after I have enabled it again and set zvol_threads to 8 . FWIW I am currently running version 0.7.9 with kernel 4.14.41 |
having tested under some load, the impression so far is that the current version is much better in this respect. Hard to say how much help was zvol_thrads=8, also I would not say the situation is optimal. The host remained stable functional and usable through testing, even though the load was pretty high. The guest virtual machine hung on the third test and had to be terminated. The termination did not take effect for some few minutes after the request, despite vfio own IO threads being killed, but the system was usable and stable through. Also the performance figures obtained from tests were sensible. |
@Bronek thanks for testing this again. It sounds like we should open a PR which sets the default number of |
@behlendorf FWIW, my system is:
I have now changed zvol_threads to 16, but will wait with more testing until details of a fix mentioned by @kpande (which PR?) |
With kernel-induced ZVOL write RMW out of the way (see #8590), I find that zvol_request_sync=1 is usually the best way to go. Latency and throughput are both extremely good and the throttle works properly. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
I just encountered severe zvol write performance issue, where zvol is used as a backing store for a kvm virtual machines, running under qemu 2.9.0 and kernel 4.9.26 (vanilla , i.e. no patches) with ZFS/SPL version 0.7.0-rc4. Due to very high CPU utilization in kernel mode, the host became virtually unresponsive and had to be restarted. The host machine has 128GB RAM and 2 sockets, each with Xeon E5-2667v2 and HT enabled, which means 16 cores (32 logical cores, i.e. hardware threads), while the guest was only assigned 2 logical cores and 2GB RAM. The system was luckily responsive to sysreq commands, which allowed me to dump blocked tasks (and several other statistics). Here are interesting excerpts from kernel log:
The guest virtual machine where this was triggered is a very basic install of FreeBSD 10.3 (freshly installed). The issue was triggered when I started this command in guest:
... which brings 116 dependencies, taking 2GB of disk space (as I've just learned, one would have to install "vim-lite-8.0.0507" on FreeBSD, for vim alone). The problem happened somewhere in the middle of guest installing these 2GB of dependencies. The guest does have sufficient free disk space, as seen here:
This is all hosted by a 24GB ZVOL, defined as below
I think this should be easily reproducible and also I am happy to share more details (e.g. VM image , VM libvirt xml definition, my full kernel config etc). This is in fact why I raised this as a separate issue, so there is space for all these details (as the issue itself is not unique)
The text was updated successfully, but these errors were encountered: