Large synchronous writes are slow when a slog is present #1012
@dechamps I was investigating this issue yesterday which was easy to reproduce given your excellent summary of the problem. Unfortunately, I don't thing it's going to be quite as trivial to fix as we'd hoped.
Initially, I tried your suggestion of tweaking zfs_log_write() and zvol_log_write() to switch to indirect mode when exceeding a commit size threshold. In practice that proved problematic since on my test system the log never grew to a large enough size where I could set a reasonable default threshold.
Upon further reflection it was also clear to me that the log size isn't really what we want to be using here. What would be far better is to remove that assumption from the existing code that the slog is always going to be fastest storage. As you point out above this is almost certainly true for small I/Os, but large streaming I/O would be far better handled by the primary pool.
Ideally we want a way to determine which set of vdevs is going to stream the fastest on your system to minimize the latency. For unrelated reasons I've already been looking at tracking additional per-vdev performance data such as IOPs and throughput. Once those enhancements get merged it would be relatively straight forward for the zfs_log_write() and zvol_log_write() to take device performance in to account and to do the right thing.
What would be far better is to remove that assumption from the existing code that the slog is always going to be fastest storage.
The code doesn't always assume the slog is faster than everything: when the log size exceeds zil_slog_limit, it switches to the main pool. The issue is, when it takes this decision, it already decided earlier that the write will be in immediate mode, so it ends up writing in immediate mode to the main pool, which is not what we want.
The core issue is that both decisions (indirect/immediate and slog/main) are taken by different modules at different times, so we end up with an absurd end result.
Sure, and I think it is pretty easy to fix the worst case behavior you described. The initial patch I put together basically added a call to USE_SLOG() when setting the slogging variable in fs_log_write() and zvol_log_write(). That allowed it to change to indirect mode at roughly the right time. Perhaps that's still worth doing in the short term.
slogging = spa_has_slogs(zilog->zl_spa) && USE_SLOG(zilog) &&
(zilog->zl_logbias == ZFS_LOGBIAS_LATENCY);
For testing it just happened that my pools primary storage was about 3x faster at streaming than my slog. So even when it correctly used the slog in immediate mode there was still a 3x performance penalty compared to using indirect mode to the primary pool. This second issue got me thinking about how to just do the right thing.
The code lines in your last comment are basically what I had in mind for fixing this issue.
For testing it just happened that my pools primary storage was about 3x faster at streaming than my slog. So even when it correctly used the slog in immediate mode there was still a 3x performance penalty compared to using indirect mode to the primary pool. This second issue got me thinking about how to just do the right thing.
Well, that's a trade-off. Keep in mind that for the ZIL, latency is the main performance metric, not throughput. What counts is the time it takes for zil_commit() to complete and nothing else. In your case with primary storage 3x faster (streaming) than slog, small commits should still go to the slog, because they will complete much faster (~ 0.1 ms versus ~ 3 ms, assuming it's a SSD). Large commits, however, should go the primary pool because the actual write time (= disk throughput) dominates the initial seek latency.
Basically, the primary pool should be used if (initial latency + commit size / vdev throughput) is greater for the primary pool than for the slog. For example, for a pool with 1 SSD slog (0.1 ms latency, 100 MB/s) and 3 spindles (3 ms, 300 MB/s total), then any ZIL commit larger than 0.435 MB will take less time to complete on the main pool. Which means zil_slog_limit should be set to roughly 512 KB.
Note that my previous demonstration is only valid when there is no congestion, i.e. the disks are idle when the commit occurs. If the disks aren't idle, then other factors come into play, and then the SSD will often win because it is likely to be less loaded than the main disks, unless all the load is on the ZIL. In addition, if disks are busy, then writing in immediate mode (i.e. twice) on the main pool halves performance, which brings us to the issue described in my original description.
Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large
synchronous writes like iscsi or nfs to zvols?
Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large synchronous writes like iscsi or nfs to zvols?
Basically, yes. If you care about large synchronous writes, then adding a slog might be counter-productive. Note that this is true for all ZFS implementations, including FreeBSD and Illumos.
I assume you sidestep the issue if you set the logbias=throughput ???
Yes, but if you set logbias=throughput then the slog is never used for synchronous writes, even small ones, so that makes the slog useless (unless you have some other dataset that uses it).
Thank you for this clarification. As there are so many disscussion around SSD for zil/cache, this is basically frustrating. I hope you find some time to work on this issue.
Note that this issue seems to impact all ZFS implementations, not just ZFS On Linux.
ZFS uses a complicated process when it comes to deciding whether a write should be logged in indirect mode (written once by the DMU, the log records store a pointer) or in immediate mode (written in the log record, rewritten later by the DMU). Basically, it goes like this:
logbias=throughput, orzfs_immediate_write_sz.logbias=latencyand:zfs_immediate_write_sz, orzil_slog_limit.logbias=latency, there is a slog, and the total commit size is smaller thanzil_slog_limit.The decision to use indirect or immediate mode is implemented in
zfs_log_write()andzvol_log_write(). The decision to use the slog or the normal vdevs is implemented in theUSE_SLOG()macro used byzil_lwb_write_start.The issue is, this decision process makes sense except for one particularly painful edge case, when these conditions are all true:
logbias=latency, andIn this situation, the optimal choice would be to write to the normal pool in indirect mode, which should give us the minimum latency considering this is a large sequential write. Indeed, for very large writes, you don't want to use immediate mode because it means writing the data twice. Even if you write the log records to the slog, this will be slower with most pool configurations with e.g. lots of spindles and one SSD slog because the aggregate sequential write throughput of all the spindles is usually greater than the SSD's.
Instead, the algorithm makes the worst decision possible: it writes the data in immediate mode to the main data disks. This means that all the (large) data will be commited as ZIL log records on the data disks first, then immediately after, it will get written again by the DMU. This means the overall throughput is halved, and if this is a sustained load, the ZIL commit latency will be doubled compared to indirect mode.
It is shockingly easy to reproduce this issue. In pseudo-code:
Watch the
zil_statskstat page when that runs.If you don't have a slog in your pool, then the
fsync()call will complete in roughly the time it takes to write 2 GB sequentially to your main disks. This is optimal.If you have a slog in your pool, then the
fsync()call will generate twice as much write activity, and will write up to 4 GB to your main disks. Ironically, the slog won't be used at all when that happens.The solution would be to modify the algorithm
zfs_log_write()andzvol_log_write()so that, in the conditions mentioned above, it switches to indirect writes when the commit size reaches a certain threshold (e.g. 32 MB).I would gladly write a patch, but I won't have the time to do it, so I'm just leaving the result of my research here in case anyone's interested. If anyone wants to write the patch, it should be very simple to implement it.