Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Large synchronous writes are slow when a slog is present #1012
Comments
This was referenced Oct 4, 2012
|
@dechamps I was investigating this issue yesterday which was easy to reproduce given your excellent summary of the problem. Unfortunately, I don't thing it's going to be quite as trivial to fix as we'd hoped. Initially, I tried your suggestion of tweaking Upon further reflection it was also clear to me that the log size isn't really what we want to be using here. What would be far better is to remove that assumption from the existing code that the slog is always going to be fastest storage. As you point out above this is almost certainly true for small I/Os, but large streaming I/O would be far better handled by the primary pool. Ideally we want a way to determine which set of vdevs is going to stream the fastest on your system to minimize the latency. For unrelated reasons I've already been looking at tracking additional per-vdev performance data such as IOPs and throughput. Once those enhancements get merged it would be relatively straight forward for the |
The code doesn't always assume the slog is faster than everything: when the log size exceeds The core issue is that both decisions (indirect/immediate and slog/main) are taken by different modules at different times, so we end up with an absurd end result. |
|
Sure, and I think it is pretty easy to fix the worst case behavior you described. The initial patch I put together basically added a call to USE_SLOG() when setting the slogging = spa_has_slogs(zilog->zl_spa) && USE_SLOG(zilog) &&
(zilog->zl_logbias == ZFS_LOGBIAS_LATENCY);
For testing it just happened that my pools primary storage was about 3x faster at streaming than my slog. So even when it correctly used the slog in immediate mode there was still a 3x performance penalty compared to using indirect mode to the primary pool. This second issue got me thinking about how to just do the right thing. |
|
The code lines in your last comment are basically what I had in mind for fixing this issue.
Well, that's a trade-off. Keep in mind that for the ZIL, latency is the main performance metric, not throughput. What counts is the time it takes for Basically, the primary pool should be used if (initial latency + commit size / vdev throughput) is greater for the primary pool than for the slog. For example, for a pool with 1 SSD slog (0.1 ms latency, 100 MB/s) and 3 spindles (3 ms, 300 MB/s total), then any ZIL commit larger than 0.435 MB will take less time to complete on the main pool. Which means Note that my previous demonstration is only valid when there is no congestion, i.e. the disks are idle when the commit occurs. If the disks aren't idle, then other factors come into play, and then the SSD will often win because it is likely to be less loaded than the main disks, unless all the load is on the ZIL. In addition, if disks are busy, then writing in immediate mode (i.e. twice) on the main pool halves performance, which brings us to the issue described in my original description. |
|
Is this issue already open? I guess it is solved with #1013 ? |
|
As I said in the comments of #1013, it is not. |
|
Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large |
|
I assume you sidestep the issue if you set the logbias=throughput ??? On Mar 26,2013, at 10:35 , P.SCH wrote:
|
Basically, yes. If you care about large synchronous writes, then adding a slog might be counter-productive. Note that this is true for all ZFS implementations, including FreeBSD and Illumos.
Yes, but if you set |
|
Thank you for this clarification. As there are so many disscussion around SSD for zil/cache, this is basically frustrating. I hope you find some time to work on this issue. |
This was referenced Aug 21, 2013
behlendorf
removed this from the 0.6.5 milestone
Oct 6, 2014
behlendorf
added
the
Difficulty - Hard
label
Oct 6, 2014
This was referenced Apr 13, 2015
behlendorf
added this to the 1.0.0 milestone
Mar 26, 2016
behlendorf
removed
the
Difficulty - Hard
label
Oct 5, 2016
shodanshok
commented
Aug 11, 2017
|
Hi all, I just read about this old, still opened, ticket and wondered if the problem can be somewhat sidestepped by using a quite large |
evujumenuk
commented
Oct 20, 2017
|
@evujumenuk I wouldn't know. It definitely looks interesting, but the last time I looked into this was literally 5 years ago and I don't have any context around this anymore. |
evujumenuk
commented
Oct 23, 2017
|
Maybe @dinatale2 can shed some light on this. The question (as I understand it): is it still possible for large sync writes to be written in immediate mode to data vdevs if |
|
@evujumenuk PR #6191 does not directly address this issue. When a slog device is part of the pool the assumption is still that it offers the absolute lowest latency and is preferred when With #6191 and OpenZFS 8585 (#6566) it might be easier to implement the original suggestion. Since OpenZFS 8585 does away with the batching of blocks we may have a better idea about when the log device is being overwhelmed and should transition to indirect writes. We still want the slog to soak up bursty synchronous writes. It's also worth mentioning that if your target workload is large synchronous writes you can set |
After re-reading my original description, I don't think that's what this issue is about. The issue is that, when faced with large synchronous writes, a slog is present, and What makes sense is either writing the data in immediate mode to the slog, or in indirect mode to the main disks. Writing large blocks in immediate mode to the main disks just results in the data getting written twice (both times to the main disks) for no reason. Or at least that's what I can piece back together after re-reading my report from 5 years ago. |
Then we should close this issue because that is the current behavior. Here's the relevant block of code which decides how the log record should be written. Large blocks will always be written indirectly when a pool lacks a slog device. long zfs_immediate_write_sz = 32768; if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
write_state = WR_INDIRECT;
>>> else if (!spa_has_slogs(zilog->zl_spa) &&
resid >= zfs_immediate_write_sz)
write_state = WR_INDIRECT;
else if (ioflag & (FSYNC | FDSYNC))
write_state = WR_COPIED;
else
write_state = WR_NEED_COPY; |
behlendorf
removed this from the 1.0.0 milestone
Oct 26, 2017
Yes. If there is no slog, then there is no problem. I agree. Again, that's not what this issue is about. Everything I said in this thread assumes there is a slog attached to the pool. Again:
Examples of behaviors that would make sense, but that I did not observe when I filed this issue, include:
In any case, according to my original description the issue is quite straightforward to reproduce, so it should just be a matter of trying to reproduce it with the current code to confirm that the issue is still there. |
Hypocritus
commented
Dec 8, 2017
•
|
I think that much of the disagreement is related to the ambiguity -- and sometimes counterintuitive appearance -- of the parameter naming AND setting terminology with their at-times lack of clearly-differentiated definitions, coupled with the overriding behavior of other less-documented, less-accessible parameters' settings and/or decision logic. I believe that this is the wall that continues to be hit by many of us looking to adopt this beautiful beast with its capable promises of being the last word. For example, we have logbias=latency. "Bias" means "a predisposition towards" ... what?? We should find out by the setting: "latency". "Latency" is also an ambiguous word meaning "the time delay", having no clear positive or negative value towards which part of the overall file commit process, unfortunately for us (figuratively) less RAM-equipped thinkers. I know that I am not about to change ZFS, but when I learned the two opposite settings for logbias, being =latency or =throughput, it appeared (and continues to try to assert itself with me) that, according to ZFS' image of high standards of performance (ZFS: the LAST word in file systems), logbias=latency meant the log's predisposition is toward latency, a generally slower commit; as opposed to throughput appearing to mean (in opposition) get it done, now. "Fast". This of course is not what these parameters mean, which I believe many in the community are continually finding out. We are continually learning that this particular setting is referring to "part" of a "part" of "part" of the entire file commit decision process. In other words, for the logbias parameter's scope (which can unfortunately be overridden by other, less-accessible but equally powerful variables' settings, as well as even more powerful, all-but-undocumented logic), latency means a target of "low perceived latency" and probably worse, is supposed to mean "longer delay in having the file commit fully committed to permanent storage". Whereas throughput means "bypass the SLOG and go straight to disk", which, although is the shortest by-wire path, appears to slow down the overall throughput of the sum of the file commit processes under many workloads. Pretzel-Minding... And often seemingly unfair with the other overriding parameters being less accessible, or less documented. Another example of the knotted verbage is found in the likes of "writes being logged in indirect mode or in immediate mode." To the uninitiated, these types of statements, by way of "common" reasoning, seem to imply that "indirect mode" means that the writes are "not" directly written/logged to disk, and conversely, "direct mode" would imply that the writes "are" written/logged directly. But of course, the "opposite" is true, and we are dealing with a several step write process that involves logging on one or more levels, at potentially 3 locations, RAM, ZIL or SLOG, and Final Resting Place. The OP does a fair job of defining the true behavior of "indirect" and "direct mode" in this context. I mean, come on. A "write being logged in indirect mode" really means that the write IS directly written to it's final location, bypassing a log??? Why is the verb "logged" even used??? Why did the creators take such pains to write such convoluted, counterintuitive phrasing??? "Why God, Why???" This type of verbage is much like saying, "It is certainly not bright outside" in describing the outside lighting at noonday, but with reference to the full-moon's illumination! A full moon is always on the opposite side of the world at noon in anyone's timezone, and therefore it is really "moon-dark", but "sun-bright" outside. It completely defies the childlike desire to want to understand "now". They just wanted to know if they could go out to play! If us kids miss just one apparently benign but critical word, with its lack of clearly differentiated meaning and context, the concept is either misunderstood, or perhaps understood as the world exists on "Opposite Day". I fear that for many "would-be" ZFS-ists, the "last word" is unfortunately the one they never get to because it wasn't a part of an 1) official (or not), clearly-differentiated documentation, 2) in an all-encompassing, easy-to-reference manner, 3) from a single standards-oriented, open source (or not!) organization or web location. |
richardelling
commented
Dec 9, 2017
|
Do not assume that low-latency devices can deliver high throughput. Thus the logbias property attempts to allow some control over those conditions. |
|
I'm not quite sure if your rant is meant to be taken seriously (loved your second-to-last paragraph though). In any case, the following should clarify the terminology for those unfamiliar with the internals of ZIL implementation:
Every time ZFS commits the ZIL, it has to make a decision about two things:
Assuming nothing has changed since my original post, ZFS makes these decisions based on:
Following the decision tree that I described in my original post, there are four possible outcomes:
The reason why I filed this bug is because the above is inefficient, since it can lead to large amounts of data being written in immediate mode to the main disks even though a SLOG is present, which is a very dumb thing to do (it is strictly worse behavior than if you do not have any SLOG at all!). I believe a more efficient decision logic would be:
Such a change would simultaneously improve latency, throughput, and efficiency in the case where large synchronous writes are happening in a SLOG-enabled pool. |
dechamps commentedOct 4, 2012
Note that this issue seems to impact all ZFS implementations, not just ZFS On Linux.
ZFS uses a complicated process when it comes to deciding whether a write should be logged in indirect mode (written once by the DMU, the log records store a pointer) or in immediate mode (written in the log record, rewritten later by the DMU). Basically, it goes like this:
logbias=throughput, orzfs_immediate_write_sz.logbias=latencyand:zfs_immediate_write_sz, orzil_slog_limit.logbias=latency, there is a slog, and the total commit size is smaller thanzil_slog_limit.The decision to use indirect or immediate mode is implemented in
zfs_log_write()andzvol_log_write(). The decision to use the slog or the normal vdevs is implemented in theUSE_SLOG()macro used byzil_lwb_write_start.The issue is, this decision process makes sense except for one particularly painful edge case, when these conditions are all true:
logbias=latency, andIn this situation, the optimal choice would be to write to the normal pool in indirect mode, which should give us the minimum latency considering this is a large sequential write. Indeed, for very large writes, you don't want to use immediate mode because it means writing the data twice. Even if you write the log records to the slog, this will be slower with most pool configurations with e.g. lots of spindles and one SSD slog because the aggregate sequential write throughput of all the spindles is usually greater than the SSD's.
Instead, the algorithm makes the worst decision possible: it writes the data in immediate mode to the main data disks. This means that all the (large) data will be commited as ZIL log records on the data disks first, then immediately after, it will get written again by the DMU. This means the overall throughput is halved, and if this is a sustained load, the ZIL commit latency will be doubled compared to indirect mode.
It is shockingly easy to reproduce this issue. In pseudo-code:
Watch the
zil_statskstat page when that runs.If you don't have a slog in your pool, then the
fsync()call will complete in roughly the time it takes to write 2 GB sequentially to your main disks. This is optimal.If you have a slog in your pool, then the
fsync()call will generate twice as much write activity, and will write up to 4 GB to your main disks. Ironically, the slog won't be used at all when that happens.The solution would be to modify the algorithm
zfs_log_write()andzvol_log_write()so that, in the conditions mentioned above, it switches to indirect writes when the commit size reaches a certain threshold (e.g. 32 MB).I would gladly write a patch, but I won't have the time to do it, so I'm just leaving the result of my research here in case anyone's interested. If anyone wants to write the patch, it should be very simple to implement it.