Large synchronous writes are slow when a slog is present #1012

Open
dechamps opened this Issue Oct 4, 2012 · 21 comments

Comments

Projects
None yet
8 participants
Contributor

dechamps commented Oct 4, 2012

Note that this issue seems to impact all ZFS implementations, not just ZFS On Linux.

ZFS uses a complicated process when it comes to deciding whether a write should be logged in indirect mode (written once by the DMU, the log records store a pointer) or in immediate mode (written in the log record, rewritten later by the DMU). Basically, it goes like this:

  • Write in indirect mode to the data vdevs if:
    • logbias=throughput, or
    • There is no slog and the write is larger than zfs_immediate_write_sz.
  • Write in immediate mode to the data vdevs if logbias=latency and:
    • There is no slog and the write is smaller than zfs_immediate_write_sz, or
    • There is a slog and the total commit size if larger than zil_slog_limit.
  • Write in immediate mode to the slog vdevs if logbias=latency, there is a slog, and the total commit size is smaller than zil_slog_limit.

The decision to use indirect or immediate mode is implemented in zfs_log_write() and zvol_log_write(). The decision to use the slog or the normal vdevs is implemented in the USE_SLOG() macro used by zil_lwb_write_start.

The issue is, this decision process makes sense except for one particularly painful edge case, when these conditions are all true:

  • logbias=latency, and
  • There is a slog, and
  • There are large writes in the ZIL to be commited (e.g. > 100 MB).

In this situation, the optimal choice would be to write to the normal pool in indirect mode, which should give us the minimum latency considering this is a large sequential write. Indeed, for very large writes, you don't want to use immediate mode because it means writing the data twice. Even if you write the log records to the slog, this will be slower with most pool configurations with e.g. lots of spindles and one SSD slog because the aggregate sequential write throughput of all the spindles is usually greater than the SSD's.

Instead, the algorithm makes the worst decision possible: it writes the data in immediate mode to the main data disks. This means that all the (large) data will be commited as ZIL log records on the data disks first, then immediately after, it will get written again by the DMU. This means the overall throughput is halved, and if this is a sustained load, the ZIL commit latency will be doubled compared to indirect mode.

It is shockingly easy to reproduce this issue. In pseudo-code:

open(file)
write(file, lots of data) // e.g. 2 GB
fsync(file)

Watch the zil_stats kstat page when that runs.

If you don't have a slog in your pool, then the fsync() call will complete in roughly the time it takes to write 2 GB sequentially to your main disks. This is optimal.

If you have a slog in your pool, then the fsync() call will generate twice as much write activity, and will write up to 4 GB to your main disks. Ironically, the slog won't be used at all when that happens.

The solution would be to modify the algorithm zfs_log_write() and zvol_log_write() so that, in the conditions mentioned above, it switches to indirect writes when the commit size reaches a certain threshold (e.g. 32 MB).

I would gladly write a patch, but I won't have the time to do it, so I'm just leaving the result of my research here in case anyone's interested. If anyone wants to write the patch, it should be very simple to implement it.

Owner

behlendorf commented Oct 17, 2012

@dechamps I was investigating this issue yesterday which was easy to reproduce given your excellent summary of the problem. Unfortunately, I don't thing it's going to be quite as trivial to fix as we'd hoped.

Initially, I tried your suggestion of tweaking zfs_log_write() and zvol_log_write() to switch to indirect mode when exceeding a commit size threshold. In practice that proved problematic since on my test system the log never grew to a large enough size where I could set a reasonable default threshold.

Upon further reflection it was also clear to me that the log size isn't really what we want to be using here. What would be far better is to remove that assumption from the existing code that the slog is always going to be fastest storage. As you point out above this is almost certainly true for small I/Os, but large streaming I/O would be far better handled by the primary pool.

Ideally we want a way to determine which set of vdevs is going to stream the fastest on your system to minimize the latency. For unrelated reasons I've already been looking at tracking additional per-vdev performance data such as IOPs and throughput. Once those enhancements get merged it would be relatively straight forward for the zfs_log_write() and zvol_log_write() to take device performance in to account and to do the right thing.

Contributor

dechamps commented Oct 17, 2012

What would be far better is to remove that assumption from the existing code that the slog is always going to be fastest storage.

The code doesn't always assume the slog is faster than everything: when the log size exceeds zil_slog_limit, it switches to the main pool. The issue is, when it takes this decision, it already decided earlier that the write will be in immediate mode, so it ends up writing in immediate mode to the main pool, which is not what we want.

The core issue is that both decisions (indirect/immediate and slog/main) are taken by different modules at different times, so we end up with an absurd end result.

Owner

behlendorf commented Oct 17, 2012

Sure, and I think it is pretty easy to fix the worst case behavior you described. The initial patch I put together basically added a call to USE_SLOG() when setting the slogging variable in fs_log_write() and zvol_log_write(). That allowed it to change to indirect mode at roughly the right time. Perhaps that's still worth doing in the short term.

        slogging = spa_has_slogs(zilog->zl_spa) && USE_SLOG(zilog) &&
            (zilog->zl_logbias == ZFS_LOGBIAS_LATENCY);

For testing it just happened that my pools primary storage was about 3x faster at streaming than my slog. So even when it correctly used the slog in immediate mode there was still a 3x performance penalty compared to using indirect mode to the primary pool. This second issue got me thinking about how to just do the right thing.

Contributor

dechamps commented Oct 17, 2012

The code lines in your last comment are basically what I had in mind for fixing this issue.

For testing it just happened that my pools primary storage was about 3x faster at streaming than my slog. So even when it correctly used the slog in immediate mode there was still a 3x performance penalty compared to using indirect mode to the primary pool. This second issue got me thinking about how to just do the right thing.

Well, that's a trade-off. Keep in mind that for the ZIL, latency is the main performance metric, not throughput. What counts is the time it takes for zil_commit() to complete and nothing else. In your case with primary storage 3x faster (streaming) than slog, small commits should still go to the slog, because they will complete much faster (~ 0.1 ms versus ~ 3 ms, assuming it's a SSD). Large commits, however, should go the primary pool because the actual write time (= disk throughput) dominates the initial seek latency.

Basically, the primary pool should be used if (initial latency + commit size / vdev throughput) is greater for the primary pool than for the slog. For example, for a pool with 1 SSD slog (0.1 ms latency, 100 MB/s) and 3 spindles (3 ms, 300 MB/s total), then any ZIL commit larger than 0.435 MB will take less time to complete on the main pool. Which means zil_slog_limit should be set to roughly 512 KB.

Note that my previous demonstration is only valid when there is no congestion, i.e. the disks are idle when the commit occurs. If the disks aren't idle, then other factors come into play, and then the SSD will often win because it is likely to be less loaded than the main disks, unless all the load is on the ZIL. In addition, if disks are busy, then writing in immediate mode (i.e. twice) on the main pool halves performance, which brings us to the issue described in my original description.

Contributor

pyavdr commented Mar 26, 2013

Is this issue already open? I guess it is solved with #1013 ?

Contributor

dechamps commented Mar 26, 2013

As I said in the comments of #1013, it is not.

Contributor

pyavdr commented Mar 26, 2013

Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large
synchronous writes like iscsi or nfs to zvols?

Contributor

ColdCanuck commented Mar 26, 2013

I assume you sidestep the issue if you set the logbias=throughput ???

On Mar 26,2013, at 10:35 , P.SCH wrote:

Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large
synchronous writes like iscsi or nfs to zvols?


Reply to this email directly or view it on GitHub.

Contributor

dechamps commented Mar 26, 2013

Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large synchronous writes like iscsi or nfs to zvols?

Basically, yes. If you care about large synchronous writes, then adding a slog might be counter-productive. Note that this is true for all ZFS implementations, including FreeBSD and Illumos.

I assume you sidestep the issue if you set the logbias=throughput ???

Yes, but if you set logbias=throughput then the slog is never used for synchronous writes, even small ones, so that makes the slog useless (unless you have some other dataset that uses it).

Contributor

pyavdr commented Mar 26, 2013

Thank you for this clarification. As there are so many disscussion around SSD for zil/cache, this is basically frustrating. I hope you find some time to work on this issue.

@behlendorf behlendorf removed this from the 0.6.5 milestone Oct 6, 2014

This was referenced Apr 13, 2015

@behlendorf behlendorf added this to the 1.0.0 milestone Mar 26, 2016

Hi all, I just read about this old, still opened, ticket and wondered if the problem can be somewhat sidestepped by using a quite large zil_slog_limit. Sure, with fast main pools this still impair performance as large synchronized writes will be logged to the ZIL and to the main pool, but should avoid the problem of 2X writes to the main pool. I am right, or I am missing something?

PR #6191 looks like it inadvertently fixes this problem. @dechamps, can you confirm (or refute)?

Contributor

dechamps commented Oct 21, 2017

@evujumenuk I wouldn't know. It definitely looks interesting, but the last time I looked into this was literally 5 years ago and I don't have any context around this anymore.

Maybe @dinatale2 can shed some light on this. The question (as I understand it): is it still possible for large sync writes to be written in immediate mode to data vdevs if logbias=latency and a SLOG exists?

Owner

behlendorf commented Oct 23, 2017

@evujumenuk PR #6191 does not directly address this issue. When a slog device is part of the pool the assumption is still that it offers the absolute lowest latency and is preferred when logbias=latency.

With #6191 and OpenZFS 8585 (#6566) it might be easier to implement the original suggestion. Since OpenZFS 8585 does away with the batching of blocks we may have a better idea about when the log device is being overwhelmed and should transition to indirect writes. We still want the slog to soak up bursty synchronous writes.

It's also worth mentioning that if your target workload is large synchronous writes you can set logbias=throughput on the dataset today and prevent this double writing.

Contributor

dechamps commented Oct 24, 2017

@behlendorf

When a slog device is part of the pool the assumption is still that it offers the absolute lowest latency and is preferred when logbias=latency.

After re-reading my original description, I don't think that's what this issue is about. The issue is that, when faced with large synchronous writes, a slog is present, and logbias=latency, ZFS will decide to write the data in immediate mode to the main disks (not the slog!), which makes absolutely no sense under any scenario, even if you "assume that slog devices offer the absolute lowest latency".

What makes sense is either writing the data in immediate mode to the slog, or in indirect mode to the main disks. Writing large blocks in immediate mode to the main disks just results in the data getting written twice (both times to the main disks) for no reason.

Or at least that's what I can piece back together after re-reading my report from 5 years ago.

Owner

behlendorf commented Oct 26, 2017

What makes sense is either writing the data in immediate mode to the slog, or in indirect mode to the main disks.

Then we should close this issue because that is the current behavior. Here's the relevant block of code which decides how the log record should be written. Large blocks will always be written indirectly when a pool lacks a slog device.

long zfs_immediate_write_sz = 32768;
        if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
                write_state = WR_INDIRECT;
>>>     else if (!spa_has_slogs(zilog->zl_spa) &&
            resid >= zfs_immediate_write_sz)
                write_state = WR_INDIRECT;
        else if (ioflag & (FSYNC | FDSYNC))
                write_state = WR_COPIED;
        else
                write_state = WR_NEED_COPY;

@behlendorf behlendorf removed this from the 1.0.0 milestone Oct 26, 2017

Contributor

dechamps commented Oct 26, 2017

Large blocks will always be written indirectly when a pool lacks a slog device.

Yes. If there is no slog, then there is no problem. I agree. Again, that's not what this issue is about. Everything I said in this thread assumes there is a slog attached to the pool.

Again:

  • If there is no slog, large synchronous writes are going to be written in indirect mode to the main disks → optimal behavior
  • If there is a slog, large synchronous writes are going to be written in immediate mode twice to the main disks (and the slog stays idle!) → makes no sense

Examples of behaviors that would make sense, but that I did not observe when I filed this issue, include:

  • If there is a slog, large synchronous writes are going to be written to the slog in immediate mode and then permanently to the main disks
  • If there is a slog, large synchronous writes are going to be written in indirect mode to the main disks

In any case, according to my original description the issue is quite straightforward to reproduce, so it should just be a matter of trying to reproduce it with the current code to confirm that the issue is still there.

Hypocritus commented Dec 8, 2017

I think that much of the disagreement is related to the ambiguity -- and sometimes counterintuitive appearance -- of the parameter naming AND setting terminology with their at-times lack of clearly-differentiated definitions, coupled with the overriding behavior of other less-documented, less-accessible parameters' settings and/or decision logic. I believe that this is the wall that continues to be hit by many of us looking to adopt this beautiful beast with its capable promises of being the last word.

For example, we have logbias=latency. "Bias" means "a predisposition towards" ... what?? We should find out by the setting: "latency". "Latency" is also an ambiguous word meaning "the time delay", having no clear positive or negative value towards which part of the overall file commit process, unfortunately for us (figuratively) less RAM-equipped thinkers.

I know that I am not about to change ZFS, but when I learned the two opposite settings for logbias, being =latency or =throughput, it appeared (and continues to try to assert itself with me) that, according to ZFS' image of high standards of performance (ZFS: the LAST word in file systems), logbias=latency meant the log's predisposition is toward latency, a generally slower commit; as opposed to throughput appearing to mean (in opposition) get it done, now. "Fast".

This of course is not what these parameters mean, which I believe many in the community are continually finding out. We are continually learning that this particular setting is referring to "part" of a "part" of "part" of the entire file commit decision process.

In other words, for the logbias parameter's scope (which can unfortunately be overridden by other, less-accessible but equally powerful variables' settings, as well as even more powerful, all-but-undocumented logic), latency means a target of "low perceived latency" and probably worse, is supposed to mean "longer delay in having the file commit fully committed to permanent storage". Whereas throughput means "bypass the SLOG and go straight to disk", which, although is the shortest by-wire path, appears to slow down the overall throughput of the sum of the file commit processes under many workloads.

Pretzel-Minding... And often seemingly unfair with the other overriding parameters being less accessible, or less documented.

Another example of the knotted verbage is found in the likes of "writes being logged in indirect mode or in immediate mode." To the uninitiated, these types of statements, by way of "common" reasoning, seem to imply that "indirect mode" means that the writes are "not" directly written/logged to disk, and conversely, "direct mode" would imply that the writes "are" written/logged directly. But of course, the "opposite" is true, and we are dealing with a several step write process that involves logging on one or more levels, at potentially 3 locations, RAM, ZIL or SLOG, and Final Resting Place. The OP does a fair job of defining the true behavior of "indirect" and "direct mode" in this context.

I mean, come on. A "write being logged in indirect mode" really means that the write IS directly written to it's final location, bypassing a log??? Why is the verb "logged" even used??? Why did the creators take such pains to write such convoluted, counterintuitive phrasing??? "Why God, Why???" This type of verbage is much like saying, "It is certainly not bright outside" in describing the outside lighting at noonday, but with reference to the full-moon's illumination! A full moon is always on the opposite side of the world at noon in anyone's timezone, and therefore it is really "moon-dark", but "sun-bright" outside. It completely defies the childlike desire to want to understand "now". They just wanted to know if they could go out to play! If us kids miss just one apparently benign but critical word, with its lack of clearly differentiated meaning and context, the concept is either misunderstood, or perhaps understood as the world exists on "Opposite Day".

I fear that for many "would-be" ZFS-ists, the "last word" is unfortunately the one they never get to because it wasn't a part of an 1) official (or not), clearly-differentiated documentation, 2) in an all-encompassing, easy-to-reference manner, 3) from a single standards-oriented, open source (or not!) organization or web location.

Do not assume that low-latency devices can deliver high throughput. Thus the logbias property attempts to allow some control over those conditions.

Contributor

dechamps commented Dec 12, 2017

@Hypocritus

I'm not quite sure if your rant is meant to be taken seriously (loved your second-to-last paragraph though). In any case, the following should clarify the terminology for those unfamiliar with the internals of ZIL implementation:

  • logbias=latency means "optimize ZIL behavior for minimum latency". The goal is to make individual sync operations complete as quickly as possible, sacrificing overall efficiency and throughput in the process. This mode is meant to be used in applications that don't write a lot of data, but want this data to be committed to disk as quickly as possible. Which is typically what one wants for sync operations, hence it is the default.

  • logbias=throughput means "optimize ZIL behavior for maximum throughput". The goal is to make it possible to efficiently write large amounts of data in a synchronous manner, where the time it takes for a sync() call to complete doesn't matter much but the total write throughput (and overall I/O load) does. This mode sacrifices sync operation latency for a more efficient use of resources.

  • "Immediate mode" means that the data itself is written directly inside ZIL blocks when the ZIL is committed to disk. This approach provides the lowest latency and allows the ZIL commit (and thus the sync operation) to complete as quickly as possible, but it is inefficient because the data is rewritten again at TXG commit time (ZIL blocks are ephemeral).

  • "Indirect mode" means the data is written to its "final resting place" at ZIL commit time (by going directly to the DMU), and the ZIL only contains pointers to the data. It is called "indirect" precisely because the ZIL blocks contain pointers: it is literally a layer of indirection. This doesn't require rewriting the data because it's already in the right place, so it's a more efficient use of resources. However it might also take longer to commit the ZIL (i.e. sync operation latency is increased), because there are now two blocks (the ZIL block and the "final" block) that need to be written in two potentially separate places, possibly incurring the cost of a seek.

Every time ZFS commits the ZIL, it has to make a decision about two things:

  • Where the ZIL itself should be stored (SLOG - if there is one - or main disks)
  • Whether individual blocks should be written in immediate mode or indirect mode

Assuming nothing has changed since my original post, ZFS makes these decisions based on:

  • The value of the logbias property
  • The size of the block to be written (related to the zfs_immediate_write_sz tunable)
  • The total size of the pending writes to be committed (related to the zil_slog_limit tunable)

Following the decision tree that I described in my original post, there are four possible outcomes:

  1. Main disks, indirect: if logbias=throughput (which overrides everything else), or there is no slog and the write is larger than zfs_immediate_write_sz
  2. Main disks, immediate: if there is no SLOG and the write is smaller than zfs_immediate_write_sz, or there is a SLOG and the total size of the writes to be committed is larger than zil_slog_limit.
  3. SLOG, indirect: never happens
  4. SLOG, immediate: if there is a SLOG (duh) and the total size of the writes to be committed is smaller than zil_slog_limit.

The reason why I filed this bug is because the above is inefficient, since it can lead to large amounts of data being written in immediate mode to the main disks even though a SLOG is present, which is a very dumb thing to do (it is strictly worse behavior than if you do not have any SLOG at all!). I believe a more efficient decision logic would be:

  1. Main disks, indirect: (same as above)
  2. Main disks, immediate: if there is no SLOG and the write is smaller than zfs_immediate_write_sz.
  3. SLOG, indirect: if there is a SLOG and the total size of the writes to be committed is larger than zil_slog_limit.
  4. SLOG, immediate: if there is a SLOG and the total size of the writes to be committed is smaller than zil_slog_limit.

Such a change would simultaneously improve latency, throughput, and efficiency in the case where large synchronous writes are happening in a SLOG-enabled pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment