Sequential file rewrite outside of block boundaries is dead slow #361
Thanks for the detailed diagnosis of the problem! Unfortunately, I think a reasonable fix for this issue is tricky. Although, I may be able to suggest a decent workaround for now.
So as you rightly point out, ZFS currently has no logic to detect the unaligned-sequential-rewrite case. As a result of this it must read in the full block, update it, checksum it, and write it out as part of the next txg. This read can completely ruin performance and isn't going to be needed if the next write dirties the rest of the block.
Now deferring these reads to txg sync as you suggest should be possible, and it would allow you to optimize some of them out. However, if this were done it would have a devastating performance impact on the unaligned-random-overwrite case. The txg sync has to block for any required deferred reads which now will only be started at txg sync time.
My feeling is that the right fix for this lies somewhere between these two extremes. Perhaps something as simple as initially issuing this as a low priority read and adding some code to either:
A) Cancel it if it's not needed. For the sequential case we'd need
to be able to determine this before it's sent to the disk. Or,
B) Increase its priority once the txg begins sync'ing to ensure
good forward progress is maintained.
Now until this sort of optimization gets made there are some things you could try to make the situation better. You could try adding a large L2ARC device which could be used to service these reads. If it's big enough to cache a large fraction of the VM image I'd expect it to help. However, you may need to set l2arc_noprefetch=0 to ensure sequential access patterns get cached by the L2ARC.
suggested patch/fix:
Hi,
This problem occurs on Freebsd, Here is a patch trying to fix the issue
The cause is under high memory pressure zfs will wrongly evict cached metadata in prior to data during rewrite.
People who experience the same issue might be interested.
James Pan
http://lists.open-zfs.org/pipermail/developer/2015-January/001222.html
?
(This was originally posted on the
zfs-discussmailing list.)Let me demonstrate:
Now let's create a 20 GB file in the pool.
High throughput, as expected. Let's rewrite this file using a 128 KB buffer size:
Still high throughput, no issue here. Now, the same thing using a 64 KB buffer size:
That's 13 times slower. Argh.
Here's an
iostatduring the 128 KB rewrite test:And during the 64 KB rewrite test:
Note that Solaris 10 Update 10 exhibits the same problem, so it is present in the original ZFS. I did some other tests which led me to the following conclusion:
When rewriting a file, and the beginning and end of each individual write request is not aligned with a ZFS block boundary, then performance will drop dramatically.
Synchronous write speed is worse, even with a fast SSD as a separate log device.
Now, this issue wouldn't surprise me for random writes, but I don't expect such behavior for a purely sequential workload. What's interesting is that
iostatshows high read activity when this happens. That's probably what's causing the performance drop, especially if the reads are random.I can see why ZFS needs to read a block if it needs to be partially rewritten (checksuming, etc.) but I would expect it to do at least some kind of write coalescing for sequential writes. This way ZFS could "know" that a block is, in fact, going to be entirely rewritten and that there is no need to read it. Right now it seems ZFS is not smart enough to do this.
This is a big issue when storing VM disk images on ZFS, because VMs write on sector (512 byte) boundaries, not ZFS block boundaries. This basically means that once the disk file has expanded to its final size, sequential write performance inside the VM becomes absolutely terrible.
I did some digging using DTrace on the Solaris machine. It seems the best candidates for triggering reads are
dmu_tx_check_ioerr()anddbuf_will_dirty(). The culprit is that these functions are called as part ofzfs_write(), which means they will get called for each write request.Merging writes in
zfs_write()doesn't seem like an easy thing to do, so here's another idea: instead of calling these functions for each write request, call them when syncing the TXG. Given that all pending writes go into the TXG, this seems like the perfect place to go over the pending writes, merge them, read what needs to be read, then sync the TXG. Not only would this eliminate most of the read requests, but it would also make the remaining reads more efficient: indeed, by sorting read requests we can optimize disk head movement.Bug #290 seems to be caused by this issue.