Sequential file rewrite outside of block boundaries is dead slow #361

Open
dechamps opened this Issue Aug 12, 2011 · 3 comments

4 participants

@dechamps

(This was originally posted on the zfs-discuss mailing list.)

Let me demonstrate:

# zpool create -f homez sdab sdac sdad sdae

Now let's create a 20 GB file in the pool.

# dd if=/dev/zero of=/homez/zero bs=1048576 count=20480
20480+0 records in
20480+0 records out
21474836480 bytes (21 GB) copied, 52.6104 s, 408 MB/s

High throughput, as expected. Let's rewrite this file using a 128 KB buffer size:

# dd if=/dev/zero of=/homez/zero bs=131072 count=163840 conv=notrunc
163840+0 records in
163840+0 records out
21474836480 bytes (21 GB) copied, 60.5479 s, 355 MB/s

Still high throughput, no issue here. Now, the same thing using a 64 KB buffer size:

# dd if=/dev/zero of=/homez/zero bs=65536 count=327680 conv=notrunc
327680+0 records in
327680+0 records out
21474836480 bytes (21 GB) copied, 812.825 s, 26.4 MB/s

That's 13 times slower. Argh.

Here's an iostat during the 128 KB rewrite test:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
homez       20.8G  1.07T      2  3.16K   142K   391M
homez       20.8G  1.07T      3  3.15K   206K   389M
homez       20.9G  1.07T     13  2.87K   833K   351M
homez       20.9G  1.07T     15  2.80K   999K   347M
homez       20.0G  1.07T     14  2.25K   896K   262M
homez       20.0G  1.07T     12  1.79K   781K   193M
homez       20.0G  1.07T      6  2.28K   410K   267M
homez       20.8G  1.07T     16  2.47K  1.00M   301M
homez       20.9G  1.07T     19  2.79K  1.19M   345M
homez       20.9G  1.07T     15  2.86K   948K   349M

And during the 64 KB rewrite test:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
homez       20.3G  1.07T    202    237  25.3M  25.7M
homez       20.3G  1.07T    201    236  25.2M  25.6M
homez       20.3G  1.07T    203    218  25.4M  25.3M
homez       20.3G  1.07T    199    217  24.9M  25.3M
homez       20.2G  1.07T    206    216  25.9M  25.0M
homez       20.3G  1.07T    196    224  24.6M  26.0M
homez       20.3G  1.07T    203    212  25.4M  24.6M
homez       20.2G  1.07T    201    218  25.2M  25.3M
homez       20.3G  1.07T    200    218  25.1M  25.3M

Note that Solaris 10 Update 10 exhibits the same problem, so it is present in the original ZFS. I did some other tests which led me to the following conclusion:

When rewriting a file, and the beginning and end of each individual write request is not aligned with a ZFS block boundary, then performance will drop dramatically.

Synchronous write speed is worse, even with a fast SSD as a separate log device.

Now, this issue wouldn't surprise me for random writes, but I don't expect such behavior for a purely sequential workload. What's interesting is that iostat shows high read activity when this happens. That's probably what's causing the performance drop, especially if the reads are random.

I can see why ZFS needs to read a block if it needs to be partially rewritten (checksuming, etc.) but I would expect it to do at least some kind of write coalescing for sequential writes. This way ZFS could "know" that a block is, in fact, going to be entirely rewritten and that there is no need to read it. Right now it seems ZFS is not smart enough to do this.

This is a big issue when storing VM disk images on ZFS, because VMs write on sector (512 byte) boundaries, not ZFS block boundaries. This basically means that once the disk file has expanded to its final size, sequential write performance inside the VM becomes absolutely terrible.

I did some digging using DTrace on the Solaris machine. It seems the best candidates for triggering reads are dmu_tx_check_ioerr() and dbuf_will_dirty(). The culprit is that these functions are called as part of zfs_write(), which means they will get called for each write request.

Merging writes in zfs_write() doesn't seem like an easy thing to do, so here's another idea: instead of calling these functions for each write request, call them when syncing the TXG. Given that all pending writes go into the TXG, this seems like the perfect place to go over the pending writes, merge them, read what needs to be read, then sync the TXG. Not only would this eliminate most of the read requests, but it would also make the remaining reads more efficient: indeed, by sorting read requests we can optimize disk head movement.

Bug #290 seems to be caused by this issue.

@behlendorf
ZFS on Linux member

Thanks for the detailed diagnosis of the problem! Unfortunately, I think a reasonable fix for this issue is tricky. Although, I may be able to suggest a decent workaround for now.

So as you rightly point out, ZFS currently has no logic to detect the unaligned-sequential-rewrite case. As a result of this it must read in the full block, update it, checksum it, and write it out as part of the next txg. This read can completely ruin performance and isn't going to be needed if the next write dirties the rest of the block.

Now deferring these reads to txg sync as you suggest should be possible, and it would allow you to optimize some of them out. However, if this were done it would have a devastating performance impact on the unaligned-random-overwrite case. The txg sync has to block for any required deferred reads which now will only be started at txg sync time.

My feeling is that the right fix for this lies somewhere between these two extremes. Perhaps something as simple as initially issuing this as a low priority read and adding some code to either:

A) Cancel it if it's not needed. For the sequential case we'd need
to be able to determine this before it's sent to the disk. Or,

B) Increase its priority once the txg begins sync'ing to ensure
good forward progress is maintained.

Now until this sort of optimization gets made there are some things you could try to make the situation better. You could try adding a large L2ARC device which could be used to service these reads. If it's big enough to cache a large fraction of the VM image I'd expect it to help. However, you may need to set l2arc_noprefetch=0 to ensure sequential access patterns get cached by the L2ARC.

@dechamps dechamps added a commit to dechamps/zfs that referenced this issue Sep 5, 2011
@dechamps dechamps Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.
3f56d78
@dechamps dechamps added a commit to dechamps/zfs that referenced this issue Feb 3, 2012
@dechamps dechamps Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.
b5491eb
@behlendorf behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 8, 2012
@dechamps dechamps Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
34037af
@ryao
ZFS on Linux member

Upstream issue filed:

https://www.illumos.org/issues/3794

@behlendorf behlendorf removed this from the 0.8.0 milestone Oct 3, 2014
@kernelOfTruth

suggested patch/fix:

Hi,
This problem occurs on Freebsd, Here is a patch trying to fix the issue
The cause is under high memory pressure zfs will wrongly evict cached metadata in prior to data during rewrite.

People who experience the same issue might be interested.

James Pan

http://lists.open-zfs.org/pipermail/developer/2015-January/001222.html

?

@behlendorf behlendorf added this to the 1.0.0 milestone Mar 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment