Don't ashift-align vdev read requests #1022

dechamps · 2012-10-07T20:00:55Z

Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be.

This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size.

For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests.

We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves.

As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons:

We remove a memory allocation in a heavily-used code path;
The request gets aligned in the lowest layer possible, which shrinks the path that the additional, useless padding data has to travel. For example, when using 4k-sector drives that lie about their sector
size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter.

The only exception is raidz, because raidz needs to read the whole allocated block for parity.

This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads.

Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be. This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size. For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests. We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves. As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons: - We remove a memory allocation in a heavily-used code path; - The request gets aligned in the lowest layer possible, which shrinks the path that the additional, useless padding data has to travel. For example, when using 4k-sector drives that lie about their sector size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter. The only exception is raidz, because raidz needs to read the whole allocated block for parity. This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads.

ryao · 2012-10-07T20:49:12Z

What happens if another read occurs within the area that we would have had thanks to the alignment? In particular, how does ARC handle it? It already allocates a power of 2 sized storage space meant to hold these things. If it has data from an unaligned read, and then it suddenly needs to cache data that it would have had, is it able to handle that?

dechamps · 2012-10-07T20:55:50Z

What happens if another read occurs within the area that we would have had thanks to the alignment?

Well, I'm not sure that's actually possible since I don't think there can be more than one ZFS block per "ashift block". Even if it were possible, the usual I/O aggregation algorithms would still apply.

In particular, how does ARC handle it? It already allocates a power of 2 sized storage space meant to hold these things. If it has data from an unaligned read, and then it suddenly needs to cache data that it would have had, is it able to handle that?

The ARC has nothing to do with this. This alignment-related code is a low-level vdev zio detail that is completely abstracted away from the upper layers, which still see the original I/O size. That's how zio_push_transform() works: the ZIO is transformed after the upper layer has issued it, and when it's done, it's transformed back to its original form in zio_pop_transforms() before the upper layer (the ARC in your example) is called back. The whole process is completely transparent and the upper layer is blissfully unaware of it. It still thinks it's issuing 512-byte reads even though the vdev zio code expands it to 4k.

dechamps · 2012-10-07T22:11:41Z

It would probably be best to check if this doesn't cause any issues with devices that really expose 4k as their sector size. My guess is the kernel will take care of aligning the requests but this needs verification. I'm not even sure how to test such a case.

dechamps · 2012-10-07T22:53:48Z

My concern for the ARC is that this change would introduce internal fragmentation.

Again, this doesn't change anything for the ARC. The ARC will still do the same allocations with the same sizes. The only thing that changes is that we skip allocating one buffer in the ZIO pipeline. That buffer has nothing to do with the ARC, it's just a temporary buffer which only lives for the duration of the physical I/O. It is used as the buffer for the read request, gets filled with data from the vdev, gets copied back to the original buffer (e.g. an ARC buffer), and then is thrown away. In other words, the ARC buffer doesn't change size, a temporary buffer with the desired size is used instead. So, as far as memory management is concerned, this change has only upsides. We're skipping an allocation and a copy operation for free.

Memory allocations can be expensive, but fragmentation and additional IOs are even more expensive.

I still don't understand which additional I/Os you are talking about. There can be only one ZFS buffer per allocation unit. There's no way the original 4k read could cover multiple ZFS blocks. In the case of a 512b ZFS block, it's just 512b of data and 3.5k of garbage.

Which allocation does this avoid and have you seen measurable improvements by avoiding it?

I didn't measure anything, as I don't have my test hardware anymore. Maybe someone can do some benchmarks, but to be honest, I don't expect to see any significant difference for normal workloads under standard conditions. It's just a small optimization, but I don't see why we should pass on skipping an allocation and reducing I/O sizes for free.

ryao · 2012-10-08T17:33:53Z

I was half a sleep when I made that comment, which is why I deleted it shortly afterward. I will be more careful to avoid commenting when I am groggy in the future.

behlendorf · 2012-10-08T19:43:58Z

@dechamps That's a nice optimization. Eliminating that extra allocation and copy is particularly important for Linux where we're already pushing the VM subsystem harder than I'd like. I doubt this will make much of a difference for small pools, but for larger configurations with more concurrent I/O this should help.

One minor issue here is that your change conflicts with the other trim patches. Could you push an updated version against your trim-clean branch. I expect, hope, the trim changes will get merged first.

dechamps · 2012-10-08T20:39:52Z

One minor issue here is that your change conflicts with the other trim patches. Could you push an updated version against your trim-clean branch. I expect, hope, the trim changes will get merged first.

Well, I think I'll just wait for one of the branches to be merged and then rebase the other one.

behlendorf · 2012-10-09T01:43:59Z

@dechamps Works for me.

pyavdr · 2012-10-12T06:16:10Z

@dechamps
Congratulations, that is really a very good increase in performance. I did a benchmark with bonnie++ with the zfs patch 98e37d2 applied to a single device pool, using zfs for a zvol (no dedup, no compression) and read/write with iscsi with blocksize set to 64k. The read values jumped from 82 MB/s to 130 MB/s, random seeks from 860/sec to 1000/sec. Latency dropped from 830 ms to 250 ms. Tested with ESXi5, between Opensuse 12.2 and Ubuntu 1204 as guests, conected via isci in blockio mode. On the other hand, in the same environment, using ext4 in place of a zvol reads about 190 MB/s.

behlendorf · 2012-10-12T20:06:59Z

@dechamps Good call on waiting. TRIMs going to take some more testing on my part before I'm comfortable with it but this is quite straight forward and I didn't want to hold it up any longer. Merged as a5c20e2

Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be. This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size. For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests. We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves. As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons: - We remove a memory allocation in a heavily-used code path; - The request gets aligned in the lowest layer possible, which shrinks the path that the additional, useless padding data has to travel. For example, when using 4k-sector drives that lie about their sector size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter. The only exception is raidz, because raidz needs to read the whole allocated block for parity. This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1022

behlendorf closed this in a5c20e2 Oct 12, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't ashift-align vdev read requests #1022

Don't ashift-align vdev read requests #1022

dechamps commented Oct 7, 2012

ryao commented Oct 7, 2012

dechamps commented Oct 7, 2012

dechamps commented Oct 7, 2012

dechamps commented Oct 7, 2012

ryao commented Oct 8, 2012

behlendorf commented Oct 8, 2012

dechamps commented Oct 8, 2012

behlendorf commented Oct 9, 2012

pyavdr commented Oct 12, 2012

behlendorf commented Oct 12, 2012

Don't ashift-align vdev read requests #1022

Don't ashift-align vdev read requests #1022

Conversation

dechamps commented Oct 7, 2012

ryao commented Oct 7, 2012

dechamps commented Oct 7, 2012

dechamps commented Oct 7, 2012

dechamps commented Oct 7, 2012

ryao commented Oct 8, 2012

behlendorf commented Oct 8, 2012

dechamps commented Oct 8, 2012

behlendorf commented Oct 9, 2012

pyavdr commented Oct 12, 2012

behlendorf commented Oct 12, 2012