Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't ashift-align vdev read requests #1022

Closed
wants to merge 1 commit into from

Conversation

dechamps
Copy link
Contributor

@dechamps dechamps commented Oct 7, 2012

Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be.

This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size.

For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests.

We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves.

As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons:

  • We remove a memory allocation in a heavily-used code path;
  • The request gets aligned in the lowest layer possible, which shrinks the path that the additional, useless padding data has to travel. For example, when using 4k-sector drives that lie about their sector
    size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter.

The only exception is raidz, because raidz needs to read the whole allocated block for parity.

This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads.

Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.

This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.

For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.

We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.

As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
  the path that the additional, useless padding data has to travel.
  For example, when using 4k-sector drives that lie about their sector
  size, using 512b read requests instead of 4k means that there will
  be less data traveling down the ATA/SCSI interface, even though the
  drive actually reads 4k from the platter.

The only exception is raidz, because raidz needs to read the whole
allocated block for parity.

This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.
@ryao
Copy link
Contributor

ryao commented Oct 7, 2012

What happens if another read occurs within the area that we would have had thanks to the alignment? In particular, how does ARC handle it? It already allocates a power of 2 sized storage space meant to hold these things. If it has data from an unaligned read, and then it suddenly needs to cache data that it would have had, is it able to handle that?

@dechamps
Copy link
Contributor Author

dechamps commented Oct 7, 2012

What happens if another read occurs within the area that we would have had thanks to the alignment?

Well, I'm not sure that's actually possible since I don't think there can be more than one ZFS block per "ashift block". Even if it were possible, the usual I/O aggregation algorithms would still apply.

In particular, how does ARC handle it? It already allocates a power of 2 sized storage space meant to hold these things. If it has data from an unaligned read, and then it suddenly needs to cache data that it would have had, is it able to handle that?

The ARC has nothing to do with this. This alignment-related code is a low-level vdev zio detail that is completely abstracted away from the upper layers, which still see the original I/O size. That's how zio_push_transform() works: the ZIO is transformed after the upper layer has issued it, and when it's done, it's transformed back to its original form in zio_pop_transforms() before the upper layer (the ARC in your example) is called back. The whole process is completely transparent and the upper layer is blissfully unaware of it. It still thinks it's issuing 512-byte reads even though the vdev zio code expands it to 4k.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 7, 2012

It would probably be best to check if this doesn't cause any issues with devices that really expose 4k as their sector size. My guess is the kernel will take care of aligning the requests but this needs verification. I'm not even sure how to test such a case.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 7, 2012

My concern for the ARC is that this change would introduce internal fragmentation.

Again, this doesn't change anything for the ARC. The ARC will still do the same allocations with the same sizes. The only thing that changes is that we skip allocating one buffer in the ZIO pipeline. That buffer has nothing to do with the ARC, it's just a temporary buffer which only lives for the duration of the physical I/O. It is used as the buffer for the read request, gets filled with data from the vdev, gets copied back to the original buffer (e.g. an ARC buffer), and then is thrown away. In other words, the ARC buffer doesn't change size, a temporary buffer with the desired size is used instead. So, as far as memory management is concerned, this change has only upsides. We're skipping an allocation and a copy operation for free.

Memory allocations can be expensive, but fragmentation and additional IOs are even more expensive.

I still don't understand which additional I/Os you are talking about. There can be only one ZFS buffer per allocation unit. There's no way the original 4k read could cover multiple ZFS blocks. In the case of a 512b ZFS block, it's just 512b of data and 3.5k of garbage.

Which allocation does this avoid and have you seen measurable improvements by avoiding it?

I didn't measure anything, as I don't have my test hardware anymore. Maybe someone can do some benchmarks, but to be honest, I don't expect to see any significant difference for normal workloads under standard conditions. It's just a small optimization, but I don't see why we should pass on skipping an allocation and reducing I/O sizes for free.

@ryao
Copy link
Contributor

ryao commented Oct 8, 2012

I was half a sleep when I made that comment, which is why I deleted it shortly afterward. I will be more careful to avoid commenting when I am groggy in the future.

@behlendorf
Copy link
Contributor

@dechamps That's a nice optimization. Eliminating that extra allocation and copy is particularly important for Linux where we're already pushing the VM subsystem harder than I'd like. I doubt this will make much of a difference for small pools, but for larger configurations with more concurrent I/O this should help.

One minor issue here is that your change conflicts with the other trim patches. Could you push an updated version against your trim-clean branch. I expect, hope, the trim changes will get merged first.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 8, 2012

One minor issue here is that your change conflicts with the other trim patches. Could you push an updated version against your trim-clean branch. I expect, hope, the trim changes will get merged first.

Well, I think I'll just wait for one of the branches to be merged and then rebase the other one.

@behlendorf
Copy link
Contributor

@dechamps Works for me.

@pyavdr
Copy link
Contributor

pyavdr commented Oct 12, 2012

@dechamps
Congratulations, that is really a very good increase in performance. I did a benchmark with bonnie++ with the zfs patch 98e37d2 applied to a single device pool, using zfs for a zvol (no dedup, no compression) and read/write with iscsi with blocksize set to 64k. The read values jumped from 82 MB/s to 130 MB/s, random seeks from 860/sec to 1000/sec. Latency dropped from 830 ms to 250 ms. Tested with ESXi5, between Opensuse 12.2 and Ubuntu 1204 as guests, conected via isci in blockio mode. On the other hand, in the same environment, using ext4 in place of a zvol reads about 190 MB/s.

@behlendorf
Copy link
Contributor

@dechamps Good call on waiting. TRIMs going to take some more testing on my part before I'm comfortable with it but this is quite straight forward and I didn't want to hold it up any longer. Merged as a5c20e2

unya pushed a commit to unya/zfs that referenced this pull request Dec 13, 2013
Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.

This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.

For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.

We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.

As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
  the path that the additional, useless padding data has to travel.
  For example, when using 4k-sector drives that lie about their sector
  size, using 512b read requests instead of 4k means that there will
  be less data traveling down the ATA/SCSI interface, even though the
  drive actually reads 4k from the platter.

The only exception is raidz, because raidz needs to read the whole
allocated block for parity.

This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#1022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants