New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't ashift-align vdev read requests #1022
Conversation
Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be. This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size. For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests. We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves. As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons: - We remove a memory allocation in a heavily-used code path; - The request gets aligned in the lowest layer possible, which shrinks the path that the additional, useless padding data has to travel. For example, when using 4k-sector drives that lie about their sector size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter. The only exception is raidz, because raidz needs to read the whole allocated block for parity. This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads.
What happens if another read occurs within the area that we would have had thanks to the alignment? In particular, how does ARC handle it? It already allocates a power of 2 sized storage space meant to hold these things. If it has data from an unaligned read, and then it suddenly needs to cache data that it would have had, is it able to handle that? |
Well, I'm not sure that's actually possible since I don't think there can be more than one ZFS block per "ashift block". Even if it were possible, the usual I/O aggregation algorithms would still apply.
The ARC has nothing to do with this. This alignment-related code is a low-level vdev zio detail that is completely abstracted away from the upper layers, which still see the original I/O size. That's how |
It would probably be best to check if this doesn't cause any issues with devices that really expose 4k as their sector size. My guess is the kernel will take care of aligning the requests but this needs verification. I'm not even sure how to test such a case. |
Again, this doesn't change anything for the ARC. The ARC will still do the same allocations with the same sizes. The only thing that changes is that we skip allocating one buffer in the ZIO pipeline. That buffer has nothing to do with the ARC, it's just a temporary buffer which only lives for the duration of the physical I/O. It is used as the buffer for the read request, gets filled with data from the vdev, gets copied back to the original buffer (e.g. an ARC buffer), and then is thrown away. In other words, the ARC buffer doesn't change size, a temporary buffer with the desired size is used instead. So, as far as memory management is concerned, this change has only upsides. We're skipping an allocation and a copy operation for free.
I still don't understand which additional I/Os you are talking about. There can be only one ZFS buffer per allocation unit. There's no way the original 4k read could cover multiple ZFS blocks. In the case of a 512b ZFS block, it's just 512b of data and 3.5k of garbage.
I didn't measure anything, as I don't have my test hardware anymore. Maybe someone can do some benchmarks, but to be honest, I don't expect to see any significant difference for normal workloads under standard conditions. It's just a small optimization, but I don't see why we should pass on skipping an allocation and reducing I/O sizes for free. |
I was half a sleep when I made that comment, which is why I deleted it shortly afterward. I will be more careful to avoid commenting when I am groggy in the future. |
@dechamps That's a nice optimization. Eliminating that extra allocation and copy is particularly important for Linux where we're already pushing the VM subsystem harder than I'd like. I doubt this will make much of a difference for small pools, but for larger configurations with more concurrent I/O this should help. One minor issue here is that your change conflicts with the other trim patches. Could you push an updated version against your |
Well, I think I'll just wait for one of the branches to be merged and then rebase the other one. |
@dechamps Works for me. |
@dechamps |
Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be. This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size. For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests. We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves. As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons: - We remove a memory allocation in a heavily-used code path; - The request gets aligned in the lowest layer possible, which shrinks the path that the additional, useless padding data has to travel. For example, when using 4k-sector drives that lie about their sector size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter. The only exception is raidz, because raidz needs to read the whole allocated block for parity. This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1022
Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be.
This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size.
For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests.
We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves.
As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons:
size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter.
The only exception is raidz, because raidz needs to read the whole allocated block for parity.
This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads.