Benchmark results for small random reads from a fast SSD, using `io_uring` & `fio` #26

JackKelly · 2023-11-07T16:58:30Z

JackKelly
Nov 7, 2023
Maintainer

With sharded Zarrs, we want to read many small chunks from a few large files. We might even want to read on the order of a million chunks per second.

The spec sheets for modern SSDs claim they are capable of over a million IOPs (input/output operations per second). This is pretty exciting for folks (like me) who want to read random crops from sharded Zarr arrays whilst training ML models.

Yet, most online reviews of SSDs show pretty miserable random read performance. For example, this CrystalDiskMark x64 result for the fastest consumer PCIe Gen5 SSD currently available, the Crucial T700, shows that the T700 is capable of 12.3 GB/s for sequential reads, but only 0.9 GB/s for random reads.

Why this discrepancy?

It turns out that - internally - SSDs are parallel devices. The controller might have 4 or 8 channels to the storage chips, and the controller can use those channels in parallel. To achieve anything close to a million IOPs, you need a few essential ingredients: 1) a long queue length (so the SSD's controller has a channel to fully exploit its internal parallism), 2) an IO stack which is as efficient as possible, 3) (possibly) multiple threads reading in parallel.

On this thread, I'll try to write up some of my findings of benchmarking the fastest SSD I have access to, with access patterns which are vaguely similar to the access patterns we might see for reading sharded Zarrs.

Hardware

SSD: 4TB Seagate FireCuda 530 (PCIe Gen4 x 4, NVMe 1.4, the specs claim it can deliver 7.25 GB/s sequential reads (128 KB), and 1 million IO operations per second (4 KB blocksize with a queue depth of 32).
CPU: 32-core / 64-thread AMD EPYC 7543P (2.8 GHz base clock, 256 MB L3 cache)
RAM: 512 GB (8 channels of ECC DDR4 3200 MHz RDIMM; 204 GB/s memory bandwidth)

Software

Operating system: Ubuntu Server 22.04.3, Linux kernel 5.15.0-88-generic
Benchmarking tool: fio (and a custom IPython notebook)

Baseline `fio` configuration

Unless otherwise specified, the fio config is:

direct=1, rw="randread", ioengine="io_uring", iodepth=32, size="1g", blocksize=8, 
hipri=False, fixedbufs=False, registerfiles=True, sqthread_poll=True, numjobs=4, thread=True

Benchmark results

A CSV of (almost) all the fio runs described below is here.

Bandwidth & IOPs as a function of blocksize

Here's one clear result: The FireCuda 530 is capable of a little over 3 GB/s even with a tiny blocksize of 4 kB:

In terms of IOPs, I'm still not quite getting to the 1 million IOPs the spec sheet claims should be possible... but I'll keep trying!

I find this plot super-exciting: It says that, even when reading tiny chunks (4 kB or 8 kB) at random locations from a single file, we're still able to hit speeds of 3 GB/s (for 4 kB) and 5.8 GB/s (for 8 kB). This bodes very well for sharded Zarr! (And possibly for speeding up kerchunk, too.)

0 replies

JackKelly · 2023-11-07T17:14:17Z

JackKelly
Nov 7, 2023
Maintainer Author

Interaction between IO queue depth & number of parallel threads

To my surprise, io_uring performs best when the userspace process uses multiple threads:

(I find this surprising because, if the kernel is doing all the hard work of the IO, then I'm not sure why it matters how many userspace threads there are...)

I don't know for sure, but I assume fio uses one io_uring ring per fio thread (Jens Axboe, the main developer for both io_uring and fio, is on the record as saying that it's best to use one io_uring per thread).

0 replies

JackKelly · 2023-11-07T19:17:19Z

JackKelly
Nov 7, 2023
Maintainer Author

Interaction between blocksize & number of parallel threads

Having 4 parallel worker threads also helps across a range of blocksizes. This first plot is with an iodepth of 32:

The benefit of having multiple threads is less pronounced with an iodepth of 128 (although, strangely, the max performance for a blocksize of 8 kB is reduced with an iodepth of 128!):

Each bar in these two plots show the mean across 6 runs. The order was randomised on each iteration.

0 replies

JackKelly · 2023-11-07T20:51:45Z

JackKelly
Nov 7, 2023
Maintainer Author

Effect of `io_uring` parameters `sqthread_poll`, `fixedbufs`, `registerfiles`, `hipri`

TL;DR: It's almost always best to keep all of these params enabled 🙂

Docs for these params:

sqthread_poll ("submitting IO will be done by a polling thread in the kernel")
fixedbufs ("pages are pre-mapped before IO is started")
registerfiles (registerfiles is turned on implicitly when sqthread_poll is active) ("fio registers the set of files being used with the kernel")
hipri ("fio will attempt to use polled IO completions")

Keep `blocksize=8kB` but vary `iodepth`

`iodepth=32` and `blocksize=8kB`

There is a clear benefit to enabling sqthread_poll and fixedbufs when numjobs=1.

The benefit is still visible when numjobs=2.

hipri helps a bit, across the board.

Let's try varying iodepth and see what happens...

`iodepth=8` and `blocksize=8kB`

The benefits of sqthread_poll and fixedbufs appear to vanish when the iodepth is reduced to 8. But the benefits of hipri are more pronounced:

`iodepth=128` and `blocksize=8kB`

Keep `iodepth=32` and vary `blocksize`:

`iodepth=32` and `blocksize=4kB`

`iodepth=32` and `blocksize=16kB`

`iodepth=32` and `blocksize=64kB`

2 replies

devsjc Nov 8, 2023

Forgive me for most likely my stupid questions, but I have a couple of points to raise based on what I can gather from those graphs (bearing in mind, my solid state drive knowledge is anything but solid!):

Firstly,

modern SSDs can be very fast at small, random reads

Yes, but also seemingly they are faster (or at least have higher throughput) with slightly larger than the smallest blocksizes. Unless IOPs is the more important number to increase here, does this mean shards should target sizes of 32/64 kb?

Secondly,

Do the graphs also suggest that iodepth should be tuned to the block/shard size?

All in all an excellent and interesting investigation Jack, very thorough! (Top graphs too)

JackKelly Nov 8, 2023
Maintainer Author

Hey Sol! Firstly, thanks loads for actually reading that post! I honestly wasn't sure if anyone would be interested enough to read it!

The short answer is that you're absolutely right on both of your observations 🙂.

The longer answer... 🙂 ....

Bandwidth as a function of blocksize

For this 4TB SSD (and probably for all SSDs of a similar vintage), you're absolutely right that you don't really hit the SSD's max throughput until you get to blocksizes of about 64 kBytes. Even if our SSD did as the specs say, and did 1 million IOPs at 4 kB, that'd "only" be 4 GB/sec (which is pretty close to what I observed.)

The current fastest consumer SSD (the Crucial T700 4TB PCIe Gen5 SSD) can do 1.5 million IOPs at 4 kB, which would be 6 GB/sec.

In late-2023 or early-2024 we should see consumer SSDs based on a new controller which is capable of 2.5 million IOPs at 4 kB (yielding a bandwidth of 10 GB/sec at 4 kB, which is pretty incredible).

But there's also stuff we can do in software (and, as you say, some of this should be tunable to specific SSDs). It will sometimes make sense to "consolidate" reads (this will be especially true for HDDs and cloud storage). e.g. if we have a Zarr shard which has 4 kB chunks, and we want data from every n_th chunk then we might want to merge those individual reads into a single large read, even though we're going to throw away (n-1)/n of the data we read from disk. (The tunable parameter would be the maximum n for which we'd consolidate reads.)

do the graphs also suggest that iodepth should be tuned to the block/shard size?

They do indeed. Which I find very surprising. I would've expected that a larger iodepth would always be better. I don't understand why that isn't the behaviour we observe in practice. I'd like to benchmark more SSDs, and - crucially - on a more recent Linux kernel before reaching any general conclusions. (donatello uses Linux kernel 5.15, which was released in Oct 2021. The most recent Linux kernel is 6.6. io_uring has matured quite rapidly over the last few years).

axboe · 2023-11-08T15:51:53Z

axboe
Nov 8, 2023

For particularly small bs iops, hipri is key. What's often missed is that you need to configure nvme to have poll queues, if you don't, then you don't really get the benefit of polled io. Verify with checking dmesg:

nvme nvme0: 8/0/4 default/read/poll queues

which for this box tells me I have 8 default queues, and 4 poll queues. Can be set with nvme.poll_queues=4 as a boot parameter if nvme is built in, otherwise poll_queues=N as a module option for nvme.

5 replies

JackKelly Nov 8, 2023
Maintainer Author

Hi Jens! Thanks so much for jumping in! I didn't know about enabling NVMe poll queues. I've just checked dmesg and our machine has 0 poll queues. I'll tinker with the poll queues config ASAP, and re-run the benchmarks.

JackKelly Nov 8, 2023
Maintainer Author

Please may I ask: This 2019 Phoronix set of benchmarks suggest that some workloads will suffer if NVMe poll queues are enabled. Is that still the case (on more recent kernels)?!

axboe Nov 8, 2023

Should not be the case, and honestly I would not trust those results very much. Most devices have plenty of IO queues available, hence you're not taking anything away by adding some poll queues. You'll have the same amount of IRQ queues. Even if not, depending on the box, you don't need THAT many queues to provide good scalability.

Any amount of queue types (read, default, poll) will map to the available set of CPUS, so some care has to be given to ensuring that they spread evenly. I don't know the topology what you are testing on, but keep that in mind when setting aside a number of poll queues. Unless it's a huge box or a limited device, I'd just pick one per cpu core (not per thread!) by default. If that ends up limiting your IO queues too much, cut that number in half. Then carefully consider placement of jobs with fio when testing.

axboe Nov 8, 2023

Just skimmed that phoronix page and that is nonsense... Not sure what he's doing there, but he sees positive impact where there cannot be any (eg kernel compile?! no polled IO would be used there), and vice versa when there also won't be any change unless the device is severely queue starved.

JackKelly Nov 9, 2023
Maintainer Author

Fab, thank you for the very quick replies! I'll try enabling poll queues, and re-benchmark soon...

(And, of course, thank you so much for all your work on the Linux block layer, io_uring, and fio! It really feels like there's been a step-change improvement in storage performance thanks to your work, and modern SSDs. This is all very exciting for folks like me who need fast access to large volumes of scientific data!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark results for small random reads from a fast SSD, using `io_uring` & `fio` #26

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Benchmark results for small random reads from a fast SSD, using io_uring & fio #26

JackKelly Nov 7, 2023 Maintainer

Hardware

Software

Baseline fio configuration

Benchmark results

Further reading

Replies: 5 comments · 7 replies

JackKelly Nov 7, 2023 Maintainer Author

Bandwidth & IOPs as a function of blocksize

JackKelly Nov 7, 2023 Maintainer Author

Interaction between IO queue depth & number of parallel threads

JackKelly Nov 7, 2023 Maintainer Author

Interaction between blocksize & number of parallel threads

JackKelly Nov 7, 2023 Maintainer Author

Effect of io_uring parameters sqthread_poll, fixedbufs, registerfiles, hipri

Keep blocksize=8kB but vary iodepth

iodepth=32 and blocksize=8kB

iodepth=8 and blocksize=8kB

iodepth=128 and blocksize=8kB

Keep iodepth=32 and vary blocksize:

iodepth=32 and blocksize=4kB

iodepth=32 and blocksize=16kB

iodepth=32 and blocksize=64kB

devsjc Nov 8, 2023

JackKelly Nov 8, 2023 Maintainer Author

Bandwidth as a function of blocksize

axboe Nov 8, 2023

JackKelly Nov 8, 2023 Maintainer Author

JackKelly Nov 8, 2023 Maintainer Author

axboe Nov 8, 2023

axboe Nov 8, 2023

JackKelly Nov 9, 2023 Maintainer Author

Benchmark results for small random reads from a fast SSD, using `io_uring` & `fio` #26

JackKelly
Nov 7, 2023
Maintainer

Baseline `fio` configuration

Replies: 5 comments 7 replies

JackKelly
Nov 7, 2023
Maintainer Author

JackKelly
Nov 7, 2023
Maintainer Author

JackKelly
Nov 7, 2023
Maintainer Author

JackKelly
Nov 7, 2023
Maintainer Author

Effect of `io_uring` parameters `sqthread_poll`, `fixedbufs`, `registerfiles`, `hipri`

Keep `blocksize=8kB` but vary `iodepth`

`iodepth=32` and `blocksize=8kB`

`iodepth=8` and `blocksize=8kB`

`iodepth=128` and `blocksize=8kB`

Keep `iodepth=32` and vary `blocksize`:

`iodepth=32` and `blocksize=4kB`

`iodepth=32` and `blocksize=16kB`

`iodepth=32` and `blocksize=64kB`

JackKelly Nov 8, 2023
Maintainer Author

axboe
Nov 8, 2023

JackKelly Nov 8, 2023
Maintainer Author

JackKelly Nov 8, 2023
Maintainer Author

JackKelly Nov 9, 2023
Maintainer Author