Replies: 5 comments 7 replies
-
Bandwidth & IOPs as a function of blocksizeHere's one clear result: The FireCuda 530 is capable of a little over 3 GB/s even with a tiny blocksize of 4 kB: In terms of IOPs, I'm still not quite getting to the 1 million IOPs the spec sheet claims should be possible... but I'll keep trying! I find this plot super-exciting: It says that, even when reading tiny chunks (4 kB or 8 kB) at random locations from a single file, we're still able to hit speeds of 3 GB/s (for 4 kB) and 5.8 GB/s (for 8 kB). This bodes very well for sharded Zarr! (And possibly for speeding up kerchunk, too.) |
Beta Was this translation helpful? Give feedback.
-
Interaction between IO queue depth & number of parallel threadsTo my surprise, (I find this surprising because, if the kernel is doing all the hard work of the IO, then I'm not sure why it matters how many userspace threads there are...) I don't know for sure, but I assume |
Beta Was this translation helpful? Give feedback.
-
Interaction between blocksize & number of parallel threadsHaving 4 parallel worker threads also helps across a range of blocksizes. This first plot is with an iodepth of 32: The benefit of having multiple threads is less pronounced with an iodepth of 128 (although, strangely, the max performance for a blocksize of 8 kB is reduced with an iodepth of 128!): Each bar in these two plots show the mean across 6 runs. The order was randomised on each iteration. |
Beta Was this translation helpful? Give feedback.
-
Effect of
|
Beta Was this translation helpful? Give feedback.
-
For particularly small bs iops, hipri is key. What's often missed is that you need to configure nvme to have poll queues, if you don't, then you don't really get the benefit of polled io. Verify with checking dmesg: nvme nvme0: 8/0/4 default/read/poll queues which for this box tells me I have 8 default queues, and 4 poll queues. Can be set with nvme.poll_queues=4 as a boot parameter if nvme is built in, otherwise poll_queues=N as a module option for nvme. |
Beta Was this translation helpful? Give feedback.
-
With sharded Zarrs, we want to read many small chunks from a few large files. We might even want to read on the order of a million chunks per second.
The spec sheets for modern SSDs claim they are capable of over a million IOPs (input/output operations per second). This is pretty exciting for folks (like me) who want to read random crops from sharded Zarr arrays whilst training ML models.
Yet, most online reviews of SSDs show pretty miserable random read performance. For example, this CrystalDiskMark x64 result for the fastest consumer PCIe Gen5 SSD currently available, the Crucial T700, shows that the T700 is capable of 12.3 GB/s for sequential reads, but only 0.9 GB/s for random reads.
Why this discrepancy?
It turns out that - internally - SSDs are parallel devices. The controller might have 4 or 8 channels to the storage chips, and the controller can use those channels in parallel. To achieve anything close to a million IOPs, you need a few essential ingredients: 1) a long queue length (so the SSD's controller has a channel to fully exploit its internal parallism), 2) an IO stack which is as efficient as possible, 3) (possibly) multiple threads reading in parallel.
On this thread, I'll try to write up some of my findings of benchmarking the fastest SSD I have access to, with access patterns which are vaguely similar to the access patterns we might see for reading sharded Zarrs.
Hardware
Software
fio
(and a custom IPython notebook)Baseline
fio
configurationUnless otherwise specified, the
fio
config is:Benchmark results
A CSV of (almost) all the
fio
runs described below is here.Further reading
fio
's manualBeta Was this translation helpful? Give feedback.
All reactions