Writing to volblocksize=8k zvol is slow #824

Ukko-Ylijumala · 2012-07-12T15:26:45Z

Default zvol volblocksize is 8k. However, it also is dead slow on ZoL. I've been testing this for some time, and I've done the same tests over on OpenIndiana side, which doesn't exhibit the same slowness - for native ZFS the best volblocksize clearly is 8k.

Test setup: 8x mirror pairs of 600GB SAS 10k disks, LSI2008 HBA, 6GB SAS expander, 2x Xeon E5640, 48GB RAM (arc_max=28GB). Created 5x 200GB zvols, with volblocksize=128k/64k/32k/16k/8k. compression/dedup=off. Random write speeds to prefilled zvols (measured with fio 2.0.6) are like so:

128k: 602 MB/s
64k: 505 MB/s
32k: 484 MB/s
16k: 249 MB/s
8k: 47 MB/s

On OpenIndiana the other blocksizes give 150-200 MB/s while 8k blocksize zvol attains 250-400 MB/s. So the behaviour is essentially reversed. The same effect can also be seen with SATA disks on another machine.

Running zpool iostat -v alongside the write perf test (see below) at some point looks interesting; take note how there's a lot of reads going on at the same time, which doesn't happen with other volblocksize zvols. My guess would be these reads are what kills the performance. Noteworthy is that arcstat.pl shows there's maybe 5k reads/s also which are satisfied from ARC (metadata probably) all the time while the write operation goes on too, however those reads don't affect writes since they're from ARC.

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        2.66T  1.68T  3.21K  4.81K  9.20M  35.4M
  mirror     340G   216G    401    525  1.14M  3.87M
    sdab        -      -    160     43  1.03M  3.88M
    sdai        -      -    142     35   679K  3.88M
  mirror     340G   216G    417    574  1.17M  4.21M
    sdah        -      -    141     37  1.16M  4.21M
    sdag        -      -    153     37   720K  4.21M
  mirror     341G   215G    414    689  1.16M  4.88M
    sdaf        -      -    137     65   771K  4.88M
    sdae        -      -    156     63  1.02M  4.88M
  mirror     340G   216G    410    714  1.15M  4.98M
    sdad        -      -    162    509   937K  4.98M
    sdac        -      -    151    508   819K  4.98M
  mirror     340G   216G    406    675  1.14M  4.73M
    sdam        -      -    141     81   818K  4.73M
    sdal        -      -    156     80   947K  4.73M
  mirror     341G   215G    387    678  1.07M  4.91M
    sdak        -      -    145    507   989K  4.91M
    sdaj        -      -    143    507   668K  4.91M
  mirror     340G   216G    420    607  1.16M  4.44M
    sdaq        -      -    159     43   803K  4.44M
    sdap        -      -    155     41   925K  4.44M
  mirror     341G   215G    426    463  1.20M  3.39M
    sdao        -      -    168     29   907K  3.39M
    sdan        -      -    151     34   849K  3.39M
logs            -      -      -      -      -      -
  sdb        656K  7.37G      0      0      0      0
cache           -      -      -      -      -      -
  sdar       112G  14.1M     39     81   640K  9.01M
----------  -----  -----  -----  -----  -----  -----

This particular test gave these numbers after fio had ran all the way through the zvol writing 200GB data:

/dev/zvol/tank/test-8k: (g=0): rw=randwrite, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=32
fio 2.0.6
Starting 1 process

/dev/zvol/tank/test-8k: (groupid=0, jobs=1): err= 0: pid=5957
  write: io=204800MB, bw=19229KB/s, iops=2403 , runt=10906424msec
    slat (usec): min=2 , max=35835 , avg= 9.50, stdev=24.83
    clat (usec): min=77 , max=3049.1K, avg=13287.25, stdev=113700.04
     lat (usec): min=81 , max=3049.1K, avg=13297.09, stdev=113700.19
    clat percentiles (usec):
     |  1.00th=[  159],  5.00th=[  187], 10.00th=[  203], 20.00th=[  229],
     | 30.00th=[  258], 40.00th=[  294], 50.00th=[  338], 60.00th=[  406],
     | 70.00th=[  532], 80.00th=[ 2224], 90.00th=[19840], 95.00th=[20096],
     | 99.00th=[62208], 99.50th=[987136], 99.90th=[1761280], 99.95th=[1941504],
     | 99.99th=[2244608]
    bw (KB/s)  : min=   12, max=233142, per=100.00%, avg=29291.35, stdev=33165.31
    lat (usec) : 100=0.01%, 250=27.37%, 500=40.90%, 750=7.07%, 1000=1.91%
    lat (msec) : 2=2.45%, 4=1.92%, 10=2.13%, 20=8.04%, 50=7.12%
    lat (msec) : 100=0.21%, 250=0.09%, 500=0.05%, 750=0.08%, 1000=0.17%
    lat (msec) : 2000=0.45%, >=2000=0.04%
  cpu          : usr=1.50%, sys=2.94%, ctx=4557272, majf=0, minf=324
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=26214400/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=204800MB, aggrb=19228KB/s, minb=19690KB/s, maxb=19690KB/s, mint=10906424msec, maxt=10906424msec

Another point worth noticing is that while arc_max=28GB, the real memory usage near the end of the test is 41GB. Looks like slab fragmentation affects this test in particular, though I can't say for sure. I find it pretty interesting that 8k is such a pathological case for ZoL while it's the best performing for native ZFS. Also, take note that there's a 120GB SSD as cache so metadata should all be served from RAM or flash; however that might not be the case here.

Another point is that these tests are being ran with zvol_threads=16 (numcpus), for in my testing using 32 zvol threads leads to much more context switching and around 20% slower performance in both sequential and random write throughput.

Comments and suggestions welcome; I'd love to see more performance tuning in the future. The basic code stability has been fine after the VM tweaks. This test rig has been chugging along without issues for 50 days now.

dechamps · 2012-07-12T19:17:55Z

I'm experiencing a similar issue. At first I thought I was an isolated case, but seeing that you too have poor performance with volblocksize=8k, I guess I'm not.

As a related note, try playing with /sys/block/zdXX/queue/nr_requests: I have much better performance with small numbers (like 8). I have no idea why, but I plan to investigate eventually. Note, however, that making the queue so small will undoubtedly destroy synchronous write performance, so the best value depends on your work load. Let me know your results.

dechamps · 2012-07-12T19:22:13Z

Oh, wait.

Random write speeds to prefilled zvols

Have you tried benchmarking empty zvols? If you're experiencing much better performance with empty zvols versus filled zvols, you're definitely hitting #361. That completely explains the reading activity you're describing.

Ukko-Ylijumala · 2012-07-12T21:27:17Z

Have you tried benchmarking empty zvols? If you're experiencing much better performance with empty zvols versus filled zvols, you're definitely hitting #361. That completely explains the reading activity you're describing.

I can exhibit that behaviour, yes, if I write/rewrite/random rewrite with io_blocksize < volblocksize. However, my examples here have been ran with io_blocksize = volblocksize for simplicity's sake. Only volblocksize=8k zvol exhibits poor performance across all IO blocksizes (and only on ZoL); the rest only degrade when io_blocksize < volblocksize - which definitely is because of #361.

Here's a more comprehensive look at the random rewrite case; same system, same setup: http://pastebin.com/ksHqed4e

As a related note, try playing with /sys/block/zdXX/queue/nr_requests: I have much better performance with small numbers (like 8).

Thanks, I'll incorporate that into my next test pass. I suspect however that the answer to this particular riddle will mostly be found within code.

Ukko-Ylijumala · 2012-07-12T21:50:23Z

Another point is that these tests are being ran with zvol_threads=16 (numcpus), for in my testing using 32 zvol threads leads to much more context switching and around 20% slower performance in both sequential and random write throughput.

This part of my testing is noteworthy; I suspect that with systems where 12 < core count < 32 the optimal number of zvol threads is numcpus(), not the hardcoded 32 from issue #567.

dechamps · 2012-07-12T22:03:09Z

This part of my testing is noteworthy; I suspect that with systems where 12 < core count < 32 the optimal number of zvol threads is numcpus(), not the hardcoded 32 from issue #567.

It's not so simple. With normal writes and a high throughput pool, a high zvol_threads might indeed be detrimental due to context switches, and even then it depends on other factors (for example I don't have this problem when pushing 750 MB/s to a single zvol - but I have a faster CPU).

With synchronous writes however, a low zvol_threads value will without a single doubt destroy performance, and the reason why is clearly explained in #567. That's because in the synchronous write case, zvol threads are actually blocking on each write (I/O bound), which is not the case with normal writes (CPU bound). In fact, with some heavy synchronous write workloads on large pools, zvol_threads=32 might not even be enough.

So until we can come up with a better solution, we have to choose some middle ground, and it can't be numcpus() because on small systems (e.g. 2 CPUs) synchronous writes would be way too low. In the end, that's why zvol_threads is a tunable parameter: it is meant to be tuned like you did, until one can figure out how to make it auto-tune itself.

Ukko-Ylijumala · 2012-07-12T22:05:23Z

Oh, another little tidbit. When I observe the behaviour of zvol threads with htop during these write tests, I see that with higher IO blocksizes and volblocksizes (128k, 64k) the threads are almost all the time in "R" state, and every CPU is almost completely in use (90%+). When blocksizes gets smaller, the zvol threads spend a progressively larger time in "D" state and system load is a lot lower (around 40%).

Ukko-Ylijumala · 2012-07-13T12:19:01Z

After more testing, the following tweaks have had no noticeable impact:

/sys/block/zdXX/queue/nr_requests changed to 8 and 16
/dev/zdXX elevator changed from deadline to noop
/sys/block/zdXX/queue/read_ahead_kb changed from 128 to 0 to 1024

Changing the zvol primarycache from all to metadata worsened the situation. Adding another 120GB SSD as L2ARC didn't help.

Anyone familiar with the code who could chime in why 8k is such a bad case? And mostly, why there's a need to read from disk when we're (re)writing with optimal blocksize (ie. aligned)...

ryao · 2012-10-17T21:07:58Z

Is this still an issue in HEAD? multiple performance fixes have been merged.

Ukko-Ylijumala · 2012-10-18T04:47:47Z

I haven't looked at this issue for a while. I'll need to fire up a test during the weekend -- I'll post back with results.

behlendorf · 2012-11-10T03:58:21Z

@shapemaker Can you rerun your tests using the latest ZoL master source from git. I just merged several significant memory management improvements which may help for small block sized. @cwedgwood has also recently done some testing with zvols and may have some insights in to the expected performance.

Ukko-Ylijumala · 2012-11-10T04:13:38Z

Been sick for nearly 2 weeks in the meantime. Will run another test pass during next week and see how it goes.

ProTip · 2013-05-05T11:27:06Z

Hi guys, I have been testing ZFS on 1TB velociraptors all weekend. Unfortunately I'm seeing it under-utilize the drives(only 62% of what MD can do) when testing with iozone.

Anyway, I have been benchmarking the zvol's at different block sizes as well. I have just done some testing with a module I compiled from todays code. Take a look at this uneven disk usage with 8k blocks during writes:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 2.20 0.00 2646.60 0.00 80041.00 60.49 4.10 1.54 0.00 1.54 0.24 63.28
sdb 0.00 3.60 2.00 2562.80 8.00 79567.30 62.05 4.20 1.62 4.80 1.62 0.26 65.84
sdc 0.00 3.40 0.20 2537.20 0.80 133828.90 105.49 6.79 2.66 24.00 2.66 0.33 84.72
sdd 0.00 2.20 0.00 2650.80 0.00 79217.70 59.77 4.09 1.53 0.00 1.53 0.23 61.92
sde 0.00 5.40 1.40 2648.80 5.60 83592.80 63.09 3.95 1.49 4.00 1.49 0.24 63.68
sdf 0.00 1.20 0.20 2604.00 0.80 139591.20 107.21 6.47 2.48 4.00 2.48 0.32 83.68

And during reads:
evice: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 17.20 0.00 1632.40 0.00 22673.60 0.00 27.78 2.11 1.30 1.30 0.00 0.32 52.16
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 20.40 0.00 1584.00 0.00 22372.00 0.00 28.25 2.30 1.46 1.46 0.00 0.37 58.40
sdd 19.60 0.00 1687.60 0.00 22271.20 0.00 26.39 2.23 1.32 1.32 0.00 0.32 54.16
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 17.20 0.00 1585.80 0.00 22683.20 0.00 28.61 2.19 1.38 1.38 0.00 0.35 55.12

I don't really start seeing better performance till 32k-64k zvol block sizes. Reads are considerably worse than writes. With 64K block sizes the disk utilization is very even and performance way up but still a ways off from MD. This is with a RAIDZ-2 though performance was pretty bad at lower block levels with a RAID-10 as well. I can't get the disk usage right now because I just lost connection to the office, but I was seeing bad performance with iozone on ZFS when reducing the record size down from 128K as well.

ProTip · 2013-05-05T15:17:56Z

It just struck me that those iostat samples may be from different iozone runs. In each though there are two disks that are wildly different. During writes two disks are getting heavily hit and in that read samples(5 seconds) two disks are not read from at all. I would expect to see this if the two parities are being written to the same two disks for a large number of blocks? I'll see if they are related in the same run tomorrow.

ProTip · 2013-05-06T01:49:09Z

Hmmm, here are the results from the same iozone run. Sample during writes:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 1.20 0.40 1838.40 1.60 116603.20 126.83 6.82 3.70 6.00 3.70 0.43 79.52
sdb 0.00 0.00 1.60 1919.20 6.40 79756.00 83.05 4.90 2.55 0.50 2.56 0.33 64.24
sdc 0.00 2.60 25.60 1723.40 102.40 79598.40 91.14 5.25 2.95 8.25 2.87 0.41 72.32
sdd 0.00 0.60 0.60 1994.40 2.40 116651.20 116.95 6.59 3.30 4.00 3.30 0.41 82.40
sde 0.00 0.00 0.20 1973.40 0.80 80723.20 81.80 4.92 2.50 0.00 2.50 0.34 66.64
sdf 0.00 6.20 26.00 1773.80 104.00 80741.60 89.84 5.14 2.85 6.68 2.79 0.39 70.80

You can see sda and sdd are getting smoked. Here is a sample when it switches over to reads:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 16.00 4.60 1195.00 142.40 14251.20 14566.40 43.09 2.66 1.99 1.33 7.48 0.44 58.48
sdb 0.00 1.20 0.60 362.40 2.40 14566.40 80.27 1.00 2.76 10.67 2.75 0.32 11.52
sdc 14.80 6.20 1229.20 247.60 13840.00 14564.00 38.47 2.58 1.73 1.27 4.01 0.42 61.44
sdd 13.60 3.40 1224.80 190.20 13756.00 14559.20 40.02 2.51 1.77 1.21 5.37 0.39 55.84
sde 0.00 0.00 1.20 38.20 12.00 164.80 8.97 0.04 1.12 4.00 1.03 0.49 1.92
sdf 13.80 3.60 1237.60 34.40 14242.40 169.60 22.66 1.97 1.55 1.40 6.93 0.45 56.80

That's right when it kicks over so it looks like the writes are still getting sent to disk. But you can see sdb and sde are not really getting reads. Now with 64K blocks it is smoothed out and quite fast:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 1.20 1.60 1295.20 25.60 112031.20 172.82 4.96 3.82 9.00 3.82 0.53 68.80
sdb 0.00 1.20 1.60 1298.80 25.60 111244.00 171.13 5.06 3.88 7.50 3.87 0.54 70.48
sdc 0.00 9.80 7.60 1403.60 116.80 111947.20 158.82 5.29 3.75 3.58 3.75 0.54 76.48
sdd 0.00 6.00 5.60 1269.80 89.60 111766.40 175.41 5.29 4.15 2.43 4.16 0.58 73.44
sde 0.00 5.80 5.60 1306.00 89.60 112013.60 170.94 5.27 4.02 2.57 4.03 0.57 74.48
sdf 0.00 5.80 7.40 1288.20 116.00 111768.80 172.72 4.97 3.83 2.16 3.84 0.54 70.16

That's actually the peak, these disks hit about 210MB/s with MD:/

Ukko-Ylijumala · 2014-03-28T09:02:54Z

I had a chance to replicate the test setup I had quite closely. It seems that this 8k volblocksize issue is gone in the latest master so this can be closed.

ARC space utilisation is still a problem with 8k zvols, but that is a separate matter.

"Thank you" to everyone who has contributed in the meantime for fixing this issue :)

In `zcache iostat -l`, the buckets are labeled starting at 1024ns (and doubling from there), when in reality the first bucket is 1000ns (1us exactly).

Ukko-Ylijumala mentioned this issue Jul 13, 2012

Unloading zfs stack causes a BUG #828

Closed

Ukko-Ylijumala mentioned this issue Sep 20, 2013

Merge Spectra Logic's optimizations to the DMU #1738

Closed

Ukko-Ylijumala closed this as completed Mar 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing to volblocksize=8k zvol is slow #824

Writing to volblocksize=8k zvol is slow #824

Ukko-Ylijumala commented Jul 12, 2012

dechamps commented Jul 12, 2012

dechamps commented Jul 12, 2012

Ukko-Ylijumala commented Jul 12, 2012

Ukko-Ylijumala commented Jul 12, 2012

dechamps commented Jul 12, 2012

Ukko-Ylijumala commented Jul 12, 2012

Ukko-Ylijumala commented Jul 13, 2012

ryao commented Oct 17, 2012

Ukko-Ylijumala commented Oct 18, 2012

behlendorf commented Nov 10, 2012

Ukko-Ylijumala commented Nov 10, 2012

ProTip commented May 5, 2013

ProTip commented May 5, 2013

ProTip commented May 6, 2013

Ukko-Ylijumala commented Mar 28, 2014