Subpar single stream synchronous write performance with SSD based SLOG #3528

bprotopopov · 2015-06-25T20:46:36Z

When setting flags that eventually wind up in bio->bi_rw, function vdev_disk.c:vdev_disk_io_start() does not seem to distinguish between synchronous and asynchronous operations. While this does not lead to incorrect behavior, it does seem to result in subtle but important differences in the way Linux block device layer schedules the resulting I/O requests.

Specifically, when writing to SSD-based SLOGs, the block I/O layer delays completion of write I/O requests until the queue is unplugged by timer (3 msec by default), which appear to result in 2-3 millisecond latencies of writes to the SSD. This, in turn, impacts single stream synchronous write performance.

Here is a block trace for a SLOG device that illustrated the problem:
...
130,96 0 1 0.000000000 0 C W 829696 + 136 [0]
130,96 0 2 0.002961512 0 C W 829952 + 136 [0]
130,96 0 3 0.008968390 0 C W 830464 + 136 [0]
130,96 0 4 0.011999161 0 C W 830720 + 136 [0]
130,96 0 5 0.023955549 0 C W 831744 + 136 [0]
130,96 0 6 0.024337663 19775 A W 832000 + 136 <- (130,97) 829952
130,96 0 7 0.024338823 19775 Q W 832000 + 136 [z_wr_iss/6]
130,96 0 8 0.024340523 19775 G W 832000 + 136 [z_wr_iss/6]
130,96 0 9 0.024343187 19775 P N [z_wr_iss/6]
130,96 0 10 0.024344120 19775 I W 832000 + 136 [z_wr_iss/6]
130,96 0 11 0.026784405 0 UT N [swapper] 1
130,96 0 12 0.026805339 202 U N [kblockd/0] 1
130,96 0 13 0.026807199 202 D W 832000 + 136 [kblockd/0]
130,96 0 14 0.026966948 0 C W 832000 + 136 [0]
130,96 3 1 0.000449358 19788 A W 829952 + 136 <- (130,97) 827904
130,96 3 2 0.000450951 19788 Q W 829952 + 136 [z_wr_iss/19]
130,96 3 3 0.000453212 19788 G W 829952 + 136 [z_wr_iss/19]
130,96 3 4 0.000455956 19788 P N [z_wr_iss/19]
130,96 3 5 0.000457076 19788 I W 829952 + 136 [z_wr_iss/19]
130,96 3 6 0.002786349 0 UT N [swapper] 1
130,96 3 7 0.002813434 205 U N [kblockd/3] 1
...

Here, 130,97 is the partition created when adding the SLOG device to the pool, while the 130,96 is the base device. One can see that the writes to the SLOG are not synchronous (S is missing next to W), and the UT (unplug by timer) events slow down the I/O response times.

When one sets the WRITE_SYNC flag in vdev_disk_io_start() based in zio->io_priority, the block trace for the SLOG looks as follows:
...
130,96 4 1 0.000000000 70714 A WS 4280576 + 136 <- (130,97) 4278528
130,96 4 2 0.000000832 70714 Q WS 4280576 + 136 [(null)]
130,96 4 3 0.000002109 70714 G WS 4280576 + 136 [(null)]
130,96 4 4 0.000003394 70714 P N [(null)]
130,96 4 5 0.000003846 70714 I WS 4280576 + 136 [(null)]
130,96 4 6 0.000004854 70714 D WS 4280576 + 136 [(null)]
130,96 5 1 0.000354487 70713 A WS 4280832 + 136 <- (130,97) 4278784
130,96 5 2 0.000355072 70713 Q WS 4280832 + 136 [(null)]
130,96 5 3 0.000356383 70713 G WS 4280832 + 136 [(null)]
130,96 5 4 0.000357635 70713 P N [(null)]
130,96 5 5 0.000358088 70713 I WS 4280832 + 136 [(null)]
130,96 5 6 0.000359191 70713 D WS 4280832 + 136 [(null)]
130,96 0 76 0.000159539 0 C WS 4280576 + 136 [0]
130,96 16 85 0.000742108 70718 A WS 4281088 + 136 <- (130,97) 4279040
130,96 16 86 0.000743197 70718 Q WS 4281088 + 136 [z_wr_iss/15]
130,96 16 87 0.000744450 70718 G WS 4281088 + 136 [z_wr_iss/15]
130,96 16 88 0.000745817 70718 P N [z_wr_iss/15]
130,96 16 89 0.000746705 70718 I WS 4281088 + 136 [z_wr_iss/15]
130,96 16 90 0.000747848 70718 D WS 4281088 + 136 [z_wr_iss/15]
130,96 0 77 0.000604063 0 C WS 4280832 + 136 [0]
130,96 0 78 0.000899858 0 C WS 4281088 + 136 [0]
...

One can see that the writes are synchronous, the completions are quicker, and no timers are involved.

bprotopopov · 2015-06-25T20:48:25Z

Since WRITE_SYNC or READ_SYNC flags are among several factors that are considered when processing bio requests, it seems prudent to mark all the zio requests of synchronous priority with the READ/WRITE_SYNC flags to make them eligible for consideration as such by the Linux block I/O layer.

However, this also needs a bit more testing under various workloads.

bprotopopov · 2015-06-25T20:52:33Z

I have a very simple fix for this in
commit bprotopopov@ea361de

I am still testing under various workloads, but I have not seems any adverse effects of this change, while the improvements for synchronous writes are pronounced.

bprotopopov · 2015-06-25T21:02:51Z

Forgot to mention.

This is observed on Centos 6.6. (2.6.32-504.23.4.el6.x86_64).

The SLOG device parameters in /sys/block/[devname]/queue/rotational is properly set to zero, and the scheduler in /sys/block/[devname]/queue/scheduler is set to [none] in this case.

dswartz · 2015-06-25T21:11:50Z

On 2015-06-25 17:02, Boris Protopopov wrote:

Forgot to mention.

This is observed on Centos 6.6. (2.6.32-504.23.4.el6.x86_64).

The SLOG device parameters in /sys/block/[devname]/queue/rotational is
properly set to zero, and the scheduler in
/sys/block/[devname]/queue/scheduler is set to [none] in this case.

Hmmm, I wonder if this is what I am seeing. I have an intel s3700 as
SLOG for a 4x2 raid10 pool. It is serving up iSCSI targets via SCST
(file type LUNS on a dataset - I did this because I was seeing
performance issues with zvols...) Anyway, I have a 10gbe link from a
vsphere host. If I run crystaldiskmark on a win7 guest with
sync=disabled, I can get a little more than 500MB/sec for sequential
throughput. If I set sync=always (paranoia vis-a-vis iSCSI initiators
maybe not always playing fair), the sequential write throughput drops to
about 180MB/sec. Can you give me a quick idea as to how to tell if this
is what is biting me? Thanks!

behlendorf · 2015-06-25T21:37:42Z

@bprotopopov very interesting! The fix your proposing should be entirely safe. It would be great if you could open a pull request with the patch. I'd love to see your performance numbers too.

bprotopopov · 2015-06-25T21:45:34Z

Sure. So, the 180MB/s is pretty good for a single stream (unless it is not single stream of course). I think you can't really get much more unless you write to different datasets (that use different ZILs), which is not single stream.

Anyway :)

you can run the block trace on your SLOG device as follows (taken straight from the blkparse man page):

blktrace -d /dev/'dev name' -o - | blkparse -i - > /tmp/slog.trace

where 'dev name' is your SLOG device name that can be obtained from the zpool status -v 'pool name'.

Of course if it is a 'persistent name' (not sdX), e.g. id-, wwn-, or path-based name, you should use the /dev/disk/{by-id,py-path}/'dev name' instead of /dev/sdX

Then examine the /tmp/slog.trace, and see if the operations are marked as W (writes) and not WS (synchronous writes), and if the trace contains the UT (unplug by timer) events. If I/O completions (e.g. the time between I - insert, and C - completion) are dominated by the UT time, this is the same issue I have described.

bprotopopov · 2015-06-25T21:51:04Z

Sure @behlendorf, I have created a pull request off the branch.

The sync write test was a simple fio job running sequential synchronous I/O. I am going to run through more micro benchmarks hopefully within next few days.

dswartz · 2015-06-25T22:42:54Z

I would be happy to test as well...

Brian Behlendorf notifications@github.com wrote:

@bprotopopov very interesting! The fix your proposing should be entirely safe. It would be great if you could open a pull request with the patch. I'd love to see your performance numbers too.

—
Reply to this email directly or view it on GitHub.

behlendorf · 2015-06-25T22:54:49Z

@dswartz that'd be great, this could significant help your use case. Just grab the patch in #3529.

dswartz · 2015-06-26T04:29:05Z

I'm not sure I understand why 180MB/sec is a realistic maximum? This is a very high quality SSD and the IOPS going into it is well under the rated specs. I guess I will find out :)

bprotopopov · 2015-06-26T12:52:36Z

That's because of the contention over the ZIL (different issue, but could be what you are dealing with). Implementation artifact (software), nothing to do with the hardware.

If you'd like to experiment, you can provision a SLOG over RAM disk and observe the "theoretical maximum" on your system :)

Then try spreading the io across multiple datasets and see if the aggregate throughout grows.

Typos courtesy of my iPhone

On Jun 26, 2015, at 12:29 AM, dswartz notifications@github.com wrote:

I'm not sure I understand why 180MB/sec is a realistic maximum? This is a very high quality SSD and the IOPS going into it is well under the rated specs. I guess I will find out :)

—
Reply to this email directly or view it on GitHub.

dswartz · 2015-06-28T20:51:36Z

Looks like you are right. A single stream gets the same basic write throughput. I never understood why multiple slog devices are used serially, so that raid0 is pointless. Be nice to be able to throw multiple SLOG devices at the problem and get an improvement (I'm sure there is a reason, it just sucks...)

behlendorf · 2015-07-16T19:13:48Z

The fix was merged as b39c22b.

behlendorf added the Type: Performance Performance improvement or performance problem label Jun 26, 2015

behlendorf added this to the 0.6.5 milestone Jun 26, 2015

behlendorf closed this as completed Jul 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subpar single stream synchronous write performance with SSD based SLOG #3528

Subpar single stream synchronous write performance with SSD based SLOG #3528

bprotopopov commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

dswartz commented Jun 25, 2015

behlendorf commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

dswartz commented Jun 25, 2015

behlendorf commented Jun 25, 2015

dswartz commented Jun 26, 2015

bprotopopov commented Jun 26, 2015

dswartz commented Jun 28, 2015

behlendorf commented Jul 16, 2015

Subpar single stream synchronous write performance with SSD based SLOG #3528

Subpar single stream synchronous write performance with SSD based SLOG #3528

Comments

bprotopopov commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

dswartz commented Jun 25, 2015

behlendorf commented Jun 25, 2015

bprotopopov commented Jun 25, 2015

blktrace -d /dev/'dev name' -o - | blkparse -i - > /tmp/slog.trace

bprotopopov commented Jun 25, 2015

dswartz commented Jun 25, 2015

behlendorf commented Jun 25, 2015

dswartz commented Jun 26, 2015

bprotopopov commented Jun 26, 2015

dswartz commented Jun 28, 2015

behlendorf commented Jul 16, 2015