New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subpar single stream synchronous write performance with SSD based SLOG #3528
Comments
Since WRITE_SYNC or READ_SYNC flags are among several factors that are considered when processing bio requests, it seems prudent to mark all the zio requests of synchronous priority with the READ/WRITE_SYNC flags to make them eligible for consideration as such by the Linux block I/O layer. However, this also needs a bit more testing under various workloads. |
I have a very simple fix for this in I am still testing under various workloads, but I have not seems any adverse effects of this change, while the improvements for synchronous writes are pronounced. |
Forgot to mention. This is observed on Centos 6.6. (2.6.32-504.23.4.el6.x86_64). The SLOG device parameters in /sys/block/[devname]/queue/rotational is properly set to zero, and the scheduler in /sys/block/[devname]/queue/scheduler is set to [none] in this case. |
On 2015-06-25 17:02, Boris Protopopov wrote:
Hmmm, I wonder if this is what I am seeing. I have an intel s3700 as |
@bprotopopov very interesting! The fix your proposing should be entirely safe. It would be great if you could open a pull request with the patch. I'd love to see your performance numbers too. |
Sure. So, the 180MB/s is pretty good for a single stream (unless it is not single stream of course). I think you can't really get much more unless you write to different datasets (that use different ZILs), which is not single stream. Anyway :) you can run the block trace on your SLOG device as follows (taken straight from the blkparse man page): blktrace -d /dev/'dev name' -o - | blkparse -i - > /tmp/slog.tracewhere 'dev name' is your SLOG device name that can be obtained from the zpool status -v 'pool name'. Of course if it is a 'persistent name' (not sdX), e.g. id-, wwn-, or path-based name, you should use the /dev/disk/{by-id,py-path}/'dev name' instead of /dev/sdX Then examine the /tmp/slog.trace, and see if the operations are marked as W (writes) and not WS (synchronous writes), and if the trace contains the UT (unplug by timer) events. If I/O completions (e.g. the time between I - insert, and C - completion) are dominated by the UT time, this is the same issue I have described. |
Sure @behlendorf, I have created a pull request off the branch. The sync write test was a simple fio job running sequential synchronous I/O. I am going to run through more micro benchmarks hopefully within next few days. |
I would be happy to test as well... Brian Behlendorf notifications@github.com wrote:
|
I'm not sure I understand why 180MB/sec is a realistic maximum? This is a very high quality SSD and the IOPS going into it is well under the rated specs. I guess I will find out :) |
That's because of the contention over the ZIL (different issue, but could be what you are dealing with). Implementation artifact (software), nothing to do with the hardware. If you'd like to experiment, you can provision a SLOG over RAM disk and observe the "theoretical maximum" on your system :) Then try spreading the io across multiple datasets and see if the aggregate throughout grows. Typos courtesy of my iPhone
|
Looks like you are right. A single stream gets the same basic write throughput. I never understood why multiple slog devices are used serially, so that raid0 is pointless. Be nice to be able to throw multiple SLOG devices at the problem and get an improvement (I'm sure there is a reason, it just sucks...) |
The fix was merged as b39c22b. |
When setting flags that eventually wind up in bio->bi_rw, function vdev_disk.c:vdev_disk_io_start() does not seem to distinguish between synchronous and asynchronous operations. While this does not lead to incorrect behavior, it does seem to result in subtle but important differences in the way Linux block device layer schedules the resulting I/O requests.
Specifically, when writing to SSD-based SLOGs, the block I/O layer delays completion of write I/O requests until the queue is unplugged by timer (3 msec by default), which appear to result in 2-3 millisecond latencies of writes to the SSD. This, in turn, impacts single stream synchronous write performance.
Here is a block trace for a SLOG device that illustrated the problem:
...
130,96 0 1 0.000000000 0 C W 829696 + 136 [0]
130,96 0 2 0.002961512 0 C W 829952 + 136 [0]
130,96 0 3 0.008968390 0 C W 830464 + 136 [0]
130,96 0 4 0.011999161 0 C W 830720 + 136 [0]
130,96 0 5 0.023955549 0 C W 831744 + 136 [0]
130,96 0 6 0.024337663 19775 A W 832000 + 136 <- (130,97) 829952
130,96 0 7 0.024338823 19775 Q W 832000 + 136 [z_wr_iss/6]
130,96 0 8 0.024340523 19775 G W 832000 + 136 [z_wr_iss/6]
130,96 0 9 0.024343187 19775 P N [z_wr_iss/6]
130,96 0 10 0.024344120 19775 I W 832000 + 136 [z_wr_iss/6]
130,96 0 11 0.026784405 0 UT N [swapper] 1
130,96 0 12 0.026805339 202 U N [kblockd/0] 1
130,96 0 13 0.026807199 202 D W 832000 + 136 [kblockd/0]
130,96 0 14 0.026966948 0 C W 832000 + 136 [0]
130,96 3 1 0.000449358 19788 A W 829952 + 136 <- (130,97) 827904
130,96 3 2 0.000450951 19788 Q W 829952 + 136 [z_wr_iss/19]
130,96 3 3 0.000453212 19788 G W 829952 + 136 [z_wr_iss/19]
130,96 3 4 0.000455956 19788 P N [z_wr_iss/19]
130,96 3 5 0.000457076 19788 I W 829952 + 136 [z_wr_iss/19]
130,96 3 6 0.002786349 0 UT N [swapper] 1
130,96 3 7 0.002813434 205 U N [kblockd/3] 1
...
Here, 130,97 is the partition created when adding the SLOG device to the pool, while the 130,96 is the base device. One can see that the writes to the SLOG are not synchronous (S is missing next to W), and the UT (unplug by timer) events slow down the I/O response times.
When one sets the WRITE_SYNC flag in vdev_disk_io_start() based in zio->io_priority, the block trace for the SLOG looks as follows:
...
130,96 4 1 0.000000000 70714 A WS 4280576 + 136 <- (130,97) 4278528
130,96 4 2 0.000000832 70714 Q WS 4280576 + 136 [(null)]
130,96 4 3 0.000002109 70714 G WS 4280576 + 136 [(null)]
130,96 4 4 0.000003394 70714 P N [(null)]
130,96 4 5 0.000003846 70714 I WS 4280576 + 136 [(null)]
130,96 4 6 0.000004854 70714 D WS 4280576 + 136 [(null)]
130,96 5 1 0.000354487 70713 A WS 4280832 + 136 <- (130,97) 4278784
130,96 5 2 0.000355072 70713 Q WS 4280832 + 136 [(null)]
130,96 5 3 0.000356383 70713 G WS 4280832 + 136 [(null)]
130,96 5 4 0.000357635 70713 P N [(null)]
130,96 5 5 0.000358088 70713 I WS 4280832 + 136 [(null)]
130,96 5 6 0.000359191 70713 D WS 4280832 + 136 [(null)]
130,96 0 76 0.000159539 0 C WS 4280576 + 136 [0]
130,96 16 85 0.000742108 70718 A WS 4281088 + 136 <- (130,97) 4279040
130,96 16 86 0.000743197 70718 Q WS 4281088 + 136 [z_wr_iss/15]
130,96 16 87 0.000744450 70718 G WS 4281088 + 136 [z_wr_iss/15]
130,96 16 88 0.000745817 70718 P N [z_wr_iss/15]
130,96 16 89 0.000746705 70718 I WS 4281088 + 136 [z_wr_iss/15]
130,96 16 90 0.000747848 70718 D WS 4281088 + 136 [z_wr_iss/15]
130,96 0 77 0.000604063 0 C WS 4280832 + 136 [0]
130,96 0 78 0.000899858 0 C WS 4281088 + 136 [0]
...
One can see that the writes are synchronous, the completions are quicker, and no timers are involved.
The text was updated successfully, but these errors were encountered: