[WIP] Fix #2147: mostly zero-copy writes and reads in IOStream #2166

pitrou · 2017-10-17T15:50:01Z

Improves performance by 45% on the following benchmark: #2147 (comment)

Caveats:

small reads and writes can be 10% slower
read_until and read_until_regex can be 10 to 20% slower

If we implement the core of StreamBuffer in C, we can probably make performance even better.

mrocklin · 2017-10-17T21:01:23Z

What are you using to benchmark the small-messages case? Is the 10-20% performance hit after optimization? I'm curious if we can reduce this to make a change like this more palatable to Tornado devs.

pitrou · 2017-10-18T07:24:49Z

I'm using those two scripts and running them with time: https://gist.github.com/pitrou/15ffce51f709301652bbff212e93ef4f

pitrou · 2017-10-18T08:13:01Z

I'm curious if we can reduce this to make a change like this more palatable to Tornado devs.

Optimizing this in pure Python looks difficult, as there are so many different ways of reading from an IOStream (read_bytes, read_until, read_until_close, callbacks and streaming callbacks...).

Perhaps writing the StreamBuffer class in C can gain enough speed to make up for the performance hit.

mrocklin · 2017-10-18T11:49:34Z

also cc @ajdavis

mrocklin · 2017-10-18T13:23:15Z

@bdarnell what is your tolerance for slowdown in the small message case and what is your tolerance for C?

Also, do you have benchmarks that we can work against that you perceive to be representative of Tornado's common case workload?

pitrou · 2017-10-18T18:35:50Z

It occurs to me that I could post a smaller PR with only the write() changes, which would probably minimize the slowdown on small messages.

bdarnell · 2017-10-19T03:48:21Z

My current policy for Tornado is that the use of C must be optional because mac and windows users often don't have C tools installed (we do publish wheels now for python 3.5+ on windows; maybe it's time to build out a more comprehensive wheel-building pipeline). As a consequence of that, any use of C must be kept small so there's not much duplication (and room for bugs) between C and the pure-python fallback.

Cython is an interesting possibility for generating a fast implementation from (annotated) python. It should also help avoid the wild pointer bugs that can plague C code.

I care more about HTTP performance than raw TCP. The only benchmark I really have (which isn't great) is demos/benchmark/benchmark.py. That's an extreme hello-world case, but it does reflect that I'm more interested in fairly small messages than large ones. Paying a 10-20% price doesn't sound great to me (although I think that number would come down a bit when you're looking at HTTP instead of TCP).

If we're going to have to keep the existing implementation for python 2, then maybe we could also punt on the problem and expose an option to optimize for either large or small messages.

pitrou · 2017-10-19T11:43:21Z

I care more about HTTP performance than raw TCP. The only benchmark I really have (which isn't great) is demos/benchmark/benchmark.py.

Thanks. I get 1890 req/s on master, and 1820 req/s on this PR. That's looking like a 4% slowdown, which is better than I thought.

If we're going to have to keep the existing implementation for python 2, then maybe we could also punt on the problem and expose an option to optimize for either large or small messages.

I'm not sure what you mean by that. Do you only care about performance on Python 2, or do you imply that Python 3-only code could be made significantly more efficient?

bdarnell · 2017-10-19T13:30:23Z

I was referring to the fact that USE_STREAM_BUFFER_FOR_READS is (currently) set only for py3. If we're going to have two implementations and the ability to switch between them, we could expose that setting to the application instead of deciding based on the python version. (I think I'd rather find a workaround to make StreamBuffer work on py2 so we can only have one, but if we're going to keep both and they have different performance characteristics, we might as well take advantage of it).

I have historically focused on py2 for performance measurements, but it's time now to make py3 the priority.

pitrou · 2017-10-19T14:00:35Z

I've created #2169 with only the write-specific changes. This will make the code easier to review and minimizes the potential slowdown.

pitrou · 2017-10-19T14:10:19Z

I was referring to the fact that USE_STREAM_BUFFER_FOR_READS is (currently) set only for py3

Well, I think the read-specific changes are the most contentious part of this PR, as they make read_until and read_until_regex non-trivially more complicated. Hence my splitting the write parts off to a specific, smaller PR :-)

I'm also starting to wonder whether, instead of changing the buffering logic for all reads, we should instead expose a readinto-like functionality on IOStream. Though I should first experiment with that, as it's not obvious that it's easy to add dedicated logic for it into IOStream :-)

pitrou added 6 commits October 17, 2017 14:37

Draft

7019a68

Draft

34f988d

Make it possible to use StreamBuffer for reads

9df4c39

Improve find() a bit

ad95a88

Add comment

b34b9e0

Fix for py2

76e5e5d

pitrou mentioned this pull request Oct 17, 2017

Zero copy send #2147

Closed

Fix logged error in some tests

a4399fd

pitrou force-pushed the large_buffers2 branch 2 times, most recently from 1aa6093 to 76e5e5d Compare October 17, 2017 16:06

pitrou changed the title ~~Fix #2147: mostly zero-copy writes and reads in IOStream~~ [WIP] Fix #2147: mostly zero-copy writes and reads in IOStream Oct 17, 2017

pitrou mentioned this pull request Oct 19, 2017

TCP comms could be faster dask/distributed#1477

Closed

pitrou mentioned this pull request Oct 19, 2017

Issue #2147: avoid copies on non-small writes #2169

Merged

pitrou closed this Nov 8, 2017

pitrou deleted the large_buffers2 branch November 8, 2017 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix #2147: mostly zero-copy writes and reads in IOStream #2166

[WIP] Fix #2147: mostly zero-copy writes and reads in IOStream #2166

pitrou commented Oct 17, 2017 •

edited

Loading

mrocklin commented Oct 17, 2017

pitrou commented Oct 18, 2017

pitrou commented Oct 18, 2017

mrocklin commented Oct 18, 2017

mrocklin commented Oct 18, 2017

pitrou commented Oct 18, 2017

bdarnell commented Oct 19, 2017

pitrou commented Oct 19, 2017

bdarnell commented Oct 19, 2017

pitrou commented Oct 19, 2017

pitrou commented Oct 19, 2017 •

edited

Loading

[WIP] Fix #2147: mostly zero-copy writes and reads in IOStream #2166

[WIP] Fix #2147: mostly zero-copy writes and reads in IOStream #2166

Conversation

pitrou commented Oct 17, 2017 • edited Loading

mrocklin commented Oct 17, 2017

pitrou commented Oct 18, 2017

pitrou commented Oct 18, 2017

mrocklin commented Oct 18, 2017

mrocklin commented Oct 18, 2017

pitrou commented Oct 18, 2017

bdarnell commented Oct 19, 2017

pitrou commented Oct 19, 2017

bdarnell commented Oct 19, 2017

pitrou commented Oct 19, 2017

pitrou commented Oct 19, 2017 • edited Loading

pitrou commented Oct 17, 2017 •

edited

Loading

pitrou commented Oct 19, 2017 •

edited

Loading