-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix #2147: mostly zero-copy writes and reads in IOStream #2166
Conversation
1aa6093
to
76e5e5d
Compare
What are you using to benchmark the small-messages case? Is the 10-20% performance hit after optimization? I'm curious if we can reduce this to make a change like this more palatable to Tornado devs. |
I'm using those two scripts and running them with |
Optimizing this in pure Python looks difficult, as there are so many different ways of reading from an IOStream ( Perhaps writing the StreamBuffer class in C can gain enough speed to make up for the performance hit. |
also cc @ajdavis |
@bdarnell what is your tolerance for slowdown in the small message case and what is your tolerance for C? Also, do you have benchmarks that we can work against that you perceive to be representative of Tornado's common case workload? |
It occurs to me that I could post a smaller PR with only the write() changes, which would probably minimize the slowdown on small messages. |
My current policy for Tornado is that the use of C must be optional because mac and windows users often don't have C tools installed (we do publish wheels now for python 3.5+ on windows; maybe it's time to build out a more comprehensive wheel-building pipeline). As a consequence of that, any use of C must be kept small so there's not much duplication (and room for bugs) between C and the pure-python fallback. Cython is an interesting possibility for generating a fast implementation from (annotated) python. It should also help avoid the wild pointer bugs that can plague C code. I care more about HTTP performance than raw TCP. The only benchmark I really have (which isn't great) is If we're going to have to keep the existing implementation for python 2, then maybe we could also punt on the problem and expose an option to optimize for either large or small messages. |
Thanks. I get 1890 req/s on master, and 1820 req/s on this PR. That's looking like a 4% slowdown, which is better than I thought.
I'm not sure what you mean by that. Do you only care about performance on Python 2, or do you imply that Python 3-only code could be made significantly more efficient? |
I was referring to the fact that USE_STREAM_BUFFER_FOR_READS is (currently) set only for py3. If we're going to have two implementations and the ability to switch between them, we could expose that setting to the application instead of deciding based on the python version. (I think I'd rather find a workaround to make StreamBuffer work on py2 so we can only have one, but if we're going to keep both and they have different performance characteristics, we might as well take advantage of it). I have historically focused on py2 for performance measurements, but it's time now to make py3 the priority. |
I've created #2169 with only the write-specific changes. This will make the code easier to review and minimizes the potential slowdown. |
Well, I think the read-specific changes are the most contentious part of this PR, as they make I'm also starting to wonder whether, instead of changing the buffering logic for all reads, we should instead expose a |
Improves performance by 45% on the following benchmark: #2147 (comment)
Caveats:
read_until
andread_until_regex
can be 10 to 20% slowerIf we implement the core of StreamBuffer in C, we can probably make performance even better.