Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams are iterable + receive_some doesn't require an explicit size #1123

Merged
merged 6 commits into from
Jul 30, 2019

Conversation

njsmith
Copy link
Member

@njsmith njsmith commented Jun 25, 2019

This came out of discussion in gh-959

@codecov
Copy link

codecov bot commented Jun 25, 2019

Codecov Report

Merging #1123 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1123      +/-   ##
==========================================
- Coverage   99.55%   99.51%   -0.04%     
==========================================
  Files         105      104       -1     
  Lines       12716    12666      -50     
  Branches      970      977       +7     
==========================================
- Hits        12659    12605      -54     
- Misses         36       40       +4     
  Partials       21       21
Impacted Files Coverage Δ
trio/_windows_pipes.py 100% <100%> (ø) ⬆️
trio/_subprocess.py 100% <100%> (ø) ⬆️
trio/_unix_pipes.py 100% <100%> (ø) ⬆️
trio/_ssl.py 100% <100%> (ø) ⬆️
trio/testing/_check_streams.py 99.33% <100%> (+0.01%) ⬆️
trio/_highlevel_generic.py 100% <100%> (ø) ⬆️
trio/tests/test_testing.py 100% <100%> (ø) ⬆️
trio/_abc.py 100% <100%> (ø) ⬆️
trio/_highlevel_socket.py 100% <100%> (ø) ⬆️
trio/tests/test_ssl.py 99.71% <100%> (ø) ⬆️
... and 15 more

@njsmith njsmith force-pushed the no-more-mandatory-buffer-size branch from 25bcd80 to aab5fe3 Compare June 25, 2019 23:42
@python-trio python-trio deleted a comment from codecov bot Jun 25, 2019
@njsmith
Copy link
Member Author

njsmith commented Jun 26, 2019

Codecov doesn't seem to be updating its comment, but if I click through now, it says that this isn't adding any new uncovered lines.

Copy link
Member

@oremanj oremanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this -- I think it's a huge usability improvement in working with Streams. Feel free to merge if you feel you've adequately addressed my comments, as I won't be online to respond till next Monday.

docs/source/tutorial/echo-server.py Outdated Show resolved Hide resolved
newsfragments/959.feature.rst Outdated Show resolved Hide resolved
newsfragments/959.feature.rst Outdated Show resolved Hide resolved
trio/_abc.py Outdated Show resolved Hide resolved
@@ -10,6 +10,12 @@

__all__ = ["SocketStream", "SocketListener"]

# XX TODO: this number was picked arbitrarily. We should do experiments to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One wrinkle: AFAIK, each call to socket.recv() allocates a new bytes object that is large enough for the entire given chunksize. If large allocations are more expensive, passing a too-large buffer is probably bad for performance. (The allocators I know of use 128KB as their threshold for "this is big, mmap it instead of finding a free chunk" but if one used 64KB instead and we got a mmap/munmap pair on each receive, that feels maybe bad?)

My intuition favors a much lower buffer size, like 4KB or 8KB, but I also do most of my work on systems that are rarely backlogged, so my intuition might well be off when it comes to a high-throughput Trio application.

Another option we could consider: the socket owns a receive buffer (bytearray) which it reuses, calls recv_into(), and extracts just the amount actually received into a bytes for returning. Downside: spends 64KB (or whatever) per socket in steady state. Counterpoint: the OS-level socket buffers are probably much larger than that (but I don't know how much memory they occupy when the socket isn't backlogged).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting discussion but I don't want it to hold up merging the basic functionality, so I split it off into #1139

(Twisted has apparently used 64 KiB receive buffers for its entire existence and I can't find any evidence that anyone has ever thought twice about it. So we're probably not risking any disaster by starting with 64 KiB for now :-).)

trio/_ssl.py Outdated
# Heuristic: normally we use DEFAULT_RECEIVE_SIZE, but if
# the transport gave us a bunch of data last time then we'll
# try to decrypt and pass it all back at once.
max_bytes = max(DEFAULT_RECEIVE_SIZE, self._incoming.pending)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused at what the benefit is of having a DEFAULT_RECEIVE_SIZE for SSLStream at all. It seems like we could instead have a nice magic-number-free policy of "ask the transport stream to receive_some() with no size specified, then return all the data we decrypted from whatever we got in that chunk, or loop and receive_some() again if we didn't get any decrypted data".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is complicated... what you say makes logical sense, but, openssl's API is super awkward. There isn't any way to say "please decrypt all the data in your receive buffer". You have to pick a value to pass to SSLObject.read. And even more annoying: you don't find out until after you've picked a value whether you have to go back to the underlying transport for more data. So you have to pick the value before you know how much data the underlying transport wants to give you. And once you've picked a value, you have to keep using that value until some data is returned.

So my logic was: well, if we already have a bunch of data in the receive buffer because the underlying transport was generous, then likely we can just decrypt and return that, and the size of the encrypted data is a plausible upper bound on the size of the decrypted data, so self._incoming.pending is a good value to pass to SSLObject.read.

But, sometimes there won't be a lot of data in the receive buffer – for example, because our heuristic worked well the previous time, and cleared everything out, or almost everything. Like, imagine there's 1 byte left in the receive buffer. The way TLS works, you generally can't decrypt just 1 byte – everything's transmitted in frames, and you need to get the whole frame with its header and MAC and everything before you can decrypt any of it. So if we call ssl_object.read(1), then openssl will end up requesting another large chunk of data from the underlying transport, then our read(1) call will decrypt the first byte and return it, leaving the rest of the data sitting in the buffer for next time. And that would be unfortunate.

So my first attempt at a heuristic is: use the receive buffer size, but never anything smaller than DEFAULT_RECEIVE_SIZE.

I guess this has a weird effect if the underlying transport likes to return more than DEFAULT_RECEIVE_SIZE. Say it gives us 65 KiB, while DEFAULT_RECEIVE_SIZE is 64 KiB. On our first call to SSLStream.receive_some, the buffer size is zero, so we do read(64 KiB). This drains 65 KiB from the underlying transport, then decrypts and returns the first 64 KiB. The next time we call SSLStream.receive_some, we do read(64 KiB) again, but there's already 1 KiB of data in the buffer, so we just return that immediately without refilling the buffer. Then this repeats indefinitely, so we alternate between doing a big receive and a small receive every time. Seems wasteful – it'd be better to return 65 KiB each time.

So maybe a better strategy would be to start with some smallish default receive size, and then increase it over time if we observe the underlying transport giving us more data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the SSLStream stuff to hopefully address the above issues...

trio/tests/test_ssl.py Outdated Show resolved Hide resolved
Copy link
Member

@oremanj oremanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here - looks good!

@njsmith njsmith merged commit d2f364e into python-trio:master Jul 30, 2019
@njsmith njsmith deleted the no-more-mandatory-buffer-size branch July 30, 2019 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants