Context
#2712 is about to close #2128 and #709. However, one piece is missing. In #709 (comment), @\Lukasa mentioned 7 years ago that if we were to implement #2712, we should also make sure we don't take away any important capability from urllib3. Here's the full comment:
If we want to satisfy all use cases, we're in a really tricky place. The space of options and whether they're currently supported is:
|
Decompress |
Don't Decompress |
| Bytes read from the wire |
✓ |
✓ |
| Bytes received in the application |
✗ |
✓ |
Put another way, if you are about the number of bytes you read from the wire, you can control that perfectly with urllib3 today, regardless of decode_content. If you care about the number of bytes your application receives, you can only control that if you set decode_content to False and handle the decoding yourself.
I have no objection to supporting that final box, but I think we should consider really carefully whether we want to do it by adding a cross mark in an already supported box. The proposal in this issue (to change the semantic of read) would do that, unless it was accompanied by either another parameter or a new function that preserves the current logic.
I am +1 on supporting all of the above. I am -1 on supporting a different configuration of the above (i.e. moving the cross to another box). I am +0.1 on doing nothing (because the status quo has some value). I am open to relegating the current 'default' logic to a backup method (raw_read or some such thing), and moving @gilesbrown's suggestion to the read method, though it will break a surprising amount of code.
Originally posted by @\Lukasa in #709 (comment)
This was unclear for me for a while, so I'll put it yet another way.
- Most users care about the bytes they receive in their application. And in the face of decompression, then having
read() returning a random number of bytes, including sometimes 0, is a bug.
- Some users are instead interested in controlling precisely the bytes they read from the wire: each
read() call issues a single system call and that's it. And maybe they want decompression as well, and don't mind if read() returns 0 bytes or too many bytes because the important part is that read(n) reads n bytes.
What #2712 does when fixing that bug is breaking the second set of users that want 1/ read() to issue single sytem calls and 2/ decompression.
Going back to the question from @sethmlarson in #2712 (comment):
What's the use-case for the raw_read function on HTTPResponse? My thinking was the previous behavior of returning data while not following io.BufferedIOBase.read semantics was a bug.
It is indeed a bug, but the above shows that this breaks a potentially valid use case: HTTPResponse.read() doing single system calls + decompression. However I realize now that my initial raw_read does not help with that use case because it does not handle decompression. I'll remove it from the pull request.
Thinking on extensibility to HTTP/2: there's a chance that data is available in a buffer somewhere w/o needing a system call due to stream multiplexing, our read (and readX) functions should be documented to use either 0-1 (for read1, etc) or 0-N system calls instead of "1 or more".
I think this is true even in HTTP/1.1. If you have data that compresses very well, it's perfectly possible for low values of n that one read(n) reads so much data to cover multiple read(n) in the future. This is the if amt is not None and self._decoded_bytes_in_buffer() >= amt condition.
Are we OK with breaking users that want single sytem calls + decompression?
I'm +1 because I'm doubtful that anybody actually wants that. And a workaround is to set decode_content to False and handle decompression manually, possibly with our MultiDecoder.
Context
#2712 is about to close #2128 and #709. However, one piece is missing. In #709 (comment), @\Lukasa mentioned 7 years ago that if we were to implement #2712, we should also make sure we don't take away any important capability from urllib3. Here's the full comment:
Originally posted by @\Lukasa in #709 (comment)
This was unclear for me for a while, so I'll put it yet another way.
read()returning a random number of bytes, including sometimes 0, is a bug.read()call issues a single system call and that's it. And maybe they want decompression as well, and don't mind ifread()returns 0 bytes or too many bytes because the important part is thatread(n)reads n bytes.What #2712 does when fixing that bug is breaking the second set of users that want 1/ read() to issue single sytem calls and 2/ decompression.
Going back to the question from @sethmlarson in #2712 (comment):
It is indeed a bug, but the above shows that this breaks a potentially valid use case:
HTTPResponse.read()doing single system calls + decompression. However I realize now that my initialraw_readdoes not help with that use case because it does not handle decompression. I'll remove it from the pull request.I think this is true even in HTTP/1.1. If you have data that compresses very well, it's perfectly possible for low values of n that one read(n) reads so much data to cover multiple read(n) in the future. This is the
if amt is not None and self._decoded_bytes_in_buffer() >= amtcondition.Are we OK with breaking users that want single sytem calls + decompression?
I'm +1 because I'm doubtful that anybody actually wants that. And a workaround is to set
decode_contenttoFalseand handle decompression manually, possibly with our MultiDecoder.