Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should HTTPResponse support decompression with raw reads? #2769

Closed
pquentin opened this issue Nov 8, 2022 · 1 comment
Closed

Should HTTPResponse support decompression with raw reads? #2769

pquentin opened this issue Nov 8, 2022 · 1 comment

Comments

@pquentin
Copy link
Member

pquentin commented Nov 8, 2022

Context

#2712 is about to close #2128 and #709. However, one piece is missing. In #709 (comment), @\Lukasa mentioned 7 years ago that if we were to implement #2712, we should also make sure we don't take away any important capability from urllib3. Here's the full comment:

If we want to satisfy all use cases, we're in a really tricky place. The space of options and whether they're currently supported is:

Decompress Don't Decompress
Bytes read from the wire
Bytes received in the application

Put another way, if you are about the number of bytes you read from the wire, you can control that perfectly with urllib3 today, regardless of decode_content. If you care about the number of bytes your application receives, you can only control that if you set decode_content to False and handle the decoding yourself.

I have no objection to supporting that final box, but I think we should consider really carefully whether we want to do it by adding a cross mark in an already supported box. The proposal in this issue (to change the semantic of read) would do that, unless it was accompanied by either another parameter or a new function that preserves the current logic.

I am +1 on supporting all of the above. I am -1 on supporting a different configuration of the above (i.e. moving the cross to another box). I am +0.1 on doing nothing (because the status quo has some value). I am open to relegating the current 'default' logic to a backup method (raw_read or some such thing), and moving @gilesbrown's suggestion to the read method, though it will break a surprising amount of code.

Originally posted by @\Lukasa in #709 (comment)

This was unclear for me for a while, so I'll put it yet another way.

  1. Most users care about the bytes they receive in their application. And in the face of decompression, then having read() returning a random number of bytes, including sometimes 0, is a bug.
  2. Some users are instead interested in controlling precisely the bytes they read from the wire: each read() call issues a single system call and that's it. And maybe they want decompression as well, and don't mind if read() returns 0 bytes or too many bytes because the important part is that read(n) reads n bytes.

What #2712 does when fixing that bug is breaking the second set of users that want 1/ read() to issue single sytem calls and 2/ decompression.

Going back to the question from @sethmlarson in #2712 (comment):

What's the use-case for the raw_read function on HTTPResponse? My thinking was the previous behavior of returning data while not following io.BufferedIOBase.read semantics was a bug.

It is indeed a bug, but the above shows that this breaks a potentially valid use case: HTTPResponse.read() doing single system calls + decompression. However I realize now that my initial raw_read does not help with that use case because it does not handle decompression. I'll remove it from the pull request.

Thinking on extensibility to HTTP/2: there's a chance that data is available in a buffer somewhere w/o needing a system call due to stream multiplexing, our read (and readX) functions should be documented to use either 0-1 (for read1, etc) or 0-N system calls instead of "1 or more".

I think this is true even in HTTP/1.1. If you have data that compresses very well, it's perfectly possible for low values of n that one read(n) reads so much data to cover multiple read(n) in the future. This is the if amt is not None and self._decoded_bytes_in_buffer() >= amt condition.

Are we OK with breaking users that want single sytem calls + decompression?

I'm +1 because I'm doubtful that anybody actually wants that. And a workaround is to set decode_content to False and handle decompression manually, possibly with our MultiDecoder.

@sethmlarson
Copy link
Member

sethmlarson commented Nov 8, 2022

Chatted with @pquentin and we came to the conclusion that the use-case of being able to exactly control socket read occurrences and amounts combined with decompression isn't going to be a supported use-case.

We consider strictly following the io module's contract more important a feature and in the future if HTTP/2 is added calls to HTTPResponse.read will not map to socket reads directly.

Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants