Skip to content

Standardize HTTPResponse.read(X) behavior regardless of compression #2712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

franekmagiera
Copy link
Member

Closes #2128

Using a bytearray as a buffer for decoded data. Tested manually with httpbin so far, need to add unit tests.

@sethmlarson
Copy link
Member

Wanted to drop in and say thank you for picking up this issue @franekmagiera! Happy to see your name again :) Would you like me to assign you to #2128 so you can "claim" the issue despite the draft state of this PR?

@franekmagiera
Copy link
Member Author

Thanks @sethmlarson! I self-assigned, will try to create a first "reviewable" version soon.

@franekmagiera franekmagiera marked this pull request as ready for review August 19, 2022 21:53
@franekmagiera franekmagiera changed the title Draft: Standardize HTTPResponse.read(X) behavior regardless of compression Standardize HTTPResponse.read(X) behavior regardless of compression Aug 19, 2022
@franekmagiera
Copy link
Member Author

franekmagiera commented Aug 19, 2022

I think this is ready for a first look. I see that Windows pipelines are failing, will check it out.

@pquentin
Copy link
Member

How big can the buffer get? Is it linear in n? It's not obvious to me.

@franekmagiera
Copy link
Member Author

How large the buffer gets depends on the message in the body. If the message is very compressible, the buffer can get quite large even though we're asking for a small amount of bytes. For example, running the following script:

from urllib3.response import DeflateDecoder
import zlib

message = b"abcdefghij" + b"foo" * 20 + b"klmnop"

print(f"Original message length: {len(message)}")

compressed = zlib.compress(message)

print(f"Compressed message length: {len(compressed)}")

decoder = DeflateDecoder()

pos = 3
while pos <= len(compressed):
    data = decoder.decompress(compressed[pos-3:pos])
    print(f"pos = {pos}; data length = {len(data)}; data: {data}")
    pos += 3

gives a following result:

Original message length: 76
Compressed message length: 29
pos = 3; data length = 0; data: b''
pos = 6; data length = 3; data: b'abc'
pos = 9; data length = 3; data: b'def'
pos = 12; data length = 3; data: b'ghi'
pos = 15; data length = 3; data: b'jfo'
pos = 18; data length = 58; data: b'ofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo'
pos = 21; data length = 3; data: b'klm'
pos = 24; data length = 3; data: b'nop'
pos = 27; data length = 0; data: b''

It's a pretty tough question, it seems to me it's impossible to give a more precise answer without making any assumptions about the payload message itself and even then it feels like it'd be quite tricky to come up with a reasonable model.

@franekmagiera
Copy link
Member Author

Got some time today to take a look at failing Windows pipelines - there were MemoryErrors caused by introducing the buffer. I put a bandaid on that by setting decode_content=False in failing tests (this circumvents writing to the buffer), couldn't see a better solution at the moment, will spend some time later to think about this. Also coverage is failing, will write more tests to cover missing 2 lines.

@franekmagiera franekmagiera force-pushed the 2128-standarize-response-read-behaviour branch from b57ac44 to 65f6fc9 Compare August 30, 2022 05:55
@franekmagiera
Copy link
Member Author

franekmagiera commented Sep 19, 2022

Regarding my last comment - I didn't have much time to think about anything smarter than setting decode_content=False to fix the tests that were failing on Windows due to MemoryError. The errors were caused by copying large amounts of data to the buffer in the default case. I don't think it's a significant regression though, let me know if you disagree.

Also added the tests to bring back 100% coverage.

I think it is ready for a first look again, let me know what you think about it.

BTW, I don't mean to hurry you up by this comment - I understand it's not always easy to find time and I don't mind waiting for the review. I just realized that judging by my previous comment it is not clear whether I intended to make any more changes or not and wanted to clarify that.

Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The base functionality works, but I think the reliance on content-length makes the pull request more complicated than needed. What am I missing? Do we really need to use it?

Regarding my last comment - I didn't have much time to think about anything smarter than setting decode_content=False to fix the tests that were failing on Windows due to MemoryError. The errors were caused by copying large amounts of data to the buffer in the default case. I don't think it's a significant regression though, let me know if you disagree.

I don't know, it seems significant. It means this pull request interacts badly with #2657, at least on Windows.

Copy link
Member

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking superb, nice work all. Was just thinking this would be an interesting place to add hypothesis based tests for different combinations of compression and reads and asserting tells/eof, etc.

Also a general ask for a few more comments since these functions are quite complex, otherwise besides the questions below LGTM!

@pquentin pquentin requested a review from shazow as a code owner November 8, 2022 10:36
sethmlarson
sethmlarson previously approved these changes Nov 8, 2022
Copy link
Member

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Now that stream(100) really reads 100 bytes, a single next() call is
enough.
@pquentin
Copy link
Member

pquentin commented Nov 9, 2022

Regarding the MemoryError issue, Windows is only the messenger here. There's apparently a known issue with Windows in GitHub Actions where memory available is lower but you can increase the pagefile (swap) to get jobs to run. However, it's not because Linux and macOS don't complain that this is fine. On my Linux laptop, I ran test_requesting_large_resources_via_ssl with decode_content enabled and disabled under mprof run:

decode_content=False

decode_content=False

decode_content=True

decode_content=True

This patch triples the memory usage! I need to look into using a single buffer here, but it's not trivial.

@pquentin
Copy link
Member

pquentin commented Nov 9, 2022

So this highlighted something useless we were doing when amt was None (the default): put the bytes in a buffer to get them back immediately. (This was much easier to see thanks to Seth's suggestion above to have separate blocks.)

diff --git a/src/urllib3/response.py b/src/urllib3/response.py
index 956153ad..1c9b74b3 100644
--- a/src/urllib3/response.py
+++ b/src/urllib3/response.py
@@ -827,9 +827,7 @@ class HTTPResponse(BaseHTTPResponse):
         self._init_decoder()
 
         if amt is None:
-            decoded_data = self._decode(data, decode_content, flush_decoder=True)
-            self._decoded_bytes.extend(decoded_data)
-            data = self._popall_decoded_bytes()
+            data = self._decode(data, decode_content, flush_decoder=True)
 
             if cache_content:
                 self._body = data

amt_none

In the last two remaining spikes, amt is 2GiB.

@pquentin
Copy link
Member

Using memray I understand clearly now why we're seeing four copies in the data:

image

  • The data read from the wire (self._raw_read)
  • The copy added to self._decoded_bytes
  • And two more copies made in popleft:
    • self._decoded_bytes[:n] (a new bytearray)
    • bytes(self._decoded_bytes[:n]) (a bytes object)

@franekmagiera
Copy link
Member Author

Closing this one in favor of #2798
Impressive work @pquentin! Sorry for wasting some time with this one.

@pquentin
Copy link
Member

You did not waste any time, it was great work! Thankfully Windows told us something was amiss memory wise. You can also submit the 500$ bounty on Open Collective btw.

@franekmagiera
Copy link
Member Author

franekmagiera commented Nov 13, 2022

Yup, just feel bad that my instinct was to leave it alone. Anyway, thanks, learnt a lot with this one!

@pquentin
Copy link
Member

Your work was great, you deserve the bounty, no worries.

@franekmagiera franekmagiera deleted the 2128-standarize-response-read-behaviour branch December 3, 2022 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize HTTPResponse.read(X) behavior regardless of compression
3 participants