-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Standardize HTTPResponse.read(X) behavior regardless of compression #2712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize HTTPResponse.read(X) behavior regardless of compression #2712
Conversation
…ests for the HTTPResponse
Wanted to drop in and say thank you for picking up this issue @franekmagiera! Happy to see your name again :) Would you like me to assign you to #2128 so you can "claim" the issue despite the draft state of this PR? |
Thanks @sethmlarson! I self-assigned, will try to create a first "reviewable" version soon. |
I think this is ready for a first look. I see that Windows pipelines are failing, will check it out. |
How big can the buffer get? Is it linear in n? It's not obvious to me. |
How large the buffer gets depends on the message in the body. If the message is very compressible, the buffer can get quite large even though we're asking for a small amount of bytes. For example, running the following script: from urllib3.response import DeflateDecoder
import zlib
message = b"abcdefghij" + b"foo" * 20 + b"klmnop"
print(f"Original message length: {len(message)}")
compressed = zlib.compress(message)
print(f"Compressed message length: {len(compressed)}")
decoder = DeflateDecoder()
pos = 3
while pos <= len(compressed):
data = decoder.decompress(compressed[pos-3:pos])
print(f"pos = {pos}; data length = {len(data)}; data: {data}")
pos += 3 gives a following result:
It's a pretty tough question, it seems to me it's impossible to give a more precise answer without making any assumptions about the payload message itself and even then it feels like it'd be quite tricky to come up with a reasonable model. |
Got some time today to take a look at failing Windows pipelines - there were |
b57ac44
to
65f6fc9
Compare
Regarding my last comment - I didn't have much time to think about anything smarter than setting Also added the tests to bring back 100% coverage. I think it is ready for a first look again, let me know what you think about it. BTW, I don't mean to hurry you up by this comment - I understand it's not always easy to find time and I don't mind waiting for the review. I just realized that judging by my previous comment it is not clear whether I intended to make any more changes or not and wanted to clarify that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The base functionality works, but I think the reliance on content-length makes the pull request more complicated than needed. What am I missing? Do we really need to use it?
Regarding my last comment - I didn't have much time to think about anything smarter than setting decode_content=False to fix the tests that were failing on Windows due to MemoryError. The errors were caused by copying large amounts of data to the buffer in the default case. I don't think it's a significant regression though, let me know if you disagree.
I don't know, it seems significant. It means this pull request interacts badly with #2657, at least on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking superb, nice work all. Was just thinking this would be an interesting place to add hypothesis based tests for different combinations of compression and reads and asserting tells/eof, etc.
Also a general ask for a few more comments since these functions are quite complex, otherwise besides the questions below LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that stream(100) really reads 100 bytes, a single next() call is enough.
Regarding the MemoryError issue, Windows is only the messenger here. There's apparently a known issue with Windows in GitHub Actions where memory available is lower but you can increase the pagefile (swap) to get jobs to run. However, it's not because Linux and macOS don't complain that this is fine. On my Linux laptop, I ran test_requesting_large_resources_via_ssl with decode_content=Falsedecode_content=TrueThis patch triples the memory usage! I need to look into using a single buffer here, but it's not trivial. |
So this highlighted something useless we were doing when amt was None (the default): put the bytes in a buffer to get them back immediately. (This was much easier to see thanks to Seth's suggestion above to have separate blocks.)
In the last two remaining spikes, amt is 2GiB. |
You did not waste any time, it was great work! Thankfully Windows told us something was amiss memory wise. You can also submit the 500$ bounty on Open Collective btw. |
Yup, just feel bad that my instinct was to leave it alone. Anyway, thanks, learnt a lot with this one! |
Your work was great, you deserve the bounty, no worries. |
Closes #2128
Using a
bytearray
as a buffer for decoded data. Tested manually with httpbin so far, need to add unit tests.