Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we flush the decoder when reaching EOF in partial reads? #2799

Open
Tracked by #2798
pquentin opened this issue Nov 11, 2022 · 6 comments
Open
Tracked by #2798

Do we flush the decoder when reaching EOF in partial reads? #2799

pquentin opened this issue Nov 11, 2022 · 6 comments
Labels
💰 Bounty $300 If you complete this issue we'll pay you $300 on OpenCollective!

Comments

@pquentin
Copy link
Member

I started writing a very detailed bug report and then convinced me that it was not real. Sorry, this will be the short version now. Looks at our flush_decoder logic:

flush_decoder = False
fp_closed = getattr(self._fp, "closed", False)
with self._error_catcher():
data = self._fp_read(amt) if not fp_closed else b""
if amt is None:
flush_decoder = True
else:
cache_content = False
if (
amt != 0 and not data
): # Platform-specific: Buggy versions of Python.
# Close the connection when no data is returned
#
# This is redundant to what httplib/http.client _should_
# already do. However, versions of python released before
# December 15, 2012 (http://bugs.python.org/issue16298) do
# not properly close the connection in all cases. There is
# no harm in redundantly calling close.
self._fp.close()
flush_decoder = True
if (
self.enforce_content_length
and self.length_remaining is not None
and self.length_remaining != 0
):
# This is an edge case that httplib failed to cover due
# to concerns of backward compatibility. We're
# addressing it here to make sure IncompleteRead is
# raised during streaming, so all calls with incorrect
# Content-Length are caught.
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
if data:
self._fp_bytes_read += len(data)
if self.length_remaining is not None:
self.length_remaining -= len(data)
data = self._decode(data, decode_content, flush_decoder)

We should only flush when there is no data left. In the streaming case, it would be because we reached EOF, and data is b''. Which is done correctly line 776. But then we don't call self._decode because if data will prevent us from doing it. So we don't flush().

This bug was introduced in urllib3 1.11, released in July 2015: 29e144f. Surely in 7 years someone would have noticed if they had missing data.

So surely the flush() call is not actually doing anything, whatever the reason is.

@sethmlarson sethmlarson added the 💰 Bounty $300 If you complete this issue we'll pay you $300 on OpenCollective! label Nov 3, 2023
@sethmlarson
Copy link
Member

@abebeos Thanks for digging deep into the issue, can't wait to hear your conclusions! Looks like you're on a good path so far.

@illia-v
Copy link
Member

illia-v commented Jan 19, 2024

Do I understand correctly that code in the main branch as of today flushes the decoder when EOF is reached in partial reads? If so, do you know if we have a test ensuring that a decoder is not left unflushed?

@illia-v
Copy link
Member

illia-v commented Jan 19, 2024

decoded_data = self._decode(data, decode_content, flush_decoder)
self._decoded_buffer.put(decoded_data)
while len(self._decoded_buffer) < amt and data:
# TODO make sure to initially read enough data to get past the headers
# For example, the GZ file header takes 10 bytes, we don't want to read
# it one byte at a time
data = self._raw_read(amt)
decoded_data = self._decode(data, decode_content, flush_decoder)
self._decoded_buffer.put(decoded_data)
data = self._decoded_buffer.get(amt)

It looks like there can be a case when EOF is reached inside the while loop, but flush_decoder is not changed to True.

#3262 does some rearrangements of the code. Maybe it even fixes the issue. However, if so, the fix is lost among other rearrangements, and it is not covered by any new test.

@sethmlarson
Copy link
Member

@illia-v I don't like the approach in that PR, would prefer a simple fix instead of a refactor unless strictly necessary.

@zawan-ila
Copy link
Contributor

So surely the flush() call is not actually doing anything, whatever the reason is

Looks like we don't really need to call flush on a decompress object (at least for the decompressors we have in urllib3). Check here for why that is the case for zlib and here for zstd. For brotli, I can't find any docs. However, looking at the source, it does not look like that the Decompressor object even has a flush method.

@sethmlarson
Copy link
Member

Great sleuthing @zawan-ila! So this means that because we don't support Jython anymore we should be able to remove all logic related to flushing 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💰 Bounty $300 If you complete this issue we'll pay you $300 on OpenCollective!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants