-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
http.client.IncompleteRead from HTTPResponse read after part reading file #70686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a regression in Python 3.5 tested under Linux and Mac OS X, spotted from a failing test in Biopython biopython/biopython#773 where we would parse a file from the internet. The trigger is partially reading the network handle line by line (e.g. until an end record marker is found), and then calling handle.read() to fetch any remaining data. Self contained examples below. Note that partially reading a file like this still works: $ python3.5
Python 3.5.0 (default, Sep 14 2015, 12:13:24)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from urllib.request import urlopen
>>> handle = urlopen("http://www.python.org")
>>> chunk = handle.read(50)
>>> rest = handle.read()
>>> handle.close() However, the following variants read a few lines and then attempt to call handle.read() and fail. The URL is not important (as long as it has over four lines in these examples). Using readline, >>> from urllib.request import urlopen
>>> handle = urlopen("http://www.python.org")
>>> for i in range(4):
... line = handle.readline()
...
>>> rest = handle.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
s = self._safe_read(self.length)
File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected) Using line iteration via next, >>> from urllib.request import urlopen
>>> handle = urlopen("http://www.python.org")
>>> for i in range(4):
... line = next(handle)
...
>>> rest = handle.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
s = self._safe_read(self.length)
File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected) Using line iteration directly, >>> from urllib.request import urlopen
>>> count = 0
>>> handle = urlopen("http://www.python.org")
>>> for line in handle:
... count += 1
... if count == 4:
... break
...
>>> rest = handle.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
s = self._safe_read(self.length)
File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected) These examples all worked on Python 3.3 and 3.4 so this is a regression. |
As far as I'm able to track it, it was a refactoring in bpo-19009 that is responsible for this regression (rev 49017c391564). I'm adding Kristján, so that he'd have a look at the attached fix and test. |
Thanks for the report and the patch. It looks okay as far as it goes, but I think there are other related bugs: ## read1() doesn’t update length either ## >>> handle = urlopen("https://www.python.org/")
>>> len(handle.read1())
7374
>>> handle.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/proj/python/cpython/Lib/http/client.py", line 462, in read
s = self._safe_read(self.length)
File "/home/proj/python/cpython/Lib/http/client.py", line 614, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(39583 bytes read, 7374 more expected) ## readline() and read1() blindly read beyond Content-Length ##
I wonder if anyone has considered implementing HTTPResponse by wrapping or subclassing BufferedReader; that way you get well-tested readline() etc functionality for free. The HTTP protocol (Connection: close, Content-Length, chunked decoding) could be handled by an internal RawIOBase.readinto() method. |
Here is the updated patch. I only included the additional fix for read1 since readlines is not overwritten in the HTTPConnection. Not sure how to write test for it, does it need a much longer body (compared to the one in tests) to produce this behaviour? The larger refactoring might be appropriate for 3.6, but I believe these smaller fixed could go into the next micro release. |
Yes I agree it would be best to fix the bug(s) without major refactoring if practical. To fix the other bug, I think the parameters in read1(n) and readline(limit) need to be capped at self.length. The inherited BufferedIOBase.readlines() implementation should just call readline() multiple times. The point is that if you call readline() when self.length is reduced to zero, it should behave like EOF, rather than waiting for more data from the socket. To test this, you might be able to use the FakeSocket class. You might have to add a second HTTP response to the data, and test that readline() etc doesn’t read any of the second response. |
All the highlighted issue are now fixed. The limit on n in read1 wasn't tested. Your suggestion regarding testing went a bit over my head, Martin. So, just trying to make sure we're on the same page. ExtendedReadTest, where I thought placing these new tests, is already employing FakeSocket, but I'm not sure how one would add a second response and to what. In any case, some of the code in that class seems rather specific, so it's not clear if it could or should be reused. |
To add a second response you would just concatenate it to the existing response. (Your existing test already uses FakeSocket.) The FakeSocket parameter just represents data that would be sent from the server. So: body = (
b"HTTP/1.1 200 OK\r\n"
b"Content-Length: 3\r\n"
b"\r\n"
b"abc"
b"HTTP/1.1 408 Next response should not be read yet\r\n"
) In fact, see the BasicTest.test_content_length_sync() case, which literally has "extradata" as that last line. I think we just need to adapt or duplicate this test to cover readline() and read1(), not just read(). Maybe also with read1(100), readline(100) cases as well. |
OK, here is the patch including the tests that seem to exercise the behaviour. |
Thanks Silent Ghost, this patch looks pretty good. I did leave a couple more comments. Also I just realized that HTTPResponse.readline() and read1() are not really documented. The BufferedIOBase support was meant to be added in 3.5, although readline() was supposed to be supported earlier for urlopen() compatibility. |
Updated patch addresses the rietveld comments. |
Here is a modified version of the tests that I propose to commit soon. I trimmed the size of the tests down and simplified some of them. I also mentioned that BufferedIOBase is supported in the documentation. |
New changeset e95e9701b472 by Martin Panter in branch '3.5': New changeset 74b3523d6256 by Martin Panter in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: