-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[http.client] HTTPConnection.request not support "chunked" Transfer-Encoding to send data #56528
Comments
Hi, HTTPConnection.putrequest not support "chunked" Transfer-Encodings to send data. Exemple, I can't do PUT request with chunk transfert. Regards, |
What's the use case? Do you have an iterable that yields data whose size is unknown? AFAIK, most web servers don't even support chunked uploads. (Removing Python 2.7 from versions as this is clearly a feature request.) |
I use http.client in WebDAV client. Mac OS X Finder WebDAV client perform all his request in "chunk" mode : PUT and GET. Here, I use http.client to simulate Mac OS X Finder WebDAV client. Regards, |
harobed wrote:
Now I'm confused. Per the HTTP specification, GET requests don't have Are you sure you don't confuse with the response that the server |
Sorry, yes GET requests have "Transfer-Encoding: chunked" in server response. |
We had support for chunked transfer encoding for POST method recently, which is exposed via urllib2 wrapper function. PUT is not exposed via urllib2 and users should use httplib. This feature of chunked transfer can be added to PUT by taking the body of the message as iterable. |
I couldn't find what you're talking about. |
I'd like to support the request. I have a use case where I definitely need this feature: I maintain a Python client for a scientific metadata catalogue, see 1 for details. The client also features the upload of the data files. The files may come in as a data stream from another process, so my client takes a file like object as input. The files may be large (several GB), so buffering them is not an option, they must get streamed to the server as they come in. Therefore, there is have no way to know the Content-length of the upload beforehand. I implemented chunked transfer encoding in a custom module that monkey patches the library, see the attached file. This works fine, but of course it's an awkward hack as I must meddle deeply into the internals of urllib and http.client to get this working. This module is tested to work with Python 2.7, 3.1, 3.2, 3.3, and 3.4 (for Python 3 you need to pass it through 2to3 first). I really would like to see this feature in the standard library in order to get rid of this hack in my package. I would happy to transform my module into a patch to the library if such a patch would have a chance to get accepted. |
The truth is http in stdlib is dead. |
Thanks for the notice! As far as I read from the link you cite, getting rid of the current httplib in urllib3 is planned but far from being done. Furthermore, I don't like packages with too many 3rd party dependencies. Since my package is working fine with the current standard lib, even though using an ugly hack in one place, I'd consider switching to urllib3 as soon as the latter makes it into the standard lib. I still believe that adding chunked transfer encoding to http.client and urllib in the current standard lib would only require a rather small change that can easily be done such that the lib remains fully compatible with existing code. Still waiting for feedback if such a patch is welcome. |
One interesting question is how to convey data to the chunked encoder. There are two sets of options in my mind, a pull interface:
and a push interface:
The advantage of the push interface is that you could fairly easily feed data from an iterable or file reader into it simply by just doing shutil.copyfileobj() or equivalent. But to adapt the pull interface to a push interface would require “asyncio” support or a separate thread or something to invert the flow of control. So I think building the encoder with a push interface would be best. Rolf’s ChunkedHTTPConnectionMixin class appears to only support the pull interface (iterables and stream readers). I would welcome support for chunked uploading in Python’s “http.client” module, especially with push or stream writer support. I don’t think overwriting _send_request should be necessary; just call putrequest(), putheader() etc manually, and then call send() for each chunk. Perhaps there is scope for sharing the code with the “http.server” module (for encoding chunked responses). |
The design goal for my implementation was compatibility. My version can be used as a drop in replacement for the existing urllib's HTTPHandler. The only thing that need to be changed in the calling code is that it must call build_opener() to select the chunked handler in the place of the default HTTPHandler. After this, the calling code can use the returned opener in the very same way as usual. I guess, switching to a push interface would require major changes in the calling code. In principle, you could certainly support both schemes at the same time: you could change the internal design to a push interface and than wrap this by a pull interface for the compatibility with existing code. But I'm not sure whether this would be worth the effort. If, as Piotr suggests, the current urllib is going to be replaced by urllib3, then I guess, its questionable if it makes sense to add major design changes that are incompatible with existing code to the current standard lib. |
I've attached a patch that implements full Transfer-Encoding support for requests as specified in RFC 7230. |
I hit "submit" a little too soon. The intention of the patch is to adhere to all aspects of Transfer-Encoding as specified in the RFC and to make best guesses as to encoding that should be used based on the data type of the given body. This will break backwards compatibility for cases where users are manually chunking the request bodies prior to passing them in and explicitly setting the Transfer-Encoding header. Additionally, if Transfer-Encoding was previously specified, but not chunked, the patch will automatically chunk the body. Otherwise, the patch should only be additive. |
Also note that this patch includes the changes in bpo-23350 as it's directly relevant. |
FWIW, so far I've tested this change against: cherrypy 3.6.0 The chunked body works as expected. Unfortunately, all implementations seem to be ignorant of the trailer part. So it seems that although RFC-compliant (and I can definitely see the use case for it), they trailer implementation may not be overly practical. I still think that it's worthwhile keeping it, but perhaps adding a note that it may not be supported at this point. Relevant gists: https://gist.github.com/demianbrecht/3fd60994eceeb3da8f13 |
After sleeping on this, I think that the best route to go would be to drop the trailer implementation (if it's not practical it doesn't belong in the standard library). Also, to better preserve backwards compatibility it may be better to circumvent the automatic chunking if transfer-encoding headers are present in the request call. That way, no changes would need to be made to existing code that already supports it at a higher level. |
Updated patch changes the following: + Removes support for trailers in requests as they're not supported |
I left a few comments on Reitveld, mainly about the documentation and API design. However I understand Rolf specifically wanted chunked encoding to work with the existing urlopen() framework, at least after constructing a separate opener object. I think that should be practical with the existing HTTPConnection implementation. Here is some pseudocode of how I might write a urlopen() handler class, and an encoder class that should be usable for both clients and servers: class ChunkedHandler(...):
def http_request(...):
# Like AbstractHTTPHandler, but don’t force Content-Length
def default_open(...):
# Like AbstractHTTPHandler, but instead of calling h.request():
encoder = ChunkedEncoder(h.send)
h.putrequest(req.get_method(), req.selector)
for item in headers:
h.putheader(*item)
h.putheader("Transfer-Encoding", encoder.encoding)
h.endheaders()
shutil.copyfileobj(req.data, writer)
encoder.close()
class ChunkedEncoder(io.BufferedIOBase):
# Hook output() up to either http.client.HTTPConnection.send()
# or http.server.BaseHTTPRequestHandler.wfile.write()
encoding = "chunked"
def write(self, b):
self.output("{:X}\r\n".format(len(b)).encode("ascii"))
self.output(b)
self.output(b"\r\n")
def close(self):
self.output(b"0\r\n\r\n") |
Thanks for the review Martin.
The original issue was that http.client doesn't support chunked encoding. With this patch, chunked encoding should more or less come for free with urlopen. There's absolutely no reason as to why HTTPConnection should not support transfer encoding out of the box given it's part of the HTTP1.1 spec. I do understand that there are some modifications needed in urllib.request in order to support the changes here, but I didn't include those in the patch as to not conflate the patch. I think that it's also reasonable to open a new issue to address the feature in urllib.request rather than piggy-backing on this one. |
Perhaps you should make a table of some potential body object types, and figure out what the behaviour should be for request() with and without relevant headers, and endheaders() and send() with and without encode_chunked=True:
Potential body types:
This list could go on and on. I would rather have a couple of simple rules, or explicit APIs for the various modes so you don’t have to guess which mode your particular body type will trigger. |
Agreed. However, I'm wondering if that should belong in a new issue geared towards further clarifying behaviour of request body types. The patch introduces the behaviour this specific issue was looking, with the largest change being that iterators may now result in chunked transfer encoding with the data currently handled by the library. I'd rather move forward with incremental improvements rather than keep building on each issue before the work's merged (there's still a /good/ amount of work to be done in this module). |
The incremental improvement thing sounds good. Here are some things which I think are orthogonal to sensible chunked encoding:
|
What's the status on this one? It looks like some review comments need addressing. |
That's about it. Unfortunately I've been pretty tied up over the last month and a bit. I'll try to get to hopefully closing this out over the next few days. |
Latest patch should address all outstanding comments. |
BTW, thanks for the reviews Martin and the nudge Antoine! |
Yes when I say “text” I mean str objects, as opposed to byte strings, etc. I wonder if it would be safer to test for TextIOBase. With the read(0) hack, the zero size may be unexpected. A file object may need special handling to generate an empty result, or to not interpret it as an EOF indicator. See e.g. bpo-23804, about zero-sized SSL reads, where both of these problems happened. As long as we document that the text file logic has changed, I think it would be okay. IMO it is better to keep things simple and robust for byte files than add more hacks for text files. A common base class like HTTPException is useful when a caller may need to handle a set of exceptions in a common way. But a problem with Content-Length or Transfer-Encoding seems like a programming error, so ValueError seems fine IMO. However I still don’t see why the checks are worthwhile. Sure having both is a technical violation of the current HTTP protocols. But if a caller is likely to end up in one of these situations, it may indicate a problem with the API that we should fix instead. If the checks are unlikely to ever fail, then drop them. Otherwise, I don’t think any of the bugs I pointed out have been addressed. I don’t like converting bytes-like to bytes. I understand one of the main purposes of accepting bytes-like objects is to avoid copying them, but converting to bytes defeats that. Also, the copying is inconsistent with the lack of copying in _read_iterable(). I am wondering if it would be best to keep new logic out of send(); just put it into _send_output() instead. Once upon a time, send() was a simple wrapper around socket.sendall() without much bloat. But with the current patch, in order to send a single buffer, we have to jump through a bunch of checks, create a dummy lambda and singleton tuple, loop over the singleton, call the lambda, realize there is no chunked encoding, etc. IMO the existing support for iterables should have been put in _send_output() too (when that was an option). There is no need to use send() to send the whole body when endheaders() already supports that. |
Martin,
The problem is that there is no accepted definition of what a file-like object is. As a consequence, each test method will fail in some cases. Not all file-like objects are derived from the base classes in the io module. It is common praxis to simply define an object with a read() method and to consider this a file-like. This pattern can even be found in the official Python test suite. Testing for TextIOBase will fail in these cases. There seem to be at least a minimal consensus that each file-like (that supports reading) must have the read() method implementing the interface as defined by the io base classes. The read(0) test (I wouldn't call it a hack, as it only uses this documented API) has the advantage of making no other assumptions then this. And I even would not accept bpo-23804 as a counter example as this was clearly a bug in the ssl module. The current test using the mode attribute is certainly the worst one as it even fails for standard classes from the io module. Anyway. Even though I would prefer the read(0) test, I am happy to accept any other method. Please tell me what I should use.
The question is whether we should apply sanity checks on caller's input or whether we should rather send bullshit out and wait for the request to fail. I'd rather give the caller a somewhat helpful error message rather then a lesser helpful "400 Bad Request" from the server. But if you really insist, I can also remove these checks. I don't understand your remark on the API and what the likeliness of rubbish header values set by the caller has to do with it, sorry.
Could you elaborate on what bugs you are talking about? I really don't know, sorry.
I added this conversion because you mentioned bpo-27340 and the problem to send byte-like objects with the ssl module. I agree to leave this to bpo-27340 to find a better solution and will remove the conversion.
No problem, I can move all this to _send_output(). The only problem that I see then, is that the current library version of send() already accepts file-like (both text and byte streams), bytes, and iterables. So we will end up with two different methods accepting this variety of data types and behaving differently in the respective cases. Maybe, as a consequence, one should restrict send() to accept only bytes in order to tighten consistency. But this is not in the scope of this Issue and will not be covered by my patch. Please note that I will be in holidays starting with next Monday. We should try to get this done during this week. |
Ok, here comes yet another version of the patch, addressing the review comments. As indicated in the last post and requested by Martin, I moved the logic to deal with the different body types out of send() into _send_output(). I also removed the conversion from byte-like to bytes. I had some problems with some comments mainly on the documentation of urllib.request, asking for stuff unrelated to chunked transfer encoding and the present issue. I have the impression this is starting to get out of hands. Time is running out and I suggest that we should keep the focus on the present issue. There were also requests to drop support for stuff that is working in the current library, although not documented, and that is explicitly covered by the test suite. I'm fine with dropping that and also adapted the tests accordingly. But please be sure that what you are asking for is really what you want. Please note that this patch is not yet complete. I still need your decision on the two issues raised in the last post: the test method to use for text streams and whether I should drop the sanity checks for the transfer-encoding header. |
I just wrote:
Sorry, please disregards this comment. I just realized that these tests were added by previous versions of the patch by Damien. |
(Assigning this to Martin) I really like this patch, and I support that we land this before the feature freeze of 3.6. As along as the API interface to users is simple, the changes internal to _send_output should not be a concern. Yeah, I agree that multiple things needs to be taken care, but it is alright. Also, +1 to do validation of transfer-encoding usage as it's done by the patch. This is along the lines of RFC recommendation and it is better to fail early than let the server implementation be the decider. I would like to know the better suggestion for "testing text" too. (Adding haypo to steer us here) Although, I would point out that the static method has a generic name _is_textIO, while relying on .read attribute which is checked outside the static method. |
When I have read the last patch, I have seen a mutable as the default value of the HTTPConnection.request method. I have proposed this issue and the attached patch http://bugs.python.org/issue27728 |
_is_textIO(): I’m sorry but I still prefer the TextIOBase check over read(0). Each option has a disadvantage, but with TextIOBase, the disadvantage only affects text files, not byte files. Maybe another option is to stick with the current checking of the “mode” attribute, or are its failings significant for chunked encoding? It seems to me that the most important use cases would be reading bytes from a pipe file, custom bytes file object reader, generator, etc, but not a text file without a “mode” attribute. Checking Content-Length and Transfer-Encoding: I would prefer to remove this, unless there is a “good” reason to keep them. I want to understand why the three specific checks were added, and what made them more special than other potential checks (e.g. the format of the Content-Length value, or that the resulting body matches it). If the reason is something like “it is too easy for the caller to accidentally trigger the problem”, then there may be another way to fix it. Or maybe it is a mitigation for a security problem? But at the moment, it just seems to me like code being too smart for its own good. I mentioned a few possible bugs with parsing Transfer-Encoding at <https://bugs.python.org/review/12319/diff/14900/Lib/http/client.py#newcode1210\>. The format of the Transfer-Encoding value is specified at <https://tools.ietf.org/html/rfc7230#section-3.3.1\> and <https://tools.ietf.org/html/rfc7230#section-4\>. In bpo-23498, I identified some parts of Python that parse header values like this, although it looks like http.cookiejar.split_header_words() may be the only one usable for Transfer-Encoding. |
Some of the header field validation was included in Demian’s first patch, raising HTTPEncodingError. But I don’t know why he included it. Maybe it was more appropriate then; it looks like that patch took a different approach than the current encode_chunked=True mechanism. |
Martin,
Ok, this is a valid argument.
No, this is not significant for chunked encoding. As I said, I am happy to accept any method. But I would say, it's a fairly save bet that any file-like that has the mode attribute is in fact derived from the base classes in the io module. If this assumption holds, then the TextIOBase check will always work in all cases where the mode attribute check would. The inverse is not true. So I'll use the TextIOBase check, hope you agree.
Ok, I'll drop them.
I admit, I didn't read all comments on Demian's versions of the patch. Since I'll drop the checks, I guess, these problems are gone now.
I don't think this is in the current versions of the patch any more. |
Ok, here comes the next version of the patch. I made the changes discussed in the last post and addressed the review comments. Looks like we are converging towards a final version. |
I am pretty happy with the latest patch. I left one comment. I will try to give it a more thorough review and actually test it out at some point, but I don’t anticipate any major problems. |
Patch 12 has the following changes:
I think there are a few of my comments to Demian’s patches that are still relevant, but they are just opportunities for improving the code (e.g. making the chunking iteration more efficient and readable). This version is hopefully “good enough”. |
New changeset 0effb34f3276 by Martin Panter in branch 'default': |
Thank you and great work everyone involved with this change. Regards. |
Rolf, just a note that I had to remove some trailing spaces on various continued lines in the Python code before it would let me push this. Other tweaks I made:
The Windows buildbots fail the test sending a pipe with urlopen(): http://buildbot.python.org/all/builders/AMD64%20Windows8.1%20Non-Debug%203.x/builds/1166/steps/test/logs/stdio Traceback (most recent call last):
File "D:\buildarea\3.x.ware-win81-release\build\lib\test\test_urllib2.py", line 970, in test_http_body_pipe
self.assertEqual(newreq.get_header('Content-length'), None)
AssertionError: '0' != None I cannot figure out why exactly. My experiments with Wine suggest that tell() on a BufferedReader wrapping a pipe raises OSError, so I would expect _get_content_length() to return None, yet the test failure shows that Content-Length is set to zero. Is anyone able to investigate this on Windows? Will tell() and seek() reliably fail on a pipe or other unseekable file? |
New changeset b004da19b869 by Martin Panter in branch 'default': |
First of all, thanks all for the great work, in particular to you Martin for your patience! Martin, on your changes: as mentioned earlier, I'm in holidays right now, so I didn't had a close look yet. But I tested my application and it works, so I guess I'm happy with it. I don't use Windows, so I can't comment on the issue test_http_body_pipe, sorry. |
No they will not reliably fail. The file position is stored in the FILE_OBJECT CurrentByteOffset. This value can be queried and set using the WinAPI function SetFilePointerEx, which is implemented via the system calls NtQueryInformationFile and NtSetInformationFile with the information class FilePositionInformation. For files opened in synchronous mode, which is the only mode that the CRT supports, there's a fast path in the I/O manager to bypass the I/O request packet (IRP) path that would normally call the device driver. This fast-path code trivially updates the CurrentByteOffset field, regardless of the file type. In synchronous mode, a seek will only fail if the file is opened without buffering and the new position isn't sector-aligned, as is documented for SetFilePointerEx and NtSetInformationFile: FilePositionInformation. Prior to Windows 8, seek() and tell() will also fail for a console handle because it's not actually a kernel File handle that can be used with I/O system calls. OTOH, Windows 8+ uses the ConDrv device driver, so console handles trivially allow seek() and tell() to succeed, though the console itself ignores the CurrentByteOffset in the File object. Console screen-buffer files have a positionable cursor, which is similar but not quite the same. A fun fact is that for files opened in asynchronous mode, for FilePositionInformation the I/O manager does a regular device driver call with an IRP_MJ_QUERY_INFORMATION or IRP_MJ_SET_INFORMATION packet. It probably does this instead of using its fast path because there's no inherent meaning to the file position in asynchronous mode. I played around with this a bit with aynchronous named pipes. For setting FilePositionInformation, the NamedPipe filesystem driver always returns STATUS_INVALID_PARAMETER. However, querying the file 'position' returns the number of unread bytes in the queue for the handle's end of the pipe. This can only be non-zero if the handle allows reading, which depends on whether the pipe was created as duplex, inbound, or outbound. Querying this requires the NT API because SetFilePointerEx always calls NtSetInformationFile and fails. |
Thankyou Eryksun for the detailed explanation. Unfortunately, that means that uploading an unseekable file via urllib.request, or via http.client, isn’t going to work by default on Windows. I see a couple of workarounds with the current code:
But I never liked how so much behaviour depends on specific details of the upload body object anyway. Perhaps this is an opportunity to change the behaviour so that a file object always triggers chunked encoding. We could add a note to <https://docs.python.org/3.6/whatsnew/3.6.html#changes-in-the-python-api\> to warn that uploading files to HTTP 1.0 servers will now require Content-Length to be manually specified. |
Here is my attempt to drop automatic Content-Length with files. I also dropped the special handling of zero-length iterables, added some tests for that, and removed some unused and misleading bits from other tests. |
New changeset 216f5451c35a by Martin Panter in branch 'default': |
I committed my patch, so now file bodies are chunk-encoded by default. The Windows buildbots pass the relevant tests again. This does mean that old code that uploaded a file to a HTTP-1.0-only server, and relied on Python automatically setting Content-Length, will no longer work in 3.6. Let me know if you think this is a serious problem, and maybe we can find another solution. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: