Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept header parameters in RFC 2231 format. #869

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
2 participants
@gagern
Copy link
Contributor

commented Aug 7, 2013

This should fix #868. Someone should test whether this works as intended with Python 2, though. In gagern/urllib3@e11e036 I had to encode the input from unicode to bytes for this to work. Not sure why, though, and also not sure whether this might be handled in some other location. Python 3 accepts this modification without issues.

gagern added some commits Aug 7, 2013

for p in parts:
i = p.find('=')
if i >= 0:
name = p[:i].strip().lower()
value = p[i + 1:].strip()
if len(value) >= 2 and value[0] == value[-1] == '"':
if len(value) >= 2 and value[0] == value[-1] == '"' and False:

This comment has been minimized.

Copy link
@bdarnell

bdarnell Aug 24, 2013

Member

What's up with this "and False"?

This comment has been minimized.

Copy link
@gagern

gagern Aug 28, 2013

Author Contributor

Whoops! Looks like something I added as a temporary hack during debugging, and then accidentially left in place. This explains why I had two places stripping quotation marks: this one and my own line 452. I guess I'd better remove lines 442 through 444. My code doesn't provide a replacement for line 444, though. That came from f249ee3, which copied it from cpython, where it was added in http://hg.python.org/cpython/rev/846ba2972d7c. But that line can be safely removed, since email.utils will remove quotes and later on add quotes without escaping existing ones. To play it safe, I'll include backslashes and quotation marks in my doctest.

@bdarnell

This comment has been minimized.

Copy link
Member

commented Sep 1, 2013

This change still doesn't work on python 2 - it works in isolation, but when I added a test case that went through parse_multipart_form_data it failed because email.utils.collapse_rfc2231_value apparently only works if the input is a byte string. parse_multipart_form_data decodes the input as utf8.

While researching to see if that decode call could be removed, I found that HTML5 explicitly forbids RFC 2231 encoding:
http://www.w3.org/html/wg/drafts/html/master/forms.html#multipart-form-data
So it looks like we need to decode both raw utf8 and rfc2231 (do you know of any clients actually doing this?). This could be done by encoding the unicode header back to ascii bytes (there shouldn't be any non-ascii data if the client is using rfc2231) before decoding the 2231 data, but I'm uncomfortable with this much back-and-forth between our code and the email library. If the email module isn't a turnkey solution it may be better to just ship our own implementation instead of risking subtle differences between python versions.

@gagern

This comment has been minimized.

Copy link
Contributor Author

commented Sep 2, 2013

This HTML5 aspect is interesting; I hadn't known about it. It's kind of unfortunate that there is one RFC advocating one approach and another draft stating something different. I guess following Postel's law, a server implementation should try to handle either. I found two more RFCs, namely RFC 5987 and RFC 6266, which discuss this encoding in the context of HTTP and specifically the Content-Disposition header. Using these as the basis of an implementation might be easier since they only allow a subset of RFC 2231 features.

I must confess that I know no details about what various clients do, one way or the other. I should probably set up some test site and give things a try, but don't have the time for that just now. In the meantime, http://greenbytes.de/tech/tc2231/ has some overview over what major browsers do in accepting such headers. It points at http://trac.tools.ietf.org/wg/httpbis/trac/wiki/ContentDispositionProducers for creators of these, but it is hard to tell the current state of affairs from that. One thing which is useful is a link to https://github.com/g2p/rfc6266, a LGPL-licensed python implementation for this kind of task. Perhaps tornado can use this, or build on it? Haven't looked at in detail, particularly not with the HTML5 requirements in mind.

@bdarnell

This comment has been minimized.

Copy link
Member

commented Sep 2, 2013

There are two distinct uses of Content-disposition, as an HTTP response header and within a multipart/form-data body. Tornado here is concerned with the latter, so I don't think RFCs 5987 and 6266 necessarily apply. The difference is important because HTTP headers are not allowed to contain arbitrary unicode (because the original standard specified ISO 8859-1), similar to email's restriction to 7-bit ascii. There does not appear to be such a formal restriction for multipart/form-data (as defined in RFC 2388) - Non-ascii data "might" be encoded according to RFC 2231, but this does not appear to be mandatory (and the HTML5 spec suggests that it's most common to just use UTF8 directly).

@bdarnell

This comment has been minimized.

Copy link
Member

commented Jun 3, 2018

This is doing the same thing as #1898 (but using different RFC numbers so I didn't catch the duplication before). I'm going to add a test to the newer one and merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.