Accept header parameters in RFC 2231 format. #869

wants to merge 3 commits into


None yet

2 participants

gagern commented Aug 7, 2013

This should fix tornadoweb/tornado#868. Someone should test whether this works as intended with Python 2, though. In gagern/urllib3@e11e036 I had to encode the input from unicode to bytes for this to work. Not sure why, though, and also not sure whether this might be handled in some other location. Python 3 accepts this modification without issues.

@bdarnell bdarnell and 1 other commented on an outdated diff Aug 24, 2013
for p in parts:
i = p.find('=')
if i >= 0:
name = p[:i].strip().lower()
value = p[i + 1:].strip()
- if len(value) >= 2 and value[0] == value[-1] == '"':
+ if len(value) >= 2 and value[0] == value[-1] == '"' and False:
bdarnell Aug 24, 2013 tornadoweb member

What's up with this "and False"?

gagern Aug 28, 2013

Whoops! Looks like something I added as a temporary hack during debugging, and then accidentially left in place. This explains why I had two places stripping quotation marks: this one and my own line 452. I guess I'd better remove lines 442 through 444. My code doesn't provide a replacement for line 444, though. That came from f249ee3, which copied it from cpython, where it was added in But that line can be safely removed, since email.utils will remove quotes and later on add quotes without escaping existing ones. To play it safe, I'll include backslashes and quotation marks in my doctest.

bdarnell commented Sep 1, 2013

This change still doesn't work on python 2 - it works in isolation, but when I added a test case that went through parse_multipart_form_data it failed because email.utils.collapse_rfc2231_value apparently only works if the input is a byte string. parse_multipart_form_data decodes the input as utf8.

While researching to see if that decode call could be removed, I found that HTML5 explicitly forbids RFC 2231 encoding:
So it looks like we need to decode both raw utf8 and rfc2231 (do you know of any clients actually doing this?). This could be done by encoding the unicode header back to ascii bytes (there shouldn't be any non-ascii data if the client is using rfc2231) before decoding the 2231 data, but I'm uncomfortable with this much back-and-forth between our code and the email library. If the email module isn't a turnkey solution it may be better to just ship our own implementation instead of risking subtle differences between python versions.

gagern commented Sep 2, 2013

This HTML5 aspect is interesting; I hadn't known about it. It's kind of unfortunate that there is one RFC advocating one approach and another draft stating something different. I guess following Postel's law, a server implementation should try to handle either. I found two more RFCs, namely RFC 5987 and RFC 6266, which discuss this encoding in the context of HTTP and specifically the Content-Disposition header. Using these as the basis of an implementation might be easier since they only allow a subset of RFC 2231 features.

I must confess that I know no details about what various clients do, one way or the other. I should probably set up some test site and give things a try, but don't have the time for that just now. In the meantime, has some overview over what major browsers do in accepting such headers. It points at for creators of these, but it is hard to tell the current state of affairs from that. One thing which is useful is a link to, a LGPL-licensed python implementation for this kind of task. Perhaps tornado can use this, or build on it? Haven't looked at in detail, particularly not with the HTML5 requirements in mind.

bdarnell commented Sep 2, 2013

There are two distinct uses of Content-disposition, as an HTTP response header and within a multipart/form-data body. Tornado here is concerned with the latter, so I don't think RFCs 5987 and 6266 necessarily apply. The difference is important because HTTP headers are not allowed to contain arbitrary unicode (because the original standard specified ISO 8859-1), similar to email's restriction to 7-bit ascii. There does not appear to be such a formal restriction for multipart/form-data (as defined in RFC 2388) - Non-ascii data "might" be encoded according to RFC 2231, but this does not appear to be mandatory (and the HTML5 spec suggests that it's most common to just use UTF8 directly).

@gagern gagern referenced this pull request in shazow/urllib3 Jan 4, 2014

Consider HTML 5 draft for multipart/form-data #303

@bdarnell bdarnell added the httpserver label Jul 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment