Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Accept header parameters in RFC 2231 format. #869

Open
wants to merge 3 commits into from

2 participants

@gagern

This should fix tornadoweb/tornado#868. Someone should test whether this works as intended with Python 2, though. In gagern/urllib3@e11e036 I had to encode the input from unicode to bytes for this to work. Not sure why, though, and also not sure whether this might be handled in some other location. Python 3 accepts this modification without issues.

tornado/httputil.py
((20 lines not shown))
for p in parts:
i = p.find('=')
if i >= 0:
name = p[:i].strip().lower()
value = p[i + 1:].strip()
- if len(value) >= 2 and value[0] == value[-1] == '"':
+ if len(value) >= 2 and value[0] == value[-1] == '"' and False:
@bdarnell Owner

What's up with this "and False"?

@gagern
gagern added a note

Whoops! Looks like something I added as a temporary hack during debugging, and then accidentially left in place. This explains why I had two places stripping quotation marks: this one and my own line 452. I guess I'd better remove lines 442 through 444. My code doesn't provide a replacement for line 444, though. That came from f249ee3, which copied it from cpython, where it was added in http://hg.python.org/cpython/rev/846ba2972d7c. But that line can be safely removed, since email.utils will remove quotes and later on add quotes without escaping existing ones. To play it safe, I'll include backslashes and quotation marks in my doctest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bdarnell
Owner

This change still doesn't work on python 2 - it works in isolation, but when I added a test case that went through parse_multipart_form_data it failed because email.utils.collapse_rfc2231_value apparently only works if the input is a byte string. parse_multipart_form_data decodes the input as utf8.

While researching to see if that decode call could be removed, I found that HTML5 explicitly forbids RFC 2231 encoding:
http://www.w3.org/html/wg/drafts/html/master/forms.html#multipart-form-data
So it looks like we need to decode both raw utf8 and rfc2231 (do you know of any clients actually doing this?). This could be done by encoding the unicode header back to ascii bytes (there shouldn't be any non-ascii data if the client is using rfc2231) before decoding the 2231 data, but I'm uncomfortable with this much back-and-forth between our code and the email library. If the email module isn't a turnkey solution it may be better to just ship our own implementation instead of risking subtle differences between python versions.

@gagern

This HTML5 aspect is interesting; I hadn't known about it. It's kind of unfortunate that there is one RFC advocating one approach and another draft stating something different. I guess following Postel's law, a server implementation should try to handle either. I found two more RFCs, namely RFC 5987 and RFC 6266, which discuss this encoding in the context of HTTP and specifically the Content-Disposition header. Using these as the basis of an implementation might be easier since they only allow a subset of RFC 2231 features.

I must confess that I know no details about what various clients do, one way or the other. I should probably set up some test site and give things a try, but don't have the time for that just now. In the meantime, http://greenbytes.de/tech/tc2231/ has some overview over what major browsers do in accepting such headers. It points at http://trac.tools.ietf.org/wg/httpbis/trac/wiki/ContentDispositionProducers for creators of these, but it is hard to tell the current state of affairs from that. One thing which is useful is a link to https://github.com/g2p/rfc6266, a LGPL-licensed python implementation for this kind of task. Perhaps tornado can use this, or build on it? Haven't looked at in detail, particularly not with the HTML5 requirements in mind.

@bdarnell
Owner

There are two distinct uses of Content-disposition, as an HTTP response header and within a multipart/form-data body. Tornado here is concerned with the latter, so I don't think RFCs 5987 and 6266 necessarily apply. The difference is important because HTTP headers are not allowed to contain arbitrary unicode (because the original standard specified ISO 8859-1), similar to email's restriction to 7-bit ascii. There does not appear to be such a formal restriction for multipart/form-data (as defined in RFC 2388) - Non-ascii data "might" be encoded according to RFC 2231, but this does not appear to be mandatory (and the HTML5 spec suggests that it's most common to just use UTF8 directly).

@gagern gagern referenced this pull request in shazow/urllib3
Open

Consider HTML 5 draft for multipart/form-data #303

@bdarnell bdarnell added the httpserver label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 18 additions and 6 deletions.
  1. +18 −6 tornado/httputil.py
View
24 tornado/httputil.py
@@ -420,23 +420,35 @@ def _parseparam(s):
def _parse_header(line):
- """Parse a Content-type like header.
+ r"""Parse a Content-type like header.
Return the main content-type and a dictionary of options.
+ >>> d = "CD: fd; foo=\"b\\\\a\\\"r\"; file*=utf-8''T%C3%A4st"
+ >>> d = _parse_header(d)[1]
+ >>> d['file'] == r'T\u00e4st'.encode('ascii').decode('unicode_escape')
+ True
+ >>> d['foo']
+ 'b\\a"r'
"""
parts = _parseparam(';' + line)
key = next(parts)
- pdict = {}
+ # decode_params treats first argument special, but we already stripped key
+ params = [('Dummy', 'value')]
for p in parts:
i = p.find('=')
if i >= 0:
name = p[:i].strip().lower()
value = p[i + 1:].strip()
- if len(value) >= 2 and value[0] == value[-1] == '"':
- value = value[1:-1]
- value = value.replace('\\\\', '\\').replace('\\"', '"')
- pdict[name] = value
+ params.append((name, value))
+ params = email.utils.decode_params(params)
+ params.pop(0) # get rid of the dummy again
+ pdict = {}
+ for name, value in params:
+ value = email.utils.collapse_rfc2231_value(value)
+ if len(value) >= 2 and value[0] == '"' and value[-1] == '"':
+ value = value[1:-1]
+ pdict[name] = value
return key, pdict
Something went wrong with that request. Please try again.