handleFileUpload with non acsii file name not working properly #150

Closed
cvb opened this Issue Jul 9, 2012 · 10 comments

Comments

Projects
None yet
5 participants
@cvb

cvb commented Jul 9, 2012

I'm trying to upload file, which name is utf8 encoded string, and getting string quoted with ".

I found that it can't be properly parsed by Snap.Internal.Parsing.pQuotedString, so as a workaround
I made this patch

cvb/snap-core@e03cb4e

@gregorycollins

This comment has been minimized.

Show comment Hide comment
@gregorycollins

gregorycollins Jul 9, 2012

Owner

The RFC states that all MIME-headers are to be encoded as quoted-printable. If your client library is sending 8-bit headers then it is violating spec. What program are you using to upload?

Owner

gregorycollins commented Jul 9, 2012

The RFC states that all MIME-headers are to be encoded as quoted-printable. If your client library is sending 8-bit headers then it is violating spec. What program are you using to upload?

@cvb

This comment has been minimized.

Show comment Hide comment
@cvb

cvb Jul 9, 2012

I used google chrome 20.0.1132.34 beta and firefox 10.0.5, result is same

Here is what I get using wireshark

POST /upload/case/56/files HTTP/1.1
Host: localhost:8000
Connection: keep-alive
Content-Length: 213
Origin: http://localhost:8000
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.34 Safari/536.11
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryNnXEMncBxsCgUvip
Accept: /
Referer: http://localhost:8000/
Accept-Encoding: gzip,deflate,sdch
Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4
Accept-Charset: UTF-8,*;q=0.5
Cookie: _remember=0nIfbdirkFlYqBG/W5bzPtRBNUueKLS7bcTU2wDQFXlQAobo/C5qK3JBaBC8OknmjmhE5R3T+r3yYYbQOwhcRTTtoI2Y9AcvPeVTBUvg+ZZcWpNulagr1nJMG8O//lQDon1pDZN1LwIAPTo0AxjWBFVeqWZafbRxF0tzv4k=; _session=boGFdLo2+zzcO1DwYETDeEuR4vtN0plPX/EWgC43ChCjhAjxMmY09fjV146mH8Gi9tzTOPvS68JKkHhpQurkHM+naR++DPhgMshV11fu5G0iDmSe9Yt+Vo0O1//rZBL1FeLaxp/4oE7GfcFC81/tzDaxwit4F3oO/vQwkvoKFg0kvVEm8zti83Wo0kJAzzs=

------WebKitFormBoundaryNnXEMncBxsCgUvip
Content-Disposition: form-data; name="files"; filename="........"
Content-Type: application/octet-stream

111............

And boundary content in hex

00000381 2d 2d 2d 2d 2d 2d 57 65 62 4b 69 74 46 6f 72 6d ------We bKitForm
00000391 42 6f 75 6e 64 61 72 79 4e 6e 58 45 4d 6e 63 42 Boundary NnXEMncB
000003A1 78 73 43 67 55 76 69 70 0d 0a 43 6f 6e 74 65 6e xsCgUvip ..Conten
000003B1 74 2d 44 69 73 70 6f 73 69 74 69 6f 6e 3a 20 66 t-Dispos ition: f
000003C1 6f 72 6d 2d 64 61 74 61 3b 20 6e 61 6d 65 3d 22 orm-data ; name="
000003D1 66 69 6c 65 73 22 3b 20 66 69 6c 65 6e 61 6d 65 files"; filename
000003E1 3d 22 d1 82 d0 b5 d1 81 d1 82 22 0d 0a 43 6f 6e ="...... .."..Con
000003F1 74 65 6e 74 2d 54 79 70 65 3a 20 61 70 70 6c 69 tent-Typ e: appli
00000401 63 61 74 69 6f 6e 2f 6f 63 74 65 74 2d 73 74 72 cation/o ctet-str
00000411 65 61 6d 0d 0a 0d 0a 31 31 31 d0 bb d0 be d1 80 eam....1 11......
00000421 d0 bb d0 be d1 80 0a 0d 0a 2d 2d 2d 2d 2d 2d 57 ........ .------W
00000431 65 62 4b 69 74 46 6f 72 6d 42 6f 75 6e 64 61 72 ebKitFor mBoundar
00000441 79 4e 6e 58 45 4d 6e 63 42 78 73 43 67 55 76 69 yNnXEMnc BxsCgUvi
00000451 70 2d 2d 0d 0a p--..

So filename= 22 d1 82 d0 b5 d1 81 d1 82 22 which is utf8 encoded "тест" in russian.

cvb commented Jul 9, 2012

I used google chrome 20.0.1132.34 beta and firefox 10.0.5, result is same

Here is what I get using wireshark

POST /upload/case/56/files HTTP/1.1
Host: localhost:8000
Connection: keep-alive
Content-Length: 213
Origin: http://localhost:8000
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.34 Safari/536.11
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryNnXEMncBxsCgUvip
Accept: /
Referer: http://localhost:8000/
Accept-Encoding: gzip,deflate,sdch
Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4
Accept-Charset: UTF-8,*;q=0.5
Cookie: _remember=0nIfbdirkFlYqBG/W5bzPtRBNUueKLS7bcTU2wDQFXlQAobo/C5qK3JBaBC8OknmjmhE5R3T+r3yYYbQOwhcRTTtoI2Y9AcvPeVTBUvg+ZZcWpNulagr1nJMG8O//lQDon1pDZN1LwIAPTo0AxjWBFVeqWZafbRxF0tzv4k=; _session=boGFdLo2+zzcO1DwYETDeEuR4vtN0plPX/EWgC43ChCjhAjxMmY09fjV146mH8Gi9tzTOPvS68JKkHhpQurkHM+naR++DPhgMshV11fu5G0iDmSe9Yt+Vo0O1//rZBL1FeLaxp/4oE7GfcFC81/tzDaxwit4F3oO/vQwkvoKFg0kvVEm8zti83Wo0kJAzzs=

------WebKitFormBoundaryNnXEMncBxsCgUvip
Content-Disposition: form-data; name="files"; filename="........"
Content-Type: application/octet-stream

111............

And boundary content in hex

00000381 2d 2d 2d 2d 2d 2d 57 65 62 4b 69 74 46 6f 72 6d ------We bKitForm
00000391 42 6f 75 6e 64 61 72 79 4e 6e 58 45 4d 6e 63 42 Boundary NnXEMncB
000003A1 78 73 43 67 55 76 69 70 0d 0a 43 6f 6e 74 65 6e xsCgUvip ..Conten
000003B1 74 2d 44 69 73 70 6f 73 69 74 69 6f 6e 3a 20 66 t-Dispos ition: f
000003C1 6f 72 6d 2d 64 61 74 61 3b 20 6e 61 6d 65 3d 22 orm-data ; name="
000003D1 66 69 6c 65 73 22 3b 20 66 69 6c 65 6e 61 6d 65 files"; filename
000003E1 3d 22 d1 82 d0 b5 d1 81 d1 82 22 0d 0a 43 6f 6e ="...... .."..Con
000003F1 74 65 6e 74 2d 54 79 70 65 3a 20 61 70 70 6c 69 tent-Typ e: appli
00000401 63 61 74 69 6f 6e 2f 6f 63 74 65 74 2d 73 74 72 cation/o ctet-str
00000411 65 61 6d 0d 0a 0d 0a 31 31 31 d0 bb d0 be d1 80 eam....1 11......
00000421 d0 bb d0 be d1 80 0a 0d 0a 2d 2d 2d 2d 2d 2d 57 ........ .------W
00000431 65 62 4b 69 74 46 6f 72 6d 42 6f 75 6e 64 61 72 ebKitFor mBoundar
00000441 79 4e 6e 58 45 4d 6e 63 42 78 73 43 67 55 76 69 yNnXEMnc BxsCgUvi
00000451 70 2d 2d 0d 0a p--..

So filename= 22 d1 82 d0 b5 d1 81 d1 82 22 which is utf8 encoded "тест" in russian.

@gregorycollins

This comment has been minimized.

Show comment Hide comment
@gregorycollins

gregorycollins Jul 9, 2012

Owner

Just wonderful. http://tools.ietf.org/html/rfc2388 says:

5.4 Non-ASCII field names

Note that MIME headers are generally required to consist only of 7-
bit data in the US-ASCII character set. Hence field names should be
encoded according to the method in RFC 2047 if they contain
characters outside of that set.

....but if browsers are actually breaking spec we need to support that. Awesome.

Owner

gregorycollins commented Jul 9, 2012

Just wonderful. http://tools.ietf.org/html/rfc2388 says:

5.4 Non-ASCII field names

Note that MIME headers are generally required to consist only of 7-
bit data in the US-ASCII character set. Hence field names should be
encoded according to the method in RFC 2047 if they contain
characters outside of that set.

....but if browsers are actually breaking spec we need to support that. Awesome.

@dzhus

This comment has been minimized.

Show comment Hide comment
@dzhus

dzhus Jul 13, 2012

This should be reported to browser devs as well.

dzhus commented Jul 13, 2012

This should be reported to browser devs as well.

@mulderr

This comment has been minimized.

Show comment Hide comment
@mulderr

mulderr Mar 17, 2017

Contributor

@gregorycollins @dzhus

There is a new RFC. RFC2388 (August 1998) has been obsoleted by RFC7578 (July 2015).

Specifically this seems relevant:

5.1.2. Interpreting Forms and Creating multipart/form-data Data

Some applications of this specification will supply a character encoding to be used for interpretation of the multipart/form-data body. In particular, HTML 5 [W3C.REC-html5-20141028] uses

o the content of a "charset" field, if there is one;
o the value of an accept-charset attribute of the form element, if there is one;
o the character encoding of the document containing the form, if it is US-ASCII compatible;
o otherwise, UTF-8.

And later in 5.1.3:

For this reason, interpreting multipart/form-data (even from conforming generators) may require knowing the charset used in form encoding in cases where the charset field value or a charset parameter of a "text/plain" Content-Type header field is not supplied.

Apparently the rules have changed. The browsers are definitely not breaking current spec by sending UTF-8 and I think Snap should be updated as well. In fact otherwise, UTF-8 seems to suggest that UTF-8 is the new default.

Contributor

mulderr commented Mar 17, 2017

@gregorycollins @dzhus

There is a new RFC. RFC2388 (August 1998) has been obsoleted by RFC7578 (July 2015).

Specifically this seems relevant:

5.1.2. Interpreting Forms and Creating multipart/form-data Data

Some applications of this specification will supply a character encoding to be used for interpretation of the multipart/form-data body. In particular, HTML 5 [W3C.REC-html5-20141028] uses

o the content of a "charset" field, if there is one;
o the value of an accept-charset attribute of the form element, if there is one;
o the character encoding of the document containing the form, if it is US-ASCII compatible;
o otherwise, UTF-8.

And later in 5.1.3:

For this reason, interpreting multipart/form-data (even from conforming generators) may require knowing the charset used in form encoding in cases where the charset field value or a charset parameter of a "text/plain" Content-Type header field is not supplied.

Apparently the rules have changed. The browsers are definitely not breaking current spec by sending UTF-8 and I think Snap should be updated as well. In fact otherwise, UTF-8 seems to suggest that UTF-8 is the new default.

@sopvop

This comment has been minimized.

Show comment Hide comment
@sopvop

sopvop Mar 20, 2017

Contributor

IIRC Python requests client does it mime way and snap can't handle it either.

Contributor

sopvop commented Mar 20, 2017

IIRC Python requests client does it mime way and snap can't handle it either.

@mulderr

This comment has been minimized.

Show comment Hide comment
@mulderr

mulderr Mar 21, 2017

Contributor

@sopvop Do you by any chance have a link to requests issue for that? I'm trying to see what the current consensus is. Maybe they found a reasonable solution there?

As a summary, relevant RFCs I could find:

In practice:

  • Chromium 56
  • Firefox 51

will both happily send UTF-8 directly in filename without any warning using the simplest form:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>title</title>
  </head>
  <body>
    <form method="post" enctype="multipart/form-data" action="http://localhost:8000/upload">
      <input type="file" name="the-file">
      <button type="submit">Upload</button>
    </form>
  </body>
</html>

Used dev-tools in both browsers and tcpdump to confirm. In short: what @cvb wrote a couple years ago still holds.

Conclusion: filename* is dead. Any other special forms of encoding are not being currently used. Browsers will not apply any transformations given UTF-8 filenames and will send them unmodified directly in the filename param.

Would you accept a PR that does not assume US-ASCII for Content-Disposition filename value while processing multipart/form-data and keeps everything else as is? Probably by using a different parser here.

Contributor

mulderr commented Mar 21, 2017

@sopvop Do you by any chance have a link to requests issue for that? I'm trying to see what the current consensus is. Maybe they found a reasonable solution there?

As a summary, relevant RFCs I could find:

In practice:

  • Chromium 56
  • Firefox 51

will both happily send UTF-8 directly in filename without any warning using the simplest form:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>title</title>
  </head>
  <body>
    <form method="post" enctype="multipart/form-data" action="http://localhost:8000/upload">
      <input type="file" name="the-file">
      <button type="submit">Upload</button>
    </form>
  </body>
</html>

Used dev-tools in both browsers and tcpdump to confirm. In short: what @cvb wrote a couple years ago still holds.

Conclusion: filename* is dead. Any other special forms of encoding are not being currently used. Browsers will not apply any transformations given UTF-8 filenames and will send them unmodified directly in the filename param.

Would you accept a PR that does not assume US-ASCII for Content-Disposition filename value while processing multipart/form-data and keeps everything else as is? Probably by using a different parser here.

@sopvop

This comment has been minimized.

Show comment Hide comment
@sopvop

sopvop Mar 21, 2017

Contributor

@mulderr here is a closed issue for requests, and relevant urllib3 issue.
As I understand it, they haven't implemented any fixes yet.
With requests I just pass a separate header with base64 encoded utf-8 name. You can do something like that with JS i guess, but not with plain form.

Contributor

sopvop commented Mar 21, 2017

@mulderr here is a closed issue for requests, and relevant urllib3 issue.
As I understand it, they haven't implemented any fixes yet.
With requests I just pass a separate header with base64 encoded utf-8 name. You can do something like that with JS i guess, but not with plain form.

@mulderr

This comment has been minimized.

Show comment Hide comment
@mulderr

mulderr Mar 21, 2017

Contributor

Thanks! I do have a workaround in place, so it's not something I need ASAP. The current behavior was very confusing to debug though.

At first glance, in the urllib3 issue, they seem to be discussing what the default should be. But we already have filename returned as ByteString - which was I guess a perfect call in anticipation for exactly this kind of problem. We just need to actually allow more then US-ASCII when parsing (without double quoting for no good reason).

Edit: hmm... to be precise, UTF-8 is already being allowed, just not properly parsed.

Contributor

mulderr commented Mar 21, 2017

Thanks! I do have a workaround in place, so it's not something I need ASAP. The current behavior was very confusing to debug though.

At first glance, in the urllib3 issue, they seem to be discussing what the default should be. But we already have filename returned as ByteString - which was I guess a perfect call in anticipation for exactly this kind of problem. We just need to actually allow more then US-ASCII when parsing (without double quoting for no good reason).

Edit: hmm... to be precise, UTF-8 is already being allowed, just not properly parsed.

@gregorycollins

This comment has been minimized.

Show comment Hide comment
@gregorycollins

gregorycollins Sep 4, 2017

Owner

This should be fixed now? Please reopen if not.

Owner

gregorycollins commented Sep 4, 2017

This should be fixed now? Please reopen if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment