Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can multipart parts have UTF-8 headers? #3876

Closed
swankjesse opened this issue Feb 21, 2018 · 3 comments
Closed

Can multipart parts have UTF-8 headers? #3876

swankjesse opened this issue Feb 21, 2018 · 3 comments
Labels
enhancement Feature not a bug
Milestone

Comments

@swankjesse
Copy link
Collaborator

From a conversation on issue 2802, @megantracy93 has found that iOS & web clients permit UTF-8 characters in content-disposition headers.

We might be able to do likewise, which will allow users to retain UTF-8 filenames when doing file uploads.

@yschimke yschimke added the enhancement Feature not a bug label Mar 12, 2018
@bric3
Copy link

bric3 commented Mar 12, 2018

I stumbled on this issue by chance. I believe the specification for to support encoding in header fields is RFC 8187. I'm not sure how iOS respects this specific RFC or not. But for me the answer to this question _
Can multipart parts have UTF-8 headers?_ is yes
.

As I had to think about this thing server-side a couple years ago, here's what I remember with an update with the last RFCs :


Historically RFC 2616 imposed characters of ISO/CEI 8859-1 (§2.2), but I believe the obsoleting RFCs 7230 changes that (in §3.2) for headers. However the new specification is still unclear about how to represent Unicode characters.

Interestingly there is a specification that address this problem in RFC 5987 now obsoleted by RFC 8187 which find sources in email mime headers (RFC 2231):

Use of characters outside the US-ASCII coded character set
([RFC0020]) in HTTP header fields ([RFC7230]) is non-trivial:

  • The HTTP specification discourages use of non-US-ASCII characters
    in field values, placing them into the "obs-text" Augmented
    Backus-Naur Form (ABNF) production ([RFC7230], Section 3.2).

  • Furthermore, it stays silent about default character encoding
    schemes for field values, so any use of non-US-ASCII characters
    would need to be specific to the field definition or would require
    some other kind of out-of-band information.

  • Finally, some APIs assume a default character encoding scheme in
    order to map from the octet sequences (obtained from the HTTP
    message) to character sequences: for instance, the XMLHttpRequest
    API ([XMLHttpRequest]) uses the Interface Definition Language type
    "ByteString", effectively resulting in the ISO-8859-1 character
    encoding scheme ([ISO-8859-1]) being used.

On the other hand, RFC 2231 defines an encoding mechanism for
parameters inside MIME header fields ([RFC2231]), which, as opposed
to HTTP messages, do need to be sent over non-binary transports.
This document specifies an encoding suitable for use in HTTP header
fields that is compatible with a simplified profile of the encoding
defined in RFC 2231. It can be applied to any HTTP header field that
uses the common "parameter" ("name=value") syntax.

This document obsoletes [RFC5987] and moves it to "Historic" status;
the changes are summarized in Appendix A.

Note: In the remainder of this document, RFC 2231 is only
referenced for the purpose of explaining the choice of features
that were adopted; therefore, they are purely informative.

Note: This encoding does not apply to message payloads transmitted
over HTTP, such as when using the media type "multipart/form-data"
([RFC7578]).

It describes how to encode the fields of the headers, e.g. :

 foo: bar; title*=UTF-8''%c2%a3%20and%20%e2%82%ac%20rates
 foo: bar; title="EURO exchange rates";
          title*=utf-8''%e2%82%ac%20exchange%20rates

In the above case the title without the encoding extension will be discarded.


Also for Content-Disposition header there's the dedicated RFC 6266 that follows what the above RFC describes, e.g. :

 Content-Disposition: attachment;
                     filename="EURO rates";
                     filename*=utf-8''%e2%82%ac%20rates

But this RFC does not apply to this header appearing within a multipart/form-data body. For that you need to go to RFC 7578.
And that RFC is unclear about how to represent non US-ASCII filenames §4.2.

In most multipart types, the MIME header fields in each part are
restricted to US-ASCII; for compatibility with those systems, file
names normally visible to users MAY be encoded using the percent-
encoding method in Section 2, following how a "file:" URI
[URI-SCHEME] might be encoded.

NOTE: The encoding method described in [RFC5987], which would add a
"filename*" parameter to the Content-Disposition header field, MUST
NOT be used.

Some commonly deployed systems use multipart/form-data with file
names directly encoded including octets outside the US-ASCII range.
The encoding used for the file names is typically UTF-8, although
HTML forms will use the charset associated with the form.

For reference §2 is what we saw above

Within this specification, "percent-encoding" (as defined in
[RFC3986]) is offered as a possible way of encoding characters in
file names that are otherwise disallowed, including non-ASCII
characters, spaces, control characters, and so forth. The encoding
is created replacing each non-ASCII or disallowed character with a
sequence, where each byte of the UTF-8 encoding of the character is
represented by a percent-sign (%) followed by the (case-insensitive)
hexadecimal of that byte.

From that I understand is that the filename field of the multi-part header, can have values like :

  • US-ASCII : Content-Disposition: form-data; name="field2"; filename="atre.png"
  • raw UTF-8 : Content-Disposition: form-data; name="field2"; filename="âtre.png"
  • percent encoded : Content-Disposition: form-data; name="field2"; filename="%C3%A2tre.png"

The RFC says these fields are part of a form and as such this information is usually hidden to the user, but this usually applies to web pages, application clients may indeed rely a bit more on those fields for the user experience.


As a side note this may relate indirectly to #930 since OkHttp is updating to RFC 7230.

@swankjesse swankjesse added this to the 3.12 milestone Jul 5, 2018
@swankjesse swankjesse modified the milestones: 3.13, 3.12 Nov 3, 2018
@swankjesse
Copy link
Collaborator Author

Fixed with #4296

@swankjesse swankjesse modified the milestones: 3.13, 3.12 Nov 3, 2018
@sxci
Copy link

sxci commented Dec 11, 2018

https://github.com/square/okhttp/blob/parent-3.12.0/okhttp/src/main/java/okhttp3/MultipartBody.java#L259
should be Headers.ofNonAscii , but is is Headers.of which does not allow non-ascii, this issue is not be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature not a bug
Projects
None yet
Development

No branches or pull requests

4 participants