Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Type parsing (MIME type parsing) #30

Closed
annevk opened this issue Aug 18, 2017 · 47 comments
Closed

Content-Type parsing (MIME type parsing) #30

annevk opened this issue Aug 18, 2017 · 47 comments

Comments

@annevk
Copy link
Member

@annevk annevk commented Aug 18, 2017

I looked into MIME type parsing to figure out how to make progress with whatwg/fetch#579 and httpwg/http-core#33. However, it doesn't seem like there's much interoperability or good places to start.

For instance the following decodes as UTF-8 in Chrome and Firefox, but windows-1252 in Edge and Safari (inspired by http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/netwerk/base/nsURLHelper.cpp#957):

def main(request, response):
   response.headers.set("content-type", "text/html;charset=windows-1252,text/html;charset=utf-8")
   response.content = "\xC2\xB1"

Only Chrome and Firefox have a modicum of MIME type validation happening for data: URLs, but even that's rather limited and broken (e.g., unknown parameters get dropped, but image/gif;charset=x is fine).

It seems that anything here would have to be quite forgiving to maintain the status quo of not bailing if a MIME type is invalid (i.e.., treat text/html; as text/html and not as an error), but there's also quite some flexibility. And then there's the question of whether strings need to be simply passed through to Blob and such or if there should be some validation step to normalize input (Chrome and Firefox appear to lowercase all input there).

@foolip
Copy link
Member

@foolip foolip commented Sep 7, 2017

What are all of the contexts that MIME type-like things are parsed? Ones that I know of:

  • the Content-Type header
  • various content attributes in HTML
  • video.canPlayType(), MediaSource.isTypeSupported() and MediaRecorder.isTypeSupported()

Are there others? It's probably not the same parser used in all contexts.

@annevk
Copy link
Member Author

@annevk annevk commented Sep 7, 2017

  • Blob's type
  • Response and Request parse Content-Type for later use by Blob's type
  • data: URLs
  • Drag and drop API (unclear, it mostly seems to do string matching as far as I can tell)
  • registerContentHandler() (not useful, only implemented by Firefox)

It's all rather messy indeed.

Content-Type is hard to test as you can only really tell what happens by navigating to the resource. And that doesn't really tell you which parameters got preserved and such, but it does tell you if it was recognized as text/html and charset was found.

@annevk
Copy link
Member Author

@annevk annevk commented Sep 8, 2017

html/semantics/embedded-content/the-img-element/update-the-source-set.html has evidence (and tests!) that "broken" MIME types such as image/gif;, image/gif;encodings, and image/gif;encodings= do not get rejected by browsers meaning the RFC for MIME types is wrong in how parsing is supposed to happen.

annevk added a commit to web-platform-tests/wpt that referenced this issue Sep 8, 2017
@GPHemsley
Copy link
Member

@GPHemsley GPHemsley commented Sep 8, 2017

If it helps, I put this wiki page together back in the day:

https://wiki.whatwg.org/wiki/Contexts

@GPHemsley
Copy link
Member

@GPHemsley GPHemsley commented Sep 8, 2017

I also tried to assemble a list of source code locations related to MIME sniffing:

https://wiki.whatwg.org/wiki/MIME_Sniffing

foolip added a commit to web-platform-tests/wpt that referenced this issue Sep 9, 2017
@bzbarsky
Copy link

@bzbarsky bzbarsky commented Sep 15, 2017

Note that Content-Type header parsing also depends. See https://bugzilla.mozilla.org/show_bug.cgi?id=1210302 where we ended up needing to use different parsers for the request and response Content-Type headers.

@annevk
Copy link
Member Author

@annevk annevk commented Sep 19, 2017

Firefox:

We do have a MIME type parser. It has two entrypoints: http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/netwerk/base/nsINetUtil.idl#18-32 (request header) and http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/netwerk/base/nsINetUtil.idl#35-52 (response header).

It would be good to verify those algorithms against other browsers somehow. @mikewest do you have insight as to what Chrome does?

@mikewest
Copy link
Member

@mikewest mikewest commented Sep 19, 2017

@sleevi is a better resource for the network stack's charset sniffing mechanisms. If he's too busy at BlinkOn this week, I'll dig around for a link. :)

@annevk
Copy link
Member Author

@annevk annevk commented Sep 19, 2017

Firefox appears to use http://searchfox.org/mozilla-central/source/netwerk/base/nsURLHelper.cpp#1033 for request Content-Type which would reject a valid MIME type such as text/plain;hi=",". There's not much love for commas in MIME types.

@bzbarsky
Copy link

@bzbarsky bzbarsky commented Sep 19, 2017

which would reject a valid MIME type such as text/plain;hi=","

No, it would not. @annevk, why do you think it would?

@annevk
Copy link
Member Author

@annevk annevk commented Sep 19, 2017

My bad, it does indeed skip quoted strings.

@sleevi
Copy link

@sleevi sleevi commented Sep 20, 2017

@annevk It sounds like Chrome is equally in a weird place :)

We have net::ParseMimeTypeWithoutParameter which is part of a fallback handling path for data: URLs in DataURL::Parse. It would appear our parser is... uh... not the best, for data: URLs

We have a separate parser for response Content-Type in net::HttpUtil::ParseContentType and request Content-Type (like @bzbarsky mentioned in #30 (comment) ), in blink::ParsedContentType. This latter implementation is used fairly extensively throughout the Blink side - that is, most Web-exposed pieces will at least be expected to transit that content type parser, but may also transit the HttpUtil::ParseContentType path.

For the Response object, it doesn't look like we parse the Content-Type during construction (see FetchHeaderList::ExtractMIMEType )

Separately, for things like canPlayType, we have multiple places that parsing can happen. This is expressed via the blink::ContentType, which is handled by Blink, but codec parsing is handled by media::MimeUtil. However, other aspects of media loading, such as detecting Media Capabilities, will use the above ParsedContentType

I'm sure we could add another dozen or so MIME type parsers before we start refactoring ;)

For MIME sniffing in particular, the determination of whether or not to sniff is using the net::HttpUtil::ParseContentType logic, by virtue of HttpResponseHeaders::GetMimeTypeAndCharset setting the Response's mime_type for HTTP-network-loads feeding into content::ShouldSniffContent. However, it gets 'messy' for other form of loads, because it's implementation-dependent. A quick scan through implementations suggest they're using the same implementation, but my above remarks about the Blink-vs-network layer means that it may have been 'double parsed' (by both Blink and then network).

@mikewest Is that what you were hoping for? :)

@annevk
Copy link
Member Author

@annevk annevk commented Oct 3, 2017

HTTP/1.1 200
Content-Length: 11
Content-Type: text/html
Content-Type: text/plain

<b>TEST</b>

Rendered as text/plain: Chrome, Firefox, Safari
Rendered as text/html: Edge

HTTP/1.1 200
Content-Length: 11
Content-Type: text/html, text/plain

<b>TEST</b>

Rendered as text/plain: Chrome, Firefox
Download: Edge
Rendered as text/html: Safari

That means there's already inconsistency between multiple headers and combined headers. I was led to believe that was only the case for cookies. Hopefully we can still make it so somehow.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 3, 2017

HTTP/1.1 200
Content-Length: 11
Content-Type: unknown/unknown
Content-Type: text/html

<b>TEST</b>
HTTP/1.1 200
Content-Length: 11
Content-Type: unknown/unknown, text/html

<b>TEST</b>

Both rendered as text/html: Chrome, Firefox, Safari
Both download: Edge

@annevk
Copy link
Member Author

@annevk annevk commented Oct 3, 2017

If I reverse the order and list text/html first and then unknown/unknown, then:

  • Chrome, Safari: render all as text/html
  • Edge: renders the separate header version, downloads comma version (as expected)
  • Firefox: downloads both

If I list */* last then only Firefox changes and will render both as text/html.

(It's not entirely clear to me why Chrome special cases */* in its code still.)

@annevk
Copy link
Member Author

@annevk annevk commented Oct 3, 2017

Given the lack of interoperability it's not clear to me that the complicated Content-Type parsers in Chrome, Firefox, and Safari are totally warranted.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 3, 2017

Unfortunately Edge is not as consistent in its download story as I hoped.

HTTP/1.1 200
Content-Length: 55
Content-Type: text/html;charset=utf-8
Content-Type: */*;charset=gbk

<script>document.write(document.characterSet)</script>

Chrome, Safari: GBK
Edge: utf-8 (effectively the same as Firefox, but with a "minor" API bug)
Firefox: UTF-8

HTTP/1.1 200
Content-Length: 55
Content-Type: text/html;charset=utf-8, */*;charset=gbk

<script>document.write(document.characterSet)</script>

Chrome: GBK
Edge: windows-1252 (no download despite the comma)
Firefox, Safari: UTF-8

@bzbarsky
Copy link

@bzbarsky bzbarsky commented Oct 3, 2017

So far it sounds to me like Chrome and Firefox consistently treat:

Content-Type: foo
Content-Type: bar

identically to:

Content-Type: foo, bar

though they may not always agree on how the latter is treated. For Firefox this is expected, because the HTTP response header type in Firefox has no representation for "repeated headers" last I checked; they get turned into a comma-separated list. So the second form is all that the rest of the system sees. I can't speak for how Chrome handles this.

Safari's behavior for the unknown/unknown, text/html case is pretty weird, given its earlier behavior for text/html, text/plain. It's like the parser includes a validator of some sort, or whitelisting or something.

It's possible that Firefox treats */* as a "sniff this" signal; I'd have to step through the code.

@MattMenke2
Copy link

@MattMenke2 MattMenke2 commented Apr 19, 2018

So, with those rules:

Content-Type: text/html; charset=foo
Content-Type: text/html
would be equivalent to "text/html; charset=foo"

and

Content-Type: text/html; charset=foo
Content-Type: text/plain
Content-Type: text/html
would be equivalent to "text/html"

I can live with that. If we could get away with it, I'd love if we could unconditionally just take the first one. I thought an earlier comment suggested that's what Edge and Safari do?

@annevk
Copy link
Member Author

@annevk annevk commented Apr 20, 2018

On the topic of header parsing, I think the way to go would be to always combine (except for Set-Cookie) and then define all the parsers for the combined value. Then how you deal with quotes and such becomes a question on a per-header-parser basis. I.e., the foo header parser only ever sees bar,baz and is only invoked once. That seems like the only way to guarantee consistency.

As for Content-Type, I'll look a bit closer at Edge/Safari since you're willing to flip Chrome and report back.

@annevk
Copy link
Member Author

@annevk annevk commented Apr 20, 2018

Edge: from the tests I wrote, when the headers are separate it picks the first one. When the headers are combined, it tries to parse their entire value as a MIME type, not giving the comma any special consideration.

Safari: when the headers are separate it picks the last one (matching Chrome/Firefox), but does not carry over charset when the essence matches (not matching Chrome/Firefox). Downloads / (even without parameters).

(I think the earlier comment got misled by using unknown/unknown, which will trigger sniffing rules as per https://mimesniff.spec.whatwg.org/#determining-the-computed-mime-type-of-a-resource.)

@MattMenke2
Copy link

@MattMenke2 MattMenke2 commented Apr 20, 2018

Merging and handling commas on a per-header basis works, though I'd be worried about more breakages. And, at least in Chrome, it would require rewriting basically every handler, possibly significantly, just for the comma case - something that I'm worried will not end up happening.

@reschke
Copy link

@reschke reschke commented Apr 21, 2018

@MattMenke2 - header folding can be done the same way for all fields, as required by the HTTP spec. The only exception is Set-Cookie, as described in https://greenbytes.de/tech/webdav/rfc7230.html#rfc.section.3.2.2.

That said, parsing of these values needs to be specific to the header field, as it requires knowledge of the syntax of individual list elements. You can't split on "," dues to commas appearing in the literal value.

@annevk
Copy link
Member Author

@annevk annevk commented Sep 24, 2018

For stylesheet loading (initial tests in web-platform-tests/wpt#13144) it seems that Firefox treats a */* value equivalent to Content-Type being missing or MIME type parsing failing. Unfortunately due to how other browsers treat a normal value it's unclear if they pay special attention to that value. So therefore I haven't tested that value for now as it's unclear to me what should be done for it.

@MattMenke2
Copy link

@MattMenke2 MattMenke2 commented Sep 24, 2018

Chrome treats "*/*" as an unknown MIME type, like "", "application/unknown", "unknown/unknown", or any type that lacks a slash, with a comment about matching FireFox's behavior for "*/*".

@annevk
Copy link
Member Author

@annevk annevk commented Nov 7, 2018

My preferred way of dealing with " and , would be to combine first and then split on , that are not enclosed by "s. So while you scan the combined string " sets a do not split flag and a subsequent " unsets it. As long as the do not split flag is set , will not create a new value but instead be appended to the current value.

If you wanted a more efficient approach to this you would not even have to combine first, but you do have to parse all of them at once. So as you reach the end of the first header value you'd simply continue with the second header value while retaining all the current state of the parser (e.g., the do not split flag).

The "controversial" case here is that this would mean that text/html;", text/plain (combined or separate header fields) ends up as text/html, whereas " isn't a valid parameter name to begin with. Given that this only affects erroneous cases (using quoted strings in the wrong place) that seems okay to me though and keeps splitting fairly straightforward.

Aside from web-platform-tests/wpt#10525 I'll write some more tests for Content-Type in the context of the script element, as it will be easier to test invalid values there (those that would result in a download for iframe). If someone has an even easier way to test all kinds of response Content-Type processing I'm all ears.

(Note that the request Content-Type parser @bzbarsky was concerned with is covered by whatwg/fetch#829 and tests linked from there and I consider solved at this point. It's only response Content-Type parsing that's problematic as it needs to support multiple values somehow.)

cc @asankah

@annevk
Copy link
Member Author

@annevk annevk commented Nov 7, 2018

I added tests for the script element (pushed to the same PR) which indicate that at least some browsers have different response Content-Type parsers based on context. E.g., for the script element Chrome uses the first value. That seems rather bad.

@MattMenke2
Copy link

@MattMenke2 MattMenke2 commented Nov 7, 2018

Sadly, not too surprising - Chrome's network stack has a lot of behaviors modeled after FireFox, while it uses a forked WebKit as its renderer, resulting in inheriting different behavior from different browsers, depending on where the code is running.

annevk added a commit to whatwg/fetch that referenced this issue Nov 9, 2018
Also known as "extract a MIME type" down right.

Tests: web-platform-tests/wpt#10525.

Helps with #814.

Fixes #529. Closes whatwg/mimesniff#30.
@annevk
Copy link
Member Author

@annevk annevk commented Nov 9, 2018

I put up an initial patch for this at whatwg/fetch#831. Relative to #30 (comment) it defines how splitting works (in a way that can be reused across different headers) and it deals with parsing a MIME type being able to return failure (I also added tests for this).

I still need to add tests and possibly adjust the prose for these values: the empty string, "application/unknown", and "unknown/unknown". I probably also need to read the algorithms one more time and perhaps convince @domenic to implement them in jsdom to ensure they are correct.

There is some potential for simplification here I suppose given that browsers are different, but unless we can simply use the first value always in all contexts it's unclear how much that'll buy us (the weird charset copying might be worth correcting though).

annevk added a commit to whatwg/fetch that referenced this issue Nov 12, 2018
Also known as "extract a MIME type" down right.

Tests: web-platform-tests/wpt#10525.

Helps with #814.

Fixes #529. Closes whatwg/mimesniff#30.
@annevk
Copy link
Member Author

@annevk annevk commented Nov 13, 2018

Empty string is treated the same as failure as far as I can tell. (I added tests.)

I'm not sure about application/unknown and unknown/unknown. Unless there's a compelling reason to add/keep them to the list I'd prefer removing them as special cases since they don't appear in Firefox's code today (only some usage that doesn't affect any processing models).

annevk added a commit that referenced this issue Nov 13, 2018
Firefox has had no need for these. It seems less weird behavior is better.

See #30 for context.
@asankah
Copy link

@asankah asankah commented Nov 13, 2018

Oops. Sorry I missed this thread.

One problem with existing grammars is that splitting on , modulo quoted strings cannot be done by examining the prefix alone unless headers that use , in their grammar for purposes other than separating distinct header values are excluded. This was the problem with the family of authentication headers (WWW-Authenticate, Authorization, Proxy-Authenticate, Proxy-Authorization). Prior to RFC 7235 it was not possible to unambiguously parse auth headers in general regardless of header coalescing. After 7235, they are parsable, but requires a two token look ahead, which is unfortunate.

E.g.: From RFC 7235 § 4.1:

   WWW-Authenticate: Newauth realm="apps", type=1,
                     title="Login to \"apps\""

None of the commas in this header delineate values. Instead they all delineate parameters.

A contrived but valid example is below where the first and last ,s are delineating values. Note that a lexer can't split at the last , until it as seen the next realm token. Hence the two token look ahead.

   WWW-Authenticate: , Newauth realm="apps", type=1,
                     title="Login to \"apps\"", Basic realm="simple"

*Edited for accuracy.

@annevk
Copy link
Member Author

@annevk annevk commented Nov 13, 2018

@asankah thanks for chiming in. It seems to me you could still split on , modulo " eagerly in line with other headers, but the parser operating on the resulting splitted values has to be aware of the semantics and context and not process these values independently of the others.

@asankah
Copy link

@asankah asankah commented Nov 13, 2018

@annevk Meaning the second example would be treated as being equivalent to

WWW-Authenticate:
WWW-Authenticate: Newauth realm="apps"
WWW-Authenticate: type=1
WWW-Authenticate: title="Login to \"apps\""
WWW-Authenticate: Basic realm="simple"

?

Thus the parser for authentication challenges would need to associate all immediately following authentication headers that match auth-param with the header that had a auth-scheme.

It is a possibility :-) , as is adding lookaheads. Though it has the property that the meaning of a header depends on headers that follow.

(Apologies if I'm misinterpreting your suggestion).

@annevk
Copy link
Member Author

@annevk annevk commented Nov 13, 2018

Yes, that's exactly what I mean, and that would also not make eagerly combining intermediaries yield different results in the end.

annevk added a commit to whatwg/fetch that referenced this issue Nov 27, 2018
Also known as "extract a MIME type" done right.

Tests: web-platform-tests/wpt#10525.

Helps with #814.

Fixes #529. Closes whatwg/mimesniff#30.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.