New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Type parsing (MIME type parsing) #30

Open
annevk opened this Issue Aug 18, 2017 · 36 comments

Comments

9 participants
@annevk
Member

annevk commented Aug 18, 2017

I looked into MIME type parsing to figure out how to make progress with whatwg/fetch#579 and httpwg/http-core#33. However, it doesn't seem like there's much interoperability or good places to start.

For instance the following decodes as UTF-8 in Chrome and Firefox, but windows-1252 in Edge and Safari (inspired by http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/netwerk/base/nsURLHelper.cpp#957):

def main(request, response):
   response.headers.set("content-type", "text/html;charset=windows-1252,text/html;charset=utf-8")
   response.content = "\xC2\xB1"

Only Chrome and Firefox have a modicum of MIME type validation happening for data: URLs, but even that's rather limited and broken (e.g., unknown parameters get dropped, but image/gif;charset=x is fine).

It seems that anything here would have to be quite forgiving to maintain the status quo of not bailing if a MIME type is invalid (i.e.., treat text/html; as text/html and not as an error), but there's also quite some flexibility. And then there's the question of whether strings need to be simply passed through to Blob and such or if there should be some validation step to normalize input (Chrome and Firefox appear to lowercase all input there).

@foolip

This comment has been minimized.

Show comment
Hide comment
@foolip

foolip Sep 7, 2017

Member

What are all of the contexts that MIME type-like things are parsed? Ones that I know of:

  • the Content-Type header
  • various content attributes in HTML
  • video.canPlayType(), MediaSource.isTypeSupported() and MediaRecorder.isTypeSupported()

Are there others? It's probably not the same parser used in all contexts.

Member

foolip commented Sep 7, 2017

What are all of the contexts that MIME type-like things are parsed? Ones that I know of:

  • the Content-Type header
  • various content attributes in HTML
  • video.canPlayType(), MediaSource.isTypeSupported() and MediaRecorder.isTypeSupported()

Are there others? It's probably not the same parser used in all contexts.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Sep 7, 2017

Member
  • Blob's type
  • Response and Request parse Content-Type for later use by Blob's type
  • data: URLs
  • Drag and drop API (unclear, it mostly seems to do string matching as far as I can tell)
  • registerContentHandler() (not useful, only implemented by Firefox)

It's all rather messy indeed.

Content-Type is hard to test as you can only really tell what happens by navigating to the resource. And that doesn't really tell you which parameters got preserved and such, but it does tell you if it was recognized as text/html and charset was found.

Member

annevk commented Sep 7, 2017

  • Blob's type
  • Response and Request parse Content-Type for later use by Blob's type
  • data: URLs
  • Drag and drop API (unclear, it mostly seems to do string matching as far as I can tell)
  • registerContentHandler() (not useful, only implemented by Firefox)

It's all rather messy indeed.

Content-Type is hard to test as you can only really tell what happens by navigating to the resource. And that doesn't really tell you which parameters got preserved and such, but it does tell you if it was recognized as text/html and charset was found.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Sep 8, 2017

Member

html/semantics/embedded-content/the-img-element/update-the-source-set.html has evidence (and tests!) that "broken" MIME types such as image/gif;, image/gif;encodings, and image/gif;encodings= do not get rejected by browsers meaning the RFC for MIME types is wrong in how parsing is supposed to happen.

Member

annevk commented Sep 8, 2017

html/semantics/embedded-content/the-img-element/update-the-source-set.html has evidence (and tests!) that "broken" MIME types such as image/gif;, image/gif;encodings, and image/gif;encodings= do not get rejected by browsers meaning the RFC for MIME types is wrong in how parsing is supposed to happen.

annevk added a commit to web-platform-tests/wpt that referenced this issue Sep 8, 2017

@GPHemsley

This comment has been minimized.

Show comment
Hide comment
@GPHemsley

GPHemsley Sep 8, 2017

Member

If it helps, I put this wiki page together back in the day:

https://wiki.whatwg.org/wiki/Contexts

Member

GPHemsley commented Sep 8, 2017

If it helps, I put this wiki page together back in the day:

https://wiki.whatwg.org/wiki/Contexts

@GPHemsley

This comment has been minimized.

Show comment
Hide comment
@GPHemsley

GPHemsley Sep 8, 2017

Member

I also tried to assemble a list of source code locations related to MIME sniffing:

https://wiki.whatwg.org/wiki/MIME_Sniffing

Member

GPHemsley commented Sep 8, 2017

I also tried to assemble a list of source code locations related to MIME sniffing:

https://wiki.whatwg.org/wiki/MIME_Sniffing

foolip added a commit to web-platform-tests/wpt that referenced this issue Sep 9, 2017

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Sep 15, 2017

Note that Content-Type header parsing also depends. See https://bugzilla.mozilla.org/show_bug.cgi?id=1210302 where we ended up needing to use different parsers for the request and response Content-Type headers.

bzbarsky commented Sep 15, 2017

Note that Content-Type header parsing also depends. See https://bugzilla.mozilla.org/show_bug.cgi?id=1210302 where we ended up needing to use different parsers for the request and response Content-Type headers.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Sep 19, 2017

Member

Firefox:

We do have a MIME type parser. It has two entrypoints: http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/netwerk/base/nsINetUtil.idl#18-32 (request header) and http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/netwerk/base/nsINetUtil.idl#35-52 (response header).

It would be good to verify those algorithms against other browsers somehow. @mikewest do you have insight as to what Chrome does?

Member

annevk commented Sep 19, 2017

Firefox:

We do have a MIME type parser. It has two entrypoints: http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/netwerk/base/nsINetUtil.idl#18-32 (request header) and http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/netwerk/base/nsINetUtil.idl#35-52 (response header).

It would be good to verify those algorithms against other browsers somehow. @mikewest do you have insight as to what Chrome does?

@mikewest

This comment has been minimized.

Show comment
Hide comment
@mikewest

mikewest Sep 19, 2017

Member

@sleevi is a better resource for the network stack's charset sniffing mechanisms. If he's too busy at BlinkOn this week, I'll dig around for a link. :)

Member

mikewest commented Sep 19, 2017

@sleevi is a better resource for the network stack's charset sniffing mechanisms. If he's too busy at BlinkOn this week, I'll dig around for a link. :)

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Sep 19, 2017

Member

Firefox appears to use http://searchfox.org/mozilla-central/source/netwerk/base/nsURLHelper.cpp#1033 for request Content-Type which would reject a valid MIME type such as text/plain;hi=",". There's not much love for commas in MIME types.

Member

annevk commented Sep 19, 2017

Firefox appears to use http://searchfox.org/mozilla-central/source/netwerk/base/nsURLHelper.cpp#1033 for request Content-Type which would reject a valid MIME type such as text/plain;hi=",". There's not much love for commas in MIME types.

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Sep 19, 2017

which would reject a valid MIME type such as text/plain;hi=","

No, it would not. @annevk, why do you think it would?

bzbarsky commented Sep 19, 2017

which would reject a valid MIME type such as text/plain;hi=","

No, it would not. @annevk, why do you think it would?

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Sep 19, 2017

Member

My bad, it does indeed skip quoted strings.

Member

annevk commented Sep 19, 2017

My bad, it does indeed skip quoted strings.

@sleevi

This comment has been minimized.

Show comment
Hide comment
@sleevi

sleevi Sep 20, 2017

@annevk It sounds like Chrome is equally in a weird place :)

We have net::ParseMimeTypeWithoutParameter which is part of a fallback handling path for data: URLs in DataURL::Parse. It would appear our parser is... uh... not the best, for data: URLs

We have a separate parser for response Content-Type in net::HttpUtil::ParseContentType and request Content-Type (like @bzbarsky mentioned in #30 (comment) ), in blink::ParsedContentType. This latter implementation is used fairly extensively throughout the Blink side - that is, most Web-exposed pieces will at least be expected to transit that content type parser, but may also transit the HttpUtil::ParseContentType path.

For the Response object, it doesn't look like we parse the Content-Type during construction (see FetchHeaderList::ExtractMIMEType )

Separately, for things like canPlayType, we have multiple places that parsing can happen. This is expressed via the blink::ContentType, which is handled by Blink, but codec parsing is handled by media::MimeUtil. However, other aspects of media loading, such as detecting Media Capabilities, will use the above ParsedContentType

I'm sure we could add another dozen or so MIME type parsers before we start refactoring ;)

For MIME sniffing in particular, the determination of whether or not to sniff is using the net::HttpUtil::ParseContentType logic, by virtue of HttpResponseHeaders::GetMimeTypeAndCharset setting the Response's mime_type for HTTP-network-loads feeding into content::ShouldSniffContent. However, it gets 'messy' for other form of loads, because it's implementation-dependent. A quick scan through implementations suggest they're using the same implementation, but my above remarks about the Blink-vs-network layer means that it may have been 'double parsed' (by both Blink and then network).

@mikewest Is that what you were hoping for? :)

sleevi commented Sep 20, 2017

@annevk It sounds like Chrome is equally in a weird place :)

We have net::ParseMimeTypeWithoutParameter which is part of a fallback handling path for data: URLs in DataURL::Parse. It would appear our parser is... uh... not the best, for data: URLs

We have a separate parser for response Content-Type in net::HttpUtil::ParseContentType and request Content-Type (like @bzbarsky mentioned in #30 (comment) ), in blink::ParsedContentType. This latter implementation is used fairly extensively throughout the Blink side - that is, most Web-exposed pieces will at least be expected to transit that content type parser, but may also transit the HttpUtil::ParseContentType path.

For the Response object, it doesn't look like we parse the Content-Type during construction (see FetchHeaderList::ExtractMIMEType )

Separately, for things like canPlayType, we have multiple places that parsing can happen. This is expressed via the blink::ContentType, which is handled by Blink, but codec parsing is handled by media::MimeUtil. However, other aspects of media loading, such as detecting Media Capabilities, will use the above ParsedContentType

I'm sure we could add another dozen or so MIME type parsers before we start refactoring ;)

For MIME sniffing in particular, the determination of whether or not to sniff is using the net::HttpUtil::ParseContentType logic, by virtue of HttpResponseHeaders::GetMimeTypeAndCharset setting the Response's mime_type for HTTP-network-loads feeding into content::ShouldSniffContent. However, it gets 'messy' for other form of loads, because it's implementation-dependent. A quick scan through implementations suggest they're using the same implementation, but my above remarks about the Blink-vs-network layer means that it may have been 'double parsed' (by both Blink and then network).

@mikewest Is that what you were hoping for? :)

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 3, 2017

Member
HTTP/1.1 200
Content-Length: 11
Content-Type: text/html
Content-Type: text/plain

<b>TEST</b>

Rendered as text/plain: Chrome, Firefox, Safari
Rendered as text/html: Edge

HTTP/1.1 200
Content-Length: 11
Content-Type: text/html, text/plain

<b>TEST</b>

Rendered as text/plain: Chrome, Firefox
Download: Edge
Rendered as text/html: Safari

That means there's already inconsistency between multiple headers and combined headers. I was led to believe that was only the case for cookies. Hopefully we can still make it so somehow.

Member

annevk commented Oct 3, 2017

HTTP/1.1 200
Content-Length: 11
Content-Type: text/html
Content-Type: text/plain

<b>TEST</b>

Rendered as text/plain: Chrome, Firefox, Safari
Rendered as text/html: Edge

HTTP/1.1 200
Content-Length: 11
Content-Type: text/html, text/plain

<b>TEST</b>

Rendered as text/plain: Chrome, Firefox
Download: Edge
Rendered as text/html: Safari

That means there's already inconsistency between multiple headers and combined headers. I was led to believe that was only the case for cookies. Hopefully we can still make it so somehow.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 3, 2017

Member
HTTP/1.1 200
Content-Length: 11
Content-Type: unknown/unknown
Content-Type: text/html

<b>TEST</b>
HTTP/1.1 200
Content-Length: 11
Content-Type: unknown/unknown, text/html

<b>TEST</b>

Both rendered as text/html: Chrome, Firefox, Safari
Both download: Edge

Member

annevk commented Oct 3, 2017

HTTP/1.1 200
Content-Length: 11
Content-Type: unknown/unknown
Content-Type: text/html

<b>TEST</b>
HTTP/1.1 200
Content-Length: 11
Content-Type: unknown/unknown, text/html

<b>TEST</b>

Both rendered as text/html: Chrome, Firefox, Safari
Both download: Edge

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 3, 2017

Member

If I reverse the order and list text/html first and then unknown/unknown, then:

  • Chrome, Safari: render all as text/html
  • Edge: renders the separate header version, downloads comma version (as expected)
  • Firefox: downloads both

If I list */* last then only Firefox changes and will render both as text/html.

(It's not entirely clear to me why Chrome special cases */* in its code still.)

Member

annevk commented Oct 3, 2017

If I reverse the order and list text/html first and then unknown/unknown, then:

  • Chrome, Safari: render all as text/html
  • Edge: renders the separate header version, downloads comma version (as expected)
  • Firefox: downloads both

If I list */* last then only Firefox changes and will render both as text/html.

(It's not entirely clear to me why Chrome special cases */* in its code still.)

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 3, 2017

Member

Given the lack of interoperability it's not clear to me that the complicated Content-Type parsers in Chrome, Firefox, and Safari are totally warranted.

Member

annevk commented Oct 3, 2017

Given the lack of interoperability it's not clear to me that the complicated Content-Type parsers in Chrome, Firefox, and Safari are totally warranted.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 3, 2017

Member

Unfortunately Edge is not as consistent in its download story as I hoped.

HTTP/1.1 200
Content-Length: 55
Content-Type: text/html;charset=utf-8
Content-Type: */*;charset=gbk

<script>document.write(document.characterSet)</script>

Chrome, Safari: GBK
Edge: utf-8 (effectively the same as Firefox, but with a "minor" API bug)
Firefox: UTF-8

HTTP/1.1 200
Content-Length: 55
Content-Type: text/html;charset=utf-8, */*;charset=gbk

<script>document.write(document.characterSet)</script>

Chrome: GBK
Edge: windows-1252 (no download despite the comma)
Firefox, Safari: UTF-8

Member

annevk commented Oct 3, 2017

Unfortunately Edge is not as consistent in its download story as I hoped.

HTTP/1.1 200
Content-Length: 55
Content-Type: text/html;charset=utf-8
Content-Type: */*;charset=gbk

<script>document.write(document.characterSet)</script>

Chrome, Safari: GBK
Edge: utf-8 (effectively the same as Firefox, but with a "minor" API bug)
Firefox: UTF-8

HTTP/1.1 200
Content-Length: 55
Content-Type: text/html;charset=utf-8, */*;charset=gbk

<script>document.write(document.characterSet)</script>

Chrome: GBK
Edge: windows-1252 (no download despite the comma)
Firefox, Safari: UTF-8

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Oct 3, 2017

So far it sounds to me like Chrome and Firefox consistently treat:

Content-Type: foo
Content-Type: bar

identically to:

Content-Type: foo, bar

though they may not always agree on how the latter is treated. For Firefox this is expected, because the HTTP response header type in Firefox has no representation for "repeated headers" last I checked; they get turned into a comma-separated list. So the second form is all that the rest of the system sees. I can't speak for how Chrome handles this.

Safari's behavior for the unknown/unknown, text/html case is pretty weird, given its earlier behavior for text/html, text/plain. It's like the parser includes a validator of some sort, or whitelisting or something.

It's possible that Firefox treats */* as a "sniff this" signal; I'd have to step through the code.

bzbarsky commented Oct 3, 2017

So far it sounds to me like Chrome and Firefox consistently treat:

Content-Type: foo
Content-Type: bar

identically to:

Content-Type: foo, bar

though they may not always agree on how the latter is treated. For Firefox this is expected, because the HTTP response header type in Firefox has no representation for "repeated headers" last I checked; they get turned into a comma-separated list. So the second form is all that the rest of the system sees. I can't speak for how Chrome handles this.

Safari's behavior for the unknown/unknown, text/html case is pretty weird, given its earlier behavior for text/html, text/plain. It's like the parser includes a validator of some sort, or whitelisting or something.

It's possible that Firefox treats */* as a "sniff this" signal; I'd have to step through the code.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 4, 2017

Member

I found a case for Firefox where that is not true:

HTTP/1.1 200
Content-Length: 55
Content-Type: text/plain;charset=gbk
Content-Type: text/html

<script>document.write(document.characterSet)</script>

Results in GBK

HTTP/1.1 200
Content-Length: 55
Content-Type: text/plain;charset=gbk, text/html

<script>document.write(document.characterSet)</script>

Results in windows-1252.

For the first resource the developer console doesn't list the second Content-Type header either, which suggests there's something preserving the difference through the system somehow.

(Edge renders those as text/plain, Chrome always says windows-1252, and Safari renders the comma version as text/plain and says windows-1252 for the other.)

(We should probably treat this as a bug in Firefox. I don't really think we want to go down the rathole of having to care about this difference.)

Member

annevk commented Oct 4, 2017

I found a case for Firefox where that is not true:

HTTP/1.1 200
Content-Length: 55
Content-Type: text/plain;charset=gbk
Content-Type: text/html

<script>document.write(document.characterSet)</script>

Results in GBK

HTTP/1.1 200
Content-Length: 55
Content-Type: text/plain;charset=gbk, text/html

<script>document.write(document.characterSet)</script>

Results in windows-1252.

For the first resource the developer console doesn't list the second Content-Type header either, which suggests there's something preserving the difference through the system somehow.

(Edge renders those as text/plain, Chrome always says windows-1252, and Safari renders the comma version as text/plain and says windows-1252 for the other.)

(We should probably treat this as a bug in Firefox. I don't really think we want to go down the rathole of having to care about this difference.)

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Oct 4, 2017

Oh, hmm. Looks like I was wrong and Gecko actually stopped merging Content-Type headers in https://bugzilla.mozilla.org/show_bug.cgi?id=655389 (only keeping the last one) but then I don't understand the results with charset=gbk in your two-header case; I would have thought we would replave the first value with the second one, based on code inspection.

I'd appreciate a pointer to a testcase that has those headers so I can figure out what's actually going on.

Given the security implications in https://bugzilla.mozilla.org/show_bug.cgi?id=655389 I also think we should be a little careful about assuming which behaviors are OK here without talking to domain experts. :(

bzbarsky commented Oct 4, 2017

Oh, hmm. Looks like I was wrong and Gecko actually stopped merging Content-Type headers in https://bugzilla.mozilla.org/show_bug.cgi?id=655389 (only keeping the last one) but then I don't understand the results with charset=gbk in your two-header case; I would have thought we would replave the first value with the second one, based on code inspection.

I'd appreciate a pointer to a testcase that has those headers so I can figure out what's actually going on.

Given the security implications in https://bugzilla.mozilla.org/show_bug.cgi?id=655389 I also think we should be a little careful about assuming which behaviors are OK here without talking to domain experts. :(

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Oct 4, 2017

Member

@bzbarsky you can copy-and-paste my tests, save them as [something].asis within web-platform-tests, and then run them. Deploying these is a little harder as I don't have an instance of the web-platform-tests server running.

Member

annevk commented Oct 4, 2017

@bzbarsky you can copy-and-paste my tests, save them as [something].asis within web-platform-tests, and then run them. Deploying these is a little harder as I don't have an instance of the web-platform-tests server running.

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Oct 4, 2017

Thanks, that works beautifully.

So two corrections to Gecko behavior:

  1. It can in fact store multiple Content-Type headers. What that does in practice I didn't look into.
  2. Every time it sees a Content-Type header it calls http://searchfox.org/mozilla-central/rev/e62604a97286f49993eb29c493672c17c7398da9/netwerk/base/nsURLHelper.cpp#978 on the value. This does not depend on whether it's seen other headers or not. This function will parse out the actual MIME type and the charset and save them to the overall "response head" struct. Importantly, it will NOT clobber existing values in that struct if they're not present in the header. So in the example above, the first header gets parsed, sets type to text/plain and charset to gbk. The second header gets parsed, sets type to text/html and doesn't clobber the existing charset value.

Given that, I would also consider this a bug in Gecko. It should clear out the existing type/charset every time it goes to parse a Content-Type header....

bzbarsky commented Oct 4, 2017

Thanks, that works beautifully.

So two corrections to Gecko behavior:

  1. It can in fact store multiple Content-Type headers. What that does in practice I didn't look into.
  2. Every time it sees a Content-Type header it calls http://searchfox.org/mozilla-central/rev/e62604a97286f49993eb29c493672c17c7398da9/netwerk/base/nsURLHelper.cpp#978 on the value. This does not depend on whether it's seen other headers or not. This function will parse out the actual MIME type and the charset and save them to the overall "response head" struct. Importantly, it will NOT clobber existing values in that struct if they're not present in the header. So in the example above, the first header gets parsed, sets type to text/plain and charset to gbk. The second header gets parsed, sets type to text/html and doesn't clobber the existing charset value.

Given that, I would also consider this a bug in Gecko. It should clear out the existing type/charset every time it goes to parse a Content-Type header....

@annevk annevk referenced this issue Oct 13, 2017

Closed

Sort out MIME type tests #42

4 of 4 tasks complete

rachelandrew added a commit to rachelandrew/web-platform-tests that referenced this issue Nov 8, 2017

jakearchibald added a commit to jakearchibald/web-platform-tests that referenced this issue Nov 16, 2017

@annevk annevk changed the title from MIME type parsing to Content-Type parsing (MIME type parsing) Nov 27, 2017

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Feb 1, 2018

Is this fixed by #36?

SimonSapin commented Feb 1, 2018

Is this fixed by #36?

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Feb 1, 2018

Member

@SimonSapin unfortunately, no. Both request and response Content-Type header parsing end up delegating to the MIME type parser at some point, but both have their own quirks on top. I haven't yet written enough tests to define those aspects of Content-Type (another question is how much of the logic is shared across browsers and how much is needed for compatibility).

Member

annevk commented Feb 1, 2018

@SimonSapin unfortunately, no. Both request and response Content-Type header parsing end up delegating to the MIME type parser at some point, but both have their own quirks on top. I haven't yet written enough tests to define those aspects of Content-Type (another question is how much of the logic is shared across browsers and how much is needed for compatibility).

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Apr 18, 2018

Member

I started working on tests for this (basically the above, but generalized): web-platform-tests/wpt#10525. It's not complete (e.g., only tests a single response header and only <iframe> as a context), but feedback on the general setup would be much appreciated at this point. Thanks!

Member

annevk commented Apr 18, 2018

I started working on tests for this (basically the above, but generalized): web-platform-tests/wpt#10525. It's not complete (e.g., only tests a single response header and only <iframe> as a context), but feedback on the general setup would be much appreciated at this point. Thanks!

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Apr 19, 2018

Member

Having written more tests, here are my observations on parsing of Content-Type as a response header:

  • Chrome and Firefox almost never differ between multiple headers with one value and a single header with comma-separated values. They are close and I hope they are willing to address the differences. The main difficulty here is if the first header has text/html;x=" as value and a subsequent header has text/plain;x=" as value. Is that text/html;x=",text/plain;x=" or parsed twice and therefore text/plain;x=""? I hope we can agree the former would be better. (Ideally for all headers apart from Set-Cookie we make these two scenarios behave identically.)
  • Chrome and Firefox pay special attention to / by discarding it; Chrome only does so when there’s no parameters. Safari seems to download. Edge seems to fail randomly.
  • From code inspection Chrome and Firefox care about extracting the MIME type's essence and the charset parameter (and sometimes the boundary parameter). If there's multiple Content-Type headers (or multiple values) the charset parameter might be reused if the MIME type's essence of a subsequent header/value is the same. That’s somewhat incompatible with extracting a full MIME type, although we could make do by just setting the charset parameter on the last MIME type I suppose.

This would lead to me to suggest the following parsing algorithm:

  1. Let charset be null.
  2. Let essence be null.
  3. Let mimeType be null.
  4. Let contentTypeHeader be the combined value of all Content-Type headers.
  5. For each value of contentTypeHeader (assume we deal with , inside " to find these values correctly):
    1. Set mimeType to the result of parsing value as a MIME type.
    2. If essence is not mimeType's essence and mimeType's essence is not "/", then:
      1. Set charset to null.
      2. If mimeType's parameters["charset"] exists, then set charset to it.
      3. Set essence to mimeType's essence.
    3. Otherwise, if essence is mimeType's essence:
      1. If mimeType's parameters["charset"] does not exist and charset is non-null, then set
        mimeType's parameters["charset"] to charset.
  6. Return mimeType.

(Probably to be defined in Fetch.)

Then the stricter parser for request Content-Type would be equivalent, but error on multiple values.

Then there's a question of which parser we use where. Do Request and Response use the stricter parser always or should they follow what navigation needs to do? Given the current interoperability there it doesn't seem to matter much. Perhaps using the stricter parser where we can get away with it would be good.

Input on all this much appreciated.

cc @bencebeky @MattMenke2 @youennf

Member

annevk commented Apr 19, 2018

Having written more tests, here are my observations on parsing of Content-Type as a response header:

  • Chrome and Firefox almost never differ between multiple headers with one value and a single header with comma-separated values. They are close and I hope they are willing to address the differences. The main difficulty here is if the first header has text/html;x=" as value and a subsequent header has text/plain;x=" as value. Is that text/html;x=",text/plain;x=" or parsed twice and therefore text/plain;x=""? I hope we can agree the former would be better. (Ideally for all headers apart from Set-Cookie we make these two scenarios behave identically.)
  • Chrome and Firefox pay special attention to / by discarding it; Chrome only does so when there’s no parameters. Safari seems to download. Edge seems to fail randomly.
  • From code inspection Chrome and Firefox care about extracting the MIME type's essence and the charset parameter (and sometimes the boundary parameter). If there's multiple Content-Type headers (or multiple values) the charset parameter might be reused if the MIME type's essence of a subsequent header/value is the same. That’s somewhat incompatible with extracting a full MIME type, although we could make do by just setting the charset parameter on the last MIME type I suppose.

This would lead to me to suggest the following parsing algorithm:

  1. Let charset be null.
  2. Let essence be null.
  3. Let mimeType be null.
  4. Let contentTypeHeader be the combined value of all Content-Type headers.
  5. For each value of contentTypeHeader (assume we deal with , inside " to find these values correctly):
    1. Set mimeType to the result of parsing value as a MIME type.
    2. If essence is not mimeType's essence and mimeType's essence is not "/", then:
      1. Set charset to null.
      2. If mimeType's parameters["charset"] exists, then set charset to it.
      3. Set essence to mimeType's essence.
    3. Otherwise, if essence is mimeType's essence:
      1. If mimeType's parameters["charset"] does not exist and charset is non-null, then set
        mimeType's parameters["charset"] to charset.
  6. Return mimeType.

(Probably to be defined in Fetch.)

Then the stricter parser for request Content-Type would be equivalent, but error on multiple values.

Then there's a question of which parser we use where. Do Request and Response use the stricter parser always or should they follow what navigation needs to do? Given the current interoperability there it doesn't seem to matter much. Perhaps using the stricter parser where we can get away with it would be good.

Input on all this much appreciated.

cc @bencebeky @MattMenke2 @youennf

@MattMenke2

This comment has been minimized.

Show comment
Hide comment
@MattMenke2

MattMenke2 Apr 19, 2018

I'm not sure what it means to correctly deal with a comma inside quotes.

Consider: foo: bar "," baz

Presumably shouldn't be split into foo: bar " and foo: "baz. But then replace foo with "Content-Type" and "bar" with "text/html;x=", and you'd get: Content-Type: text/html;x= "," baz

The mime type parser would not consider the "," an quoted string, but the generic header parser would, which seems rather unexpected to me. Remove spaces, and the situation is even more ambiguous - should quoted strings be set off by anything?

I know quoted comma handling is a little bit of a tangent here.

MattMenke2 commented Apr 19, 2018

I'm not sure what it means to correctly deal with a comma inside quotes.

Consider: foo: bar "," baz

Presumably shouldn't be split into foo: bar " and foo: "baz. But then replace foo with "Content-Type" and "bar" with "text/html;x=", and you'd get: Content-Type: text/html;x= "," baz

The mime type parser would not consider the "," an quoted string, but the generic header parser would, which seems rather unexpected to me. Remove spaces, and the situation is even more ambiguous - should quoted strings be set off by anything?

I know quoted comma handling is a little bit of a tangent here.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Apr 19, 2018

Member

@MattMenke2 that's an excellent point. I guess that needs more research across headers to see what actually happens, but I'm assuming you do agree that

foo: bar
foo: baz

and

foo: bar, baz

for any foo other than Set-Cookie and any bar and any baz should be parsed identically (as middleware is allowed to combine, etc.)?

Member

annevk commented Apr 19, 2018

@MattMenke2 that's an excellent point. I guess that needs more research across headers to see what actually happens, but I'm assuming you do agree that

foo: bar
foo: baz

and

foo: bar, baz

for any foo other than Set-Cookie and any bar and any baz should be parsed identically (as middleware is allowed to combine, etc.)?

@MattMenke2

This comment has been minimized.

Show comment
Hide comment
@MattMenke2

MattMenke2 Apr 19, 2018

I completely agree, it just becomes an issue once you have unmatched quotes, or quoted commas.

MattMenke2 commented Apr 19, 2018

I completely agree, it just becomes an issue once you have unmatched quotes, or quoted commas.

@MattMenke2

This comment has been minimized.

Show comment
Hide comment
@MattMenke2

MattMenke2 Apr 19, 2018

It seems like the only consistent handling may be to make commas take precedence over quotes, to the extent that commas can never appear in any field, quoted or not, that supports general merging like that.

MattMenke2 commented Apr 19, 2018

It seems like the only consistent handling may be to make commas take precedence over quotes, to the extent that commas can never appear in any field, quoted or not, that supports general merging like that.

@MattMenke2

This comment has been minimized.

Show comment
Hide comment
@MattMenke2

MattMenke2 Apr 19, 2018

Sorry, by "field" I mean value. i.e.,
foo: ","

would always be the same as
foo: "
foo: "

And
Content-Encoding: text/html; charset="foo,bar"
would be the same as:
Content-Encoding: text/html; charset="foo
Context-Encoding: bar"

MattMenke2 commented Apr 19, 2018

Sorry, by "field" I mean value. i.e.,
foo: ","

would always be the same as
foo: "
foo: "

And
Content-Encoding: text/html; charset="foo,bar"
would be the same as:
Content-Encoding: text/html; charset="foo
Context-Encoding: bar"

@MattMenke2

This comment has been minimized.

Show comment
Hide comment
@MattMenke2

MattMenke2 Apr 19, 2018

So, with those rules:

Content-Type: text/html; charset=foo
Content-Type: text/html
would be equivalent to "text/html; charset=foo"

and

Content-Type: text/html; charset=foo
Content-Type: text/plain
Content-Type: text/html
would be equivalent to "text/html"

I can live with that. If we could get away with it, I'd love if we could unconditionally just take the first one. I thought an earlier comment suggested that's what Edge and Safari do?

MattMenke2 commented Apr 19, 2018

So, with those rules:

Content-Type: text/html; charset=foo
Content-Type: text/html
would be equivalent to "text/html; charset=foo"

and

Content-Type: text/html; charset=foo
Content-Type: text/plain
Content-Type: text/html
would be equivalent to "text/html"

I can live with that. If we could get away with it, I'd love if we could unconditionally just take the first one. I thought an earlier comment suggested that's what Edge and Safari do?

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Apr 20, 2018

Member

On the topic of header parsing, I think the way to go would be to always combine (except for Set-Cookie) and then define all the parsers for the combined value. Then how you deal with quotes and such becomes a question on a per-header-parser basis. I.e., the foo header parser only ever sees bar,baz and is only invoked once. That seems like the only way to guarantee consistency.

As for Content-Type, I'll look a bit closer at Edge/Safari since you're willing to flip Chrome and report back.

Member

annevk commented Apr 20, 2018

On the topic of header parsing, I think the way to go would be to always combine (except for Set-Cookie) and then define all the parsers for the combined value. Then how you deal with quotes and such becomes a question on a per-header-parser basis. I.e., the foo header parser only ever sees bar,baz and is only invoked once. That seems like the only way to guarantee consistency.

As for Content-Type, I'll look a bit closer at Edge/Safari since you're willing to flip Chrome and report back.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Apr 20, 2018

Member

Edge: from the tests I wrote, when the headers are separate it picks the first one. When the headers are combined, it tries to parse their entire value as a MIME type, not giving the comma any special consideration.

Safari: when the headers are separate it picks the last one (matching Chrome/Firefox), but does not carry over charset when the essence matches (not matching Chrome/Firefox). Downloads / (even without parameters).

(I think the earlier comment got misled by using unknown/unknown, which will trigger sniffing rules as per https://mimesniff.spec.whatwg.org/#determining-the-computed-mime-type-of-a-resource.)

Member

annevk commented Apr 20, 2018

Edge: from the tests I wrote, when the headers are separate it picks the first one. When the headers are combined, it tries to parse their entire value as a MIME type, not giving the comma any special consideration.

Safari: when the headers are separate it picks the last one (matching Chrome/Firefox), but does not carry over charset when the essence matches (not matching Chrome/Firefox). Downloads / (even without parameters).

(I think the earlier comment got misled by using unknown/unknown, which will trigger sniffing rules as per https://mimesniff.spec.whatwg.org/#determining-the-computed-mime-type-of-a-resource.)

@MattMenke2

This comment has been minimized.

Show comment
Hide comment
@MattMenke2

MattMenke2 Apr 20, 2018

Merging and handling commas on a per-header basis works, though I'd be worried about more breakages. And, at least in Chrome, it would require rewriting basically every handler, possibly significantly, just for the comma case - something that I'm worried will not end up happening.

MattMenke2 commented Apr 20, 2018

Merging and handling commas on a per-header basis works, though I'd be worried about more breakages. And, at least in Chrome, it would require rewriting basically every handler, possibly significantly, just for the comma case - something that I'm worried will not end up happening.

@reschke

This comment has been minimized.

Show comment
Hide comment
@reschke

reschke Apr 21, 2018

@MattMenke2 - header folding can be done the same way for all fields, as required by the HTTP spec. The only exception is Set-Cookie, as described in https://greenbytes.de/tech/webdav/rfc7230.html#rfc.section.3.2.2.

That said, parsing of these values needs to be specific to the header field, as it requires knowledge of the syntax of individual list elements. You can't split on "," dues to commas appearing in the literal value.

reschke commented Apr 21, 2018

@MattMenke2 - header folding can be done the same way for all fields, as required by the HTTP spec. The only exception is Set-Cookie, as described in https://greenbytes.de/tech/webdav/rfc7230.html#rfc.section.3.2.2.

That said, parsing of these values needs to be specific to the header field, as it requires knowledge of the syntax of individual list elements. You can't split on "," dues to commas appearing in the literal value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment