Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do separators need to be preserved when parsing? #39

Closed
domenic opened this issue Oct 10, 2017 · 24 comments
Closed

Do separators need to be preserved when parsing? #39

domenic opened this issue Oct 10, 2017 · 24 comments

Comments

@domenic
Copy link
Member

@domenic domenic commented Oct 10, 2017

Spinning off from #36 to discuss this specific issue.

In https://github.com/jsdom/content-type-parser we, for some reason, preserved the separator between MIME type parameters. (So, e.g., it maybe ; or ; with a space or something else.) This means that when you parse-then-serialize, the separators are preserved.

I believe this might have been necessary to pass some of the XMLHttpRequest or File API web platform tests? I'm not sure; perhaps @nicolashenry, the original author of that code, remembers.

Browsing through usages in jsdom it appears we make use of this ability in FileReader, which (at least in our implementation) parses-then-serializes the Blob's type when creating the data: URL. We also use it when creating a Blob xhr.response, to set the blob's type value from the parse-then-serialize of the Content-Type header.

Maybe there are web platform tests that assume this, but should not, because it's pretty silly?

I suppose we could try changing this in jsdom and see what tests start failing.

@annevk
Copy link
Member

@annevk annevk commented Oct 10, 2017

I was hoping we could simplify this at least somewhat, but it's indeed different from what implementations do. They mostly pass through the input (or remove everything but "charset"), which seems rather broken.

@nicolashenry
Copy link

@nicolashenry nicolashenry commented Oct 10, 2017

If I remember well, this was necessary to pass this XMLHttpRequest web platform test :
https://github.com/w3c/web-platform-tests/blob/master/XMLHttpRequest/send-content-type-charset.htm

@annevk
Copy link
Member

@annevk annevk commented Oct 11, 2017

Well, when you set the Content-Type header no MIME type parsing should take place (at least not at that level). So if that's the test it would indicate a bug in jsdom of sorts.

@domenic
Copy link
Member Author

@domenic domenic commented Oct 11, 2017

You need to parse it and replace charset per https://xhr.spec.whatwg.org/#dom-xmlhttprequest-send step 4 "otherwise".

@annevk
Copy link
Member

@annevk annevk commented Oct 11, 2017

Ah, my bad. That's a very good point. Thanks!

@annevk
Copy link
Member

@annevk annevk commented Oct 11, 2017

This is actually a rather big problem as this seemingly preserves all kinds of garbage. Parameters without values are not really a thing per the official MIME type definition for instance.

@bzbarsky do you know to what extent we need to preserve the exact behavior of https://www.w3c-test.org/XMLHttpRequest/send-content-type-charset.htm? Do we need to take some other approach to MIME types than we do for other formats? Not have a clear parse / model / serialize separation?

@annevk
Copy link
Member

@annevk annevk commented Oct 11, 2017

We could of course have a one-off parser just for XMLHttpRequest's Content-Type request header's charset parameter processing, but that does not seem great.

@annevk annevk mentioned this issue Oct 12, 2017
7 of 7 tasks complete
@domenic
Copy link
Member Author

@domenic domenic commented Oct 12, 2017

Maybe @tyoshino / @ricea have a perspective here.

@annevk
Copy link
Member

@annevk annevk commented Oct 12, 2017

Good point, and @yutakahirano.

@yutakahirano
Copy link
Member

@yutakahirano yutakahirano commented Oct 13, 2017

...then set all the `charset` parameters whose value is not a byte-case-insensitive match for encoding of that header’s value to encoding.

This doesn't sound like a text replacement. I'm under an impression that equivalent representations such as extra spaces or quotation are also allowed.

@annevk
Copy link
Member

@annevk annevk commented Oct 13, 2017

Yeah, but it so happens that's not what implementations do (or what we test for). If that is web compatible there might not be an issue here though.

@annevk
Copy link
Member

@annevk annevk commented Oct 13, 2017

It's also not clear to me that an internal representation that supports duplicate parameter names is a good one.

@annevk annevk mentioned this issue Oct 13, 2017
4 of 4 tasks complete
@foolip
Copy link
Member

@foolip foolip commented Oct 18, 2017

So, things that we'll need to go test:

  • XMLHttpRequest
  • File API
  • Anything for Fetch?
@foolip
Copy link
Member

@foolip foolip commented Oct 18, 2017

I suppose we're all hoping that all APIs could either pass through strings verbatim, or parse+serialize. But it looks like what's actually implemented in some cases is something funkier?

foolip added a commit to web-platform-tests/wpt that referenced this issue Oct 18, 2017
Do not merge, testing only.
@foolip
Copy link
Member

@foolip foolip commented Oct 18, 2017

About https://wpt.fyi/XMLHttpRequest/send-content-type-charset.htm, there's impressively little agreement about some of the cases, which I guess is good news in a way if we want to change things.

I added some cases in web-platform-tests/wpt#7882 to see what happens to whitespace after semicolon.

@domenic
Copy link
Member Author

@domenic domenic commented Oct 18, 2017

I suppose we're all hoping that all APIs could either pass through strings verbatim, or parse+serialize. But it looks like what's actually implemented in some cases is something funkier?

Well, verbatim isn't quite an option for the XHR stuff at least. And in general it'd be much nicer to do parse+serialize. (E.g., it would match how URLs are handled.)

In general it seems likely that changing this stuff is web-compatible, yes :). I am hopeful that @annevk's work will create a model everyone can work toward.

@foolip
Copy link
Member

@foolip foolip commented Oct 25, 2017

OK, so from web-platform-tests/wpt#7882 we can conclude:

  • Everyone preserves space (or lack of space) before/after semicolon
  • Chrome+Safari turns both charset=utf-8 and charset=bogus into charset=UTF-8
  • Edge doesn't turn bogus charsets into UTF-8 or uppercase charset=utf-8 into UTF-8
  • Firefox turns charset=bogus into charset=UTF-8, but doesn't uppercase charset=utf-8

So, yes, separators do need to be preserved when parsing!

@foolip foolip mentioned this issue Oct 25, 2017
3 of 3 tasks complete
@annevk
Copy link
Member

@annevk annevk commented Oct 25, 2017

Well, if it's web-compatible I'd much rather not preserve them. Keeping syntax around in the object representation is rather ugly. And if you cared about preserving existing behavior to that extent we'd end up with several different MIME type parsers/scanners rather than a single unified model. Existing code bases aren't exactly great.

@foolip
Copy link
Member

@foolip foolip commented Oct 25, 2017

Are there other contexts where whitespace isn't preserved? It's very hard to guess what the compat implications of changing that are. Presumably implementations don't have a clear parse / model / serialize separation here, but rather something like "fix up MIME string" and "extract information from MIME string"?

@yutakahirano
Copy link
Member

@yutakahirano yutakahirano commented Oct 25, 2017

The current chromium implementation is "search-and-replace" which I don't love. I would like to replace the implementation with a parse-then-serialize impl when I have time, but adding whitespace preservation as a requirement will make the parse-then-serialize implementation much harder.

@foolip
Copy link
Member

@foolip foolip commented Oct 26, 2017

@yutakahirano, if you'd like to implement and ship such a model, that makes me more optimistic about it. I have no idea how to estimate the compat risk of this, but I suppose a start would be to see what kind of normalization (space after semicolon or not) would result in the fewest changes in the wild.

@yutakahirano
Copy link
Member

@yutakahirano yutakahirano commented Nov 15, 2017

I added a use counter to Chromium. It will tell us how risky it is to change the current behavior.

chromium-wpt-export-bot added a commit to web-platform-tests/wpt that referenced this issue Nov 16, 2017
This CL introduces mime type parser and stringifier to
wpt/XMLHttpRequest/send-content-type-charset in order to accept
implementations that is actually conforming to the spec but was rejected
by the test due to some text representation errors.

Bug: whatwg/mimesniff#39
Change-Id: I99466e2e596bb9c1b7f11267ad4ff0a886913086
chromium-wpt-export-bot added a commit to web-platform-tests/wpt that referenced this issue Nov 16, 2017
This CL introduces a mime type parser and stringifier to
wpt/XMLHttpRequest/send-content-type-charset in order to accept
implementations that are actually conforming to the spec but were rejected
by the test due to some text representation errors.

Bug: whatwg/mimesniff#39
Change-Id: I99466e2e596bb9c1b7f11267ad4ff0a886913086
chromium-wpt-export-bot added a commit to web-platform-tests/wpt that referenced this issue Nov 21, 2017
This CL introduces a mime type parser and stringifier to
wpt/XMLHttpRequest/send-content-type-charset in order to accept
implementations that are actually conforming to the spec but were rejected
by the test due to some text representation errors.

Bug: whatwg/mimesniff#39
Change-Id: I99466e2e596bb9c1b7f11267ad4ff0a886913086
chromium-wpt-export-bot added a commit to web-platform-tests/wpt that referenced this issue Nov 21, 2017
This CL introduces a mime type parser and stringifier to
wpt/XMLHttpRequest/send-content-type-charset in order to accept
implementations that are actually conforming to the spec but were rejected
by the test due to some text representation errors.

Bug: whatwg/mimesniff#39
Change-Id: I99466e2e596bb9c1b7f11267ad4ff0a886913086
chromium-wpt-export-bot added a commit to web-platform-tests/wpt that referenced this issue Nov 21, 2017
This CL introduces a mime type parser and stringifier to
wpt/XMLHttpRequest/send-content-type-charset in order to accept
implementations that are actually conforming to the spec but were rejected
by the test due to some text representation errors.

Bug: whatwg/mimesniff#39
Change-Id: I99466e2e596bb9c1b7f11267ad4ff0a886913086
@annevk
Copy link
Member

@annevk annevk commented Nov 24, 2017

To be clear, my hope is that we can resolve this issue by changing XMLHttpRequest and the test in web-platform-tests: web-platform-tests/wpt#8422.

I'd rather not preserve separator syntax, duplicate parameters, and invalid parameters as required by those tests currently. That seems way too much complexity for a such a niche use case.

@annevk
Copy link
Member

@annevk annevk commented Nov 29, 2017

I think we've reached agreement here. Tests have been written as well. What's still missing is an update to the XMLHttpRequest Standard, but I'll write that after we land #36 I suspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.