New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New MIME type parsing tests #10279

Merged
merged 4 commits into from Apr 4, 2018

Conversation

Projects
None yet
5 participants
@annevk
Member

annevk commented Apr 3, 2018

Tests suggested at #1851 (comment).

Co-Authored-By: Matt Menke mmenke@google.com

I didn't add these for now:

text/html;charset=" gbk"

Already tested.

text/html;charset= ";charset=GBK

I didn't see why this would end up as GBK. We'd have " \"" as charset parameter value and that would simply not result in an encoding. Why would we consult the second parameter?

text/html;charset=";charset=GBK";charset=GBK

Semi-colons are allowed, so this would not parse as GBK.

annevk and others added some commits Apr 3, 2018

New MIME type parsing tests
Tests suggested at #1851 (comment).

Co-Authored-By: Matt Menke <mmenke@google.com>
missed one
Co-Authored-By: Matt Menke <mmenke@google.com>

@wpt-pr-bot wpt-pr-bot added the mimesniff label Apr 3, 2018

@wpt-pr-bot

This comment has been minimized.

Collaborator

wpt-pr-bot commented Apr 3, 2018

There are no owners for this pull request besides its author. Please reach out on W3C's irc server (irc.w3.org, port 6665) on channel #testing (web client) to get help with this. Thank you!

@annevk annevk requested a review from domenic Apr 3, 2018

@annevk

This comment has been minimized.

Member

annevk commented Apr 3, 2018

@MattMenke2

This comment has been minimized.

Contributor

MattMenke2 commented Apr 3, 2018

Thanks for this!

For text/html;charset= ";charset=GBK - according to the algorithm, only quotes immediately after the space matters. So this would be interpreted as:

text/html;charset=" "";charset=GBK

Quotes aren't allowed in chraset values (See the "parameterName solely contains HTTP token code points" rule, which looks to exclude quotes), so the first charset value isn't set, so we fall back to the second.

Semicolons also don't look to be allowed: "An HTTP token code point is U+0021 (!), U+0023 (#), U+0024 ($), U+0025 (%), U+0026 (&), U+0027 ('), U+002A (*), U+002B (+), U+002D (-), U+002E (.), U+005E (^), U+005F (_), U+0060 (`), U+007C (|), U+007E (~), or an ASCII alphanumeric."

Am I missing some reason why semicolons and quotes would be allowed?

@MattMenke2

This comment has been minimized.

Contributor

MattMenke2 commented Apr 3, 2018

Oops - looks like I was flipping name and value rules. Sorry about that.

@w3c-bots

This comment has been minimized.

w3c-bots commented Apr 3, 2018

Build PASSED

Started: 2018-04-04 08:16:02
Finished: 2018-04-04 08:24:31

View more information about this build on:

@MattMenke2

This comment has been minimized.

Contributor

MattMenke2 commented Apr 3, 2018

So instead, I'd suggest:

text/html;charset= ";\xFF;charset=GBK (comes out as GBK, since the \xFF is considered part of the first chrset.)

text/html;charset="\xFF;charset=foo";charset=GBK (Comes out as GBK)

The first is to catch incorrect handling of a quote after a space. The second is intended to catch incorrect tokenization around semicolons.

@MattMenke2

This comment has been minimized.

Contributor

MattMenke2 commented Apr 3, 2018

Sorry, that first one should be:

text/html;charset= "\xFF;charset=GBK

@annevk

This comment has been minimized.

Member

annevk commented Apr 3, 2018

Did you mean 0x7F? 0xFF is in the allowed range of HTTP quoted-string token code point so the parameter would not be dropped.

@MattMenke2

This comment has been minimized.

Contributor

MattMenke2 commented Apr 3, 2018

Sorry again - yes, I meant \x7F (Or any disallowed character).

fix earlier test; add two more per feedback
Co-Authored-By: Matt Menke <mmenke@google.com>
@annevk

This comment has been minimized.

Member

annevk commented Apr 3, 2018

Cool, added those too.

@MattMenke2

This comment has been minimized.

Contributor

MattMenke2 commented Apr 3, 2018

Looks good, thanks! Looking at the others I suggested, I think if those two pass, all the others will as well (Modulo my confusion over which characters aren't allowed in values).

@@ -113,6 +125,24 @@
"navigable": true,
"encoding": "GBK"
},
{
"input": "text/html;charset=();charset=GBK",
"output": "text/html;charset=GBK",

This comment has been minimized.

@domenic

domenic Apr 3, 2018

Member

I get text/html;charset="()" for this. https://mimesniff.spec.whatwg.org/#http-quoted-string-token-code-point includes ( and ) so () is a fine charset.

This comment has been minimized.

@MattMenke2

MattMenke2 Apr 3, 2018

Contributor

Yea, this we my mistake - was confused about the allowed characters.

This comment has been minimized.

@MattMenke2

MattMenke2 Apr 3, 2018

Contributor

*was my mistake

This comment has been minimized.

@annevk

annevk Apr 4, 2018

Member

Fixed!

address feedback by @domenic
Co-Authored-By: Matt Menke <mmenke@google.com>

@annevk annevk merged commit ef7fe9a into master Apr 4, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@annevk annevk deleted the annevk/mime-type-parsing branch Apr 4, 2018

@annevk annevk restored the annevk/mime-type-parsing branch Apr 5, 2018

@MattMenke2

This comment has been minimized.

Contributor

MattMenke2 commented Apr 5, 2018

I don't think this one is correct: https://github.com/w3c/web-platform-tests/pull/10279/files#diff-4d2126605002e90f20a8cb91502ede9cR135

Per the parsing rules, the quoted string should run up to the end of GBK. The encoding should be null, I believe.

Sorry for the late comment - was just looking over the tests where Chrome fails as I work on fixing most of the remianing difference. One or two I'll probably leave failing, unless other browsers show the same behavior - I'm a bit paranoid about making the charset= "GBK" ones pass, in particular.

@domenic

This comment has been minimized.

Member

domenic commented Apr 5, 2018

That one still works, because of the space before the quote. The spec only interprets a parameter as quoted if it sees the exact sequence =".

So it sees two versions of charset, one with value "\u007F and one with value GBK. But the first one is outside the range of HTTP quoted-string token code points, so it doesn't get set, and the GBK one does.

@MattMenke2

This comment has been minimized.

Contributor

MattMenke2 commented Apr 5, 2018

You're right, thanks...and I think it was even exactly as I suggested, the space before quotes rule is just very counter-intuitive.

@annevk annevk deleted the annevk/mime-type-parsing branch Apr 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment