Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp MIME type section #36

Merged
merged 18 commits into from Dec 7, 2017
Merged

Revamp MIME type section #36

merged 18 commits into from Dec 7, 2017

Conversation

@annevk
Copy link
Member

@annevk annevk commented Oct 6, 2017

TODO:

  • Update the remainder of the document to use the new terminology
  • Add byte sequence parser and serializer
  • Define parameters as ASCII strings too

Preview | Diff

@annevk
Copy link
Member Author

@annevk annevk commented Oct 6, 2017

This contains just parsing MIME types. I haven't actually changed everything yet since I figured I'd ask for some feedback first.

My idea was to align this with URL. So we have MIME types, which are also known as MIME type records. And MIME type strings, which serve as input (maybe I'll define that part later, I'd like to focus on implementation requirements initially). We should probably also have some byte sequence entry points for most of these, but since going from bytes to strings is easy with isomorphic decode I thought making the main parser string-based is best.

Thoughts?

@domenic
Copy link
Member

@domenic domenic commented Oct 6, 2017

I like the general plan of aligning with URL. Although the spec is already like that, just a bit dated, right? (E.g. using multiple return values instead of a struct.)

I'd like to hear more justification for using strings as input, instead of bytes. What call sites use which? I'm not aware of how many call sites would use this parsing algorithm, so it's hard to judge. Bytes seems more correct though at first impression. If you do go with strings, it should be stated to be a JS string, I think.

The current spec has XXX boxes that are, as far as I can tell, designed to limit certain things to 127 bytes. Those have disappeared, but it seems important to do some testing there, or at least preserve some kind of XXX box.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 6, 2017

XMLHttpRequest's overrideMimeType() in particular would be hard to define if it was based on bytes. You could roundtrip with UTF-8 and probably be fine though. But also, we have much more utilities for processing strings.

@GPHemsley
Copy link
Member

@GPHemsley GPHemsley commented Oct 7, 2017

Indeed, I believe some spec somewhere (an RFC, maybe?) had a limit of 127 bytes, but I questioned whether it was actually enforced by any implementation.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 9, 2017

Content-Type: text/html;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;charset=gbk results in GBK in all implementations. We also generally don't do limits, so it seems good to get rid of that.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 9, 2017

I also don't think we should state a specific string type, this works with all of them after all. Why require casts?

@domenic
Copy link
Member

@domenic domenic commented Oct 9, 2017

You're right, I forgot that our only two string types were JS string and SV string.

I'm still not convinced this should be strings instead of bytes though. The lack of utilities for parsing/manipulating byte sequences is fixable. I'd like to get an accounting of the call sites before we decide one way or another.

The XHR overrideMimeType() example is interesting precisely because I'm unsure how it should behave given that it operates on a DOMString (instead of a ByteString). What bytes do we actually transmit over HTTP? It seems like the spec right now sometimes stores a string in the "override MIME type" field, e.g. overrideMimeType() step 3, but sometimes stores a byte sequence, e.g. overrideMimeType() step 2.

Your version not only accepts strings as inputs, but also creates them as outputs in the MIME type struct, which seems quite bad if we're eventually sending these over the wire?

@annevk
Copy link
Member Author

@annevk annevk commented Oct 10, 2017

What bytes do we actually transmit over HTTP?

None, overrideMimeType() is an override for the response. We could use USVString and UTF-8 encode I think, without observable effects, but I'm not sure that's useful as there are other callers that want to operate on strings, such as several attributes defined in HTML. data: URLs and Content-Type need byte-based processing, but it's easy enough to make them use a wrapper.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 10, 2017

And to be clear, I do intend to provide a "parse a MIME type from bytes" and "serialize a MIME type to bytes" for those cases, with the necessary asserts on the input for the latter of the two.

@GPHemsley
Copy link
Member

@GPHemsley GPHemsley commented Oct 14, 2017

Content-Type: text/html;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;charset=gbk results in GBK in all implementations. We also generally don't do limits, so it seems good to get rid of that.

The (supposed) 127-byte limit was on the individual portions (type, subtype, parameter name, parameter value), not the overall MIME type.

@annevk
Copy link
Member Author

@annevk annevk commented Oct 16, 2017

text/html;0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789=x;charset=gbk yields GBK. I also haven't found any evidence of it in implementations for which I could inspect the source code.

@foolip
Copy link
Member

@foolip foolip commented Oct 25, 2017

From some testing in #39 I think that we do need to preserve whitespace around separators. For charset normalization, bogus charsets should turn into UTF-8.

For charset=utf-8, it's debatable. It's plausible that stuff could depend on both the encoding being untouched and it being normalized. I think I have a slight preference to normalize it, since that is what document.charset is supposed to do.

@domenic
Copy link
Member

@domenic domenic commented Nov 14, 2017

We should test what happens with non-ASCII characters in the various segments of the MIME type; the spec seems to just ASCII lowercase the strings and pass them through. I wonder if that's how browsers treat them. And I wonder if browsers treat them differently when given HTTP header bytes or other byte-accepting entry points vs. XHR overrideMimeType or other string-accepting APIs.

@domenic
Copy link
Member

@domenic domenic commented Nov 14, 2017

As far as I can tell the proposed spec skips whitespace after the = but collects it before the = sign. http://httpwg.org/specs/rfc7231.html says no whitespace is allowed. We should test what browsers do in such scenarios if you haven't already. If you have, adding a note about this potentially-confusing mismatch would be good. Both that it mismatches the RFC, and that it treats before different than after.

@annevk
Copy link
Member Author

@annevk annevk commented Nov 14, 2017

It doesn't mismatch the RFC anymore than anything else that the RFC would say is invalid and is simply consumed as part of the token here, no? (There are tests for this already and browser bugs have been filed, see #34.)

@domenic
Copy link
Member

@domenic domenic commented Nov 14, 2017

Well, I don't see anything about consuming as part of a token in the RFC. But as far as I can tell the RFC has token both before and after the = sign, whereas you ignore whitespace before the = sign, but preserve it afterward.

I see you tested spaces before the = sign, but did you test spaces after?

@annevk
Copy link
Member Author

@annevk annevk commented Nov 14, 2017

How is it ignored before the = sign? I strip it after currently, but that should be dropped as the Encoding Standard handles that already for encodings.

@domenic
Copy link
Member

@domenic domenic commented Nov 14, 2017

Sorry, I got confused between my two posts. You ignore it after the = sign, but don't ignore it before the = sign. #34 has tests for before the = sign, which I guess is what led you to the behavior of not skipping whitespace there. I was wondering if there are any tests for after the = sign, which would help decide on spaces or no there.

It'd be ideal if there were a way of testing this independent of encoding handling, but I guess there is not in browsers today.

@annevk
Copy link
Member Author

@annevk annevk commented Nov 14, 2017

If browsers agree with all the proposed changes we could test it through data URLs and XMLHttpRequest, but that would first require browsers to actually start storing unknown parameters and such. So yeah, not in today's browsers.

@domenic
Copy link
Member

@domenic domenic commented Nov 14, 2017

Well, something to keep in mind at least; it'd be nice to write such tests and ask browsers to follow that model. But you're more in touch with whether that's realistic than I am.

@annevk
Copy link
Member Author

@annevk annevk commented Nov 14, 2017

I already wrote such a test: web-platform-tests/wpt#6890 (review).

I thought we needed to distinguish for overrideMimeType but not so
@annevk annevk mentioned this pull request Nov 29, 2017
7 of 7 tasks complete
@annevk annevk requested a review from domenic Nov 29, 2017
Copy link
Member

@domenic domenic left a comment

Found some things while writing a JS implementation. Will report back shortly on whether that implementation matches the tests...

<li>
While <var>sequence</var>[<var>s</var>] is <a>ASCII whitespace</a>, continuously execute the
following steps:
<li><p>If the current <a>code point</a> in <var>input</var> is U+003B (;), then

This comment has been minimized.

@domenic

domenic Nov 30, 2017
Member

"current code point" is a bit imprecise; best to mention position.

following steps:
<p>If <var>position</var> is not past the end of <var>input</var> and the current
<a>code point</a> in <var>input</var> is U+005C (\), then advance <var>position</var> to
the next <a>code point</a> in <var>input</var> and:

This comment has been minimized.

@domenic

domenic Nov 30, 2017
Member

Here and below, put the "advance position to..." inside the list of statements covered by the if, instead of putting it on the same line as the if.

SIGN character ("<code>=</code>"), exit loop
<var>M</var>.
<li><a>Collect a sequence of code points</a> that are not U+003B (;) from <var>input</var>,
given <var>position</var>.

This comment has been minimized.

@domenic

domenic Nov 30, 2017
Member

An example inline here of the kind of garbage this throws away would be nice. In general whenever you collect but don't do something with the results, informatively stating what that does is a good idea.

<li>
Append <var>sequence</var>[<var>s</var>] to <var>value</var>.
<ul class=brief>
<li>neither <var>parameterName</var> nor <var>parameterValue</var> are the empty string

This comment has been minimized.

@domenic

domenic Nov 30, 2017
Member

State this as two conditions: parameterName is not the empty string, and parameterValue is not the empty string

Initialize <var>u</var> to 0.
<li><p>Let <var>mimeType</var> be a new <a>MIME type record</a> whose <a for="MIME type">type</a>
is <var>type</var>, in <a>ASCII lowercase</a>, and <a for="MIME type">subtype</a> is
<var>subtype</var>, in <a>ASCII lowercase</a>.

This comment has been minimized.

@domenic

domenic Nov 30, 2017
Member

You need to set its parameters to an empty ordered map here.

This comment has been minimized.

@annevk

annevk Dec 4, 2017
Author Member

I already have "It is initially empty." by the definition of MIME type record. I think that's better as folks will likely forget to initialize it.

This comment has been minimized.

@domenic

domenic Dec 4, 2017
Member

Hmm, stating half the initial values here and half the initial values elsewhere makes the translation into code a decent bit harder :(

This comment has been minimized.

@annevk

annevk Dec 4, 2017
Author Member

@domenic presumably when you'd write code for creating MIME type records you'd look at the definition of MIME type records and not at the creation of a single instance?

This comment has been minimized.

@domenic

domenic Dec 4, 2017
Member

In the past I've always been able to treat records as JavaScript objects, e.g.

const record1 = { type: specifiedType, subtype: specifiedSubtype };

Now you are essentially saying I have to think of them as JavaScript classes, e.g.

class MIMETypeRecord {
  constructor(type, subtype) {
   this.parameters = new Map();
   this.type = type;
   this.subtype = subtype;
  }
}

const record2 = new MIMETypeRecord(specifiedType, specifiedSubtype);

That's a bit unfortunate, and a lot of ceremony for records, which I consider to be more lightweight abstractions. Instead I'll probably give up on 1:1 spec to implementation mapping and, as you say, just do

const record3 = {
  parameters: new Map() // this is specified at https://mimesniff.spec.whatwg.org/..., outside the current algorithm we're implementing
  type: specifiedType,
  subtype: specifiedSubtype
};

It's not fatal, but it's just annoying to have to jump all over the spec to find out the values that are actually being set. Or worse, not notice until your tests start failing, as I did.

Even if you add "and with its parameters set to the default value specified in the definition of MIME type record", that would help explicitly signpost that the reader has to jump back and forth.

This comment has been minimized.

@annevk

annevk Dec 4, 2017
Author Member

Hmm, URL records also default all over the place. What specification construct are you thinking of?

This comment has been minimized.

@domenic

domenic Dec 4, 2017
Member

I guess that's true; I rescind this then. Here the contrast is more vivid, since you can easily be misled into thinking that you've got all the fields, because it's easy to forget that you have to refer back to the definition for that one field out of three. But what can you do.

For each character <var>char</var> in
<var>parameters</var>[<var>name</var>], execute the following steps:
<li>
<p>If <var>value</var> does not solely contain <a>HTTP token code points</a>:

This comment has been minimized.

@domenic

domenic Nov 30, 2017
Member

Shouldn't this be quoted-string?

This comment has been minimized.

@domenic

domenic Nov 30, 2017
Member

Nevermind, we quote things only if they are not bare tokens...

<li>
Enter loop <var>M</var>:
<p>If the current <a>code point</a> in <var>input</var> is U+0022 ("), then advance

This comment has been minimized.

@yutakahirano

yutakahirano Dec 1, 2017
Member

This part accepts (and ignores) unterminated quoted strings, such as 'text/plain; charset="utf-8' or 'text/plain; charset="utf-8; param2=value2'. I think it's too lax, what do you think? Is it OK to fail parsing in such cases?

This comment has been minimized.

@yutakahirano

yutakahirano Dec 1, 2017
Member

Hmm, it seems your design principle is to let the parser parse type and subtype whenever they are sane (i.e., regardless of the "parameter" section). Is that right? Then this behavior may be OK...

This comment has been minimized.

@annevk

annevk Dec 1, 2017
Author Member

When testing I found that only Safari would treat text/html;charset="gbk not as GBK. Therefore I went with the majority. But yes, nobody returned failure and started downloading such a resource, so I don't think we can start with that now.

@domenic
Copy link
Member

@domenic domenic commented Dec 1, 2017

I think "exclude parameters" on the serializer is useless because we already have "essence". Instead of saying e.g. "Let result be mimeType serialized without parameters", you'd just say "Let result be mimeType's essence".

domenic added a commit to jsdom/whatwg-mimetype that referenced this pull request Dec 2, 2017
@annevk
Copy link
Member Author

@annevk annevk commented Dec 4, 2017

Commit message:

Define a new MIME type model, parser, and serializer

This addresses all open inline issues with respect to the parser and serializer, aligns both closer with implementations, except where those stood in the way of an improved model.

This also updates all of it to make extensive use of the Infra Standard.

See #42 for the testing story (included all linked issues) and https://github.com/w3c/web-platform-tests/pull/7764 for the majority of tests.
@annevk annevk merged commit cc81ec4 into master Dec 7, 2017
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@annevk annevk deleted the annevk/mime-type branch Dec 7, 2017
annevk added a commit to web-platform-tests/wpt that referenced this pull request Dec 7, 2017
domenic added a commit to whatwg/html that referenced this pull request Feb 5, 2018
This follows whatwg/mimesniff#58 by referencing
the definitions for JavaScript and JSON MIME type that now live in MIME
Sniffing. It also follows whatwg/mimesniff#36 by
using the terms "valid MIME type string" and "valid MIME type string
without parameters" instead of their non-string counterparts that
previously appeared.
domenic added a commit to whatwg/html that referenced this pull request Feb 16, 2018
This follows whatwg/mimesniff#58 by referencing
the definitions for JavaScript and JSON MIME type that now live in MIME
Sniffing. It also follows whatwg/mimesniff#36 by
using the terms "valid MIME type string" and "valid MIME type string
without parameters" instead of their non-string counterparts that
previously appeared.
domenic added a commit to whatwg/html that referenced this pull request Feb 16, 2018
This follows whatwg/mimesniff#58 by referencing
the definitions for JavaScript and JSON MIME type that now live in MIME
Sniffing. It also follows whatwg/mimesniff#36 by
using the terms "valid MIME type string" and "valid MIME type string
without parameters" instead of their non-string counterparts that
previously appeared. Finally, it updates the terms "explicitly supported
XML/JSON type" to include the word "MIME", like other MIME type group
definitions now do.
annevk added a commit to web-platform-tests/wpt that referenced this pull request Apr 10, 2018
annevk added a commit to web-platform-tests/wpt that referenced this pull request Apr 16, 2018
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Apr 26, 2018
…ng, a=testonly

Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling

See whatwg/xhr#176 and whatwg/mimesniff#36.
--

wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4
wpt-pr: 8422
alice added a commit to alice/html that referenced this pull request Jan 8, 2019
This follows whatwg/mimesniff#58 by referencing
the definitions for JavaScript and JSON MIME type that now live in MIME
Sniffing. It also follows whatwg/mimesniff#36 by
using the terms "valid MIME type string" and "valid MIME type string
without parameters" instead of their non-string counterparts that
previously appeared. Finally, it updates the terms "explicitly supported
XML/JSON type" to include the word "MIME", like other MIME type group
definitions now do.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.