Add XML declaration encoding sniffing #1752

domenic · 2016-09-07T01:56:01Z

Closes #1438, where we found out that this is required for web
compatibility. The algorithm given here is an exact copy of that used by
WebKit and Blink, with the exception that it does not detect UTF-32 byte
sequences since in web-standards-world, UTF-32 must not be supported.

@bzbarsky FYI. Would appreciate some careful editor review on this (@zcorpan seems especially good at byte-parsing algorithms) as while I tried to match these algorithms to the surrounding style, I'm not sure I succeeded. It would also be good to have someone else double-check the correspondence between this spec text and Blink/WebKit's algorithms; the fact that they use the ASCII XML declaration as a hint but then also check meta charset was a bit strange, but I think I captured that.

(PR template added by @hsivonen )

At least two implementers are interested (and none opposed):
- @hsivonen (Gecko)
- Matches existing Chromium and WebKit behavior
Tests are written and can be reviewed and commented upon at:
- Tests https://hsivonen.com/test/moz/xml-decl/
- WPT wrapper https://phabricator.services.mozilla.com/differential/changeset/?ref=3688096
Implementation bugs are filed:
- Chrome: https://bugs.chromium.org/p/chromium/issues/detail?id=1188006
- Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=673087
- Safari: https://bugs.webkit.org/show_bug.cgi?id=223174

/parsing.html ( diff )

bzbarsky · 2016-09-07T02:40:18Z

See #1438 (comment) for my take on the WebKit/Blink behavior around parsing the XML declaration... Note that Edge does not do that, which is why I kept trying to get some information on what it does do.

zcorpan · 2016-09-07T08:35:04Z

source

+   of bytes 0x65, 0x6E, 0x63, 0x6F, 0x64, 0x69, 0x6E, 0x67 (ASCII 'encoding') at or after the
+   current <var>encodingPosition</var>. If there is no such sequence, abort the <span
+   data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm
+   and return failure.</p></li>


We could return failure here if the byte before the 'e' is not less than or equal to 0x20, to fix the fooencoding="utf-8" problem. (I haven't tested what Edge does though.)

Alternatively we should tokenize attributes as is done for meta scanner (consider <?xml foo=" encoding='utf-8'"?>)

Kept the WebKit-compatible behavior.

domenic · 2016-09-07T16:41:18Z

I'm going to tag this as "do not merge yet" until we either:

Decide to just go with what is 2/4 implemented (this PR, currently)
Decide to come up with an improved algorithm, do so, and get buy in from 2/4 browsers on the improved algorithm (so presumably Firefox + 1 more).

hsivonen · 2020-04-08T11:06:10Z

I think we should add WebKit/Blink-compatible XML declaration sniffing.

hsivonen · 2020-04-08T11:08:12Z

(I disapprove of WebKit improvising on this back when it did, but now that IE and Spartan are going away, the behavior is part of the compatibility legacy.)

annevk · 2020-04-08T11:42:57Z

I rebased this, addressed the obvious feedback, but two comments from Simon are still remaining. If someone has a pointer to where this is implemented in Chromium/WebKit I'd be happy to read through that and do a comparison.

I also wonder if there's a better way of writing this down as the jumping around doesn't seem great.

zcorpan · 2020-04-08T11:57:28Z

I'm OK with going with the implemented behavior. The parts that are bogus can be called out with notes.

TimothyGu · 2020-04-08T13:29:44Z

Here’s where this is in Chrome, I believe: https://source.chromium.org/chromium/chromium/src/+/master:third_party/blink/renderer/core/html/parser/text_resource_decoder.cc;l=301;drc=e141b41381b928e84ce377a39484e46e0cff7c02

hsivonen

There are some clear bugs and one issue that matches Blink, but I think matching Blink is harmful due to the network buffer boundary dependency.

hsivonen · 2020-04-08T14:48:48Z

source

@@ -102094,6 +102128,55 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
   step.</p></li>
  </ol>

+  <p>When the <span>prescan a byte stream to determine its encoding</span> algorithm says to <dfn
+  data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding</dfn>, it means doing
+  this. If at any point during these steps the <var>encodingPosition</var> pointer created in the


This seems to match Blink, but I think it's bad to make this dependent on network byte buffer boundaries. If I was writing this from scratch, I'd make a state machine that'd fail fast if it doesn't match encoding= next, but let's try to stay closer to what Blink does. I'd prefer to require scanning exactly the first 1024 for >. However, the spec presently allows for stopping earlier, which is bad.

I suggest saying that the browser must scan for > for as far as it would scan for meta charset.

Oops. I didn't fix this one, yet.

source

hsivonen · 2020-04-08T15:09:05Z

Hmm. In fact, since meta charset takes precedence, I suggest placing this step after the meta prescan: If the meta prescan fails, perform this sniffing over the same bytes that the meta prescan was performed over.

hsivonen · 2020-04-08T15:10:16Z

Hmm. Or at least meta prescan takes precedence in the ASCII superset case. Need to test the UTF-16LE/BE cases, still.

hsivonen · 2020-04-08T16:42:46Z

It looks like the UTF-16 cases take precedence over meta, but meta takes precedence over ASCII-based XML decl.

Proof:
https://hsivonen.com/test/moz/xml-decl.htm
https://hsivonen.com/test/moz/xml-decl-16.htm
https://hsivonen.com/test/moz/xml-decl-no-space-after-xml.htm
https://hsivonen.com/test/moz/xml-decl-garbage-after-xml.htm

So, the UTF-16 cases should be checked before meta prescan, and waiting for 6 bytes should be required.

If meta prescan fails, the ASCII XML declaration scan should run exactly on the same bytes that the meta prescan ran over.

domenic · 2020-04-08T16:44:50Z

So cool to see this revived; thanks @hsivonen for the prod and the review, and thanks @annevk for rebasing.

Since this was originally proposed the WHATWG has adopted a tests-required policy, so I hope we can also take the opportunity to add web platform tests for these cases. @hsivonen, it seems like you're working on this area in Gecko; will you be able to do that as part of your implementation?

hsivonen · 2020-04-08T16:45:37Z

I can write tests, yes.

domenic · 2020-06-03T19:22:36Z

Ping @hsivonen; did you end up writing tests for this?

hsivonen · 2021-03-01T12:05:43Z

Manual demos, not in WPT form yet: https://hsivonen.com/test/moz/xml-decl/

hsivonen · 2021-03-02T14:35:24Z

I'm unhappy that it looks like WebKit managed to give UTF-16 <?xml precedence that doesn't match the BOM's precedence:
https://hsivonen.com/test/moz/xml-decl/utf-16-vs-http.html
https://hsivonen.com/test/moz/xml-decl/utf-16-and-meta.htm

The precedence appears to be:

BOM
HTTP
UTF-16 <?xml
meta
ASCII <?xml

It's sad that WebKit added a check for UTF-16 <?xml at all considering the UTF-16 XML must have a BOM.

hsivonen · 2021-03-02T14:44:26Z

Looks like WebKit and Blink don't use get an encoding: https://hsivonen.com/test/moz/xml-decl/space-around-label.htm

hsivonen · 2021-03-03T08:53:02Z

WebKit and Blink don't trim ASCII whitespace when resolving the encoding label from an XML declaration in text/html but do perform alias resolution:
https://hsivonen.com/test/moz/xml-decl/cp1251.htm
https://hsivonen.com/test/moz/xml-decl/WINDOWS.htm
https://hsivonen.com/test/moz/xml-decl/space-around-label.htm

This makes sense in the light of the XML spec not allowing ASCII whitespace around the label:
https://www.w3.org/TR/REC-xml/#NT-EncodingDecl

If we don't want to introduce get an encoding without trimming in the Encoding Standard, we can make the spec text here bail out if there's ASCII whitespace inside the quotes.

domenic · 2021-03-15T20:01:29Z

I'm sorry, I'm hearing a lot of history, some "this is strange", and some of what seems like the blame game. But I still haven't heard a clear articulation of what the benefits would be, either to web users, or to implementations, or to interop... Could you phrase this in terms of, e.g., pages that are broken with WebKit and Blink's current implementation, but would be fixed by the spec change you're proposing here? It sounds like a small change, but if there's literally no benefits, then it's hard to argue for it.

Perhaps another way of phrasing it is, what part of the priority of constituencies is this change targeting?

hsivonen · 2021-03-16T06:03:47Z

The benefit is cross-engine consistency (i.e. interop) on how far to scan. Concretely, whether it's legitimate to land WPTs that test exactly how far to scan.

At this point, it's not known that it fixes or breaks any non-test page in particular; the limit is substantially higher than the length of a plausible XML declaration.

For other features, we typically specify exact behavior and write tests for the boundary conditions, even if there's no evidence of concrete Web compat concerns at hand. The expected benefit for testing an exact boundary condition here is low, but the same argument could be made about many other edge case tests that we write as a matter of testing culture that treats edge case differences in implementation and "implementation-defined" in spec text as presumptively suspect.

Would you like me to remove the tests that seek to show that the browser does not scan past 1024 bytes (Blink and WebKit don't pass)? Would you like me to remove the tests that seek show that the browser scans all the way to byte 1024 (Blink and WebKit pass given the current buffering provided by the WPT server; I haven't investigated if they'd pass with different buffer boundaries or timing, but I haven't done that investigation for any of the tests)?

hsivonen · 2021-03-16T14:36:36Z

Further, should this go into WPT? https://phabricator.services.mozilla.com/differential/changeset/?ref=3707434

Firefox with my patch parses the XML declaration as declaring windows-1251. Chrome does not.

It's unclear to me what exactly makes Chrome give up. Making an XML declaration megabytes long doesn't make it give up. Causing the stream to stall for 20 seconds in the middle of an XML declaration doesn't make it give up. Making a normal-sized XML declaration trickle in byte by byte makes it give up.

So far, my hypothesis is that Chrome gives up after some number of network chunks that it considers contiguous, but I'd appreciate a real explanation.

annevk · 2021-03-17T13:05:12Z

@mfreed7 is this something you could help with? From Mozilla's side we'd really like the bytes to document operation to be fully deterministic, i.e., not have it use implementation-defined.

hsivonen · 2021-03-17T15:03:00Z

To elaborate on why I want to limit this to 1024 bytes:

meta takes precedence over the bogo-XML declaration. The two syntaxes are mutually exclusive such that it's not possible to hide a successful meta within the bogo-XML declaration syntax. Therefore, given the precedence, it makes sense to scan for meta first and only scan for the bogo-XML declaration if the meta prescan fails.

Regarding meta, we already have spec text that says: "User agents are encouraged to only prescan the first 1024 bytes." Gecko implements this exactly without timeouts. That is, if there is no meta, regardless of buffer boundaries and wall clock, Gecko always decides that the meta prescan fails once it has seen the first 1024 bytes without meta (or EOF). Therefore, when when meta prescan fails, the implementation options are either to scan the first 1024 bytes that we already have for the bogo-XML declaration syntax or to, on principle of accommodating the XML declaration syntax, read more input until a > arrives.

Since real bogo-XML declarations are generally not longer than 45 bytes, at the point we have 1024 bytes available, it's unreasonable to add complexity to handle cases where there is excessive whitespace that, in principle, fits the syntax: we're already accommodating a 20-fold excess length. In CSS, we have precedent for drawing the line to 1024 bytes for such theoretical infinite whitespace opportunity.

Under the Well-defined Behavior Design Principle, we should write the limit in the spec as in done in CSS even though in both cases it's so large relative to the real-world usage of the syntax that the exact limit is unlikely to have much Web compat impact.

hsivonen · 2021-03-24T08:41:56Z

I added a bit of text to handle UTF-16BE/LE declared using the ASCII-based syntax.

The WPTs landed, but it looks like I will need to split the test to avoid timeouts.

domenic · 2021-03-24T14:48:40Z

This just got re-assigned to me. My personal next step would be to remove the Gecko-only 1024 limitation so we can match 2/3 engines, and then remove or flip the pass conditions on any web platform tests which assume that if possible. (Although I'm not sure if it is.)

hsivonen · 2021-03-24T15:00:19Z

My personal next step would be to remove the Gecko-only 1024 limitation so we can match 2/3 engines, and then remove or flip the pass conditions on any web platform tests which assume that if possible.

What's your opinion on the analogous 1024-byte limit in CSS? What's your opinion on the trickle.py test?

hsivonen · 2021-04-15T18:03:02Z

It looks like for Web compat, Gecko will have to extend reloadless meta scanning past 1024 bytes. That changes what makes sense for the bogo XML declaration.

Let's drop the 1024-byte limit for XML declaration.

I'd still like to know how developers of the other engines view the trickle.py test.

annevk

This looks good to me. So I guess the main thing remaining here is for @mfreed7 to look at the tests. Is there a follow-up issue for changing meta element scanning accordingly?

annevk · 2021-04-20T10:08:30Z

source

-  the first step below goes beyond the end of the byte stream obtained so far) or reaches its
-  <var>end condition</var>, then abort the <span>prescan a byte stream to determine its
-  encoding</span> algorithm unsuccessfully.</p>
+  encoding</dfn>, given some defined <dfn data-x="prescan-end-condition"><var>end


If this is now a named argument we should add:

export for="prescan a byte stream to determine its encoding"

I wonder why this is exported in the first place...

It's for https://xhr.spec.whatwg.org/#document-response

annevk · 2021-04-20T10:11:17Z

source

+
+   <li><p>Advance <var>encodingPosition</var> past the 0x67 (g) byte.</p></li>
+
+   <li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is


Nit: comma after i.e.

annevk · 2021-04-20T10:11:22Z

source

+
+   <li><p>If the byte at <var>encodingPosition</var> is not 0x3D (=), then return failure.</p></li>
+
+   <li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is


mfreed7 · 2021-04-23T19:49:43Z

This looks good to me. So I guess the main thing remaining here is for @mfreed7 to look at the tests. Is there a follow-up issue for changing meta element scanning accordingly?

This LGTM. Test-wise, it looks like there are four test cases that need to change to re-remove the 1024-byte limit, right? I.e. the four that fail here?

And then there's the trickle.py test. I do not understand this one. When I run it locally, and when it runs on the Chromium automation bots, it passes. Not even flaky. So I'm not exactly sure why it is failing on WPT.fyi, seemingly consistently. I also inspected the code for some time, and I don't see any reason why this should fail. Suggestions appreciated here! The failure is that it assumes windows-1252 for some reason.

mfreed7 · 2021-04-23T20:32:21Z

And then there's the trickle.py test. I do not understand this one. When I run it locally, and when it runs on the Chromium automation bots, it passes. Not even flaky. So I'm not exactly sure why it is failing on WPT.fyi, seemingly consistently. I also inspected the code for some time, and I don't see any reason why this should fail. Suggestions appreciated here! The failure is that it assumes windows-1252 for some reason.

Ok, I might have found at least a clue. It is possible that this is due to a timeout of sorts in the test. Each of the <iframes> in xmldecl-1.html first inherit the windows-1252 from the parent frame. Then they each start loading. The trickle.py test trickles in one character every 50ms, finishing in just over 2 seconds. At that point, the charset is correctly detected and set to windows-1251. I think somehow the test might query before that, and see the inherited windows-1252. I don't yet see how, since it waits for onload. Is there anything funny in the WPT.fyi test runner perhaps? I ran it locally with the per-character time increased to 0.2 seconds, and it still passes.

Speculatively, we could split just the trickle.py test into a new xmldecl-3.html file, perhaps?

mfreed7 · 2021-05-06T19:36:19Z

Ok, I might have found at least a clue. It is possible that this is due to a timeout of sorts in the test. Each of the <iframes> in xmldecl-1.html first inherit the windows-1252 from the parent frame. Then they each start loading. The trickle.py test trickles in one character every 50ms, finishing in just over 2 seconds. At that point, the charset is correctly detected and set to windows-1251. I think somehow the test might query before that, and see the inherited windows-1252. I don't yet see how, since it waits for onload. Is there anything funny in the WPT.fyi test runner perhaps? I ran it locally with the per-character time increased to 0.2 seconds, and it still passes.

Speculatively, we could split just the trickle.py test into a new xmldecl-3.html file, perhaps?

I'm working on landing a split of the trickle.py test, to see if that helps: web-platform-tests/wpt#28874

mfreed7 · 2021-05-11T17:15:12Z

So I landed the changes to trickle.py (splitting it out and marking it as a slow test) and that solved the problems on Chromium bots. However, it still fails on wpt.fyi:

https://wpt.fyi/results/html/syntax/xmldecl/xmldecl-3.html?label=master&label=experimental&aligned&q=xmldecl

I don't understand the difference between WPT.fyi and our bots. Suggestions/thoughts appreciated.

domenic · 2021-05-11T17:40:52Z

Is there any chance it's a content_shell vs. full chrome browser issue? Seems unlikely...

mfreed7 · 2021-05-11T18:13:39Z

Hmm, well that's interesting. Yes. I hadn't tried the full browser but it appears to fail there, at least locally. I'll give that a look, thanks!

So WPT.fyi uses a full browser? I thought it was using content_shell like the bots. TIL.

mfreed7 · 2021-05-11T18:18:47Z

Actually no, maybe false alarm. Locally, it was just downloading the trickle.py file itself, rather than executing it...

mfreed7 · 2021-05-11T18:28:18Z

Yep, definitely false alarm. When actually running run_blink_wptserve.py, it works perfectly from a full browser also. Still, thanks for the suggestion.

domenic · 2021-05-11T18:30:36Z

For me, http://wpt.live/html/syntax/xmldecl/xmldecl-3.html passes on Canary and fails on Stable. This is on Windows. Not really sure what that indicates...

hsivonen · 2021-05-12T07:23:08Z

If I run ./mach wpt --product=chrome testing/web-platform/tests/html/syntax/xmldecl/xmldecl-3.html from today's mozilla-central with Chrome Version 90.0.4430.212 (Official Build) (64-bit) on Ubuntu 20.04, it fails.

hsivonen · 2021-05-12T08:04:06Z

I'm in the process of updating the WPT expectation around the kilobyte boundary.

Thanks for getting this in the spec!

domenic added normative change compat Standard is not web compatible or proprietary feature needs standardizing labels Sep 7, 2016

zcorpan reviewed Sep 7, 2016
View reviewed changes

domenic added the do not merge yet Pull request must not be merged per rationale in comment label Sep 7, 2016

domenic mentioned this pull request Sep 8, 2016

Use XML declaration <?xml ...?> as input for the text/html encoding sniffer #1438

Closed

annevk force-pushed the xml-charset branch from daf61a2 to 4459c22 Compare April 8, 2020 11:36

r12a added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Apr 8, 2020

hsivonen suggested changes Apr 8, 2020

View reviewed changes

plehegar mentioned this pull request Apr 9, 2020

Add XML declaration encoding sniffing w3c/i18n-activity#880

Closed

hsivonen mentioned this pull request Jan 11, 2021

www7a.biglobe.ne.jp - The page has a different character encoding webcompat/web-bugs#64525

Closed

Base automatically changed from master to main January 15, 2021 07:56

Handle UTF-16BE/LE in ASCII-based bogo-XML declaration

0066d35

annevk assigned domenic Mar 24, 2021

domenic added the agenda+ To be discussed at a triage meeting label Mar 31, 2021

past mentioned this pull request Apr 1, 2021

Upcoming HTML standard issue triage meeting on 4/1/2021 #6481

Closed

Remove length constraint for bogo-XML declaration

70ceba3

annevk approved these changes Apr 20, 2021

View reviewed changes

Fixes per Anne's review

f97b9a5

domenic merged commit 800a2dc into main May 6, 2021

domenic deleted the xml-charset branch May 6, 2021 17:46

past mentioned this pull request May 7, 2021

Upcoming HTML standard issue triage meeting on 5/6/2021 #6551

Closed


		<li><p>Advance <var>encodingPosition</var> past the 0x67 (g) byte.</p></li>

		<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is


		<li><p>If the byte at <var>encodingPosition</var> is not 0x3D (=), then return failure.</p></li>

		<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is

Add XML declaration encoding sniffing #1752

Add XML declaration encoding sniffing #1752

Conversation

domenic commented Sep 7, 2016 • edited by pr-preview bot Loading

bzbarsky commented Sep 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

domenic commented Sep 7, 2016

hsivonen commented Apr 8, 2020

hsivonen commented Apr 8, 2020

annevk commented Apr 8, 2020

zcorpan commented Apr 8, 2020

TimothyGu commented Apr 8, 2020

hsivonen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsivonen commented Apr 8, 2020

hsivonen commented Apr 8, 2020

hsivonen commented Apr 8, 2020

domenic commented Apr 8, 2020

hsivonen commented Apr 8, 2020

domenic commented Jun 3, 2020

hsivonen commented Mar 1, 2021 • edited Loading

hsivonen commented Mar 2, 2021

hsivonen commented Mar 2, 2021

hsivonen commented Mar 3, 2021

domenic commented Mar 15, 2021 • edited Loading

hsivonen commented Mar 16, 2021 • edited Loading

hsivonen commented Mar 16, 2021

annevk commented Mar 17, 2021

hsivonen commented Mar 17, 2021

hsivonen commented Mar 24, 2021

domenic commented Mar 24, 2021

hsivonen commented Mar 24, 2021

hsivonen commented Apr 15, 2021

annevk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfreed7 commented Apr 23, 2021

mfreed7 commented Apr 23, 2021

mfreed7 commented May 6, 2021

mfreed7 commented May 11, 2021

domenic commented May 11, 2021

mfreed7 commented May 11, 2021

mfreed7 commented May 11, 2021

mfreed7 commented May 11, 2021

domenic commented May 11, 2021

hsivonen commented May 12, 2021

hsivonen commented May 12, 2021

domenic commented Sep 7, 2016 •

edited by pr-preview bot

Loading

hsivonen commented Mar 1, 2021 •

edited

Loading

domenic commented Mar 15, 2021 •

edited

Loading

hsivonen commented Mar 16, 2021 •

edited

Loading