Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XML declaration encoding sniffing #1752

Merged
merged 12 commits into from
May 6, 2021
Merged

Add XML declaration encoding sniffing #1752

merged 12 commits into from
May 6, 2021

Conversation

domenic
Copy link
Member

@domenic domenic commented Sep 7, 2016

Closes #1438, where we found out that this is required for web
compatibility. The algorithm given here is an exact copy of that used by
WebKit and Blink, with the exception that it does not detect UTF-32 byte
sequences since in web-standards-world, UTF-32 must not be supported.


@bzbarsky FYI. Would appreciate some careful editor review on this (@zcorpan seems especially good at byte-parsing algorithms) as while I tried to match these algorithms to the surrounding style, I'm not sure I succeeded. It would also be good to have someone else double-check the correspondence between this spec text and Blink/WebKit's algorithms; the fact that they use the ASCII XML declaration as a hint but then also check meta charset was a bit strange, but I think I captured that.


(PR template added by @hsivonen )



/parsing.html ( diff )

@domenic domenic added normative change compat Standard is not web compatible or proprietary feature needs standardizing labels Sep 7, 2016
@bzbarsky
Copy link
Contributor

bzbarsky commented Sep 7, 2016

See #1438 (comment) for my take on the WebKit/Blink behavior around parsing the XML declaration... Note that Edge does not do that, which is why I kept trying to get some information on what it does do.

source Outdated
of bytes 0x65, 0x6E, 0x63, 0x6F, 0x64, 0x69, 0x6E, 0x67 (ASCII 'encoding') at or after the
current <var>encodingPosition</var>. If there is no such sequence, abort the <span
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm
and return failure.</p></li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could return failure here if the byte before the 'e' is not less than or equal to 0x20, to fix the fooencoding="utf-8" problem. (I haven't tested what Edge does though.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we should tokenize attributes as is done for meta scanner (consider <?xml foo=" encoding='utf-8'"?>)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the WebKit-compatible behavior.

@domenic
Copy link
Member Author

domenic commented Sep 7, 2016

I'm going to tag this as "do not merge yet" until we either:

  1. Decide to just go with what is 2/4 implemented (this PR, currently)
  2. Decide to come up with an improved algorithm, do so, and get buy in from 2/4 browsers on the improved algorithm (so presumably Firefox + 1 more).

@domenic domenic added the do not merge yet Pull request must not be merged per rationale in comment label Sep 7, 2016
@hsivonen
Copy link
Member

hsivonen commented Apr 8, 2020

I think we should add WebKit/Blink-compatible XML declaration sniffing.

@hsivonen
Copy link
Member

hsivonen commented Apr 8, 2020

(I disapprove of WebKit improvising on this back when it did, but now that IE and Spartan are going away, the behavior is part of the compatibility legacy.)

@annevk
Copy link
Member

annevk commented Apr 8, 2020

I rebased this, addressed the obvious feedback, but two comments from Simon are still remaining. If someone has a pointer to where this is implemented in Chromium/WebKit I'd be happy to read through that and do a comparison.

I also wonder if there's a better way of writing this down as the jumping around doesn't seem great.

@zcorpan
Copy link
Member

zcorpan commented Apr 8, 2020

I'm OK with going with the implemented behavior. The parts that are bogus can be called out with notes.

@r12a r12a added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Apr 8, 2020
Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some clear bugs and one issue that matches Blink, but I think matching Blink is harmful due to the network buffer boundary dependency.

source Outdated
@@ -102094,6 +102128,55 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
step.</p></li>
</ol>

<p>When the <span>prescan a byte stream to determine its encoding</span> algorithm says to <dfn
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding</dfn>, it means doing
this. If at any point during these steps the <var>encodingPosition</var> pointer created in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to match Blink, but I think it's bad to make this dependent on network byte buffer boundaries. If I was writing this from scratch, I'd make a state machine that'd fail fast if it doesn't match encoding= next, but let's try to stay closer to what Blink does. I'd prefer to require scanning exactly the first 1024 for >. However, the spec presently allows for stopping earlier, which is bad.

I suggest saying that the browser must scan for > for as far as it would scan for meta charset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I didn't fix this one, yet.

source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
@hsivonen
Copy link
Member

hsivonen commented Apr 8, 2020

Hmm. In fact, since meta charset takes precedence, I suggest placing this step after the meta prescan: If the meta prescan fails, perform this sniffing over the same bytes that the meta prescan was performed over.

@hsivonen
Copy link
Member

hsivonen commented Apr 8, 2020

Hmm. Or at least meta prescan takes precedence in the ASCII superset case. Need to test the UTF-16LE/BE cases, still.

@hsivonen
Copy link
Member

hsivonen commented Apr 8, 2020

It looks like the UTF-16 cases take precedence over meta, but meta takes precedence over ASCII-based XML decl.

Proof:
https://hsivonen.com/test/moz/xml-decl.htm
https://hsivonen.com/test/moz/xml-decl-16.htm
https://hsivonen.com/test/moz/xml-decl-no-space-after-xml.htm
https://hsivonen.com/test/moz/xml-decl-garbage-after-xml.htm

So, the UTF-16 cases should be checked before meta prescan, and waiting for 6 bytes should be required.

If meta prescan fails, the ASCII XML declaration scan should run exactly on the same bytes that the meta prescan ran over.

@domenic
Copy link
Member Author

domenic commented Apr 8, 2020

So cool to see this revived; thanks @hsivonen for the prod and the review, and thanks @annevk for rebasing.

Since this was originally proposed the WHATWG has adopted a tests-required policy, so I hope we can also take the opportunity to add web platform tests for these cases. @hsivonen, it seems like you're working on this area in Gecko; will you be able to do that as part of your implementation?

@hsivonen
Copy link
Member

hsivonen commented Apr 8, 2020

I can write tests, yes.

@domenic
Copy link
Member Author

domenic commented Jun 3, 2020

Ping @hsivonen; did you end up writing tests for this?

@hsivonen
Copy link
Member

hsivonen commented Mar 1, 2021

Manual demos, not in WPT form yet: https://hsivonen.com/test/moz/xml-decl/

@hsivonen
Copy link
Member

hsivonen commented Mar 2, 2021

I'm unhappy that it looks like WebKit managed to give UTF-16 <?xml precedence that doesn't match the BOM's precedence:
https://hsivonen.com/test/moz/xml-decl/utf-16-vs-http.html
https://hsivonen.com/test/moz/xml-decl/utf-16-and-meta.htm

The precedence appears to be:

  1. BOM
  2. HTTP
  3. UTF-16 <?xml
  4. meta
  5. ASCII <?xml

It's sad that WebKit added a check for UTF-16 <?xml at all considering the UTF-16 XML must have a BOM.

@hsivonen
Copy link
Member

hsivonen commented Mar 2, 2021

Looks like WebKit and Blink don't use get an encoding: https://hsivonen.com/test/moz/xml-decl/space-around-label.htm

@hsivonen
Copy link
Member

hsivonen commented Mar 3, 2021

WebKit and Blink don't trim ASCII whitespace when resolving the encoding label from an XML declaration in text/html but do perform alias resolution:
https://hsivonen.com/test/moz/xml-decl/cp1251.htm
https://hsivonen.com/test/moz/xml-decl/WINDOWS.htm
https://hsivonen.com/test/moz/xml-decl/space-around-label.htm

This makes sense in the light of the XML spec not allowing ASCII whitespace around the label:
https://www.w3.org/TR/REC-xml/#NT-EncodingDecl

If we don't want to introduce get an encoding without trimming in the Encoding Standard, we can make the spec text here bail out if there's ASCII whitespace inside the quotes.

@domenic
Copy link
Member Author

domenic commented Mar 15, 2021

I'm sorry, I'm hearing a lot of history, some "this is strange", and some of what seems like the blame game. But I still haven't heard a clear articulation of what the benefits would be, either to web users, or to implementations, or to interop... Could you phrase this in terms of, e.g., pages that are broken with WebKit and Blink's current implementation, but would be fixed by the spec change you're proposing here? It sounds like a small change, but if there's literally no benefits, then it's hard to argue for it.

Perhaps another way of phrasing it is, what part of the priority of constituencies is this change targeting?

@hsivonen
Copy link
Member

hsivonen commented Mar 16, 2021

The benefit is cross-engine consistency (i.e. interop) on how far to scan. Concretely, whether it's legitimate to land WPTs that test exactly how far to scan.

At this point, it's not known that it fixes or breaks any non-test page in particular; the limit is substantially higher than the length of a plausible XML declaration.

For other features, we typically specify exact behavior and write tests for the boundary conditions, even if there's no evidence of concrete Web compat concerns at hand. The expected benefit for testing an exact boundary condition here is low, but the same argument could be made about many other edge case tests that we write as a matter of testing culture that treats edge case differences in implementation and "implementation-defined" in spec text as presumptively suspect.

Would you like me to remove the tests that seek to show that the browser does not scan past 1024 bytes (Blink and WebKit don't pass)? Would you like me to remove the tests that seek show that the browser scans all the way to byte 1024 (Blink and WebKit pass given the current buffering provided by the WPT server; I haven't investigated if they'd pass with different buffer boundaries or timing, but I haven't done that investigation for any of the tests)?

@hsivonen
Copy link
Member

Further, should this go into WPT? https://phabricator.services.mozilla.com/differential/changeset/?ref=3707434

Firefox with my patch parses the XML declaration as declaring windows-1251. Chrome does not.

It's unclear to me what exactly makes Chrome give up. Making an XML declaration megabytes long doesn't make it give up. Causing the stream to stall for 20 seconds in the middle of an XML declaration doesn't make it give up. Making a normal-sized XML declaration trickle in byte by byte makes it give up.

So far, my hypothesis is that Chrome gives up after some number of network chunks that it considers contiguous, but I'd appreciate a real explanation.

@annevk
Copy link
Member

annevk commented Mar 17, 2021

@mfreed7 is this something you could help with? From Mozilla's side we'd really like the bytes to document operation to be fully deterministic, i.e., not have it use implementation-defined.

@hsivonen
Copy link
Member

To elaborate on why I want to limit this to 1024 bytes:

meta takes precedence over the bogo-XML declaration. The two syntaxes are mutually exclusive such that it's not possible to hide a successful meta within the bogo-XML declaration syntax. Therefore, given the precedence, it makes sense to scan for meta first and only scan for the bogo-XML declaration if the meta prescan fails.

Regarding meta, we already have spec text that says: "User agents are encouraged to only prescan the first 1024 bytes." Gecko implements this exactly without timeouts. That is, if there is no meta, regardless of buffer boundaries and wall clock, Gecko always decides that the meta prescan fails once it has seen the first 1024 bytes without meta (or EOF). Therefore, when when meta prescan fails, the implementation options are either to scan the first 1024 bytes that we already have for the bogo-XML declaration syntax or to, on principle of accommodating the XML declaration syntax, read more input until a > arrives.

Since real bogo-XML declarations are generally not longer than 45 bytes, at the point we have 1024 bytes available, it's unreasonable to add complexity to handle cases where there is excessive whitespace that, in principle, fits the syntax: we're already accommodating a 20-fold excess length. In CSS, we have precedent for drawing the line to 1024 bytes for such theoretical infinite whitespace opportunity.

Under the Well-defined Behavior Design Principle, we should write the limit in the spec as in done in CSS even though in both cases it's so large relative to the real-world usage of the syntax that the exact limit is unlikely to have much Web compat impact.

@hsivonen
Copy link
Member

I added a bit of text to handle UTF-16BE/LE declared using the ASCII-based syntax.

The WPTs landed, but it looks like I will need to split the test to avoid timeouts.

@domenic
Copy link
Member Author

domenic commented Mar 24, 2021

This just got re-assigned to me. My personal next step would be to remove the Gecko-only 1024 limitation so we can match 2/3 engines, and then remove or flip the pass conditions on any web platform tests which assume that if possible. (Although I'm not sure if it is.)

@hsivonen
Copy link
Member

My personal next step would be to remove the Gecko-only 1024 limitation so we can match 2/3 engines, and then remove or flip the pass conditions on any web platform tests which assume that if possible.

What's your opinion on the analogous 1024-byte limit in CSS? What's your opinion on the trickle.py test?

@domenic domenic added the agenda+ To be discussed at a triage meeting label Mar 31, 2021
@hsivonen
Copy link
Member

It looks like for Web compat, Gecko will have to extend reloadless meta scanning past 1024 bytes. That changes what makes sense for the bogo XML declaration.

Let's drop the 1024-byte limit for XML declaration.

I'd still like to know how developers of the other engines view the trickle.py test.

Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. So I guess the main thing remaining here is for @mfreed7 to look at the tests. Is there a follow-up issue for changing meta element scanning accordingly?

the first step below goes beyond the end of the byte stream obtained so far) or reaches its
<var>end condition</var>, then abort the <span>prescan a byte stream to determine its
encoding</span> algorithm unsuccessfully.</p>
encoding</dfn>, given some defined <dfn data-x="prescan-end-condition"><var>end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is now a named argument we should add:

export for="prescan a byte stream to determine its encoding"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this is exported in the first place...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


<li><p>Advance <var>encodingPosition</var> past the 0x67 (g) byte.</p></li>

<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comma after i.e.


<li><p>If the byte at <var>encodingPosition</var> is not 0x3D (=), then return failure.</p></li>

<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@mfreed7
Copy link
Contributor

mfreed7 commented Apr 23, 2021

This looks good to me. So I guess the main thing remaining here is for @mfreed7 to look at the tests. Is there a follow-up issue for changing meta element scanning accordingly?

This LGTM. Test-wise, it looks like there are four test cases that need to change to re-remove the 1024-byte limit, right? I.e. the four that fail here?

And then there's the trickle.py test. I do not understand this one. When I run it locally, and when it runs on the Chromium automation bots, it passes. Not even flaky. So I'm not exactly sure why it is failing on WPT.fyi, seemingly consistently. I also inspected the code for some time, and I don't see any reason why this should fail. Suggestions appreciated here! The failure is that it assumes windows-1252 for some reason.

@mfreed7
Copy link
Contributor

mfreed7 commented Apr 23, 2021

And then there's the trickle.py test. I do not understand this one. When I run it locally, and when it runs on the Chromium automation bots, it passes. Not even flaky. So I'm not exactly sure why it is failing on WPT.fyi, seemingly consistently. I also inspected the code for some time, and I don't see any reason why this should fail. Suggestions appreciated here! The failure is that it assumes windows-1252 for some reason.

Ok, I might have found at least a clue. It is possible that this is due to a timeout of sorts in the test. Each of the <iframes> in xmldecl-1.html first inherit the windows-1252 from the parent frame. Then they each start loading. The trickle.py test trickles in one character every 50ms, finishing in just over 2 seconds. At that point, the charset is correctly detected and set to windows-1251. I think somehow the test might query before that, and see the inherited windows-1252. I don't yet see how, since it waits for onload. Is there anything funny in the WPT.fyi test runner perhaps? I ran it locally with the per-character time increased to 0.2 seconds, and it still passes.

Speculatively, we could split just the trickle.py test into a new xmldecl-3.html file, perhaps?

@domenic domenic merged commit 800a2dc into main May 6, 2021
@domenic domenic deleted the xml-charset branch May 6, 2021 17:46
@mfreed7
Copy link
Contributor

mfreed7 commented May 6, 2021

Ok, I might have found at least a clue. It is possible that this is due to a timeout of sorts in the test. Each of the <iframes> in xmldecl-1.html first inherit the windows-1252 from the parent frame. Then they each start loading. The trickle.py test trickles in one character every 50ms, finishing in just over 2 seconds. At that point, the charset is correctly detected and set to windows-1251. I think somehow the test might query before that, and see the inherited windows-1252. I don't yet see how, since it waits for onload. Is there anything funny in the WPT.fyi test runner perhaps? I ran it locally with the per-character time increased to 0.2 seconds, and it still passes.

Speculatively, we could split just the trickle.py test into a new xmldecl-3.html file, perhaps?

I'm working on landing a split of the trickle.py test, to see if that helps: web-platform-tests/wpt#28874

@mfreed7
Copy link
Contributor

mfreed7 commented May 11, 2021

So I landed the changes to trickle.py (splitting it out and marking it as a slow test) and that solved the problems on Chromium bots. However, it still fails on wpt.fyi:

https://wpt.fyi/results/html/syntax/xmldecl/xmldecl-3.html?label=master&label=experimental&aligned&q=xmldecl

I don't understand the difference between WPT.fyi and our bots. Suggestions/thoughts appreciated.

@domenic
Copy link
Member Author

domenic commented May 11, 2021

Is there any chance it's a content_shell vs. full chrome browser issue? Seems unlikely...

@mfreed7
Copy link
Contributor

mfreed7 commented May 11, 2021

Hmm, well that's interesting. Yes. I hadn't tried the full browser but it appears to fail there, at least locally. I'll give that a look, thanks!

So WPT.fyi uses a full browser? I thought it was using content_shell like the bots. TIL.

@mfreed7
Copy link
Contributor

mfreed7 commented May 11, 2021

Actually no, maybe false alarm. Locally, it was just downloading the trickle.py file itself, rather than executing it...

@mfreed7
Copy link
Contributor

mfreed7 commented May 11, 2021

Yep, definitely false alarm. When actually running run_blink_wptserve.py, it works perfectly from a full browser also. Still, thanks for the suggestion.

@domenic
Copy link
Member Author

domenic commented May 11, 2021

For me, http://wpt.live/html/syntax/xmldecl/xmldecl-3.html passes on Canary and fails on Stable. This is on Windows. Not really sure what that indicates...

@hsivonen
Copy link
Member

If I run ./mach wpt --product=chrome testing/web-platform/tests/html/syntax/xmldecl/xmldecl-3.html from today's mozilla-central with Chrome Version 90.0.4430.212 (Official Build) (64-bit) on Ubuntu 20.04, it fails.

@hsivonen
Copy link
Member

I'm in the process of updating the WPT expectation around the kilobyte boundary.

Thanks for getting this in the spec!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agenda+ To be discussed at a triage meeting compat Standard is not web compatible or proprietary feature needs standardizing do not merge yet Pull request must not be merged per rationale in comment i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. normative change
Development

Successfully merging this pull request may close these issues.

Use XML declaration <?xml ...?> as input for the text/html encoding sniffer
9 participants