Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify speculative HTML parsing (preload scanner) #5624

Closed
zcorpan opened this issue Jun 9, 2020 · 25 comments · Fixed by #5959
Closed

Specify speculative HTML parsing (preload scanner) #5624

zcorpan opened this issue Jun 9, 2020 · 25 comments · Fixed by #5959

Comments

@zcorpan
Copy link
Member

zcorpan commented Jun 9, 2020

From #1349 (comment) , @chrishtr proposes spec edits to explicitly specify a preload scanner (as one part of a larger proposal on behavior for stylesheet loading). But the proposal doesn't say which exact approach to standardize. I believe this is implemented in slightly different ways:

Gecko: https://web.archive.org/web/20201021003137/https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading
Chromium: #5624 (comment)
WebKit: ?

Are there documents that explain how this is implemented in Chromium and WebKit?

Is there interest in aligning on the observable behavior here?

cc @whatwg/html-parser @othermaciej @hsivonen @richard-townsend-arm @lilles @mfreed7

@zcorpan zcorpan added needs implementer interest Moving the issue forward requires implementers to express interest topic: parser labels Jun 9, 2020
@zcorpan zcorpan self-assigned this Jun 9, 2020
@annevk
Copy link
Member

annevk commented Jun 9, 2020

This is also relevant for whatwg/fetch#590 and rel=preload. I'm definitely in favor of getting this a bit more detailed, especially with how the fetches end up being constructed and how that affects future fetches, etc.

@mfreed7
Copy link
Contributor

mfreed7 commented Jun 20, 2020

Here is a very high-level description of the Chromium preload scanner, from @nyaxt:

At the high-level, there are two classes involved: PreloadScanner and HTMLResourcePreloader:

  • PreloadScanner takes a decoded string stream from the HTMLDocumentParser, drives a HTMLTokenizer internally to generate the HTML token stream (tokenizer output) and emit PreloadRequests.
  • HTMLResourcePreloader consumes the PreloadRequests and emits actual preload requests to the core/loader stack.

PreloadScanner consumes the token stream without actually constructing the DOM tree. It mostly only cares about the start tag tokens (% inline stylesheets), and sees if there's an interesting URL to start preloading. Since "false-positive" preload requests (== the urls PreloadScanner noticed but turns out not actually used) are very costly, PreloadScanner logic got more and more complicated over time. For example, it simulates the media query match so that it won't request images of the wrong resolution.

PreloadScanner is invoked for:

  • {main resource bytes received, document.write inputs} while parser is paused waiting for blocking script to run
  • scanning possible subresources for NoStatePrefetch feature

+@richard-townsend-arm to comment on whether any of the above has changed with the new synchronous parser

@zcorpan
Copy link
Member Author

zcorpan commented Jun 20, 2020

Thanks! What is NoStatePrefetch?

@othermaciej
Copy link

I'm not sure it's a good idea to fully specify it. The preload scanner is a heuristic for faster page loading, and it needs to be possible to evolve it as implementations learn new things about performance, and possibly to evolve it in implementation-specific ways.

If observable behavior differences are causing real-world interop problems, then it may make sense to specify constraints on what can be done. Ideally, there would be no preloading that is mandatory, and as few cases as possible that are forbidden. It should be like the http cache, where there are constraints but also enough flexibility to do innovative things like racing disk and network.

That said, I'm not the top expert on this area of WebKit. Tagging @cdumez as a person who may know more and @hober to make sure we follow up internally with relevant folks who are not on GitHub.

@mfreed7
Copy link
Contributor

mfreed7 commented Jun 20, 2020

Thanks! What is NoStatePrefetch?

https://developers.google.com/web/updates/2018/07/nostate-prefetch

@hsivonen
Copy link
Member

Is there interest in aligning on the observable behavior here?

My understanding is that Gecko's speculative parser is capable of looking further ahead than WebKit's that was inherited into Blink. I'm not interested in making Gecko's speculative parser less capable in order to match WebKit and Blink.

IIRC, the reported interop-sensitive problem has been rather niche and isolated: Trying to do server-side responsive images by setting resolution information in a cookie via JS and being unhappy that the browser already fetched the images whose URLs were in the HTML source. However, with a multiprocess architecture, it's increasingly problematic to give guarantees about when script-set cookies are reflected in the network requests that are made, so I think we shouldn't cater to tricks like this.

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Jul 8, 2020
zcorpan added a commit to web-platform-tests/wpt that referenced this issue Jul 8, 2020
zcorpan added a commit to web-platform-tests/wpt that referenced this issue Jul 8, 2020
zcorpan added a commit to web-platform-tests/wpt that referenced this issue Jul 8, 2020
@zcorpan
Copy link
Member Author

zcorpan commented Jul 8, 2020

@othermaciej thanks for your thoughts. I agree that a full specification might not be desirable. I think the spec can still acknowledge that this optimization exists and specify some rules around how it should behave. This could serve web platform predictability and interoperability with regard to which URLs are speculatively prefetched.

@hsivonen thanks. Not wanting to regress on speculative parser capability is understandable. On the script-set cookie case, that's indeed something that doesn't work well together with the preload scanner optimization in particular.

For observable behavior, apart from the case @hsivonen mentioned, I'm aware of a few other cases that have been a source of complaints from web developers:

There are likely more issues, this was from a quick search.

To facilitate the discussion, I've written a few tentative tests today and a test generation script (so it's easy to add more tests): web-platform-tests/wpt#24521

Test results: https://wpt.fyi/results/html/syntax/preload-scanner.tentative/page-load?sha=ed26ec1897&label=pr_head

@othermaciej
Copy link

Acknowledging that the optimization exists, and perhaps giving constraints that would resolve some specific differences in observable behavior, seems like probably a good idea.

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Sep 2, 2020
@zcorpan
Copy link
Member Author

zcorpan commented Sep 3, 2020

I've written some more tests. Here are some things I learned:

  • Safari 112 preview:
    • img-src-referrerpolicy-no-referrer fails because it includes the Referer header in the speculative fetch.
    • picture-source-no-img fails, it speculatively fetches even though there's no img element.
    • script-src-unsupported-type fails, it speculatively fetches the script with type=text/plain.
  • Chrome 86 dev:
    • picture-source-br-img fails; it doesn't speculatively fetch when there's a br tag before the img in picture.
  • Safari and Chrome:
    • Don't bother at all with svg or math tree builder rules for foreign content in the speculative parser. They tokenize with knowledge of some tags like xmp being RAWTEXT, but don't differentiate between HTML script or SVG script
    • In document.write(): the meta-charset-script-src test fails; a meta charset declaration after a blocking script doesn't affect the encoding of the document. (The test for this had a mistake.)
  • Firefox 82 nightly:
    • SVG script with xlink:href is speculatively fetched, but SVG script with href is not, even though href is normally supported.
    • img-srcset and picture-source-br-img are flaky in Firefox for some reason. Other tests are not flaky as far as I can tell.

I've made the pass conditions be assuming the correct behavior would be to fetch (for a selective set of resource kinds) what would be fetched had the real HTML parser continued without the current script doing any "destructive" document.write. This appears to be very close to what Firefox does.

Latest test results: https://wpt.fyi/results/html/syntax/speculative-parsing/generated?label=pr_head&max-count=1&pr=24521

Now, I realize that not all of the cases I've tested have a basis of complaints from web developers. I wanted to find and highlight fundamental differences and possible bugs. What to specify is open for discussion.

@yoavweiss
Copy link
Contributor

I think it makes sense to specify the preloader, in order to make it easier to specify things that rely on it:

  • Link preload headers that also include a media attribute currently wait (In both Chromium and WebKit, IIRC) for the preloader to figure out the right viewport (from <meta viewport> tokens) before preloading the resource in question. Currently this is entirely unspecified.
  • As Simon saw, error handling can be strictly different between different implementations, which results in potentially different web observed fetches.
  • As @annevk pointed out, speculative preloads rely on the unspecified "preload cache" (or memory cache in WebKit, which is subtly different). It would be good to define what the behavior is when e.g. a non-cacheable script is speculatively fetched multiple times. IIRC, that would result in a different number of fetches between Chromium and Gecko.
  • The parser behavior in the face of blocking in-body styles, which started this whole discussion.
  • We want to be sure (and test) that all browsers don't preload the wrong things when Media Queries are involved (for responsive images, as well as with the media attribute on styles).

While the preload scanner may not be a major point of interop pain, but the differences in behavior are highly observable, through Resource Timing, Service Workers and the servers fielding the actual requests.
Different browsers preload different things potentially (e.g. import styles), in ways that may surprise developers. Defining a baseline may help to reduce this divergence in bahvior and reduce surprises.

I agree with @othermaciej and @hsivonen that we wouldn't want such a specification to limit future optimizations, but at this point, I feel this optimization is mature enough that we could specify its basics, and let user agents innovate on top of that.

@zcorpan
Copy link
Member Author

zcorpan commented Sep 9, 2020

Thanks @yoavweiss!

New info for me to consider and test:

  • Link preload headers waiting for a meta viewport, potentially found by speculative parser
  • meta CSP can prevent speculative fetches that violate the CSP. (@yoavweiss mentioned this to me in IRC.)

@yoavweiss
Copy link
Contributor

  • meta CSP can prevent speculative fetches that violate the CSP. (@yoavweiss mentioned this to me in IRC.)

I believe that to simplify the implementation, it prevents any speculative fetches period (at least in Chromium). Might be good to see what other implementations are doing, and maybe do something smarter.

@zcorpan
Copy link
Member Author

zcorpan commented Sep 9, 2020

Here are two demos with meta CSP:

From what I can tell with the Network tab devtools, in Safari and Firefox, both cases speculatively fetch the image, ignoring the CSP meta directives. Chrome doesn't fetch the image for either case.

@hsivonen
Copy link
Member

hsivonen commented Sep 9, 2020

  • SVG script with xlink:href is speculatively fetched, but SVG script with href is not, even though href is normally supported.

Filed. Thanks! Also: The name of the image test suggests that it's testing src rather than href. The spec says href. Apparently the tests haven't landed yet, so I failed to locate the test source to check.

@zcorpan
Copy link
Member Author

zcorpan commented Sep 9, 2020

@hsivonen thank you!

The tests are here web-platform-tests/wpt#24521

There are multiple image tests, both svg-image-href and svg-image-src (the latter expects no speculative fetch to happen).

@zcorpan zcorpan removed the needs implementer interest Moving the issue forward requires implementers to express interest label Sep 14, 2020
@zcorpan
Copy link
Member Author

zcorpan commented Sep 29, 2020

I looked into trying to specify the integration points of speculative parsing in the spec in #5959 , and I have some questions.

Specifically, I'm a bit confused about how to specify speculative parsing for the document.write string. From irc:

I don't follow how what the spec does for <script src=X></script> where X is: document.write("<script src=Y></script><img src=Z>");
first there's https://html.spec.whatwg.org/multipage/parsing.html#scriptEndTag
"Prepare" will mark the script as the pending parser-blocking script, for the first script
"script nesting level" is zero, so the Otherwise steps are run
so far so good. This is where I want to kick off the (first) speculative parser
but then the script is loaded and executed (step 10), and the document.write will invoke the tree builder reentrantly, reaching this part again but with "script nesting level" being 1
where the spec says "Set the parser pause flag to true, and abort the processing of any nested invocations of the tokenizer, yielding control back to the caller. (Tokenization will resume when the caller returns to the "outer" tree construction stage.)"
and that's what I don't get. how is the Y script loaded and executed?
Where should I kick off a second speculative parser to find the img in the document.write string

@chrishtr
Copy link
Contributor

chrishtr commented Sep 29, 2020

Isn't it that once we're back in the "outer" tree construction stage:

  • Tokenization resumes, getting to the <script src=Y> block, which then blocks the parser on loading and and then executing Y.
  • In the meantime, the preload scanner either resumes if it hasn't yet gotten to the place affected by document.write, or starts again at the document-write-affected area.

@zcorpan
Copy link
Member Author

zcorpan commented Sep 30, 2020

I thought there was special handling for document.write with the speculative parser.

In particular, the document write steps insert input into the input stream, but if the speculative parser is already parsing, it will be beyond the insertion point and thus not parse input without special handling.

@chrishtr
Copy link
Contributor

chrishtr commented Oct 1, 2020

I thought there was special handling for document.write with the speculative parser.

Are you speaking about the Chromium implementation?

In particular, the document write steps insert input into the input stream, but if the speculative parser is already parsing, it will be beyond the insertion point and thus not parse input without special handling.

Agree that if the speculative parser has already gotten to the end of the input and then document.write happens, special handling will be needed. Is it any more complicated than something equivalent to "the speculative parser begins again at the new areas added to the input stream if it has completed already"?

@mfreed7
Copy link
Contributor

mfreed7 commented Oct 2, 2020

Agree that if the speculative parser has already gotten to the end of the input and then document.write happens, special handling will be needed. Is it any more complicated than something equivalent to "the speculative parser begins again at the new areas added to the input stream if it has completed already"?

I think it’s more that the speculative parser might have to “rewind” back to the insertion point, since it might be past it.

@zcorpan
Copy link
Member Author

zcorpan commented Oct 2, 2020

@mfreed7 right. It will be past it as soon as it has speculatively parsed a single character.

It's possible that Gecko has a different strategy with document.write compared to Chromium/WebKit. From https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading

nsHtml5Parser has one tokenizer/tree builder pair for parsing data from document.write(). It also has another lazily initialized tokenizer/tree builder for speculatively scanning the tail of document.write() data when the parser blocks without parsing the document.write() data to completion.

@zcorpan
Copy link
Member Author

zcorpan commented Oct 29, 2020

Draft spec for review

There is now a draft specification for review in #5959. (Preview of the generated spec)

At a high level, the optimization itself is optional, but if it's implemented, it needs to speculatively parse HTML as if it was following the normal HTML parser rules, and only fetch resources that would be fetched from normal processing (if the blocking script does nothing). Which kinds of resources to fetch is implementation-defined.

TPAC discussion

We discussed this topic a few days ago in the WHATWG TPAC breakout session. Minutes here: https://www.w3.org/2020/10/26-whatwg-minutes.html#preload

Commentary inline:

zcorpan: I've been working on this. Tests demonstrate interop issues.

zcorpan: document.write() external scripts cause performance problems. So the speculative parser tries to help. Gecko vs. Chromium/WebKit: Gecko more faithfully uses the real HTML parser to build a speculative tree. Chromium/WebKit only use the tokenizer, with some tree-building knowledge (e.g. know about script elements and style elements; don't know about SVGNS). This allows generating test cases to show interop differences. Don't have data on the web developer pain these interop causes, but it'd be useful to reach interop. Question is: how much detail do we want in the spec, and which behavior do we want to specify? Also we need to speculatively parse the document.write() string if you're writing an external script.

hsivonen: have you found any cases where "the right thing" differs from "what Firefox does"?

zcorpan: document.write("<meta charset>") plus a later document.write("<meta charset>") which invalidates that does not work.

This about meta charset was both minuted inaccurately and I made a mistake in my testing which I've now fixed. There are a couple of interesting bits about meta charset in Gecko's speculative parser, though:

  • it is ignored by the speculative parser, i.e. it does not change the encoding of the document or the URL character encoding of speculative fetches. (This only makes a difference when the page is doing something "wrong", either using a late encoding declaration, or document.write()ing an encoding declaration.)
  • in the "page-load" test for this (/html/syntax/speculative-parsing/generated/page-load/meta-charset-script-src.tentative.html), Firefox does not speculatively fetch the script src after the encoding declaration. I asked @hsivonen about this in IRC, but he didn't have an answer to why this happens. (The "document-write" test does speculatively fetch the script.)

smfr: reiterate othermaciej's comments in the issue that this is just an optimization done by UAs, and maybe it's not necessary to fully specify. If there are interop issues then maybe the spec should say enough to address those, but we're not convinced that it needs to be fully specified.

emilio: if we know of authors relying on particular bits, it'd be useful to document them; it's not clear whether it needs to be normative or formal.

zcorpan: so like a bullet point set of requirements or expectations

hsivonen: my main concern is not speccing things so that the less-accurate version is correct and Firefox's more-accurate version is incorrect. So either spec the correct thing or leave it hand-wavey enough that the correct version is allowed.

zcorpan: Another Firefox is not totally correct issue: it doesn't respect CSP. In Chromium the presence of CSP meta entirely disables the speculative parser. (WebKit aligns with Firefox and ignores CSP.)

emilio: that seems like a bug we could probably fix. Is there a bug on file?

Henri filed https://bugzilla.mozilla.org/show_bug.cgi?id=1673407

zcorpan: I haven't filed one. I do envision listing this as a requirement in the spec and filing a bug.

zcorpan: subtopic: the speculative cache. Ideally the spec would say something about that.

emilio: Gecko reuses the existing caches. Per-document stylesheet cache, image cache.

zcorpan: annevk brought this up in #5624.

(some discussion about stylesheet cache, subtleties, how it doesn't exist in the spec)

hsivonen: we have a hash table of URLs that have been speculatively fetched. So e.g. <img> and <script> pointing to the same URL won't speculatively fetch the same URL twice.

I haven't specified this hash table.

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Oct 29, 2020
zcorpan added a commit that referenced this issue May 20, 2021
zcorpan added a commit that referenced this issue Aug 11, 2021
zcorpan added a commit to web-platform-tests/wpt that referenced this issue Aug 31, 2021
zcorpan added a commit to web-platform-tests/wpt that referenced this issue Sep 8, 2021
zcorpan added a commit to web-platform-tests/wpt that referenced this issue Sep 14, 2021
zcorpan added a commit that referenced this issue Sep 14, 2021
@ke1vin4real
Copy link

@mfreed7 Is there any article introduced PreloadScanner?

@yoavweiss
Copy link
Contributor

I wrote this article many years ago, but I believe it's still mostly relevant

@ke1vin4real
Copy link

@yoavweiss Thanks

moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Oct 3, 2021
…e HTML parsing, a=testonly

Automatic update from web-platform-tests
HTML: Add tentative tests for speculative HTML parsing

See whatwg/html#5624
--

wpt-commits: 9afacb73a04c1f33837eb0a7ffcd9ec16c9d477f
wpt-pr: 24521
jamienicol pushed a commit to jamienicol/gecko that referenced this issue Oct 4, 2021
…e HTML parsing, a=testonly

Automatic update from web-platform-tests
HTML: Add tentative tests for speculative HTML parsing

See whatwg/html#5624
--

wpt-commits: 9afacb73a04c1f33837eb0a7ffcd9ec16c9d477f
wpt-pr: 24521
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Oct 4, 2021
…e HTML parsing, a=testonly

Automatic update from web-platform-tests
HTML: Add tentative tests for speculative HTML parsing

See whatwg/html#5624
--

wpt-commits: 9afacb73a04c1f33837eb0a7ffcd9ec16c9d477f
wpt-pr: 24521
jamienicol pushed a commit to jamienicol/gecko that referenced this issue Oct 6, 2021
…e HTML parsing, a=testonly

Automatic update from web-platform-tests
HTML: Add tentative tests for speculative HTML parsing

See whatwg/html#5624
--

wpt-commits: 9afacb73a04c1f33837eb0a7ffcd9ec16c9d477f
wpt-pr: 24521
Gabisampaio pushed a commit to Gabisampaio/wpt that referenced this issue Nov 18, 2021
mfreed7 pushed a commit to mfreed7/html that referenced this issue Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

8 participants