Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: align with Encoding's changes to "streams" / "I/O queues" #5874

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

andreubotella
Copy link
Member

@andreubotella andreubotella commented Aug 28, 2020

See whatwg/encoding#215.


I'm using "stream" to refer to the generic concept of streaming bytes, and "I/O queue" for the particular data structure. This prevents having to rename "input stream" and so on.


馃挜 Error: Wattsi server error 馃挜

PR Preview failed to build. (Last tried on Jan 15, 2021, 8:00 AM UTC).

More

PR Preview relies on a number of web services to run. There seems to be an issue with the following one:

馃毃 Wattsi Server - Wattsi Server is the web service used to build the WHATWG HTML spec.

馃敆 Related URL

Parsing MDN data...
Parsing...



If you don't have enough information above to solve the error by yourself (or to understand to which web service the error is related to, if any), please file an issue.

@andreubotella
Copy link
Member Author

andreubotella commented Sep 11, 2020

@whatwg/html-parser should review this.

For some context, in whatwg/encoding#215 we realized that streams (now called I/O queues) were treated here as a data structure in memory, there as a representation of data yet to come from the network, and that obscured things like the fact that reading from them should block. @annevk and I ultimately decided to formalize them as in-memory structures that can get populated with data from the network (see the changes on the page load processing model, for example), and which block on reading, with the presence of an end-of-queue item at the end indicating that no more data is to come.

Since the event loop shouldn't block, this PR explicitly moves the input byte stream decoding to a new thread, and the tokenizer and tree construction to another, since they have to read from the input byte stream and input stream, respectively. But it doesn't attempt to be formal, especially around the tree constructor's handling of the DOM. That's best left for a future PR.

I'd appreciate some review especially on the parts around document.write() and reentrancy, which are black magic to me.

source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
source Outdated
parallel</span> with the <span>event loop</span> and with each other.</p>

<p id="nestedParsing">The tokenizer stage and the tree construction stage must run on the same
thread and share one single set of states, but the tree construction stage is reentrant, meaning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's appropriate to require this. Implementations should be free to implement an HTML parser however they wish, as long as the end result is equivalent. (The statement this replaces was a statement of fact and didn't contain requirements so this also doesn't seem like an editorial change.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most algorithms in the web platform, whether they use the word "must" or not, have a "must" conformance level and are still understood to only apply to anything observable. But sure, this text doesn't make it clear that the difference is not observable (the XXX note below does, but I guess it's best not to depend on it for the correct interpretation).

The algorithms below describe the decoding of the input byte stream into scalar values and the tokenization stage as happening in parallel with the event loop and with each other, with the tokenization and tree construction stages running on the same thread. This is, however, not observable to user code, so implementers may run the HTML parser entirely on the event loop.

source Outdated Show resolved Hide resolved
@annevk
Copy link
Member

annevk commented Oct 9, 2020

I should add, I really appreciate you trying to get this better formalized. I hope others from @whatwg/html-parser can also take a look.

source Outdated
the source to find the encoding improves performance, as it reduces the need to throw away the
data structures used when parsing upon finding the encoding information. However, if the user
agent delays too long to obtain data to determine the encoding, then the cost of the delay could
outweigh any performance improvements from the preparse.</p>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @domenic pointed out to me in #5913, the way the spec is written right now, it's perfectly valid for implementations to use 0 bytes for the prescan and then do anything, up to implementing their own spec-incompatible prescan, in the "autodetect" stage. So this change, which prevents that kind of interpretation, is clearly not editorial, though it might be still worth including separately.

Base automatically changed from master to main January 15, 2021 07:58
@domenic
Copy link
Member

domenic commented Aug 17, 2021

@annevk @andreubotella should we try to get this over the finish line?

script-inserted data (or alternatively, perform speculative parsing).</p>

<p>In this model, most of the processing for the <a href="#scriptEndTag"><code>script</code>
end tag</a> would take place on the event loop and, and so calling <code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end tag</a> would take place on the event loop and, and so calling <code
end tag</a> would take place on the event loop, so calling <code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants