Skip to content

Conversation

@RReverser
Copy link
Member

This refactors state machines describing reading of tag names, attributes and DOCTYPE tokens in order to simplify understanding of states described in the specification and corresponding state machine implementation by reducing number of duplicative actions.

Each commit contains changes to a particular subject, and contains link to relevant commit in fork of parse5 HTML parser that contains suggested changes in tokenizer code in order to prove that changes work as intended and to show off code simplifications on a real-world well-tested implementation.

@RReverser RReverser changed the title Refactor tags in tokenizer Refactor tag states and actions Apr 5, 2016
@domenic
Copy link
Member

domenic commented Apr 5, 2016

Thanks so much for this, and for #993! You may be more familiar with the parser at this point than most of the editors, so we'll likely be reviewing for editorial conventions and not much else. Maybe @inikulin could help review as well?

Also, the fact that these are working in parse5 means that all the html5lib tests still pass, right?

be filled in before it is emitted.)</dd>
<dt><span data-x="ASCII letters">ASCII letter</span></dt>
<dd>Create a new start tag token, set its tag name to the empty string. Switch to the <span>tag
name state</span>. Reconsume the <span>current input character</span>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces additional reconsumption. While it makes spec cleaner, this may introduce performance penalty (not significant, but still).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inikulin Depends on the implementation. As far as I understand from parse5 code, it does, but in our in-house implementations "reconsume" means just "don't do anything and go to specified state" (in opposite to regular character consumption where you actually move pointer), so it cleans up code without any performance penalty at all. I believe parse5 wouldn't be hard to change this way as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but it will introduce additional check for the ASCII character: first one will trigger reconsumption then we will need additionally check for the ASCII upper or lower in new state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inikulin Checking whether a pure number is >= / <= than constants is extremely cheap compared to all the other operations we're doing here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

https://html.spec.whatwg.org/multipage/infrastructure.html#conformance-requirements

@inikulin
Copy link
Member

inikulin commented Apr 5, 2016

I'll try to take precise look tomorrow. But my main concern currently is that most modifications introduce additional reconsumption and thus increasing computational complexity (within constant factors).

@RReverser
Copy link
Member Author

Also, the fact that these are working in parse5 means that all the html5lib tests still pass, right?

I'd say I assume though, but would be nice if @inikulin could confirm.

@RReverser
Copy link
Member Author

But my main concern currently is that most modifications introduce additional reconsumption and thus increasing computational complexity (within constant factors).

@inikulin Here are the parse5 results on my branch:

[16:30:44] Starting 'benchmark'...
[16:30:44] Running 'parse5 regression benchmark - HUGE' from ~/Documents/Web/parse5/test/benchmark/bench-huge.js ...
[16:30:51] Working copy x 7.34 ops/sec ±7.99% (23 runs sampled)
[16:30:57] Upstream x 7.27 ops/sec ±9.05% (23 runs sampled)
[16:30:57] 'parse5 regression benchmark - HUGE' from ~/Documents/Web/parse5/test/benchmark/bench-huge.js (passed: 2, failed: 0)
[16:30:57] Passed:
[16:30:57] 'Working copy' at 1.01x faster
[16:30:57] 'Upstream' is etalon
[16:30:57] Running 'parse5 regression benchmark - MICRO' from ~/Documents/Web/parse5/test/benchmark/bench-micro.js ...
[16:31:03] Working copy x 53.82 ops/sec ±6.81% (60 runs sampled)
[16:31:09] Upstream x 46.92 ops/sec ±15.61% (56 runs sampled)
[16:31:09] 'parse5 regression benchmark - MICRO' from ~/Documents/Web/parse5/test/benchmark/bench-micro.js (passed: 2, failed: 0)
[16:31:09] Passed:
[16:31:09] 'Working copy' at 1.15x faster
[16:31:09] 'Upstream' is etalon
[16:31:09] Running 'parse5 regression benchmark - PAGES' from ~/Documents/Web/parse5/test/benchmark/bench-pages.js ...
[16:31:14] Working copy x 147 ops/sec ±6.09% (53 runs sampled)
[16:31:20] Upstream x 144 ops/sec ±6.96% (52 runs sampled)
[16:31:20] 'parse5 regression benchmark - PAGES' from ~/Documents/Web/parse5/test/benchmark/bench-pages.js (passed: 2, failed: 0)
[16:31:20] Passed:
[16:31:20] 'Working copy' at 1.02x faster
[16:31:20] 'Upstream' is etalon
[16:31:20] Running 'parse5 regression benchmark - STREAM' from ~/Documents/Web/parse5/test/benchmark/bench-stream.js ...
[16:31:26] Working copy x 108 ops/sec ±3.81% (73 runs sampled)
[16:31:32] Upstream x 115 ops/sec ±3.52% (65 runs sampled)
[16:31:32] 'parse5 regression benchmark - STREAM' from ~/Documents/Web/parse5/test/benchmark/bench-stream.js (passed: 2, failed: 0)
[16:31:32] Passed:
[16:31:32] 'Upstream' is etalon
[16:31:32] 'Working copy' at 1.06x slower

If taking ± into account, I'd say that speed didn't change (sometimes a little bit faster, sometimes a little bit slower).

@inikulin
Copy link
Member

inikulin commented Apr 5, 2016

I'd say I assume though, but would be nice if @inikulin could confirm.

Well, parse5 uses all available html5lib tests. Taking in the account that tokenization state machine is quite sensitive to changes and we have nearly 100% test coverage I assume that it's nearly impossible to introduce wrong behavior that will not be reflected by tests.

@inikulin
Copy link
Member

inikulin commented Apr 5, 2016

@inikulin Here are the parse5 results on my branch:

Looks good. I guess it's because switch to end tag/start tag states occurs not as often as e.g. data state invocation, so it's performance is insignificant.

<dt>U+0022 QUOTATION MARK (&quot;)</dt>
<dt>U+0027 APOSTROPHE (')</dt>
<dt>U+003C LESS-THAN SIGN (&lt;)</dt>
<dt>U+003D EQUALS SIGN (=)</dt>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this change remove the parse error for = sign? The attribute name state consumes it with no error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@domenic Fixed and rebased.

@domenic
Copy link
Member

domenic commented Apr 5, 2016

Is "reconsume" defined anywhere?

@inikulin
Copy link
Member

inikulin commented Apr 5, 2016

Is "reconsume" defined anywhere?

Not explicitly:

Most states consume a single character,
which may have various side-effects, and either switches the state machine to a new state to
reconsume the same character, or switches it to a new state to consume the next character,
or stays in the same state to consume the next character.

(at the beginning of the Tokenization section).

P.S. how do you guys work with this huge file? Even raw view lags for me a little bit

@domenic domenic added the clarification Standard could be clearer label Apr 5, 2016
@annevk
Copy link
Member

annevk commented Apr 6, 2016

P.S. how do you guys work with this huge file?

Sublime Text 2 on my Macbook Air from 2012 handles it pretty close to any other text file (little lag with the initial rendering). I mostly use GitHub Desktop for diffs.

@RReverser
Copy link
Member Author

@domenic Is there anything else I should fix in this one?

@sideshowbarker
Copy link
Member

P.S. how do you guys work with this huge file? Even raw view lags for me a little bit

vim

@annevk
Copy link
Member

annevk commented Apr 7, 2016

@zcorpan @gsnedders maybe you can take a look during the day until @domenic wakes up?

@RReverser
Copy link
Member Author

@annevk Fixed whitespaces here as well. Sorry again and please let me know if there is anything else I should change.

@annevk
Copy link
Member

annevk commented Apr 7, 2016

It looks good to me. I'll let @domenic do a final check.

@zcorpan
Copy link
Member

zcorpan commented Apr 7, 2016

I'm not confident yet about the doctype changes. Is it equivalent for parse errors and force quirks? Are those aspects well tested?

@RReverser
Copy link
Member Author

@zcorpan As for force-quirks - yes, you can see the related parse5 commit which reflects those changes precisely and doesn't break any of the html5lib nor own tests. As for parse errors - happy to address any specific concerns.

@zcorpan zcorpan self-assigned this Apr 7, 2016
@zcorpan
Copy link
Member

zcorpan commented Apr 8, 2016

  • DOCTYPE state: <!doctypehtml> should be a parse error (but not <!doctype html>)
  • After DOCTYPE name state no longer reconsumes EOF? (I suppose it should have "Reconsume the current input character" at the end of Anything else to fix.)

The rest LGTM.

character.</dd>

<dt>Anything else</dt>
<dd>Append the <span>current input character</span> to the current attribute's name.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EOF gets appended to the attribute's name here (Attribute name state). This seems wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not - it's handled above together with everything else what's invalid in doctype name (solidus, spaces, greater-than sign) https://github.com/whatwg/html/pull/994/files#diff-36cd38f49b9afa08222c0dc9ebfe35ebR100042

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. 👍

@RReverser
Copy link
Member Author

DOCTYPE state: <!doctypehtml> should be a parse error (but not <!doctype html>)

You're right, this is an issue. Will fix.

After DOCTYPE name state no longer reconsumes EOF? (I suppose it should have "Reconsume the current input character" at the end of Anything else to fix.)

@zcorpan It's addressed (see line comment).

@RReverser RReverser closed this Apr 8, 2016
@RReverser RReverser reopened this Apr 8, 2016
@RReverser
Copy link
Member Author

Rebased to contain only tag & attribute changes, will send fixed DOCTYPE in a separate PR.
Hopefully this will simplify reasoning & review.

@zcorpan
Copy link
Member

zcorpan commented Apr 8, 2016

LGTM! Thank you!

@zcorpan zcorpan merged commit b66ec32 into whatwg:master Apr 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clarification Standard could be clearer

Development

Successfully merging this pull request may close these issues.

6 participants