Refactor tag states and actions #994

RReverser · 2016-04-05T14:59:51Z

This refactors state machines describing reading of tag names, attributes and DOCTYPE tokens in order to simplify understanding of states described in the specification and corresponding state machine implementation by reducing number of duplicative actions.

Each commit contains changes to a particular subject, and contains link to relevant commit in fork of parse5 HTML parser that contains suggested changes in tokenizer code in order to prove that changes work as intended and to show off code simplifications on a real-world well-tested implementation.

domenic · 2016-04-05T15:08:44Z

Thanks so much for this, and for #993! You may be more familiar with the parser at this point than most of the editors, so we'll likely be reviewing for editorial conventions and not much else. Maybe @inikulin could help review as well?

Also, the fact that these are working in parse5 means that all the html5lib tests still pass, right?

inikulin · 2016-04-05T15:15:25Z

source

-   be filled in before it is emitted.)</dd>
+   <dt><span data-x="ASCII letters">ASCII letter</span></dt>
+   <dd>Create a new start tag token, set its tag name to the empty string. Switch to the <span>tag
+   name state</span>. Reconsume the <span>current input character</span>.


This introduces additional reconsumption. While it makes spec cleaner, this may introduce performance penalty (not significant, but still).

@inikulin Depends on the implementation. As far as I understand from parse5 code, it does, but in our in-house implementations "reconsume" means just "don't do anything and go to specified state" (in opposite to regular character consumption where you actually move pointer), so it cleans up code without any performance penalty at all. I believe parse5 wouldn't be hard to change this way as well.

Yep, but it will introduce additional check for the ASCII character: first one will trigger reconsumption then we will need additionally check for the ASCII upper or lower in new state.

@inikulin Checking whether a pure number is >= / <= than constants is extremely cheap compared to all the other operations we're doing here :)

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

https://html.spec.whatwg.org/multipage/infrastructure.html#conformance-requirements

inikulin · 2016-04-05T15:23:13Z

I'll try to take precise look tomorrow. But my main concern currently is that most modifications introduce additional reconsumption and thus increasing computational complexity (within constant factors).

RReverser · 2016-04-05T15:23:38Z

Also, the fact that these are working in parse5 means that all the html5lib tests still pass, right?

I'd say I assume though, but would be nice if @inikulin could confirm.

RReverser · 2016-04-05T15:33:48Z

But my main concern currently is that most modifications introduce additional reconsumption and thus increasing computational complexity (within constant factors).

@inikulin Here are the parse5 results on my branch:

[16:30:44] Starting 'benchmark'...
[16:30:44] Running 'parse5 regression benchmark - HUGE' from ~/Documents/Web/parse5/test/benchmark/bench-huge.js ...
[16:30:51] Working copy x 7.34 ops/sec ±7.99% (23 runs sampled)
[16:30:57] Upstream x 7.27 ops/sec ±9.05% (23 runs sampled)
[16:30:57] 'parse5 regression benchmark - HUGE' from ~/Documents/Web/parse5/test/benchmark/bench-huge.js (passed: 2, failed: 0)
[16:30:57] Passed:
[16:30:57] 'Working copy' at 1.01x faster
[16:30:57] 'Upstream' is etalon
[16:30:57] Running 'parse5 regression benchmark - MICRO' from ~/Documents/Web/parse5/test/benchmark/bench-micro.js ...
[16:31:03] Working copy x 53.82 ops/sec ±6.81% (60 runs sampled)
[16:31:09] Upstream x 46.92 ops/sec ±15.61% (56 runs sampled)
[16:31:09] 'parse5 regression benchmark - MICRO' from ~/Documents/Web/parse5/test/benchmark/bench-micro.js (passed: 2, failed: 0)
[16:31:09] Passed:
[16:31:09] 'Working copy' at 1.15x faster
[16:31:09] 'Upstream' is etalon
[16:31:09] Running 'parse5 regression benchmark - PAGES' from ~/Documents/Web/parse5/test/benchmark/bench-pages.js ...
[16:31:14] Working copy x 147 ops/sec ±6.09% (53 runs sampled)
[16:31:20] Upstream x 144 ops/sec ±6.96% (52 runs sampled)
[16:31:20] 'parse5 regression benchmark - PAGES' from ~/Documents/Web/parse5/test/benchmark/bench-pages.js (passed: 2, failed: 0)
[16:31:20] Passed:
[16:31:20] 'Working copy' at 1.02x faster
[16:31:20] 'Upstream' is etalon
[16:31:20] Running 'parse5 regression benchmark - STREAM' from ~/Documents/Web/parse5/test/benchmark/bench-stream.js ...
[16:31:26] Working copy x 108 ops/sec ±3.81% (73 runs sampled)
[16:31:32] Upstream x 115 ops/sec ±3.52% (65 runs sampled)
[16:31:32] 'parse5 regression benchmark - STREAM' from ~/Documents/Web/parse5/test/benchmark/bench-stream.js (passed: 2, failed: 0)
[16:31:32] Passed:
[16:31:32] 'Upstream' is etalon
[16:31:32] 'Working copy' at 1.06x slower

If taking ± into account, I'd say that speed didn't change (sometimes a little bit faster, sometimes a little bit slower).

inikulin · 2016-04-05T15:48:03Z

I'd say I assume though, but would be nice if @inikulin could confirm.

Well, parse5 uses all available html5lib tests. Taking in the account that tokenization state machine is quite sensitive to changes and we have nearly 100% test coverage I assume that it's nearly impossible to introduce wrong behavior that will not be reflected by tests.

inikulin · 2016-04-05T16:00:04Z

@inikulin Here are the parse5 results on my branch:

Looks good. I guess it's because switch to end tag/start tag states occurs not as often as e.g. data state invocation, so it's performance is insignificant.

domenic · 2016-04-05T20:18:30Z

source

-   <dt>U+0022 QUOTATION MARK (&quot;)</dt>
-   <dt>U+0027 APOSTROPHE (')</dt>
-   <dt>U+003C LESS-THAN SIGN (&lt;)</dt>
-   <dt>U+003D EQUALS SIGN (=)</dt>


Doesn't this change remove the parse error for = sign? The attribute name state consumes it with no error.

Oh, good catch, thanks!

@domenic Fixed and rebased.

domenic · 2016-04-05T20:19:37Z

Is "reconsume" defined anywhere?

inikulin · 2016-04-05T21:32:57Z

Is "reconsume" defined anywhere?

Not explicitly:

Most states consume a single character,
which may have various side-effects, and either switches the state machine to a new state to
reconsume the same character, or switches it to a new state to consume the next character,
or stays in the same state to consume the next character.

(at the beginning of the Tokenization section).

P.S. how do you guys work with this huge file? Even raw view lags for me a little bit

annevk · 2016-04-06T07:01:08Z

P.S. how do you guys work with this huge file?

Sublime Text 2 on my Macbook Air from 2012 handles it pretty close to any other text file (little lag with the initial rendering). I mostly use GitHub Desktop for diffs.

RReverser · 2016-04-06T20:25:43Z

@domenic Is there anything else I should fix in this one?

sideshowbarker · 2016-04-07T08:42:08Z

P.S. how do you guys work with this huge file? Even raw view lags for me a little bit

vim

annevk · 2016-04-07T08:43:39Z

@zcorpan @gsnedders maybe you can take a look during the day until @domenic wakes up?

RReverser · 2016-04-07T11:35:54Z

@annevk Fixed whitespaces here as well. Sorry again and please let me know if there is anything else I should change.

annevk · 2016-04-07T11:47:01Z

It looks good to me. I'll let @domenic do a final check.

zcorpan · 2016-04-07T12:05:12Z

I'm not confident yet about the doctype changes. Is it equivalent for parse errors and force quirks? Are those aspects well tested?

RReverser · 2016-04-07T12:26:23Z

@zcorpan As for force-quirks - yes, you can see the related parse5 commit which reflects those changes precisely and doesn't break any of the html5lib nor own tests. As for parse errors - happy to address any specific concerns.

zcorpan · 2016-04-08T12:12:33Z

DOCTYPE state: <!doctypehtml> should be a parse error (but not <!doctype html>)
After DOCTYPE name state no longer reconsumes EOF? (I suppose it should have "Reconsume the current input character" at the end of Anything else to fix.)

The rest LGTM.

zcorpan · 2016-04-08T12:23:51Z

source

-   character.</dd>
-
   <dt>Anything else</dt>
   <dd>Append the <span>current input character</span> to the current attribute's name.</dd>


EOF gets appended to the attribute's name here (Attribute name state). This seems wrong.

It's not - it's handled above together with everything else what's invalid in doctype name (solidus, spaces, greater-than sign) https://github.com/whatwg/html/pull/994/files#diff-36cd38f49b9afa08222c0dc9ebfe35ebR100042

Ah OK. 👍

RReverser · 2016-04-08T13:16:53Z

DOCTYPE state: <!doctypehtml> should be a parse error (but not <!doctype html>)

You're right, this is an issue. Will fix.

After DOCTYPE name state no longer reconsumes EOF? (I suppose it should have "Reconsume the current input character" at the end of Anything else to fix.)

@zcorpan It's addressed (see line comment).

Proof of work: RReverser/parse5@2ece567

Proof of work: RReverser/parse5@b159bb9

RReverser · 2016-04-08T13:24:06Z

Rebased to contain only tag & attribute changes, will send fixed DOCTYPE in a separate PR.
Hopefully this will simplify reasoning & review.

zcorpan · 2016-04-08T15:36:24Z

LGTM! Thank you!

RReverser changed the title ~~Refactor tags in tokenizer~~ Refactor tag states and actions Apr 5, 2016

RReverser force-pushed the refactor-tags branch from 7cbc333 to a4935f1 Compare April 5, 2016 15:04

inikulin reviewed Apr 5, 2016
View reviewed changes

domenic reviewed Apr 5, 2016
View reviewed changes

domenic added the clarification Standard could be clearer label Apr 5, 2016

RReverser force-pushed the refactor-tags branch from a4935f1 to eeb38f0 Compare April 6, 2016 10:16

RReverser force-pushed the refactor-tags branch from eeb38f0 to 27b9d8c Compare April 7, 2016 11:35

zcorpan self-assigned this Apr 7, 2016

zcorpan reviewed Apr 8, 2016
View reviewed changes

RReverser closed this Apr 8, 2016

RReverser reopened this Apr 8, 2016

Avoid duplication of actions for reading tag names

8a1a046

Proof of work: RReverser/parse5@2ece567

RReverser added 2 commits April 8, 2016 14:23

Avoid duplication of actions for reading attributes

2336118

Proof of work: RReverser/parse5@b159bb9

Define "ASCII letters"

818a143

RReverser force-pushed the refactor-tags branch from 27b9d8c to 818a143 Compare April 8, 2016 13:23

zcorpan merged commit b66ec32 into whatwg:master Apr 8, 2016

Refactor tag states and actions #994

Refactor tag states and actions #994

Uh oh!

Conversation

RReverser commented Apr 5, 2016

Uh oh!

domenic commented Apr 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inikulin commented Apr 5, 2016

Uh oh!

RReverser commented Apr 5, 2016

Uh oh!

RReverser commented Apr 5, 2016

Uh oh!

inikulin commented Apr 5, 2016

Uh oh!

inikulin commented Apr 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

domenic commented Apr 5, 2016

Uh oh!

inikulin commented Apr 5, 2016

Uh oh!

annevk commented Apr 6, 2016

Uh oh!

RReverser commented Apr 6, 2016

Uh oh!

sideshowbarker commented Apr 7, 2016

Uh oh!

annevk commented Apr 7, 2016

Uh oh!

RReverser commented Apr 7, 2016

Uh oh!

annevk commented Apr 7, 2016

Uh oh!

zcorpan commented Apr 7, 2016

Uh oh!

RReverser commented Apr 7, 2016

Uh oh!

zcorpan commented Apr 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RReverser commented Apr 8, 2016

Uh oh!

RReverser commented Apr 8, 2016

Uh oh!

zcorpan commented Apr 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants