-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Refactor tag states and actions #994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks so much for this, and for #993! You may be more familiar with the parser at this point than most of the editors, so we'll likely be reviewing for editorial conventions and not much else. Maybe @inikulin could help review as well? Also, the fact that these are working in parse5 means that all the html5lib tests still pass, right? |
| be filled in before it is emitted.)</dd> | ||
| <dt><span data-x="ASCII letters">ASCII letter</span></dt> | ||
| <dd>Create a new start tag token, set its tag name to the empty string. Switch to the <span>tag | ||
| name state</span>. Reconsume the <span>current input character</span>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces additional reconsumption. While it makes spec cleaner, this may introduce performance penalty (not significant, but still).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inikulin Depends on the implementation. As far as I understand from parse5 code, it does, but in our in-house implementations "reconsume" means just "don't do anything and go to specified state" (in opposite to regular character consumption where you actually move pointer), so it cleans up code without any performance penalty at all. I believe parse5 wouldn't be hard to change this way as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but it will introduce additional check for the ASCII character: first one will trigger reconsumption then we will need additionally check for the ASCII upper or lower in new state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inikulin Checking whether a pure number is >= / <= than constants is extremely cheap compared to all the other operations we're doing here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
https://html.spec.whatwg.org/multipage/infrastructure.html#conformance-requirements
|
I'll try to take precise look tomorrow. But my main concern currently is that most modifications introduce additional reconsumption and thus increasing computational complexity (within constant factors). |
I'd say I assume though, but would be nice if @inikulin could confirm. |
@inikulin Here are the parse5 results on my branch:
If taking ± into account, I'd say that speed didn't change (sometimes a little bit faster, sometimes a little bit slower). |
Well, parse5 uses all available html5lib tests. Taking in the account that tokenization state machine is quite sensitive to changes and we have nearly 100% test coverage I assume that it's nearly impossible to introduce wrong behavior that will not be reflected by tests. |
Looks good. I guess it's because switch to end tag/start tag states occurs not as often as e.g. data state invocation, so it's performance is insignificant. |
| <dt>U+0022 QUOTATION MARK (")</dt> | ||
| <dt>U+0027 APOSTROPHE (')</dt> | ||
| <dt>U+003C LESS-THAN SIGN (<)</dt> | ||
| <dt>U+003D EQUALS SIGN (=)</dt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this change remove the parse error for = sign? The attribute name state consumes it with no error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@domenic Fixed and rebased.
|
Is "reconsume" defined anywhere? |
Not explicitly:
(at the beginning of the Tokenization section). P.S. how do you guys work with this huge file? Even raw view lags for me a little bit |
Sublime Text 2 on my Macbook Air from 2012 handles it pretty close to any other text file (little lag with the initial rendering). I mostly use GitHub Desktop for diffs. |
|
@domenic Is there anything else I should fix in this one? |
vim |
|
@zcorpan @gsnedders maybe you can take a look during the day until @domenic wakes up? |
|
@annevk Fixed whitespaces here as well. Sorry again and please let me know if there is anything else I should change. |
|
It looks good to me. I'll let @domenic do a final check. |
|
I'm not confident yet about the doctype changes. Is it equivalent for parse errors and force quirks? Are those aspects well tested? |
|
@zcorpan As for force-quirks - yes, you can see the related |
The rest LGTM. |
| character.</dd> | ||
|
|
||
| <dt>Anything else</dt> | ||
| <dd>Append the <span>current input character</span> to the current attribute's name.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EOF gets appended to the attribute's name here (Attribute name state). This seems wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not - it's handled above together with everything else what's invalid in doctype name (solidus, spaces, greater-than sign) https://github.com/whatwg/html/pull/994/files#diff-36cd38f49b9afa08222c0dc9ebfe35ebR100042
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. 👍
You're right, this is an issue. Will fix.
@zcorpan It's addressed (see line comment). |
|
Rebased to contain only tag & attribute changes, will send fixed DOCTYPE in a separate PR. |
|
LGTM! Thank you! |
This refactors state machines describing reading of tag names, attributes and DOCTYPE tokens in order to simplify understanding of states described in the specification and corresponding state machine implementation by reducing number of duplicative actions.
Each commit contains changes to a particular subject, and contains link to relevant commit in fork of parse5 HTML parser that contains suggested changes in tokenizer code in order to prove that changes work as intended and to show off code simplifications on a real-world well-tested implementation.