-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Formalize bogus comment state #993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@inikulin Yeah, I saw how parse5 implemented it, but this extra state is a bit redundant IMO. In linked commit it works without this extra state, although implementation will be surely still free to perform own optimizations specific to its needs (they don't necessarily even need to implement actual state machine after all :) ), and as for the spec itself I'd prefer it to be as minimal and clean as possible. |
|
Makes sense |
| state machine to switch into the bogus comment state, up to and including the character | ||
| immediately before the last consumed character (i.e. up to the character just before the U+003E or | ||
| EOF character), but with any U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER | ||
| characters. (If the comment was started by the end of the file (EOF), the token is empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit sad to lose this parenthetical note. Maybe move it to a <p class="note">?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess the EOF one is obvious. I'm not sure where the <!> one comes from; it seems like that's not a bogus comment at all? Do you understand it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens when you come to bogus comment state from markup declaration open state. If ! in that state is not followed by doctype, comment hypnes or CDATA (if it's enabled by parser) then any character after it should be reconsumed in bogus comment state. It's not quite obvious from the spec, since markup declaration open state is not formalized as well:
Switch to the bogus comment state. The next character that is consumed, if any, is the first character that will be in the comment.
State description doesn't have "Consume the next input character" prefix. Therefore no character is consumed since the last state. Speaking clearly we just use lookahead here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking of it a little bit I suppose we need to change markup declaration open state as well.
The next character that is consumed, if any, is the first character that will be in the comment.
This is not true. If next character is > it will trigger switch to data state in bogus comment state and thus wouldn't be in the comment (it will remain empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we need to change markup declaration open state as well
Yeah, I just decided that it will be easier to modularize PRs (2nd one already got bit enough). There are many states like that that are described via human language and not formalized - markup declaration open , CData section etc. One state at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true. If next character is > it will trigger switch to data state in bogus comment state ant thus wouldn't be in the comment (it will remain empty)
Yeah, that's why I removed it in the PR.
For now, also added explicit (don't consume anything in the current state) in markup declaration open state to avoid misunderstanding (we can formalize it better in future in the separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
Proof of work: RReverser/parse5@647a075
|
LGTM. @domenic OK without the non-normative text about |
|
I guess so; seems kind of sad to lose but it's just a non-normative note anyway. LGTM. |
|
Thanks! |
|
It appears this had lines with trailing (well, only) whitespace. For future PRs, can you please set your editor to trim trailing whitespace? (We should probably have automated checks for such things, but don't yet...) Thx! |
|
@zcorpan Ah, sorry for that - my VSCode was misconfigured. |
This formalizes bogus comment state in tokenizer in form of a state machine instead of description.
Proof of work on example of parse5 tokenizer: RReverser/parse5@647a075