Skip to content

Conversation

@RReverser
Copy link
Member

This formalizes bogus comment state in tokenizer in form of a state machine instead of description.

Proof of work on example of parse5 tokenizer: RReverser/parse5@647a075

@inikulin
Copy link
Member

inikulin commented Apr 5, 2016

Great work! I wonder maybe we should split bogus comment state into 2 states to reduce code duplication. We need to remove comment token creation from bogus comment state to make it reentrant (this is required for streaming parsing).

E.g. Blink and parse5 implement such states.

@RReverser
Copy link
Member Author

@inikulin Yeah, I saw how parse5 implemented it, but this extra state is a bit redundant IMO. In linked commit it works without this extra state, although implementation will be surely still free to perform own optimizations specific to its needs (they don't necessarily even need to implement actual state machine after all :) ), and as for the spec itself I'd prefer it to be as minimal and clean as possible.

@inikulin
Copy link
Member

inikulin commented Apr 5, 2016

Makes sense

state machine to switch into the bogus comment state, up to and including the character
immediately before the last consumed character (i.e. up to the character just before the U+003E or
EOF character), but with any U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER
characters. (If the comment was started by the end of the file (EOF), the token is empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit sad to lose this parenthetical note. Maybe move it to a <p class="note">?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess the EOF one is obvious. I'm not sure where the <!> one comes from; it seems like that's not a bogus comment at all? Do you understand it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens when you come to bogus comment state from markup declaration open state. If ! in that state is not followed by doctype, comment hypnes or CDATA (if it's enabled by parser) then any character after it should be reconsumed in bogus comment state. It's not quite obvious from the spec, since markup declaration open state is not formalized as well:

Switch to the bogus comment state. The next character that is consumed, if any, is the first character that will be in the comment.

State description doesn't have "Consume the next input character" prefix. Therefore no character is consumed since the last state. Speaking clearly we just use lookahead here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking of it a little bit I suppose we need to change markup declaration open state as well.

The next character that is consumed, if any, is the first character that will be in the comment.

This is not true. If next character is > it will trigger switch to data state in bogus comment state and thus wouldn't be in the comment (it will remain empty)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we need to change markup declaration open state as well

Yeah, I just decided that it will be easier to modularize PRs (2nd one already got bit enough). There are many states like that that are described via human language and not formalized - markup declaration open , CData section etc. One state at a time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inikulin

This is not true. If next character is > it will trigger switch to data state in bogus comment state ant thus wouldn't be in the comment (it will remain empty)

Yeah, that's why I removed it in the PR.

For now, also added explicit (don't consume anything in the current state) in markup declaration open state to avoid misunderstanding (we can formalize it better in future in the separate PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

@domenic domenic added the clarification Standard could be clearer label Apr 5, 2016
@zcorpan
Copy link
Member

zcorpan commented Apr 6, 2016

LGTM. @domenic OK without the non-normative text about <!(EOF) and <!>?

@domenic
Copy link
Member

domenic commented Apr 6, 2016

I guess so; seems kind of sad to lose but it's just a non-normative note anyway. LGTM.

@domenic domenic merged commit 28a40d1 into whatwg:master Apr 6, 2016
@RReverser
Copy link
Member Author

Thanks!

@RReverser RReverser deleted the bogus-comment branch April 6, 2016 21:17
@zcorpan
Copy link
Member

zcorpan commented Apr 7, 2016

It appears this had lines with trailing (well, only) whitespace. For future PRs, can you please set your editor to trim trailing whitespace? (We should probably have automated checks for such things, but don't yet...) Thx!

@RReverser
Copy link
Member Author

@zcorpan Ah, sorry for that - my VSCode was misconfigured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clarification Standard could be clearer

Development

Successfully merging this pull request may close these issues.

4 participants