Formalize bogus comment state #993

RReverser · 2016-04-05T14:54:03Z

This formalizes bogus comment state in tokenizer in form of a state machine instead of description.

Proof of work on example of parse5 tokenizer: RReverser/parse5@647a075

inikulin · 2016-04-05T15:36:28Z

Great work! I wonder maybe we should split bogus comment state into 2 states to reduce code duplication. We need to remove comment token creation from bogus comment state to make it reentrant (this is required for streaming parsing).

E.g. Blink and parse5 implement such states.

RReverser · 2016-04-05T15:39:40Z

@inikulin Yeah, I saw how parse5 implemented it, but this extra state is a bit redundant IMO. In linked commit it works without this extra state, although implementation will be surely still free to perform own optimizations specific to its needs (they don't necessarily even need to implement actual state machine after all :) ), and as for the spec itself I'd prefer it to be as minimal and clean as possible.

inikulin · 2016-04-05T15:43:31Z

Makes sense

domenic · 2016-04-05T20:12:26Z

source

-  state machine to switch into the bogus comment state, up to and including the character
-  immediately before the last consumed character (i.e. up to the character just before the U+003E or
-  EOF character), but with any U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER
-  characters. (If the comment was started by the end of the file (EOF), the token is empty.


It seems a bit sad to lose this parenthetical note. Maybe move it to a <p class="note">?

Well, I guess the EOF one is obvious. I'm not sure where the <!> one comes from; it seems like that's not a bogus comment at all? Do you understand it?

It happens when you come to bogus comment state from markup declaration open state. If ! in that state is not followed by doctype, comment hypnes or CDATA (if it's enabled by parser) then any character after it should be reconsumed in bogus comment state. It's not quite obvious from the spec, since markup declaration open state is not formalized as well:

Switch to the bogus comment state. The next character that is consumed, if any, is the first character that will be in the comment.

State description doesn't have "Consume the next input character" prefix. Therefore no character is consumed since the last state. Speaking clearly we just use lookahead here.

After thinking of it a little bit I suppose we need to change markup declaration open state as well.

The next character that is consumed, if any, is the first character that will be in the comment.

This is not true. If next character is > it will trigger switch to data state in bogus comment state and thus wouldn't be in the comment (it will remain empty)

I suppose we need to change markup declaration open state as well

Yeah, I just decided that it will be easier to modularize PRs (2nd one already got bit enough). There are many states like that that are described via human language and not formalized - markup declaration open , CData section etc. One state at a time.

@inikulin

This is not true. If next character is > it will trigger switch to data state in bogus comment state ant thus wouldn't be in the comment (it will remain empty)

Yeah, that's why I removed it in the PR.

For now, also added explicit (don't consume anything in the current state) in markup declaration open state to avoid misunderstanding (we can formalize it better in future in the separate PR).

Nice, thanks!

Proof of work: RReverser/parse5@647a075

zcorpan · 2016-04-06T13:24:59Z

LGTM. @domenic OK without the non-normative text about <!(EOF) and <!>?

domenic · 2016-04-06T13:35:43Z

I guess so; seems kind of sad to lose but it's just a non-normative note anyway. LGTM.

RReverser · 2016-04-06T21:16:48Z

Thanks!

zcorpan · 2016-04-07T11:25:51Z

It appears this had lines with trailing (well, only) whitespace. For future PRs, can you please set your editor to trim trailing whitespace? (We should probably have automated checks for such things, but don't yet...) Thx!

RReverser · 2016-04-07T11:28:10Z

@zcorpan Ah, sorry for that - my VSCode was misconfigured.

RReverser force-pushed the bogus-comment branch from 322a9f5 to 9551270 Compare April 5, 2016 15:04

domenic mentioned this pull request Apr 5, 2016

Refactor tag states and actions #994

Merged

domenic reviewed Apr 5, 2016
View reviewed changes

domenic added the clarification Standard could be clearer label Apr 5, 2016

Formalize bogus comment state

80a98e2

Proof of work: RReverser/parse5@647a075

RReverser force-pushed the bogus-comment branch from 9551270 to 80a98e2 Compare April 6, 2016 10:09

domenic merged commit 28a40d1 into whatwg:master Apr 6, 2016

RReverser deleted the bogus-comment branch April 6, 2016 21:17

Formalize bogus comment state #993

Formalize bogus comment state #993

Uh oh!

Conversation

RReverser commented Apr 5, 2016

Uh oh!

inikulin commented Apr 5, 2016

Uh oh!

RReverser commented Apr 5, 2016

Uh oh!

inikulin commented Apr 5, 2016

Uh oh!

domenic Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

domenic Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

inikulin Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

inikulin Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

RReverser Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

RReverser Apr 6, 2016

Choose a reason for hiding this comment

Uh oh!

inikulin Apr 6, 2016

Choose a reason for hiding this comment

Uh oh!

zcorpan commented Apr 6, 2016

Uh oh!

domenic commented Apr 6, 2016

Uh oh!

RReverser commented Apr 6, 2016

Uh oh!

zcorpan commented Apr 7, 2016

Uh oh!

RReverser commented Apr 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants