Expose tokenizer state (dealing with script data correctly) #11

not-my-profile · 2021-11-30T12:04:37Z

I think most people who use an HTML5 tokenizer will want <script><b>test</b></script> to be tokenized as

StartTag(StartTag { self_closing: false, name: "script", attributes: {} })
String("<b>test</b>")
EndTag(EndTag { name: "script" })

instead of

StartTag(StartTag { self_closing: false, name: "script", attributes: {} })
StartTag(StartTag { self_closing: false, name: "b", attributes: {} })
String("test")
EndTag(EndTag { name: "b" })
EndTag(EndTag { name: "script" })

Unless I am missing something that doesn't seem to be possible with the current API?

Sidenote: It would also be nice to have some convenience utility that automatically dealt with the state implications of script, style, title, textarea, iframe etc. For example the html tokenizer of the Python standard library automatically takes care of script and style.

The text was updated successfully, but these errors were encountered:

untitaker · 2021-12-01T10:29:00Z

I think most people who use an HTML5 tokenizer will want <script>test</script> to be tokenized as

I don't fully understand how the referenced section about HTML fragments applies here by the way, but I haven't read it in detail.

For as long as html5gum doesn't give you a real treebuilder (i'm considering adding one), I think just exposing the state enum makes sense.

not-my-profile · 2021-12-01T10:49:17Z

The get_state method of my PR is not spec compliant for two reasons:

it doesn't handle noscript since that depends on a flag ... this could be fixed by adding a scripting: bool parameter
following the spec by switching to the script data state for <script> results in the following

% printf '<script><b>test</b></script>'  |  cargo run --example=switch-state    
StartTag(StartTag { self_closing: false, name: "script", attributes: {} })
String("<b>test")
EndTag(EndTag { name: "b" })
EndTag(EndTag { name: "script" })

which obviously is strictly undesirable ... I am actually not quite sure why that happens (I assume it's because in an actual browser the JavaScript parser takes over?). If we could figure out why that happens and fix it / provide a way to work around it we could make get_state spec compliant.

I'm happy to expose the full enum, seems like that's generally required to build meaningful extensions on top of the tokenizer.

I don't think exposing the whole state enum is necessary the states mentioned in the Parsing HTML fragments section should suffice for most meaningful extensions. I think I would prefer exposing only these for now, marking the public enum #[non_exhaustive] and waiting with exposing it all till somebody actually has a use case for that.

untitaker · 2021-12-02T22:07:08Z

which obviously is strictly undesirable

do you know how other tokenizers behave and what states they pass through? the state machine in html5gum is hand-written and i can't find any similar tests in html5lib-tests so it's easily possible that I messed up

not-my-profile · 2021-12-03T06:44:16Z

Oh yeah you messed up one state transition. I added a commit to fix it to my PR #13.

I also renamed StartTag::get_state to the more descriptive next_state and updated it to be fully spec compliant :)

In #11 it was discovered that script data state is parsed wrongly. We should ideally contribute a test to html5lib-tests.

Fix #11

untitaker · 2022-06-18T13:03:15Z

I'm gonna fix this in #41

thanks @not-my-profile... I should've acted on this sooner. I thought it would be possible to implement a tree builder in html5gum in a reasonable timeframe and kept punting on this issue. I see that lol-html has a similar mapping in its own sourcecode where I suspect yours was also from.

I'm still a bit afraid that people are going to make uninformed choices here. There are some extra checks in lol-html's source code that I don't fully understand that appear to have security implications if such a parser were to be used in browser-grade applications. See this: https://github.com/cloudflare/lol-html/blob/f40a9f767c41caf07851548d7470649a6019548c/src/parser/tree_builder_simulator/ambiguity_guard.rs#L1

not-my-profile · 2023-08-11T08:22:09Z

Thank you as well! I really like the name of the naive_next_state function you introduced, it nicely highlights that this prolly isn't completely sound.

Perhaps the switch_states method of the DefaultEmitter should be called naive_switch_states as well?

untitaker · 2023-08-11T09:42:19Z

yes makes sense. I would be fine with a pr that just makes the breaking change

…

On Fri, Aug 11, 2023, at 10:22, Martin Fischer wrote: Thank you as well! I really like the name of the naive_next_state <https://docs.rs/html5gum/0.5.7/html5gum/fn.naive_next_state.html> function you introduced, it nicely highlights that this prolly isn't completely sound. Perhaps the `switch_states` method of the `DefaultEmitter` should be called `naive_switch_states` as well? — Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRMHWE2L2N43CAJCSULXUXTTXANCNFSM5JBT6I5A>. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

not-my-profile changed the title ~~Facilitate parsing HTML (switching tokenizer state based on tag name)~~ Switching tokenizer state based on tag name Nov 30, 2021

not-my-profile mentioned this issue Nov 30, 2021

Allow setting certain parser states & add StartTag::next_state #13

Closed

untitaker added a commit that referenced this issue Dec 5, 2021

fix wrong script data-related transition

0b402c5

In #11 it was discovered that script data state is parsed wrongly. We should ideally contribute a test to html5lib-tests.

untitaker mentioned this issue Dec 5, 2021

fix wrong script data-related transition #15

Merged

untitaker changed the title ~~Switching tokenizer state based on tag name~~ Ability to switch tokenizer state Dec 5, 2021

untitaker added the enhancement New feature or request label Dec 5, 2021

untitaker changed the title ~~Ability to switch tokenizer state~~ Ability to switch tokenizer state (dealing with script data correctly) Dec 5, 2021

untitaker changed the title ~~Ability to switch tokenizer state (dealing with script data correctly)~~ Expose tokenizer state (dealing with script data correctly) Dec 5, 2021

untitaker mentioned this issue Dec 12, 2021

Add tree builder #21

Open

untitaker mentioned this issue Jun 18, 2022

Ignore links inside <script> tags untitaker/hyperlink#148

Closed

untitaker added a commit that referenced this issue Jun 18, 2022

Allow setting certain parser states, take 2

d12584b

Fix #11

untitaker mentioned this issue Jun 18, 2022

[breaking change] Allow setting certain parser states, take 2 #41

Merged

untitaker closed this as completed in #41 Jun 18, 2022

not-my-profile mentioned this issue Aug 11, 2023

[breaking]: Rename DefaultEmitter::switch_states #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose tokenizer state (dealing with script data correctly) #11

Expose tokenizer state (dealing with script data correctly) #11

not-my-profile commented Nov 30, 2021 •

edited

untitaker commented Dec 1, 2021

not-my-profile commented Dec 1, 2021 •

edited

untitaker commented Dec 2, 2021

not-my-profile commented Dec 3, 2021

untitaker commented Jun 18, 2022

not-my-profile commented Aug 11, 2023

untitaker commented Aug 11, 2023 via email

Expose tokenizer state (dealing with script data correctly) #11

Expose tokenizer state (dealing with script data correctly) #11

Comments

not-my-profile commented Nov 30, 2021 • edited

untitaker commented Dec 1, 2021

not-my-profile commented Dec 1, 2021 • edited

untitaker commented Dec 2, 2021

not-my-profile commented Dec 3, 2021

untitaker commented Jun 18, 2022

not-my-profile commented Aug 11, 2023

untitaker commented Aug 11, 2023 via email

not-my-profile commented Nov 30, 2021 •

edited

not-my-profile commented Dec 1, 2021 •

edited