Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose tokenizer state (dealing with script data correctly) #11

Closed
not-my-profile opened this issue Nov 30, 2021 · 7 comments · Fixed by #41
Closed

Expose tokenizer state (dealing with script data correctly) #11

not-my-profile opened this issue Nov 30, 2021 · 7 comments · Fixed by #41
Labels
enhancement New feature or request

Comments

@not-my-profile
Copy link
Contributor

not-my-profile commented Nov 30, 2021

I think most people who use an HTML5 tokenizer will want <script><b>test</b></script> to be tokenized as

StartTag(StartTag { self_closing: false, name: "script", attributes: {} })
String("<b>test</b>")
EndTag(EndTag { name: "script" })

instead of

StartTag(StartTag { self_closing: false, name: "script", attributes: {} })
StartTag(StartTag { self_closing: false, name: "b", attributes: {} })
String("test")
EndTag(EndTag { name: "b" })
EndTag(EndTag { name: "script" })

Unless I am missing something that doesn't seem to be possible with the current API?

Sidenote: It would also be nice to have some convenience utility that automatically dealt with the state implications of script, style, title, textarea, iframe etc. For example the html tokenizer of the Python standard library automatically takes care of script and style.

@not-my-profile not-my-profile changed the title Facilitate parsing HTML (switching tokenizer state based on tag name) Switching tokenizer state based on tag name Nov 30, 2021
@untitaker
Copy link
Owner

I think most people who use an HTML5 tokenizer will want <script>test</script> to be tokenized as

I don't fully understand how the referenced section about HTML fragments applies here by the way, but I haven't read it in detail.

For as long as html5gum doesn't give you a real treebuilder (i'm considering adding one), I think just exposing the state enum makes sense.

@not-my-profile
Copy link
Contributor Author

not-my-profile commented Dec 1, 2021

The get_state method of my PR is not spec compliant for two reasons:

  • it doesn't handle noscript since that depends on a flag ... this could be fixed by adding a scripting: bool parameter
  • following the spec by switching to the script data state for <script> results in the following
% printf '<script><b>test</b></script>'  |  cargo run --example=switch-state    
StartTag(StartTag { self_closing: false, name: "script", attributes: {} })
String("<b>test")
EndTag(EndTag { name: "b" })
EndTag(EndTag { name: "script" })

which obviously is strictly undesirable ... I am actually not quite sure why that happens (I assume it's because in an actual browser the JavaScript parser takes over?). If we could figure out why that happens and fix it / provide a way to work around it we could make get_state spec compliant.

I'm happy to expose the full enum, seems like that's generally required to build meaningful extensions on top of the tokenizer.

I don't think exposing the whole state enum is necessary the states mentioned in the Parsing HTML fragments section should suffice for most meaningful extensions. I think I would prefer exposing only these for now, marking the public enum #[non_exhaustive] and waiting with exposing it all till somebody actually has a use case for that.

@untitaker
Copy link
Owner

which obviously is strictly undesirable

do you know how other tokenizers behave and what states they pass through? the state machine in html5gum is hand-written and i can't find any similar tests in html5lib-tests so it's easily possible that I messed up

@not-my-profile
Copy link
Contributor Author

Oh yeah you messed up one state transition. I added a commit to fix it to my PR #13.

I also renamed StartTag::get_state to the more descriptive next_state and updated it to be fully spec compliant :)

untitaker added a commit that referenced this issue Dec 5, 2021
In #11 it was discovered that script data state is parsed wrongly.

We should ideally contribute a test to html5lib-tests.
@untitaker untitaker changed the title Switching tokenizer state based on tag name Ability to switch tokenizer state Dec 5, 2021
@untitaker untitaker added the enhancement New feature or request label Dec 5, 2021
@untitaker untitaker changed the title Ability to switch tokenizer state Ability to switch tokenizer state (dealing with script data correctly) Dec 5, 2021
@untitaker untitaker changed the title Ability to switch tokenizer state (dealing with script data correctly) Expose tokenizer state (dealing with script data correctly) Dec 5, 2021
untitaker added a commit that referenced this issue Jun 18, 2022
@untitaker
Copy link
Owner

I'm gonna fix this in #41

thanks @not-my-profile... I should've acted on this sooner. I thought it would be possible to implement a tree builder in html5gum in a reasonable timeframe and kept punting on this issue. I see that lol-html has a similar mapping in its own sourcecode where I suspect yours was also from.

I'm still a bit afraid that people are going to make uninformed choices here. There are some extra checks in lol-html's source code that I don't fully understand that appear to have security implications if such a parser were to be used in browser-grade applications. See this: https://github.com/cloudflare/lol-html/blob/f40a9f767c41caf07851548d7470649a6019548c/src/parser/tree_builder_simulator/ambiguity_guard.rs#L1

@not-my-profile
Copy link
Contributor Author

Thank you as well! I really like the name of the naive_next_state function you introduced, it nicely highlights that this prolly isn't completely sound.

Perhaps the switch_states method of the DefaultEmitter should be called naive_switch_states as well?

@untitaker
Copy link
Owner

untitaker commented Aug 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants