-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose tokenizer state (dealing with script data correctly) #11
Comments
I don't fully understand how the referenced section about HTML fragments applies here by the way, but I haven't read it in detail. For as long as html5gum doesn't give you a real treebuilder (i'm considering adding one), I think just exposing the state enum makes sense. |
The
which obviously is strictly undesirable ... I am actually not quite sure why that happens (I assume it's because in an actual browser the JavaScript parser takes over?). If we could figure out why that happens and fix it / provide a way to work around it we could make
I don't think exposing the whole state enum is necessary the states mentioned in the Parsing HTML fragments section should suffice for most meaningful extensions. I think I would prefer exposing only these for now, marking the public enum |
do you know how other tokenizers behave and what states they pass through? the state machine in html5gum is hand-written and i can't find any similar tests in html5lib-tests so it's easily possible that I messed up |
Oh yeah you messed up one state transition. I added a commit to fix it to my PR #13. I also renamed |
In #11 it was discovered that script data state is parsed wrongly. We should ideally contribute a test to html5lib-tests.
I'm gonna fix this in #41 thanks @not-my-profile... I should've acted on this sooner. I thought it would be possible to implement a tree builder in html5gum in a reasonable timeframe and kept punting on this issue. I see that lol-html has a similar mapping in its own sourcecode where I suspect yours was also from. I'm still a bit afraid that people are going to make uninformed choices here. There are some extra checks in lol-html's source code that I don't fully understand that appear to have security implications if such a parser were to be used in browser-grade applications. See this: https://github.com/cloudflare/lol-html/blob/f40a9f767c41caf07851548d7470649a6019548c/src/parser/tree_builder_simulator/ambiguity_guard.rs#L1 |
Thank you as well! I really like the name of the naive_next_state function you introduced, it nicely highlights that this prolly isn't completely sound. Perhaps the |
yes makes sense. I would be fine with a pr that just makes the breaking change
…On Fri, Aug 11, 2023, at 10:22, Martin Fischer wrote:
Thank you as well! I really like the name of the naive_next_state <https://docs.rs/html5gum/0.5.7/html5gum/fn.naive_next_state.html> function you introduced, it nicely highlights that this prolly isn't completely sound.
Perhaps the `switch_states` method of the `DefaultEmitter` should be called `naive_switch_states` as well?
—
Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRMHWE2L2N43CAJCSULXUXTTXANCNFSM5JBT6I5A>.
You are receiving this because you modified the open/close state.Message ID: ***@***.***>
|
I think most people who use an HTML5 tokenizer will want
<script><b>test</b></script>
to be tokenized asinstead of
Unless I am missing something that doesn't seem to be possible with the current API?
Sidenote: It would also be nice to have some convenience utility that automatically dealt with the state implications of
script
,style
,title
,textarea
,iframe
etc. For example the html tokenizer of the Python standard library automatically takes care ofscript
andstyle
.The text was updated successfully, but these errors were encountered: