Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html5ever pull tokenizer? #208

Closed
TyOverby opened this issue Apr 21, 2016 · 13 comments
Closed

html5ever pull tokenizer? #208

TyOverby opened this issue Apr 21, 2016 · 13 comments

Comments

@TyOverby
Copy link

I'm working on an application where the push-tokenizer that is built into html5ever is not very ergonomic.

Instead of making a sink and having process_token get called on it, I would rather enqueue a Tendril, and get an Iterator of Tokens that are parsed and returned on-demand.

My question is: would this be an acceptable option for html5ever? I wouldn't mind doing the implementation.

@jdm
Copy link
Member

jdm commented Apr 21, 2016

@nox and @SimonSapin, any opinions here?

@TyOverby
Copy link
Author

TyOverby commented Apr 21, 2016

A pull-based model would also allow for abandoning a parse midway through.
I also think you could implement push parsing on top of pull parsing.

for token in tokenize(input)
    sink.process(token)

@SimonSapin
Copy link
Member

push parsing on top of pull parsing.

What about the tokenizer’s own input, when it’s not available all at once?

@TyOverby
Copy link
Author

My example was highly simplified. I would expect that you could have a PullParser struct that kept parse-state and allowed more source to be added over time. This PullParser would either be an iterator, or have an iter() method.

@Ygg01
Copy link
Contributor

Ygg01 commented Apr 22, 2016

@TyOverby So, would this require a major rewrite of tokenizer and tree builder's rules macros? Do current macros make sense under PullParser approach?

@TyOverby
Copy link
Author

@Ygg01: I doubt it, but since I don't see anyone vehemently opposed, I'll prototype this and see how it turns out.

@nox
Copy link
Contributor

nox commented Apr 25, 2016

@TyOverby Don't forget that tree building alters tokenisation, though.

@Ygg01
Copy link
Contributor

Ygg01 commented Apr 25, 2016

@TyOverby I don't have a vote, but one really great thing about html5ever is that macros follow the spec format closely, so comparing code and spec it's easy to see where there was a divergence.

@TyOverby
Copy link
Author

@nox

In what way can the tree builder change tokenisation? set_plaintext_state?

@SimonSapin
Copy link
Member

My opinion is that if you manage to add a new pull API without affecting the current push API, go for it. (It’d also be nice to avoid significantly rewriting the tokenizer or tree builder code, as @Ygg01 mentioned.) But I’m skeptical that it’s possible without corountines (that Rust doesn’t have) or threads (that would add synchronization overhead).

@TyOverby
Copy link
Author

The tokeniser is already broken up into discrete steps with the run method calling a bunch of steps. My plan (and I haven't looked into this too deeply), is to write an iterators next method that calls into step. step will have to be slightly rewritten (for instance, there are quite a few cases that don't return any tokens), but I don't think the change will be huge.

@nox
Copy link
Contributor

nox commented Nov 6, 2016

@TyOverby @SimonSapin We may want to discuss that again now that h5e doesn't own its input stream anymore.

@nox
Copy link
Contributor

nox commented Mar 4, 2017

Closing.

@nox nox closed this as completed Mar 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants