html5ever pull tokenizer? #208

TyOverby · 2016-04-21T17:40:12Z

I'm working on an application where the push-tokenizer that is built into html5ever is not very ergonomic.

Instead of making a sink and having process_token get called on it, I would rather enqueue a Tendril, and get an Iterator of Tokens that are parsed and returned on-demand.

My question is: would this be an acceptable option for html5ever? I wouldn't mind doing the implementation.

The text was updated successfully, but these errors were encountered:

jdm · 2016-04-21T18:02:10Z

@nox and @SimonSapin, any opinions here?

TyOverby · 2016-04-21T18:29:27Z

A pull-based model would also allow for abandoning a parse midway through.
I also think you could implement push parsing on top of pull parsing.

for token in tokenize(input)
    sink.process(token)

SimonSapin · 2016-04-21T21:52:01Z

push parsing on top of pull parsing.

What about the tokenizer’s own input, when it’s not available all at once?

TyOverby · 2016-04-21T22:19:09Z

My example was highly simplified. I would expect that you could have a PullParser struct that kept parse-state and allowed more source to be added over time. This PullParser would either be an iterator, or have an iter() method.

Ygg01 · 2016-04-22T17:04:28Z

@TyOverby So, would this require a major rewrite of tokenizer and tree builder's rules macros? Do current macros make sense under PullParser approach?

TyOverby · 2016-04-25T04:14:49Z

@Ygg01: I doubt it, but since I don't see anyone vehemently opposed, I'll prototype this and see how it turns out.

nox · 2016-04-25T08:49:53Z

@TyOverby Don't forget that tree building alters tokenisation, though.

Ygg01 · 2016-04-25T12:34:25Z

@TyOverby I don't have a vote, but one really great thing about html5ever is that macros follow the spec format closely, so comparing code and spec it's easy to see where there was a divergence.

TyOverby · 2016-04-25T17:05:15Z

@nox

In what way can the tree builder change tokenisation? set_plaintext_state?

SimonSapin · 2016-04-25T20:54:08Z

My opinion is that if you manage to add a new pull API without affecting the current push API, go for it. (It’d also be nice to avoid significantly rewriting the tokenizer or tree builder code, as @Ygg01 mentioned.) But I’m skeptical that it’s possible without corountines (that Rust doesn’t have) or threads (that would add synchronization overhead).

TyOverby · 2016-04-25T21:02:02Z

The tokeniser is already broken up into discrete steps with the run method calling a bunch of steps. My plan (and I haven't looked into this too deeply), is to write an iterators next method that calls into step. step will have to be slightly rewritten (for instance, there are quite a few cases that don't return any tokens), but I don't think the change will be huge.

nox · 2016-11-06T14:32:38Z

@TyOverby @SimonSapin We may want to discuss that again now that h5e doesn't own its input stream anymore.

nox · 2017-03-04T11:48:57Z

Closing.

nox closed this as completed Mar 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html5ever pull tokenizer? #208

html5ever pull tokenizer? #208

TyOverby commented Apr 21, 2016

jdm commented Apr 21, 2016

TyOverby commented Apr 21, 2016 •

edited

SimonSapin commented Apr 21, 2016

TyOverby commented Apr 21, 2016

Ygg01 commented Apr 22, 2016 •

edited

TyOverby commented Apr 25, 2016

nox commented Apr 25, 2016

Ygg01 commented Apr 25, 2016 •

edited

TyOverby commented Apr 25, 2016

SimonSapin commented Apr 25, 2016

TyOverby commented Apr 25, 2016

nox commented Nov 6, 2016

nox commented Mar 4, 2017

html5ever pull tokenizer? #208

html5ever pull tokenizer? #208

Comments

TyOverby commented Apr 21, 2016

jdm commented Apr 21, 2016

TyOverby commented Apr 21, 2016 • edited

SimonSapin commented Apr 21, 2016

TyOverby commented Apr 21, 2016

Ygg01 commented Apr 22, 2016 • edited

TyOverby commented Apr 25, 2016

nox commented Apr 25, 2016

Ygg01 commented Apr 25, 2016 • edited

TyOverby commented Apr 25, 2016

SimonSapin commented Apr 25, 2016

TyOverby commented Apr 25, 2016

nox commented Nov 6, 2016

nox commented Mar 4, 2017

TyOverby commented Apr 21, 2016 •

edited

Ygg01 commented Apr 22, 2016 •

edited

Ygg01 commented Apr 25, 2016 •

edited