Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy parsing mode #201

Closed
RReverser opened this issue Mar 15, 2016 · 18 comments
Closed

Lazy parsing mode #201

RReverser opened this issue Mar 15, 2016 · 18 comments

Comments

@RReverser
Copy link

We're willing to use html5ever's tokenizer for streaming transformation of HTML (that is, replacing specific attributes on specific tags only - e.g. src in img or src in script or href in link). For that, it would be useful to have lazy mode that would not force parser to parse attributes and content that we don't care about and then force us to serialize them back properly, thus 1) preserve formatting as closely to original as possible and 2) improve performance by avoiding lots of unnecessary work.

Would be that possible to implement? I'm pretty new to Rust, but I would be happy to take an attempt to do this myself if given some basic pointers and directions on where to look / what might need to be changed.

Thanks.

P.S. Theoretically, Servo could benefit from lazy parsing, too, if decoding attributes would happen only on first access as it might safely ignore many attributes for rendering and call such lazy decoder only when actually accessing them e.g. via DOM API (whether they are data-* attributes or any other that renderer doesn't care about and only JS code might do).

@SimonSapin
Copy link
Member

Sorry, I don’t really understand. If you’re already only using the tokenizer and not the full parser, what does it mean to "not parse" or "not decode" attributes?

Regarding formatting, whitespace between tags is already preserved as text tokens/nodes. But whitespace between attributes is not preserved, and doing so sounds like more data to carry around and more work, not less.

@RReverser
Copy link
Author

Sorry for the vague description. I mean process in which tokenization would happen (that is, types and boundaries of each token would be identified and corresponding callbacks invoked) but HTML entities, quotes etc. in attribute values and text nodes wouldn't be decoded automatically but rather preserved as-is as views into original chunks until API user invokes the decoding of values themselves manually.

Hopefully it makes a bit more sense now, but let me know if not and I can try to describe more precisely or provide examples :)

@SimonSapin
Copy link
Member

I’m somewhat skeptical that the pref impact would be significant. The tokenizer has to go through every byte of the value anyway in order to find where it ends, and I think that not having any character reference in a given value is the common case (though admittedly I have zero data on this).

@RReverser
Copy link
Author

I thought this way it wouldn't even need to allocate memory chunks for the decoded data, but could reuse same chunks from the input? (although I have zero evidence data as well, it's just how we did that with in-house Ragel-based HTML4 parser)

@asajeffrey
Copy link
Member

@SimonSapin and I had a chat on IRC about this: http://logs.glob.uno/?c=mozilla%23servo#c385665

We realized we had different reads of what you were suggesting, @RReverser, my read is that you are proposing making it possible to replace the Vec<Attribute> in a Tag by a Tendril, with the idea that in cases where you only care about one attribute (e.g. the src attribute of an img) you could do an O(n) search through the tendril for it, rather than an O(k) search through the vector (where n is the length of the tendril, and k is the number of key/value pairs).

The tradeoff here is that creating the Vec<Attribute> takes O(k) space, but searching it for a single attribute takes O(n) rather than O(k) time.

@RReverser: did I accurately capture your suggestion?

@RReverser
Copy link
Author

@asajeffrey Yes, pretty much. The only difference is in

you could do an O(n) search through the tendril for it, rather than an O(k) search through the vector (where n is the length of the tendril, and k is the number of key/value pairs)

Ideally I wanted to avoid tradeoff by not searching through tendril manually at all. As I said, I want to do streaming transformation, and as @SimonSapin pointed, tokenizer needs to go through those bytes anyway, so ideally I just wanted to have extra hook where attribute names + raw undecoded tendrils would be passed (even though official spec doesn't recognize them as real "tokens"), and consumer could decide what it wants to do. Same for text nodes, except those are already passed in a decoded form - I'm just asking to pass only a tendril into raw buffer instead of automatic decoding (as far as I understand, that's the case where tendril really shines as it doesn't need to create any copies at all).

In that case:

  • consumers that absolutely need all attributes and text nodes to be pre-decoded (e.g. for debugging purposes), can reproduce old behavior by creating this Vec<Attribute> and decoding and collecting attributes in the handler
  • consumers such as my (streaming transformer) can immediately return from such handler if they're not interested in current attribute name on current tag, so handling of such attributes would be essentially no-op with O(0) and we do care about such optimizations at scale
  • just for the sake of more examples - other consumers can collect names + raw tendrils into something like Vec<RawAttribute> and provide getAttribute API that would allocate extra memory & decode actual values on-demand (although, as I said, this one is rather a thought experiment and I don't have numbers to prove that it would be more efficient; I'm more interested in previous one).

TL;DR: I'm interested in adding a hook for attributes before (instead of) decoding and adding them to tag object, so that all the needed values could be modified right away, and others would be passed as-is. This should avoid both extra operations & allocations, and trade-off you mentioned (searching through entire string) at the same time.

@asajeffrey
Copy link
Member

Oh I see. the idea is you could have a type Tag<A> where A is the type of the attribute store, and it would be created by having A implement FromIterator<Attribute>, for example the current implementation would be Tag<Vec<Attribute>>, but you could also have Tag<HashMap<QualName,StrTendril>>, or Tag<MyCustomAttributes> which could do filtering etc.

This sounds to me like the trade-off of a more flexible, polymorphic API vs a simpler monomorphic API.

@RReverser
Copy link
Author

@asajeffrey As I said, I totally understand that it's not something on your Roadmap and you have likely more important things to do, it's just that I'm trying to replace custom in-house Ragel-based lexer we use with HTML5-compliant browser-grade parser while preserving performance as closely as possible, so really need to be as lazy about parsing contents as possible as we modify very few things. Initially I looked into libhubbub, but then decided that html5ever might be a more interesting option since it's actively and publicly developed and, well, has all the benefits Rust gives. Btw, I saw you initially used libhubbub too, but then switched to writing own parser and wondered if there were any specific issues you encountered, but now it's no more important anyway..

So what I'm asking here is at least

some basic pointers and directions on where to look / what might need to be changed

so that I could try and play with it myself (I'm digging into the codebase & Rust, just thought it might be faster to consult here first).

@asajeffrey
Copy link
Member

Sorry, the tone of my last sentence was a bit off, it read a bit more discouraging than I intended!

The way I'd go about the change you're looking for is:

  • Replace Tag by something like TagWith<A>, where A is the type of the attribute store.
  • Define a synonym for backward compatibility type Tag = TagWith<Vec<Attribute>>.
  • Generalize the tokenization algorithm so it works for any A: FromIterator<Attribute>.
  • Implement FromIterator<Attribute> for your own custom attribute store, e.g. one which discards everything but src for an img.

This will deal with the case where you are discarding attributes you're not interested in, which would be a start. Keeping other attributes in their raw form would be tricker, but should be doable, e.g. by having A implement FromIterator<(Tendril,Attribute,Tendril)> giving you access to the raw text before and after each attribute.

@SimonSapin may have different ideas.

@RReverser
Copy link
Author

@asajeffrey Thanks! I'm wondering on what is the easiest way to extract actual entities decoding to own method - currently it seems to be happening as tokenization goes and distributed across the code (or maybe I'm missing something obvious), so I have no control on when it should be enabled and when skipped entirely for discarded attributes. Any ideas on that part?

As for TagWith<A> - I'm curious if that's really necessary - as you said, such API might introduce somewhat more complicated changes. Instead, I could probably just split Tag into separate tokens (smth like TagStart { kind, name } + N * RawAttribute { name, raw_tendril } + TagEnd { self_closing }) that I can handle separately or collect into proper Tag depending on options the consumer passed. Wouldn't be this slightly easier?

Thank you for helping out!

@asajeffrey
Copy link
Member

@RReverser I think the bits of code that would need to be changed are all in tokenizer/mod.rs, in particular emit_current_tag and finish_attribute. There are various places where Vec<Attribute> and vec!() would need to be replaced by something like A and A::new().

As for whether you need to expose TagWith<A>, you could try an approach of implementing TagWith<A> and then providing some typedefs type Tag = TagWith<Vec<Attribute>> and type MyCleverTagWithCallbacks = TagWith<MyCleverAttributeContainer>, and put the callback logic into MyCleverAttributeContainer.

Rust APIs often don't use dynamically dispatched callbacks (since they come with a runtime overhead), often preferring parameterized types and trait implementations (since those can be statically dispatched). YMMV.

@RReverser
Copy link
Author

I see, thanks. I'll try to play with it and closing this issue for now, but hopefully you won't mind if I'll have any other specific questions.

@asajeffrey
Copy link
Member

Cool, good luck!

As a technical point, I realized that where I said FromIterator<A> above, I probably meant Extend<A> since that allows you to add new elements to a collection without gathering them all up into an iterator beforehand.

@RReverser
Copy link
Author

I see there is scripts/bench-branches.py that is supposedly used to run benchmarks, but can't figure out how to run it - it requires to be executed from build folder, but neither of the only two folders with such name I found (target/{debug,release}/build) seems to be the required one.

@SimonSapin
Copy link
Member

Based on its use of ../configure and make, it looks like that script was written before html5ever used Cargo. I think this script can safely be removed since it would need to be rewritten completely, and comparing branches is not that useful since no branch other than master is being actively developed.

The equivalent of make bench would now be cargo bench… but that’s broken at the moment since it uses unstable features and we don’t have CI for it.

@RReverser
Copy link
Author

I see, thanks. I tried running bench via Cargo by installing unstable Rust, but that doesn't work either, so gave up on it.

@SimonSapin
Copy link
Member

#202 makes cargo bench build and run again, but I haven’t looked at interpreting results.

@Ygg01
Copy link
Contributor

Ygg01 commented Mar 22, 2016

since no branch other than master is being actively developed.

That wasn't always the case there was SSE instruction branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants