New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy parsing mode #201
Comments
Sorry, I don’t really understand. If you’re already only using the tokenizer and not the full parser, what does it mean to "not parse" or "not decode" attributes? Regarding formatting, whitespace between tags is already preserved as text tokens/nodes. But whitespace between attributes is not preserved, and doing so sounds like more data to carry around and more work, not less. |
Sorry for the vague description. I mean process in which tokenization would happen (that is, types and boundaries of each token would be identified and corresponding callbacks invoked) but HTML entities, quotes etc. in attribute values and text nodes wouldn't be decoded automatically but rather preserved as-is as views into original chunks until API user invokes the decoding of values themselves manually. Hopefully it makes a bit more sense now, but let me know if not and I can try to describe more precisely or provide examples :) |
I’m somewhat skeptical that the pref impact would be significant. The tokenizer has to go through every byte of the value anyway in order to find where it ends, and I think that not having any character reference in a given value is the common case (though admittedly I have zero data on this). |
I thought this way it wouldn't even need to allocate memory chunks for the decoded data, but could reuse same chunks from the input? (although I have zero evidence data as well, it's just how we did that with in-house Ragel-based HTML4 parser) |
@SimonSapin and I had a chat on IRC about this: http://logs.glob.uno/?c=mozilla%23servo#c385665 We realized we had different reads of what you were suggesting, @RReverser, my read is that you are proposing making it possible to replace the The tradeoff here is that creating the @RReverser: did I accurately capture your suggestion? |
@asajeffrey Yes, pretty much. The only difference is in
Ideally I wanted to avoid tradeoff by not searching through tendril manually at all. As I said, I want to do streaming transformation, and as @SimonSapin pointed, tokenizer needs to go through those bytes anyway, so ideally I just wanted to have extra hook where attribute names + raw undecoded tendrils would be passed (even though official spec doesn't recognize them as real "tokens"), and consumer could decide what it wants to do. Same for text nodes, except those are already passed in a decoded form - I'm just asking to pass only a tendril into raw buffer instead of automatic decoding (as far as I understand, that's the case where tendril really shines as it doesn't need to create any copies at all). In that case:
TL;DR: I'm interested in adding a hook for attributes before (instead of) decoding and adding them to tag object, so that all the needed values could be modified right away, and others would be passed as-is. This should avoid both extra operations & allocations, and trade-off you mentioned (searching through entire string) at the same time. |
Oh I see. the idea is you could have a type This sounds to me like the trade-off of a more flexible, polymorphic API vs a simpler monomorphic API. |
@asajeffrey As I said, I totally understand that it's not something on your Roadmap and you have likely more important things to do, it's just that I'm trying to replace custom in-house Ragel-based lexer we use with HTML5-compliant browser-grade parser while preserving performance as closely as possible, so really need to be as lazy about parsing contents as possible as we modify very few things. Initially I looked into libhubbub, but then decided that html5ever might be a more interesting option since it's actively and publicly developed and, well, has all the benefits Rust gives. Btw, I saw you initially used libhubbub too, but then switched to writing own parser and wondered if there were any specific issues you encountered, but now it's no more important anyway.. So what I'm asking here is at least
so that I could try and play with it myself (I'm digging into the codebase & Rust, just thought it might be faster to consult here first). |
Sorry, the tone of my last sentence was a bit off, it read a bit more discouraging than I intended! The way I'd go about the change you're looking for is:
This will deal with the case where you are discarding attributes you're not interested in, which would be a start. Keeping other attributes in their raw form would be tricker, but should be doable, e.g. by having @SimonSapin may have different ideas. |
@asajeffrey Thanks! I'm wondering on what is the easiest way to extract actual entities decoding to own method - currently it seems to be happening as tokenization goes and distributed across the code (or maybe I'm missing something obvious), so I have no control on when it should be enabled and when skipped entirely for discarded attributes. Any ideas on that part? As for Thank you for helping out! |
@RReverser I think the bits of code that would need to be changed are all in tokenizer/mod.rs, in particular As for whether you need to expose Rust APIs often don't use dynamically dispatched callbacks (since they come with a runtime overhead), often preferring parameterized types and trait implementations (since those can be statically dispatched). YMMV. |
I see, thanks. I'll try to play with it and closing this issue for now, but hopefully you won't mind if I'll have any other specific questions. |
Cool, good luck! As a technical point, I realized that where I said |
I see there is |
Based on its use of The equivalent of |
I see, thanks. I tried running bench via Cargo by installing unstable Rust, but that doesn't work either, so gave up on it. |
#202 makes |
That wasn't always the case there was SSE instruction branch. |
We're willing to use html5ever's tokenizer for streaming transformation of HTML (that is, replacing specific attributes on specific tags only - e.g.
src
inimg
orsrc
inscript
orhref
inlink
). For that, it would be useful to have lazy mode that would not force parser to parse attributes and content that we don't care about and then force us to serialize them back properly, thus 1) preserve formatting as closely to original as possible and 2) improve performance by avoiding lots of unnecessary work.Would be that possible to implement? I'm pretty new to Rust, but I would be happy to take an attempt to do this myself if given some basic pointers and directions on where to look / what might need to be changed.
Thanks.
P.S. Theoretically, Servo could benefit from lazy parsing, too, if decoding attributes would happen only on first access as it might safely ignore many attributes for rendering and call such lazy decoder only when actually accessing them e.g. via DOM API (whether they are
data-*
attributes or any other that renderer doesn't care about and only JS code might do).The text was updated successfully, but these errors were encountered: