Implements zero-copy html parsing. #60

Closed
wants to merge 6 commits into
from

Projects

None yet

2 participants

@cgaebel
Contributor
cgaebel commented Nov 25, 2014

Instead of using strings, html5ever now uses Iobufs and Spans over
Iobufs to represent the raw html data. This allows us to do all of
the parsing without copying the HTML into a tree of strings: it can
be Spans and Iobufs all the way down.

There were several performance hacks done to get this faster. Most
of them were to work around rustc failures as chronicled in:

http://discuss.rust-lang.org/t/the-sad-state-of-zero-on-drop/944

Here's a little list of transformations done for performance reasons:

  • State machine states are now functions. Rustc isn't smart enough
    to properly handle large-ish things on the stack in different
    match arms. This does mean that it's sane to inline the jump
    table into feed, which had a nice impact on performance. Jump
    tables inside loops are especially efficient because it's just
    a bigger jump table!
  • Things which get atomized anyhow (except for doctypes, which
    weren't hot enough for me to bother changing) use the old String
    parsing method, since it ends up being a lot faster for small
    strings and doesn't cause O(tags) allocations, thanks to truncation.
  • A custom Option called FastOption which doesn't zero on take
    and can't be matched on, but still maintains safety.
  • No utf-8 decoding of the chars is avoided unless absolutely
    necessary. For most parsing, we just need the utf-8 length,
    which is much easier to calculate (a branch, and a LUT on the
    first byte in the "slow" path).
  • Chars and Runs are parsed into a "shared" location every time,
    because rustc is really bad at generating code for types which
    Drop a lot in a loop. See the discuss post at the top.
  • A new temp_buf has been introduced, because it is no longer
    performant to just append random characters to spans. Consider
    a partially-consumed comment start: <-. If the next character
    is an a, the <- needs to be emitted. The second temporary
    buffer is used to handle cases like that.
  • Similar to above, but when parsing char refs: the '&' and '#'
    are saved in case of backout.
  • Dashes at the end of a comment -----> need to be saved and
    shuffled as we keep reading more dashes, so that we always
    emit the "right" ones to keep the span continuous. This
    required a little 2-element "queue": first_comment_end_dash
    and second_comment_end_dash.
  • Some of the tokenizer fields were reordered for cache
    efficiency.
  • Some inliner guidance was done in get_char and
    get_preprocessed_char, to keep fast paths fast.
  • clone_from to get data out of the input buffers is used
    where it makes sense, preventing a bunch of bad rustc codegen.

As a result of these optimizations (and zero-copy parsing in general):

zero-copy

test tokenize uncommitted/html5.html ... bench: 124076195 ns/iter (+/- 9519897)
test tokenize uncommitted/lipsum-1M.html ... bench: 1989708 ns/iter (+/- 405327)
test tokenize uncommitted/sina.com.cn.html ... bench: 7210262 ns/iter (+/- 1391972)
test tokenize uncommitted/strong.html ... bench: 30002001 ns/iter (+/- 3375152)
test tokenize uncommitted/webapps.html ... bench: 99264377 ns/iter (+/- 8138989)
test tokenize uncommitted/wikipedia.html ... bench: 3841740 ns/iter (+/- 612645)

original

test tokenize uncommitted/html5.html ... bench: 153991836 ns/iter (+/- 7196531)
test tokenize uncommitted/lipsum-1M.html ... bench: 2393385 ns/iter (+/- 450953)
test tokenize uncommitted/sina.com.cn.html ... bench: 8837605 ns/iter (+/- 1238217)
test tokenize uncommitted/strong.html ... bench: 44153393 ns/iter (+/- 5076161)
test tokenize uncommitted/webapps.html ... bench: 136860951 ns/iter (+/- 8137049)
test tokenize uncommitted/wikipedia.html ... bench: 4868854 ns/iter (+/- 797178)

SUMMARY

html5.html: 19%
sina.com.cn.html: 14%
strong.html: 47%
webapps.html: 27%
wikipedia.html: 21%
lipsum-1M.html: 17%

r? @kmcallister

cgaebel added some commits Nov 13, 2014
@cgaebel cgaebel Implements zero-copy html parsing.
Instead of using strings, html5ever now uses Iobufs and Spans over
Iobufs to represent the raw html data. This allows us to do all of
the parsing without copying the HTML into a tree of strings: it can
be Spans and Iobufs all the way down.

There were several performance hacks done to get this faster. Most
of them were to work around rustc failures as described in:

http://discuss.rust-lang.org/t/the-sad-state-of-zero-on-drop/944

Here's a little list of transformations done for performance reasons:

  - State machine states are now functions. Rustc isn't smart enough
    to properly handle large-ish things on the stack in different
    match arms. This does mean that it's sane to inline the jump
    table into `feed`, which had a nice impact on performance. Jump
    tables inside loops are especially efficient because it's just
    a bigger jump table!
  - Things which get atomized anyhow (except for doctypes, which
    weren't hot enough for me to bother changing) use the old String
    parsing method, since it ends up being a lot faster for small
    strings and doesn't cause O(tags) allocations, thanks to truncation.
  - A custom Option called `FastOption` which doesn't zero on `take`
    and can't be matched on, but still maintains safety.
  - Iobufs in the input_buffers Ringbuf are padded to 32 bytes, to
    allow indexing without a multiply. That was actually a hotspot
    that showed up in perf, which is a little scary.
  - No utf-8 decoding of the chars is avoided unless absolutely
    necessary. For most parsing, we just need the utf-8 length,
    which is much easier to calculate (a branch, and a LUT on the
    first byte in the "slow" path).
  - Chars and Runs are parsed into a "shared" location every time,
    because rustc is really bad at generating code for types which
    Drop a lot in a loop. See the discuss post at the top.
  - A new temp_buf has been introduced, because it is no longer
    performant to just append random characters to spans. Consider
    a partially-consumed comment start: `<-`. If the next character
    is an `a`, the <- needs to be emitted. The second temporary
    buffer is used to handle cases like that.
  - Similar to above, but when parsing char refs: the '&' and '#'
    are saved in case of backout.
  - Dashes at the end of a comment `----->` need to be saved and
    shuffled as we keep reading more dashes, so that we always
    emit the "right" ones to keep the span continuous. This
    required a little 2-element "queue": `first_comment_end_dash`
    and `second_comment_end_dash`.
  - Some of the tokenizer fields were reordered for cache
    efficiency.
  - Some inliner guidance was done in `get_char` and
    `get_preprocessed_char`, to keep fast paths fast.
  - `clone_from` to get data out of the input buffers is used
    where it makes sense, preventing a bunch of bad rustc codegen.

As a result of these optimizations (and zero-copy parsing in general):

===

zero-copy

test tokenize uncommitted/html5.html       ... bench: 124076195 ns/iter (+/- 9519897)
test tokenize uncommitted/lipsum-1M.html   ... bench:   1989708 ns/iter (+/- 405327)
test tokenize uncommitted/sina.com.cn.html ... bench:   7210262 ns/iter (+/- 1391972)
test tokenize uncommitted/strong.html      ... bench:  30002001 ns/iter (+/- 3375152)
test tokenize uncommitted/webapps.html     ... bench:  99264377 ns/iter (+/- 8138989)
test tokenize uncommitted/wikipedia.html   ... bench:   3841740 ns/iter (+/- 612645)

original

test tokenize uncommitted/html5.html       ... bench: 153991836 ns/iter (+/- 7196531)
test tokenize uncommitted/lipsum-1M.html   ... bench:   2393385 ns/iter (+/- 450953)
test tokenize uncommitted/sina.com.cn.html ... bench:   8837605 ns/iter (+/- 1238217)
test tokenize uncommitted/strong.html      ... bench:  44153393 ns/iter (+/- 5076161)
test tokenize uncommitted/webapps.html     ... bench: 136860951 ns/iter (+/- 8137049)
test tokenize uncommitted/wikipedia.html   ... bench:   4868854 ns/iter (+/- 797178)

SUMMARY

html5.html:       19%
sina.com.cn.html: 14%
strong.html:      47%
webapps.html:     27%
wikipedia.html:   21%
lipsum-1M.html:   17%

====

r? @kmcallister
5bcc896
@cgaebel cgaebel fixed test on 32-bit platforms f4cb709
@kmcallister

Needs a license header.

Owner

done.

@kmcallister

Can you add some comments about the relationship between Span, BufSpan, and ROIobuf? Also I see that ROIobuf<'static> still appears in a number of places; can we have a synonym for that as well, and make its name fit with Span (possibly by renaming Span)?

Owner

Sure. I'll just add a typealias calling it Buf, and throw in a block comment explaining their relationships.

Owner

Done.

@kmcallister

Why allow dead code here?

Owner

pad is never accessed, so counts as "dead code".

@kmcallister

Can you add a comment about what this padding accomplishes and why these particular sizes?

Owner

I just removed it. After more profiling it seems misguided.

@cgaebel
Contributor
cgaebel commented Mar 8, 2015

Optimization issues that I ran in to that led to a bunch of the "performance tweaks" in this patch:

http://internals.rust-lang.org/t/the-sad-state-of-zero-on-drop/944
rust-lang/rust#20219
The giant state machine function had stuff on the stack (with a destructor) in each match arm, and I believe this is the root cause of stack usage exploding. I think this is zero-on-drop confusing llvm. To fix it, I put each state in its own function. This actually isn't totally a bad thing, because it makes stacktraces when there's problems a lot nicer to read.

@kmcallister
Contributor

@cgaebel: Did you investigate using Span just for the fast paths like pop_except_from? Keeping track of every SingleChar may be more trouble than it's worth.

@cgaebel
Contributor
cgaebel commented Mar 21, 2015

I did. If you break up runs of text on "non-hot" states, spans move out of their "empty or one" state and into the "many" state, which is much slower. It definitely made a huge difference, and this design was only found after I tried what you just said, because you're right -- keeping track of every SingleChar is hard.

@kmcallister
Contributor

How many buffers did those spans have on average? I'm thinking a small vector optimization could save us, or maybe finger trees.

@cgaebel
Contributor
cgaebel commented Mar 21, 2015

the vast majority are spans over 0 or 1 buffer. That optimization is already implemented. Making spans handle more than that inline would greatly increase the size of each span, and increase the amount of memory traffic on the stack even for simple and common cases.

@kmcallister kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015
@kmcallister kmcallister Implement zero-copy parsing
Based on #60 by cgaebel.
e2f1d18
@kmcallister kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015
@kmcallister kmcallister Implement zero-copy parsing
Based on #60 by cgaebel.
3942a80
@kmcallister kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015
@kmcallister kmcallister Implement zero-copy parsing
Based on #60 by cgaebel.
9ca1de0
@kmcallister kmcallister added a commit to kmcallister/html5ever that referenced this pull request Jun 10, 2015
@kmcallister kmcallister Implement zero-copy parsing
Based on #60 and #114.

Fixes #20.
Fixes #115.
7be620c
@kmcallister
Contributor

Now #141.

@kmcallister kmcallister added a commit that referenced this pull request Jun 16, 2015
@kmcallister @SimonSapin kmcallister + SimonSapin Implement zero-copy parsing
Based on #60 and #114.

Fixes #20.
Fixes #115.
238c03b
@kmcallister kmcallister added a commit that referenced this pull request Jun 25, 2015
@kmcallister @SimonSapin kmcallister + SimonSapin Implement zero-copy parsing
Based on #60 and #114.

Fixes #20.
Fixes #115.
221bd2c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment