Avoid taking more than O(n) time even for malicious input #289

DemiMarie · 2017-07-30T23:20:41Z

HTML parsers are not just used in client-side applications: they are also used on servers, such as in HTML sanitizers. html5ever (and xml5ever) should guarantee that they cannot be coerced into taking more than O(n), or at worst O(n log n), time. This may be difficult, especially if one does not want to use massively connected datastructures.

Right now, it seems that the worst offenders are likely to be calls to Vec::remove in various places, such as in the adoption agency algorithm.

The text was updated successfully, but these errors were encountered:

Ygg01 · 2017-07-31T11:55:20Z

Excellent point. I think attribute duplication is probably the worst offender, especially in xml5ever, since it does attribute removal twice (first time in tokenizer and second after namespace resolution).

Any ideas how other parsers limit the execution time to O(n) (or O(n log n) )?

DemiMarie · 2017-07-31T13:19:40Z

I know Gumbo does worse than html5ever; really deep nesting is enough to make it go quadratic. The main problem seems to be the active formatting elements and unclosed elements vectors. I am not sure how to pull off guaranteed linear time. Maybe by caching, or by storing flags instead of searching the lists? Some steps seem to require a linked list be used instead of a vector, so as to allow O(1) removal from the middle of the list. Others seem to require a hash set of every node, keyed by name, attributes, and values, that allows up to 3 duplicates! I think that using vectors is a losing battle here, which strongly favors using an arena for all allocations inside the parser.

…

On Jul 31, 2017 7:55 AM, "Ygg01" ***@***.***> wrote: Excellent point. I think attribute duplication is probably the worst offender, especially in xml5ever, since it does attribute removal twice (first time in tokenizer and second after namespace resolution). Any ideas how other parsers limit the execution time to O(n) (or O(n log n) )? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#289 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGGWB0Z5v_U33BeSW7xEUzMO5V0g-15Cks5sTcCpgaJpZM4OnxxF> .

DemiMarie · 2017-07-31T13:33:54Z

Another option is to deliberately impose an artificial cap on the length of those lists. Ugly, and in violation of the spec, but the only simple option I see.

…

On Jul 31, 2017 9:19 AM, "Demi Obenour" ***@***.***> wrote: I know Gumbo does worse than html5ever; really deep nesting is enough to make it go quadratic. The main problem seems to be the active formatting elements and unclosed elements vectors. I am not sure how to pull off guaranteed linear time. Maybe by caching, or by storing flags instead of searching the lists? Some steps seem to require a linked list be used instead of a vector, so as to allow O(1) removal from the middle of the list. Others seem to require a hash set of every node, keyed by name, attributes, and values, that allows up to 3 duplicates! I think that using vectors is a losing battle here, which strongly favors using an arena for all allocations inside the parser. On Jul 31, 2017 7:55 AM, "Ygg01" ***@***.***> wrote: > Excellent point. I think attribute duplication is probably the worst > offender, especially in xml5ever, since it does attribute removal twice > (first time in tokenizer and second after namespace resolution). > > Any ideas how other parsers limit the execution time to O(n) (or O(n log > n) )? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#289 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AGGWB0Z5v_U33BeSW7xEUzMO5V0g-15Cks5sTcCpgaJpZM4OnxxF> > . >

SimonSapin · 2017-07-31T14:55:57Z

I’ve changed the title because "guarantee" is not gonna happen unless we have static analysis that can tell if we’re doing something wrong, and that sounds like a hard research problem.

Do you know specific inputs where html5ever currently behaves badly?

DemiMarie · 2017-07-31T22:05:43Z

@SimonSapin A long string of opening <i> tags, followed by a bunch of nested links, causes quadratic time consumption in the adoption agency algorithm. You can reproduce this with

perl -e 'my $x = 20000; print("<i>"x$x . "<a>"x$x . "</a>"x$x)' |
target/release/examples/html2html

DemiMarie · 2017-07-31T22:34:56Z

So does

perl -We 'use strict; my $q = 40000; print "<i>"x$q, "<div>"x$q; ' |
 target/release/examples/arena

Time is spent in

<html5ever::tree_builder::TreeBuilder<Handle, Sink> as html5ever::tree_builder::actions::TreeBuilderActions<Handle>>::close_p_element_in_button_scope

according to perf.

SimonSapin changed the title ~~Guarantee O(n) time even for malicious input~~ Avoid taking more than O(n) time even for malicious input Jul 31, 2017

DemiMarie mentioned this issue Aug 7, 2017

Adoption agency limit #297

Open

mozfreddyb mentioned this issue May 24, 2018

Fuzz testing rust-ammonia/ammonia#83

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid taking more than O(n) time even for malicious input #289

Avoid taking more than O(n) time even for malicious input #289

DemiMarie commented Jul 30, 2017

Ygg01 commented Jul 31, 2017

DemiMarie commented Jul 31, 2017 via email

DemiMarie commented Jul 31, 2017 via email

SimonSapin commented Jul 31, 2017

DemiMarie commented Jul 31, 2017 •

edited

Loading

DemiMarie commented Jul 31, 2017

Avoid taking more than O(n) time even for malicious input #289

Avoid taking more than O(n) time even for malicious input #289

Comments

DemiMarie commented Jul 30, 2017

Ygg01 commented Jul 31, 2017

DemiMarie commented Jul 31, 2017 via email

DemiMarie commented Jul 31, 2017 via email

SimonSapin commented Jul 31, 2017

DemiMarie commented Jul 31, 2017 • edited Loading

DemiMarie commented Jul 31, 2017

DemiMarie commented Jul 31, 2017 •

edited

Loading