Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid taking more than O(n) time even for malicious input #289

Open
DemiMarie opened this issue Jul 30, 2017 · 6 comments
Open

Avoid taking more than O(n) time even for malicious input #289

DemiMarie opened this issue Jul 30, 2017 · 6 comments

Comments

@DemiMarie
Copy link

HTML parsers are not just used in client-side applications: they are also used on servers, such as in HTML sanitizers. html5ever (and xml5ever) should guarantee that they cannot be coerced into taking more than O(n), or at worst O(n log n), time. This may be difficult, especially if one does not want to use massively connected datastructures.

Right now, it seems that the worst offenders are likely to be calls to Vec::remove in various places, such as in the adoption agency algorithm.

@Ygg01
Copy link
Contributor

Ygg01 commented Jul 31, 2017

Excellent point. I think attribute duplication is probably the worst offender, especially in xml5ever, since it does attribute removal twice (first time in tokenizer and second after namespace resolution).

Any ideas how other parsers limit the execution time to O(n) (or O(n log n) )?

@DemiMarie
Copy link
Author

DemiMarie commented Jul 31, 2017 via email

@DemiMarie
Copy link
Author

DemiMarie commented Jul 31, 2017 via email

@SimonSapin SimonSapin changed the title Guarantee O(n) time even for malicious input Avoid taking more than O(n) time even for malicious input Jul 31, 2017
@SimonSapin
Copy link
Member

I’ve changed the title because "guarantee" is not gonna happen unless we have static analysis that can tell if we’re doing something wrong, and that sounds like a hard research problem.

Do you know specific inputs where html5ever currently behaves badly?

@DemiMarie
Copy link
Author

DemiMarie commented Jul 31, 2017

@SimonSapin A long string of opening <i> tags, followed by a bunch of nested links, causes quadratic time consumption in the adoption agency algorithm. You can reproduce this with

perl -e 'my $x = 20000; print("<i>"x$x . "<a>"x$x . "</a>"x$x)' |
target/release/examples/html2html

@DemiMarie
Copy link
Author

So does

perl -We 'use strict; my $q = 40000; print "<i>"x$q, "<div>"x$q; ' |
 target/release/examples/arena

Time is spent in

<html5ever::tree_builder::TreeBuilder<Handle, Sink> as html5ever::tree_builder::actions::TreeBuilderActions<Handle>>::close_p_element_in_button_scope

according to perf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants