-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid taking more than O(n) time even for malicious input #289
Comments
Excellent point. I think attribute duplication is probably the worst offender, especially in xml5ever, since it does attribute removal twice (first time in tokenizer and second after namespace resolution). Any ideas how other parsers limit the execution time to |
I know Gumbo does worse than html5ever; really deep nesting is enough to
make it go quadratic.
The main problem seems to be the active formatting elements and unclosed
elements vectors. I am not sure how to pull off guaranteed linear time.
Maybe by caching, or by storing flags instead of searching the lists? Some
steps seem to require a linked list be used instead of a vector, so as to
allow O(1) removal from the middle of the list. Others seem to require a
hash set of every node, keyed by name, attributes, and values, that allows
up to 3 duplicates! I think that using vectors is a losing battle here,
which strongly favors using an arena for all allocations inside the parser.
…On Jul 31, 2017 7:55 AM, "Ygg01" ***@***.***> wrote:
Excellent point. I think attribute duplication is probably the worst
offender, especially in xml5ever, since it does attribute removal twice
(first time in tokenizer and second after namespace resolution).
Any ideas how other parsers limit the execution time to O(n) (or O(n log
n) )?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#289 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGGWB0Z5v_U33BeSW7xEUzMO5V0g-15Cks5sTcCpgaJpZM4OnxxF>
.
|
Another option is to deliberately impose an artificial cap on the length of
those lists. Ugly, and in violation of the spec, but the only simple
option I see.
…On Jul 31, 2017 9:19 AM, "Demi Obenour" ***@***.***> wrote:
I know Gumbo does worse than html5ever; really deep nesting is enough to
make it go quadratic.
The main problem seems to be the active formatting elements and unclosed
elements vectors. I am not sure how to pull off guaranteed linear time.
Maybe by caching, or by storing flags instead of searching the lists? Some
steps seem to require a linked list be used instead of a vector, so as to
allow O(1) removal from the middle of the list. Others seem to require a
hash set of every node, keyed by name, attributes, and values, that allows
up to 3 duplicates! I think that using vectors is a losing battle here,
which strongly favors using an arena for all allocations inside the parser.
On Jul 31, 2017 7:55 AM, "Ygg01" ***@***.***> wrote:
> Excellent point. I think attribute duplication is probably the worst
> offender, especially in xml5ever, since it does attribute removal twice
> (first time in tokenizer and second after namespace resolution).
>
> Any ideas how other parsers limit the execution time to O(n) (or O(n log
> n) )?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#289 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AGGWB0Z5v_U33BeSW7xEUzMO5V0g-15Cks5sTcCpgaJpZM4OnxxF>
> .
>
|
I’ve changed the title because "guarantee" is not gonna happen unless we have static analysis that can tell if we’re doing something wrong, and that sounds like a hard research problem. Do you know specific inputs where html5ever currently behaves badly? |
@SimonSapin A long string of opening perl -e 'my $x = 20000; print("<i>"x$x . "<a>"x$x . "</a>"x$x)' |
target/release/examples/html2html |
So does perl -We 'use strict; my $q = 40000; print "<i>"x$q, "<div>"x$q; ' |
target/release/examples/arena Time is spent in
according to |
HTML parsers are not just used in client-side applications: they are also used on servers, such as in HTML sanitizers. html5ever (and xml5ever) should guarantee that they cannot be coerced into taking more than O(n), or at worst O(n log n), time. This may be difficult, especially if one does not want to use massively connected datastructures.
Right now, it seems that the worst offenders are likely to be calls to
Vec::remove
in various places, such as in the adoption agency algorithm.The text was updated successfully, but these errors were encountered: