Error with greater and lesser than tags #83

eviltrout · 2013-09-16T16:45:50Z

This bug was initially discovered here: http://meta.discourse.org/t/after-lt-all-the-text-is-hidden/9807

A call to Sanitize.clean("1 > 2 and 2 < 1") produces 1 > 2 and 2.

Everything after the < symbol is chopped off. Sorry I'm not familiar with your code base but I'm wondering if there's anything I can do to help this get fixed?

The text was updated successfully, but these errors were encountered:

rgrove · 2013-09-16T17:08:24Z

This is probably the most-reported issue in Sanitize, but I don't think people realize what a complex problem it is. So far, I haven't been able to come up with a solution that wouldn't make Sanitize potentially unsafe.

The root of the issue is that "1 > 2 and 2 < 1" simply isn't syntactically valid HTML. In attempting to parse it, Nokogiri (which Sanitize uses under the hood) turns it into "1 > 2 and 2", as you can see here:

irb(main):008:0> require 'nokogiri'
=> false
irb(main):009:0> Nokogiri::HTML::DocumentFragment.parse('1 > 2 and 2 < 1')
=> #<Nokogiri::HTML::DocumentFragment:0x3fced9c20258 name="#document-fragment" children=[#<Nokogiri::XML::Text:0x3fced9c1fe70 "1 > 2 and 2 ">]>

This is technically correct behavior. As far as Nokogiri is concerned, "< 1" indicates an unclosed tag, which is invalid, so Nokogiri removes the unclosed tag and its contents (" 1").

So, could Sanitize pre-parse the input and identify "<" chars that are meant literally rather than as the beginning of an unclosed tag, and then escape them before passing the input on to Nokogiri?

Maybe. But so far I haven't been able to come up with a safe and effective way of doing this, and nobody else has either. If you have any clever ideas, I'm all ears!

kasperpeulen · 2013-09-16T19:08:43Z

"Maybe. But so far I haven't been able to come up with a safe and effective way of doing this, and nobody else has either. If you have any clever ideas, I'm all ears!"

I thought about this:
If "<" is followed by a letter, then it indicates an unclosed tag.
If "<" is followed by anything else (spacebar, number, symbol, etc.), then the "<" is surely not meant as an opening html tag, and the "<" should be interpreted literally I think.

Of course, this is not completely right. Sometimes a mathematician would like to write "x<a", but any issues could then be prevented by typing an extra spacebar, like "x < a".

I'm not a programmer, so I can't help you with scripting this if you think it is a good solution.

rgrove · 2013-09-16T19:25:02Z

Unfortunately, those rules are too ambiguous. The mathematical less-than operator isn't the only use case for "<". For example, it's often used in plain text to demarcate a URL:

Check out <http://example.com/>!

Some people use it in emoticons:

OMG HAPPY BIRTHDAY! *<:-D

And some people even use it when discussing literal HTML tags:

Use the <em> tag to indicate emphasis!

These are just a few examples. In my experience, when Sanitize is used to sanitize comments or forum posts in which users aren't explicitly conscious of writing HTML (such as when they're actually writing Markdown or some other format that gets converted to HTML), users tend to expect "<" to just work magically in any arbitrary situation.

kasperpeulen · 2013-09-16T21:12:31Z

Well, I undestand your point I think. To write a script where "<" works magically in arbitrary situation will be probably impossible. Especially your example of discussing literal HTML tags.

What about converting < to < when it is absolutely clear that < can't be intended to be the opening of an html tag. This may not be perfect, but I don't see how this would make Sanitize potentially unsafe. And I think it would be better then just always considering < as an unclosed tag that should be removed together with its content.

rgrove · 2013-09-16T21:37:05Z

What about converting < to < when it is absolutely clear that < can't be intended to be the opening of an html tag. This may not be perfect, but I don't see how this would make Sanitize potentially unsafe.

There are two basic problems with this:

What criteria will be used to determine when it's "absolutely clear" that < isn't the opening of an HTML tag?
Assuming we can come up with criteria in which we're fully confident, how do we apply that criteria without essentially implementing another HTML parser on top of Nokogiri?

Right now, Sanitize consists of a battle-tested parser (Nokogiri) and a set of simple transformations that get applied to the DOM that Nokogiri generates. Nokogiri can't be configured to apply custom invalid tag escaping criteria, so we would have to implement this before passing HTML to Nokogiri.

The last thing I want to do is implement another HTML parser on top of Nokogiri, because that means adding complexity, which means adding bugs. But what's the alternative? Regular expressions? What if the criteria (which have yet to be determined) can't be expressed using regular expressions? Or what if the regexes required are complex, and thus likely to introduce bugs? These are questions I can't answer right now because we lack the criteria needed to answer them, but they worry me.

One of the things that makes Sanitize safe is its simplicity. Right now there is one regex in Sanitize, and it's relatively easy to understand, and is used for a well-defined task (validating URI schemes). The rest of Sanitize's logic is implemented in the form of easy to follow DOM transformations. Sanitize doesn't try to do too much, and what it does do, it tries to do with as little complexity as possible, because complexity leads to bugs. I'd like to stick to this principle as much as possible.

So the ultimate question here is: is there a set of simple, unambiguous criteria that can be used to identify < characters that don't open an HTML tag?

If you can answer that, I want to hear from you. If not, I appreciate your input, but there's no way to move forward until we first have this criteria.

pda · 2013-09-16T22:58:55Z

Perhaps fail harder, rather than being more accommodating?

If I pass malformed HTML like <p>a < 1</p> I'd rather get an exception than <p>a </p>.

@kenichi made reference to using Nokogiri::DocumentFragment#errors. Maybe that could be used to detect this case, but raise an error rather than attempting to resolve the syntax error…?

rgrove · 2013-09-16T23:21:16Z

I don't like the idea of failing hard, given that Sanitize is used for parsing blog comments, form posts, this actual GitHub comment, etc. While it would prevent unexpected output, I fear application developers would route around the failures by doing something less safe and more forgiving.

Ideally, I'd like Sanitize to be safe and forgiving, erring on the side of safety when forgiveness isn't possible.

Nokogiri::DocumentFragment#errors can definitely be used to detect this case, but not much more than that. So it'd be useful for raising an error, as you say, but I don't think there's a reliable way to use it to actually solve the problem.

At this point I'm pretty sure we can completely rule out a regex-based solution, simply because there's no sane way to construct a regex that can reliably identify a non-tag < without falling into context traps like <script>if (a < b) { ... }</script>.

I've considered writing a SAX-based pre-parser, but Nokogiri::HTML::SAX::Parser is no more capable of handling this case than the Document parser Sanitize currently uses, so that's probably out of the question as well.

I plan to investigate https://github.com/rubys/nokogumbo, since HTML5's parsing rules are at least a little more lenient (they would allow "1 > 2 and 2 < 1", for example), but Gumbo and its Ruby wrappers are still very young and Google doesn't yet recommend using it with untrusted input.

Blaisorblade · 2013-09-17T02:17:31Z

In my experience, when Sanitize is used to sanitize comments or forum posts in which users aren't explicitly conscious of writing HTML (such as when they're actually writing Markdown or some other format that gets converted to HTML), users tend to expect "<" to just work magically in any arbitrary situation.

Well, isn't the HTML converter supposed to escape <? After all, it does work in Github comments even though Github uses sanitize. So in all those cases, it seems sanitize is being fed garbage and is doing the sane thing, that is, damage control on bugs of the HTML converter. (And I guess that's the case with Discourse, as I've written on the original thread).

At most, you want to add a FAQ entry explaining this to users - and given this:

This is probably the most-reported issue in Sanitize, but I don't think people realize what a complex problem it is.

I guess you should. To me, if you can't answer "RTFM (read the fine manual)", you have a bug.

Maybe you have other use cases than converters, such as users manually writing HTML — but is < supposed to just work there?

rgrove · 2013-09-17T02:39:54Z

I think I'll be happy if I can get Sanitize to the point of recovering from an invalid < at least as well as an HTML5-compliant parser. Currently, HTML5 parsers are more forgiving about many cases that Nokogiri finds hard to handle, including Discourse's case.

In the meantime, @Blaisorblade is probably right that there needs to be a FAQ.

mathiasbynens · 2013-09-17T07:49:06Z

If Nokogiri parses the document fragment 1 > 2 and 2 < 1 as 1 > 2 and 2 that’s a bug in Nokogiri. Browsers don’t match this behavior: data:text/html,1%20>%202%20and%202%20<%201

The only bulletproof solution is for Nokogiri to implement the HTML5 parsing algorithm, just like browsers do.

pda · 2013-09-17T18:48:58Z

If Nokogiri parses the document fragment 1 > 2 and 2 < 1 as 1 > 2 and 2 that’s a bug in Nokogiri.

Not necessarily. I don't think Nokogiri claims to support all of HTML5. I think accepting that fragment as “valid” HTML depends heavily on HTML5's error handling specification.

The only bulletproof solution is for Nokogiri to implement the HTML5 parsing algorithm, just like browsers do.

That's what @rgrove was getting at with his reference to https://github.com/rubys/nokogumbo#readme which backs onto https://github.com/google/gumbo-parser#readme for full HTML5 support.

But Gumbo lists under its “Non-goals”:

Security. Gumbo was initially designed for a product that worked with trusted input files only. We're working to harden this and make sure that it behaves as expected even on malicious input, but for now, Gumbo should only be run on trusted input or within a sandbox.

rgrove · 2013-09-19T00:23:27Z

I asked Gumbo's maintainer for clarification on Gumbo's security status. You can see his response at google/gumbo-parser#53.

This gist of it is that Gumbo is currently known to be susceptible to at least one buffer overflow bug, and may have other issues, but they hope to fix these.

Given this info, I'm definitely not comfortable using Gumbo in Sanitize yet, but I did create an experimental branch of Sanitize that uses Gumbo's parser via nokogumbo. It works very well, and I like its output a lot more than Nokogiri's. I won't switch until and unless Gumbo can safely handle untrusted input, but so far I like what I see.

rgrove · 2014-05-18T03:16:08Z

Good news everyone!

Google now considers Gumbo robust enough for use with untrusted input, and I'm steaming ahead on Sanitize 3.0.0, which uses Gumbo to parse HTML and thus resolves this issue. There's a fair amount of work and testing left to do before 3.0.0 will be ready for release, but you can follow its progress in the dev-3.0.0 branch if you're interested.

pda · 2014-05-18T04:55:33Z

Excellent :)

rgrove · 2014-06-21T23:01:10Z

Fixed in Sanitize 3.0.0.

sbounmy mentioned this issue Oct 2, 2013

error with ampersand #85

Closed

rgrove added a commit that referenced this issue May 19, 2014

Add test and history entry for #83.

88cc40d

rgrove closed this as completed Jun 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with greater and lesser than tags #83

Error with greater and lesser than tags #83

eviltrout commented Sep 16, 2013

rgrove commented Sep 16, 2013

kasperpeulen commented Sep 16, 2013

rgrove commented Sep 16, 2013

kasperpeulen commented Sep 16, 2013

rgrove commented Sep 16, 2013

pda commented Sep 16, 2013

rgrove commented Sep 16, 2013

Blaisorblade commented Sep 17, 2013

rgrove commented Sep 17, 2013

mathiasbynens commented Sep 17, 2013

pda commented Sep 17, 2013

rgrove commented Sep 19, 2013

rgrove commented May 18, 2014

pda commented May 18, 2014

rgrove commented Jun 21, 2014

Error with greater and lesser than tags #83

Error with greater and lesser than tags #83

Comments

eviltrout commented Sep 16, 2013

rgrove commented Sep 16, 2013

kasperpeulen commented Sep 16, 2013

rgrove commented Sep 16, 2013

kasperpeulen commented Sep 16, 2013

rgrove commented Sep 16, 2013

pda commented Sep 16, 2013

rgrove commented Sep 16, 2013

Blaisorblade commented Sep 17, 2013

rgrove commented Sep 17, 2013

mathiasbynens commented Sep 17, 2013

pda commented Sep 17, 2013

rgrove commented Sep 19, 2013

rgrove commented May 18, 2014

pda commented May 18, 2014

rgrove commented Jun 21, 2014