Added failing tests for '<<' and '<<<some html'. #963

kaspth · 2013-08-27T07:43:54Z

I called:

doc = Nokogiri::XML::Document.parse('<<<some html')
doc.to_s

I expected getting the string back. However, Nokogiri returns an empty string.

In the case of '<<', this is what the test returns:
+"<?xml version=\"<<<some html>\"?> +"

// @rafaelfranca

knu · 2013-11-01T10:57:06Z

I can't reproduce the +"<?xml version=\"<<<some html>\"?> +" stuff. What's your environment? (nokogiri --version)

The libxml2 parser does not even parse "<", so I suppose you couldn't expect much with regard to invalid XML fragments anyway.

kaspth · 2013-11-01T12:58:26Z

My environment:

# Nokogiri (1.6.0)
    ---
    warnings: []
    nokogiri: 1.6.0
    ruby:
      version: 2.0.0
      platform: x86_64-darwin12.4.0
      description: ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: /Users/kasperhansen/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxml2/2.8.0
      libxslt_path: /Users/kasperhansen/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxslt/1.1.26
      compiled: 2.8.0
      loaded: 2.8.0

If libxml2 doesn't parse "<" what does it do? Skip them, escape them or nothing?

knu · 2013-11-01T15:45:34Z

It simply ignores bare angle brackets because they are simply invalid tokens as XML or HTML4.

HTML5 happens to explicitly defines how to parse a non-well-formed markup document, so a decent HTML5 parser could parse such a fragment as nearly as you would expect.

For example, nokogumbo parses << as <<, and <<<<some html as <<< (<some html is taken as an incomplete open tag). Try something like Nokogumbo.parse('<<<some').at('body').children.

kaspth · 2013-11-05T18:32:42Z

We're trying to get Rails to use Loofah (which uses Nokogiri) for sanitization. Here's the original issue: https://github.com/rafaelfranca/rails-html-sanitizer/blob/master/test/sanitizer_test.rb#L77

If I hear you right < are ignored so a user inputting <-: gets an empty string? So there's going to be honest input which will just disappear for the user?

//@rafaelfranca

knu · 2013-11-06T03:01:45Z

As I said in the previous comment, if what is specified in HTML5 is enough for you, consider using an HTML5 parser like Nokogumbo. If that is not an option, you'll have to implement your own parser or preprocessor. I'm not sure, but it might be enough to do gsub!(/<(?![[:word:]:?!\/])/, "<") or something like that before feeding a fragment to Nokogiri.

After all, Nokogiri is a wrapper for libxml2 (CRuby) and NekoHTML (JRuby), either of which is a standard conforming but relatively naive parser for XML and HTML4. They may not parse broken HTML fragments like popular browsers do (if that's what you expect) but that's mainly because XML and HTML4 do not specify much about how to deal with invalid fragments, and it is at least currently out of scope for the underlying libraries and Nokogiri itself to act like web browsers.

leejarvis · 2014-01-23T15:14:06Z

Closing this one too, please re-open to discuss if further if necessary.

rafaelfranca mentioned this pull request Nov 5, 2013

Fix the skipped tests rails/rails-html-sanitizer#4

Closed

Added failing tests for '<<' and '<<<some html'.

f9a5e7c

leejarvis closed this Jan 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added failing tests for '<<' and '<<<some html'. #963

Added failing tests for '<<' and '<<<some html'. #963

kaspth commented Aug 27, 2013

knu commented Nov 1, 2013

kaspth commented Nov 1, 2013

knu commented Nov 1, 2013

kaspth commented Nov 5, 2013

knu commented Nov 6, 2013

leejarvis commented Jan 23, 2014

Added failing tests for '<<' and '<<<some html'. #963

Added failing tests for '<<' and '<<<some html'. #963

Conversation

kaspth commented Aug 27, 2013

knu commented Nov 1, 2013

kaspth commented Nov 1, 2013

knu commented Nov 1, 2013

kaspth commented Nov 5, 2013

knu commented Nov 6, 2013

leejarvis commented Jan 23, 2014