Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added failing tests for '<<' and '<<<some html'. #963

Closed
wants to merge 1 commit into from

Conversation

kaspth
Copy link

@kaspth kaspth commented Aug 27, 2013

I called:

doc = Nokogiri::XML::Document.parse('<<<some html')
doc.to_s

I expected getting the string back. However, Nokogiri returns an empty string.

In the case of '<<', this is what the test returns:
+"<?xml version=\"<<<some html>\"?> +"

// @rafaelfranca

@knu
Copy link
Member

knu commented Nov 1, 2013

I can't reproduce the +"<?xml version=\"<<<some html>\"?> +" stuff. What's your environment? (nokogiri --version)

The libxml2 parser does not even parse "<", so I suppose you couldn't expect much with regard to invalid XML fragments anyway.

@kaspth
Copy link
Author

kaspth commented Nov 1, 2013

My environment:

# Nokogiri (1.6.0)
    ---
    warnings: []
    nokogiri: 1.6.0
    ruby:
      version: 2.0.0
      platform: x86_64-darwin12.4.0
      description: ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: /Users/kasperhansen/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxml2/2.8.0
      libxslt_path: /Users/kasperhansen/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxslt/1.1.26
      compiled: 2.8.0
      loaded: 2.8.0

If libxml2 doesn't parse "<" what does it do? Skip them, escape them or nothing?

@knu
Copy link
Member

knu commented Nov 1, 2013

It simply ignores bare angle brackets because they are simply invalid tokens as XML or HTML4.

HTML5 happens to explicitly defines how to parse a non-well-formed markup document, so a decent HTML5 parser could parse such a fragment as nearly as you would expect.

For example, nokogumbo parses << as &lt;&lt;, and <<<<some html as &lt;&lt;&lt; (<some html is taken as an incomplete open tag). Try something like Nokogumbo.parse('<<<some').at('body').children.

@kaspth
Copy link
Author

kaspth commented Nov 5, 2013

We're trying to get Rails to use Loofah (which uses Nokogiri) for sanitization. Here's the original issue: https://github.com/rafaelfranca/rails-html-sanitizer/blob/master/test/sanitizer_test.rb#L77

If I hear you right < are ignored so a user inputting <-: gets an empty string? So there's going to be honest input which will just disappear for the user?

//@rafaelfranca

@knu
Copy link
Member

knu commented Nov 6, 2013

As I said in the previous comment, if what is specified in HTML5 is enough for you, consider using an HTML5 parser like Nokogumbo. If that is not an option, you'll have to implement your own parser or preprocessor. I'm not sure, but it might be enough to do gsub!(/<(?![[:word:]:?!\/])/, "&lt;") or something like that before feeding a fragment to Nokogiri.

After all, Nokogiri is a wrapper for libxml2 (CRuby) and NekoHTML (JRuby), either of which is a standard conforming but relatively naive parser for XML and HTML4. They may not parse broken HTML fragments like popular browsers do (if that's what you expect) but that's mainly because XML and HTML4 do not specify much about how to deal with invalid fragments, and it is at least currently out of scope for the underlying libraries and Nokogiri itself to act like web browsers.

@leejarvis
Copy link
Member

Closing this one too, please re-open to discuss if further if necessary.

@leejarvis leejarvis closed this Jan 23, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants