Skip to content


Parsing without NOENT still substitutes entities #551

Phrogz opened this Issue · 7 comments

5 participants

require 'nokogiri'
p Nokogiri::VERSION
#=> "1.5.0"
p [
].map{ |o| Nokogiri::HTML("<p>Foo&reg;</p>",nil,'utf-8',o).at('//text()') }
#=> [#<Nokogiri::XML::Text:0x810bb34c "Foo\u00AE">,
#=>  #<Nokogiri::XML::Text:0x810bafdc "Foo\u00AE">]

I expected a Nokogiri::XML::EntityReference to be created for those characters that are specified by an entity, e.g.

require 'nokogiri'
doc = Nokogiri::HTML("<p>Foo</p>")'p') << doc, 'reg' )
#=> [#<Nokogiri::XML::Text:0x8092b67c "Foo">,
#=>  #<Nokogiri::XML::EntityReference:0x8092ba3c name="reg">]

Originated by trying to answer this Stack Overflow question.


This is still an issue. NOENT is not respected, which is problematic when using Nokogiri to sanitize text:

Nokogiri::HTML("&lt;script>alert('Ouch!');&lt;/script>") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.text => "<script>alert('Ouch!');</script>"

This is still a real issue. It's making it impossible to sanitize text, as @cheald mentioned. (Bumped into this when working on nricciar/wikicloth#78), so it's really restricting the use cases for nokogiri.


I noticed there is a difference between using Node#text and Node#to_s: the latter seems to preserve html entities. Is this difference in behavior intended?

irb(main):077:0> str = "<pre>&amp;lt&#59;test</pre>&lt;script type=&quot;text/javascript&quot; src=&quot;bla.js&quot;&gt;&lt;/script&gt;"
irb(main):081:0> doc = Nokogiri::HTML::DocumentFragment.parse(str)
irb(main):084:0> doc.xpath(".//text()").last.text
=> "<script type=\"text/javascript\" src=\"bla.js\"></script>"
irb(main):086:0> doc.xpath(".//text()").last.to_s
=> "&lt;script type=\"text/javascript\" src=\"bla.js\"&gt;&lt;/script&gt;"

I'm having a similar issue, specifically with the trademark character \u0099:

2.1.5 :043 > Nokogiri::HTML("&#153;") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.text
=> "\u0099"
2.1.5 :044 > Nokogiri::HTML("&#153;") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.to_s
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"\">\n<html><body><p>\u0099</p></body></html>\n"

I'm using nokogiri with ruby 2.1.5

am I missing something, or is this happening due to this bug?

Sparkle Motion member

The difference between #text and #to_s is intentional; we should definitely do a better job of documenting this. #text is intended for plaintext contexts, which is why the entities are expanded. #to_s emits XML or HTML using either #to_xml or #to_html and is what should be used for emitting markup for consumption by browser/parser/etc.:

      def to_s
        document.xml? ? to_xml : to_html

Also possibly confusing the issue is that libxml2's behavior is odd in that NOENT will cause entities to be not-expanded at parse time, but will expand them when xmlNodeGetContent() is called, which is pretty much any time you want to print the node and its contents. Which means it's not as useful an option as you might think.

@cheald, @MoustafaEid, @dometto -- does my explanation of #to_s (and #to_html and #to_xml) versus #text make sense and resolve your questions?

I'll work on improving documentation, and will check back in a couple of days to see if anyone has objections to closing this.

@flavorjones flavorjones added a commit that referenced this issue
@flavorjones flavorjones Improving documentation around serializers.
This is in response to questions on #551.

@flavorjones That helps a lot! Thank you.

Specifically WRT sanitization, it looks like something like this should work:

# => "bold&lt;script&gt;alert('Ouch!');&lt;/script&gt;"

@flavorjones thanks for the explanation, would definitely be good to update the docs in this respect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.