Parsing without NOENT still substitutes entities #551

Open
Phrogz opened this Issue Oct 13, 2011 · 7 comments

Comments

Projects
None yet
5 participants
@Phrogz

Phrogz commented Oct 13, 2011

require 'nokogiri'
p Nokogiri::VERSION
#=> "1.5.0"
p [
  Nokogiri::XML::ParseOptions::NOENT,
  Nokogiri::XML::ParseOptions::DEFAULT_HTML
].map{ |o| Nokogiri::HTML("<p>Foo&reg;</p>",nil,'utf-8',o).at('//text()') }
#=> [#<Nokogiri::XML::Text:0x810bb34c "Foo\u00AE">,
#=>  #<Nokogiri::XML::Text:0x810bafdc "Foo\u00AE">]

I expected a Nokogiri::XML::EntityReference to be created for those characters that are specified by an entity, e.g.

require 'nokogiri'
doc = Nokogiri::HTML("<p>Foo</p>")
doc.at('p') << Nokogiri::XML::EntityReference.new( doc, 'reg' )
p doc.at('p').children
#=> [#<Nokogiri::XML::Text:0x8092b67c "Foo">,
#=>  #<Nokogiri::XML::EntityReference:0x8092ba3c name="reg">]

Originated by trying to answer this Stack Overflow question.

@cheald

This comment has been minimized.

Show comment
Hide comment
@cheald

cheald Feb 17, 2014

This is still an issue. NOENT is not respected, which is problematic when using Nokogiri to sanitize text:

Nokogiri::HTML("&lt;script>alert('Ouch!');&lt;/script>") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.text => "<script>alert('Ouch!');</script>"

cheald commented Feb 17, 2014

This is still an issue. NOENT is not respected, which is problematic when using Nokogiri to sanitize text:

Nokogiri::HTML("&lt;script>alert('Ouch!');&lt;/script>") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.text => "<script>alert('Ouch!');</script>"
@dometto

This comment has been minimized.

Show comment
Hide comment
@dometto

dometto Jan 29, 2015

This is still a real issue. It's making it impossible to sanitize text, as @cheald mentioned. (Bumped into this when working on nricciar/wikicloth#78), so it's really restricting the use cases for nokogiri.

dometto commented Jan 29, 2015

This is still a real issue. It's making it impossible to sanitize text, as @cheald mentioned. (Bumped into this when working on nricciar/wikicloth#78), so it's really restricting the use cases for nokogiri.

@dometto

This comment has been minimized.

Show comment
Hide comment
@dometto

dometto Jan 30, 2015

I noticed there is a difference between using Node#text and Node#to_s: the latter seems to preserve html entities. Is this difference in behavior intended?

irb(main):077:0> str = "<pre>&amp;lt&#59;test</pre>&lt;script type=&quot;text/javascript&quot; src=&quot;bla.js&quot;&gt;&lt;/script&gt;"
irb(main):081:0> doc = Nokogiri::HTML::DocumentFragment.parse(str)
irb(main):084:0> doc.xpath(".//text()").last.text
=> "<script type=\"text/javascript\" src=\"bla.js\"></script>"
irb(main):086:0> doc.xpath(".//text()").last.to_s
=> "&lt;script type=\"text/javascript\" src=\"bla.js\"&gt;&lt;/script&gt;"

dometto commented Jan 30, 2015

I noticed there is a difference between using Node#text and Node#to_s: the latter seems to preserve html entities. Is this difference in behavior intended?

irb(main):077:0> str = "<pre>&amp;lt&#59;test</pre>&lt;script type=&quot;text/javascript&quot; src=&quot;bla.js&quot;&gt;&lt;/script&gt;"
irb(main):081:0> doc = Nokogiri::HTML::DocumentFragment.parse(str)
irb(main):084:0> doc.xpath(".//text()").last.text
=> "<script type=\"text/javascript\" src=\"bla.js\"></script>"
irb(main):086:0> doc.xpath(".//text()").last.to_s
=> "&lt;script type=\"text/javascript\" src=\"bla.js\"&gt;&lt;/script&gt;"
@MoustafaEid

This comment has been minimized.

Show comment
Hide comment
@MoustafaEid

MoustafaEid Mar 9, 2015

I'm having a similar issue, specifically with the trademark character \u0099:

2.1.5 :043 > Nokogiri::HTML("&#153;") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.text
=> "\u0099"
2.1.5 :044 > Nokogiri::HTML("&#153;") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.to_s
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>\u0099</p></body></html>\n"

I'm using nokogiri 1.6.6.2 with ruby 2.1.5

am I missing something, or is this happening due to this bug?

I'm having a similar issue, specifically with the trademark character \u0099:

2.1.5 :043 > Nokogiri::HTML("&#153;") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.text
=> "\u0099"
2.1.5 :044 > Nokogiri::HTML("&#153;") {|c| c.options |= Nokogiri::XML::ParseOptions::NOENT }.to_s
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>\u0099</p></body></html>\n"

I'm using nokogiri 1.6.6.2 with ruby 2.1.5

am I missing something, or is this happening due to this bug?

@flavorjones

This comment has been minimized.

Show comment
Hide comment
@flavorjones

flavorjones Mar 16, 2015

Member

The difference between #text and #to_s is intentional; we should definitely do a better job of documenting this. #text is intended for plaintext contexts, which is why the entities are expanded. #to_s emits XML or HTML using either #to_xml or #to_html and is what should be used for emitting markup for consumption by browser/parser/etc.:

      def to_s
        document.xml? ? to_xml : to_html
      end

Also possibly confusing the issue is that libxml2's behavior is odd in that NOENT will cause entities to be not-expanded at parse time, but will expand them when xmlNodeGetContent() is called, which is pretty much any time you want to print the node and its contents. Which means it's not as useful an option as you might think.

@cheald, @MoustafaEid, @dometto -- does my explanation of #to_s (and #to_html and #to_xml) versus #text make sense and resolve your questions?

I'll work on improving documentation, and will check back in a couple of days to see if anyone has objections to closing this.

Member

flavorjones commented Mar 16, 2015

The difference between #text and #to_s is intentional; we should definitely do a better job of documenting this. #text is intended for plaintext contexts, which is why the entities are expanded. #to_s emits XML or HTML using either #to_xml or #to_html and is what should be used for emitting markup for consumption by browser/parser/etc.:

      def to_s
        document.xml? ? to_xml : to_html
      end

Also possibly confusing the issue is that libxml2's behavior is odd in that NOENT will cause entities to be not-expanded at parse time, but will expand them when xmlNodeGetContent() is called, which is pretty much any time you want to print the node and its contents. Which means it's not as useful an option as you might think.

@cheald, @MoustafaEid, @dometto -- does my explanation of #to_s (and #to_html and #to_xml) versus #text make sense and resolve your questions?

I'll work on improving documentation, and will check back in a couple of days to see if anyone has objections to closing this.

flavorjones added a commit that referenced this issue Mar 16, 2015

Improving documentation around serializers.
This is in response to questions on #551.
@cheald

This comment has been minimized.

Show comment
Hide comment
@cheald

cheald Mar 16, 2015

@flavorjones That helps a lot! Thank you.

Specifically WRT sanitization, it looks like something like this should work:

Nokogiri::HTML("<div><b>bold</b>&lt;script>alert('Ouch!');&lt;/script></div>").xpath("//text()").to_html
# => "bold&lt;script&gt;alert('Ouch!');&lt;/script&gt;"

cheald commented Mar 16, 2015

@flavorjones That helps a lot! Thank you.

Specifically WRT sanitization, it looks like something like this should work:

Nokogiri::HTML("<div><b>bold</b>&lt;script>alert('Ouch!');&lt;/script></div>").xpath("//text()").to_html
# => "bold&lt;script&gt;alert('Ouch!');&lt;/script&gt;"
@dometto

This comment has been minimized.

Show comment
Hide comment
@dometto

dometto Mar 16, 2015

@flavorjones thanks for the explanation, would definitely be good to update the docs in this respect.

dometto commented Mar 16, 2015

@flavorjones thanks for the explanation, would definitely be good to update the docs in this respect.

@opoudjis opoudjis referenced this issue in riboseinc/asciidoctor-rfc Oct 25, 2017

Closed

Entities in Asciidoc #56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment