New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to disable '&' to '&' escaping #1127

Closed
ryansch opened this Issue Jul 15, 2014 · 12 comments

Comments

Projects
None yet
8 participants
@ryansch

ryansch commented Jul 15, 2014

I have an html document that contains a link of the form:
http://domain.com/path?foo=bar&dead=beef

If I parse it and emit it back out via to_html I get the following:
http://domain.com/path?foo=bar&dead=beef

This breaks the link's functionality and appears to only have the foo get param. How do I get #to_html or #serialize to behave?

Example

doc = %q{<a href="http://domain.com/path?foo=bar&dead=beef">Clicky</a>}
message = Nokogiri::HTML.parse(doc)
message.to_html

Result

"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><a href=\"http://domain.com/path?foo=bar&amp;dead=beef\">Clicky</a></body></html>\n"
@ryansch

This comment has been minimized.

ryansch commented Jul 16, 2014

Here's a failing test case. I'm trying to see if I can fix this.

def test_amp_href_round_trip
  html = <<-eohtml
<html>
  <body>
    <a href="http://tenderlovemaking.com/get_params?foo=bar&dead=beef">Clicky</a>
  </body>
</html>
  eohtml
  html.chomp!

  doc = Nokogiri::HTML(html)
  assert_equal html, doc.root.to_html
end

Failure:

  1) Failure:
test_amp_href_round_trip(Nokogiri::HTML::TestDocument) [/foo/nokogiri/test/html/test_document.rb:556]:
--- expected
+++ actual
@@ -1,5 +1,5 @@
 "<html>
   <body>
-    <a href=\"http://tenderlovemaking.com/get_params?foo=bar&dead=beef\">Clicky</a>
+    <a href=\"http://tenderlovemaking.com/get_params?foo=bar&amp;dead=beef\">Clicky</a>
   </body>
 </html>"

@flavorjones flavorjones added the OSL label Aug 22, 2014

@nolman

This comment has been minimized.

nolman commented Feb 5, 2015

So I believe the root cause of this bug is an incompatibility between xml and html and that nokogiri uses libxml2 to build html documents.

XML does not permit lone &'s instead they must be encoded as &amp;. The code that is converting & to &amp; is in libxml2:

void
xmlBufAttrSerializeTxtContent(xmlBufPtr buf, xmlDocPtr doc,
                              xmlAttrPtr attr, const xmlChar * string)
{   xmlChar *base, *cur;

    if (string == NULL)
        return;
    base = cur = (xmlChar *) string;
    while (*cur != 0) {
 ....
        } else if (*cur == '&') {
            if (base != cur)
                xmlBufAdd(buf, base, cur - base);
            xmlBufAdd(buf, BAD_CAST "&amp;", 5);
            cur++;
            base = cur;
@ryansch

This comment has been minimized.

ryansch commented Feb 5, 2015

Nice find. I didn't end up having enough time to dig through libxml2. I was able to work around this bug in our app in a different way.

@nolman

This comment has been minimized.

nolman commented Feb 5, 2015

What did you end up doing?

@ryansch

This comment has been minimized.

ryansch commented Feb 5, 2015

We were able to limit our url generation so that each url only had one get param. It's not a widely applicable solution. 😦

@khoan

This comment has been minimized.

khoan commented May 29, 2015

big_bad_node.to_html.gsub(%r{https?://[^"]+}) {|s| CGI.unescape_html(s) }

@nolman

This comment has been minimized.

nolman commented May 29, 2015

Parsing/substituting HTML via regex is generally a bad idea. If one controls the URL and can avoiding multiple query params that is probably a better work around for now $0.02

@fabn

This comment has been minimized.

fabn commented Sep 19, 2015

Is there any other workaround not involving regex for this issue?

@gingerlime

This comment has been minimized.

gingerlime commented Jul 8, 2016

Sorry for posting a "me too", but I also bumped into this and find it odd. The regex workaround doesn't always work either. e.g.

> frag = "<h2>Introduction</h2><p><img src=\"https://ab.c.com/?ab=c&mark=%2Fimages%2Fwatermark"
> Nokogiri::HTML.fragment(frag).to_html.gsub(%r{https?://[^"]+}) {|s| CGI.unescape_html(s) } == frag
=> false
@flavorjones

This comment has been minimized.

Member

flavorjones commented Feb 10, 2017

This behavior is going to depend on what version of libxml2 you're running. With the libxml 2.9.4 that's in Nokogiri v1.7.0.1, this is properly not-escaped.

If you see this issue, please do comment here with the output from nokogiri -v so we can see what version of Nokogiri and libxml2 you're running.

@flavorjones flavorjones added libxml2-upstream and removed OSL labels Feb 10, 2017

@redjohn

This comment has been minimized.

redjohn commented Aug 4, 2017

I am still having this problem with nokogiri 1.8.0:

$ nokogiri -v
# Nokogiri (1.8.0)
    ---
    warnings: []
    nokogiri: 1.8.0
    ruby:
      version: 2.3.3
      platform: x86_64-darwin16
      description: ruby 2.3.3p222 (2016-11-21 revision 56859) [x86_64-darwin16]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "<snipped>/gems/nokogiri-1.8.0/ports/x86_64-apple-darwin16.5.0/libxml2/2.9.4"
      libxslt_path: "<snipped>/gems/nokogiri-1.8.0/ports/x86_64-apple-darwin16.5.0/libxslt/1.1.29"
      libxml2_patches:
      - 0001-Fix-comparison-with-root-node-in-xmlXPathCmpNodes.patch
      - 0002-Fix-XPointer-paths-beginning-with-range-to.patch
      - 0003-Disallow-namespace-nodes-in-XPointer-ranges.patch
      libxslt_patches:
      - 0001-Fix-heap-overread-in-xsltFormatNumberConversion.patch
      - 0002-Check-for-integer-overflow-in-xsltAddTextString.patch
      compiled: 2.9.4
      loaded: 2.9.4

Output from @ryansch 's test:

Minitest::Assertion: --- expected
+++ actual
@@ -1,5 +1,5 @@
 "<html>
   <body>
-    <a href=\"http://tenderlovemaking.com/get_params?foo=bar&dead=beef\">Clicky</a>
+    <a href=\"http://tenderlovemaking.com/get_params?foo=bar&amp;dead=beef\">Clicky</a>
   </body>
 </html>"
@newdark

This comment has been minimized.

newdark commented Oct 3, 2018

Has anyone ran into this issue before

fragment = Nokogiri::HTML.fragment('')
iframe_url = 'https://github.com/sparklemotion/nokogiri/issues/1127#issuecomment-320301344&amp;test=true'
Nokogiri::HTML::Builder.with(fragment) do |doc|
  doc.iframe(src: iframe_url)
end
fragment.to_xhtml

Results

"<iframe src=\"https://github.com/sparklemotion/nokogiri/issues/1127#issuecomment-320301344&amp;amp;test=true\"></iframe>"

It seems to want to escape the value twice &\amp;amp

Nokogiri (1.8.4)

Details
---
warnings: []
nokogiri: 1.8.4
ruby:
  version: 2.3.3
  platform: x86_64-darwin18
  description: ruby 2.3.3p222 (2016-11-21 revision 56859) [x86_64-darwin18]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/Users/newdark/.rbenv/versions/2.3.3/gemsets/tb/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin18.0.0/libxml2/2.9.8"
  libxslt_path: "/Users/newdark/.rbenv/versions/2.3.3/gemsets/tb/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin18.0.0/libxslt/1.1.32"
  libxml2_patches:
  - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
  libxslt_patches: []
  compiled: 2.9.8
  loaded: 2.9.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment