Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] & removed if element has duplicated attributes #2267

Closed
gioele opened this issue Jun 14, 2021 · 2 comments
Closed

[bug] & removed if element has duplicated attributes #2267

gioele opened this issue Jun 14, 2021 · 2 comments

Comments

@gioele
Copy link

gioele commented Jun 14, 2021

Please describe the bug

If an element has a duplicate attribute, for example because of issue #2265, all occurrencence of & in all its children will be removed.

For instance, this document

<html foo='x'><body><p>A &amp; B</p></body></html>

is turned into

<html foo='x'><body><p>A  B</p></body></html>

Help us reproduce what you're seeing

Replication script:

#!/usr/bin/env ruby

require "nokogiri"

h1 = "<html foo=\"x\">\n<body>A &amp; B</body></html>"
doc1 = Nokogiri::XML::Document.parse(h1)
p h1
p doc1.to_xml

raise "&amp; missing" unless doc1.text.include? "&"

# NOTE: Attribute `foo` appears twice.
h2 = "<html foo=\"x\" foo=\"x\">\n<body>A &amp; B</body></html>"
doc2 = Nokogiri::XML::Document.parse(h2)
p h2
p doc2.to_xml

raise "&amp; missing" unless doc2.text.include? "&"

This is the output of the script:

"<html foo=\"x\">\n<body>A &amp; B</body></html>"
"<?xml version=\"1.0\"?>\n<html foo=\"x\">\n<body>A &amp; B</body></html>\n"
"<html foo=\"x\" foo=\"x\">\n<body>A &amp; B</body></html>"
"<?xml version=\"1.0\"?>\n<html foo=\"x\" foo=\"x\">\n<body>A  B</body></html>\n"
RuntimeError: &amp; missing
  ./test.rb:19:in `<top (required)>'

Expected behavior / Actual behavior

As long as a file is parseable, the &amp; entity should be preserved.

Environment

$ nokogiri -v
# Nokogiri (1.11.7)
    ---
    warnings: []
    nokogiri:
      version: 1.11.7
      cppflags:
      - "-I/[RBENV]/versions/2.6.2/lib/ruby/gems/2.6.0/gems/nokogiri-1.11.7-x86_64-linux/ext/nokogiri"
      - "-I/[RBENV]/versions/2.6.2/lib/ruby/gems/2.6.0/gems/nokogiri-1.11.7-x86_64-linux/ext/nokogiri/include"
      - "-I/[RBENV]/versions/2.6.2/lib/ruby/gems/2.6.0/gems/nokogiri-1.11.7-x86_64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.6.2
      platform: x86_64-linux
      gem_platform: x86_64-linux
      description: ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - 0007-Fix-XPath-recursion-limit.patch
      libxml2_path: "/[RBENV]/versions/2.6.2/lib/ruby/gems/2.6.0/gems/nokogiri-1.11.7-x86_64-linux/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.12
      loaded: 2.9.12
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      - 0002-Fix-xml2-config-check-in-configure-script.patch
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11
@gioele gioele added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jun 14, 2021
@flavorjones
Copy link
Member

Hi, thanks once again for your well-written and clear report.

In this case, it's illustrative to also emit the document's errors. Updating the script to contain these lines:

--- bar.rb	2021-06-14 07:30:52.774562815 -0400
+++ baz.rb	2021-06-14 07:30:49.809562715 -0400
@@ -9,6 +9,7 @@
 doc1 = Nokogiri::XML::Document.parse(h1)
 p h1
 p doc1.to_xml
+p doc1.errors
 
 raise "&amp; missing" unless doc1.text.include? "&"
 
@@ -17,5 +18,6 @@
 doc2 = Nokogiri::XML::Document.parse(h2)
 p h2
 p doc2.to_xml
+p doc2.errors
 
 raise "&amp; missing" unless doc2.text.include? "&"

you'll see [] for the first document (i.e., no errors), but this for the second document:

[#<Nokogiri::XML::SyntaxError: 1:22: FATAL: Attribute foo redefined>]

We can see that the libxml2 does not consider the second document to be well-formed, but will try to recover because the default parse options used for XML include RECOVER. libxml2's recovery logic is different (and simpler) than the happy path logic, and as a result this element doesn't capture all of the original content.

Stated more simply: the error early in parsing this element leads to "recovery" logic failing to properly parse entities later in parsing.

If you'd like to submit a report upstream to libxml2, I'd be happy to provide some guidance. Please let me know!

@flavorjones flavorjones added topic/error-handling upstream/libxml2 and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jun 14, 2021
@gioele
Copy link
Author

gioele commented Jun 14, 2021

Hi, thanks for the super quick triage. :)

I reported this issue upstream: https://gitlab.gnome.org/GNOME/libxml2/-/issues/270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants