You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If an element has a duplicate attribute, for example because of issue #2265, all occurrencence of & in all its children will be removed.
For instance, this document
<htmlfoo='x'><body><p>A & B</p></body></html>
is turned into
<htmlfoo='x'><body><p>A B</p></body></html>
Help us reproduce what you're seeing
Replication script:
#!/usr/bin/env rubyrequire"nokogiri"h1="<html foo=\"x\">\n<body>A & B</body></html>"doc1=Nokogiri::XML::Document.parse(h1)ph1pdoc1.to_xmlraise"& missing"unlessdoc1.text.include?"&"# NOTE: Attribute `foo` appears twice.h2="<html foo=\"x\" foo=\"x\">\n<body>A & B</body></html>"doc2=Nokogiri::XML::Document.parse(h2)ph2pdoc2.to_xmlraise"& missing"unlessdoc2.text.include?"&"
This is the output of the script:
"<htmlfoo=\"x\">\n<body>A & B</body></html>"
"<?xml version=\"1.0\"?>\n<htmlfoo=\"x\">\n<body>A & B</body></html>\n"
"<htmlfoo=\"x\"foo=\"x\">\n<body>A & B</body></html>"
"<?xml version=\"1.0\"?>\n<htmlfoo=\"x\"foo=\"x\">\n<body>A B</body></html>\n"
RuntimeError: & missing
./test.rb:19:in `<top (required)>'
Expected behavior / Actual behavior
As long as a file is parseable, the & entity should be preserved.
We can see that the libxml2 does not consider the second document to be well-formed, but will try to recover because the default parse options used for XML include RECOVER. libxml2's recovery logic is different (and simpler) than the happy path logic, and as a result this element doesn't capture all of the original content.
Stated more simply: the error early in parsing this element leads to "recovery" logic failing to properly parse entities later in parsing.
If you'd like to submit a report upstream to libxml2, I'd be happy to provide some guidance. Please let me know!
Please describe the bug
If an element has a duplicate attribute, for example because of issue #2265, all occurrencence of
&
in all its children will be removed.For instance, this document
is turned into
Help us reproduce what you're seeing
Replication script:
This is the output of the script:
Expected behavior / Actual behavior
As long as a file is parseable, the
&
entity should be preserved.Environment
The text was updated successfully, but these errors were encountered: