-
-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
un-expected result with the method #to_html
.
#969
Comments
Seems this is how libxml2 deals with the "name" attribute of the "a" element. HTML4 DTD says that the name attribute is a CDATA, so the behavior should be a violation of the specification if applied strictly although it is seemingly a practical convention for dealing with HTML anchors. However, the HTML4 specification also says that the "a" element's "name" attribute and the "id" attibute share the same name space, so it could be considered that the name attribute should share the same character set as the id attribute for its values. It probably means that how a name value including non-id characters is treated could depend on the context and the implementation. |
Apologies for not responding on this for so many years. @knu is correct in his diagnosis. The behavior you're describing is inherited from the underlying HTML4 library, libxml2. Here's the C code that controls URI-escaping of certain HTML attributes at serialization-time (when the document is printed): https://gitlab.gnome.org/GNOME/libxml2/blob/v2.9.2/HTMLtree.c#L714-718 Specifically, Fortunately, since 2013 things have improved somewhat, and you can now use Nokogiri's HTML5 parser (which uses libgumbo) to handle this better: #! /usr/bin/env ruby
require "nokogiri"
html = <<~EOF
<html>
<body>
<a name='[09]'>hello</a>
EOF
html4_doc = Nokogiri::HTML4::Document.parse(html)
puts html4_doc.to_html
# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html>
# <body>
# <a name="%5B09%5D">hello</a>
# </body>
html5_doc = Nokogiri::HTML5::Document.parse(html)
puts html5_doc.to_html
# => <html><head></head><body>
# <a name="[09]">hello</a>
# </body></html> Hopefully you're able to use the HTML5 parser (available in Nokogiri v1.12+, or before then you can install Nokogumbo separately). Apologies again for my slow response. |
We found that using Rails' HTML sanitizer does more than we want the Richtext sanitization to do: It does not just remove nodes that are not in the safelist, it also escapes some markup (especially in links). This introduces a custom Loofah "scrubber" that only cares about the element safelist. The `sanitized_body` attribute is not for escaping at the view layer, where all these safety precautions are necessary, but just for making sure admin's don't use iframes when we don't want to. See the following related issues and commits: rails/rails-html-sanitizer@f3ba1a8 sparklemotion/nokogiri#3104 sparklemotion/nokogiri#969 (comment) flavorjones/loofah#14 (comment)
We found that using Rails' HTML sanitizer does more than we want the Richtext sanitization to do: It does not just remove nodes that are not in the safelist, it also escapes some markup (especially in links). This introduces a custom Loofah "scrubber" that only cares about the element safelist. The `sanitized_body` attribute is not for escaping at the view layer, where all these safety precautions are necessary, but just for making sure admin's don't use iframes when we don't want to. See the following related issues and commits: rails/rails-html-sanitizer@f3ba1a8 sparklemotion/nokogiri#3104 sparklemotion/nokogiri#969 (comment) flavorjones/loofah#14 (comment)
We found that using Rails' HTML sanitizer does more than we want the Richtext sanitization to do: It does not just remove nodes that are not in the safelist, it also escapes some markup (especially in links). This introduces a custom Loofah "scrubber" that only cares about the element safelist. The `sanitized_body` attribute is not for escaping at the view layer, where all these safety precautions are necessary, but just for making sure admin's don't use iframes when we don't want to. See the following related issues and commits: rails/rails-html-sanitizer@f3ba1a8 sparklemotion/nokogiri#3104 sparklemotion/nokogiri#969 (comment) flavorjones/loofah#14 (comment)
Hi,
I am getting un-expected result with the method
#to_html
.expected output is :
<a NAME="[09]"></a>
The text was updated successfully, but these errors were encountered: