un-expected result with the method `#to_html`. #969

aruprakshit · 2013-09-09T12:25:00Z

Hi,

I am getting un-expected result with the method #to_html.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
puts doc.at_xpath("(//table)[2]/tr/td//a[@name='[09]']").to_html(:encoding => 'utf-8')
# => <a name="%5B09%5D"></a>

expected output is : <a NAME="[09]"></a>

The text was updated successfully, but these errors were encountered:

knu · 2013-11-06T13:46:48Z

Seems this is how libxml2 deals with the "name" attribute of the "a" element.

HTML4 DTD says that the name attribute is a CDATA, so the behavior should be a violation of the specification if applied strictly although it is seemingly a practical convention for dealing with HTML anchors.

However, the HTML4 specification also says that the "a" element's "name" attribute and the "id" attibute share the same name space, so it could be considered that the name attribute should share the same character set as the id attribute for its values.

It probably means that how a name value including non-id characters is treated could depend on the context and the implementation.

flavorjones · 2021-11-18T17:10:21Z

Apologies for not responding on this for so many years.

@knu is correct in his diagnosis. The behavior you're describing is inherited from the underlying HTML4 library, libxml2. Here's the C code that controls URI-escaping of certain HTML attributes at serialization-time (when the document is printed):

https://gitlab.gnome.org/GNOME/libxml2/blob/v2.9.2/HTMLtree.c#L714-718

Specifically, href, action, src, and name (but only within an anchor) are always escaped when generating HTML -- basically, anything that could be a URI reference.

Fortunately, since 2013 things have improved somewhat, and you can now use Nokogiri's HTML5 parser (which uses libgumbo) to handle this better:

#! /usr/bin/env ruby

require "nokogiri"

html = <<~EOF
  <html>
    <body>
      <a name='[09]'>hello</a>
EOF

html4_doc = Nokogiri::HTML4::Document.parse(html)
puts html4_doc.to_html
# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html>
#   <body>
#     <a name="%5B09%5D">hello</a>
# </body>


html5_doc = Nokogiri::HTML5::Document.parse(html)
puts html5_doc.to_html
# => <html><head></head><body>
#     <a name="[09]">hello</a>
# </body></html>

Hopefully you're able to use the HTML5 parser (available in Nokogiri v1.12+, or before then you can install Nokogumbo separately). Apologies again for my slow response.

We found that using Rails' HTML sanitizer does more than we want the Richtext sanitization to do: It does not just remove nodes that are not in the safelist, it also escapes some markup (especially in links). This introduces a custom Loofah "scrubber" that only cares about the element safelist. The `sanitized_body` attribute is not for escaping at the view layer, where all these safety precautions are necessary, but just for making sure admin's don't use iframes when we don't want to. See the following related issues and commits: rails/rails-html-sanitizer@f3ba1a8 sparklemotion/nokogiri#3104 sparklemotion/nokogiri#969 (comment) flavorjones/loofah#14 (comment)

flavorjones added libxml2-upstream and removed vendored/libxml2 labels Jan 11, 2015

flavorjones added vendored/libxml2 and removed upstream/libxml2 labels Jan 5, 2019

flavorjones closed this as completed Nov 18, 2021

flavorjones mentioned this issue Jan 19, 2024

[bug] Attribute contents HTML-encoded after serialization #3104

Closed

mamhoff mentioned this issue Jan 19, 2024

Implement custom scrubber for Alchemy::Ingredients::Richtext AlchemyCMS/alchemy_cms#2700

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

un-expected result with the method `#to_html`. #969

un-expected result with the method `#to_html`. #969

aruprakshit commented Sep 9, 2013

knu commented Nov 6, 2013

flavorjones commented Nov 18, 2021

un-expected result with the method #to_html. #969

un-expected result with the method #to_html. #969

Comments

aruprakshit commented Sep 9, 2013

knu commented Nov 6, 2013

flavorjones commented Nov 18, 2021

un-expected result with the method `#to_html`. #969

un-expected result with the method `#to_html`. #969