Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

un-expected result with the method #to_html. #969

Closed
aruprakshit opened this issue Sep 9, 2013 · 2 comments
Closed

un-expected result with the method #to_html. #969

aruprakshit opened this issue Sep 9, 2013 · 2 comments

Comments

@aruprakshit
Copy link

Hi,

I am getting un-expected result with the method #to_html.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
puts doc.at_xpath("(//table)[2]/tr/td//a[@name='[09]']").to_html(:encoding => 'utf-8')
# => <a name="%5B09%5D"></a>

expected output is : <a NAME="[09]"></a>

@knu
Copy link
Member

knu commented Nov 6, 2013

Seems this is how libxml2 deals with the "name" attribute of the "a" element.

HTML4 DTD says that the name attribute is a CDATA, so the behavior should be a violation of the specification if applied strictly although it is seemingly a practical convention for dealing with HTML anchors.

However, the HTML4 specification also says that the "a" element's "name" attribute and the "id" attibute share the same name space, so it could be considered that the name attribute should share the same character set as the id attribute for its values.

It probably means that how a name value including non-id characters is treated could depend on the context and the implementation.

@flavorjones
Copy link
Member

Apologies for not responding on this for so many years.

@knu is correct in his diagnosis. The behavior you're describing is inherited from the underlying HTML4 library, libxml2. Here's the C code that controls URI-escaping of certain HTML attributes at serialization-time (when the document is printed):

https://gitlab.gnome.org/GNOME/libxml2/blob/v2.9.2/HTMLtree.c#L714-718

Specifically, href, action, src, and name (but only within an anchor) are always escaped when generating HTML -- basically, anything that could be a URI reference.

Fortunately, since 2013 things have improved somewhat, and you can now use Nokogiri's HTML5 parser (which uses libgumbo) to handle this better:

#! /usr/bin/env ruby

require "nokogiri"

html = <<~EOF
  <html>
    <body>
      <a name='[09]'>hello</a>
EOF

html4_doc = Nokogiri::HTML4::Document.parse(html)
puts html4_doc.to_html
# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html>
#   <body>
#     <a name="%5B09%5D">hello</a>
# </body>


html5_doc = Nokogiri::HTML5::Document.parse(html)
puts html5_doc.to_html
# => <html><head></head><body>
#     <a name="[09]">hello</a>
# </body></html>

Hopefully you're able to use the HTML5 parser (available in Nokogiri v1.12+, or before then you can install Nokogumbo separately). Apologies again for my slow response.

mamhoff added a commit to mamhoff/alchemy_cms that referenced this issue Jan 19, 2024
We found that using Rails' HTML sanitizer does more than we want the
Richtext sanitization to do: It does not just remove nodes that are not
in the safelist, it also escapes some markup (especially in links).

This introduces a custom Loofah "scrubber" that only cares about the
element safelist.

The `sanitized_body` attribute is not for escaping at the view layer,
where all these safety precautions are necessary, but just for making
sure admin's don't use iframes when we don't want to.

See the following related issues and commits:
rails/rails-html-sanitizer@f3ba1a8
sparklemotion/nokogiri#3104
sparklemotion/nokogiri#969 (comment)
flavorjones/loofah#14 (comment)
mamhoff added a commit to mamhoff/alchemy_cms that referenced this issue Jan 19, 2024
We found that using Rails' HTML sanitizer does more than we want the
Richtext sanitization to do: It does not just remove nodes that are not
in the safelist, it also escapes some markup (especially in links).

This introduces a custom Loofah "scrubber" that only cares about the
element safelist.

The `sanitized_body` attribute is not for escaping at the view layer,
where all these safety precautions are necessary, but just for making
sure admin's don't use iframes when we don't want to.

See the following related issues and commits:
rails/rails-html-sanitizer@f3ba1a8
sparklemotion/nokogiri#3104
sparklemotion/nokogiri#969 (comment)
flavorjones/loofah#14 (comment)
mamhoff added a commit to mamhoff/alchemy_cms that referenced this issue Jan 19, 2024
We found that using Rails' HTML sanitizer does more than we want the
Richtext sanitization to do: It does not just remove nodes that are not
in the safelist, it also escapes some markup (especially in links).

This introduces a custom Loofah "scrubber" that only cares about the
element safelist.

The `sanitized_body` attribute is not for escaping at the view layer,
where all these safety precautions are necessary, but just for making
sure admin's don't use iframes when we don't want to.

See the following related issues and commits:
rails/rails-html-sanitizer@f3ba1a8
sparklemotion/nokogiri#3104
sparklemotion/nokogiri#969 (comment)
flavorjones/loofah#14 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants