-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to Nokogumbo 2.0 #189
Conversation
Thanks! This is super helpful. I should have some time this weekend to dive in and see what's up with those remaining test failures. |
For the content type failure, something like attr = doc.at_xpath('/html/head/meta[@http-equiv="content-type"]/@content')
attr.content = 'text/html; charset=utf-8' unless attr.nil? should fix it. I'm open to adding an option to the serialize methods to output the DOCTYPE public and system identifiers, if present. HTML treats a present but empty identifier as different from a missing identifier but Gumbo does not. As a result, I believe Nokogumbo treats an empty identifier as missing. I think this would be easy to fix. I'm also open to adding a slightly more smart attribute serialization. We probably don't want to include something as aggressive as this which will turn |
I don't have any real opinion on this, so whatever you think is best is fine with me.
I personally prefer always using double quoted attributes. I think the principle of least surprise is valuable here; if Sanitize's output sometimes uses double quotes and sometimes uses single quotes depending on attribute values, that could surprise people. I've been looking into the YouTube transformer failures. When parsing a non-empty
This seems correct from a parsing perspective. But when the iframe is serialized, this text content is serialized verbatim, and the When these tests were originally written, my understanding was that The HTML Living Standard says
It also says that when serializing a
So it seems like Nokogumbo is behaving correctly here, but it also seems like the result could be unsafe in old browsers, though I'm not sure exactly which browsers are old enough to both not support iframes and to execute scripts within iframe content. I guess a safe route might be for Sanitize to manually escape iframe contents? 🤔 |
I have no need of such functionality as I don't need to support transitional documents. So if you're okay with having the identifiers removed, then the test case could be updated. That might violate the principle of least surprise for your users though. So up to you.
I agree. So then I guess the test case could change to using double quotes with
Escaping the text should be relatively straight forward. Something like doc = Nokogiri::HTML5.fragment('<iframe><script>alert("hi")</script></iframe>')
doc.xpath('.//iframe/text()').each do |t|
t.content = t.content.gsub(/[&\u00a0<>]/, '&' => '&', "\u00a0" => ' ', '<' => '<', '>' => '>')
end (I don't know why the period in that XPath expression is necessary, but it doesn't work without it.) Alternatively, you could just drop all children of an |
Cool, sounds like we’re on the same page. I’ve got most of these changes finished, but had to stop for tonight. Should be able to finish them up tomorrow and merge this. |
This can't be merged as is, but I wanted to share what I'd done, in case you want to update Nokogumbo to 2.0.
Nokogumbo 2.0 has had a bunch of fixes (many around errors which I don't think
sanitize
cares about), but it also does spec-compliant parsing of HTML fragments and serialization.This changes a number of things as mentioned here.
There are still 7 test failures with this PR, some of which I can explain, others I'm not sure why they're happening and I haven't investigated closely.
Failures 2–5 seem like pretty concerning failures. I'm not sure why that happens.
The HTML spec specifies that
"
in attribute values be escaped as"
and attribute values themselves are always surrounded by"
. That explains Failure 1.Libxml2 has the horrible behavior of inserting
meta
elements which you had code to remove. Nokogumbo 2.0 no longer uses libxml2's serialization (which is broken for HTML5 anyway, e.g., rubys/nokogumbo#14). Nokogumbo no longer does anything with meta elements, hence Failure 6.Finally, the HTML spec strips out public and system identifiers in DOCTYPEs when serializing. That explains Failure 7.
I have to run right this minute, but I'm happy to help make this work for you.