Investigate the possibility of preserving unnecessary entities instead of converting to UTF-8 #116

rgrove · 2014-09-22T19:17:25Z

The Gumbo parser (and Nokogiri in Sanitize < 3.0.0) automatically converts entities to UTF-8 characters when possible. While perfectly valid and in many cases desirable, this causes problems for users who don't have a completely UTF-8 compliant stack.

We should investigate ways to preserve HTML entities in the output to mitigate these kinds of problems.

Related: #115.

mwlang · 2015-05-01T05:04:43Z

This would be good to resolve. I'm actually getting invalid byte sequences from what is otherwise perfectly fine HTML content.

Example:

<p>&#8220;HOT&#8221;</p>

converts to

\u00E2\u0080\u009CHOT\u00E2\u0080\u009D

I'm not sure what the mechanics were exactly, but when I wrote the sanitized HTML to file then read it back, I ended up with a string encoded as US-ASCII instead of UTF-8 and that led to "invalid byte sequence in US-ASCII" errors when I attempt to do anything with such outputs.

To get around this, I basically escaped the "&" with "%AMP%" before feeding it to the sanitizer and then flipped it back. Here's the gist of what I'm doing...

https://gist.github.com/mwlang/5ac24295275242844511

Not sure if this is helpful, but if I'm reading this issue's intent correctly, then hopefully this sheds some light and ideas on the subject.

rgrove · 2015-05-01T17:21:16Z

@mwlang Sounds like whatever you used to either write the file or read it back expected US-ASCII and not UTF-8, and the byte sequences you got were mangled as a result.

When you give Sanitize this input:

<p>&#8220;HOT&#8221;</p>

...it produces this valid UTF-8 output:

“HOT”

...but if you then treat those valid UTF-8 bytes as if they were US-ASCII bytes, you'll get the invalid byte sequence you mentioned above.

This is a nuanced issue because, to be honest, there's no good reason for any program to read or write an HTML file as US-ASCII rather than UTF-8. Sanitize can only parse UTF-8 input and only generates UTF-8 output, and even if you're not using Sanitize you should still be using a fully UTF-8 stack for reading, writing, and serving HTML or you're asking for trouble.

That said, there are many programs that still don't do this right (at least, not by default). Even Ruby gets this wrong a lot of the time. And lots of developers don't fully understand character encodings and don't realize when they're doing something wrong (and I don't blame them; it's a real pain).

So, while it technically shouldn't be necessary for Sanitize to preserve HTML entities in output, it could fix a lot of headaches.

mwlang · 2015-05-12T14:57:08Z

@rgrove It is a nuanced issue and not strictly the fault of sanitize. I was just using Ruby to write outputs of sanitizing HTML fragments to test/spec fixtures files. When I read those files back in with Ruby, the encoding became US-ASCII for every file that had the extended unicode character sequences in it. As I dug deeper, I did find that Ruby's default encoding (Encoding::default_external) is "US-ASCII" and something about the strings were triggering Ruby to either return the sanitized string as "US-ASCII" or at least read the file as US-ASCII and I suspect this is how my issue came to manifest itself.

Anyway, I think the ability to tell the sanitizer, "preserve html entities" would be a good feature add to help those who need it. In my case, I'd rather keep the named entities because, at the end of the day, I'm going to display the sanitized HTML in a browser and the browser can do just fine displaying those named entities.

jhubert · 2015-09-10T21:53:25Z

fwiw, I ran into this issue when sanitizing content for email delivery. After sanitize, I'm running through an encoding conversion process to force ASCII so that older email clients show the content properly. This is causing curly quotes and apostrophes to disappear. I've worked around it, but thought it might be relevant.

rgrove added the enhancement label Sep 22, 2014

rgrove closed this as completed Jun 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the possibility of preserving unnecessary entities instead of converting to UTF-8 #116

Investigate the possibility of preserving unnecessary entities instead of converting to UTF-8 #116

rgrove commented Sep 22, 2014

mwlang commented May 1, 2015

rgrove commented May 1, 2015

mwlang commented May 12, 2015

jhubert commented Sep 10, 2015

Investigate the possibility of preserving unnecessary entities instead of converting to UTF-8 #116

Investigate the possibility of preserving unnecessary entities instead of converting to UTF-8 #116

Comments

rgrove commented Sep 22, 2014

mwlang commented May 1, 2015

rgrove commented May 1, 2015

mwlang commented May 12, 2015

jhubert commented Sep 10, 2015