Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the possibility of preserving unnecessary entities instead of converting to UTF-8 #116

Closed
rgrove opened this issue Sep 22, 2014 · 4 comments

Comments

@rgrove
Copy link
Owner

rgrove commented Sep 22, 2014

The Gumbo parser (and Nokogiri in Sanitize < 3.0.0) automatically converts entities to UTF-8 characters when possible. While perfectly valid and in many cases desirable, this causes problems for users who don't have a completely UTF-8 compliant stack.

We should investigate ways to preserve HTML entities in the output to mitigate these kinds of problems.

Related: #115.

@mwlang
Copy link

mwlang commented May 1, 2015

This would be good to resolve. I'm actually getting invalid byte sequences from what is otherwise perfectly fine HTML content.

Example:

<p>&#8220;HOT&#8221;</p>

converts to

\u00E2\u0080\u009CHOT\u00E2\u0080\u009D

I'm not sure what the mechanics were exactly, but when I wrote the sanitized HTML to file then read it back, I ended up with a string encoded as US-ASCII instead of UTF-8 and that led to "invalid byte sequence in US-ASCII" errors when I attempt to do anything with such outputs.

To get around this, I basically escaped the "&" with "%AMP%" before feeding it to the sanitizer and then flipped it back. Here's the gist of what I'm doing...

https://gist.github.com/mwlang/5ac24295275242844511

Not sure if this is helpful, but if I'm reading this issue's intent correctly, then hopefully this sheds some light and ideas on the subject.

@rgrove
Copy link
Owner Author

rgrove commented May 1, 2015

@mwlang Sounds like whatever you used to either write the file or read it back expected US-ASCII and not UTF-8, and the byte sequences you got were mangled as a result.

When you give Sanitize this input:

<p>&#8220;HOT&#8221;</p>

...it produces this valid UTF-8 output:

“HOT”

...but if you then treat those valid UTF-8 bytes as if they were US-ASCII bytes, you'll get the invalid byte sequence you mentioned above.

This is a nuanced issue because, to be honest, there's no good reason for any program to read or write an HTML file as US-ASCII rather than UTF-8. Sanitize can only parse UTF-8 input and only generates UTF-8 output, and even if you're not using Sanitize you should still be using a fully UTF-8 stack for reading, writing, and serving HTML or you're asking for trouble.

That said, there are many programs that still don't do this right (at least, not by default). Even Ruby gets this wrong a lot of the time. And lots of developers don't fully understand character encodings and don't realize when they're doing something wrong (and I don't blame them; it's a real pain).

So, while it technically shouldn't be necessary for Sanitize to preserve HTML entities in output, it could fix a lot of headaches.

@mwlang
Copy link

mwlang commented May 12, 2015

@rgrove It is a nuanced issue and not strictly the fault of sanitize. I was just using Ruby to write outputs of sanitizing HTML fragments to test/spec fixtures files. When I read those files back in with Ruby, the encoding became US-ASCII for every file that had the extended unicode character sequences in it. As I dug deeper, I did find that Ruby's default encoding (Encoding::default_external) is "US-ASCII" and something about the strings were triggering Ruby to either return the sanitized string as "US-ASCII" or at least read the file as US-ASCII and I suspect this is how my issue came to manifest itself.

Anyway, I think the ability to tell the sanitizer, "preserve html entities" would be a good feature add to help those who need it. In my case, I'd rather keep the named entities because, at the end of the day, I'm going to display the sanitized HTML in a browser and the browser can do just fine displaying those named entities.

@jhubert
Copy link

jhubert commented Sep 10, 2015

fwiw, I ran into this issue when sanitizing content for email delivery. After sanitize, I'm running through an encoding conversion process to force ASCII so that older email clients show the content properly. This is causing curly quotes and apostrophes to disappear. I've worked around it, but thought it might be relevant.

@rgrove rgrove closed this as completed Jun 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants