-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should strip non-space controls and noncharacters from input prior to parsing #179
Comments
I'm not sure what character you mean by "\b". Can you please specify the actual hex value of the Unicode code point you're referring to? In any event, if you're seeing a "�" somewhere, that's a good sign that either the input was not valid UTF-8 or the output is not being properly displayed or processed as UTF-8. |
Emm, for "\b" I mean backspace, \u0008 . |
Thanks for clarifying. I understand the problem now. It looks like Sanitize should probably strip all non-space control characters and non-characters from the input prior to parsing, as described in the HTML standard. We currently strip some, but not all, of these characters, since we were following older guidelines that have since been withdrawn. |
And now Sanitize do not strips out U+2028 is that intentional? |
Yes. I'm not aware of any requirement in the HTML standard for parsers to strip U+2028. Have I missed something? Previously Sanitize was stripping some characters that shouldn't have been stripped and was not stripping some characters that should have been stripped. Now it only strips characters that actually aren't allowed in HTML. |
I'm not saying you've missed something. Was just double checking that it was intentional (before changing expectations in my app). In any case your answer clears things up. Thank you. |
Hello, I think Sanitize#preprocesses not filter "\b" , then Nokogiri make it a �. Is it a mistake or a normal behavior?
The text was updated successfully, but these errors were encountered: