Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should strip non-space controls and noncharacters from input prior to parsing #179

Closed
cloudy9101 opened this issue Apr 3, 2018 · 6 comments
Labels

Comments

@cloudy9101
Copy link

Hello, I think Sanitize#preprocesses not filter "\b" , then Nokogiri make it a �. Is it a mistake or a normal behavior?

@rgrove
Copy link
Owner

rgrove commented Apr 3, 2018

I'm not sure what character you mean by "\b". Can you please specify the actual hex value of the Unicode code point you're referring to?

In any event, if you're seeing a "�" somewhere, that's a good sign that either the input was not valid UTF-8 or the output is not being properly displayed or processed as UTF-8.

@rgrove rgrove added the question label Apr 3, 2018
@cloudy9101
Copy link
Author

Emm, for "\b" I mean backspace, \u0008 .

@rgrove
Copy link
Owner

rgrove commented Apr 8, 2018

Thanks for clarifying. I understand the problem now.

It looks like Sanitize should probably strip all non-space control characters and non-characters from the input prior to parsing, as described in the HTML standard. We currently strip some, but not all, of these characters, since we were following older guidelines that have since been withdrawn.

@rgrove rgrove added bug and removed question labels Apr 8, 2018
@rgrove rgrove changed the title When "\b" was sanitized, it appears a � . Should strip non-space controls and noncharacters from input prior to parsing Apr 8, 2018
@rgrove rgrove closed this as completed in 0d4158f Sep 8, 2019
@ixti
Copy link

ixti commented Oct 6, 2019

And now Sanitize do not strips out U+2028 is that intentional?

@rgrove
Copy link
Owner

rgrove commented Oct 7, 2019

Yes. I'm not aware of any requirement in the HTML standard for parsers to strip U+2028. Have I missed something?

Previously Sanitize was stripping some characters that shouldn't have been stripped and was not stripping some characters that should have been stripped. Now it only strips characters that actually aren't allowed in HTML.

@ixti
Copy link

ixti commented Oct 7, 2019

I'm not saying you've missed something. Was just double checking that it was intentional (before changing expectations in my app).

In any case your answer clears things up. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants