Should strip non-space controls and noncharacters from input prior to parsing #179

cloudy9101 · 2018-04-03T03:15:49Z

Hello, I think Sanitize#preprocesses not filter "\b" , then Nokogiri make it a �. Is it a mistake or a normal behavior?

rgrove · 2018-04-03T04:26:39Z

I'm not sure what character you mean by "\b". Can you please specify the actual hex value of the Unicode code point you're referring to?

In any event, if you're seeing a "�" somewhere, that's a good sign that either the input was not valid UTF-8 or the output is not being properly displayed or processed as UTF-8.

cloudy9101 · 2018-04-03T08:29:26Z

Emm, for "\b" I mean backspace, \u0008 .

rgrove · 2018-04-08T21:30:31Z

Thanks for clarifying. I understand the problem now.

It looks like Sanitize should probably strip all non-space control characters and non-characters from the input prior to parsing, as described in the HTML standard. We currently strip some, but not all, of these characters, since we were following older guidelines that have since been withdrawn.

ixti · 2019-10-06T23:45:04Z

And now Sanitize do not strips out U+2028 is that intentional?

rgrove · 2019-10-07T00:11:24Z

Yes. I'm not aware of any requirement in the HTML standard for parsers to strip U+2028. Have I missed something?

Previously Sanitize was stripping some characters that shouldn't have been stripped and was not stripping some characters that should have been stripped. Now it only strips characters that actually aren't allowed in HTML.

ixti · 2019-10-07T00:18:55Z

I'm not saying you've missed something. Was just double checking that it was intentional (before changing expectations in my app).

In any case your answer clears things up. Thank you.

rgrove added the question label Apr 3, 2018

rgrove added bug and removed question labels Apr 8, 2018

rgrove changed the title ~~When "\b" was sanitized, it appears a � .~~ Should strip non-space controls and noncharacters from input prior to parsing Apr 8, 2018

rgrove closed this as completed in 0d4158f Sep 8, 2019

dependabot bot mentioned this issue Mar 15, 2021

Bump sanitize from 4.0.1 to 5.2.1 rapid7/github-connector#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should strip non-space controls and noncharacters from input prior to parsing #179

Should strip non-space controls and noncharacters from input prior to parsing #179

cloudy9101 commented Apr 3, 2018

rgrove commented Apr 3, 2018

cloudy9101 commented Apr 3, 2018

rgrove commented Apr 8, 2018

ixti commented Oct 6, 2019

rgrove commented Oct 7, 2019

ixti commented Oct 7, 2019 •

edited

Loading

Should strip non-space controls and noncharacters from input prior to parsing #179

Should strip non-space controls and noncharacters from input prior to parsing #179

Comments

cloudy9101 commented Apr 3, 2018

rgrove commented Apr 3, 2018

cloudy9101 commented Apr 3, 2018

rgrove commented Apr 8, 2018

ixti commented Oct 6, 2019

rgrove commented Oct 7, 2019

ixti commented Oct 7, 2019 • edited Loading

ixti commented Oct 7, 2019 •

edited

Loading