You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The escape_html_attr() helper currently escapes invalid utf16 code units (ie. unpaired surrogates). Its sibling escape_json_in_html() doesn't.
Do we actually need to escape them? They are technically invalid according to the HTML spec, but treated as a soft-error, and apparently browsers will just treat them as unknown characters without any extra issues that I could find.
I guess this could also be affected by whatever platform we're running on, since the utf16 to utf8 conversion usually happens there (from what I could gather?). Do we know of any issues from passing invalid utf16 to any platform?
If there are no known issues, we could drop the escaping from attributes. If there are, it's unlikely that they wouldn't also affect text nodes (ie. <script>'s contents), in that case we'd also need that on escape_json_in_html. This is actually already the case.
If it turns out they're needed, I have a regex-based version ready that improves performance quite a bit over the character-by-character loop.
Huh, you're absolutely right — the reproduction contains replacement characters, not the invalid surrogates. I must've messed up which bytes I looked into when examining the response (I had some as props on the page as well for testing, and these are left alone). Sorry about the confusion.
In that case, both already replace them. I don't think we should use JSON.stringify, though, because that would mean we'd also have to parse them from the DOM. Let me know if otherwise.
In the meantime I'll have a PR ready with the regex-based version for escape_html_attr. A quick benchmark revealed it's about the same speed as JSON.stringify for small strings (the usual case), and about twice as fast on a big string (the single-page html spec; likely because stringify escapes \ and there are a few instances there). It takes 1/10 to 1/100 the time of the character-by-character version depending on input size.
Describe the bug
#4014 follow up.
The
escape_html_attr()
helper currently escapes invalid utf16 code units (ie. unpaired surrogates).Its siblingescape_json_in_html()
doesn't.Do we actually need to escape them? They are technically invalid according to the HTML spec, but treated as a soft-error, and apparently browsers will just treat them as unknown characters without any extra issues that I could find.
I guess this could also be affected by whatever platform we're running on, since the utf16 to utf8 conversion usually happens there (from what I could gather?). Do we know of any issues from passing invalid utf16 to any platform?
If there are no known issues, we could drop the escaping from attributes.
If there are, it's unlikely that they wouldn't also affect text nodes (ie.This is actually already the case.<script>
's contents), in that case we'd also need that onescape_json_in_html
.If it turns out they're needed, I have a regex-based version ready that improves performance quite a bit over the character-by-character loop.
Reproduction
n/a
Logs
No response
System Info
Severity
annoyance
Additional Information
No response
The text was updated successfully, but these errors were encountered: