Utf16 handling in html attributes and text #4016

mrkishi · 2022-02-20T01:22:08Z

Describe the bug

#4014 follow up.

The escape_html_attr() helper currently escapes invalid utf16 code units (ie. unpaired surrogates). ~~Its sibling escape_json_in_html() doesn't.~~

Do we actually need to escape them? They are technically invalid according to the HTML spec, but treated as a soft-error, and apparently browsers will just treat them as unknown characters without any extra issues that I could find.

I guess this could also be affected by whatever platform we're running on, since the utf16 to utf8 conversion usually happens there (from what I could gather?). Do we know of any issues from passing invalid utf16 to any platform?

If there are no known issues, we could drop the escaping from attributes.
~~If there are, it's unlikely that they wouldn't also affect text nodes (ie. <script>'s contents), in that case we'd also need that on escape_json_in_html.~~ This is actually already the case.

If it turns out they're needed, I have a regex-based version ready that improves performance quite a bit over the character-by-character loop.

Reproduction

n/a

Logs

No response

System Info

System:
    OS: Windows 10 10.0.19044
    CPU: (4) x64 Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
    Memory: 1.04 GB / 7.88 GB
  Binaries:
    Node: 17.4.0 - ~\scoop\apps\nodejs\current\node.EXE
    npm: 8.4.0 - ~\scoop\apps\nodejs\current\bin\npm.CMD
  Browsers:
    Chrome: 98.0.4758.102
    Edge: Spartan (44.19041.1266.0), Chromium (98.0.1108.50)
  npmPackages:
    @sveltejs/eslint-config: github:sveltejs/eslint-config#v5.8.0 => 5.8.0

Severity

annoyance

Additional Information

No response

The text was updated successfully, but these errors were encountered:

dominikg · 2022-02-20T08:32:10Z

escape_json_in_html uses JSON.stringify which shouldn't produce invalid tc39/ecma262#1396

so maybe that could be used in the attr escape too?

mrkishi · 2022-02-20T12:37:47Z

Huh, you're absolutely right — the reproduction contains replacement characters, not the invalid surrogates. I must've messed up which bytes I looked into when examining the response (I had some as props on the page as well for testing, and these are left alone). Sorry about the confusion.

In that case, both already replace them. I don't think we should use JSON.stringify, though, because that would mean we'd also have to parse them from the DOM. Let me know if otherwise.

In the meantime I'll have a PR ready with the regex-based version for escape_html_attr. A quick benchmark revealed it's about the same speed as JSON.stringify for small strings (the usual case), and about twice as fast on a big string (the single-page html spec; likely because stringify escapes \ and there are a few instances there). It takes 1/10 to 1/100 the time of the character-by-character version depending on input size.

mrkishi mentioned this issue Feb 20, 2022

Improve escape_html_attr performance #4024

Merged

5 tasks

Rich-Harris closed this as completed in #4024 Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utf16 handling in html attributes and text #4016

Utf16 handling in html attributes and text #4016

mrkishi commented Feb 20, 2022 •

edited

dominikg commented Feb 20, 2022

mrkishi commented Feb 20, 2022

Utf16 handling in html attributes and text #4016

Utf16 handling in html attributes and text #4016

Comments

mrkishi commented Feb 20, 2022 • edited

Describe the bug

Reproduction

Logs

System Info

Severity

Additional Information

dominikg commented Feb 20, 2022

mrkishi commented Feb 20, 2022

mrkishi commented Feb 20, 2022 •

edited