Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utf16 handling in html attributes and text #4016

Closed
mrkishi opened this issue Feb 20, 2022 · 2 comments · Fixed by #4024
Closed

Utf16 handling in html attributes and text #4016

mrkishi opened this issue Feb 20, 2022 · 2 comments · Fixed by #4024

Comments

@mrkishi
Copy link
Member

mrkishi commented Feb 20, 2022

Describe the bug

#4014 follow up.

The escape_html_attr() helper currently escapes invalid utf16 code units (ie. unpaired surrogates). Its sibling escape_json_in_html() doesn't.

Do we actually need to escape them? They are technically invalid according to the HTML spec, but treated as a soft-error, and apparently browsers will just treat them as unknown characters without any extra issues that I could find.

I guess this could also be affected by whatever platform we're running on, since the utf16 to utf8 conversion usually happens there (from what I could gather?). Do we know of any issues from passing invalid utf16 to any platform?

If there are no known issues, we could drop the escaping from attributes.
If there are, it's unlikely that they wouldn't also affect text nodes (ie. <script>'s contents), in that case we'd also need that on escape_json_in_html. This is actually already the case.

If it turns out they're needed, I have a regex-based version ready that improves performance quite a bit over the character-by-character loop.

Reproduction

n/a

Logs

No response

System Info

System:
    OS: Windows 10 10.0.19044
    CPU: (4) x64 Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
    Memory: 1.04 GB / 7.88 GB
  Binaries:
    Node: 17.4.0 - ~\scoop\apps\nodejs\current\node.EXE
    npm: 8.4.0 - ~\scoop\apps\nodejs\current\bin\npm.CMD
  Browsers:
    Chrome: 98.0.4758.102
    Edge: Spartan (44.19041.1266.0), Chromium (98.0.1108.50)
  npmPackages:
    @sveltejs/eslint-config: github:sveltejs/eslint-config#v5.8.0 => 5.8.0

Severity

annoyance

Additional Information

No response

@dominikg
Copy link
Member

escape_json_in_html uses JSON.stringify which shouldn't produce invalid tc39/ecma262#1396

so maybe that could be used in the attr escape too?

@mrkishi
Copy link
Member Author

mrkishi commented Feb 20, 2022

Huh, you're absolutely right — the reproduction contains replacement characters, not the invalid surrogates. I must've messed up which bytes I looked into when examining the response (I had some as props on the page as well for testing, and these are left alone). Sorry about the confusion.

In that case, both already replace them. I don't think we should use JSON.stringify, though, because that would mean we'd also have to parse them from the DOM. Let me know if otherwise.

In the meantime I'll have a PR ready with the regex-based version for escape_html_attr. A quick benchmark revealed it's about the same speed as JSON.stringify for small strings (the usual case), and about twice as fast on a big string (the single-page html spec; likely because stringify escapes \ and there are a few instances there). It takes 1/10 to 1/100 the time of the character-by-character version depending on input size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants