Allow \u escape in string literal to encode surrogate code point#23102
Allow \u escape in string literal to encode surrogate code point#23102andersen wants to merge 6 commits intoziglang:masterfrom
Conversation
Mark unreachable code as unreachable (code point already checked to be in range). Co-authored-by: Veikka Tuominen <git@vexu.eu>
|
I added a test which appears to fail during bootstrapping as older code is used to parse the strings: try expect(eql(u8, "\u{d800}", try parseAlloc(alloc, "\"\u{d800}\"")));Changing it to the following solves the problem: try expect(eql(u8, "\xed\xa0\x80", try parseAlloc(alloc, "\"\\u{d800}\"")));I suspect some of the preceding tests would also benefit from escaped (i.e., doubled) backslashes on the right-hand side (as arguments to parseAlloc), and maybe |
\u{d800} spelt out as \xed\xa0\x80 or escaped as argument to parseAlloc
|
(Just updated my own test for now.) |
Removed spaces to conform to zig fmt
|
Nice! Didn't realize this proposal got accepted. The cc @mnemnion |
@Vexu: These are the lines that look like they should have try expect(eql(u8, "foo", try parseAlloc(alloc, "\"f\x6f\x6f\"")));
try expect(eql(u8, "f💯", try parseAlloc(alloc, "\"f\u{1f4af}\"")));However, the tests still work, and this has nothing to do with the surrogate escape issue, so I am not sure whether you want these tests to be changed and, if so, whether a separate pull request would be more appropriate. @squeek502: Yes, simplifying those Unicode test cases would be nice. Is there a policy for updating zig1.wasm? |
It would be good to fix the tests, I'm fine with either way.
It is updated when needed while avoiding too frequent updates that would bloat the repo. This PR doesn't require updating it. If you want to use the escape for surrogates you can open a new PR that doesn't pass the CI and wait for zig1.wasm to be updated by some other change and then rebase that new PR. |
|
Thank you for your clear reply! I fixed the tests here as it is a tiny change. |
Patch to enable the use of \u escapes in string literals to encode surrogate code points as per issue #20270. Please review and let me know of any changes needed.
(The last comment suggests outlawing raw C1 controls U+80..U+9F from literals. I can work on that separately if that change is also considered accepted.)