Allow \u escape in string literal to encode surrogate code point by andersen · Pull Request #23102 · ziglang/zig

andersen · 2025-03-05T12:38:55Z

Patch to enable the use of \u escapes in string literals to encode surrogate code points as per issue #20270. Please review and let me know of any changes needed.

(The last comment suggests outlawing raw C1 controls U+80..U+9F from literals. I can work on that separately if that change is also considered accepted.)

Mark unreachable code as unreachable (code point already checked to be in range). Co-authored-by: Veikka Tuominen <git@vexu.eu>

andersen · 2025-03-05T16:58:36Z

I added a test which appears to fail during bootstrapping as older code is used to parse the strings:

try expect(eql(u8, "\u{d800}", try parseAlloc(alloc, "\"\u{d800}\"")));

Changing it to the following solves the problem:

try expect(eql(u8, "\xed\xa0\x80", try parseAlloc(alloc, "\"\\u{d800}\"")));

I suspect some of the preceding tests would also benefit from escaped (i.e., doubled) backslashes on the right-hand side (as arguments to parseAlloc), and maybe \u{1234} on the left-hand side is better spelt out as \xE1\x88\xB4 as well.

\u{d800} spelt out as \xed\xa0\x80 or escaped as argument to parseAlloc

andersen · 2025-03-05T17:03:14Z

(Just updated my own test for now.)

Removed spaces to conform to zig fmt

squeek502 · 2025-03-05T22:36:08Z

Nice! Didn't realize this proposal got accepted. The std.unicode test cases could be cleaned up using this but I believe that'll need a zig1.wasm update, so I'll save that for a follow up.

cc @mnemnion

andersen · 2025-03-10T19:11:41Z

I suspect some of the preceding tests would also benefit from escaped (i.e., doubled) backslashes on the right-hand side

@Vexu: These are the lines that look like they should have \\x (twice) and \\u to test parseAlloc directly:

     try expect(eql(u8, "foo", try parseAlloc(alloc, "\"f\x6f\x6f\"")));
     try expect(eql(u8, "f💯", try parseAlloc(alloc, "\"f\u{1f4af}\"")));

However, the tests still work, and this has nothing to do with the surrogate escape issue, so I am not sure whether you want these tests to be changed and, if so, whether a separate pull request would be more appropriate.

@squeek502: Yes, simplifying those Unicode test cases would be nice. Is there a policy for updating zig1.wasm?

Vexu · 2025-03-10T23:36:56Z

so I am not sure whether you want these tests to be changed and, if so, whether a separate pull request would be more appropriate.

It would be good to fix the tests, I'm fine with either way.

Is there a policy for updating zig1.wasm?

It is updated when needed while avoiding too frequent updates that would bloat the repo. This PR doesn't require updating it.

If you want to use the escape for surrogates you can open a new PR that doesn't pass the CI and wait for zig1.wasm to be updated by some other change and then rebase that new PR.

andersen · 2025-03-11T12:31:10Z

Thank you for your clear reply! I fixed the tests here as it is a tiny change.
I think a separate issue should be opened for the question of raw C1 characters mentioned in #20270 (q.v. for details) and that the changes in this pull request otherwise resolves the issue.
As for the test cases in std.unicode, I shall let @squeek502 open a pull request for that unless I hear otherwise as he has already done the work.

Allow \u escape in string literal to encode surrogate code point.

a022962

Vexu reviewed Mar 5, 2025

View reviewed changes

Comment thread lib/std/zig/string_literal.zig Outdated

andersen and others added 2 commits March 5, 2025 15:03

Unreachable catch

d5d4fc9

Mark unreachable code as unreachable (code point already checked to be in range). Co-authored-by: Veikka Tuominen <git@vexu.eu>

Do not shorten utf8EncodeAllowSurrogates

87e5d41

Updated test to work during bootstrapping

cc7f1b7

\u{d800} spelt out as \xed\xa0\x80 or escaped as argument to parseAlloc

andersen changed the title ~~Allow \u escape in string literal to encode surrogate code point #20270~~ Allow \u escape in string literal to encode surrogate code point Mar 5, 2025

zig fmt compliance

b087632

Removed spaces to conform to zig fmt

Fix missing backslash escaping (doubling) in existing tests

bfcec17

Vexu approved these changes Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow \u escape in string literal to encode surrogate code point#23102

Allow \u escape in string literal to encode surrogate code point#23102
andersen wants to merge 6 commits intoziglang:masterfrom
andersen:surrogates

andersen commented Mar 5, 2025

Uh oh!

Uh oh!

andersen commented Mar 5, 2025

Uh oh!

andersen commented Mar 5, 2025

Uh oh!

squeek502 commented Mar 5, 2025 •

edited

Loading

Uh oh!

andersen commented Mar 10, 2025 •

edited

Loading

Uh oh!

Vexu commented Mar 10, 2025

Uh oh!

andersen commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

andersen commented Mar 5, 2025

Uh oh!

Uh oh!

andersen commented Mar 5, 2025

Uh oh!

andersen commented Mar 5, 2025

Uh oh!

squeek502 commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andersen commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vexu commented Mar 10, 2025

Uh oh!

andersen commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

squeek502 commented Mar 5, 2025 •

edited

Loading

andersen commented Mar 10, 2025 •

edited

Loading