Skip to content

Allow \u escape in string literal to encode surrogate code point#23102

Open
andersen wants to merge 6 commits intoziglang:masterfrom
andersen:surrogates
Open

Allow \u escape in string literal to encode surrogate code point#23102
andersen wants to merge 6 commits intoziglang:masterfrom
andersen:surrogates

Conversation

@andersen
Copy link
Copy Markdown

@andersen andersen commented Mar 5, 2025

Patch to enable the use of \u escapes in string literals to encode surrogate code points as per issue #20270. Please review and let me know of any changes needed.

(The last comment suggests outlawing raw C1 controls U+80..U+9F from literals. I can work on that separately if that change is also considered accepted.)

Comment thread lib/std/zig/string_literal.zig Outdated
andersen and others added 2 commits March 5, 2025 15:03
Mark unreachable code as unreachable (code point already checked to be in range).

Co-authored-by: Veikka Tuominen <git@vexu.eu>
@andersen
Copy link
Copy Markdown
Author

andersen commented Mar 5, 2025

I added a test which appears to fail during bootstrapping as older code is used to parse the strings:

try expect(eql(u8, "\u{d800}", try parseAlloc(alloc, "\"\u{d800}\"")));

Changing it to the following solves the problem:

try expect(eql(u8, "\xed\xa0\x80", try parseAlloc(alloc, "\"\\u{d800}\"")));

I suspect some of the preceding tests would also benefit from escaped (i.e., doubled) backslashes on the right-hand side (as arguments to parseAlloc), and maybe \u{1234} on the left-hand side is better spelt out as \xE1\x88\xB4 as well.

\u{d800} spelt out as \xed\xa0\x80 or escaped as argument to parseAlloc
@andersen
Copy link
Copy Markdown
Author

andersen commented Mar 5, 2025

(Just updated my own test for now.)

@andersen andersen changed the title Allow \u escape in string literal to encode surrogate code point #20270 Allow \u escape in string literal to encode surrogate code point Mar 5, 2025
Removed spaces to conform to zig fmt
@squeek502
Copy link
Copy Markdown
Member

squeek502 commented Mar 5, 2025

Nice! Didn't realize this proposal got accepted. The std.unicode test cases could be cleaned up using this but I believe that'll need a zig1.wasm update, so I'll save that for a follow up.

cc @mnemnion

@andersen
Copy link
Copy Markdown
Author

andersen commented Mar 10, 2025

I suspect some of the preceding tests would also benefit from escaped (i.e., doubled) backslashes on the right-hand side

@Vexu: These are the lines that look like they should have \\x (twice) and \\u to test parseAlloc directly:

     try expect(eql(u8, "foo", try parseAlloc(alloc, "\"f\x6f\x6f\"")));
     try expect(eql(u8, "f💯", try parseAlloc(alloc, "\"f\u{1f4af}\"")));

However, the tests still work, and this has nothing to do with the surrogate escape issue, so I am not sure whether you want these tests to be changed and, if so, whether a separate pull request would be more appropriate.

@squeek502: Yes, simplifying those Unicode test cases would be nice. Is there a policy for updating zig1.wasm?

@Vexu
Copy link
Copy Markdown
Member

Vexu commented Mar 10, 2025

so I am not sure whether you want these tests to be changed and, if so, whether a separate pull request would be more appropriate.

It would be good to fix the tests, I'm fine with either way.

Is there a policy for updating zig1.wasm?

It is updated when needed while avoiding too frequent updates that would bloat the repo. This PR doesn't require updating it.

If you want to use the escape for surrogates you can open a new PR that doesn't pass the CI and wait for zig1.wasm to be updated by some other change and then rebase that new PR.

@andersen
Copy link
Copy Markdown
Author

Thank you for your clear reply! I fixed the tests here as it is a tiny change.
I think a separate issue should be opened for the question of raw C1 characters mentioned in #20270 (q.v. for details) and that the changes in this pull request otherwise resolves the issue.
As for the test cases in std.unicode, I shall let @squeek502 open a pull request for that unless I hear otherwise as he has already done the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants