Unicode escapes: support u{N...} #2823

hryx · 2019-07-05T06:00:31Z

TODO

stage2 tokenizer
stage2 parser test
stage1 tokenizer
behavior tests
update documentation examples and grammar
update grammar in zig-spec Update rule for unicode escape zig-spec#8

Notes

Any number of digits (one or more) is allowed in the braces. The stage1 tokenizer retains upper limit on character value of 0x10ffff.
The old \uNNNN and \UNNNNNN syntaxes were removed.

hryx · 2019-07-05T06:01:42Z

doc/langref.html.in

-        <tr>
-            <td><code>\UNNNNNN</code></td>
-          <td>hexadecimal 24-bit Unicode character code UTF-8 encoded (6 digits)</td>
+            <td><code>\u{NNNNNN}</code></td>


Not sure what the clearest way to write this is. Could also be something like:

\u{N...}

I think the "1 or more digits" you have below is sufficient

hryx · 2019-07-05T06:02:31Z

std/zig/tokenizer.zig

                    },
                    else => {
                        state = State.CharLiteralEnd;
                    },
                },

                State.CharLiteralHexEscape => switch (c) {
-                    '0'...'9', 'a'...'z', 'A'...'F' => {


I assume this was a bug (found when new tests were added)

yep. thanks!

hryx · 2019-07-05T06:07:01Z

std/zig/tokenizer.zig

+                    },
+                },
+
+                State.CharLiteralUnicodeInvalid => switch (c) {


I got a little creative here because I thought this behavior might prevent some confusing error output. If it doesn't actually help, I'd be totally fine removing this special state.

Let's run with this and see what happens.

daurnimator · 2019-07-05T06:24:36Z

src/tokenizer.cpp

+                            break;
+                        }
+                        if (t.char_code > 0x10ffff) {
+                            tokenize_error(&t, "unicode value out of range: %x", t.char_code);


Move this down to the else below?

andrewrk

Looks great, easy merge

andrewrk · 2019-07-06T17:10:48Z

doc/langref.html.in

-        <tr>
-            <td><code>\UNNNNNN</code></td>
-          <td>hexadecimal 24-bit Unicode character code UTF-8 encoded (6 digits)</td>
+            <td><code>\u{NNNNNN}</code></td>


I think the "1 or more digits" you have below is sufficient

andrewrk · 2019-07-06T17:12:08Z

std/zig/tokenizer.zig

                    },
                    else => {
                        state = State.CharLiteralEnd;
                    },
                },

                State.CharLiteralHexEscape => switch (c) {
-                    '0'...'9', 'a'...'z', 'A'...'F' => {


yep. thanks!

andrewrk · 2019-07-06T17:13:25Z

std/zig/tokenizer.zig

+                    },
+                },
+
+                State.CharLiteralUnicodeInvalid => switch (c) {


Let's run with this and see what happens.

shawnl · 2019-07-20T13:55:52Z

On neither stage1 or stage2 did you reject UTF-16 surrogate pairs, 0xd800 - 0xdfff.

hryx · 2019-07-20T20:13:24Z

@shawnl The purpose of this PR was to change the grammar, not introduce new validation logic.

hryx added 3 commits July 4, 2019 14:48

Unicode escapes: stage2 tokenizer and parser test

8365a7a

Unicode escapes: stage1 tokenizer and behavior tests

6bfa854

Unicode escapes: documentation and grammar

e35d49c

hryx commented Jul 5, 2019

View reviewed changes

daurnimator requested changes Jul 5, 2019

View reviewed changes

andrewrk approved these changes Jul 6, 2019

View reviewed changes

andrewrk merged commit 21c6092 into ziglang:master Jul 6, 2019

hryx deleted the unicode-escape branch July 20, 2019 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode escapes: support u{N...} #2823

Unicode escapes: support u{N...} #2823

hryx commented Jul 5, 2019 •

edited

hryx Jul 5, 2019

andrewrk Jul 6, 2019

hryx Jul 5, 2019

andrewrk Jul 6, 2019

hryx Jul 5, 2019

andrewrk Jul 6, 2019

daurnimator Jul 5, 2019

andrewrk left a comment

andrewrk Jul 6, 2019

andrewrk Jul 6, 2019

andrewrk Jul 6, 2019

shawnl commented Jul 20, 2019 •

edited

hryx commented Jul 20, 2019

Unicode escapes: support u{N...} #2823

Unicode escapes: support u{N...} #2823

Conversation

hryx commented Jul 5, 2019 • edited

TODO

Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewrk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shawnl commented Jul 20, 2019 • edited

hryx commented Jul 20, 2019

hryx commented Jul 5, 2019 •

edited

shawnl commented Jul 20, 2019 •

edited