Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode escapes: support u{N...} #2823

Merged
merged 3 commits into from
Jul 6, 2019
Merged

Conversation

hryx
Copy link
Sponsor Contributor

@hryx hryx commented Jul 5, 2019

Closes #2129

TODO

Notes

  • Any number of digits (one or more) is allowed in the braces. The stage1 tokenizer retains upper limit on character value of 0x10ffff.
  • The old \uNNNN and \UNNNNNN syntaxes were removed.

<tr>
<td><code>\UNNNNNN</code></td>
<td>hexadecimal 24-bit Unicode character code UTF-8 encoded (6 digits)</td>
<td><code>\u{NNNNNN}</code></td>
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the clearest way to write this is. Could also be something like:

\u{N...}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "1 or more digits" you have below is sufficient

},
else => {
state = State.CharLiteralEnd;
},
},

State.CharLiteralHexEscape => switch (c) {
'0'...'9', 'a'...'z', 'A'...'F' => {
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was a bug (found when new tests were added)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. thanks!

},
},

State.CharLiteralUnicodeInvalid => switch (c) {
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a little creative here because I thought this behavior might prevent some confusing error output. If it doesn't actually help, I'd be totally fine removing this special state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run with this and see what happens.

break;
}
if (t.char_code > 0x10ffff) {
tokenize_error(&t, "unicode value out of range: %x", t.char_code);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this down to the else below?

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, easy merge

<tr>
<td><code>\UNNNNNN</code></td>
<td>hexadecimal 24-bit Unicode character code UTF-8 encoded (6 digits)</td>
<td><code>\u{NNNNNN}</code></td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "1 or more digits" you have below is sufficient

},
else => {
state = State.CharLiteralEnd;
},
},

State.CharLiteralHexEscape => switch (c) {
'0'...'9', 'a'...'z', 'A'...'F' => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. thanks!

},
},

State.CharLiteralUnicodeInvalid => switch (c) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run with this and see what happens.

@andrewrk andrewrk merged commit 21c6092 into ziglang:master Jul 6, 2019
@shawnl
Copy link
Contributor

shawnl commented Jul 20, 2019

On neither stage1 or stage2 did you reject UTF-16 surrogate pairs, 0xd800 - 0xdfff.

@hryx hryx deleted the unicode-escape branch July 20, 2019 19:59
@hryx
Copy link
Sponsor Contributor Author

hryx commented Jul 20, 2019

@shawnl The purpose of this PR was to change the grammar, not introduce new validation logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change \uXXXX \UXXXXXX string literal escape syntax to \u{}
4 participants