Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow unicode characters in character literals #2097

Open
andrewrk opened this Issue Mar 23, 2019 · 9 comments

Comments

Projects
None yet
4 participants
@andrewrk
Copy link
Member

commented Mar 23, 2019

While solving #2088 I am about to push a change that makes this test pass:

test "unicode escape in character literal" {
    var a: u24 = '\U01f4a9';
    expect(a == 128169);
}

This makes sense since character literals are just comptime_int. There's no footgun here because you can't accidentally misuse it:

test "aoeu" {
    var str = "hello";
    str[1] = '\U01f4a9';
}
/home/andy/dev/zig/build/test.zig:5:14: error: integer value 128169 cannot be implicitly casted to type 'u8'
    str[1] = '\U01f4a9';
             ^

With that in mind, I think it makes sense to allow utf-8 characters in character literals, since we have UTF-8 source encoding. I propose this test should pass:

const std = @import("std");

test "utf8 character literal" {
    const x = '💩';
    std.testing.expect(x == 128169);
}

@andrewrk andrewrk added the proposal label Mar 23, 2019

@andrewrk andrewrk added this to the 0.5.0 milestone Mar 23, 2019

andrewrk added a commit that referenced this issue Mar 23, 2019

character literals: allow unicode escapes
also make the documentation for character literals more clear.
closes #2089

see #2097
@emekoi

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

wait since we have utf8 source encoding can 💩, or あい be a valid identifier? or was this already discussed in a previous issue?

@shawnl

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

@emekoi that is explicit rejected, unless using the C ABI, in which you can use @"any utf-8 string, with arbitrary bytes using \x00 syntax"

shawnl added a commit to shawnl/zig that referenced this issue Mar 24, 2019

stage1: unicode characters in character literals and utf-8 validation
adding utf-8 validation revealed some non-utf-8 stuff in std

Closes: ziglang#2097

@andrewrk andrewrk added the accepted label Mar 24, 2019

shawnl added a commit to shawnl/zig that referenced this issue Mar 24, 2019

stage1: unicode characters in character literals and utf-8 validation
adding utf-8 validation revealed some non-utf-8 stuff in std

Closes: ziglang#2097
@daurnimator

This comment has been minimized.

Copy link
Contributor

commented Mar 24, 2019

\U01f4a9

Why did we pick an uppercase U for this rather than lowercase? Most languages use \u for unicode characters.

While on this topic; was there any consideration of \u{01f4a9} with braces?

@shawnl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

While on this topic; was there any consideration of \u{01f4a9} with braces?

It isn't necessary. There is a single quote ' at the end, so it is unambiguous. However there is a big difference. \u covers the Basic Multilingual Plane BMP where the most useful stuff is, while \U is the supplementary planes. (such as Egyptian Hieroglyphics and Emoji) see this chart https://www.unicode.org/roadmaps/bmp/

@daurnimator

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

It isn't necessary. There is a single quote ' at the end, so it is unambiguous. However there is a big difference. \u covers the Basic Multilingual Plane BMP where the most useful stuff is, while \U is the supplementary planes. (such as Egyptian Hieroglyphics and Emoji) see this chart https://www.unicode.org/roadmaps/bmp/

I think having two different escapes for this is really confusing. Instead there should just be one form of unicode escape.

The \u{} syntax is used by javascript (since ES6), lua (since 5.3), swift (who swapped from our current syntax!) and seems to be an accepted recent improvement in languages

@shawnl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

Oh, yeah I wasn't seeing that it needs to be consistent with string literal escape syntax. Yeah, I like \u{} is clearer, but \x still has to be supported as is (where it is the same as in C), because \x80 is differn't from \u{80}. \u{} should accept 1, 2, or 3 bytes.

@daurnimator

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

@shawnl agreed, \x should be kept, \uXXXX and \UXXXXXX should be replaced with \u{X}.

@shawnl

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

@andrewrk can we get an approved on the new character literal syntax, which will also apply for escape sequences inside strings? i'd like to code it up.

@daurnimator

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

@shawnl you've got your approved, see #2129 :)

shawnl added a commit to shawnl/zig that referenced this issue Mar 29, 2019

src-self-hosted: teach the self-hosted compiler about char literals
also fix std.unicode to the 21-bit range of Unicode.

uses the  ziglang#2097 syntax

shawnl added a commit to shawnl/zig that referenced this issue Mar 30, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Mar 31, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Mar 31, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Mar 31, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Mar 31, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Mar 31, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Mar 31, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Apr 1, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Apr 1, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Apr 1, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Apr 1, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Apr 6, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Apr 6, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Apr 8, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.

shawnl added a commit to shawnl/zig that referenced this issue Apr 10, 2019

breaking: big unicode overhaul
Allow utf-8 in character literals

Char Unicode escape syntax.

Validate zig as UTF-8.

overhaul std.unicode

Fully implemented in stage2.

Closes: ziglang#2097
Closes: ziglang#2129
---

About the UTF-8 validation in stage1: This implementation is quite slow,
but the stage automata it claims to represent is correct,
and it has two features faster validators don't that would
make the code in stage1 more complicated:

* They don't provide the char point
* They don't provide the index of the error (although this could be
  hacked in, but at more cost)

I don't want to put that much optimization effort into stage1 and C
code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.