Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-ascii identifiers #4151

Closed
thejoshwolfe opened this issue Jan 11, 2020 · 22 comments
Closed

Allow non-ascii identifiers #4151

thejoshwolfe opened this issue Jan 11, 2020 · 22 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@thejoshwolfe
Copy link
Sponsor Contributor

thejoshwolfe commented Jan 11, 2020

Here is a concrete proposal for #3947 (comment) .

Background

All Zig code is always encoded in UTF-8, and this proposal does not change that.

This proposal does not change the interpretation of ASCII codepoints anywhere in Zig code.

The only non-ascii codepoints with special handling in Zig before this proposal are: U+0085 (NEL), U+2028 (LS), U+2029 (PS). This proposal does not change the interpretation of these codepoints; they are not allowed in identifiers.

Proposal

Zig's current lexical rule for identifiers is:

IDENTIFIER
    <- !keyword ("c" !["\\] / [A-Zabd-z_]) [A-Za-z0-9_]* skip
     / "@\"" string_char* "\""                            skip

This proposal adds the codepoints listed in the table below to both the ranges [A-Zabd-z_] and [A-Za-z0-9_] in the above rule.

00A0
00A8
00AA
00AD
00AF
00B2..00B5
00B7..00BA
00BC..00BE
00C0..00D6
00D8..00F6
00F8..200D
202A..202F
203F..2040
2054
205F..218F
2460..24FF
2776..2793
2C00..2DFF
2E80..3000
3004..3007
3021..302F
3031..D7FF
F900..FD3D
FD40..FDCF
FDF0..FE44
FE47..FFFD
10000..1FFFD
20000..2FFFD
30000..3FFFD
40000..4FFFD
50000..5FFFD
60000..6FFFD
70000..7FFFD
80000..8FFFD
90000..9FFFD
A0000..AFFFD
B0000..BFFFD
C0000..CFFFD
D0000..DFFFD
E0000..EFFFD

Explanation

This set of codepoints was determined by following the recommendation here: https://unicode.org/reports/tr31/#Immutable_Identifier_Syntax . Specifically, this is the set of all characters except characters meeting any of these criteria:

  • Pattern_White_Space=True
  • Pattern_Syntax=True
  • General_Category=Private_Use, Surrogate, or Control
  • Noncharacter_Code_Point=True

Unicode Character Data version 5.2.0 was used to generate this list, but this list can remain stable forever despite future versions to Unicode Character Data, as per the recommendation and discussion in tr31 linked above. (EDIT: @daurnimator pointed out that this is many major versions behind, but even using the latest version 12.1.0, the list of codepoints in this proposal is identical.)

The code I used to generate the above set of codepoints can be found here: https://github.com/ziglang/zig/blob/6f8e2fad94fde6c9a8c4ca52d964d0616690ee4c/tools/gen_id_char_table.py

@thejoshwolfe thejoshwolfe added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Jan 11, 2020
@daurnimator
Copy link
Contributor

Why generated with 5.2.0 rather than the latest? Could you check to see if it's different for the purposes of this proposal?

@thejoshwolfe
Copy link
Sponsor Contributor Author

Thanks for pointing that out. I misunderstood the versioning scheme. I updated my script to use data version 12.1.0, and resulting table is identical. I guess that says something about the stability of this proposal.

@daurnimator
Copy link
Contributor

I do think this is the best way to get Unicode identifiers in zig, but I'm still afraid of bugs caused by differing normalisation. Perhaps we could have zig fmt emit a warning when an identifer is not in NFKC form?

@Serentty
Copy link

@daurnimator That seems like a decent solution to me. It wouldn't be futureproof due to the addition of new characters over time, but since this is just a warning and not an error there are no compatibility issues introduced from that.

@mikdusan
Copy link
Member

mikdusan commented Jan 12, 2020

U+00A0 is NO-BREAK SPACE:

const hello var = "hello";
std.debug.warn("{}\n", .{hello var});

U+00B2 is SUPERSCRIPT TWO:

const ²name = "joe"
std.debug.warn("{}\n", .{²name});

If anyone is interested in a comparison, Swift language specs out their identifiers codepoints pretty concisely.

@Serentty
Copy link

I'm quite happy with this proposal. I think it's the simplest possible proposal that's still well-rounded. There are more complex solutions, and they're good too, but this is a very good compromise between features and utility.

@andrewrk
Copy link
Member

The script looks for "Pattern_White_Space" but it should probably also exclude "White_Space"

@thejoshwolfe
Copy link
Sponsor Contributor Author

I don't claim to be an expert on Unicode, but unicode.org does, and they (indirectly) suggest allowing U+00A0 (NO-BREAK SPACE) in identifiers. To quote the experts:

The drawback of this method is that it allows “nonsense” to be part of identifiers because the concerns of lexical classification and of human intelligibility are separated. Human intelligibility can, however, be addressed by other means, such as usage guidelines that encourage a restriction to meaningful terms for identifiers.

If we wanted the compiler to try to prevent abuse in identifier naming, we would need something far more sophisticated than this proposal. See the discussion in #3947.

This proposal requires that programmers are trying to be responsible with identifier naming. We're opening the door to nasty code obfuscation, if programmers were trying to write obfuscated code. Are we worried about that?

I don't think we need to worry about U+00A0 (NO-BREAK SPACE). Maybe some language out there would really like to use it for their identifiers. It has no place in an English codebase, but this proposal is not for English speakers.

@jakwings
Copy link

This proposal deserves a more accurate title, like: Allow (only) UTF-8/UTF-16/...-encoded identifiers?

--

If one doesn't care about U+00A0 then why not ignore U+0085 (NEL), U+2028 (LS), U+2029 (PS)? Still complicated anyway. When would you type those invisible characters in source code besides in raw strings?

@Serentty
Copy link

I can't imagine the NBSP character being useful for identifiers in any language. It would be terribly confusing given that it's visually identical to a normal space except for its line breaking rules. Even in languages where it is used, it's mostly just to keep punctuation on the same line and stuff like that. An underscore should work just fine.

We're opening the door to nasty code obfuscation, if programmers were trying to write obfuscated code. Are we worried about that?

I think Zig already allows for all sorts of horribly obfuscated code, but this can't be considered a bug unless it makes it easy to do so accidentally. You could use comptime to generate all sorts of opaque stuff, but other than for the fun of it, why would you? So I would apply the same perspective here.

@thejoshwolfe
Copy link
Sponsor Contributor Author

@iology see #663 regarding UTF-16 and the characters you listed.

@jakwings
Copy link

jakwings commented Jan 13, 2020

see #663 regarding UTF-16 and the characters you listed.

Thanks. Now I see

  1. UTF-8 (endianness unrelated) is enforced

  2. The line endings NEL, LS and PS are forbidden in comments purely for their visual effect.

So if whitespace like U+00A0 (NBSP) and U+200B (ZERO WIDTH SPACE) are allowed in identifier names, then I question restriction (2). Even further, I think all valid code points outside 00-7F can be allowed if they will never affect the functionality of compilers and editors. Of course this opinion only holds when extreme tiny handling for Unicode is required.

Regarding the attributes like White_Space, will there be any newly assigned code point attribute in the future? (ah, got it: "small set" + "compromise")

Another observation: U+1680 (OGHAM SPACE MARK) appears as a dash symbol in the font Noto Sans Ogham (is it important?).

@kavika13
Copy link
Contributor

kavika13 commented Jan 13, 2020

@thejoshwolfe
Not sure if it's entirely clear in the first post why this proposal. I think you summed up the solution just fine. But the problem statement is a bit fuzzy unless you read the whole thread from the original proposal.

The problem statement - From a comment Andrew made on the Unicode identifier thread:

One benefit of status quo that I am reluctant to give up is that the definition of zig tokenization is finished. Aside from language changes before 1.0.0, tokenization is immortal and unchanging; already in its final form. 100% stable.

A dependency on Unicode tables means more than just the chore of implementing and maintaining support. It means the zig language itself depends on a third party standard that changes independently of Zig, without warning, and without any decision-making power from the Zig community.

I would be interested to explore what it might look like to support non-ascii identifiers without any knowledge of Unicode. For example, the naive approach of allowing any sequence of (non ASCII) bytes as an identifier. Some downsides I can think of:

The proposed solution - From Unicode 12.0.0 standard, annex 31: Immutable Identifier Syntax:

... Instead of defining the set of code points that are allowed, define a small, fixed set of code points that are reserved for syntactic use and allow everything else (including unassigned code points) as part of an identifier. All parsers written to this specification would behave the same way for all versions of the Unicode Standard, because the classification of code points is fixed forever ...

You summed this up pretty well, but maybe quoting this part of the unicode thing you linked directly could be valuable too?

Not sure how much more discussion needs to happen though, so this could be just make-work.

@jakwings
Copy link

@andrewrk

The script looks for "Pattern_White_Space" but it should probably also exclude "White_Space"

I read more carefully... (highlighted by me) https://unicode.org/reports/tr31/#Immutable_Identifier_Syntax

Instead of defining the set of code points that are allowed, define a small, fixed set of code points that are reserved for syntactic use and allow everything else (including unassigned code points) as part of an identifier. All parsers written to this specification would behave the same way for all versions of the Unicode Standard, because the classification of code points is fixed forever.

and (as you can notice, the generated list include almost all non-ascii code points)

Immutable identifiers are intended for those cases (like XML) that cannot update across versions of Unicode, and do not require information about normalization form, or properties such as General_Category and Script. Immutable identifers that allow unassigned characters cannot provide for normalization forms or these properties, which means that they:

For best practice, a profile disallowing unassigned characters should be provided where possible.

This means the future version might introduce more whitespace or useful letters through previously unassigned code points.

@andrewrk andrewrk added this to the 0.7.0 milestone Jan 26, 2020
@ghost
Copy link

ghost commented Mar 6, 2020

I suggest a variant of the "naive" approach mentioned earlier, that involves a whitelist roughly of this form:

// *** unicode_whitelist.zig ***
// *** unicode version xxxxxx ***

// This struct definition would be a builtin type.
const CodePointEntry = struct{
  codePointStr : []const u8, // could also be a raw number, or the utf-8 encoding directly
  symbol : []const u8,
  asciiName : []const u8,
};

const greekSymbols : [_]CodePointEntry = .{
    .{"U+0370", "Ͱ", "heta"},
    .{"U+03A9", "Ω", "omega"},
     // ....and so on
};

// this table is consulted by tokenizer when accepting or rejecting symbols used in identifiers.
// all entries within table must be unique, with no duplication on any of the individual fields
const whiteListCodepoints : [_]CodePointEntry = greekSymbols ++ cyrillicSymbols ++ ... ;

This file would be imported by zig build, and the compiler would consult the white list table for any non-ascii utf8 symbol encountered in an identifier. If symbol is included in table, keep compiling, otherwise give a compile error.

This way, the tokenization of .zig files would be independent of unicode, and it would be the responsibility of the user to use a sensible subset of unicode for the best trade off between usability, readability and the pitfalls of unicode identifiers.

It would also be easy for the community to provide curated white lists, or for larger projects to have project specific white lists catered to their use case.

@BarabasGitHub
Copy link
Contributor

I don't get why this proposal proposes a whitelist while in the other discussion it seems like most people were agreeing the simple blacklist is the way to go. Just as suggested here...

I read more carefully... (highlighted by me) https://unicode.org/reports/tr31/#Immutable_Identifier_Syntax

Instead of defining the set of code points that are allowed, define a small, fixed set of code points that are reserved for syntactic use and allow everything else (including unassigned code points) as part of an identifier. All parsers written to this specification would behave the same way for all versions of the Unicode Standard, because the classification of code points is fixed forever.

Seems simple enough and it's the responsibility of the programmer not to do anything weird like using a name of all non-breaking spaces or whatever.

Maintaining a whitelist seems like an unnecessary burden. Just disallow what you need for other uses and allow the rest. Simple enough, right?

@kavika13
Copy link
Contributor

kavika13 commented Mar 6, 2020

I suggest ... a whitelist roughly of this form ...

This thread is talking about the core language specification and what sorts of constraints it imposes. The allowed character list sounds more like a feature request for the specific implementation of the language.

Also, it sounds more like the job of a linter than something that really should be in the core language. I am not sure what problems it solves, or sure that it solves them well.

@ghost
Copy link

ghost commented Mar 6, 2020

This thread is talking about the core language specification and what sorts of constraints it imposes. The allowed character list sounds more like a feature request for the specific implementation of the language.

Maybe it wasn't clear, my suggestion is essentially:

  • Compiler accepts any valid utf8 encoding in identifiers (raw bytes approach for core language specification).
    • Then something else has to be introduced to deal with unicode equivalence and other unicode related readability issues.
  • Solving the above issues at the project/build level was what came to mind first, hence the white list approach.

Defining the valid subset of utf8 encoded unicode symbols in the core language specification is of course also a perfectly fine approach.

@thejoshwolfe
Copy link
Sponsor Contributor Author

@BarabasGitHub Regarding whitelist vs blacklist: those are equivalent. The range of all possible unicode codepoints is 0x0000..0x10FFFF, and that range will not change with future versions of unicode. The set of codepoints that are allowed or not allowed according to this proposal also will never change. So this proposal can be represented as either a whitelist or a blacklist. I presented the characters here as a whitelist because that's what lexer definitions usually use.

@user00e00 I don't like the idea of a project defining something so fundamental to the language as lexer rules. You'll inevitably get into weird situations where a source file builds in one context but not in another. Then part of the interface for dealing with a code file is a dependency on allowing certain characters. Do we want to make that API formal with some kind of comment syntax? This is a whole can of worms that waves big red flags that this is a bad idea.

I'm perfectly happy with a linter producing linter errors when identifiers fit or deviate from project-defined patterns. That's within the domain of a linter. Typically, a project does not run its linters on 3rd-party dependency code, because the code was written in a different project context. However, the compiler and its lexer must process projects and all their transitive dependencies. It is wrong for a project to impose subjective restrictions on 3rd-party code. (And I know what you're thinking now: have the lexer rules be project-specific and each dependency project can define its own lexer rules; sure, but you're effectively just describing a linter, which i'm arguing is the proper solution to this situation.)

@BarabasGitHub
Copy link
Contributor

Regarding whitelist vs blacklist: those are equivalent.

The difference is listing the few character that you need for Zig syntax or listing the rest of unicode space. I see it as either: "These characters are part of Zig syntax and keywords, so you can't use them as/in identifiers, but everything else is fine." or: "We've audited all characters of all languages and we've decided that these characters are not suitable in identifiers." Or even if you're not doing that it's like listing the whole dictionary minus five words, instead of just those five words.

Anyway I agree with you that a linter is a better solution than having this be part of the compilation.

@CantrellD
Copy link

CantrellD commented Apr 20, 2020

I don't think whitelists or blacklists are a good idea. Unicode might be a mess, but it's the least-bad option that pretty much everybody has been able to accept as a standard. The subset of Unicode included in this issue doesn't have the same status. Maybe it will in the future, but that seems unlikely.

I also don't love the idea of Unicode identifiers that aren't clearly marked as such, e.g. using the @"..." syntax. Is it really a good idea to allow identifiers that look exactly like other tokens? If so, then why have blacklists? And if not, then why allow code-points that haven't been assigned yet?

@andrewrk
Copy link
Member

Thank you for the discussion all. I am closing this in favor of status quo. The @"" syntax allows any string literal to be used as an identifier, and Zig remains blissfully unaware of Unicode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

9 participants