-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Support non-latin and emoji identifiers
Allow user-defined identifiers to contain Unicode letters and emoji symbols.
Formal specification
Definitions
An identifier is a sequence of one or more Unicode code points matching:
identifier = identifier_start identifier_continue*
identifier_start
A code point is a valid identifier start if it satisfies ANY of:
- Unicode General Category is one of:
Lu(Letter, uppercase) —A,Б,ΩLl(Letter, lowercase) —a,б,ωLt(Letter, titlecase) —Dž,LjLm(Letter, modifier) —ʰ,ˠLo(Letter, other) —中,あ,אNl(Number, letter) —Ⅳ,〇So(Symbol, other) —🎉,🚀,★,♠
- Code point is
U+005F(underscore_)
identifier_continue
A code point is a valid identifier continuation if it satisfies ANY of:
- It is a valid
identifier_start - Unicode General Category is one of:
Mn(Mark, nonspacing) — combining accents:̈,́Mc(Mark, spacing combining) —ः,ंNd(Number, decimal digit) —0-9,٣,৫Pc(Punctuation, connector) —_,‿
Excluded
The following are explicitly not valid in identifiers:
Zs(Space, separator) — spaces, non-breaking spacesZl,Zp(Line/Paragraph separator)Cc(Control) —\0,\n,\tCf(Format) — zero-width joinerU+200D, zero-width non-joinerU+200C, BOMU+FEFFSk(Symbol, modifier) —^,`Sm(Symbol, math) —+,=,<,>,|,~Sc(Symbol, currency) —$,€,£Pd,Ps,Pe,Pi,Pf,Po(Punctuation) —.,,,(,),[,]
Surrogate pairs
Emoji and some symbols have code points above U+FFFF (outside BMP). In UTF-16 encodings (C#, Java, JavaScript) they are represented as surrogate pairs (two 16-bit code units). Implementations MUST handle surrogate pairs correctly — decode to a single code point before checking categories.
Reserved keywords
All NFun keywords (if, else, then, rule, true, false, none, not, and, or, in) are ASCII-only. Non-latin sequences that spell the same word (e.g. Cyrillic іf) are valid identifiers, not keywords.
Normalization
No Unicode normalization is performed. Code points are compared as-is. Two identifiers are equal if and only if their code point sequences are identical.
Consequence: café (U+0065 U+0301) and café (U+00E9) are different identifiers.
Case sensitivity
Identifiers are case-sensitive for all scripts. Foo, foo, FOO are three different identifiers. Σ (U+03A3) and σ (U+03C3) are different identifiers.
Examples
# Latin
name = "Alice"
# Cyrillic
имя = "Алиса"
сумма = a + b
# CJK
数量 = 100
名前 = "太郎"
# Arabic
قيمة = 42
# German
größe = 10
# Emoji
🎉 = "party"
результат_🚀 = calculate()
player_⭐ = score > 100
# Mixed (valid but not recommended)
data_данные = [1, 2, 3]
# Combining marks
café = "coffee" # é as U+00E9 (single code point, Lo)
naïve = true # ï as U+00EF
# Invalid — these are operators/punctuation, not identifiers
# $price — Sc (currency)
# +plus — Sm (math)
# .dot — Po (punctuation)
Implementation notes
C# (.NET)
static bool IsIdentStart(int codePoint) {
var cat = char.GetUnicodeCategory((char)codePoint); // use Rune for > BMP
return cat is Lu or Ll or Lt or Lm or Lo or Nl or OtherSymbol
|| codePoint == '_';
}
static bool IsIdentContinue(int codePoint) {
var cat = char.GetUnicodeCategory((char)codePoint);
return IsIdentStart(codePoint)
|| cat is Mn or Mc or Nd or Pc;
}For surrogate pairs in C#, use Rune (net5+) or char.ConvertToUtf32(high, low).
Rust
fn is_ident_start(c: char) -> bool {
c == '_' || matches!(unicode_general_category(c),
Lu | Ll | Lt | Lm | Lo | Nl | So)
}Rust char is a Unicode scalar value (32-bit) — no surrogate pair issues.
Other languages
Decode UTF-8/UTF-16 to code points first, then check General Category.