Skip to content

Support non-latin and emoji identifiers #96

@tmteam

Description

@tmteam

Support non-latin and emoji identifiers

Allow user-defined identifiers to contain Unicode letters and emoji symbols.

Formal specification

Definitions

An identifier is a sequence of one or more Unicode code points matching:

identifier = identifier_start identifier_continue*

identifier_start

A code point is a valid identifier start if it satisfies ANY of:

  1. Unicode General Category is one of:
    • Lu (Letter, uppercase) — A, Б, Ω
    • Ll (Letter, lowercase) — a, б, ω
    • Lt (Letter, titlecase) — Dž, Lj
    • Lm (Letter, modifier) — ʰ, ˠ
    • Lo (Letter, other) — , , א
    • Nl (Number, letter) — ,
    • So (Symbol, other) — 🎉, 🚀, ,
  2. Code point is U+005F (underscore _)

identifier_continue

A code point is a valid identifier continuation if it satisfies ANY of:

  1. It is a valid identifier_start
  2. Unicode General Category is one of:
    • Mn (Mark, nonspacing) — combining accents: ̈, ́
    • Mc (Mark, spacing combining) — ,
    • Nd (Number, decimal digit) — 0-9, ٣,
    • Pc (Punctuation, connector) — _,

Excluded

The following are explicitly not valid in identifiers:

  • Zs (Space, separator) — spaces, non-breaking spaces
  • Zl, Zp (Line/Paragraph separator)
  • Cc (Control) — \0, \n, \t
  • Cf (Format) — zero-width joiner U+200D, zero-width non-joiner U+200C, BOM U+FEFF
  • Sk (Symbol, modifier) — ^, `
  • Sm (Symbol, math) — +, =, <, >, |, ~
  • Sc (Symbol, currency) — $, , £
  • Pd, Ps, Pe, Pi, Pf, Po (Punctuation) — ., ,, (, ), [, ]

Surrogate pairs

Emoji and some symbols have code points above U+FFFF (outside BMP). In UTF-16 encodings (C#, Java, JavaScript) they are represented as surrogate pairs (two 16-bit code units). Implementations MUST handle surrogate pairs correctly — decode to a single code point before checking categories.

Reserved keywords

All NFun keywords (if, else, then, rule, true, false, none, not, and, or, in) are ASCII-only. Non-latin sequences that spell the same word (e.g. Cyrillic іf) are valid identifiers, not keywords.

Normalization

No Unicode normalization is performed. Code points are compared as-is. Two identifiers are equal if and only if their code point sequences are identical.

Consequence: café (U+0065 U+0301) and café (U+00E9) are different identifiers.

Case sensitivity

Identifiers are case-sensitive for all scripts. Foo, foo, FOO are three different identifiers. Σ (U+03A3) and σ (U+03C3) are different identifiers.

Examples

# Latin
name = "Alice"

# Cyrillic
имя = "Алиса"
сумма = a + b

# CJK
数量 = 100
名前 = "太郎"

# Arabic
قيمة = 42

# German
größe = 10

# Emoji
🎉 = "party"
результат_🚀 = calculate()
player_⭐ = score > 100

# Mixed (valid but not recommended)
data_данные = [1, 2, 3]

# Combining marks
café = "coffee"    # é as U+00E9 (single code point, Lo)
naïve = true       # ï as U+00EF

# Invalid — these are operators/punctuation, not identifiers
# $price    — Sc (currency)
# +plus     — Sm (math)
# .dot      — Po (punctuation)

Implementation notes

C# (.NET)

static bool IsIdentStart(int codePoint) {
    var cat = char.GetUnicodeCategory((char)codePoint); // use Rune for > BMP
    return cat is Lu or Ll or Lt or Lm or Lo or Nl or OtherSymbol
           || codePoint == '_';
}

static bool IsIdentContinue(int codePoint) {
    var cat = char.GetUnicodeCategory((char)codePoint);
    return IsIdentStart(codePoint)
           || cat is Mn or Mc or Nd or Pc;
}

For surrogate pairs in C#, use Rune (net5+) or char.ConvertToUtf32(high, low).

Rust

fn is_ident_start(c: char) -> bool {
    c == '_' || matches!(unicode_general_category(c),
        Lu | Ll | Lt | Lm | Lo | Nl | So)
}

Rust char is a Unicode scalar value (32-bit) — no surrogate pair issues.

Other languages

Decode UTF-8/UTF-16 to code points first, then check General Category.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions