Support non-latin and emoji identifiers

# Support non-latin and emoji identifiers

Allow user-defined identifiers to contain Unicode letters and emoji symbols.

## Formal specification

### Definitions

An **identifier** is a sequence of one or more Unicode code points matching:

```
identifier = identifier_start identifier_continue*
```

### identifier_start

A code point is a valid identifier start if it satisfies ANY of:

1. Unicode General Category is one of:
   - `Lu` (Letter, uppercase) — `A`, `Б`, `Ω`
   - `Ll` (Letter, lowercase) — `a`, `б`, `ω`
   - `Lt` (Letter, titlecase) — `ǅ`, `ǈ`
   - `Lm` (Letter, modifier) — `ʰ`, `ˠ`
   - `Lo` (Letter, other) — `中`, `あ`, `א`
   - `Nl` (Number, letter) — `Ⅳ`, `〇`
   - `So` (Symbol, other) — `🎉`, `🚀`, `★`, `♠`
2. Code point is `U+005F` (underscore `_`)

### identifier_continue

A code point is a valid identifier continuation if it satisfies ANY of:

1. It is a valid `identifier_start`
2. Unicode General Category is one of:
   - `Mn` (Mark, nonspacing) — combining accents: `̈`, `́`
   - `Mc` (Mark, spacing combining) — `ः`, `ं`
   - `Nd` (Number, decimal digit) — `0`-`9`, `٣`, `৫`
   - `Pc` (Punctuation, connector) — `_`, `‿`

### Excluded

The following are explicitly **not** valid in identifiers:
- `Zs` (Space, separator) — spaces, non-breaking spaces
- `Zl`, `Zp` (Line/Paragraph separator)
- `Cc` (Control) — `\0`, `\n`, `\t`
- `Cf` (Format) — zero-width joiner `U+200D`, zero-width non-joiner `U+200C`, BOM `U+FEFF`
- `Sk` (Symbol, modifier) — `^`, `` ` ``
- `Sm` (Symbol, math) — `+`, `=`, `<`, `>`, `|`, `~`
- `Sc` (Symbol, currency) — `$`, `€`, `£`
- `Pd`, `Ps`, `Pe`, `Pi`, `Pf`, `Po` (Punctuation) — `.`, `,`, `(`, `)`, `[`, `]`

### Surrogate pairs

Emoji and some symbols have code points above U+FFFF (outside BMP). In UTF-16 encodings (C#, Java, JavaScript) they are represented as surrogate pairs (two 16-bit code units). Implementations MUST handle surrogate pairs correctly — decode to a single code point before checking categories.

### Reserved keywords

All NFun keywords (`if`, `else`, `then`, `rule`, `true`, `false`, `none`, `not`, `and`, `or`, `in`) are ASCII-only. Non-latin sequences that spell the same word (e.g. Cyrillic `іf`) are valid identifiers, not keywords.

### Normalization

No Unicode normalization is performed. Code points are compared as-is. Two identifiers are equal if and only if their code point sequences are identical.

Consequence: `café` (U+0065 U+0301) and `café` (U+00E9) are **different** identifiers.

### Case sensitivity

Identifiers are case-sensitive for all scripts. `Foo`, `foo`, `FOO` are three different identifiers. `Σ` (U+03A3) and `σ` (U+03C3) are different identifiers.

## Examples

```
# Latin
name = "Alice"

# Cyrillic
имя = "Алиса"
сумма = a + b

# CJK
数量 = 100
名前 = "太郎"

# Arabic
قيمة = 42

# German
größe = 10

# Emoji
🎉 = "party"
результат_🚀 = calculate()
player_⭐ = score > 100

# Mixed (valid but not recommended)
data_данные = [1, 2, 3]

# Combining marks
café = "coffee"    # é as U+00E9 (single code point, Lo)
naïve = true       # ï as U+00EF

# Invalid — these are operators/punctuation, not identifiers
# $price    — Sc (currency)
# +plus     — Sm (math)
# .dot      — Po (punctuation)
```

## Implementation notes

### C# (.NET)
```csharp
static bool IsIdentStart(int codePoint) {
    var cat = char.GetUnicodeCategory((char)codePoint); // use Rune for > BMP
    return cat is Lu or Ll or Lt or Lm or Lo or Nl or OtherSymbol
           || codePoint == '_';
}

static bool IsIdentContinue(int codePoint) {
    var cat = char.GetUnicodeCategory((char)codePoint);
    return IsIdentStart(codePoint)
           || cat is Mn or Mc or Nd or Pc;
}
```

For surrogate pairs in C#, use `Rune` (net5+) or `char.ConvertToUtf32(high, low)`.

### Rust
```rust
fn is_ident_start(c: char) -> bool {
    c == '_' || matches!(unicode_general_category(c),
        Lu | Ll | Lt | Lm | Lo | Nl | So)
}
```

Rust `char` is a Unicode scalar value (32-bit) — no surrogate pair issues.

### Other languages
Decode UTF-8/UTF-16 to code points first, then check General Category.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-latin and emoji identifiers #96

Support non-latin and emoji identifiers

Formal specification

Definitions

identifier_start

identifier_continue

Excluded

Surrogate pairs

Reserved keywords

Normalization

Case sensitivity

Examples

Implementation notes

C# (.NET)

Rust

Other languages

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support non-latin and emoji identifiers #96

Description

Support non-latin and emoji identifiers

Formal specification

Definitions

identifier_start

identifier_continue

Excluded

Surrogate pairs

Reserved keywords

Normalization

Case sensitivity

Examples

Implementation notes

C# (.NET)

Rust

Other languages

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions