Skip to content
Permalink
Browse files

Update manual to define identifiers using UAX 31 XID_Start / XID_Cont…

…inue.
  • Loading branch information...
graydon committed Feb 25, 2011
1 parent 69464aa commit dabccadd3202513ab0bcb424e2c62c90ab23062d
Showing with 19 additions and 13 deletions.
  1. +19 −13 doc/rust.texi
@@ -592,10 +592,12 @@ or interrupted by ignored characters.

Most tokens in Rust follow rules similar to the C family.

Most tokens (including identifiers, whitespace, keywords, operators and
structural symbols) are drawn from the ASCII-compatible range of
Unicode. String and character literals, however, may include the full range of
Unicode characters.
Most tokens (including whitespace, keywords, operators and structural symbols)
are drawn from the ASCII-compatible range of Unicode. Identifiers are drawn
from Unicode characters specified by the @code{XID_start} and
@code{XID_continue} rules given by UAX #31@footnote{Unicode Standard Annex
#31: Unicode Identifier and Pattern Syntax}. String and character literals may
include the full range of Unicode characters.

@emph{TODO: formalize this section much more}.

@@ -638,18 +640,22 @@ token or a syntactic extension token. Multi-line comments may be nested.
@c * Ref.Lex.Ident:: Identifier tokens.
@cindex Identifier token

Identifiers follow the pattern of C identifiers: they begin with a
@emph{letter} or @emph{underscore}, and continue with any combination of
@emph{letters}, @emph{decimal digits} and underscores, and must not be equal
to any keyword or reserved token. @xref{Ref.Lex.Key}. @xref{Ref.Lex.Res}.
Identifiers follow the rules given by Unicode Standard Annex #31, in the form
closed under NFKC normalization, @emph{excluding} those tokens that are
otherwise defined as keywords or reserved
tokens. @xref{Ref.Lex.Key}. @xref{Ref.Lex.Res}.

A @emph{letter} is a Unicode character in the ranges U+0061-U+007A and
U+0041-U+005A (@code{'a'}-@code{'z'} and @code{'A'}-@code{'Z'}).
That is: an identifier starts with any character having derived property
@code{XID_Start} and continues with zero or more characters having derived
property @code{XID_Continue}; and such an identifier is NFKC-normalized during
lexing, such that all subsequent comparison of identifiers is performed on the
NFKC-normalized forms.

An @dfn{underscore} is the character U+005F ('_').
@emph{TODO: define relationship between Unicode and Rust versions}.

A @dfn{decimal digit} is a character in the range U+0030-U+0039
(@code{'0'}-@code{'9'}).
@footnote{This identifier syntax is a superset of the identifier syntaxes of C
and Java, and is modeled on Python PEP #3131, which formed the definition of
identifiers in Python 3.0 and later.}

@node Ref.Lex.Key
@subsection Ref.Lex.Key

2 comments on commit dabccad

@wmealing

This comment has been minimized.

Copy link
Contributor

replied Sep 24, 2011

I understand the why, I just imagine that I (and other casual programmers) will find it difficult to call functions with letters that can't easily type on the standard US101 keyboard. Lets hope llvm can optimize out simple wrapper functions.

@pcwalton

This comment has been minimized.

Copy link
Contributor

replied Sep 29, 2011

It can. :)

Please sign in to comment.
You can’t perform that action at this time.