Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

128 lines (77 sloc) 3.576 kb
TODO items to get us into conformance with UTS18: (draft for 6.1)
## 0 Introduction
Document what Unicode features are supported: That is the role of
this document.
## 0.1 Notation
## 0.2 Conformance
## 1.1 Hex Notation
To match Unicode character U+10450, use \x10450, or \x[] if needed to
disambiguate. This works even in char classes.
## 1.1.1 Hex notation and normalization
Currently works, insofar as the compiler is using codepoint strings. If
the compiler switches to NFG mode, things will break.
## 1.2 Properties
String functions are NOT supported because I haven't found a normative list
of them.
Property matching is done by <:Foo> or <:!Foo> for Boolean properties or
<:Foo<bar>> for string properties. The latter allows use of wildcards
through the smart-matching mechanism. Property names and values support
aliases; property names are also loose-matched. Script and General_Category
support use of their values as pseudo-Boolean properties. Script_Extensions
is supported with alias scx.
Compatibility properties are handled following Annex C.
Property values are NOT loose-matched, but this does not seem to be mandated.
*** not in any documentation I can find: isCased, isCasefolded, isLowercase,
isUppercase, isTitlecase, isNFC, isNFD, isNFKC, isNFKD, toLowercase,
toUppercase, toTitlecase, toCasefolded, toNFC, toNFD, toNFKC, toNFKD
(isXXX except isCased can be defined in terms of toXXX)
*** Age is NOT handled inclusively
Blocks are supported, both as <:Block<Thai>> and as <:InThai>.
## 1.3 Subtraction and Intersection
Supported syntactically as <:A & :B>, <:A | :B>, <:A ^ :B>, <:A - :B>,
<-:B>, <:A - (:B & :C)>. Precedence follows normal Perl rules, except +
and - bind very tightly.
## 1.4 Simple Word Boundaries
*** NOT IMPLEMENTED. « » just uses word character.
## 1.5 Simple Loose Matches
*** NOT IMPLEMENTED. Currently :i is ignored for character classes, and uses
lowercasing instead of casefolding.
## 1.6 Line Boundaries
\n matches any newline, including CRLF; ^^ and $$ anchor likewise. . matches
any character, but does not match CRLF as a unit in codepoint mode, as this
is considered to be a grapheme issue. To match CRLF as a unit, use grapheme
*** Grapheme mode is not implemented
## 1.7 Code Points
All matching is done by code points. Single surrogates are not matched,
except if they are isolated.
## 2.1 Canonical equivalence
Will be handled using grapheme mode.
## 2.2 Grapheme Clusers
<.> matches any extended grapheme cluster, <|g> matches boundaries, and
literal clusters in regexes may be matched by using a string as a component,
such as <:Alpha + "ch">.
*** Only the last works
Grapheme cluster mode will be the default, disabled by :codes.
## 2.3 Default word boundaries
## 2.4 Default loose matches
## 2.5 Name properties
\c[FOO] can match a named character. Aliases supported are NL, CR,
FF, and NEL. Control character names are supported. The three
loose match exceptions are handled.
## 2.6 Wildcards in property values
<:name(/foo/)> allows use of wildcards.
## 2.7 Full Properties (draft)
All non-Unihan UCD properties are supported.
## 3
*** All tailorings are not supported.
## 3.8 Unicode Set Sharing
Unicode sets can be shared by wrapping them up in a named regex.
## 3.11 Submatchers
You can write things that can be called as regexes by adding methods to
your grammar.
Jump to Line
Something went wrong with that request. Please try again.