Skip to content
Fetching contributors…
Cannot retrieve contributors at this time
128 lines (77 sloc) 3.49 KB
TODO items to get us into conformance with UTS18: (draft for 6.1)
## 0 Introduction
Document what Unicode features are supported: That is the role of
this document.
## 0.1 Notation
## 0.2 Conformance
## 1.1 Hex Notation
To match Unicode character U+10450, use \x10450, or \x[] if needed to
disambiguate. This works even in char classes.
## 1.1.1 Hex notation and normalization
Currently works, insofar as the compiler is using codepoint strings. If
the compiler switches to NFG mode, things will break.
## 1.2 Properties
String functions are NOT supported because I haven't found a normative list
of them.
Property matching is done by <:Foo> or <:!Foo> for Boolean properties or
<:Foo<bar>> for string properties. The latter allows use of wildcards
through the smart-matching mechanism. Property names and values support
aliases; property names are also loose-matched. Script and General_Category
support use of their values as pseudo-Boolean properties. Script_Extensions
is supported with alias scx.
Compatibility properties are handled following Annex C.
Property values are NOT loose-matched, but this does not seem to be mandated.
*** not in any documentation I can find: isCased, isCasefolded, isLowercase,
isUppercase, isTitlecase, isNFC, isNFD, isNFKC, isNFKD, toLowercase,
toUppercase, toTitlecase, toCasefolded, toNFC, toNFD, toNFKC, toNFKD
(isXXX except isCased can be defined in terms of toXXX)
*** Age is NOT handled inclusively
Blocks are supported, both as <:Block<Thai>> and as <:InThai>.
## 1.3 Subtraction and Intersection
Supported syntactically as <:A & :B>, <:A | :B>, <:A ^ :B>, <:A - :B>,
<-:B>, <:A - (:B & :C)>. Precedence follows normal Perl rules, except +
and - bind very tightly.
## 1.4 Simple Word Boundaries
*** NOT IMPLEMENTED. « » just uses word character.
## 1.5 Simple Loose Matches
*** NOT IMPLEMENTED. Currently :i is ignored for character classes, and uses
lowercasing instead of casefolding.
## 1.6 Line Boundaries
\n matches any newline, including CRLF; ^^ and $$ anchor likewise. . matches
any character, but does not match CRLF as a unit in codepoint mode, as this
is considered to be a grapheme issue. To match CRLF as a unit, use grapheme
*** Grapheme mode is not implemented
## 1.7 Code Points
All matching is done by code points. Single surrogates are not matched,
except if they are isolated.
## 2.1 Canonical equivalence
Will be handled using grapheme mode.
## 2.2 Grapheme Clusers
<.> matches any extended grapheme cluster, <|g> matches boundaries, and
literal clusters in regexes may be matched by using a string as a component,
such as <:Alpha + "ch">.
*** Only the last works
Grapheme cluster mode will be the default, disabled by :codes.
## 2.3 Default word boundaries
## 2.4 Default loose matches
## 2.5 Name properties
\c[FOO] can match a named character. Aliases supported are NL, CR,
FF, and NEL. Control character names are supported. The three
loose match exceptions are handled.
## 2.6 Wildcards in property values
<:name(/foo/)> allows use of wildcards.
## 2.7 Full Properties (draft)
All non-Unihan UCD properties are supported.
## 3
*** All tailorings are not supported.
## 3.8 Unicode Set Sharing
Unicode sets can be shared by wrapping them up in a named regex.
## 3.11 Submatchers
You can write things that can be called as regexes by adding methods to
your grammar.
Jump to Line
Something went wrong with that request. Please try again.