docs/TODO.UTS18

TODO items to get us into conformance with UTS18:
    http://www.unicode.org/reports/tr18/
    http://www.unicode.org/reports/tr18/proposed.html (draft for 6.1)

## 0 Introduction

Document what Unicode features are supported: That is the role of
this document.

## 0.1 Notation

## 0.2 Conformance

## 1.1 Hex Notation

To match Unicode character U+10450, use \x10450, or \x[] if needed to
disambiguate.  This works even in char classes.

## 1.1.1 Hex notation and normalization

Currently works, insofar as the compiler is using codepoint strings.  If
the compiler switches to NFG mode, things will break.

## 1.2 Properties

String functions are NOT supported because I haven't found a normative list
of them.

Property matching is done by <:Foo> or <:!Foo> for Boolean properties or
<:Foo<bar>> for string properties.  The latter allows use of wildcards
through the smart-matching mechanism.  Property names and values support
aliases; property names are also loose-matched.  Script and General_Category
support use of their values as pseudo-Boolean properties.  Script_Extensions
is supported with alias scx.

Compatibility properties are handled following Annex C.

Property values are NOT loose-matched, but this does not seem to be mandated.

*** not in any documentation I can find: isCased, isCasefolded, isLowercase,
    isUppercase, isTitlecase, isNFC, isNFD, isNFKC, isNFKD, toLowercase,
    toUppercase, toTitlecase, toCasefolded, toNFC, toNFD, toNFKC, toNFKD
    (isXXX except isCased can be defined in terms of toXXX)

*** Age is NOT handled inclusively

Blocks are supported, both as <:Block<Thai>> and as <:InThai>.

## 1.3 Subtraction and Intersection

Supported syntactically as <:A & :B>, <:A | :B>, <:A ^ :B>, <:A - :B>,
<-:B>, <:A - (:B & :C)>.  Precedence follows normal Perl rules, except +
and - bind very tightly.

## 1.4 Simple Word Boundaries

*** NOT IMPLEMENTED.  « » just uses word character.

## 1.5 Simple Loose Matches

*** NOT IMPLEMENTED.  Currently :i is ignored for character classes, and uses
    lowercasing instead of casefolding.

## 1.6 Line Boundaries

\n matches any newline, including CRLF; ^^ and $$ anchor likewise.  . matches
any character, but does not match CRLF as a unit in codepoint mode, as this
is considered to be a grapheme issue.  To match CRLF as a unit, use grapheme
mode.

*** Grapheme mode is not implemented

## 1.7 Code Points

All matching is done by code points.  Single surrogates are not matched,
except if they are isolated.

## 2.1 Canonical equivalence

Will be handled using grapheme mode.

*** NOT IMPLEMENTED

## 2.2 Grapheme Clusers

<.> matches any extended grapheme cluster, <|g> matches boundaries, and
literal clusters in regexes may be matched by using a string as a component,
such as <:Alpha + "ch">.

*** Only the last works

Grapheme cluster mode will be the default, disabled by :codes.

## 2.3 Default word boundaries

*** NOT IMPLEMENTED

## 2.4 Default loose matches

*** NOT IMPLEMENTED

## 2.5 Name properties

\c[FOO] can match a named character.  Aliases supported are NL, CR,
FF, and NEL.  Control character names are supported.  The three
loose match exceptions are handled.

## 2.6 Wildcards in property values

<:name(/foo/)> allows use of wildcards.

## 2.7 Full Properties (draft)

All non-Unihan UCD properties are supported.

## 3

*** All tailorings are not supported.

## 3.8 Unicode Set Sharing

Unicode sets can be shared by wrapping them up in a named regex.

## 3.11 Submatchers

You can write things that can be called as regexes by adding methods to
your grammar.