Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
mathiasbynens committed Jun 9, 2016
0 parents commit c636d50
Show file tree
Hide file tree
Showing 8 changed files with 251 additions and 0 deletions.
12 changes: 12 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
root = true

[*]
charset = utf-8
indent_style = tab
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true

[{README.md,package.json,spec.html,.travis.yml}]
indent_style = space
indent_size = 2
16 changes: 16 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
dist

# Installed npm modules
node_modules

# Folder view configuration files
.DS_Store
Desktop.ini

# Thumbnail cache files
._*
Thumbs.db

# Files that might appear on external disks
.Spotlight-V100
.Trashes
1 change: 1 addition & 0 deletions .nvmrc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
6
Binary file added .travis-github-deploy-key.enc
Binary file not shown.
18 changes: 18 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
language: node_js
after_success:
- $(npm bin)/set-up-ssh --key "${encrypted_d7d41120dcf8_key}"
--iv "${encrypted_d7d41120dcf8_iv}"
--path-encrypted-key .travis-github-deploy-key.enc
- $(npm bin)/update-branch --commands 'npm run build'
--commit-message "Update gh-pages @ ${TRAVIS_COMMIT}"
--directory 'dist'
--distribution-branch 'gh-pages'
--source-branch 'master'
git:
depth: 1
branches:
only:
- master
env:
global:
secure: "npte8Ch8CnVYBkM0wTzQz5FF7FmrIDrnR4n/4YJ/zUo3CgaQbYeQ55eK44ge4Etu08em52BBlaBXq5eQ8kdZyLNf4pENGh/IzmSE6uvLKgsT2LV2IKXvWhpgc+D+nhDvUm0XnhuJ2HxfHCspYXDJxBzzhZUFsjvvApKwmi1L4crMum0anjmf+d1ob2E5ZFrcV5BdSv1fds7a5aOU232FDoJDp9HbTEnz5TteS1P+CyJu0R0hD7w2PZj7vU8ZaP7h+Pa7tJc1y92pcTostMw+z6FFhxpunsPWdvT4vkn5Tx7fVBQgoWSLP250/soXIaRY7fKq2qvnq7E9dRI5lqgOGzcLiuTiMHdSia+1zRxqEdPqIBEyLLKfZBAq3s77TiQOAiuwIr+dvKsTAAlbKqGrLc6kZfrvUlekHtP5C8nNhExbmBOSAs0vFK1EeavNONLVxqftMhbcxjc7+fDWe1/KtpDhSK6X/hlB/LGnYSDF5CTak01mNPDO578Be+YhC2q3Au+Ns/z0JLR6XWyd/8qRYunvHeP8eZHscJ2OyA/Aa7LWjXngXEQDsZDM80KQSlDe1/NoZAV1QEcnQ/WMWAmHhP9cx2kbQv8qh9m8yWT4QesQTm08y4MMsGtgm389VwLE5QhIb4OaGL3KSZ8IWJzeNjfGNmMFLjtltICd8vQ2b0k="
51 changes: 51 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# ECMAScript proposal: Unicode property escapes in regular expressions

## Status

This proposal is in stage 0 of [the TC39 process](https://tc39.github.io/process-document/).

## Motivation

The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used exclusively in the Greek script, search the Unicode database for symbols whose `Script` property is set to `Greek`.

There currently is no way to access these Unicode character properties natively in ECMAScript regular expressions. This makes it painful for developers to support full Unicode in their regular expressions. They currently have two options, neither of which is ideal:

1. Use a library such as [XRegExp](https://github.com/slevithan/xregexp) to create the regular expressions at run-time:

```js
const regexGreekSymbol = XRegExp('\\p{Greek}', 'A');
regexGreekSymbol.test('π');
// → true
```

The downside of this approach is that the XRegExp library is a run-time dependency which may not be ideal for performance-sensitive applications. For usage on the web, there is an additional performance penalty: `xregexp-all-min.js.gz` takes up over 35 KB of space after minifying and applying gzip compression.

2. Use a library such as [Regenerate](https://github.com/mathiasbynens/regenerate) to generate the regular expression at build time:

```js
const regenerate = require('regenerate');
const codePoints = require('unicode-9.0.0/Script/Greek/code-points');
const set = regenerate(codePoints);
set.toString();
// → '[\u0370-\u0373\u0375-\u0377\u037A-\u037D\u037F\u0384\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03FF\u1D26-\u1D2A\u1D5D-\u1D61\u1D66-\u1D6A\u1DBF\u1F00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FC4\u1FC6-\u1FD3\u1FD6-\u1FDB\u1FDD-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFE\u2126\uAB65]|\uD800[\uDD40-\uDD8E\uDDA0]|\uD834[\uDE00-\uDE45]'
// Imagine there’s more code here to save this pattern to a file.
```

This approach results in optimal run-time performance, although the generated regular expressions tend to be fairly large in size (which could lead to performance problems on the web). The biggest downside is that it requires a build script, which gets painful as the developer needs more Unicode-aware regular expressions.

## Proposed solution

We propose the addition of _Unicode property escapes_ of the form `\p{…}` and `\P{…}`. Unicode property escapes are a new type of escape sequence available in regular expressions that have the `u` flag set. With this feature, the above regular expression could be written as:

```js
const regexGreekSymbol = /\p{Script=Greek}/u;
regexGreekSymbol.test('π');
// → true
```

This proposal solves all the abovementioned problems:

* It is no longer painful to create Unicode-aware regular expressions.
* There is no dependency on run-time libraries.
* The regular expressions patterns are compact and readable — no more file size bloat.
* Creating a script that generates the regular expression at build time is no longer necessary.
11 changes: 11 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"private": true,
"scripts": {
"test": ":",
"build": "ecmarkup --verbose spec.html dist/index.html --css dist/ecmarkup.css --js dist/ecmarkup.js"
},
"devDependencies": {
"@alrra/travis-scripts": "^3.0.1",
"ecmarkup": "^3.2.6"
}
}
142 changes: 142 additions & 0 deletions spec.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
<!DOCTYPE html>
<meta charset="utf-8">
<pre class="metadata">
title: Unicode property escapes in regular expressions
status: proposal
stage: 0
location: https://mathiasbynens.github.io/es-regex-unicode-property-escapes/
copyright: false
contributors: Mathias Bynens
</pre>
<script src="ecmarkup.js" defer></script>
<link rel="stylesheet" href="ecmarkup.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">

<p><ins class="block">The syntax listed in <a href="https://tc39.github.io/ecma262/#sec-patterns">21.2.1 Patterns</a> is modified as follows:</ins></p>

<emu-grammar>
LeadSurrogate ::
Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xD800 to 0xDBFF]

TrailSurrogate ::
Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xDC00 to 0xDFFF]

NonSurrogate ::
Hex4Digits [> but only if the SV of |Hex4Digits| is not in the inclusive range 0xD800 to 0xDFFF]

IdentityEscape[U] ::
[+U] SyntaxCharacter
[+U] `/`
[~U] SourceCharacter but not UnicodeIDContinue

DecimalEscape ::
NonZeroDigit DecimalDigits? [lookahead &lt;! DecimalDigit]

CharacterClassEscape[U] ::
`d`
`D`
`s`
`S`
`w`
`W`
<ins class="block">[+U] `p{` UnicodePropertyValueExpression `}`</ins>
<ins class="block">[+U] `P{` UnicodePropertyValueExpression `}`</ins>

<ins class="block">UnicodePropertyValueExpression ::
UnicodePropertyName `=` UnicodePropertyValue
LoneUnicodePropertyNameOrValue</ins>

CharacterClass[U] ::
`[` [lookahead &lt;! {`^`}] ClassRanges[?U] `]`
`[` `^` ClassRanges[?U] `]`

ClassRanges[U] ::
[empty]
NonemptyClassRanges[?U]

NonemptyClassRanges[U] ::
ClassAtom[?U]
ClassAtom[?U] NonemptyClassRangesNoDash[?U]
ClassAtom[?U] `-` ClassAtom[?U] ClassRanges[?U]

NonemptyClassRangesNoDash[U] ::
ClassAtom[?U]
ClassAtomNoDash[?U] NonemptyClassRangesNoDash[?U]
ClassAtomNoDash[?U] `-` ClassAtom[?U] ClassRanges[?U]

ClassAtom[U] ::
`-`
ClassAtomNoDash[?U]

ClassAtomNoDash[U] ::
SourceCharacter but not one of `\` or `]` or `-`
`\` ClassEscape[?U]

ClassEscape[U] ::
`b`
[+U] `-`
CharacterClassEscape
CharacterEscape[?U]
</emu-grammar>

<hr>

<p><ins class="block">The following two abstract operations are appended to <a href="https://tc39.github.io/ecma262/#sec-atom">21.2.2.8 Atom</a>.</ins></p>

<emu-clause id="sec-runtime-semantics-unicodematchproperty-p" aoid="UnicodeMatchProperty">
<h1>Runtime Semantics: UnicodeMatchProperty ( _p_ )</h1>
<p>The abstract operation UnicodeMatchProperty takes a string parameter _p_ and performs the following steps:</p>
<emu-alg>
1. If _p_ strictly matches a known Unicode property name or property alias, then
1. Let _property_ be that unaliased property name.
1. Return _property_.
1. Else, throw a *SyntaxError* exception.
</emu-alg>
<emu-note>
<p>Implementations must support the following Unicode properties and their property aliases as required by <a href="http://unicode.org/reports/tr18/#RL1.2">UTS18 RL1.2</a>: `General_Category`, `Script`, `Script_Extensions`, `Alphabetic`, `Uppercase`, `Lowercase`, `White_Space`, `Noncharacter_Code_Point`, `Default_Ignorable_Code_Point`, `Any`, `ASCII`, and `Assigned`. Implementations may extend Unicode property support to the remaining enumeration or binary properties.</p>
</emu-note>
<emu-note>
<p>Only the canonical property names and property aliases listed in `PropertyAliases.txt` as well as the property names `Any`, `ASCII`, and `Assigned` must be recognized. For example, `Block` and `blk` are valid, but `block` or `Blk` aren’t.</p>
</emu-note>
</emu-clause>

<emu-clause id="sec-runtime-semantics-unicodematchpropertyvalue-p-v" aoid="UnicodeMatchPropertyValue">
<h1>Runtime Semantics: UnicodeMatchPropertyValue ( _p_, _v_ )</h1>
<p>The abstract operation UnicodeMatchPropertyValue takes two string parameters _p_ and _v_ and performs the following steps:</p>
<emu-alg>
1. Assert: _p_ is a canonical, unaliased Unicode property name.
1. If _v_ strictly matches a known property value or property value alias for Unicode property _p_, then
1. Let _value_ be that unaliased property value.
1. Return _value_.
1. Else, throw a *SyntaxError* exception.
</emu-alg>
<emu-note>
<p>This algorithm differs from <a href="http://unicode.org/reports/tr44/#Matching_Symbolic">the matching rules for symbolic values listed in UAX44</a>: case, <emu-xref href="#sec-white-space">white space</emu-xref>, U+002D (HYPHEN-MINUS), and U+005F (LOW LINE) are not ignored, and the `Is` prefix is not supported.</p>
</emu-note>
<emu-note>
<p>Only the canonical property values and property value aliases listed in `PropertyValueAliases.txt` must be recognized. For example, `Super_And_Sub` and `Superscripts_And_Subscripts` are valid `Block` values, but `super_and_sub` and `Superscripts and Subscripts` aren’t.</p>
</emu-note>
<emu-note>
<p>Implementations must support any existing property values and their aliases for the following Unicode properties as required by <a href="http://unicode.org/reports/tr18/#RL1.2">UTS18 RL1.2</a>: `General_Category`, `Script`, and `Script_Extensions`. Implementations that extend Unicode property support to remaining enumeration properties must support any existing values (including aliases) for those properties.</p>
</emu-note>
</emu-clause>

<hr>

<p><ins class="block">The following is appended to the list of productions in <a href="https://tc39.github.io/ecma262/#sec-characterclassescape">21.2.2.12 CharacterClassEscape</a>.</ins></p>

<p>The production <emu-grammar>CharacterClassEscape :: `\P{` UnicodePropertyValueExpression `}`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>UnicodePropertyValueExpression</emu-grammar>.</p>
<p>The production <emu-grammar>UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue</emu-grammar> evaluates as follows:</p>
<emu-alg>
1. Let _p_ be ? UnicodeMatchProperty(_UnicodePropertyName_).
1. Let _v_ be ? UnicodeMatchPropertyValue(_p_, _UnicodePropertyValue_).
1. Return the set of all characters with the value _v_ for Unicode property _p_.
</emu-alg>
<p>The production <emu-grammar>UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue</emu-grammar> evaluates as follows:</p>
<emu-alg>
1. If ? UnicodeMatchPropertyValue(`"General_Category"`, _LoneUnicodePropertyNameOrValue_) is the name of a general category in Unicode, then
1. Return the set of all characters in Unicode general category _LoneUnicodePropertyNameOrValue_.
1. If ? UnicodeMatchProperty(_LoneUnicodePropertyNameOrValue_) is the name of a binary property in Unicode, then
1. Return the set of all characters with the value _True_ for Unicode property _LoneUnicodePropertyNameOrValue_.
1. Else, throw a *SyntaxError* exception.
</emu-alg>

0 comments on commit c636d50

Please sign in to comment.