This repository has been archived by the owner on Mar 20, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit c636d50
Showing
8 changed files
with
251 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
root = true | ||
|
||
[*] | ||
charset = utf-8 | ||
indent_style = tab | ||
end_of_line = lf | ||
insert_final_newline = true | ||
trim_trailing_whitespace = true | ||
|
||
[{README.md,package.json,spec.html,.travis.yml}] | ||
indent_style = space | ||
indent_size = 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
dist | ||
|
||
# Installed npm modules | ||
node_modules | ||
|
||
# Folder view configuration files | ||
.DS_Store | ||
Desktop.ini | ||
|
||
# Thumbnail cache files | ||
._* | ||
Thumbs.db | ||
|
||
# Files that might appear on external disks | ||
.Spotlight-V100 | ||
.Trashes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
6 |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
language: node_js | ||
after_success: | ||
- $(npm bin)/set-up-ssh --key "${encrypted_d7d41120dcf8_key}" | ||
--iv "${encrypted_d7d41120dcf8_iv}" | ||
--path-encrypted-key .travis-github-deploy-key.enc | ||
- $(npm bin)/update-branch --commands 'npm run build' | ||
--commit-message "Update gh-pages @ ${TRAVIS_COMMIT}" | ||
--directory 'dist' | ||
--distribution-branch 'gh-pages' | ||
--source-branch 'master' | ||
git: | ||
depth: 1 | ||
branches: | ||
only: | ||
- master | ||
env: | ||
global: | ||
secure: "npte8Ch8CnVYBkM0wTzQz5FF7FmrIDrnR4n/4YJ/zUo3CgaQbYeQ55eK44ge4Etu08em52BBlaBXq5eQ8kdZyLNf4pENGh/IzmSE6uvLKgsT2LV2IKXvWhpgc+D+nhDvUm0XnhuJ2HxfHCspYXDJxBzzhZUFsjvvApKwmi1L4crMum0anjmf+d1ob2E5ZFrcV5BdSv1fds7a5aOU232FDoJDp9HbTEnz5TteS1P+CyJu0R0hD7w2PZj7vU8ZaP7h+Pa7tJc1y92pcTostMw+z6FFhxpunsPWdvT4vkn5Tx7fVBQgoWSLP250/soXIaRY7fKq2qvnq7E9dRI5lqgOGzcLiuTiMHdSia+1zRxqEdPqIBEyLLKfZBAq3s77TiQOAiuwIr+dvKsTAAlbKqGrLc6kZfrvUlekHtP5C8nNhExbmBOSAs0vFK1EeavNONLVxqftMhbcxjc7+fDWe1/KtpDhSK6X/hlB/LGnYSDF5CTak01mNPDO578Be+YhC2q3Au+Ns/z0JLR6XWyd/8qRYunvHeP8eZHscJ2OyA/Aa7LWjXngXEQDsZDM80KQSlDe1/NoZAV1QEcnQ/WMWAmHhP9cx2kbQv8qh9m8yWT4QesQTm08y4MMsGtgm389VwLE5QhIb4OaGL3KSZ8IWJzeNjfGNmMFLjtltICd8vQ2b0k=" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# ECMAScript proposal: Unicode property escapes in regular expressions | ||
|
||
## Status | ||
|
||
This proposal is in stage 0 of [the TC39 process](https://tc39.github.io/process-document/). | ||
|
||
## Motivation | ||
|
||
The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used exclusively in the Greek script, search the Unicode database for symbols whose `Script` property is set to `Greek`. | ||
|
||
There currently is no way to access these Unicode character properties natively in ECMAScript regular expressions. This makes it painful for developers to support full Unicode in their regular expressions. They currently have two options, neither of which is ideal: | ||
|
||
1. Use a library such as [XRegExp](https://github.com/slevithan/xregexp) to create the regular expressions at run-time: | ||
|
||
```js | ||
const regexGreekSymbol = XRegExp('\\p{Greek}', 'A'); | ||
regexGreekSymbol.test('π'); | ||
// → true | ||
``` | ||
|
||
The downside of this approach is that the XRegExp library is a run-time dependency which may not be ideal for performance-sensitive applications. For usage on the web, there is an additional performance penalty: `xregexp-all-min.js.gz` takes up over 35 KB of space after minifying and applying gzip compression. | ||
|
||
2. Use a library such as [Regenerate](https://github.com/mathiasbynens/regenerate) to generate the regular expression at build time: | ||
|
||
```js | ||
const regenerate = require('regenerate'); | ||
const codePoints = require('unicode-9.0.0/Script/Greek/code-points'); | ||
const set = regenerate(codePoints); | ||
set.toString(); | ||
// → '[\u0370-\u0373\u0375-\u0377\u037A-\u037D\u037F\u0384\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03FF\u1D26-\u1D2A\u1D5D-\u1D61\u1D66-\u1D6A\u1DBF\u1F00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FC4\u1FC6-\u1FD3\u1FD6-\u1FDB\u1FDD-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFE\u2126\uAB65]|\uD800[\uDD40-\uDD8E\uDDA0]|\uD834[\uDE00-\uDE45]' | ||
// Imagine there’s more code here to save this pattern to a file. | ||
``` | ||
|
||
This approach results in optimal run-time performance, although the generated regular expressions tend to be fairly large in size (which could lead to performance problems on the web). The biggest downside is that it requires a build script, which gets painful as the developer needs more Unicode-aware regular expressions. | ||
|
||
## Proposed solution | ||
|
||
We propose the addition of _Unicode property escapes_ of the form `\p{…}` and `\P{…}`. Unicode property escapes are a new type of escape sequence available in regular expressions that have the `u` flag set. With this feature, the above regular expression could be written as: | ||
|
||
```js | ||
const regexGreekSymbol = /\p{Script=Greek}/u; | ||
regexGreekSymbol.test('π'); | ||
// → true | ||
``` | ||
|
||
This proposal solves all the abovementioned problems: | ||
|
||
* It is no longer painful to create Unicode-aware regular expressions. | ||
* There is no dependency on run-time libraries. | ||
* The regular expressions patterns are compact and readable — no more file size bloat. | ||
* Creating a script that generates the regular expression at build time is no longer necessary. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
{ | ||
"private": true, | ||
"scripts": { | ||
"test": ":", | ||
"build": "ecmarkup --verbose spec.html dist/index.html --css dist/ecmarkup.css --js dist/ecmarkup.js" | ||
}, | ||
"devDependencies": { | ||
"@alrra/travis-scripts": "^3.0.1", | ||
"ecmarkup": "^3.2.6" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
<!DOCTYPE html> | ||
<meta charset="utf-8"> | ||
<pre class="metadata"> | ||
title: Unicode property escapes in regular expressions | ||
status: proposal | ||
stage: 0 | ||
location: https://mathiasbynens.github.io/es-regex-unicode-property-escapes/ | ||
copyright: false | ||
contributors: Mathias Bynens | ||
</pre> | ||
<script src="ecmarkup.js" defer></script> | ||
<link rel="stylesheet" href="ecmarkup.css"> | ||
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css"> | ||
|
||
<p><ins class="block">The syntax listed in <a href="https://tc39.github.io/ecma262/#sec-patterns">21.2.1 Patterns</a> is modified as follows:</ins></p> | ||
|
||
<emu-grammar> | ||
LeadSurrogate :: | ||
Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xD800 to 0xDBFF] | ||
|
||
TrailSurrogate :: | ||
Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xDC00 to 0xDFFF] | ||
|
||
NonSurrogate :: | ||
Hex4Digits [> but only if the SV of |Hex4Digits| is not in the inclusive range 0xD800 to 0xDFFF] | ||
|
||
IdentityEscape[U] :: | ||
[+U] SyntaxCharacter | ||
[+U] `/` | ||
[~U] SourceCharacter but not UnicodeIDContinue | ||
|
||
DecimalEscape :: | ||
NonZeroDigit DecimalDigits? [lookahead <! DecimalDigit] | ||
|
||
CharacterClassEscape[U] :: | ||
`d` | ||
`D` | ||
`s` | ||
`S` | ||
`w` | ||
`W` | ||
<ins class="block">[+U] `p{` UnicodePropertyValueExpression `}`</ins> | ||
<ins class="block">[+U] `P{` UnicodePropertyValueExpression `}`</ins> | ||
|
||
<ins class="block">UnicodePropertyValueExpression :: | ||
UnicodePropertyName `=` UnicodePropertyValue | ||
LoneUnicodePropertyNameOrValue</ins> | ||
|
||
CharacterClass[U] :: | ||
`[` [lookahead <! {`^`}] ClassRanges[?U] `]` | ||
`[` `^` ClassRanges[?U] `]` | ||
|
||
ClassRanges[U] :: | ||
[empty] | ||
NonemptyClassRanges[?U] | ||
|
||
NonemptyClassRanges[U] :: | ||
ClassAtom[?U] | ||
ClassAtom[?U] NonemptyClassRangesNoDash[?U] | ||
ClassAtom[?U] `-` ClassAtom[?U] ClassRanges[?U] | ||
|
||
NonemptyClassRangesNoDash[U] :: | ||
ClassAtom[?U] | ||
ClassAtomNoDash[?U] NonemptyClassRangesNoDash[?U] | ||
ClassAtomNoDash[?U] `-` ClassAtom[?U] ClassRanges[?U] | ||
|
||
ClassAtom[U] :: | ||
`-` | ||
ClassAtomNoDash[?U] | ||
|
||
ClassAtomNoDash[U] :: | ||
SourceCharacter but not one of `\` or `]` or `-` | ||
`\` ClassEscape[?U] | ||
|
||
ClassEscape[U] :: | ||
`b` | ||
[+U] `-` | ||
CharacterClassEscape | ||
CharacterEscape[?U] | ||
</emu-grammar> | ||
|
||
<hr> | ||
|
||
<p><ins class="block">The following two abstract operations are appended to <a href="https://tc39.github.io/ecma262/#sec-atom">21.2.2.8 Atom</a>.</ins></p> | ||
|
||
<emu-clause id="sec-runtime-semantics-unicodematchproperty-p" aoid="UnicodeMatchProperty"> | ||
<h1>Runtime Semantics: UnicodeMatchProperty ( _p_ )</h1> | ||
<p>The abstract operation UnicodeMatchProperty takes a string parameter _p_ and performs the following steps:</p> | ||
<emu-alg> | ||
1. If _p_ strictly matches a known Unicode property name or property alias, then | ||
1. Let _property_ be that unaliased property name. | ||
1. Return _property_. | ||
1. Else, throw a *SyntaxError* exception. | ||
</emu-alg> | ||
<emu-note> | ||
<p>Implementations must support the following Unicode properties and their property aliases as required by <a href="http://unicode.org/reports/tr18/#RL1.2">UTS18 RL1.2</a>: `General_Category`, `Script`, `Script_Extensions`, `Alphabetic`, `Uppercase`, `Lowercase`, `White_Space`, `Noncharacter_Code_Point`, `Default_Ignorable_Code_Point`, `Any`, `ASCII`, and `Assigned`. Implementations may extend Unicode property support to the remaining enumeration or binary properties.</p> | ||
</emu-note> | ||
<emu-note> | ||
<p>Only the canonical property names and property aliases listed in `PropertyAliases.txt` as well as the property names `Any`, `ASCII`, and `Assigned` must be recognized. For example, `Block` and `blk` are valid, but `block` or `Blk` aren’t.</p> | ||
</emu-note> | ||
</emu-clause> | ||
|
||
<emu-clause id="sec-runtime-semantics-unicodematchpropertyvalue-p-v" aoid="UnicodeMatchPropertyValue"> | ||
<h1>Runtime Semantics: UnicodeMatchPropertyValue ( _p_, _v_ )</h1> | ||
<p>The abstract operation UnicodeMatchPropertyValue takes two string parameters _p_ and _v_ and performs the following steps:</p> | ||
<emu-alg> | ||
1. Assert: _p_ is a canonical, unaliased Unicode property name. | ||
1. If _v_ strictly matches a known property value or property value alias for Unicode property _p_, then | ||
1. Let _value_ be that unaliased property value. | ||
1. Return _value_. | ||
1. Else, throw a *SyntaxError* exception. | ||
</emu-alg> | ||
<emu-note> | ||
<p>This algorithm differs from <a href="http://unicode.org/reports/tr44/#Matching_Symbolic">the matching rules for symbolic values listed in UAX44</a>: case, <emu-xref href="#sec-white-space">white space</emu-xref>, U+002D (HYPHEN-MINUS), and U+005F (LOW LINE) are not ignored, and the `Is` prefix is not supported.</p> | ||
</emu-note> | ||
<emu-note> | ||
<p>Only the canonical property values and property value aliases listed in `PropertyValueAliases.txt` must be recognized. For example, `Super_And_Sub` and `Superscripts_And_Subscripts` are valid `Block` values, but `super_and_sub` and `Superscripts and Subscripts` aren’t.</p> | ||
</emu-note> | ||
<emu-note> | ||
<p>Implementations must support any existing property values and their aliases for the following Unicode properties as required by <a href="http://unicode.org/reports/tr18/#RL1.2">UTS18 RL1.2</a>: `General_Category`, `Script`, and `Script_Extensions`. Implementations that extend Unicode property support to remaining enumeration properties must support any existing values (including aliases) for those properties.</p> | ||
</emu-note> | ||
</emu-clause> | ||
|
||
<hr> | ||
|
||
<p><ins class="block">The following is appended to the list of productions in <a href="https://tc39.github.io/ecma262/#sec-characterclassescape">21.2.2.12 CharacterClassEscape</a>.</ins></p> | ||
|
||
<p>The production <emu-grammar>CharacterClassEscape :: `\P{` UnicodePropertyValueExpression `}`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>UnicodePropertyValueExpression</emu-grammar>.</p> | ||
<p>The production <emu-grammar>UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue</emu-grammar> evaluates as follows:</p> | ||
<emu-alg> | ||
1. Let _p_ be ? UnicodeMatchProperty(_UnicodePropertyName_). | ||
1. Let _v_ be ? UnicodeMatchPropertyValue(_p_, _UnicodePropertyValue_). | ||
1. Return the set of all characters with the value _v_ for Unicode property _p_. | ||
</emu-alg> | ||
<p>The production <emu-grammar>UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue</emu-grammar> evaluates as follows:</p> | ||
<emu-alg> | ||
1. If ? UnicodeMatchPropertyValue(`"General_Category"`, _LoneUnicodePropertyNameOrValue_) is the name of a general category in Unicode, then | ||
1. Return the set of all characters in Unicode general category _LoneUnicodePropertyNameOrValue_. | ||
1. If ? UnicodeMatchProperty(_LoneUnicodePropertyNameOrValue_) is the name of a binary property in Unicode, then | ||
1. Return the set of all characters with the value _True_ for Unicode property _LoneUnicodePropertyNameOrValue_. | ||
1. Else, throw a *SyntaxError* exception. | ||
</emu-alg> |