Named capture groups for JavaScript RegExps
HTML
Clone or download
Pull request Compare This branch is 61 commits ahead of goyakin:master.
Permalink
Failed to load latest commit information.
README.md Update README.md Jan 24, 2018
spec.html Editorial: Fix two minor issues from @anba Dec 12, 2017

README.md

RegExp Named Capture Groups

Stage 4

Champions: Daniel Ehrenberg (Igalia) & Mathias Bynens (Google)

Introduction

Numbered capture groups allow one to refer to certain portions of a string that a regular expression matches. Each capture group is assigned a unique number and can be referenced using that number, but this can make a regular expression hard to grasp and refactor.

For example, given /(\d{4})-(\d{2})-(\d{2})/ that matches a date, one cannot be sure which group corresponds to the month and which one is the day without examining the surrounding code. Also, if one wants to swap the order of the month and the day, the group references should also be updated.

Named capture groups provide a nice solution for these issues.

High Level API

A capture group can be given a name using the (?<name>...) syntax, for any identifier name. The regular expression for a date then can be written as /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u. Each name should be unique and follow the grammar for ECMAScript IdentifierName.

Named groups can be accessed from properties of a groups property of the regular expression result. Numbered references to the groups are also created, just as for non-named groups. For example:

let re = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
let result = re.exec('2015-01-02');
// result.groups.year === '2015';
// result.groups.month === '01';
// result.groups.day === '02';

// result[0] === '2015-01-02';
// result[1] === '2015';
// result[2] === '01';
// result[3] === '02';

The interface interacts nicely with destructuring, as in the following example:

let {groups: {one, two}} = /^(?<one>.*):(?<two>.*)$/u.exec('foo:bar');
console.log(`one: ${one}, two: ${two}`);  // prints one: foo, two: bar

Backreferences

A named group can be accessed within a regular expression via the \k<name> construct. For example,

let duplicate = /^(?<half>.*).\k<half>$/u;
duplicate.test('a*b'); // false
duplicate.test('a*a'); // true

Named references can also be used simultaneously with numbered references.

let triplicate = /^(?<part>.*).\k<part>.\1$/u;
triplicate.test('a*a*a'); // true
triplicate.test('a*a*b'); // false

Replacement targets

Named groups can be referenced from the replacement value passed to String.prototype.replace too. If the value is a string, named groups can be accessed using the $<name> syntax. For example:

let re = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
let result = '2015-01-02'.replace(re, '$<day>/$<month>/$<year>');
// result === '02/01/2015'

Note that an ordinary string literal, not a template literal, is passed into replace, as that method will resolve the values of day etc rather than having them as local variables. An alternative would be to use ${day} syntax (while remaining not a template string); this proposal uses $<day> to draw a parallel to the definition of the group and a distinction from template literals.

If the second argument to String.prototype.replace is a function, then the named groups can be accessed via a new parameter called groups. The new signature would be function (matched, capture1, ..., captureN, position, S, groups). Named captures would still participate in numbering, as usual. For example:

let re = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
let result = '2015-01-02'.replace(re, (...args) => {
 let {day, month, year} = args[args.length - 1];
 return `${day}/${month}/${year}`;
});
// result === '02/01/2015'

Details

Overlapping group names

RegExp result objects have some non-numerical properties already, which named capture groups may overlap with, namely length, index and input. In this proposal, to avoid ambiguity and edge cases around overlapping names, named group properties are placed on a separate groups object which is a property of the match object. This solution will permit additional properties to be placed on the result of exec in future ECMAScript versions without creating any web compatibility hazards.

The groups object is only created for RegExps with named groups. It does not include numbered group properties, only the named ones. Properties are created on the groups object for all groups which are mentioned in the RegExp; if they are not encountered in the match, the value is undefined.

Backwards compatibility of new syntax

The syntax for creating a new named group, /(?<name>)/, is currently a syntax error in ECMAScript RegExps, so it can be added to all RegExps without ambiguity. However, the named backreference syntax, /\k<foo>/, is currently permitted in non-Unicode RegExps and matches the literal string "k<foo>". In Unicode RegExps, such escapes are banned.

In this proposal, \k<foo> in non-Unicode RegExps will continue to match the literal string "k<foo>" unless the RegExp contains a named group, in which case it will match that group or be a syntax error, depending on whether or not the RegExp has a named group named foo. This does not affect existing code, since no currently valid RegExp can have a named group. It would be a refactoring hazard, although only for code which contained \k in a RegExp.

Precedent in other programming languages

This proposal is analogous to what many other programming languages have done for named capture groups. It seems to be what the consensus syntax is moving towards, though the Python syntax is an interesting and compelling outlier which would address the non-Unicode backreference issue.

Perl ref

Perl uses the same syntax as this proposal for named capture groups /(?<name>)/ and backreferences /\k<name>/.

Python ref

Named captures have the syntax "(?P<name>)" and have backrereferences with (?P=name).

Java ref

JDK7+ supports named capture groups with syntax like Perl and this proposal.

.NET ref

C# and VB.NET support named capture groups with the syntax "(?<name>)" as well as "(?'name')" and backreferences with "\k<name>".

PHP

According to a Stack Overflow post and a comment on php.net docs, PHP has long supported named groups with the syntax "(?P<foo>)", which is available as a property in the resulting match object.

Ruby ref

Ruby's syntax is identical to .NET, with named capture groups with the syntax "(?<name>)" as well as "(?'name')" and backreferences with "\k<name>".

Draft specification

Draft spec

Implementations