Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

Merged
merged 2 commits into from Jun 22, 2016

Projects

None yet

7 participants

@msaboff
Contributor
msaboff commented Apr 6, 2016

The current specification of CharacterClassEscape's in Regular Expressions introduces surprising behavior of \W when both the unicode and ignoreCase flags are provided.

There are 6 CharacterClassEscapes defined:
\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W with the exception of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

The proposal here is increase the characters in the set returned by \w CharacterClassEscape when unicode is specified. Given the current CaseFolding.txt contents, this will add \u017f (lower case long s) and \u212a (Kelvin symbol) to the returned set. It has the correspond effect that those two characters will not be in the set returned by the \W CharacterClassEscape rule.

This change will make \u017f (lower case long s) and \u212a (Kelvin symbol) word characters for unicode regular expressions. It also means that \w == !\W and !\w == \W regardless of the flags provided to the regular expression.

This is an improvement to pull request 516. It resolves the inconsistency without creating an issue with explicit characters classes ([] syntax) that include CharacterClassEscapes that exist in that pull request.

Note that much of the spec diff is due to indentation changes of the character table.

@claudepache claudepache and 1 other commented on an outdated diff Apr 7, 2016
+ </td>
+ <td>
+ </td>
+ <td>
+ </td>
+ <td>
+ </td>
+ <td>
+ </td>
+ <td>
+ </td>
+ </tr>
+ </tbody>
+ </table>
+ </figure>
+ 1. If _Unicode_ is *true*, then
@claudepache
claudepache Apr 7, 2016 Contributor

We don't need to restrict to "Unicode is true", since the following steps will add no character when Unicode is false.

@msaboff
msaboff Apr 8, 2016 Contributor

I thought about eliminating the Unicode check, however it shows that the change is there for Unicode Word character matching.

@claudepache
claudepache Apr 8, 2016 Contributor

I think that this should be expressed by the "Assert" I'm proposing below, instead of by adding an explicit check. The relevant distinction between "Unicode" and "non-Unicode" is already done in the Canonicalize() abstract operation; following the DRY principle, it is better not to repeat that distinction.

@claudepache claudepache and 1 other commented on an outdated diff Apr 7, 2016
+ </td>
+ <td>
+ </td>
+ <td>
+ </td>
+ <td>
+ </td>
+ <td>
+ </td>
+ </tr>
+ </tbody>
+ </table>
+ </figure>
+ 1. If _Unicode_ is *true*, then
+ 1. Create an empty set _U_.
+ 1. For every character _c_ not in set _A_ where Canonicalize(_c_) is in _A_, add _c_ to _U_.
@claudepache
claudepache Apr 7, 2016 Contributor

I suggest to add something like: Assert: Unless Unicode and IgnoreCase are both true, U is empty.

@msaboff
msaboff Apr 8, 2016 Contributor

The change I'm proposing specifically doesn't take into account ignoreCase as we want the set to be the same with and without ignoreCase.

@claudepache
claudepache Apr 8, 2016 Contributor

You implicitly take ignoreCase into account by using Canonicalize(). The resulting set will be different under /u or under /ui.

@claudepache
Contributor

(You should have said: "Fixes #512" in your comment, in order to autolink that Pull Request with Issue 512 (the original Issue).)

@msaboff
Contributor
msaboff commented Apr 14, 2016

I'm fine with eliminating the If Unicode check and adding the Assert.

@goyakin
Member
goyakin commented Apr 19, 2016 edited

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

Rather than expanding \w only with the additional characters that canonicalize to the 63 characters specifically called out, why don't we generalize this and have \w match the alphanumeric characters from the Unicode Character Database plus _ when the Unicode flag is present? There is already precedent in other languages, e.g. Python, .NET, and Perl.

We can then redefine \b as the boundary between \w and \W in the spec so that they are in accord.

@mathiasbynens
Contributor

@goyakin There was a proposal to make \w more Unicode-aware, but it was decided against in favor of \p{…}.

@goyakin
Member
goyakin commented Apr 20, 2016

@mathiasbynens I wasn't aware of that. Seems like the reason was backward compatibility. Since this PR is a breaking change in that regard, if we're going down this route, I think we should go ahead and accept all alphanumeric characters instead.

If we want to keep backward compatibility though, it might be better to disallow canonicalizing characters > 127 to <= 127, as mentioned in the email thread you linked to.

@hashseed

While I'm not entirely happy with this proposal, I don't see a backward compatibility issue if this is gated by /u, given that /u has not been around for very long yet anyways.

@claudepache
Contributor
claudepache commented Apr 27, 2016 edited

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

To be clear, there was already an issue before this change, i.e.:

  • without the change, /^\W\w$/ui will match "ss", but /^.\b.$/ui won’t;
  • with the change, /^\W\w$/ui won’t match "ſs", but /^.\b.$/ui will.

Which gotcha is the worst one, I don’t know. It is possible to also modify \b so that the equivalence holds again. (Although, at this point, I'd be more inclined to let \w just match all Unicode letters.)

@bterlson
Member

I am feeling like modifying \b would be the right choice here - it's a vastly smaller change than allowing all Unicode characters right?

@msaboff
Contributor
msaboff commented May 18, 2016

I'm fine with also modifying \b so that the equivalence holds once again. I'll try to get the spec changes ready for the May meeting.

@bterlson
Member

We have consensus on this PR with the addition aligning \b and \B. @msaboff can you update this PR with the appropriate changes to \b?

@bterlson
Member

@mathiasbynens thanks for raising this issue!

@hashseed
hashseed commented May 24, 2016 edited

I gave the matching of \b and \B a read and it seems to me that they are currently not actually affected by case-insensitive canonicalization at all:

The production Assertion::\b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
Let e be x's endIndex.
Call IsWordChar(e-1) and let a be the Boolean result.
Call IsWordChar(e) and let b be the Boolean result.
If a is true and b is false, return true.
If a is false and b is true, return true.
Return false.

And

The abstract operation IsWordChar takes an integer parameter e and performs the following steps:

If e is -1 or e is InputLength, return false.
Let c be the character Input[e].
If c is one of the sixty-three characters below, return true.
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 _

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

At the least the existing definition of \b and \B would have to be rewritten to be coupled to \w and \W in the first place. Or do you simply want to add sharp-S and Kelvin sign to the list for IsWordChar for /ui regexps?

@ljharb
Member
ljharb commented May 24, 2016

@hashseed imo the goal is to make the following true, in all modes:

  • /W match all the characters not in /w (currently false in /ui)
  • \B match all the characters not in \b (already the case)
  • (if I understand correctly) /B to match all the characters in \w (currently false in /ui)

Rephrased, if Kelvin and sharp-S are matched by \w, then they are word characters, and \B should match them.

(someone please correct me if I've misstated the relationship between \B and \w)

@claudepache
Contributor

@ljharb \b and \B are zero-length assertions, they don’t match characters.

@claudepache
Contributor

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui

I’m confused, because I thought the the issue was precisely that /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui. If not, what the issue was supposed to be?

@ljharb
Member
ljharb commented May 24, 2016

@claudepache lol thank you, i was sure i was going to get that wrong. hopefully someone can come in and state it better than i.

Suffice to say, within the same mode, \b, \B, \w, and \W should certainly be logically self-consistent.

@msaboff
Contributor
msaboff commented May 24, 2016

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

From our discussion yesterday, for /ui patterns, we agreed that small sharp-S and kelvin ARE ward characters. For all other flags, they are not word characters.

@hashseed

@claudepache that's what I meant. \w and \b are not consistent currently for /ui.
@msaboff thanks for clearing up.

@msaboff
Contributor
msaboff commented Jun 8, 2016

I updated the spec to reflect what we agreed on at the May meeting.

I created a common abstract operation, WordCharacters and changed so that both the word asserts and \w and \W CharacterClassEscapes use the new abstract operation.

@msaboff msaboff changed the title from Unify handling of RegExp CharacterClassEscapes \w and \W to Unify handling of RegExp CharacterClassEscapes \w, \W, \b and \B Jun 8, 2016
@msaboff msaboff changed the title from Unify handling of RegExp CharacterClassEscapes \w, \W, \b and \B to Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B Jun 8, 2016
@bterlson
Member
bterlson commented Jun 8, 2016

ping @littledan, @efaust. I'll take a look tomorrow!

@bterlson
Member
bterlson commented Jun 8, 2016

@msaboff can you also rebase?

Michael Saboff and others added some commits Apr 6, 2016
@msaboff Michael Saboff Proposed RegExp CharacterClassEscape changes for \w 0d3d71b
@msaboff msaboff Updated per what was agreed at the May 2016 meeting.
Created a new abstract operation "WordCharacters()" that is used by
both IsWordChar() for word assertions and \w/\W CharacterClassEscapes.
ab3380d
@msaboff
Contributor
msaboff commented Jun 8, 2016

Rebased.

@msaboff
Contributor
msaboff commented Jun 20, 2016

@bterlson @littledan @efaust Can I get closure on this change?

@bterlson
Member

I'll try to review this today (still out on a wedding break but festivities are now over :))

@bterlson
Member

Sorry for the delay. Looks good to me (other than some minor ecmarkup/spec convention things which I'll just fix up in a subsequent commit).

@bterlson bterlson merged commit 996af87 into tc39:master Jun 22, 2016
@mathiasbynens mathiasbynens added a commit to mathiasbynens/regexpu-fixtures that referenced this pull request Jun 23, 2016
@mathiasbynens mathiasbynens Match updated spec for `/\W/iu`
Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`.

This is no longer the case now that tc39/ecma262#525 is merged.

Ref. mathiasbynens/regexpu-core#8.
81eeb14
@mathiasbynens mathiasbynens added a commit to mathiasbynens/regexpu-core that referenced this pull request Jun 23, 2016
@mathiasbynens mathiasbynens Match updated spec for `/\W/iu`
Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`.

This is no longer the case now that tc39/ecma262#525 is merged.

Ref. #8.
Ref. mathiasbynens/regexpu-fixtures@81eeb14.
74308d5
@mathiasbynens mathiasbynens referenced this pull request in Microsoft/ChakraCore Jun 23, 2016
Closed

[RegExp] case-insensitive matching misses characters #517

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment