New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

Merged
merged 2 commits into from Jun 22, 2016

Conversation

Projects
None yet
7 participants
@msaboff
Contributor

msaboff commented Apr 6, 2016

The current specification of CharacterClassEscape's in Regular Expressions introduces surprising behavior of \W when both the unicode and ignoreCase flags are provided.

There are 6 CharacterClassEscapes defined:
\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W with the exception of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

The proposal here is increase the characters in the set returned by \w CharacterClassEscape when unicode is specified. Given the current CaseFolding.txt contents, this will add \u017f (lower case long s) and \u212a (Kelvin symbol) to the returned set. It has the correspond effect that those two characters will not be in the set returned by the \W CharacterClassEscape rule.

This change will make \u017f (lower case long s) and \u212a (Kelvin symbol) word characters for unicode regular expressions. It also means that \w == !\W and !\w == \W regardless of the flags provided to the regular expression.

This is an improvement to pull request 516. It resolves the inconsistency without creating an issue with explicit characters classes ([] syntax) that include CharacterClassEscapes that exist in that pull request.

Note that much of the spec diff is due to indentation changes of the character table.

Show outdated Hide outdated spec.html
</tbody>
</table>
</figure>
1. If _Unicode_ is *true*, then

This comment has been minimized.

@claudepache

claudepache Apr 7, 2016

Contributor

We don't need to restrict to "Unicode is true", since the following steps will add no character when Unicode is false.

@claudepache

claudepache Apr 7, 2016

Contributor

We don't need to restrict to "Unicode is true", since the following steps will add no character when Unicode is false.

This comment has been minimized.

@msaboff

msaboff Apr 8, 2016

Contributor

I thought about eliminating the Unicode check, however it shows that the change is there for Unicode Word character matching.

@msaboff

msaboff Apr 8, 2016

Contributor

I thought about eliminating the Unicode check, however it shows that the change is there for Unicode Word character matching.

This comment has been minimized.

@claudepache

claudepache Apr 8, 2016

Contributor

I think that this should be expressed by the "Assert" I'm proposing below, instead of by adding an explicit check. The relevant distinction between "Unicode" and "non-Unicode" is already done in the Canonicalize() abstract operation; following the DRY principle, it is better not to repeat that distinction.

@claudepache

claudepache Apr 8, 2016

Contributor

I think that this should be expressed by the "Assert" I'm proposing below, instead of by adding an explicit check. The relevant distinction between "Unicode" and "non-Unicode" is already done in the Canonicalize() abstract operation; following the DRY principle, it is better not to repeat that distinction.

Show outdated Hide outdated spec.html
</figure>
1. If _Unicode_ is *true*, then
1. Create an empty set _U_.
1. For every character _c_ not in set _A_ where Canonicalize(_c_) is in _A_, add _c_ to _U_.

This comment has been minimized.

@claudepache

claudepache Apr 7, 2016

Contributor

I suggest to add something like: Assert: Unless Unicode and IgnoreCase are both true, U is empty.

@claudepache

claudepache Apr 7, 2016

Contributor

I suggest to add something like: Assert: Unless Unicode and IgnoreCase are both true, U is empty.

This comment has been minimized.

@msaboff

msaboff Apr 8, 2016

Contributor

The change I'm proposing specifically doesn't take into account ignoreCase as we want the set to be the same with and without ignoreCase.

@msaboff

msaboff Apr 8, 2016

Contributor

The change I'm proposing specifically doesn't take into account ignoreCase as we want the set to be the same with and without ignoreCase.

This comment has been minimized.

@claudepache

claudepache Apr 8, 2016

Contributor

You implicitly take ignoreCase into account by using Canonicalize(). The resulting set will be different under /u or under /ui.

@claudepache

claudepache Apr 8, 2016

Contributor

You implicitly take ignoreCase into account by using Canonicalize(). The resulting set will be different under /u or under /ui.

@claudepache

This comment has been minimized.

Show comment
Hide comment
@claudepache

claudepache Apr 7, 2016

Contributor

(You should have said: "Fixes #512" in your comment, in order to autolink that Pull Request with Issue 512 (the original Issue).)

Contributor

claudepache commented Apr 7, 2016

(You should have said: "Fixes #512" in your comment, in order to autolink that Pull Request with Issue 512 (the original Issue).)

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Apr 14, 2016

Contributor

I'm fine with eliminating the If Unicode check and adding the Assert.

Contributor

msaboff commented Apr 14, 2016

I'm fine with eliminating the If Unicode check and adding the Assert.

@goyakin

This comment has been minimized.

Show comment
Hide comment
@goyakin

goyakin Apr 19, 2016

Member

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

Rather than expanding \w only with the additional characters that canonicalize to the 63 characters specifically called out, why don't we generalize this and have \w match the alphanumeric characters from the Unicode Character Database plus _ when the Unicode flag is present? There is already precedent in other languages, e.g. Python, .NET, and Perl.

We can then redefine \b as the boundary between \w and \W in the spec so that they are in accord.

Member

goyakin commented Apr 19, 2016

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

Rather than expanding \w only with the additional characters that canonicalize to the 63 characters specifically called out, why don't we generalize this and have \w match the alphanumeric characters from the Unicode Character Database plus _ when the Unicode flag is present? There is already precedent in other languages, e.g. Python, .NET, and Perl.

We can then redefine \b as the boundary between \w and \W in the spec so that they are in accord.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
Member

mathiasbynens commented Apr 19, 2016

@goyakin There was a proposal to make \w more Unicode-aware, but it was decided against in favor of \p{…}.

@goyakin

This comment has been minimized.

Show comment
Hide comment
@goyakin

goyakin Apr 20, 2016

Member

@mathiasbynens I wasn't aware of that. Seems like the reason was backward compatibility. Since this PR is a breaking change in that regard, if we're going down this route, I think we should go ahead and accept all alphanumeric characters instead.

If we want to keep backward compatibility though, it might be better to disallow canonicalizing characters > 127 to <= 127, as mentioned in the email thread you linked to.

Member

goyakin commented Apr 20, 2016

@mathiasbynens I wasn't aware of that. Seems like the reason was backward compatibility. Since this PR is a breaking change in that regard, if we're going down this route, I think we should go ahead and accept all alphanumeric characters instead.

If we want to keep backward compatibility though, it might be better to disallow canonicalizing characters > 127 to <= 127, as mentioned in the email thread you linked to.

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed Apr 26, 2016

While I'm not entirely happy with this proposal, I don't see a backward compatibility issue if this is gated by /u, given that /u has not been around for very long yet anyways.

While I'm not entirely happy with this proposal, I don't see a backward compatibility issue if this is gated by /u, given that /u has not been around for very long yet anyways.

@claudepache

This comment has been minimized.

Show comment
Hide comment
@claudepache

claudepache Apr 27, 2016

Contributor

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

To be clear, there was already an issue before this change, i.e.:

  • without the change, /^\W\w$/ui will match "ss", but /^.\b.$/ui won’t;
  • with the change, /^\W\w$/ui won’t match "ſs", but /^.\b.$/ui will.

Which gotcha is the worst one, I don’t know. It is possible to also modify \b so that the equivalence holds again. (Although, at this point, I'd be more inclined to let \w just match all Unicode letters.)

Contributor

claudepache commented Apr 27, 2016

This is going to cause a difference between what \w and \W match versus \b and \B. Semantically \b matches a word boundary, i.e. between \w and \W. However, this isn't going to hold anymore with this change.

To be clear, there was already an issue before this change, i.e.:

  • without the change, /^\W\w$/ui will match "ss", but /^.\b.$/ui won’t;
  • with the change, /^\W\w$/ui won’t match "ſs", but /^.\b.$/ui will.

Which gotcha is the worst one, I don’t know. It is possible to also modify \b so that the equivalence holds again. (Although, at this point, I'd be more inclined to let \w just match all Unicode letters.)

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Apr 29, 2016

Member

I am feeling like modifying \b would be the right choice here - it's a vastly smaller change than allowing all Unicode characters right?

Member

bterlson commented Apr 29, 2016

I am feeling like modifying \b would be the right choice here - it's a vastly smaller change than allowing all Unicode characters right?

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff May 18, 2016

Contributor

I'm fine with also modifying \b so that the equivalence holds once again. I'll try to get the spec changes ready for the May meeting.

Contributor

msaboff commented May 18, 2016

I'm fine with also modifying \b so that the equivalence holds once again. I'll try to get the spec changes ready for the May meeting.

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson May 23, 2016

Member

We have consensus on this PR with the addition aligning \b and \B. @msaboff can you update this PR with the appropriate changes to \b?

Member

bterlson commented May 23, 2016

We have consensus on this PR with the addition aligning \b and \B. @msaboff can you update this PR with the appropriate changes to \b?

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson May 23, 2016

Member

@mathiasbynens thanks for raising this issue!

Member

bterlson commented May 23, 2016

@mathiasbynens thanks for raising this issue!

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed May 24, 2016

I gave the matching of \b and \B a read and it seems to me that they are currently not actually affected by case-insensitive canonicalization at all:

The production Assertion::\b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
Let e be x's endIndex.
Call IsWordChar(e-1) and let a be the Boolean result.
Call IsWordChar(e) and let b be the Boolean result.
If a is true and b is false, return true.
If a is false and b is true, return true.
Return false.

And

The abstract operation IsWordChar takes an integer parameter e and performs the following steps:

If e is -1 or e is InputLength, return false.
Let c be the character Input[e].
If c is one of the sixty-three characters below, return true.
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 _

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

At the least the existing definition of \b and \B would have to be rewritten to be coupled to \w and \W in the first place. Or do you simply want to add sharp-S and Kelvin sign to the list for IsWordChar for /ui regexps?

hashseed commented May 24, 2016

I gave the matching of \b and \B a read and it seems to me that they are currently not actually affected by case-insensitive canonicalization at all:

The production Assertion::\b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following steps when evaluated:
Let e be x's endIndex.
Call IsWordChar(e-1) and let a be the Boolean result.
Call IsWordChar(e) and let b be the Boolean result.
If a is true and b is false, return true.
If a is false and b is true, return true.
Return false.

And

The abstract operation IsWordChar takes an integer parameter e and performs the following steps:

If e is -1 or e is InputLength, return false.
Let c be the character Input[e].
If c is one of the sixty-three characters below, return true.
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 _

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

At the least the existing definition of \b and \B would have to be rewritten to be coupled to \w and \W in the first place. Or do you simply want to add sharp-S and Kelvin sign to the list for IsWordChar for /ui regexps?

@ljharb

This comment has been minimized.

Show comment
Hide comment
@ljharb

ljharb May 24, 2016

Member

@hashseed imo the goal is to make the following true, in all modes:

  • /W match all the characters not in /w (currently false in /ui)
  • \B match all the characters not in \b (already the case)
  • (if I understand correctly) /B to match all the characters in \w (currently false in /ui)

Rephrased, if Kelvin and sharp-S are matched by \w, then they are word characters, and \B should match them.

(someone please correct me if I've misstated the relationship between \B and \w)

Member

ljharb commented May 24, 2016

@hashseed imo the goal is to make the following true, in all modes:

  • /W match all the characters not in /w (currently false in /ui)
  • \B match all the characters not in \b (already the case)
  • (if I understand correctly) /B to match all the characters in \w (currently false in /ui)

Rephrased, if Kelvin and sharp-S are matched by \w, then they are word characters, and \B should match them.

(someone please correct me if I've misstated the relationship between \B and \w)

@claudepache

This comment has been minimized.

Show comment
Hide comment
@claudepache

claudepache May 24, 2016

Contributor

@ljharb \b and \B are zero-length assertions, they don’t match characters.

Contributor

claudepache commented May 24, 2016

@ljharb \b and \B are zero-length assertions, they don’t match characters.

@claudepache

This comment has been minimized.

Show comment
Hide comment
@claudepache

claudepache May 24, 2016

Contributor

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui

I’m confused, because I thought the the issue was precisely that /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui. If not, what the issue was supposed to be?

Contributor

claudepache commented May 24, 2016

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui

I’m confused, because I thought the the issue was precisely that /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui. If not, what the issue was supposed to be?

@ljharb

This comment has been minimized.

Show comment
Hide comment
@ljharb

ljharb May 24, 2016

Member

@claudepache lol thank you, i was sure i was going to get that wrong. hopefully someone can come in and state it better than i.

Suffice to say, within the same mode, \b, \B, \w, and \W should certainly be logically self-consistent.

Member

ljharb commented May 24, 2016

@claudepache lol thank you, i was sure i was going to get that wrong. hopefully someone can come in and state it better than i.

Suffice to say, within the same mode, \b, \B, \w, and \W should certainly be logically self-consistent.

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff May 24, 2016

Contributor

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

From our discussion yesterday, for /ui patterns, we agreed that small sharp-S and kelvin ARE ward characters. For all other flags, they are not word characters.

Contributor

msaboff commented May 24, 2016

So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B.

From our discussion yesterday, for /ui patterns, we agreed that small sharp-S and kelvin ARE ward characters. For all other flags, they are not word characters.

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed May 24, 2016

@claudepache that's what I meant. \w and \b are not consistent currently for /ui.
@msaboff thanks for clearing up.

@claudepache that's what I meant. \w and \b are not consistent currently for /ui.
@msaboff thanks for clearing up.

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Jun 8, 2016

Contributor

I updated the spec to reflect what we agreed on at the May meeting.

I created a common abstract operation, WordCharacters and changed so that both the word asserts and \w and \W CharacterClassEscapes use the new abstract operation.

Contributor

msaboff commented Jun 8, 2016

I updated the spec to reflect what we agreed on at the May meeting.

I created a common abstract operation, WordCharacters and changed so that both the word asserts and \w and \W CharacterClassEscapes use the new abstract operation.

@msaboff msaboff changed the title from Unify handling of RegExp CharacterClassEscapes \w and \W to Unify handling of RegExp CharacterClassEscapes \w, \W, \b and \B Jun 8, 2016

@msaboff msaboff changed the title from Unify handling of RegExp CharacterClassEscapes \w, \W, \b and \B to Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B Jun 8, 2016

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Jun 8, 2016

Member

ping @littledan, @efaust. I'll take a look tomorrow!

Member

bterlson commented Jun 8, 2016

ping @littledan, @efaust. I'll take a look tomorrow!

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Jun 8, 2016

Member

@msaboff can you also rebase?

Member

bterlson commented Jun 8, 2016

@msaboff can you also rebase?

Michael Saboff and others added some commits Apr 6, 2016

Updated per what was agreed at the May 2016 meeting.
Created a new abstract operation "WordCharacters()" that is used by
both IsWordChar() for word assertions and \w/\W CharacterClassEscapes.
@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Jun 8, 2016

Contributor

Rebased.

Contributor

msaboff commented Jun 8, 2016

Rebased.

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Jun 20, 2016

Contributor

@bterlson @littledan @efaust Can I get closure on this change?

Contributor

msaboff commented Jun 20, 2016

@bterlson @littledan @efaust Can I get closure on this change?

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Jun 20, 2016

Member

I'll try to review this today (still out on a wedding break but festivities are now over :))

Member

bterlson commented Jun 20, 2016

I'll try to review this today (still out on a wedding break but festivities are now over :))

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Jun 22, 2016

Member

Sorry for the delay. Looks good to me (other than some minor ecmarkup/spec convention things which I'll just fix up in a subsequent commit).

Member

bterlson commented Jun 22, 2016

Sorry for the delay. Looks good to me (other than some minor ecmarkup/spec convention things which I'll just fix up in a subsequent commit).

@bterlson bterlson merged commit 996af87 into tc39:master Jun 22, 2016

mathiasbynens added a commit to mathiasbynens/regexpu-fixtures that referenced this pull request Jun 23, 2016

Match updated spec for `/\W/iu`
Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`.

This is no longer the case now that tc39/ecma262#525 is merged.

Ref. mathiasbynens/regexpu-core#8.

mathiasbynens added a commit to mathiasbynens/regexpu-core that referenced this pull request Jun 23, 2016

Match updated spec for `/\W/iu`
Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`.

This is no longer the case now that tc39/ecma262#525 is merged.

Ref. #8.
Ref. mathiasbynens/regexpu-fixtures@81eeb14.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment