New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegExp processing unicode+ignoreCase of \W is not the same as !\w when matching "S" or "K" #512

Closed
msaboff opened this Issue Apr 1, 2016 · 25 comments

Comments

Projects
None yet
6 participants
@msaboff
Contributor

msaboff commented Apr 1, 2016

A bug has been reported against both V8 bug and JSC bug concerning whether or not the strings "K" and "S" match against the regular expression /\W/ui, i.e. not a word character with the unicode and ignoreCase flags. The submitter of at least one of those bugs describes the behavior here.

This raises a possible issue with what was intended for case insensitive Unicode regular expressions. A direct reading the relevant sections of the ECMAScript® 2016 standard is that both /\W/ui.test("K") and /\W/ui.test("S") should return true, but that may not be what the standard intends.

The \W and related \w, are described in section 21.2.2.12 CharacterClassEscape. It states that word characters are matched with the \w CharacterClassEscape which consist of the set of {a .. z A .. Z 0 .. 9 _}. The \W CharacterClassEscape is the inverse of \w.

The creation of the character class from a character class escape is described in section 21.2.2.9 AtomEscape. In all cases, we create a character class with the inverse flag as false.

The case folding rules are defined in section 21.2.2.8.2 Runtime Semantics: Canonicalize. It states that for unicode case folding, the table in the file CaseFolding.txt from the Unicode Character Database is consulted and if there are common or simple case folding mappings for a character, that mapping is returned, otherwise the original character is returned.

The matching rules for character set matching is defined in section 21.2.2.8.1 Runtime Semantics: CharacterSetMatcher Abstract Operation. It states in step 1.c. that a character is retrieved from a valid index in the subject string. In step 1.d, that character is canonicalized. In steps 1.e.i. or 1.f.i., depending on the invert flag, the result of the Canonicalize() from step 1.d. is compared against the the members of the set after each of the members is processed by Canonicalize(). In step 1.e.i., if there isn't a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure. In step 1.f.i., if there is a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure.

Putting these sections together, let's walk through these sections for the construct /\W/ui.test("S")..

The regular expression consists of the character class escape '\W', which includes all the characters not in the \w word character class. It also has the unicode and ignoreCase flags set. Matching will happen according to CharacterSetMatcher Abstract Operation, with the set A being all characters not in the \w class and the invert flag being false. At step 1.d., the 'S' character will be read and passed to Canonicalize(). The result according to CaseFolding.txt, will be 's'. Since invert is false, the matching will proceed to step 1.e.i.. There exists one character in set A, \u017f (lower case long s) that will also case fold to 's'. Therefore we will not return failure, but will instead match. The same analysis of "K" holds as there exists one character \u212a (Kelvin symbol) that will canonicalize to 'k'.

The issue is whether or not this behavior was intended. Without the unicode flag, the processing of \W is the same as \w. For the unicode behavior to match the non-unicode behavior, the description for producing a character class from a character class escape needs to change for all of the upper case character class escape to produce a character class identical to the lower case equivalent with the inverse flag as true.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Apr 1, 2016

Member

The issue is whether or not this behavior was intended.

It was: https://bugs.ecmascript.org/show_bug.cgi?id=3145#c1

Member

mathiasbynens commented Apr 1, 2016

The issue is whether or not this behavior was intended.

It was: https://bugs.ecmascript.org/show_bug.cgi?id=3145#c1

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Apr 1, 2016

Contributor

Applying the principle of least surprise, it seems to me that/\W/ui.test("S") should be false given that /\w/ui.test("S") is true. The character class escapes are arranges as lower case escapes being is X and the upper case escapes being is NOT X.

\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W.

This holds true for all subject characters, all character class escapes and all combinations of the unicode and ignoreCase flags with the exceptions of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

Contributor

msaboff commented Apr 1, 2016

Applying the principle of least surprise, it seems to me that/\W/ui.test("S") should be false given that /\w/ui.test("S") is true. The character class escapes are arranges as lower case escapes being is X and the upper case escapes being is NOT X.

\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W.

This holds true for all subject characters, all character class escapes and all combinations of the unicode and ignoreCase flags with the exceptions of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

@littledan

This comment has been minimized.

Show comment
Hide comment
@littledan
Member

littledan commented Apr 2, 2016

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed Apr 2, 2016

I don't think we should set the precedent of deviating from the unicode standard by adding exceptions to avoid surprises. These exceptions would themselves be surprising.
\w and \W should not be used in combination with /u. We should rather speed up the specification on \p and \P so that we can use \p{Letter} instead. \w is not a word character in the unicode context. It's simply a shorthand for a-zA-Z0-9_.

hashseed commented Apr 2, 2016

I don't think we should set the precedent of deviating from the unicode standard by adding exceptions to avoid surprises. These exceptions would themselves be surprising.
\w and \W should not be used in combination with /u. We should rather speed up the specification on \p and \P so that we can use \p{Letter} instead. \w is not a word character in the unicode context. It's simply a shorthand for a-zA-Z0-9_.

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed Apr 2, 2016

Alternatively, you could bring this up with the unicode standard body to have it fixed there, at the source. I don't think TC39 is the right place to debate how case folding should work.

hashseed commented Apr 2, 2016

Alternatively, you could bring this up with the unicode standard body to have it fixed there, at the source. I don't think TC39 is the right place to debate how case folding should work.

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Apr 3, 2016

Contributor

I'm not suggesting that we deviate from the unicode standard. What I am suggesting is that we change the ECMAScript spec so that \W has the same set of characters as \w, but with the invert flag set to true. Although it won't change any other results, I would also suggest the same for \D and \S.

Contributor

msaboff commented Apr 3, 2016

I'm not suggesting that we deviate from the unicode standard. What I am suggesting is that we change the ECMAScript spec so that \W has the same set of characters as \w, but with the invert flag set to true. Although it won't change any other results, I would also suggest the same for \D and \S.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Apr 3, 2016

Member

Having iu enable case-folding in most cases but not in others would be equally confusing/surprising IMHO.

Member

mathiasbynens commented Apr 3, 2016

Having iu enable case-folding in most cases but not in others would be equally confusing/surprising IMHO.

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Apr 3, 2016

Contributor

Having iu enable case-folding in most cases but not in others would be equally confusing/surprising IMHO.

Agreed. I am not suggesting ignoring case-folding rules. I am suggesting that \W be the same thing as !\w and not something different.

Contributor

msaboff commented Apr 3, 2016

Having iu enable case-folding in most cases but not in others would be equally confusing/surprising IMHO.

Agreed. I am not suggesting ignoring case-folding rules. I am suggesting that \W be the same thing as !\w and not something different.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Apr 3, 2016

Member

Agreed. I am not suggesting ignoring case-folding rules.

Isn’t that exactly what you’re suggesting here:

I am suggesting that \W be the same thing as !\w and not something different.

One implies the other. The fact that \W is not !\w in iu mode is due to case folding / canonicalization.

Member

mathiasbynens commented Apr 3, 2016

Agreed. I am not suggesting ignoring case-folding rules.

Isn’t that exactly what you’re suggesting here:

I am suggesting that \W be the same thing as !\w and not something different.

One implies the other. The fact that \W is not !\w in iu mode is due to case folding / canonicalization.

@claudepache

This comment has been minimized.

Show comment
Hide comment
@claudepache

claudepache Apr 3, 2016

Contributor

@mathiasbynens, I think you're misunderstanding @msaboff's suggestion.

Currently (21.2.2.9, last algorithm), /\W/ evaluates to CharacterSetMatcher(A, false), where A is the set of characters different from A-Za-z0-9_.

What @msaboff is suggesting, is that /\W/ evaluate to CharacterSetMatcher(B, true), where B is the set of characters consisting of A-Za-z0-9_.

With this change, \W will match the characters not matched by \w (and vice versa), because the condition in step e.i of CharacterSetMatcher is the exact negation of condition in step f.i.

In particular, with this change, both "ſ" (long s) and "S" will be matched by /\w/ui, and none of them will be matched by /\W/ui.

Contributor

claudepache commented Apr 3, 2016

@mathiasbynens, I think you're misunderstanding @msaboff's suggestion.

Currently (21.2.2.9, last algorithm), /\W/ evaluates to CharacterSetMatcher(A, false), where A is the set of characters different from A-Za-z0-9_.

What @msaboff is suggesting, is that /\W/ evaluate to CharacterSetMatcher(B, true), where B is the set of characters consisting of A-Za-z0-9_.

With this change, \W will match the characters not matched by \w (and vice versa), because the condition in step e.i of CharacterSetMatcher is the exact negation of condition in step f.i.

In particular, with this change, both "ſ" (long s) and "S" will be matched by /\w/ui, and none of them will be matched by /\W/ui.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Apr 3, 2016

Member

@claudepache You’re right, that’s not how I understood it. Thanks for making it clear to me.

This suggestion seems sensible (assuming there is no existing code in the wild that relies on the current behavior of /\W/iu). Thanks for bringing this up again, @msaboff!

Member

mathiasbynens commented Apr 3, 2016

@claudepache You’re right, that’s not how I understood it. Thanks for making it clear to me.

This suggestion seems sensible (assuming there is no existing code in the wild that relies on the current behavior of /\W/iu). Thanks for bringing this up again, @msaboff!

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Apr 3, 2016

Contributor

@claudepache, thanks for explaining what I am proposing another way.

This change seems much more logical and intuitive for programmers.

Contributor

msaboff commented Apr 3, 2016

@claudepache, thanks for explaining what I am proposing another way.

This change seems much more logical and intuitive for programmers.

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed Apr 4, 2016

Same should apply to \D and \S for symmetry, right?

hashseed commented Apr 4, 2016

Same should apply to \D and \S for symmetry, right?

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Apr 4, 2016

Member

@hashseed Agreed (although it won’t make a difference there).

Member

mathiasbynens commented Apr 4, 2016

@hashseed Agreed (although it won’t make a difference there).

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed Apr 4, 2016

Sounds good to me as well.

hashseed commented Apr 4, 2016

Sounds good to me as well.

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed Apr 4, 2016

How do you interpret a class ranges that includes such a class escape?

Currently, the following are equivalent:
/\W/
/[\W]/
/[^\w]/
Consequently, these are also equivalent:
/\W/ui
/[\W]/ui
/[^\w]/ui
With the proposed change, I don't know how to interpret /[\W]/. And would these two below still be equivalent?
/[^\W]/ui
/[\w]/ui
What would /[^\W123]/ even mean? Would I need to desugar that into /\w|[123]/?
According to [0], CharSets inside the ClassRange are merged. But that assumes that CharSets are not negated.

[0] https://tc39.github.io/ecma262/#sec-nonemptyclassranges

hashseed commented Apr 4, 2016

How do you interpret a class ranges that includes such a class escape?

Currently, the following are equivalent:
/\W/
/[\W]/
/[^\w]/
Consequently, these are also equivalent:
/\W/ui
/[\W]/ui
/[^\w]/ui
With the proposed change, I don't know how to interpret /[\W]/. And would these two below still be equivalent?
/[^\W]/ui
/[\w]/ui
What would /[^\W123]/ even mean? Would I need to desugar that into /\w|[123]/?
According to [0], CharSets inside the ClassRange are merged. But that assumes that CharSets are not negated.

[0] https://tc39.github.io/ecma262/#sec-nonemptyclassranges

@claudepache

This comment has been minimized.

Show comment
Hide comment
@claudepache

claudepache Apr 4, 2016

Contributor

How do you interpret a class ranges that includes such a class escape?

Yes, it is the difficulty I'm encountering when trying to write a spec for the change. A more difficult case is: /[^\d\W]/.

Currently, the following are equivalent:
/\W/
/[\W]/
/[^\w]/
Consequently, these are also equivalent:
/\W/ui
/[\W]/ui
/[^\w]/ui

Precisely not: /[\W]/ui is not equivalent to /[^\w]/ui (under the current semantics): the first one matches "S", but not the second one. It is because

CharacterSetMatcher(A, false)

is not equivalent to

CharacterSetMatcher(B, true), where B is the set of characters not contained in A

when the CharSet A is not stable under the Canonicalize() abstract operation.

Contributor

claudepache commented Apr 4, 2016

How do you interpret a class ranges that includes such a class escape?

Yes, it is the difficulty I'm encountering when trying to write a spec for the change. A more difficult case is: /[^\d\W]/.

Currently, the following are equivalent:
/\W/
/[\W]/
/[^\w]/
Consequently, these are also equivalent:
/\W/ui
/[\W]/ui
/[^\w]/ui

Precisely not: /[\W]/ui is not equivalent to /[^\w]/ui (under the current semantics): the first one matches "S", but not the second one. It is because

CharacterSetMatcher(A, false)

is not equivalent to

CharacterSetMatcher(B, true), where B is the set of characters not contained in A

when the CharSet A is not stable under the Canonicalize() abstract operation.

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed Apr 4, 2016

Thanks for the correction. The point regarding class ranges stands though.

hashseed commented Apr 4, 2016

Thanks for the correction. The point regarding class ranges stands though.

@claudepache

This comment has been minimized.

Show comment
Hide comment
@claudepache

claudepache Apr 4, 2016

Contributor

A possibility is to specify \w to evaluate to the set of characters ch such that Canonicalize(ch) is in A-Za-z0-9_. With that semantics, \w will evaluate to a larger set than A-Za-z0-9_ when both the IgnoreCase and the Unicode flags are set to true.

Contributor

claudepache commented Apr 4, 2016

A possibility is to specify \w to evaluate to the set of characters ch such that Canonicalize(ch) is in A-Za-z0-9_. With that semantics, \w will evaluate to a larger set than A-Za-z0-9_ when both the IgnoreCase and the Unicode flags are set to true.

@hashseed

This comment has been minimized.

Show comment
Hide comment
@hashseed

hashseed Apr 4, 2016

Or simply have \w be a synonym for \p{Letter} for unicode regexps, as in, characters that have the general category "Letter". Both are breaking changes.

hashseed commented Apr 4, 2016

Or simply have \w be a synonym for \p{Letter} for unicode regexps, as in, characters that have the general category "Letter". Both are breaking changes.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens
Member

mathiasbynens commented Apr 4, 2016

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Apr 4, 2016

Contributor

Created pull request #516 with suggested changes.

Contributor

msaboff commented Apr 4, 2016

Created pull request #516 with suggested changes.

@claudepache

This comment has been minimized.

Show comment
Hide comment
@claudepache

claudepache Apr 7, 2016

Contributor

Another solution to consider is to throw a SyntaxError when \w or \W is used together with /ui flags, alerting the coder that those character classes are fundamentally incompatible with these flags, without remedy.

Contributor

claudepache commented Apr 7, 2016

Another solution to consider is to throw a SyntaxError when \w or \W is used together with /ui flags, alerting the coder that those character classes are fundamentally incompatible with these flags, without remedy.

@littledan

This comment has been minimized.

Show comment
Hide comment
@littledan

littledan Apr 7, 2016

Member

@claudepache I think it's a little late for that, given the shipped browser support.

Member

littledan commented Apr 7, 2016

@claudepache I think it's a little late for that, given the shipped browser support.

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Jul 25, 2016

Member

This is resolved now I believe!

Member

bterlson commented Jul 25, 2016

This is resolved now I believe!

@bterlson bterlson closed this Jul 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment