Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upRegExp processing unicode+ignoreCase of \W is not the same as !\w when matching "S" or "K" #512
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
msaboff
Apr 1, 2016
Contributor
Applying the principle of least surprise, it seems to me that/\W/ui.test("S") should be false given that /\w/ui.test("S") is true. The character class escapes are arranges as lower case escapes being is X and the upper case escapes being is NOT X.
\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W.
This holds true for all subject characters, all character class escapes and all combinations of the unicode and ignoreCase flags with the exceptions of /\W/.ui and the characters 'k', 'K', 's' and 'S'.
|
Applying the principle of least surprise, it seems to me that \d is digit character and \D is not digit character This holds true for all subject characters, all character class escapes and all combinations of the unicode and ignoreCase flags with the exceptions of |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
cc @hashseed |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
hashseed
Apr 2, 2016
I don't think we should set the precedent of deviating from the unicode standard by adding exceptions to avoid surprises. These exceptions would themselves be surprising.
\w and \W should not be used in combination with /u. We should rather speed up the specification on \p and \P so that we can use \p{Letter} instead. \w is not a word character in the unicode context. It's simply a shorthand for a-zA-Z0-9_.
hashseed
commented
Apr 2, 2016
|
I don't think we should set the precedent of deviating from the unicode standard by adding exceptions to avoid surprises. These exceptions would themselves be surprising. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
hashseed
Apr 2, 2016
Alternatively, you could bring this up with the unicode standard body to have it fixed there, at the source. I don't think TC39 is the right place to debate how case folding should work.
hashseed
commented
Apr 2, 2016
|
Alternatively, you could bring this up with the unicode standard body to have it fixed there, at the source. I don't think TC39 is the right place to debate how case folding should work. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
msaboff
Apr 3, 2016
Contributor
I'm not suggesting that we deviate from the unicode standard. What I am suggesting is that we change the ECMAScript spec so that \W has the same set of characters as \w, but with the invert flag set to true. Although it won't change any other results, I would also suggest the same for \D and \S.
|
I'm not suggesting that we deviate from the unicode standard. What I am suggesting is that we change the ECMAScript spec so that \W has the same set of characters as \w, but with the invert flag set to |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mathiasbynens
Apr 3, 2016
Member
Having iu enable case-folding in most cases but not in others would be equally confusing/surprising IMHO.
|
Having |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
msaboff
Apr 3, 2016
Contributor
Having
iuenable case-folding in most cases but not in others would be equally confusing/surprising IMHO.
Agreed. I am not suggesting ignoring case-folding rules. I am suggesting that \W be the same thing as !\w and not something different.
Agreed. I am not suggesting ignoring case-folding rules. I am suggesting that \W be the same thing as !\w and not something different. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mathiasbynens
Apr 3, 2016
Member
Agreed. I am not suggesting ignoring case-folding rules.
Isn’t that exactly what you’re suggesting here:
I am suggesting that
\Wbe the same thing as!\wand not something different.
One implies the other. The fact that \W is not !\w in iu mode is due to case folding / canonicalization.
Isn’t that exactly what you’re suggesting here:
One implies the other. The fact that |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
claudepache
Apr 3, 2016
Contributor
@mathiasbynens, I think you're misunderstanding @msaboff's suggestion.
Currently (21.2.2.9, last algorithm), /\W/ evaluates to CharacterSetMatcher(A, false), where A is the set of characters different from A-Za-z0-9_.
What @msaboff is suggesting, is that /\W/ evaluate to CharacterSetMatcher(B, true), where B is the set of characters consisting of A-Za-z0-9_.
With this change, \W will match the characters not matched by \w (and vice versa), because the condition in step e.i of CharacterSetMatcher is the exact negation of condition in step f.i.
In particular, with this change, both "ſ" (long s) and "S" will be matched by /\w/ui, and none of them will be matched by /\W/ui.
|
@mathiasbynens, I think you're misunderstanding @msaboff's suggestion. Currently (21.2.2.9, last algorithm), What @msaboff is suggesting, is that With this change, In particular, with this change, both |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mathiasbynens
Apr 3, 2016
Member
@claudepache You’re right, that’s not how I understood it. Thanks for making it clear to me.
This suggestion seems sensible (assuming there is no existing code in the wild that relies on the current behavior of /\W/iu). Thanks for bringing this up again, @msaboff!
|
@claudepache You’re right, that’s not how I understood it. Thanks for making it clear to me. This suggestion seems sensible (assuming there is no existing code in the wild that relies on the current behavior of |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
msaboff
Apr 3, 2016
Contributor
@claudepache, thanks for explaining what I am proposing another way.
This change seems much more logical and intuitive for programmers.
|
@claudepache, thanks for explaining what I am proposing another way. This change seems much more logical and intuitive for programmers. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
hashseed
commented
Apr 4, 2016
|
Same should apply to \D and \S for symmetry, right? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@hashseed Agreed (although it won’t make a difference there). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
hashseed
commented
Apr 4, 2016
|
Sounds good to me as well. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
hashseed
Apr 4, 2016
How do you interpret a class ranges that includes such a class escape?
Currently, the following are equivalent:
/\W/
/[\W]/
/[^\w]/
Consequently, these are also equivalent:
/\W/ui
/[\W]/ui
/[^\w]/ui
With the proposed change, I don't know how to interpret /[\W]/. And would these two below still be equivalent?
/[^\W]/ui
/[\w]/ui
What would /[^\W123]/ even mean? Would I need to desugar that into /\w|[123]/?
According to [0], CharSets inside the ClassRange are merged. But that assumes that CharSets are not negated.
hashseed
commented
Apr 4, 2016
|
How do you interpret a class ranges that includes such a class escape? Currently, the following are equivalent: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
claudepache
Apr 4, 2016
Contributor
How do you interpret a class ranges that includes such a class escape?
Yes, it is the difficulty I'm encountering when trying to write a spec for the change. A more difficult case is: /[^\d\W]/.
Currently, the following are equivalent:
/\W/
/[\W]/
/[^\w]/
Consequently, these are also equivalent:
/\W/ui
/[\W]/ui
/[^\w]/ui
Precisely not: /[\W]/ui is not equivalent to /[^\w]/ui (under the current semantics): the first one matches "S", but not the second one. It is because
CharacterSetMatcher(A, false)
is not equivalent to
CharacterSetMatcher(B, true), where B is the set of characters not contained in A
when the CharSet A is not stable under the Canonicalize() abstract operation.
Yes, it is the difficulty I'm encountering when trying to write a spec for the change. A more difficult case is: /[^\d\W]/.
Precisely not:
is not equivalent to
when the CharSet A is not stable under the Canonicalize() abstract operation. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
hashseed
commented
Apr 4, 2016
|
Thanks for the correction. The point regarding class ranges stands though. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
claudepache
Apr 4, 2016
Contributor
A possibility is to specify \w to evaluate to the set of characters ch such that Canonicalize(ch) is in A-Za-z0-9_. With that semantics, \w will evaluate to a larger set than A-Za-z0-9_ when both the IgnoreCase and the Unicode flags are set to true.
|
A possibility is to specify |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
hashseed
Apr 4, 2016
Or simply have \w be a synonym for \p{Letter} for unicode regexps, as in, characters that have the general category "Letter". Both are breaking changes.
hashseed
commented
Apr 4, 2016
|
Or simply have \w be a synonym for \p{Letter} for unicode regexps, as in, characters that have the general category "Letter". Both are breaking changes. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mathiasbynens
Apr 4, 2016
Member
@hashseed There was a proposal to make \d and \w (and \b) more Unicode-aware, but it was rejected.
|
@hashseed There was a proposal to make |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Created pull request #516 with suggested changes. |
claudepache
referenced this issue
Apr 7, 2016
Merged
Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
claudepache
Apr 7, 2016
Contributor
Another solution to consider is to throw a SyntaxError when \w or \W is used together with /ui flags, alerting the coder that those character classes are fundamentally incompatible with these flags, without remedy.
|
Another solution to consider is to throw a SyntaxError when \w or \W is used together with /ui flags, alerting the coder that those character classes are fundamentally incompatible with these flags, without remedy. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
littledan
Apr 7, 2016
Member
@claudepache I think it's a little late for that, given the shipped browser support.
|
@claudepache I think it's a little late for that, given the shipped browser support. |
goyakin
referenced this issue
Apr 7, 2016
Closed
[RegExp] case-insensitive matching misses characters #517
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
This is resolved now I believe! |
msaboff commentedApr 1, 2016
A bug has been reported against both V8 bug and JSC bug concerning whether or not the strings "K" and "S" match against the regular expression /\W/ui, i.e. not a word character with the unicode and ignoreCase flags. The submitter of at least one of those bugs describes the behavior here.
This raises a possible issue with what was intended for case insensitive Unicode regular expressions. A direct reading the relevant sections of the ECMAScript® 2016 standard is that both /\W/ui.test("K") and /\W/ui.test("S") should return true, but that may not be what the standard intends.
The \W and related \w, are described in section 21.2.2.12 CharacterClassEscape. It states that word characters are matched with the \w CharacterClassEscape which consist of the set of {a .. z A .. Z 0 .. 9 _}. The \W CharacterClassEscape is the inverse of \w.
The creation of the character class from a character class escape is described in section 21.2.2.9 AtomEscape. In all cases, we create a character class with the inverse flag as false.
The case folding rules are defined in section 21.2.2.8.2 Runtime Semantics: Canonicalize. It states that for unicode case folding, the table in the file CaseFolding.txt from the Unicode Character Database is consulted and if there are common or simple case folding mappings for a character, that mapping is returned, otherwise the original character is returned.
The matching rules for character set matching is defined in section 21.2.2.8.1 Runtime Semantics: CharacterSetMatcher Abstract Operation. It states in step 1.c. that a character is retrieved from a valid index in the subject string. In step 1.d, that character is canonicalized. In steps 1.e.i. or 1.f.i., depending on the invert flag, the result of the Canonicalize() from step 1.d. is compared against the the members of the set after each of the members is processed by Canonicalize(). In step 1.e.i., if there isn't a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure. In step 1.f.i., if there is a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure.
Putting these sections together, let's walk through these sections for the construct /\W/ui.test("S")..
The regular expression consists of the character class escape '\W', which includes all the characters not in the \w word character class. It also has the unicode and ignoreCase flags set. Matching will happen according to CharacterSetMatcher Abstract Operation, with the set A being all characters not in the \w class and the invert flag being false. At step 1.d., the 'S' character will be read and passed to Canonicalize(). The result according to CaseFolding.txt, will be 's'. Since invert is
false, the matching will proceed to step 1.e.i.. There exists one character in set A, \u017f (lower case long s) that will also case fold to 's'. Therefore we will not return failure, but will instead match. The same analysis of "K" holds as there exists one character \u212a (Kelvin symbol) that will canonicalize to 'k'.The issue is whether or not this behavior was intended. Without the unicode flag, the processing of \W is the same as \w. For the unicode behavior to match the non-unicode behavior, the description for producing a character class from a character class escape needs to change for all of the upper case character class escape to produce a character class identical to the lower case equivalent with the inverse flag as true.