Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upRegexp with word boundaries do not match cyrillic string #1020
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jmdyck
Oct 16, 2017
Collaborator
Are you claiming that this is due to a bug in the spec? Because if not (i.e., if it's just bugs in implementations), it doesn't belong here.
|
Are you claiming that this is due to a bug in the spec? Because if not (i.e., if it's just bugs in implementations), it doesn't belong here. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
roman4e
commented
Oct 16, 2017
|
Maybe. I got advice from Chromium forum to try make feedback there. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Can you link to that advice? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
roman4e
commented
Oct 16, 2017
|
Please look there |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
msaboff
Oct 16, 2017
Contributor
Both the \b assertion and the \w character class escape are defined in the standard in terms of WordCharacters(), which is ASCII based and therefore doesn't handle Cyrillic.
In the second example, the ECMAScript standard does not talk about UTF-8 source code*. I believe that is why the expression "ілн".match(/\b\ілн\b/ug) fails. Some engines may have special handling for the string and/or the expression being UTF-8, but it is not required.
In the JavaScriptCore shell, jsc, I got the following results
>>> "ілн".length
6
>>> "ілн".match(/\u{456}\u{43b}\u{43d}/ug)
null
>>> "\u0456\u043b\u043d".match(/\u{456}\u{43b}\u{43d}/ug)
ілн
>>> "\u0456\u043b\u043d".match(/\b\u{456}\u{43b}\u{43d}\b/ug)
null
>>> "ілн".match(/ілн/ug)
'�лн
The first shows that UTF-8 strings are not handled. The second expression confirms that UTF-8 strings don't match the escaped version of the string. The third expression shows that the \b assertion doesn't match with the UTF-8 string. Finally, the fourth expression shows that a UTF-8 expression matches a UTF-8 string, but the result doesn't reflect the input in a readable way.
This is just one engine. Since the standard is quiet with respect of UTF-8 source handling*, other engines could have different results, but that still wouldn't support the proposed expected results that started this issue.
* The only mention of UTF-8 handing in the ECMAScript standard is in the URI handling as specified in 18.2.6 URI Handling Functions.
|
Both the \b assertion and the \w character class escape are defined in the standard in terms of WordCharacters(), which is ASCII based and therefore doesn't handle Cyrillic. In the second example, the ECMAScript standard does not talk about UTF-8 source code*. I believe that is why the expression In the JavaScriptCore shell, jsc, I got the following results
The first shows that UTF-8 strings are not handled. The second expression confirms that UTF-8 strings don't match the escaped version of the string. The third expression shows that the \b assertion doesn't match with the UTF-8 string. Finally, the fourth expression shows that a UTF-8 expression matches a UTF-8 string, but the result doesn't reflect the input in a readable way. This is just one engine. Since the standard is quiet with respect of UTF-8 source handling*, other engines could have different results, but that still wouldn't support the proposed expected results that started this issue. * The only mention of UTF-8 handing in the ECMAScript standard is in the URI handling as specified in 18.2.6 URI Handling Functions. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
msaboff
Oct 16, 2017
Contributor
Let me add that with the proposed Unicode Property Escapes changes to ECMAScript, the following match works:
>>> "\u0456\u043b\u043d".match(/\p{Letter}+/ug)
ілн
Using a UTF-8 source string "matches" for the work reasons because the Latin characters match some of the UTF-8 bytes:
>>> "ілн".match(/\p{Letter}+/ug)
Ñ,Ð,Ð
|
Let me add that with the proposed Unicode Property Escapes changes to ECMAScript, the following match works:
Using a UTF-8 source string "matches" for the work reasons because the Latin characters match some of the UTF-8 bytes:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
roman4e
Oct 16, 2017
Sure it works okay without boundaries but the boundaries are required. As ECMAScript internally use UTF-16 (correct?), that's little confused for me why Regex parser interpret testing strings as ASCII-strings. And even the set \w does not include letters from UTF-16 set.
roman4e
commented
Oct 16, 2017
•
|
Sure it works okay without boundaries but the boundaries are required. As ECMAScript internally use UTF-16 (correct?), that's little confused for me why Regex parser interpret testing strings as ASCII-strings. And even the set \w does not include letters from UTF-16 set. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
littledan
Oct 16, 2017
Member
This was a deliberate decision in the ES6 cycle, not to change the definition for \w and \b and leave them based on ASCII, as noted here. From here, I'd recommend re-proposing the other half of this proposal if you want to move further on this topic. However, Unicode RegExps have shipped to the Web, and it is likely to be harder to change them without breaking the web. cc @allenwb @mathiasbynens .
|
This was a deliberate decision in the ES6 cycle, not to change the definition for |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bterlson
Oct 16, 2017
Member
@msaboff fwiw I get a length of 3 in all engines. I'm not seeing UTF8 bytes anywhere...
I also think the 2nd example should be an error due to the invalid escape sequence (with /u, identity escapes are severely limited to only SyntaxCharacters, although I could very well be misunderstanding things).
Anyway echoing @littledan - changes to \w and \b need to go through the normal proposal process.
|
@msaboff fwiw I get a length of 3 in all engines. I'm not seeing UTF8 bytes anywhere... I also think the 2nd example should be an error due to the invalid escape sequence (with /u, identity escapes are severely limited to only SyntaxCharacters, although I could very well be misunderstanding things). Anyway echoing @littledan - changes to |
roman4e commentedOct 16, 2017
Example: 1) "ілн".match(/\b\w+\b/ug)
2) "ілн".match(/\b\ілн\b/ug)
Actual Result: null
Expected Result: ["ілн"]
Note that Expected Result appear in languages that use libpcre3.x library like perl, php or python.
Actual result is reproducible in browsers Firefox and Chrome and might be others but I did not test.