New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexp with word boundaries do not match cyrillic string #1020

Closed
roman4e opened this Issue Oct 16, 2017 · 9 comments

Comments

Projects
None yet
6 participants
@roman4e

roman4e commented Oct 16, 2017

Example: 1) "ілн".match(/\b\w+\b/ug)
2) "ілн".match(/\b\ілн\b/ug)
Actual Result: null
Expected Result: ["ілн"]

Note that Expected Result appear in languages that use libpcre3.x library like perl, php or python.
Actual result is reproducible in browsers Firefox and Chrome and might be others but I did not test.

@jmdyck

This comment has been minimized.

Show comment
Hide comment
@jmdyck

jmdyck Oct 16, 2017

Collaborator

Are you claiming that this is due to a bug in the spec? Because if not (i.e., if it's just bugs in implementations), it doesn't belong here.

Collaborator

jmdyck commented Oct 16, 2017

Are you claiming that this is due to a bug in the spec? Because if not (i.e., if it's just bugs in implementations), it doesn't belong here.

@roman4e

This comment has been minimized.

Show comment
Hide comment
@roman4e

roman4e Oct 16, 2017

Maybe. I got advice from Chromium forum to try make feedback there.

roman4e commented Oct 16, 2017

Maybe. I got advice from Chromium forum to try make feedback there.

@ljharb

This comment has been minimized.

Show comment
Hide comment
@ljharb

ljharb Oct 16, 2017

Member

Can you link to that advice?

Member

ljharb commented Oct 16, 2017

Can you link to that advice?

@roman4e

This comment has been minimized.

Show comment
Hide comment
@roman4e

roman4e commented Oct 16, 2017

Please look there

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Oct 16, 2017

Contributor

Both the \b assertion and the \w character class escape are defined in the standard in terms of WordCharacters(), which is ASCII based and therefore doesn't handle Cyrillic.

In the second example, the ECMAScript standard does not talk about UTF-8 source code*. I believe that is why the expression "ілн".match(/\b\ілн\b/ug) fails. Some engines may have special handling for the string and/or the expression being UTF-8, but it is not required.

In the JavaScriptCore shell, jsc, I got the following results

>>> "ілн".length
6
>>> "ілн".match(/\u{456}\u{43b}\u{43d}/ug)
null
>>> "\u0456\u043b\u043d".match(/\u{456}\u{43b}\u{43d}/ug)
ілн
>>> "\u0456\u043b\u043d".match(/\b\u{456}\u{43b}\u{43d}\b/ug)
null
>>> "ілн".match(/ілн/ug)
'�лн

The first shows that UTF-8 strings are not handled. The second expression confirms that UTF-8 strings don't match the escaped version of the string. The third expression shows that the \b assertion doesn't match with the UTF-8 string. Finally, the fourth expression shows that a UTF-8 expression matches a UTF-8 string, but the result doesn't reflect the input in a readable way.

This is just one engine. Since the standard is quiet with respect of UTF-8 source handling*, other engines could have different results, but that still wouldn't support the proposed expected results that started this issue.

* The only mention of UTF-8 handing in the ECMAScript standard is in the URI handling as specified in 18.2.6 URI Handling Functions.

Contributor

msaboff commented Oct 16, 2017

Both the \b assertion and the \w character class escape are defined in the standard in terms of WordCharacters(), which is ASCII based and therefore doesn't handle Cyrillic.

In the second example, the ECMAScript standard does not talk about UTF-8 source code*. I believe that is why the expression "ілн".match(/\b\ілн\b/ug) fails. Some engines may have special handling for the string and/or the expression being UTF-8, but it is not required.

In the JavaScriptCore shell, jsc, I got the following results

>>> "ілн".length
6
>>> "ілн".match(/\u{456}\u{43b}\u{43d}/ug)
null
>>> "\u0456\u043b\u043d".match(/\u{456}\u{43b}\u{43d}/ug)
ілн
>>> "\u0456\u043b\u043d".match(/\b\u{456}\u{43b}\u{43d}\b/ug)
null
>>> "ілн".match(/ілн/ug)
'�лн

The first shows that UTF-8 strings are not handled. The second expression confirms that UTF-8 strings don't match the escaped version of the string. The third expression shows that the \b assertion doesn't match with the UTF-8 string. Finally, the fourth expression shows that a UTF-8 expression matches a UTF-8 string, but the result doesn't reflect the input in a readable way.

This is just one engine. Since the standard is quiet with respect of UTF-8 source handling*, other engines could have different results, but that still wouldn't support the proposed expected results that started this issue.

* The only mention of UTF-8 handing in the ECMAScript standard is in the URI handling as specified in 18.2.6 URI Handling Functions.

@msaboff

This comment has been minimized.

Show comment
Hide comment
@msaboff

msaboff Oct 16, 2017

Contributor

Let me add that with the proposed Unicode Property Escapes changes to ECMAScript, the following match works:

>>> "\u0456\u043b\u043d".match(/\p{Letter}+/ug)
ілн

Using a UTF-8 source string "matches" for the work reasons because the Latin characters match some of the UTF-8 bytes:

>>> "ілн".match(/\p{Letter}+/ug)
Ñ,Ð,Ð

Contributor

msaboff commented Oct 16, 2017

Let me add that with the proposed Unicode Property Escapes changes to ECMAScript, the following match works:

>>> "\u0456\u043b\u043d".match(/\p{Letter}+/ug)
ілн

Using a UTF-8 source string "matches" for the work reasons because the Latin characters match some of the UTF-8 bytes:

>>> "ілн".match(/\p{Letter}+/ug)
Ñ,Ð,Ð

@roman4e

This comment has been minimized.

Show comment
Hide comment
@roman4e

roman4e Oct 16, 2017

Sure it works okay without boundaries but the boundaries are required. As ECMAScript internally use UTF-16 (correct?), that's little confused for me why Regex parser interpret testing strings as ASCII-strings. And even the set \w does not include letters from UTF-16 set.

roman4e commented Oct 16, 2017

Sure it works okay without boundaries but the boundaries are required. As ECMAScript internally use UTF-16 (correct?), that's little confused for me why Regex parser interpret testing strings as ASCII-strings. And even the set \w does not include letters from UTF-16 set.

@littledan

This comment has been minimized.

Show comment
Hide comment
@littledan

littledan Oct 16, 2017

Member

This was a deliberate decision in the ES6 cycle, not to change the definition for \w and \b and leave them based on ASCII, as noted here. From here, I'd recommend re-proposing the other half of this proposal if you want to move further on this topic. However, Unicode RegExps have shipped to the Web, and it is likely to be harder to change them without breaking the web. cc @allenwb @mathiasbynens .

Member

littledan commented Oct 16, 2017

This was a deliberate decision in the ES6 cycle, not to change the definition for \w and \b and leave them based on ASCII, as noted here. From here, I'd recommend re-proposing the other half of this proposal if you want to move further on this topic. However, Unicode RegExps have shipped to the Web, and it is likely to be harder to change them without breaking the web. cc @allenwb @mathiasbynens .

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Oct 16, 2017

Member

@msaboff fwiw I get a length of 3 in all engines. I'm not seeing UTF8 bytes anywhere...

I also think the 2nd example should be an error due to the invalid escape sequence (with /u, identity escapes are severely limited to only SyntaxCharacters, although I could very well be misunderstanding things).

Anyway echoing @littledan - changes to \w and \b need to go through the normal proposal process.

Member

bterlson commented Oct 16, 2017

@msaboff fwiw I get a length of 3 in all engines. I'm not seeing UTF8 bytes anywhere...

I also think the 2nd example should be an error due to the invalid escape sequence (with /u, identity escapes are severely limited to only SyntaxCharacters, although I could very well be misunderstanding things).

Anyway echoing @littledan - changes to \w and \b need to go through the normal proposal process.

@bterlson bterlson closed this Oct 16, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment