Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upUnicode RegExp with index points trail surrogate in surrogate pair is not covered in the spec #128
Comments
bterlson
added
the
normative change
label
Oct 26, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
allenwb
Oct 26, 2015
Member
this actually is covered by the spec. language in 21.2.2.2.
In step 2.1 str is "\uD83D\uDC38" and Input becomes the single element List whose sole element is U+1f438.
In step 2.2 index is 1. Element 1 of str contributed to the creation of element 0 of Input. (in other words Input[0] was obtained from str[0] and str[1]). So, listIndex will be 0 since Input[0] was obtained from string[1].
When matching is actually performed, no match is found because the single element of Input (U+1f438) does not match the pattern character U+0dc38.
The spec. language probably needs to be clearer about what it means by "obtained".
There is an actual bug in 21.2.2.2. Step 2.2 does not say what should be done if index >= the length of str. In that case, listIndex should be set to the same value as InputLength. To fix that, swap steps 2.2 and 2.3 and set listIndex to the value of InputLength when index >= str.length.
|
this actually is covered by the spec. language in 21.2.2.2. In step 2.1 str is "\uD83D\uDC38" and Input becomes the single element List whose sole element is U+1f438. In step 2.2 index is 1. Element 1 of str contributed to the creation of element 0 of Input. (in other words Input[0] was obtained from str[0] and str[1]). So, listIndex will be 0 since Input[0] was obtained from string[1]. When matching is actually performed, no match is found because the single element of Input (U+1f438) does not match the pattern character U+0dc38. The spec. language probably needs to be clearer about what it means by "obtained". There is an actual bug in 21.2.2.2. Step 2.2 does not say what should be done if index >= the length of str. In that case, listIndex should be set to the same value as InputLength. To fix that, swap steps 2.2 and 2.3 and set listIndex to the value of InputLength when index >= str.length. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
arai-a
Oct 26, 2015
thanks, that makes sense :)
it would be nice to add some note about surrogate pair to step 2.2.
arai-a
commented
Oct 26, 2015
|
thanks, that makes sense :) |
mathiasbynens
referenced this issue
Oct 26, 2015
Merged
Normative: Define 21.2.2.2 behavior when `index >= str.length` #130
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bterlson
Oct 26, 2015
Member
Ooof, that 2.2 wording is subtle. I will add a note about surrogate pairs unless someone has a better idea...
|
Ooof, that 2.2 wording is subtle. I will add a note about surrogate pairs unless someone has a better idea... |
bterlson
closed this
in
#130
Oct 27, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bterlson
Oct 27, 2015
Member
@arai-a how does this do for you?
When the Unicode flag is present and str contains surrogate pairs, listIndex and index may differ. In particular, given a surrogate pair in str, both high and low surrogates will map to the same element of Input and have the same listIndex.
|
@arai-a how does this do for you? When the Unicode flag is present and str contains surrogate pairs, listIndex and index may differ. In particular, given a surrogate pair in str, both high and low surrogates will map to the same element of Input and have the same listIndex. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
arai-a
commented
Oct 27, 2015
|
Sounds great, thank you :) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anba
Oct 27, 2015
Contributor
When the Unicode flag is present and str contains surrogate pairs, listIndex and index may differ. In particular, given a surrogate pair in str, both high and low surrogates will map to the same element of Input and have the same listIndex.
I think this case should be handled in RegExpBuiltinExec, otherwise the index property of the exec result array can differ from the start position of the result string. Or am I missing something?
var r = /\uD83D\uDC38/ug;
r.lastIndex = 1;
var str = "\uD83D\uDC38";
var result = r.exec(str);
print(result.index); // Prints 0 or 1 ?
I think this case should be handled in RegExpBuiltinExec, otherwise the var r = /\uD83D\uDC38/ug;
r.lastIndex = 1;
var str = "\uD83D\uDC38";
var result = r.exec(str);
print(result.index); // Prints 0 or 1 ? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@anba Isn't that answered by step 17.a of RegExpExExec? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
anba
Oct 27, 2015
Contributor
Step 17.a of RegExpBuiltinExec only computes the new "lastIndex" property. My question is related to the "index" property (step 24 of RegExpBuiltinExec). According to the current algorithm, "index" is set to matchIndex where matchIndex is initialized with lastIndex (step 22). That means in my example "index" is set to 1 even though the actual match starts at string position 0.
|
Step 17.a of RegExpBuiltinExec only computes the new |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
allenwb
Oct 27, 2015
Member
Ah, yes I agree. Index should be set to the index of the actual match is such cases.
But this makes me think that we may need to reconsider the interpatation of thelastIndex property for Unicode patterns. As currently spec'ed, the entire string S is UTF-16 decoded independently of the value of lastIndex and if lastIndex points at the 2nd element of a surrogate pair, lastIndex-1 is the actual starting point within S of the pattern match.
An alternative, would to modify 21.2.2.2 so that it only UTF-16 decoded the string starting at position index (which corresponds to lastIndex). If we did that, a case where lastIndex points at the 2nd element of a surrogate pair would cause the trail surrogate to be decoded as a distinct code point rather than part of a pair and a S-relative match point could never be less than lastIndex.
I don't think this makes much difference for lastIndex values generated by the RegExp algorithms, but it could be a significant difference for manually set lastIndex values in cases where surrogate pairs aren't expected or fully understood.
|
Ah, yes I agree. Index should be set to the index of the actual match is such cases. But this makes me think that we may need to reconsider the interpatation of the An alternative, would to modify 21.2.2.2 so that it only UTF-16 decoded the string starting at position index (which corresponds to I don't think this makes much difference for |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
arai-a
commented
Nov 23, 2015
|
how should I implement the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@bterlson Can you re-open this issue? Thanks! |
bterlson
reopened this
Jan 13, 2016
pushed a commit
to paul99/v8mips
that referenced
this issue
Jan 25, 2016
wycats
closed this
in
tc39/proposal-decorators-previous@9c1a3bd
May 25, 2016
bterlson
reopened this
May 25, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ljharb
Mar 21, 2018
Member
@allenwb @arai-a @mathiasbynens @wycats is the fix in tc39/proposal-decorators-previous@9c1a3bd something that should be pulled into the main spec?
|
@allenwb @arai-a @mathiasbynens @wycats is the fix in tc39/proposal-decorators-previous@9c1a3bd something that should be pulled into the main spec? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ljharb
Mar 21, 2018
Member
@mathiasbynens ah, thanks. does #128 (comment) mean there's still something to do here, or should this be closed?
|
@mathiasbynens ah, thanks. does #128 (comment) mean there's still something to do here, or should this be closed? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mathiasbynens
Mar 22, 2018
Member
The remainder of @allenwb’s comments still needs addressing. #128 (comment) AFAICT, everything except the last paragraph still applies.
|
The remainder of @allenwb’s comments still needs addressing. #128 (comment) AFAICT, everything except the last paragraph still applies. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ericrannaud
Jun 1, 2018
There is a related subtle behavior with global unicode regexps:
const s = "💩"; // "\uD83D\uDCA9"
const re = /|/gu; // produces zero-length match
re.lastIndex = 1; // point to the trailing surrogate
const m = re.exec(s);
console.assert(m[0] === "" && m.index === 0 && re.lastIndex === 0);
// Note: recent Mobile Safari has m.index===1 and re.lastIndex===1This behavior is defined in the spec at https://tc39.github.io/ecma262/#sec-regexpbuiltinexec, 21.2.5.2.2(14):
14. If fullUnicode is true, then
a. e is an index into the Input character list, derived from S, matched
by matcher. Let eUTF be the smallest index into S that corresponds
to the character at element e of Input. If e is greater than or equal to
the number of elements in Input, then eUTF is the number of code
units in S.
b. Set e to eUTF.
15. If global is true or sticky is true, then
a. Perform ? Set(R, "lastIndex", e, true).
It would be helpful to add a note that a regexp can produce a match at an index prior to lastIndex. I would think many developers have traditionally assumed that a match could only be found at, or after, lastIndex.
Code that works with a caller-provided RegExp object will suddenly break if they are given certain unicode regexps.
For instance, this behavior is challenging to handle in the following typical case: a loop iterating with RegExp.exec() over a string to find all matches, with a regexp that may produce zero-length matches. With non-unicode regexps, to avoid an infinite loop, one can write:
while ((m = re.exec(text)) !== null) {
if (m[0].length === 0) {
// Make progress even with zero-length match.
re.lastIndex += 1;
continue;
}
// ...
}If re.unicode === true, then one must use something like:
while ((m = re.exec(text)) !== null) {
if (m[0].length === 0) {
const lastIndex = re.lastIndex;
let incr = 1;
if (re.unicode) {
const code1 = text.charCodeAt(lastIndex);
if (code1 >= 0xD800 && code1 <= 0xDBFF) {
const code2 = text.charCodeAt(lastIndex + 1);
if (code2 >= 0xDC00 && code2 <= 0xDFFF) {
// Move past the second surrogate in the pair.
incr = 2;
}
}
}
// Make progress even with zero-length match.
re.lastIndex += incr;
continue;
}
// ...
}
ericrannaud
commented
Jun 1, 2018
•
|
There is a related subtle behavior with global unicode regexps: const s = "💩"; // "\uD83D\uDCA9"
const re = /|/gu; // produces zero-length match
re.lastIndex = 1; // point to the trailing surrogate
const m = re.exec(s);
console.assert(m[0] === "" && m.index === 0 && re.lastIndex === 0);
// Note: recent Mobile Safari has m.index===1 and re.lastIndex===1This behavior is defined in the spec at https://tc39.github.io/ecma262/#sec-regexpbuiltinexec, 21.2.5.2.2(14):
It would be helpful to add a note that a regexp can produce a match at an index prior to lastIndex. I would think many developers have traditionally assumed that a match could only be found at, or after, lastIndex. Code that works with a caller-provided RegExp object will suddenly break if they are given certain unicode regexps. For instance, this behavior is challenging to handle in the following typical case: a loop iterating with RegExp.exec() over a string to find all matches, with a regexp that may produce zero-length matches. With non-unicode regexps, to avoid an infinite loop, one can write: while ((m = re.exec(text)) !== null) {
if (m[0].length === 0) {
// Make progress even with zero-length match.
re.lastIndex += 1;
continue;
}
// ...
}If re.unicode === true, then one must use something like: while ((m = re.exec(text)) !== null) {
if (m[0].length === 0) {
const lastIndex = re.lastIndex;
let incr = 1;
if (re.unicode) {
const code1 = text.charCodeAt(lastIndex);
if (code1 >= 0xD800 && code1 <= 0xDBFF) {
const code2 = text.charCodeAt(lastIndex + 1);
if (code2 >= 0xDC00 && code2 <= 0xDFFF) {
// Move past the second surrogate in the pair.
incr = 2;
}
}
}
// Make progress even with zero-length match.
re.lastIndex += incr;
continue;
}
// ...
} |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ljharb
Jun 1, 2018
Member
@ericrannaud it's worth noting that with the stage 3 String.prototype.matchAll proposal, you'd basically never need to loop with exec anymore. See https://npmjs.com/string.prototype.matchall
|
@ericrannaud it's worth noting that with the stage 3 String.prototype.matchAll proposal, you'd basically never need to loop with exec anymore. See https://npmjs.com/string.prototype.matchall |
arai-a commentedOct 26, 2015
Derived from https://bugzilla.mozilla.org/show_bug.cgi?id=1135377
ES6 21.2.2.2 steps 2.1-2 don't cover the case when
indexpoints a trails surrogate of the surrogate pair instr.Here's testcase:
r.lastIndexpoints trail surrogate\uDC38instr. In step 2.1,UnicodeistrueandInputbecomes a single element list withU+1f438. SolistIndexcannot point the trail surrogate character in step 2.2.