New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode RegExp with index points trail surrogate in surrogate pair is not covered in the spec #128

Open
arai-a opened this Issue Oct 26, 2015 · 17 comments

Comments

Projects
None yet
7 participants
@arai-a

arai-a commented Oct 26, 2015

Derived from https://bugzilla.mozilla.org/show_bug.cgi?id=1135377

ES6 21.2.2.2 steps 2.1-2 don't cover the case when index points a trails surrogate of the surrogate pair in str.

Here's testcase:

var r = /\uDC38/ug;
r.lastIndex = 1;
var str = "\uD83D\uDC38";
r.exec(str);

r.lastIndex points trail surrogate \uDC38 in str. In step 2.1, Unicode is true and Input becomes a single element list with U+1f438. So listIndex cannot point the trail surrogate character in step 2.2.

@allenwb

This comment has been minimized.

Show comment
Hide comment
@allenwb

allenwb Oct 26, 2015

Member

this actually is covered by the spec. language in 21.2.2.2.

In step 2.1 str is "\uD83D\uDC38" and Input becomes the single element List whose sole element is U+1f438.

In step 2.2 index is 1. Element 1 of str contributed to the creation of element 0 of Input. (in other words Input[0] was obtained from str[0] and str[1]). So, listIndex will be 0 since Input[0] was obtained from string[1].

When matching is actually performed, no match is found because the single element of Input (U+1f438) does not match the pattern character U+0dc38.

The spec. language probably needs to be clearer about what it means by "obtained".

There is an actual bug in 21.2.2.2. Step 2.2 does not say what should be done if index >= the length of str. In that case, listIndex should be set to the same value as InputLength. To fix that, swap steps 2.2 and 2.3 and set listIndex to the value of InputLength when index >= str.length.

Member

allenwb commented Oct 26, 2015

this actually is covered by the spec. language in 21.2.2.2.

In step 2.1 str is "\uD83D\uDC38" and Input becomes the single element List whose sole element is U+1f438.

In step 2.2 index is 1. Element 1 of str contributed to the creation of element 0 of Input. (in other words Input[0] was obtained from str[0] and str[1]). So, listIndex will be 0 since Input[0] was obtained from string[1].

When matching is actually performed, no match is found because the single element of Input (U+1f438) does not match the pattern character U+0dc38.

The spec. language probably needs to be clearer about what it means by "obtained".

There is an actual bug in 21.2.2.2. Step 2.2 does not say what should be done if index >= the length of str. In that case, listIndex should be set to the same value as InputLength. To fix that, swap steps 2.2 and 2.3 and set listIndex to the value of InputLength when index >= str.length.

@arai-a

This comment has been minimized.

Show comment
Hide comment
@arai-a

arai-a Oct 26, 2015

thanks, that makes sense :)
it would be nice to add some note about surrogate pair to step 2.2.

arai-a commented Oct 26, 2015

thanks, that makes sense :)
it would be nice to add some note about surrogate pair to step 2.2.

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Oct 26, 2015

Member

Ooof, that 2.2 wording is subtle. I will add a note about surrogate pairs unless someone has a better idea...

Member

bterlson commented Oct 26, 2015

Ooof, that 2.2 wording is subtle. I will add a note about surrogate pairs unless someone has a better idea...

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Oct 27, 2015

Member

@arai-a how does this do for you?

When the Unicode flag is present and str contains surrogate pairs, listIndex and index may differ. In particular, given a surrogate pair in str, both high and low surrogates will map to the same element of Input and have the same listIndex.

Member

bterlson commented Oct 27, 2015

@arai-a how does this do for you?

When the Unicode flag is present and str contains surrogate pairs, listIndex and index may differ. In particular, given a surrogate pair in str, both high and low surrogates will map to the same element of Input and have the same listIndex.

@arai-a

This comment has been minimized.

Show comment
Hide comment
@arai-a

arai-a Oct 27, 2015

Sounds great, thank you :)

arai-a commented Oct 27, 2015

Sounds great, thank you :)

@anba

This comment has been minimized.

Show comment
Hide comment
@anba

anba Oct 27, 2015

Contributor

When the Unicode flag is present and str contains surrogate pairs, listIndex and index may differ. In particular, given a surrogate pair in str, both high and low surrogates will map to the same element of Input and have the same listIndex.

I think this case should be handled in RegExpBuiltinExec, otherwise the index property of the exec result array can differ from the start position of the result string. Or am I missing something?

var r = /\uD83D\uDC38/ug;
r.lastIndex = 1;
var str = "\uD83D\uDC38";
var result = r.exec(str);
print(result.index); // Prints 0 or 1 ?
Contributor

anba commented Oct 27, 2015

When the Unicode flag is present and str contains surrogate pairs, listIndex and index may differ. In particular, given a surrogate pair in str, both high and low surrogates will map to the same element of Input and have the same listIndex.

I think this case should be handled in RegExpBuiltinExec, otherwise the index property of the exec result array can differ from the start position of the result string. Or am I missing something?

var r = /\uD83D\uDC38/ug;
r.lastIndex = 1;
var str = "\uD83D\uDC38";
var result = r.exec(str);
print(result.index); // Prints 0 or 1 ?
@allenwb

This comment has been minimized.

Show comment
Hide comment
@allenwb

allenwb Oct 27, 2015

Member

@anba Isn't that answered by step 17.a of RegExpExExec?

Member

allenwb commented Oct 27, 2015

@anba Isn't that answered by step 17.a of RegExpExExec?

@anba

This comment has been minimized.

Show comment
Hide comment
@anba

anba Oct 27, 2015

Contributor

Step 17.a of RegExpBuiltinExec only computes the new "lastIndex" property. My question is related to the "index" property (step 24 of RegExpBuiltinExec). According to the current algorithm, "index" is set to matchIndex where matchIndex is initialized with lastIndex (step 22). That means in my example "index" is set to 1 even though the actual match starts at string position 0.

Contributor

anba commented Oct 27, 2015

Step 17.a of RegExpBuiltinExec only computes the new "lastIndex" property. My question is related to the "index" property (step 24 of RegExpBuiltinExec). According to the current algorithm, "index" is set to matchIndex where matchIndex is initialized with lastIndex (step 22). That means in my example "index" is set to 1 even though the actual match starts at string position 0.

@allenwb

This comment has been minimized.

Show comment
Hide comment
@allenwb

allenwb Oct 27, 2015

Member

Ah, yes I agree. Index should be set to the index of the actual match is such cases.

But this makes me think that we may need to reconsider the interpatation of thelastIndex property for Unicode patterns. As currently spec'ed, the entire string S is UTF-16 decoded independently of the value of lastIndex and if lastIndex points at the 2nd element of a surrogate pair, lastIndex-1 is the actual starting point within S of the pattern match.

An alternative, would to modify 21.2.2.2 so that it only UTF-16 decoded the string starting at position index (which corresponds to lastIndex). If we did that, a case where lastIndex points at the 2nd element of a surrogate pair would cause the trail surrogate to be decoded as a distinct code point rather than part of a pair and a S-relative match point could never be less than lastIndex.

I don't think this makes much difference for lastIndex values generated by the RegExp algorithms, but it could be a significant difference for manually set lastIndex values in cases where surrogate pairs aren't expected or fully understood.

Member

allenwb commented Oct 27, 2015

Ah, yes I agree. Index should be set to the index of the actual match is such cases.

But this makes me think that we may need to reconsider the interpatation of thelastIndex property for Unicode patterns. As currently spec'ed, the entire string S is UTF-16 decoded independently of the value of lastIndex and if lastIndex points at the 2nd element of a surrogate pair, lastIndex-1 is the actual starting point within S of the pattern match.

An alternative, would to modify 21.2.2.2 so that it only UTF-16 decoded the string starting at position index (which corresponds to lastIndex). If we did that, a case where lastIndex points at the 2nd element of a surrogate pair would cause the trail surrogate to be decoded as a distinct code point rather than part of a pair and a S-relative match point could never be less than lastIndex.

I don't think this makes much difference for lastIndex values generated by the RegExp algorithms, but it could be a significant difference for manually set lastIndex values in cases where surrogate pairs aren't expected or fully understood.

@arai-a

This comment has been minimized.

Show comment
Hide comment
@arai-a

arai-a Nov 23, 2015

how should I implement the index part for now?

arai-a commented Nov 23, 2015

how should I implement the index part for now?

@anba

This comment has been minimized.

Show comment
Hide comment
@anba

anba Nov 27, 2015

Contributor

@bterlson Can you re-open this issue? Thanks!

Contributor

anba commented Nov 27, 2015

@bterlson Can you re-open this issue? Thanks!

@bterlson bterlson reopened this Jan 13, 2016

kisg pushed a commit to paul99/v8mips that referenced this issue Jan 25, 2016

[regexp] step back if starting unicode regexp within surrogate pair.
See tc39/ecma262#128

R=erik.corry@gmail.com, littledan@chromium.org
BUG=v8:2952
LOG=N

Review URL: https://codereview.chromium.org/1608693003

Cr-Commit-Position: refs/heads/master@{#33488}

@bterlson bterlson reopened this May 25, 2016

@ljharb

This comment has been minimized.

Show comment
Hide comment
@ljharb

ljharb Mar 21, 2018

Member

@allenwb @arai-a @mathiasbynens @wycats is the fix in tc39/proposal-decorators-previous@9c1a3bd something that should be pulled into the main spec?

Member

ljharb commented Mar 21, 2018

@allenwb @arai-a @mathiasbynens @wycats is the fix in tc39/proposal-decorators-previous@9c1a3bd something that should be pulled into the main spec?

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Mar 21, 2018

Member

It was; see #130 and 9c1a3bd.

Member

mathiasbynens commented Mar 21, 2018

It was; see #130 and 9c1a3bd.

@ljharb

This comment has been minimized.

Show comment
Hide comment
@ljharb

ljharb Mar 21, 2018

Member

@mathiasbynens ah, thanks. does #128 (comment) mean there's still something to do here, or should this be closed?

Member

ljharb commented Mar 21, 2018

@mathiasbynens ah, thanks. does #128 (comment) mean there's still something to do here, or should this be closed?

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Mar 22, 2018

Member

The remainder of @allenwb’s comments still needs addressing. #128 (comment) AFAICT, everything except the last paragraph still applies.

Member

mathiasbynens commented Mar 22, 2018

The remainder of @allenwb’s comments still needs addressing. #128 (comment) AFAICT, everything except the last paragraph still applies.

@ericrannaud

This comment has been minimized.

Show comment
Hide comment
@ericrannaud

ericrannaud Jun 1, 2018

There is a related subtle behavior with global unicode regexps:

  const s = "💩"; // "\uD83D\uDCA9"
  const re = /|/gu; // produces zero-length match
  re.lastIndex = 1; // point to the trailing surrogate
  const m = re.exec(s);
  console.assert(m[0] === "" && m.index === 0 && re.lastIndex === 0);
  // Note: recent Mobile Safari has m.index===1 and re.lastIndex===1

This behavior is defined in the spec at https://tc39.github.io/ecma262/#sec-regexpbuiltinexec, 21.2.5.2.2(14):

14. If fullUnicode is true, then
  a. e is an index into the Input character list, derived from S, matched
by matcher. Let eUTF be the smallest index into S that corresponds
to the character at element e of Input. If e is greater than or equal to
the number of elements in Input, then eUTF is the number of code
units in S.
  b. Set e to eUTF.
15. If global is true or sticky is true, then
  a. Perform ? Set(R, "lastIndex", e, true).

It would be helpful to add a note that a regexp can produce a match at an index prior to lastIndex. I would think many developers have traditionally assumed that a match could only be found at, or after, lastIndex.

Code that works with a caller-provided RegExp object will suddenly break if they are given certain unicode regexps.

For instance, this behavior is challenging to handle in the following typical case: a loop iterating with RegExp.exec() over a string to find all matches, with a regexp that may produce zero-length matches. With non-unicode regexps, to avoid an infinite loop, one can write:

  while ((m = re.exec(text)) !== null) {
    if (m[0].length === 0) {
      // Make progress even with zero-length match.
      re.lastIndex += 1;
      continue;
    }
    // ...
  }

If re.unicode === true, then one must use something like:

  while ((m = re.exec(text)) !== null) {
    if (m[0].length === 0) {
      const lastIndex = re.lastIndex;
      let incr = 1;
      if (re.unicode) {
        const code1 = text.charCodeAt(lastIndex);
        if (code1 >= 0xD800 && code1 <= 0xDBFF) {
          const code2 = text.charCodeAt(lastIndex + 1);
          if (code2 >= 0xDC00 && code2 <= 0xDFFF) {
            // Move past the second surrogate in the pair.
            incr = 2;
          }
        }
      }
      // Make progress even with zero-length match.
      re.lastIndex += incr;
      continue;
    }
    // ...
  }

ericrannaud commented Jun 1, 2018

There is a related subtle behavior with global unicode regexps:

  const s = "💩"; // "\uD83D\uDCA9"
  const re = /|/gu; // produces zero-length match
  re.lastIndex = 1; // point to the trailing surrogate
  const m = re.exec(s);
  console.assert(m[0] === "" && m.index === 0 && re.lastIndex === 0);
  // Note: recent Mobile Safari has m.index===1 and re.lastIndex===1

This behavior is defined in the spec at https://tc39.github.io/ecma262/#sec-regexpbuiltinexec, 21.2.5.2.2(14):

14. If fullUnicode is true, then
  a. e is an index into the Input character list, derived from S, matched
by matcher. Let eUTF be the smallest index into S that corresponds
to the character at element e of Input. If e is greater than or equal to
the number of elements in Input, then eUTF is the number of code
units in S.
  b. Set e to eUTF.
15. If global is true or sticky is true, then
  a. Perform ? Set(R, "lastIndex", e, true).

It would be helpful to add a note that a regexp can produce a match at an index prior to lastIndex. I would think many developers have traditionally assumed that a match could only be found at, or after, lastIndex.

Code that works with a caller-provided RegExp object will suddenly break if they are given certain unicode regexps.

For instance, this behavior is challenging to handle in the following typical case: a loop iterating with RegExp.exec() over a string to find all matches, with a regexp that may produce zero-length matches. With non-unicode regexps, to avoid an infinite loop, one can write:

  while ((m = re.exec(text)) !== null) {
    if (m[0].length === 0) {
      // Make progress even with zero-length match.
      re.lastIndex += 1;
      continue;
    }
    // ...
  }

If re.unicode === true, then one must use something like:

  while ((m = re.exec(text)) !== null) {
    if (m[0].length === 0) {
      const lastIndex = re.lastIndex;
      let incr = 1;
      if (re.unicode) {
        const code1 = text.charCodeAt(lastIndex);
        if (code1 >= 0xD800 && code1 <= 0xDBFF) {
          const code2 = text.charCodeAt(lastIndex + 1);
          if (code2 >= 0xDC00 && code2 <= 0xDFFF) {
            // Move past the second surrogate in the pair.
            incr = 2;
          }
        }
      }
      // Make progress even with zero-length match.
      re.lastIndex += incr;
      continue;
    }
    // ...
  }
@ljharb

This comment has been minimized.

Show comment
Hide comment
@ljharb

ljharb Jun 1, 2018

Member

@ericrannaud it's worth noting that with the stage 3 String.prototype.matchAll proposal, you'd basically never need to loop with exec anymore. See https://npmjs.com/string.prototype.matchall

Member

ljharb commented Jun 1, 2018

@ericrannaud it's worth noting that with the stage 3 String.prototype.matchAll proposal, you'd basically never need to loop with exec anymore. See https://npmjs.com/string.prototype.matchall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment