align \s with White_Space #37

markusicu · 2021-07-15T18:33:47Z

ES regex \s is almost, but not quite, the same as \p{White_Space}.

\s (CharacterClassEscape :: s) is defined as “Return the CharSet containing all characters corresponding to a code point on the right-hand side of the WhiteSpace or LineTerminator productions.”

On the other hand, \p{White_Space} is the Unicode White_Space property. See the list of code point ranges at the top of https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

These are deceptively similar; each set contains 25 code points. However,

\s contains U+FEFF ZERO WIDTH NO-BREAK SPACE, which is a format control (gc=Cf), not a space character
\p{White_Space} contains U+0085 NEXT LINE (NEL) which is missing from the ES LineTerminator list.

This is confusing and non-standard.

Under the new flag, we should change \s to be the same as \p{White_Space}.

I am not proposing that we change the ES lexer's WhiteSpace or LineTerminator definitions.

The text was updated successfully, but these errors were encountered:

mathiasbynens · 2021-07-16T08:12:30Z

\s was originally meant to be consistent with observable definitions of "whitespace" elsewhere in the language, e.g. String.prototype.trim. This would be changing that invariant when the v flag is set.

I would personally be okay with this since the v flag is an explicit opt-in signal. But given the above, I’d be curious to hear what @waldemarhorwat thinks.

markusicu · 2021-07-16T19:01:03Z

\s was originally meant to be consistent with observable definitions of "whitespace" elsewhere in the language, e.g. String.prototype.trim.

I think trim() should trim White_Space, and I think that developers would expect that.

I understand that U+FEFF is in White_Space probably as a BOM shortcut for the lexer, but I don't understand why outside of the lexer we should have a 4% discrepancy between trim()/\s and White_Space.

markusicu · 2021-08-19T16:29:01Z

Concern from @bmeck : There are special uses of U+FEFF that some people may handle with \s -- should be ok if we document clearly that \s under the new flag does not match U+FEFF.

macchiati · 2021-08-19T16:40:43Z

Note: the preferred character — since Unicode 3.2 (2002) — to join words is U+2060 WORD JOINER. It behaves the same as U+FEFF in line break, word break, and grapheme cluster break, but does not have the "BOM" semantics.

See

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ALine_Break%3DWord_Joiner%3A%5D&g=wb+gcb+age&i=

markusicu · 2021-08-19T16:45:07Z

Discussion: The regex flag only affects regular expressions, therefore we cannot / will not change the behavior of String.trim().

markusicu · 2021-09-30T22:45:26Z

This is no longer a goal for this proposal.

gibson042 · 2022-11-17T15:36:10Z

FWIW, jmespath-community/jmespath.test#11 (comment) uncovered real-world divergence between independent implementations of a specification that in the case of JavaScript was caused by the difference between regular expression \s and Unicode general category White_Space.

markusicu mentioned this issue Jul 15, 2021

Unicode-aware \w, \d, and \b ? #16

Closed

RunDevelopment mentioned this issue Aug 19, 2021

Consideration for Perl-like (?[]) extended character classes instead of a flag #39

Closed

mathiasbynens mentioned this issue Aug 26, 2021

Expanded scope: further alignment with UTS#18 #43

Closed

markusicu closed this as completed Sep 30, 2021

gibson042 mentioned this issue Mar 27, 2024

Editorial: Mention the special status of U+0085 NEXT LINE tc39/ecma262#3303

Closed

gibson042 mentioned this issue Apr 22, 2024

Editorial: refer to code points directly by name/number instead of using aliases tc39/ecma262#3310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

align \s with White_Space #37

align \s with White_Space #37

markusicu commented Jul 15, 2021

mathiasbynens commented Jul 16, 2021

markusicu commented Jul 16, 2021

markusicu commented Aug 19, 2021

macchiati commented Aug 19, 2021 •

edited

Loading

markusicu commented Aug 19, 2021

markusicu commented Sep 30, 2021

gibson042 commented Nov 17, 2022

align \s with White_Space #37

align \s with White_Space #37

Comments

markusicu commented Jul 15, 2021

mathiasbynens commented Jul 16, 2021

markusicu commented Jul 16, 2021

markusicu commented Aug 19, 2021

macchiati commented Aug 19, 2021 • edited Loading

markusicu commented Aug 19, 2021

markusicu commented Sep 30, 2021

gibson042 commented Nov 17, 2022

macchiati commented Aug 19, 2021 •

edited

Loading