Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

align \s with White_Space #37

Closed
markusicu opened this issue Jul 15, 2021 · 7 comments
Closed

align \s with White_Space #37

markusicu opened this issue Jul 15, 2021 · 7 comments

Comments

@markusicu
Copy link
Collaborator

ES regex \s is almost, but not quite, the same as \p{White_Space}.

\s (CharacterClassEscape :: s) is defined as “Return the CharSet containing all characters corresponding to a code point on the right-hand side of the WhiteSpace or LineTerminator productions.”

On the other hand, \p{White_Space} is the Unicode White_Space property. See the list of code point ranges at the top of https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

These are deceptively similar; each set contains 25 code points. However,

  • \s contains U+FEFF ZERO WIDTH NO-BREAK SPACE, which is a format control (gc=Cf), not a space character
  • \p{White_Space} contains U+0085 NEXT LINE (NEL) which is missing from the ES LineTerminator list.

This is confusing and non-standard.

Under the new flag, we should change \s to be the same as \p{White_Space}.

I am not proposing that we change the ES lexer's WhiteSpace or LineTerminator definitions.

@mathiasbynens
Copy link
Member

\s was originally meant to be consistent with observable definitions of "whitespace" elsewhere in the language, e.g. String.prototype.trim. This would be changing that invariant when the v flag is set.

I would personally be okay with this since the v flag is an explicit opt-in signal. But given the above, I’d be curious to hear what @waldemarhorwat thinks.

@markusicu
Copy link
Collaborator Author

\s was originally meant to be consistent with observable definitions of "whitespace" elsewhere in the language, e.g. String.prototype.trim.

I think trim() should trim White_Space, and I think that developers would expect that.

I understand that U+FEFF is in White_Space probably as a BOM shortcut for the lexer, but I don't understand why outside of the lexer we should have a 4% discrepancy between trim()/\s and White_Space.

@markusicu
Copy link
Collaborator Author

Concern from @bmeck : There are special uses of U+FEFF that some people may handle with \s -- should be ok if we document clearly that \s under the new flag does not match U+FEFF.

@macchiati
Copy link
Collaborator

macchiati commented Aug 19, 2021

Note: the preferred character — since Unicode 3.2 (2002) — to join words is U+2060 WORD JOINER. It behaves the same as U+FEFF in line break, word break, and grapheme cluster break, but does not have the "BOM" semantics.

See

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ALine_Break%3DWord_Joiner%3A%5D&g=wb+gcb+age&i=

@markusicu
Copy link
Collaborator Author

Discussion: The regex flag only affects regular expressions, therefore we cannot / will not change the behavior of String.trim().

@markusicu
Copy link
Collaborator Author

This is no longer a goal for this proposal.

@gibson042
Copy link

FWIW, jmespath-community/jmespath.test#11 (comment) uncovered real-world divergence between independent implementations of a specification that in the case of JavaScript was caused by the difference between regular expression \s and Unicode general category White_Space.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants