You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This doesn't actually implement any regex flavor. Worse than not implementing any well-known flavor, for each such flavor, there exists a diverging regex which is valid both in tree-sitter and that flavor, but which is interpreted differently.
It obviously diverges from Rust flavored regex, since we're changing the regex before interpreting it as Rust flavored regex.
It doesn't even maintain that all valid Rust regexes are interpreted correctly, since Rust supports extended word boundary assertions which get mangled (\b{start}, \b{end}, \b{start-half}, \b{end-half}).
It still doesn't match ECMAScript flavored regex:
\0 is not recognized as meaning \u{0}.
It diverges in []] being interpreted \] (ECMAScript flavor) versus [\]] (Rust flavor).
It diverges from Unicode-unaware mode, which doesn't have escapes \a, \A, \z, \<, \>, \x{}, \u{}, \U{}, \p, or \P, thus treats them as literal escapes.
Tree-sitter also still treats literal {} and unclosed { as an error.
It diverges from u mode, which doesn't have character class set operations, thus making [a-y&&xyz] mean [a-z] (instead of [xy]).
It diverges from v mode, which doesn't have posix style character classes, thus making [[:alpha:]] mean [:alph] (instead of [A-Za-z]).
It diverges from PCRE, which doesn't have \<, \>, or character class set operations.
Suggested resolution
Just stop preprocessing the regex. There's two possible justifications available for removing the curly brace preprocessing:
State that tree-sitter uses Rust flavored regex expressions. It's even a flavor option on regex101 nowadays.
State that tree-sitter always applies the u or v flag to the ECMAScript flavored regex.
Reiterating: the regex preprocessing
makes tree-sitter use an ad-hoc regex flavor
in order to emulate an explicitly deprecated behavior of ECMAScript regex syntax
and it doesn't even match ECMAScript behavior.
Given that Tree-sitter deliberately supports \u{hhhh} escapes3 despite being a .js syntax, which requires the use of the u or v flag, it imho just makes sense to stop mangling preprocessing the regex to support syntax which is an error given the use of the u or v flag. And especially now that Rust is a known flavor on regex101, it's much more an option to just say that's what Tree-sitter is using.
Footnotes
This escapes the curly braces of any {} for which the contents do not look like a character class, repetition quantifier, or \p. ↩
This preprocessing is redundant now, as the Rust regex-syntax's escape sequences now always permit escaping all ASCII except [0-9A-Za-z<>] (meaning to match the literal ASCII). ↩
If it didn't, and wanted to match ECMAScript behavior more closely, the curly brace preprocessing would probably match [^[:digit:],}] instead of [^[:alnum:],}], to at least match the behavior of \u{100F}, although \u{100} would still be interpreted as \u0100 (Rust flavor interpretation) instead of u{100} (deprecated Unicode-unaware ECMAScript flavor interpretation). ↩
The text was updated successfully, but these errors were encountered:
CAD97
added a commit
to CAD97/tree-sitter
that referenced
this issue
Dec 25, 2023
Fixestree-sitter#2837; see that issue for details.
In short: the emulated syntax behavior of JS regexp is deprecated
and only kept for web compatibility, and you should not rely on it.
Removing the preprocessing makes behavior more predictable, and even
more consistent with Unicode-aware JS regexp, which are required to
use syntax which Tree-sitter has always supported (braced \u and \p).
Tree-sitter currently does two preprocessing steps to regex:
(^|[^\\pP])\{([^}]*[^0-9A-Fa-f,}][^}]*)\}
with$1\{$2\}
1, and!'"/
2.This doesn't actually implement any regex flavor. Worse than not implementing any well-known flavor, for each such flavor, there exists a diverging regex which is valid both in tree-sitter and that flavor, but which is interpreted differently.
\b{start}
,\b{end}
,\b{start-half}
,\b{end-half}
).\0
is not recognized as meaning\u{0}
.[]]
being interpreted\]
(ECMAScript flavor) versus[\]]
(Rust flavor).\a
,\A
,\z
,\<
,\>
,\x{}
,\u{}
,\U{}
,\p
, or\P
, thus treats them as literal escapes.{}
and unclosed{
as an error.u
orv
flag) turns the deprecated literal{}
syntax which is being emulated into a syntax error.u
mode, which doesn't have character class set operations, thus making[a-y&&xyz]
mean[a-z]
(instead of[xy]
).v
mode, which doesn't have posix style character classes, thus making[[:alpha:]]
mean[:alph]
(instead of[A-Za-z]
).\<
,\>
, or character class set operations.Suggested resolution
Just stop preprocessing the regex. There's two possible justifications available for removing the curly brace preprocessing:
u
orv
flag to the ECMAScript flavored regex.Reiterating: the regex preprocessing
Given that Tree-sitter deliberately supports
\u{hhhh}
escapes3 despite being a.js
syntax, which requires the use of theu
orv
flag, it imho just makes sense to stopmanglingpreprocessing the regex to support syntax which is an error given the use of theu
orv
flag. And especially now that Rust is a known flavor on regex101, it's much more an option to just say that's what Tree-sitter is using.Footnotes
This escapes the curly braces of any
{}
for which the contents do not look like a character class, repetition quantifier, or\p
. ↩This preprocessing is redundant now, as the Rust regex-syntax's escape sequences now always permit escaping all ASCII except
[0-9A-Za-z<>]
(meaning to match the literal ASCII). ↩If it didn't, and wanted to match ECMAScript behavior more closely, the curly brace preprocessing would probably match
[^[:digit:],}]
instead of[^[:alnum:],}]
, to at least match the behavior of\u{100F}
, although\u{100}
would still be interpreted as\u0100
(Rust flavor interpretation) instead ofu{100}
(deprecated Unicode-unaware ECMAScript flavor interpretation). ↩The text was updated successfully, but these errors were encountered: