Regex preprocessing doesn't match any regex flavor #2837

CAD97 · 2023-12-25T20:45:19Z

Tree-sitter currently does two preprocessing steps to regex:

replace all (^|[^\\pP])\{([^}]*[^0-9A-Fa-f,}][^}]*)\} with $1\{$2\}¹, and
remove redundant escapes for !'"/².

This doesn't actually implement any regex flavor. Worse than not implementing any well-known flavor, for each such flavor, there exists a diverging regex which is valid both in tree-sitter and that flavor, but which is interpreted differently.

It obviously diverges from Rust flavored regex, since we're changing the regex before interpreting it as Rust flavored regex.
- It doesn't even maintain that all valid Rust regexes are interpreted correctly, since Rust supports extended word boundary assertions which get mangled (\b{start}, \b{end}, \b{start-half}, \b{end-half}).
It still doesn't match ECMAScript flavored regex:
- \0 is not recognized as meaning \u{0}.
- It diverges in []] being interpreted \] (ECMAScript flavor) versus [\]] (Rust flavor).
- It diverges from Unicode-unaware mode, which doesn't have escapes \a, \A, \z, \<, \>, \x{}, \u{}, \U{}, \p, or \P, thus treats them as literal escapes.
  - Tree-sitter also still treats literal {} and unclosed { as an error.
  - Unicode-aware mode (the u or v flag) turns the deprecated literal {} syntax which is being emulated into a syntax error.
- It diverges from u mode, which doesn't have character class set operations, thus making [a-y&&xyz] mean [a-z] (instead of [xy]).
- It diverges from v mode, which doesn't have posix style character classes, thus making [[:alpha:]] mean [:alph] (instead of [A-Za-z]).
It diverges from PCRE, which doesn't have \<, \>, or character class set operations.

Suggested resolution

Just stop preprocessing the regex. There's two possible justifications available for removing the curly brace preprocessing:

State that tree-sitter uses Rust flavored regex expressions. It's even a flavor option on regex101 nowadays.
State that tree-sitter always applies the u or v flag to the ECMAScript flavored regex.

Reiterating: the regex preprocessing

makes tree-sitter use an ad-hoc regex flavor
in order to emulate an explicitly deprecated behavior of ECMAScript regex syntax
and it doesn't even match ECMAScript behavior.

Given that Tree-sitter deliberately supports \u{hhhh} escapes³ despite being a .js syntax, which requires the use of the u or v flag, it imho just makes sense to stop ~~mangling~~ preprocessing the regex to support syntax which is an error given the use of the u or v flag. And especially now that Rust is a known flavor on regex101, it's much more an option to just say that's what Tree-sitter is using.

This escapes the curly braces of any {} for which the contents do not look like a character class, repetition quantifier, or \p. ↩
This preprocessing is redundant now, as the Rust regex-syntax's escape sequences now always permit escaping all ASCII except [0-9A-Za-z<>] (meaning to match the literal ASCII). ↩
If it didn't, and wanted to match ECMAScript behavior more closely, the curly brace preprocessing would probably match [^[:digit:],}] instead of [^[:alnum:],}], to at least match the behavior of \u{100F}, although \u{100} would still be interpreted as \u0100 (Rust flavor interpretation) instead of u{100} (deprecated Unicode-unaware ECMAScript flavor interpretation). ↩

The text was updated successfully, but these errors were encountered:

Fixes tree-sitter#2837; see that issue for details. In short: the emulated syntax behavior of JS regexp is deprecated and only kept for web compatibility, and you should not rely on it. Removing the preprocessing makes behavior more predictable, and even more consistent with Unicode-aware JS regexp, which are required to use syntax which Tree-sitter has always supported (braced \u and \p).

CAD97 mentioned this issue Dec 25, 2023

Redundant redundant escapes #2838

Merged

amaanq mentioned this issue Feb 2, 2024

Tree-Sitter Roadmap #2895

Closed

33 tasks

dundargoc added this to the 0.21.0 milestone Feb 6, 2024

amaanq closed this as completed in #2838 Mar 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex preprocessing doesn't match any regex flavor #2837

Regex preprocessing doesn't match any regex flavor #2837

CAD97 commented Dec 25, 2023 •

edited

Regex preprocessing doesn't match any regex flavor #2837

Regex preprocessing doesn't match any regex flavor #2837

Comments

CAD97 commented Dec 25, 2023 • edited

Suggested resolution

Footnotes

CAD97 commented Dec 25, 2023 •

edited