Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex preprocessing doesn't match any regex flavor #2837

Closed
Tracked by #2895
CAD97 opened this issue Dec 25, 2023 · 0 comments · Fixed by #2838
Closed
Tracked by #2895

Regex preprocessing doesn't match any regex flavor #2837

CAD97 opened this issue Dec 25, 2023 · 0 comments · Fixed by #2838
Milestone

Comments

@CAD97
Copy link
Contributor

CAD97 commented Dec 25, 2023

Tree-sitter currently does two preprocessing steps to regex:

  • replace all (^|[^\\pP])\{([^}]*[^0-9A-Fa-f,}][^}]*)\} with $1\{$2\}1, and
  • remove redundant escapes for !'"/2.

This doesn't actually implement any regex flavor. Worse than not implementing any well-known flavor, for each such flavor, there exists a diverging regex which is valid both in tree-sitter and that flavor, but which is interpreted differently.

  • It obviously diverges from Rust flavored regex, since we're changing the regex before interpreting it as Rust flavored regex.
    • It doesn't even maintain that all valid Rust regexes are interpreted correctly, since Rust supports extended word boundary assertions which get mangled (\b{start}, \b{end}, \b{start-half}, \b{end-half}).
  • It still doesn't match ECMAScript flavored regex:
    • \0 is not recognized as meaning \u{0}.
    • It diverges in []] being interpreted \] (ECMAScript flavor) versus [\]] (Rust flavor).
    • It diverges from Unicode-unaware mode, which doesn't have escapes \a, \A, \z, \<, \>, \x{}, \u{}, \U{}, \p, or \P, thus treats them as literal escapes.
    • It diverges from u mode, which doesn't have character class set operations, thus making [a-y&&xyz] mean [a-z] (instead of [xy]).
    • It diverges from v mode, which doesn't have posix style character classes, thus making [[:alpha:]] mean [:alph] (instead of [A-Za-z]).
  • It diverges from PCRE, which doesn't have \<, \>, or character class set operations.

Suggested resolution

Just stop preprocessing the regex. There's two possible justifications available for removing the curly brace preprocessing:

  • State that tree-sitter uses Rust flavored regex expressions. It's even a flavor option on regex101 nowadays.
  • State that tree-sitter always applies the u or v flag to the ECMAScript flavored regex.

Reiterating: the regex preprocessing

  • makes tree-sitter use an ad-hoc regex flavor
  • in order to emulate an explicitly deprecated behavior of ECMAScript regex syntax
  • and it doesn't even match ECMAScript behavior.

Given that Tree-sitter deliberately supports \u{hhhh} escapes3 despite being a .js syntax, which requires the use of the u or v flag, it imho just makes sense to stop mangling preprocessing the regex to support syntax which is an error given the use of the u or v flag. And especially now that Rust is a known flavor on regex101, it's much more an option to just say that's what Tree-sitter is using.

Footnotes

  1. This escapes the curly braces of any {} for which the contents do not look like a character class, repetition quantifier, or \p.

  2. This preprocessing is redundant now, as the Rust regex-syntax's escape sequences now always permit escaping all ASCII except [0-9A-Za-z<>] (meaning to match the literal ASCII).

  3. If it didn't, and wanted to match ECMAScript behavior more closely, the curly brace preprocessing would probably match [^[:digit:],}] instead of [^[:alnum:],}], to at least match the behavior of \u{100F}, although \u{100} would still be interpreted as \u0100 (Rust flavor interpretation) instead of u{100} (deprecated Unicode-unaware ECMAScript flavor interpretation).

CAD97 added a commit to CAD97/tree-sitter that referenced this issue Dec 25, 2023
Fixes tree-sitter#2837; see that issue for details.
In short: the emulated syntax behavior of JS regexp is deprecated
and only kept for web compatibility, and you should not rely on it.
Removing the preprocessing makes behavior more predictable, and even
more consistent with Unicode-aware JS regexp, which are required to
use syntax which Tree-sitter has always supported (braced \u and \p).
@amaanq amaanq mentioned this issue Feb 2, 2024
33 tasks
@dundargoc dundargoc added this to the 0.21.0 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants