Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant redundant escapes #2838

Merged
merged 1 commit into from Mar 10, 2024

Conversation

CAD97
Copy link
Contributor

@CAD97 CAD97 commented Dec 25, 2023

Fixes #2837; see that issue for further discussion.

The regex-syntax crate natively supports \!, \', \", and \/ identity escapes now, so we don't need to preprocess the regex in order to support those.

Unescaped {} are only permitted in Unicode-unaware JS regexp, and that syntax is deprecated, only kept for web compatibility, and should not be relied upon. In Unicode-aware JS regexp (via the u or v flag), unescaped {} are syntax errors. Unicode-aware mode is required for support for \u{hhhh} or \p{UnicodeProperty} syntaxes, which Tree-sitter always supports.

The current behavior is sort of a "worst of both worlds" where Tree-sitter's regex flavor is neither Rust nor ECMAScript compatible. Removing the special curly brace preprocessing means Tree-sitter's regex now matches Rust flavor again and moves it closer to that of ECMAScript in u or v mode. (And the divergences, unlike unescaped {}, are much less likely to be hit unintentionally.) And even though this means any previously-preprocessed regex will change meaning, it will always1 be changed to emit a clear error pointing out the { which now needs to be manually escaped.

Changelog

  • Regexes are now always interpreted by Tree-sitter using Rust flavored regex syntax. The main impact is that { now need to be escaped outside of []. If this occurs in your grammar, Tree-sitter will error, indicating the regex which needs to be edited.

Footnotes

  1. Well, almost always. In the specific case of \b{start}, \b{end}, \b{start-half}, or \b{end-half}, the pre-preprocessing version is valid with a different meaning.

@amaanq
Copy link
Member

amaanq commented Mar 10, 2024

Yeah this makes a lot of sense - thanks a lot for the detailed issue and pr description as well as fixing this! I'll rebase this for you & merge

We could maybe point this out in the docs as well.

@amaanq amaanq force-pushed the redundant-redundant-escapes branch 2 times, most recently from 14bbf68 to c1fcf5e Compare March 10, 2024 16:59
…ssing

The regex-syntax crate now natively supports literal escapes for all
ASCII characters except those in [0-9A-Za-z<>].
@amaanq amaanq force-pushed the redundant-redundant-escapes branch from c1fcf5e to 37da0b5 Compare March 10, 2024 17:13
@amaanq amaanq merged commit 78cc77e into tree-sitter:master Mar 10, 2024
13 checks passed
@CAD97 CAD97 deleted the redundant-redundant-escapes branch March 11, 2024 17:49
@CAD97
Copy link
Contributor Author

CAD97 commented Mar 11, 2024

And thus, my pet peeve from five years ago is resolved 🎉

I still have borderline impossible dreams for a modern parser generator1, but with this rough edge sanded down I have no qualms about suggesting/using tree-sitter anymore. Maybe it's finally time to revisit my toy language...

Footnotes

  1. (Not my post.) The main thing I want that tree-sitter doesn't have is a separately phased lexer instead of a lexerless design. However, I do know lexerless is in vogue and handles token splitting/gluing (e.g. >> vs > >) more directly, so I'll concede that. Other than that, it'd mostly be resolvable with tooling on top of the existing corpus testing support.

godrja added a commit to godrja/tree-sitter-nu that referenced this pull request Apr 9, 2024
The problem was caused by the change in curly brace parsing in
tree-sitter 0.22.0

See tree-sitter/tree-sitter#2838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex preprocessing doesn't match any regex flavor
2 participants