Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\p\ interacts poorly with error reporting #594

Closed
CAD97 opened this issue Jul 8, 2019 · 1 comment
Closed

\p\ interacts poorly with error reporting #594

CAD97 opened this issue Jul 8, 2019 · 1 comment
Labels

Comments

@CAD97
Copy link
Contributor

CAD97 commented Jul 8, 2019

This is best explained with examples:

// this one is good
regex parse error:
    \p\
    ^^^
error: Unicode property not found

// but putting things after the "escape" behaves strangely
regex parse error:
    \p\{
       ^
error: unclosed counted repetition

regex parse error:
    \p\{\}
        ^
error: decimal literal empty

// yet this one works
regex parse error:
    \p\+
    ^^^
error: Unicode property not found

It seems that \p\ is treated as "whole" for the bracket-based followers, and that error preempts the prior error about the missing Unicode property.

Why I noticed this

Trying out tree-sitter, which uses regex-syntax for parsing regex terminals but for some reason preprocesses curly brackets in regex. I wrote the regex \p{XID_Start} which it translated into \p\{XID_Start\} which then gave me the very confusing error of

>tree-sitter generate
Error processing rule identifier
Details:
  regex parse error:
    \p\{XID_Start\}
        ^
error: repetition quantifier expects a valid decimal
@BurntSushi
Copy link
Member

Nice find. Yeah, this is indeed quite odd. It looks like the problem is here:

let start = self.pos();
let c = self.char();
self.bump_and_bump_space();
let kind = ast::ClassUnicodeKind::OneLetter(c);
(start, kind)

Which is triggered when a \p (or \P) is followed by anything other than a {. Which means it gets triggered for \p\{ since \ is not a {. This particular code path is for handling single letter Unicode classes, e.g., \pL is \p{Letter}. The reason this gets fouled up is because the parser isn't responsible for checking the validity of the Unicode class given. e.g., The parser will happily transform \p{fubar} into a valid AST. Checking the validity of the class name is the responsibility of the translator (AST -> HIR).

I think it is true that a \p followed by a \ is never correct, so I think the simple solution here is to just return an error in the parser for that specific case. (This might require a new error variant, which is fine.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants