Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repeated lets keywords recognize as idents instead (priorization of or breaks?) #196

Closed
Philipp-M opened this issue Aug 22, 2022 · 10 comments

Comments

@Philipp-M
Copy link

Ok I'm not sure what is going wrong here, and/or if this is a bug, I'm trying to port https://github.com/kritzcreek/fby19 into Rust to learn type inference.

This is where the issue appears: https://github.com/Philipp-M/simple-hindley-milner-typeinference/blob/0c49d2c289efb498d23eacd81acae543a7aa4a97/src/parser.rs#L40

Apparently the repeated rule (Expr::App) lets the keywords let parse into a Expr:Var instead of applying the let rule (it fails?).
Output of the relevant test:

---- parser::parses_let_in stdout ----
thread 'parser::parses_let_in' panicked at 'assertion failed: `(left == right)`
  left: `Ok(App(Var("let"), Var("hello")))`,
 right: `Ok(Let("hello", Lit(Bool(true)), Lit(Int(123))))`', src/parser.rs:70:5

If I return atom directly (uncomment line 37), let_in parses correctly.

As far as I understood, the ors for the atom rule are tried one after the other, so let_in should be prioritized vs Expr::Var?

(Can be tested with cargo test)

@zesterer
Copy link
Owner

Chumsky, as with all PEG parsers, deals with ambiguity by simply using the first correctly parsing pattern.

If you want to ensure that let does not parse as an Expr::Var, you will need to attempt the pattern you would prefer it to parse as first (i.e: put it higher up in the or/choice chain).

This is discussed in a little more depth towards the end of the 'parsing lets' section in chumsky's tutorial, you might find it a useful reference.

Alternatively, a solution that circumvents this problem is entirely is to separate lexing and parsing into distinct phases (for an example of this, see the nano_rust example in the repository).

@Philipp-M
Copy link
Author

Thanks for the answer.

If you want to ensure that let does not parse as an Expr::Var, you will need to attempt the pattern you would prefer it to parse as first (i.e: put it higher up in the or/choice chain).

Yeah that's what I have thought of, and how I've done as well (at least as far as I understood it), to sum it up quickly (also minimal reproducable form):

let ident = text::ident().padded();
recursive(|expr| {
    // let <ident> = <expr> in <expr>
    let let_in = text::keyword("let")/* ... shortened for readability */;
    let var = ident.map(Expr::Var);
    // some other rules...
    let atom = let_in.or(var);
    // <expr> <expr>
    let app = atom.repeated()
        .at_least(1)
        .map(|v| v.into_iter().reduce(|e1, e2| Expr::App(e1.into(), e2.into())).unwrap());
    app
})

This should first try the let_in rule as far as I understood, and then if it fails try the var rule.
As said before the atom rule itself works, but as soon as the repeated method is used, it for some reason recognizes the expression as app rule with let as var. But AFAIK before the next "repeat" in app occurs, let_in should be tried first?
Does repeated not first go in order into each of the or'ed atom rules, and use the shortest (with the least parser rules?) parsed rule instead (i.e. var)?
Unfortunately the "parsing let's" section didn't help me, because let ... in can occur in any sub expression (and thus needs to be in the first recursive?).

I'd like to avoid an extra lexer, to keep this simple, but I think I'm trying this now anyway.

@zesterer
Copy link
Owner

zesterer commented Aug 22, 2022

This sounds to me like that let_in rule itself isn't quite behaving as you'd expect, possibly producing an error (leading chumsky to fall back on the var rule). You might want to try temporarily removing the var rule to see what the error is.

As an aside, rather than doing

let app = atom.repeated()
    .at_least(1)
    .map(|v| v.into_iter().reduce(|e1, e2| Expr::App(e1.into(), e2.into())).unwrap());

I'd recommend instead doing

let app = atom.then(atom.repeated())
    .foldl(|e1, e2| Expr::App(e1.into(), e2.into()));

It does the same thing but is quite a bit easier to read and avoids the .unwrap().

@Philipp-M
Copy link
Author

Thanks for the tip, that certainly is more readable.

As I said the let_in rule is working in itself (as long as repeated is not introduced)
It's defined as following:

// let <ident> = <expr> in <expr>
let let_in = text::keyword("let")
    .padded()
    .ignore_then(ident)
    .then_ignore(just('='))
    .then(expr.clone())
    .then_ignore(text::keyword("in").padded())
    .then(expr.clone())
    .map(|((name, let_body), in_body)| {
        Expr::Let(name, Box::new(let_body), Box::new(in_body))
    });

@zesterer
Copy link
Owner

That's definitely strange. Do you have a minimal example you can send that reproduces this?

or patterns will always try the first pattern, only attempting the second if the first generated errors.

@Philipp-M
Copy link
Author

I've just tested it with the int rule (without var), then it works as expected.
I've reduced the code in the initially linked repo to a minimal example:
https://github.com/Philipp-M/chumsky_issue_repro

Philipp-M added a commit to Philipp-M/simple-hindley-milner-typeinference that referenced this issue Aug 22, 2022
@zesterer
Copy link
Owner

Thanks, I'll try to find the time to look at this tomorrow.

@zesterer
Copy link
Owner

zesterer commented Aug 26, 2022

Sorry is took me so long to get to this. The parses_let_in in your example code looks like it's having problems because the in gets interpreted as just another var. This is a good demonstration of why writing a larser (i.e: a lexer + parser in one) is a bad idea and can lead to subtle ambiguity (a handle-written recursive descent larser would also have failed to parse this correctly too). I'd recommend taking the approach of the nano_rust example, parsing the input into tokens before attempting to parse structure.

@Philipp-M
Copy link
Author

No problem, thanks for investigating, now that you say it, it definitely makes sense.
As you might have seen, I've already used a lexer in the original repo (which is working great btw.), and now that I've run into it I definitely agree, it's easier to run into subtle ambiguities.
I guess if this should work without an extra lexer the ident rule has to abort if the ident is a keyword with something like you suggested here: #181 (comment)

Like this:

    let let_kw = text::keyword("let").padded();
    let in_kw = text::keyword("in").padded();
    let ident = text::ident().padded().none_of([let_kw, in_kw]);

@zesterer
Copy link
Owner

Yep, larsers end up requiring a lot of annoying hacks like that to work reliably. Glad to hear that the problem is fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants