`repeated` lets keywords recognize as idents instead (priorization of `or` breaks?) #196

Philipp-M · 2022-08-22T18:38:34Z

Ok I'm not sure what is going wrong here, and/or if this is a bug, I'm trying to port https://github.com/kritzcreek/fby19 into Rust to learn type inference.

This is where the issue appears: https://github.com/Philipp-M/simple-hindley-milner-typeinference/blob/0c49d2c289efb498d23eacd81acae543a7aa4a97/src/parser.rs#L40

Apparently the repeated rule (Expr::App) lets the keywords let parse into a Expr:Var instead of applying the let rule (it fails?).
Output of the relevant test:

---- parser::parses_let_in stdout ----
thread 'parser::parses_let_in' panicked at 'assertion failed: `(left == right)`
  left: `Ok(App(Var("let"), Var("hello")))`,
 right: `Ok(Let("hello", Lit(Bool(true)), Lit(Int(123))))`', src/parser.rs:70:5

If I return atom directly (uncomment line 37), let_in parses correctly.

As far as I understood, the ors for the atom rule are tried one after the other, so let_in should be prioritized vs Expr::Var?

(Can be tested with cargo test)

The text was updated successfully, but these errors were encountered:

zesterer · 2022-08-22T19:14:08Z

Chumsky, as with all PEG parsers, deals with ambiguity by simply using the first correctly parsing pattern.

If you want to ensure that let does not parse as an Expr::Var, you will need to attempt the pattern you would prefer it to parse as first (i.e: put it higher up in the or/choice chain).

This is discussed in a little more depth towards the end of the 'parsing lets' section in chumsky's tutorial, you might find it a useful reference.

Alternatively, a solution that circumvents this problem is entirely is to separate lexing and parsing into distinct phases (for an example of this, see the nano_rust example in the repository).

Philipp-M · 2022-08-22T20:17:53Z

Thanks for the answer.

If you want to ensure that let does not parse as an Expr::Var, you will need to attempt the pattern you would prefer it to parse as first (i.e: put it higher up in the or/choice chain).

Yeah that's what I have thought of, and how I've done as well (at least as far as I understood it), to sum it up quickly (also minimal reproducable form):

let ident = text::ident().padded();
recursive(|expr| {
    // let <ident> = <expr> in <expr>
    let let_in = text::keyword("let")/* ... shortened for readability */;
    let var = ident.map(Expr::Var);
    // some other rules...
    let atom = let_in.or(var);
    // <expr> <expr>
    let app = atom.repeated()
        .at_least(1)
        .map(|v| v.into_iter().reduce(|e1, e2| Expr::App(e1.into(), e2.into())).unwrap());
    app
})

This should first try the let_in rule as far as I understood, and then if it fails try the var rule.
As said before the atom rule itself works, but as soon as the repeated method is used, it for some reason recognizes the expression as app rule with let as var. But AFAIK before the next "repeat" in app occurs, let_in should be tried first?
Does repeated not first go in order into each of the or'ed atom rules, and use the shortest (with the least parser rules?) parsed rule instead (i.e. var)?
Unfortunately the "parsing let's" section didn't help me, because let ... in can occur in any sub expression (and thus needs to be in the first recursive?).

I'd like to avoid an extra lexer, to keep this simple, but I think I'm trying this now anyway.

zesterer · 2022-08-22T20:24:37Z

This sounds to me like that let_in rule itself isn't quite behaving as you'd expect, possibly producing an error (leading chumsky to fall back on the var rule). You might want to try temporarily removing the var rule to see what the error is.

As an aside, rather than doing

let app = atom.repeated()
    .at_least(1)
    .map(|v| v.into_iter().reduce(|e1, e2| Expr::App(e1.into(), e2.into())).unwrap());

I'd recommend instead doing

let app = atom.then(atom.repeated())
    .foldl(|e1, e2| Expr::App(e1.into(), e2.into()));

It does the same thing but is quite a bit easier to read and avoids the .unwrap().

Philipp-M · 2022-08-22T20:31:04Z

Thanks for the tip, that certainly is more readable.

As I said the let_in rule is working in itself (as long as repeated is not introduced)
It's defined as following:

// let <ident> = <expr> in <expr>
let let_in = text::keyword("let")
    .padded()
    .ignore_then(ident)
    .then_ignore(just('='))
    .then(expr.clone())
    .then_ignore(text::keyword("in").padded())
    .then(expr.clone())
    .map(|((name, let_body), in_body)| {
        Expr::Let(name, Box::new(let_body), Box::new(in_body))
    });

zesterer · 2022-08-22T20:38:26Z

That's definitely strange. Do you have a minimal example you can send that reproduces this?

or patterns will always try the first pattern, only attempting the second if the first generated errors.

Philipp-M · 2022-08-22T20:58:52Z

I've just tested it with the int rule (without var), then it works as expected.
I've reduced the code in the initially linked repo to a minimal example:
https://github.com/Philipp-M/chumsky_issue_repro

zesterer · 2022-08-22T22:09:27Z

Thanks, I'll try to find the time to look at this tomorrow.

zesterer · 2022-08-26T18:19:26Z

Sorry is took me so long to get to this. The parses_let_in in your example code looks like it's having problems because the in gets interpreted as just another var. This is a good demonstration of why writing a larser (i.e: a lexer + parser in one) is a bad idea and can lead to subtle ambiguity (a handle-written recursive descent larser would also have failed to parse this correctly too). I'd recommend taking the approach of the nano_rust example, parsing the input into tokens before attempting to parse structure.

Philipp-M · 2022-08-26T20:03:03Z

No problem, thanks for investigating, now that you say it, it definitely makes sense.
As you might have seen, I've already used a lexer in the original repo (which is working great btw.), and now that I've run into it I definitely agree, it's easier to run into subtle ambiguities.
I guess if this should work without an extra lexer the ident rule has to abort if the ident is a keyword with something like you suggested here: #181 (comment)

Like this:

    let let_kw = text::keyword("let").padded();
    let in_kw = text::keyword("in").padded();
    let ident = text::ident().padded().none_of([let_kw, in_kw]);

zesterer · 2022-08-26T22:56:20Z

Yep, larsers end up requiring a lot of annoying hacks like that to work reliably. Glad to hear that the problem is fixed!

Philipp-M added a commit to Philipp-M/simple-hindley-milner-typeinference that referenced this issue Aug 22, 2022

Finished parser by using a lexer, to workaround zesterer/chumsky#196

62158cb

Philipp-M closed this as completed Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`repeated` lets keywords recognize as idents instead (priorization of `or` breaks?) #196

`repeated` lets keywords recognize as idents instead (priorization of `or` breaks?) #196

Philipp-M commented Aug 22, 2022

zesterer commented Aug 22, 2022

Philipp-M commented Aug 22, 2022

zesterer commented Aug 22, 2022 •

edited

Philipp-M commented Aug 22, 2022

zesterer commented Aug 22, 2022

Philipp-M commented Aug 22, 2022

zesterer commented Aug 22, 2022

zesterer commented Aug 26, 2022 •

edited

Philipp-M commented Aug 26, 2022

zesterer commented Aug 26, 2022

repeated lets keywords recognize as idents instead (priorization of or breaks?) #196

repeated lets keywords recognize as idents instead (priorization of or breaks?) #196

Comments

Philipp-M commented Aug 22, 2022

zesterer commented Aug 22, 2022

Philipp-M commented Aug 22, 2022

zesterer commented Aug 22, 2022 • edited

Philipp-M commented Aug 22, 2022

zesterer commented Aug 22, 2022

Philipp-M commented Aug 22, 2022

zesterer commented Aug 22, 2022

zesterer commented Aug 26, 2022 • edited

Philipp-M commented Aug 26, 2022

zesterer commented Aug 26, 2022

`repeated` lets keywords recognize as idents instead (priorization of `or` breaks?) #196

`repeated` lets keywords recognize as idents instead (priorization of `or` breaks?) #196

zesterer commented Aug 22, 2022 •

edited

zesterer commented Aug 26, 2022 •

edited