Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty String Rules (Still) Fail to Match #112

Closed
nskobelevs opened this issue Nov 14, 2022 · 7 comments · Fixed by #113
Closed

Empty String Rules (Still) Fail to Match #112

nskobelevs opened this issue Nov 14, 2022 · 7 comments · Fixed by #113

Comments

@nskobelevs
Copy link

Describe the bug
Empty string rules still fail to match in some situations. This is the same as #108 although after it's fix it doesn't happen in all situations but still happens.

To Reproduce

  1. Load grammar from tests/fixtures/bnf.bnf

  2. Try to parse the first line of the text:
    "<syntax> ::= <rule> | <rule> <syntax>\n"
    This match will fail.

  3. Try to parse the same line with whitespace before the newline
    "<syntax> ::= <rule> | <rule> <syntax> \n"
    This will parse properly

The first example should pass given <line-end> may begin with <opt-whitespace> which has an empty string rule

<line-end>       ::= <opt-whitespace> <EOL>
<opt-whitespace> ::= "" | " " <opt-whitespace>
<EOL>            ::= "
"

Unfortunately I haven't been able to find a smaller example. Weirdly it does parse "<syntax> ::= <rule>" correctly but not the more complicated example above.

If there support for special characters like \n or a built in definition for <EOL> it could be good to have a test that a grammer made from tests/fixtures/bnf.bnf can parse the file itself.

Desktop (please complete the following information):

  • MacOS Ventura 13.0.1
@shnewto
Copy link
Owner

shnewto commented Nov 14, 2022

ah thanks for raising this issue! I really appreciate the details for reproducing!

I'll take a look 😄

@CrockAgile
Copy link
Collaborator

just to share some progress, I started looking into this today, and was able to reduce the bug test case to:

#[test]
fn parse_nested_empty() {
    let grammar: Grammar = "
    <start> ::= <a> <empty>
    <a> ::= 'a' <empty>
    <empty> ::= ''"
        .parse()
        .unwrap();

    let input = "a";

    let parses = parse(&grammar, input);
    assert_eq!(parses.count(), 1);
}

The bug seems to be caused by incorrectly de-duplicating Earley states. Because "empty" rules do not advance the input cursor, different predictions get treated as "duplicates".

I have a fix working locally, but it basically "solves" the problem by duplicating all parse attempts 😅

Next chance I get, I will brainstorm for a more realistic solution

@amascolo
Copy link
Contributor

Any progress on a fix?

I may be hitting the same issue – here's another bare-bones example you can check against:

#[test]
fn test() {
    let grammar: Grammar = "
    <balanced> ::= <left> <balanced> <right>
                 | ''
    
    <left> ::= <opt-ws> '(' <opt-ws>
    <right> ::= <opt-ws> ')' <opt-ws>
    
    <opt-ws> ::= '' | <ws>
    <ws> ::= ' ' | ' ' <ws>
    "
    .parse()
    .unwrap();

    let input = "()";

    assert!(
        grammar.parse_input(input).next().is_some(),
        "can't parse: {input}"
    );
}

Greatly appreciate everybody's work so far!

Would love to use bnf as I don't see an established Rust library that supports Earley parsing.

However, this issue is a blocker for my use case 🙁

@amascolo
Copy link
Contributor

@CrockAgile in #92 you reference this tutorial, which has a section on empty rules: https://loup-vaillant.fr/tutorials/earley-parsing/empty-rules – maybe there's something helpful there, where it discusses Aycock and Horspool's (2002) paper?

If you get any insights from fixing this issue, it might be worth including the findings in your previous write-up (thanks again for that): https://gist.github.com/CrockAgile/d8cb79607b6c6c49c06a97bd652dc219

Either way, we are in good company – as mentioned by MARPA's author:

Earley's original 1970 algorithm actually had a bug in its handling of nullables. There was an easy fix, but it made the algorithm slower. Since efficiency was already seen as the reason to prefer other parsers, this was a big deal.

@CrockAgile
Copy link
Collaborator

Thank you for another example! I do have a working fix, based on the exact solution you linked. So glad we are on the same page.

But while implementing it, a surprising amount of refactoring was needed (allowing for Earley matches on "nullable productions", not just complete "States", if you are curious 😅 )

That refactor introduced a significant performance regression 😱

But I fixed that last night! So

TLDR: I should have it up for review tomorrow 🤞

@CrockAgile
Copy link
Collaborator

this fix should now be available under v0.4.3

but this was my first time publishing this crate, so let me know if something is broken 😅

@nskobelevs
Copy link
Author

Looks good here - thank you for addressing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants