Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traversal Trees #120

Merged
merged 31 commits into from
Feb 11, 2023
Merged

Traversal Trees #120

merged 31 commits into from
Feb 11, 2023

Conversation

CrockAgile
Copy link
Collaborator

Closes #117
Closes #118 (hopefully! 🤞)
Closes #119
Closes #115

Try Again!

This PR is the result of iterating on #119. #119 attempted to resolve the same issue, but @amascolo generously raised examples that still failed.

This PR (maybe!) resolves these additional cases, which are now also included as tests.

I will include the identical "root cause" section here as #119 so that this PR may be readable on its own.

Root Cause

Consider the grammar:

<start> ::= <shortfail> | <longsuccess>
<shortfail> ::= <char> 'never'
<char> ::= 'a'
<longsuccess> ::= <long2>
<long2> ::= <long3>
<long3> ::= <long4>
<long4> ::= <char>

When parsing input "a", there are two routes: "shortfail" and "longsuccess". As the names suggest, "shortfail" requires fewer traversals, but always fails. "longsuccess" requires more traversal steps, and should eventually succeed.

The issue is caused because both paths predict the non-terminal "char". Practical Earley parsing requires de-duplicating predictions or else recursive grammars fall into infinite loops. The existing Earley implementation in BNF does roughly the following:

(work roughly alternates between the short and long routes because new traversals are appended to the end of the work queue)

* <start> ::= • <shortfail>
* <start> ::= • <longsuccess>
* <shortfail> ::= • <char> 'never'
* <longsuccess> ::= • <long2>
* <char> ::= • 'a'
* <long2> ::= • <long3>
* <char> ::= 'a' •
* <long3> ::= • <long4>
* <shortfail> ::= <char> • 'never' // <--- notice this does NOT succeed
* <long4> ::= • <char> // !!! this <char> prediction is IGNORED because an identical prediction was already made

All the <longN> productions are necessary because otherwise the <longsuccess> route is able to predict before its completion.

I am sorry if I have not explained this super clearly. It is a tricky problem! Funny aside, I actually thought of this bug while developing v0.4.0 . But I (wrongly) assumed there was something about the Earley algorithm which resolved it. I attempted to write a test, but I did not realize it would require so many intermediate non-terminals to expose. Woops!

Existing Issues

Where this PR differs from #119 is its approach to "duplicate detection". Previously, duplicate Earley states/traversals were identified by which Production and how many Terms had been matched. This turns out to be insufficient, because partially "matched" Productions (i.e. <shortfail> ::= <char> • 'never') could have matched non-terminals via different paths.

New Duplicate Detection

Traversals now match/complete terms by building trees 🌳

A new prediction is the root trunk of a tree, and each matched/completed term adds a new branch. Assuming there are two different traversals which can complete base, a traversal tree segment may look like:

flowchart TD
0["dna ::= • base dna"] --> 1["dna ::= base=1 • dna"]
0["dna ::= • base dna"] --> 2["dna ::= base=2 • dna"]

Performance

On my machine, there was seemingly no performance cost to these changes. I believe there was a cost to adding the new logic for "prior completed" traversals. But that cost was offset by the improvement of traversal trees, instead of reference counted term matching vectors.

Extra

Basically every time I have worked on an Earley bug, I have ended up adding the same manual logging to help with debugging. I decided to commit that logging this time!

There is a new tracing::event! which by default is a noop, but with the "tracing" feature enabled adds logging events.

For Earley, traversal state events are logged when created and during predict/scan/complete. It helps quite a bit with debugging!

@CrockAgile CrockAgile self-assigned this Feb 1, 2023
@CrockAgile
Copy link
Collaborator Author

I again plan to leave this PR open to give some time for review. Also I still have to read it all over myself with fresh eyes, and add a lot of comments 😅

@coveralls
Copy link

coveralls commented Feb 1, 2023

Coverage Status

Coverage: 91.063% (-1.2%) from 92.231% when pulling baf5a09 on right-recursive-failure-new into 5a51585 on main.

src/grammar.rs Outdated
bnf.parse::<Grammar>().unwrap().parse_input(input).count(),
5
);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amascolo are these parse trees what you expect?

<and> ::= <and> " AND " <terminal>
├── <and> ::= <and> " AND " <terminal>
│   ├── <and> ::= <terminal>
│   │   └── <terminal> ::= "AND"
│   │       └── "AND"
│   ├── " AND "
│   └── <terminal> ::= "AND"
│       └── "AND"
├── " AND "
└── <terminal> ::= "AND"
    └── "AND"

<and> ::= <and> " AND " <terminal>
├── <and> ::= <and> " " <terminal>
│   ├── <and> ::= <and> " " <terminal>
│   │   ├── <and> ::= <terminal>
│   │   │   └── <terminal> ::= "AND"
│   │   │       └── "AND"
│   │   ├── " "
│   │   └── <terminal> ::= "AND"
│   │       └── "AND"
│   ├── " "
│   └── <terminal> ::= "AND"
│       └── "AND"
├── " AND "
└── <terminal> ::= "AND"
    └── "AND"

<and> ::= <and> " " <terminal>
├── <and> ::= <and> " AND " <terminal>
│   ├── <and> ::= <and> " " <terminal>
│   │   ├── <and> ::= <terminal>
│   │   │   └── <terminal> ::= "AND"
│   │   │       └── "AND"
│   │   ├── " "
│   │   └── <terminal> ::= "AND"
│   │       └── "AND"
│   ├── " AND "
│   └── <terminal> ::= "AND"
│       └── "AND"
├── " "
└── <terminal> ::= "AND"
    └── "AND"

<and> ::= <and> " " <terminal>
├── <and> ::= <and> " " <terminal>
│   ├── <and> ::= <and> " AND " <terminal>
│   │   ├── <and> ::= <terminal>
│   │   │   └── <terminal> ::= "AND"
│   │   │       └── "AND"
│   │   ├── " AND "
│   │   └── <terminal> ::= "AND"
│   │       └── "AND"
│   ├── " "
│   └── <terminal> ::= "AND"
│       └── "AND"
├── " "
└── <terminal> ::= "AND"
    └── "AND"

<and> ::= <and> " " <terminal>
├── <and> ::= <and> " " <terminal>
│   ├── <and> ::= <and> " " <terminal>
│   │   ├── <and> ::= <and> " " <terminal>
│   │   │   ├── <and> ::= <terminal>
│   │   │   │   └── <terminal> ::= "AND"
│   │   │   │       └── "AND"
│   │   │   ├── " "
│   │   │   └── <terminal> ::= "AND"
│   │   │       └── "AND"
│   │   ├── " "
│   │   └── <terminal> ::= "AND"
│   │       └── "AND"
│   ├── " "
│   └── <terminal> ::= "AND"
│       └── "AND"
├── " "
└── <terminal> ::= "AND"
    └── "AND"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, these look like the parse trees I was expecting!

Thanks even for writing them out in the same order as they appeared in #117 (comment)

@CrockAgile CrockAgile marked this pull request as ready for review February 4, 2023 21:27
Copy link
Owner

@shnewto shnewto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@CrockAgile
Copy link
Collaborator Author

CrockAgile commented Feb 11, 2023

snapshot testing has exposed some nondeterminism in the Earley parsing. the parse trees are valid, but ambiguous grammars may parse inputs in inconsistent orders between test executions.

I will investigate this (likely a Hashmap or something 🙏). Otherwise, this PR is holding up to my local scrutiny. Once I resolve the nondeterminism issue, I plan to merge 🎊

tests/parse_input.rs Outdated Show resolved Hide resolved
tests/parse_input.rs Outdated Show resolved Hide resolved
@CrockAgile CrockAgile merged commit 07eb3bf into main Feb 11, 2023
@CrockAgile CrockAgile deleted the right-recursive-failure-new branch February 11, 2023 15:53
@amascolo
Copy link
Contributor

@CrockAgile thanks again, amazing to see these fixes merged – any chance of releasing them as 0.4.4?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants