Traversal Trees #120

CrockAgile · 2023-02-01T15:23:55Z

Closes #117
Closes #118 (hopefully! 🤞)
Closes #119
Closes #115

Try Again!

This PR is the result of iterating on #119. #119 attempted to resolve the same issue, but @amascolo generously raised examples that still failed.

This PR (maybe!) resolves these additional cases, which are now also included as tests.

I will include the identical "root cause" section here as #119 so that this PR may be readable on its own.

Root Cause

Consider the grammar:

<start> ::= <shortfail> | <longsuccess>
<shortfail> ::= <char> 'never'
<char> ::= 'a'
<longsuccess> ::= <long2>
<long2> ::= <long3>
<long3> ::= <long4>
<long4> ::= <char>

When parsing input "a", there are two routes: "shortfail" and "longsuccess". As the names suggest, "shortfail" requires fewer traversals, but always fails. "longsuccess" requires more traversal steps, and should eventually succeed.

The issue is caused because both paths predict the non-terminal "char". Practical Earley parsing requires de-duplicating predictions or else recursive grammars fall into infinite loops. The existing Earley implementation in BNF does roughly the following:

(work roughly alternates between the short and long routes because new traversals are appended to the end of the work queue)

* <start> ::= • <shortfail>
* <start> ::= • <longsuccess>
* <shortfail> ::= • <char> 'never'
* <longsuccess> ::= • <long2>
* <char> ::= • 'a'
* <long2> ::= • <long3>
* <char> ::= 'a' •
* <long3> ::= • <long4>
* <shortfail> ::= <char> • 'never' // <--- notice this does NOT succeed
* <long4> ::= • <char> // !!! this <char> prediction is IGNORED because an identical prediction was already made

All the <longN> productions are necessary because otherwise the <longsuccess> route is able to predict before its completion.

I am sorry if I have not explained this super clearly. It is a tricky problem! Funny aside, I actually thought of this bug while developing v0.4.0 . But I (wrongly) assumed there was something about the Earley algorithm which resolved it. I attempted to write a test, but I did not realize it would require so many intermediate non-terminals to expose. Woops!

Existing Issues

Where this PR differs from #119 is its approach to "duplicate detection". Previously, duplicate Earley states/traversals were identified by which Production and how many Terms had been matched. This turns out to be insufficient, because partially "matched" Productions (i.e. <shortfail> ::= <char> • 'never') could have matched non-terminals via different paths.

New Duplicate Detection

Traversals now match/complete terms by building trees 🌳

A new prediction is the root trunk of a tree, and each matched/completed term adds a new branch. Assuming there are two different traversals which can complete base, a traversal tree segment may look like:

flowchart TD
0["dna ::= • base dna"] --> 1["dna ::= base=1 • dna"]
0["dna ::= • base dna"] --> 2["dna ::= base=2 • dna"]

Performance

On my machine, there was seemingly no performance cost to these changes. I believe there was a cost to adding the new logic for "prior completed" traversals. But that cost was offset by the improvement of traversal trees, instead of reference counted term matching vectors.

Extra

Basically every time I have worked on an Earley bug, I have ended up adding the same manual logging to help with debugging. I decided to commit that logging this time!

There is a new tracing::event! which by default is a noop, but with the "tracing" feature enabled adds logging events.

For Earley, traversal state events are logged when created and during predict/scan/complete. It helps quite a bit with debugging!

CrockAgile · 2023-02-01T15:27:48Z

I again plan to leave this PR open to give some time for review. Also I still have to read it all over myself with fresh eyes, and add a lot of comments 😅

coveralls · 2023-02-01T15:29:03Z

Coverage: 91.063% (-1.2%) from 92.231% when pulling baf5a09 on right-recursive-failure-new into 5a51585 on main.

CrockAgile · 2023-02-01T15:29:23Z

src/grammar.rs

+            bnf.parse::<Grammar>().unwrap().parse_input(input).count(),
+            5
+        );
+    }


@amascolo are these parse trees what you expect?

<and> ::= <and> " AND " <terminal> ├── <and> ::= <and> " AND " <terminal> │ ├── <and> ::= <terminal> │ │ └── <terminal> ::= "AND" │ │ └── "AND" │ ├── " AND " │ └── <terminal> ::= "AND" │ └── "AND" ├── " AND " └── <terminal> ::= "AND" └── "AND" <and> ::= <and> " AND " <terminal> ├── <and> ::= <and> " " <terminal> │ ├── <and> ::= <and> " " <terminal> │ │ ├── <and> ::= <terminal> │ │ │ └── <terminal> ::= "AND" │ │ │ └── "AND" │ │ ├── " " │ │ └── <terminal> ::= "AND" │ │ └── "AND" │ ├── " " │ └── <terminal> ::= "AND" │ └── "AND" ├── " AND " └── <terminal> ::= "AND" └── "AND" <and> ::= <and> " " <terminal> ├── <and> ::= <and> " AND " <terminal> │ ├── <and> ::= <and> " " <terminal> │ │ ├── <and> ::= <terminal> │ │ │ └── <terminal> ::= "AND" │ │ │ └── "AND" │ │ ├── " " │ │ └── <terminal> ::= "AND" │ │ └── "AND" │ ├── " AND " │ └── <terminal> ::= "AND" │ └── "AND" ├── " " └── <terminal> ::= "AND" └── "AND" <and> ::= <and> " " <terminal> ├── <and> ::= <and> " " <terminal> │ ├── <and> ::= <and> " AND " <terminal> │ │ ├── <and> ::= <terminal> │ │ │ └── <terminal> ::= "AND" │ │ │ └── "AND" │ │ ├── " AND " │ │ └── <terminal> ::= "AND" │ │ └── "AND" │ ├── " " │ └── <terminal> ::= "AND" │ └── "AND" ├── " " └── <terminal> ::= "AND" └── "AND" <and> ::= <and> " " <terminal> ├── <and> ::= <and> " " <terminal> │ ├── <and> ::= <and> " " <terminal> │ │ ├── <and> ::= <and> " " <terminal> │ │ │ ├── <and> ::= <terminal> │ │ │ │ └── <terminal> ::= "AND" │ │ │ │ └── "AND" │ │ │ ├── " " │ │ │ └── <terminal> ::= "AND" │ │ │ └── "AND" │ │ ├── " " │ │ └── <terminal> ::= "AND" │ │ └── "AND" │ ├── " " │ └── <terminal> ::= "AND" │ └── "AND" ├── " " └── <terminal> ::= "AND" └── "AND"

Yep, these look like the parse trees I was expecting!

Thanks even for writing them out in the same order as they appeared in #117 (comment)

shnewto

looks great!

CrockAgile · 2023-02-11T14:51:19Z

snapshot testing has exposed some nondeterminism in the Earley parsing. the parse trees are valid, but ambiguous grammars may parse inputs in inconsistent orders between test executions.

I will investigate this (likely a Hashmap or something 🙏). Otherwise, this PR is holding up to my local scrutiny. Once I resolve the nondeterminism issue, I plan to merge 🎊

tests/parse_input.rs

amascolo · 2023-02-13T16:02:05Z

@CrockAgile thanks again, amazing to see these fixes merged – any chance of releasing them as 0.4.4?

CrockAgile self-assigned this Feb 1, 2023

CrockAgile requested a review from shnewto February 1, 2023 15:25

CrockAgile commented Feb 1, 2023

View reviewed changes

CrockAgile added 25 commits February 4, 2023 16:18

test example from issue #118

cf2a121

tracing events earley traversals

b9d6c51

stash

bc640e1

minimal reproduction

933c12b

working?

50085e5

test nullable productions issue #117

cebffbf

remove test logging

3ad8993

repro wip test

c811a86

traversal tree wip

4beef46

wip, completion ownership hard

df28ade

borrow checked

926547c

infinite recursion

7d57827

ughhhh

7a98438

reverse matching iter walk

f7fc5ba

cleanup unused

f5f5795

hmmm

20f3abe

hmmmmmm

b904b98

rename

35add76

maybe working?

0fc9c89

tomorrow

99eb081

limit nullable hack

bb34b9a

all passing

9b0d22b

remove nullability detection

38a69e3

tracing instead of prints

d2c76d5

log parse trees

b193163

CrockAgile added 3 commits February 4, 2023 16:18

infinite parse benchmark

b674cd3

polish

c23801e

clippy

a734d6c

CrockAgile force-pushed the right-recursive-failure-new branch from b868b3c to a734d6c Compare February 4, 2023 21:19

CrockAgile marked this pull request as ready for review February 4, 2023 21:27

shnewto approved these changes Feb 10, 2023

View reviewed changes

snapshot testing

e4c21b1

btree for ordered term completions

692ac13

CrockAgile commented Feb 11, 2023

View reviewed changes

tests/parse_input.rs Outdated Show resolved Hide resolved

CrockAgile commented Feb 11, 2023

View reviewed changes

tests/parse_input.rs Outdated Show resolved Hide resolved

snapshot test bugs

baf5a09

CrockAgile force-pushed the right-recursive-failure-new branch from ea1c1d4 to baf5a09 Compare February 11, 2023 15:47

CrockAgile merged commit 07eb3bf into main Feb 11, 2023

CrockAgile deleted the right-recursive-failure-new branch February 11, 2023 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traversal Trees #120

Traversal Trees #120

CrockAgile commented Feb 1, 2023

CrockAgile commented Feb 1, 2023

coveralls commented Feb 1, 2023 •

edited

CrockAgile Feb 1, 2023

amascolo Feb 1, 2023

shnewto left a comment

CrockAgile commented Feb 11, 2023 •

edited

amascolo commented Feb 13, 2023

Traversal Trees #120

Traversal Trees #120

Conversation

CrockAgile commented Feb 1, 2023

Try Again!

Root Cause

Existing Issues

New Duplicate Detection

Performance

Extra

CrockAgile commented Feb 1, 2023

coveralls commented Feb 1, 2023 • edited

CrockAgile Feb 1, 2023

Choose a reason for hiding this comment

amascolo Feb 1, 2023

Choose a reason for hiding this comment

shnewto left a comment

Choose a reason for hiding this comment

CrockAgile commented Feb 11, 2023 • edited

amascolo commented Feb 13, 2023

coveralls commented Feb 1, 2023 •

edited

CrockAgile commented Feb 11, 2023 •

edited