# Parser

The *Earley* parsing algorithm was invented by Jay Earley in 1970. It
can be used to parse strings that conform to a context-free grammar. The
algorithm uses a chart for parsing -- that is, it is implemented as a dynamic
program relying on solving simpler sub-problems.

Earley parsers are very appealing for a practitioner because they can use any
context-free grammar for parsing a string, and from the parse forest generated,
one can recover all (even an infinite number) of parse trees that correspond to
the given grammar.

**Note.** This notebook does not implement the Leo optimization.
A more detailed worked out notebook that explains and implements the Leo optimization can be seen [here](https://rahul.gopinath.org/post/2021/02/06/earley-parsing/) from which this notebook has been adapted.

## Synopsis

```python
import earleyparser as P
my_grammar = {'<start>': [['1', '<A>'],
                          ['2']
                         ],
              '<A>'    : [['a']]}
my_parser = P.EarleyParser(my_grammar)
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
    print(P.format_parsetree(tree))
```



Secondly, as per traditional implementations,
there can only be one expansion rule for the `<start>` symbol. We work around
this restriction by simply constructing as many charts as there are expansion
rules, and returning all parse trees.

In [None]:
grammar = {
    '<start>': [['<expr>']],
    '<expr>': [
        ['<term>', '+', '<expr>'],
        ['<term>', '-', '<expr>'],
        ['<term>']],
    '<term>': [
        ['<fact>', '*', '<term>'],
        ['<fact>', '/', '<term>'],
        ['<fact>']],
    '<fact>': [
        ['<digits>'],
        ['(','<expr>',')']],
    '<digits>': [
        ['<digit>','<digits>'],
        ['<digit>']],
    '<digit>': [["%s" % str(i)] for i in range(10)],
}
START = '<start>'

Here is another grammar that targets the same language. Unlike the first
grammar, this grammar produces ambiguous parse results.

In [None]:
a_grammar = {
    '<start>': [['<expr>']],
    '<expr>': [
        ['<expr>', '+', '<expr>'],
        ['<expr>', '-', '<expr>'],
        ['<expr>', '*', '<expr>'],
        ['<expr>', '/', '<expr>'],
        ['(', '<expr>', ')'],
        ['<integer>']],
    '<integer>': [
        ['<digits>']],
    '<digits>': [
        ['<digit>','<digits>'],
        ['<digit>']],
    '<digit>': [["%s" % str(i)] for i in range(10)],
}

## Summary

An Earley parser executes the following steps for parsing:

Use `<start>` as the entry into parsing. At this point, we want to parse the
given string by the nonterminal `<start>`. The _definition_ of `<start>`
contains the possible expansion rule that can match the given string. Each
expansion rule can be thought of as a *parsing path*, with contiguous
substrings of the given input string matched by the particular terms in the
rule.

* When given a nonterminal to match the string, the essential idea is to
  get the rules in the definition, and add them to the current set of
  parsing paths to try with the given string. Within the parsing path, we have
  a parsed index which denotes the progress of parsing that particular path
  (i.e the point till which the string until now has been recognized by that
  path, and any parents of this path). When a rule is newly added, this parsed
  index is set to zero.

* We next look at our set of possible parsing paths, and check if any of these
  paths start with a nonterminal. If one is found, then for that parsing path to
  be completed with the given string, that nonterminal has to be recognized
  first. So, we add the expansion rules corresponding to that nonterminal to the
  list of possible parsing paths. We do this recursively.

* Now, examine the current letter in the input. Then select all parsing paths
  that have that particular letter at the parsed index. These expressions can
  now advance one step to the next index. We add such parsing paths to the
  set of parsing paths to try for the next character.

* While doing this, any parsing paths have finished parsing, fetch its
  corresponding nonterminal and advance all parsing paths that have that
  nonterminal at the parsing index.

* Continue recursively until the parsing path corresponding to `<start>` has
  finished.


The chart parser depends on a chart (a table) for parsing. The columns
correspond to the characters in the input string. Each column represents a set
of *states*, and corresponds to the legal rules to follow from that point on.

Say we start with the following grammar:

In [None]:
sample_grammar = {
    '<start>': [['<A>','<B>']],
    '<A>': [['a', '<B>', 'c'], ['a', '<A>']],
    '<B>': [['b', '<C>'], ['<D>']],
    '<C>': [['c']],
    '<D>': [['d']]
}

Earley parser produces a table of possible parse paths at each letter index of
the table. Given an input `adcd`, we seed the column `0`  with:

```
   <start>: | <A> <B>
```

where the `|` represents the parsing index (also called the dot). This indicates
that we are at the starting, and the next step is to identify `<A>`. After this
rule is processed, the column would contain two more states

```
   <A>: | a <B> <c>
   <A>: | a <A>
```
which represents two parsing paths to complete `<A>`.

After processing of column `0` (which corresponds to input character `a`), we
would find the following in column `1` (which corresponds to the input character `b`)

```
   <A>: a | <B> c
   <A>: a | <A>
   <B>: | b <C>
   <B>: | <D>
   <A>: | a <B> c
   <A>: | a <A>
   <D>: | d
```

Similarly, the next column (column `2` corresponding to `d`) would contain the following.

```
   <D>: | d
   <B>: <D> |
   <A>: a <B> | c
```

Next, column `3` corresponding to `c` would contain:
```
   <A>: a <B> c |
   <start>: <A> | <B>
   <B>: | <b> <C>
   <B>: | <D>
   <D>: | d
```

Finally, column `4` (`d`) would contain this at the end of processing.
```
   <D>: d |
   <B>: <D> |
   <start>: <A> <B> |
```

This is how the table or the chart -- from where the parsing gets its name: chart parsing -- gets filled.

## The Column Data Structure

The column contains a set of states. Each column corresponds
to a character (or a token if tokens are used).
Note that the states in a column corresponds to the parsing expression that will
occur once that character has been read. That is, the first column will
correspond to the parsing expression when no characters have been read.

The column allows for adding states, and checks to prevent duplication of
states. Why do we need to prevent duplication? The problem is left recursion.
We need to detect and curtail left recursion, which is indicated by non-unique
states.

In [None]:
class Column:
    def __init__(self, index, letter):
        self.index, self.letter = index, letter
        self.states, self._unique = [], {}

    def __str__(self):
        return "%s chart[%d]\n%s" % (self.letter, self.index, "\n".join(
            str(state) for state in self.states if state.finished()))

    def to_repr(self):
        return "%s chart[%d]\n%s" % (self.letter, self.index, "\n".join(
            str(state) for state in self.states))

    def add(self, state):
        if state in self._unique:
            return self._unique[state]
        self._unique[state] = state
        self.states.append(state)
        state.e_col = self
        return self._unique[state]

## The State Data Structure

A state represents a parsing path (which corresponds to the nonterminal, and the
expansion rule that is being followed) with the current parsed index. 
Each state contains the following:

* name: The nonterminal that this rule represents.
* expr: The rule that is being followed
* dot:  The point till which parsing has happened in the rule.
* s_col: The starting point for this rule.
* e_col: The ending point for this rule.

In [None]:
class State:
    def __init__(self, name, expr, dot, s_col, e_col=None):
        self.name, self.expr, self.dot = name, expr, dot
        self.s_col, self.e_col = s_col, e_col

    def finished(self):
        return self.dot >= len(self.expr)

    def at_dot(self):
        return self.expr[self.dot] if self.dot < len(self.expr) else None

    def __str__(self):
        def idx(var):
            return var.index if var else -1

        return self.name + ':= ' + ' '.join([
            str(p)
            for p in [*self.expr[:self.dot], '|', *self.expr[self.dot:]]
        ]) + "(%d,%d)" % (idx(self.s_col), idx(self.e_col))

    def copy(self):
        return State(self.name, self.expr, self.dot, self.s_col, self.e_col)

    def _t(self):
        return (self.name, self.expr, self.dot, self.s_col.index)

    def __hash__(self):
        return hash(self._t())

    def __eq__(self, other):
        return self._t() == other._t()

    def advance(self):
        return State(self.name, self.expr, self.dot + 1, self.s_col)

The convenience methods `finished()`, `advance()` and `at_dot()` should be
self explanatory. For example,

In [None]:
if __name__ == '__main__':
    nt_name = '<B>'
    nt_expr = tuple(sample_grammar[nt_name][1])
    col_0 = Column(0, None)
    a_state = State(nt_name, tuple(nt_expr), 0, col_0)
    print(a_state.at_dot())

That is, the next symbol to be parsed is `<D>`, and if we advance it,

In [None]:
if __name__ == '__main__':
    b_state = a_state.advance()
    print(b_state)
    print(b_state.finished())

## The Basic Parser Interface

We start with a bare minimum interface for a parser. It should allow one
to parse a given text using a given nonterminal (which should be present in
the grammar).

In [None]:
class Parser:
    def recognize_on(self, text, start_symbol):
        raise NotImplemented()

    def parse_on(self, text, start_symbol):
        raise NotImplemented()

We now initialize the Earley parser, which is a parser.

In [None]:
class EarleyParser(Parser):
    def __init__(self, grammar, log = False, parse_exceptions = True, **kwargs):
        self._grammar = grammar
        self.epsilon = nullable(grammar)
        self.log = log
        self.parse_exceptions = parse_exceptions

### Nonterminals Deriving Empty Strings

Earley parser handles *nullable* nonterminals separately. A nullable
nonterminal is a nonterminal that can derive an empty string. That is
at least one of the expansion rules must derive an empty string. An
expansion rule derives an empty string if *all* of the tokens can
derive the empty string. This means no terminal symbols (assuming we
do not have zero width terminal symbols), and all nonterminal symbols
can derive empty string.

In this implementation, we first initialize the list of first level
nullable nonterminals that contain an empty expansion. That is, they
directly derive the empty string.
Next, we remove any expansion rule that contains a token as these
expansion rules will not result in empty strings. Next, we start with
our current list of nullable nonterminals, take one at a time, and
remove them from the current expansion rules. If any expansion rule
becomes empty, the corresponding nonterminal is added to the nullable
nonterminal list. This continues until all nullable nonterminals
are processed.

In [None]:
def is_nt(k):
    return (k[0], k[-1]) == ('<', '>')

def rem_terminals(g):
    g_cur = {}
    for k in g:
        alts = []
        for alt in g[k]:
            ts = [t for t in alt if not is_nt(t)]
            if not ts:
                alts.append(alt)
        if alts:
            g_cur[k] = alts
    return g_cur

def nullable(g):
    nullable_keys = {k for k in g if [] in g[k]}

    unprocessed  = list(nullable_keys)

    g_cur = rem_terminals(g)
    while unprocessed:
        nxt, *unprocessed = unprocessed
        g_nxt = {}
        for k in g_cur:
            g_alts = []
            for alt in g_cur[k]:
                alt_ = [t for t in alt if t != nxt]
                if not alt_:
                    nullable_keys.add(k)
                    unprocessed.append(k)
                    break
                else:
                    g_alts.append(alt_)
            if g_alts:
                g_nxt[k] = g_alts
        g_cur = g_nxt

    return nullable_keys

An example

In [None]:
if __name__ == '__main__':
    nullable_grammar = {
        '<start>': [['<A>', '<B>']],
        '<A>': [['a'], [], ['<C>']],
        '<B>': [['b']],
        '<C>': [['<A>'], ['<B>']]
    }

Checking

In [None]:
if __name__ == '__main__':
    print(nullable(nullable_grammar))

## The Chart Parser

Earley parser is a chart parser. That is, it relies on a table of solutions
to smaller problems. This table is called a chart (hence the name of such parsers -- chart parsers).

### The Chart Construction

Here, we begin the chart construction by 
seeding the chart with columns representing the tokens or characters.
Consider our example grammar again. The starting point is,
```
   <start>: | <A> <B>
```
We add this state to the `chart[0]` to start the parse. Note that the term
after dot is `<A>`, which will need to be recursively inserted to the column.
We will see how to do that later.

*Note:* In traditional Earley parsing, the starting nonterminal always have
a single expansion rule. However, in many cases, you want to parse a fragment
and this rule makes it cumbersome to use Earley parsing. Hence, we have
opted to allow any nonterminal to be used as the starting nonterminal
irrespective of whether it has a single rule or not.
Interestingly, this does not have an impact on the parsing itself, but in
the extraction of results.
In essence, we seed *all* expansion rules into of the current start symbol
to the chart at `column 0`. We will take care of that difference while
building parse trees.

In [None]:
class EarleyParser(EarleyParser):
    def chart_parse(self, tokens, start, alts):
        chart = [self.create_column(i, tok) for i, tok in enumerate([None, *tokens])]
        for alt in alts:
            chart[0].add(self.create_state(start, tuple(alt), 0, chart[0]))
        return self.fill_chart(chart)

    def create_column(self, i, tok): return Column(i, tok)

    def create_state(self, sym, alt, num, col): return State(sym, alt, num, col)

We seed our initial state in the example

In [None]:
if __name__ == '__main__':
    ep = EarleyParser(sample_grammar)
    ep.fill_chart = lambda s: s

    v = ep.chart_parse(list('a'), START, sample_grammar[START])
    print(v[0].states[0])

Then, we complete the chart. The idea here is to process one character or one
element at a time. At each character, we examine the current parse paths
(states) and continue forward any parse path that successfully parses the
letter. We process any state that is present in the current column in the
following fashion.

There are three main methods we use: `predict()`, `scan()`, and `complete()`


#### Predict

If in the current state, the term after the dot is a nonterminal, `predict()` is called. It
adds the expansion of the nonterminal to the current column.

If the term is nullable, then we simply advance the current state, and
add that to the current column. This fix to the original Earley parsing
was suggested by Aycock et al.[^aycock2002practical].

In [None]:
class EarleyParser(EarleyParser):
    def predict(self, col, sym, state):
        for alt in self._grammar[sym]:
            col.add(self.create_state(sym, tuple(alt), 0, col))
        if sym in self.epsilon:
            col.add(state.advance())

If we look our example, we have seeded the first column with `| <A> <B>`. Now,
`fill_chart()` will find that the next term is `<A>` and call `predict()`
which will then add the expansions of `<A>`.

In [None]:
if __name__ == '__main__':
    ep = EarleyParser(sample_grammar)
    ep.fill_chart = lambda s: s

    chart = ep.chart_parse(list('a'), START, sample_grammar[START])

    for s in chart[0].states:
        print(s)

Next, we apply predict.

In [None]:
if __name__ == '__main__':
    ep.predict(chart[0], '<A>', s)
    for s in chart[0].states:
        print(s)

As you can see, the two rules of `<A>` has been added to
the current column.

#### Scan

The `scan()` method is called if the next symbol in the current state is a terminal symbol. If the
state matches the next term, moves the dot one position, and adds the new
state to the column.

For example, consider this state.
```
   <B>: | b c
```
If we scan the next column's letter, and that letter is `b`, then it matches the
next symbol. So, we can advance the state by one symbol, and add it to the next
column.
```
   <B>: b | c
```
 

In [None]:
class EarleyParser(EarleyParser):
    def scan(self, col, state, letter):
        if letter == col.letter:
            col.add(state.advance())

Here is our continuing example.

In [None]:
if __name__ == '__main__':
    ep = EarleyParser(sample_grammar)
    ep.fill_chart = lambda s: s

    chart = ep.chart_parse(list('a'), START, sample_grammar[START])
    ep.predict(chart[0], '<A>', s)

    new_state = chart[0].states[1]
    print(new_state)

    ep.scan(chart[1], new_state, 'a')
    for s in chart[1].states:
        print(s)

As you can see, the `state[1]` in `chart[0]` that was waiting for `a` has
advanced one letter after consuming `a`, and has been added to `chart[1]`.

#### Complete

The `complete()` method is called if a particular state has finished the rule
during execution. It first extracts the start column of the finished state, then
for all states in the start column that is not finished, find the states that
were parsing this current state (that is, we can go back to continue to parse
those rules now). Next, shift them by one position, and add them to the current
column.

For example, say the state we have is:
```
   <A>: a | <B> c
   <B>: b c |
```
The state `<B> b c |` is complete, and we need to advance any state that
has `<B>` at the dot to one index forward, which is `<A>: a <B> | c`

How do we determine the parent states? During predict, we added the predicted
child states to the same column as that of the inspected state. So, the states
will be found in the starting column of the current state, with the same symbol
at_dot as that of the name of the completed state.

We advance all such parents (producing new states) and add the new states to the
current column.

In [None]:
class EarleyParser(EarleyParser):
    def complete(self, col, state):
        parent_states = [st for st in state.s_col.states
                 if st.at_dot() == state.name]
        for st in parent_states:
            col.add(st.advance())

Here is our example. We start parsing `ad`. So, we have three columns.

In [None]:
if __name__ == '__main__':
    ep = EarleyParser(sample_grammar)
    ep.fill_chart = lambda s: s

    chart = ep.chart_parse(list('ad'), START, sample_grammar[START])
    ep.predict(chart[0], '<A>', s)
    for s in chart[0].states:
        print(s)

Next, we populate column 1 which corresponds to letter `a`.

In [None]:
if __name__ == '__main__':
    print(chart[1].letter)
    for state in chart[0].states:
        if state.at_dot() not in sample_grammar:
            ep.scan(chart[1], state, 'a')
    for s in chart[1].states:
        print(s)

You can see that the two states are waiting on `<A>` and `<B>`
respectively at `at_dot()`.
Hence, we run predict again to add the corresponding rules of `<A>` and `<B>`
to the current column.

In [None]:
if __name__ == '__main__':
    for state in chart[1].states:
        if state.at_dot() in sample_grammar:
            ep.predict(chart[1], state.at_dot(), state)
    for s in chart[1].states:
        print(s)

As you can see, we have a list of states that are waiting
for `b`, `a` and `d`.

Our next letter is:

In [None]:
if __name__ == '__main__':
    print(chart[2])

We scan to populate `column 2`.

In [None]:
if __name__ == '__main__':
    for state in chart[1].states:
        if state.at_dot() not in sample_grammar:
            ep.scan(chart[2], state, state.at_dot())

    for s in chart[2].states:
        print(s)

As we expected, only `<D>` could advance to the next column (`chart[2]`)
after reading `d`

Finally, we use complete, so that we can advance the parents of the `<D>` state above.

In [None]:
if __name__ == '__main__':
    for state in chart[2].states:
        if state.finished():
            ep.complete(chart[2], state)

    for s in chart[2].states:
        print(s)

As you can see, that led to `<B>` being complete, and since `<B>` is
complete, `<A>` also becomes complete.

### Filling The Chart

In the below algorithm, whenever the `at_dot()` is at a nonterminal
symbol, the expansion rules of that nonterminal are added to the current
rule (`predict()`) since each rule represents one valid parsing path. If on the
other hand, `at_dot()` indicates processing finished for that nonterminal, we
lookup the parent symbols and advance their parsing state (`complete()`). If we
find that we are at a terminal symbol, we simply check if the current state can
advance to parsing the next character (`scan()`). 

In [None]:
class EarleyParser(EarleyParser):
    def fill_chart(self, chart):
        for i, col in enumerate(chart):
            for state in col.states:
                if state.finished():
                    self.complete(col, state)
                else:
                    sym = state.at_dot()
                    if sym in self._grammar:
                        self.predict(col, sym, state)
                    else:
                        if i + 1 >= len(chart):
                            continue
                        self.scan(chart[i + 1], state, sym)
            if self.log: print(col.to_repr(), '\n')
        return chart

We can now recognize the given string as part of the language represented by the grammar.

In [None]:
if __name__ == '__main__':
    ep = EarleyParser(sample_grammar, log=True)
    columns = ep.chart_parse('adcd', START, sample_grammar[START])
    for c in columns: print(c)

The chart above only shows completed entries. The parenthesized expression
indicates the column just before the first character was recognized, and the
ending column.

Notice how the `<start>` nonterminal shows the dot at the end. That is, fully parsed.

In [None]:
if __name__ == '__main__':
    last_col = columns[-1]
    for s in last_col.states:
        if s.name == '<start>':
            print(s)

## Derivation trees

We use the following procedures to translate the parse forest to individual
trees.

### parse_prefix

In [None]:
class EarleyParser(EarleyParser):
    def parse_prefix(self, text, start_symbol):
        alts = [tuple(alt) for alt in self._grammar[start_symbol]]
        self.table = self.chart_parse(text, start_symbol, alts)
        for col in reversed(self.table):
            states = [st for st in col.states
                if st.name == start_symbol and st.expr in alts and st.s_col.index == 0
            ]
            if states:
                return col.index, states
        return -1, []

Here is an example of using it.

In [None]:
if __name__ == '__main__':
    ep = EarleyParser(sample_grammar)
    cursor, last_states = ep.parse_prefix('adcd', START)
    print(cursor, [str(s) for s in last_states])

### parse_on

Our `parse_on()` method is slightly different from usual Earley implementations
in that we accept any nonterminal symbol, not just nonterminal symbols with a
single expansion rule. We accomplish this by computing a different chart for
each expansion.

In [None]:
class EarleyParser(EarleyParser):
    def parse_on(self, text, start_symbol):
        starts = self.recognize_on(text, start_symbol)
        forest = self.parse_forest(self.table, starts)
        for tree in self.extract_trees(forest):
            yield tree

    def recognize_on(self, text, start_symbol):
        cursor, states = self.parse_prefix(text, start_symbol)
        starts = [s for s in states if s.finished()]

        if self.parse_exceptions:
            if cursor < len(text) or not starts:
                raise SyntaxError("at " + repr(text[cursor:]))
        return starts

### parse_paths


The parse_paths() method tries to unify the given expression in `named_expr` with
the parsed string. For that, it extracts the last symbol in `named_expr` and
checks if it is a terminal symbol. If it is, then it checks the chart at `til` to
see if the letter corresponding to the position matches the terminal symbol.
If it does, extend our start index by the length of the symbol.

If the symbol was a nonterminal symbol, then we retrieve the parsed states
at the current end column index (`til`) that correspond to the nonterminal
symbol, and collect the start index. These are the end column indexes for
the remaining expression.

Given our list of start indexes, we obtain the parse paths from the remaining
expression. If we can obtain any, then we return the parse paths. If not, we
return an empty list.

In [None]:
class EarleyParser(EarleyParser):
    def parse_paths(self, named_expr, chart, frm, til):
        def paths(state, start, k, e):
            if not e:
                return [[(state, k)]] if start == frm else []
            else:
                return [[(state, k)] + r
                        for r in self.parse_paths(e, chart, frm, start)]

        *expr, var = named_expr
        starts = None
        if var not in self._grammar:
            starts = ([(var, til - len(var),
                        't')] if til > 0 and chart[til].letter == var else [])
        else:
            starts = [(s, s.s_col.index, 'n') for s in chart[til].states
                      if s.finished() and s.name == var]

        return [p for s, start, k in starts for p in paths(s, start, k, expr)]

Example

In [None]:
if __name__ == '__main__':
    print(sample_grammar[START])
    ep = EarleyParser(sample_grammar)
    completed_start = last_states[0]
    paths = ep.parse_paths(completed_start.expr, columns, 0, 4)
    for path in paths:
        print([list(str(s_) for s_ in s) for s in path])

That is, the parse path for `<start>` given the input `adcd` included
recognizing the expression `<A><B>`. This was recognized by the two states:
`<A>` from input(0) to input(2) which further involved recognizing the rule
`a<B>c`, and the next state `<B>` from input(3) which involved recognizing the
rule `<D>`.

### parse_forest

The `parse_forest()` method takes the states which represents completed
parses, and determines the possible ways that its expressions corresponded to
the parsed expression. As we noted, it is here that we take care of multiple
expansion rules for start symbol. (The `_parse_forest()` accepts a single
state, and is the main driver that corresponds to traditional implementation,)
For example, say we are parsing `1+2+3`, and the
state has `[<expr>,+,<expr>]` in `expr`. It could have been parsed as either
`[{<expr>:1+2},+,{<expr>:3}]` or `[{<expr>:1},+,{<expr>:2+3}]`.

In [None]:
class EarleyParser(EarleyParser):
    def forest(self, s, kind, chart):
        return self.parse_forest(chart, [s]) if kind == 'n' else (s, [])

    def _parse_forest(self, chart, state):
        pathexprs = self.parse_paths(state.expr, chart, state.s_col.index,
                                     state.e_col.index) if state.expr else []
        return (state.name, [[(v, k, chart) for v, k in reversed(pathexpr)]
                            for pathexpr in pathexprs])

    def parse_forest(self, chart, states):
        names = list({s.name for s in states})
        assert len(names) == 1
        forest = [self._parse_forest(chart, state) for state in states]
        return (names[0], [e for name, expr in forest for e in expr])

Example

In [None]:
if __name__ == '__main__':
    ep = EarleyParser(sample_grammar)
    result = ep.parse_forest(columns, last_states)
    print(result)

### extract_trees

We show how to extract a single tree first, and then generalize it to
all trees.

In [None]:
class EarleyParser(EarleyParser):
    def extract_a_tree(self, forest_node):
        name, paths = forest_node
        if not paths:
            return (name, [])
        return (name, [self.extract_a_tree(self.forest(*p)) for p in paths[0]])

    def extract_trees(self, forest):
        yield self.extract_a_tree(forest)

Example

In [None]:
if __name__ == '__main__':
    import src.utils as utils
    mystring = '1+2+4'
    parser = EarleyParser(a_grammar)
    for tree in parser.parse_on(mystring, START):
        utils.display_tree(tree)

## Ambiguous Parsing

Ambiguous grammars can produce multiple derivation trees for some given string.
In the above example, the `a_grammar` can parse `1+2+4` in as either `[1+2]+4` or `1+[2+4]`.

That is, we need to extract all derivation trees.
We enhance our `extract_trees()` as below.


In [None]:
import itertools as I

class EarleyParser(EarleyParser):
    def extract_trees(self, forest_node):
        name, paths = forest_node
        if not paths:
            yield (name, [])
        results = []
        for path in paths:
            ptrees = [self.extract_trees(self.forest(*p)) for p in path]
            for p in I.product(*ptrees):
                yield (name, p)
 

### Example

Using the same example,

In [None]:
if __name__ == '__main__':
    mystring = '1+2+4'
    parser = EarleyParser(a_grammar)
    for tree in parser.parse_on(mystring, START):
        utils.display_tree(tree)

In [None]:
import ipynb.fs.full.x0_1_Grammars as grammars

In [None]:
if __name__ == '__main__':
    v = '+16.18*-+4.4/5'
    ep = EarleyParser(grammars.EXPR_GRAMMAR)
    for t in ep.parse_on(v, grammars.EXPR_START):
        utils.display_tree(t)

# Done

In [None]:
#%tb