# Parsing and Recombining Inputs

In the chapter on [Grammars](Grammars.ipynb), we discussed how grammars can be
used to represent various languages. We also saw how grammars can be used to
generate strings of the corresponding language. Grammars can also perform the
reverse. That is, given a string, one can decompose the string into its
constituent parts that correspond to the parts grammar used to generate it
-- the derivation tree of that string. These parts (and parts from other similar
strings) can later be recombined using the same grammar to produce new strings.

In this chapter, we use grammars to parse and decompose inputs to
their corresponding derivation trees, allowing us to recombine them
arbitrarily.

**Prerequisites**

* You should have read the [chapter on grammars](Grammars.ipynb).
* An understanding of derivation trees from the [chapter on grammar fuzzer](GrammarFuzzer.ipynb)
  is also required.

In order to parse a string, one needs to identify the language, and the
corresponding grammar. For example, here is a string that we would like to parse

In [None]:
mystring = '1+2'

This string is an arithmetic expression for addition, which may be specified using a grammar.

In [None]:
A1_GRAMMAR = {
   "<start>":
       ["<expr>"],
   "<expr>":
       ["<expr>+<expr>", "<expr>-<expr>", "<integer>"],
   "<integer>":
       ["<digit><integer>", "<digit>"],
   "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

The parse tree for our expression from this grammar is given by:

In [None]:
display_tree(('<start>',[('<expr>',[('<expr>',[('<integer>',[('<digit>',[('1',[])])])]),('+',[]),('<expr>',[('<integer>',[('<digit>',[('2',[])])])])])]))

While a grammar can be used to specify a given language, there could be multiple
grammars that correspond to the same language. For example, here is another 
grammar to describe the same addition expression.

In [None]:
A2_GRAMMAR = {
   "<start>":
      ["<expr>"],
   "<expr>":
      ["<integer><expr_>"],
   "<expr_>":
      ["+<expr>", "-<expr>", ""],
   "<integer>":
      ["<digit><integer_>"],
   "<integer_>":
      ["<integer>", ""],
   "<digit>":
      ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

The corresponding derivation is given by:

In [None]:
 display_tree(('<start>', [('<expr>', [('<integer>', [('<digit>', [('1', [])]), ('<integer_>', [])]), ('<expr_>', [('+', []), ('<expr>', [('<integer>', [('<digit>', [('2', [])]), ('<integer_>', [])]), ('<expr_>', [])])])])])
)

Indeed, there could be different classes of grammars that
describe the same language. For example, the first grammar `A1_GRAMMAR`
is a grammar that sports both _right_ and _left_ recursion, while the
second grammar `A2_GRAMMAR` does not have left recursion in the
non-terminals in any of its productions, but contains _epsilon_ productions.
(An epsilon production is a production that has empty string in its right
hand side.)

A grammar is left recursive if any of its non-terminals are left recursive,
and a non-terminal is directly left-recursive if the left-most symbol of
any of its productions is itself. It is indirectly left-recursive if any
of the left-most symbols can be expanded using their definitions to
produce the non-terminal as the left-most symbol of the expansion.
Right recursive grammars are defined similarly.

For example, in `A1_GRAMMAR`, the definition of `<expr>` is
left-recursive, and right recursive directly. However in `A2_GRAMMAR`,
`<expr>` is right recursive indirectly through the expansion of `<expr_>`.

To complicate matters further, there could be
multiple derivation trees -- also called _parses_ -- corresponding to the
same string from the same grammar. For example, a string `1+2+3` can be
parsed either as `{expr: {expr: 1+2}+3}` or  as `{expr: 1+{expr: 2+3}}`.

Numerous parsing techniques exist that can take a grammar, and the given
string, and produce the corresponding derivation tree or trees. However,
some of them work only on specific classes of grammars. These classes of
grammars are named after the specific kind of parser that can accept
grammars of that category.

Different classes of grammars differ in the features that are available to
the user for writing a grammar of that class. That is, the corresponding
kind of parser will be unable to parse a grammar that makes use of more
features than is allowed. For example, the `A2_GRAMMAR` is an _LL(1)_
grammar because it lacks left recursion, while `A1_GRAMMAR` is not.
This is because an _LL(k)_ parser is a parser that parses its input from
left to right, and constructs a leftmost derivation of its input using *k*
lookahead tokens.

We will examine a few classes of parsers next.

First, we initialize a few things required by our parsing infrastructure

In [None]:
import fuzzingbook_utils
from Grammars import EXPR_GRAMMAR, START_SYMBOL, RE_NONTERMINAL
from GrammarFuzzer import display_tree
import functools
import re

The  `EXPR_GRAMMAR` we import from the [chapter on grammars](Grammars.ipynb) is oriented towards generation. In particular, the production rules are stored as strings. We need to massage this representation a little to conform to a canonical representation where each token in a rule is represented separately.

In [None]:
def split(rule):
    return [s for s in re.split(RE_NONTERMINAL, rule) if s]

def canonical(grammar):
    return  {k: [split(l) for l in rules] for k, rules in grammar.items()}

We define a minimal interface for parsing, that is obeyed by all parsers.

In [None]:
class Parser(object):
    def __init__(self, grammar, start_symbol=START_SYMBOL):
        self.start_symbol = start_symbol
        self.grammar = grammar
        
    def parse_prefix(self, text):
        """Return pair (cursor, forest) for longest prefix of text"""
        raise NotImplemented()
        
    def parse(self, text):
        cursor, forest = self.parse_prefix(text)
        if cursor < len(text):
            raise SyntaxError("at " + repr(text[cursor:]))
        return forest

## Packrat Parsing for _Parsing Expression Grammars_

Short of handrolling a parser, _Packrat_ parsing is one of the simplest parsing techniques, and the class of grammar it accepts is called a _Parsing Expression Grammar_ (Packrat is one of the techniques for parsing PEGs). The parsing expression grammars model the typical practice in handwritten recursive descent parsers, and hence it may be considered more intuitive to understand. Further, it comes with attractive properties such as linear time parsing. One should be aware that while the grammar looks like a Context Free Grammar, the language described by a PEG may be different (only LL(1) grammars are guaranteed to represent the same language for both PEGs and other parsers), while other behaviors could be surprising~\cite{redziejowski2008}. We look at the implementation of a simple PEG parser next.

For simplicity, we do not implent PEG predicates. However, the resulting parser is robust enough for our purposes.

We derive from the `Parser` base class first, and we accept the text to be parsed in the `parse` method, which in turn calls `unify_key` with the `start_symbol` which is the starting point of `PEG` parsing.

The algorithm itself is simple. It tries to unify the production rules corresponding to the start symbol to the given text. For that, it first verifies that the start symbol is present in the grammar. Next, it retrieves the production rules corresponding to the start symbol, and tries to unify each rule in order using `unify_rule`. If *any* of the rules succeed in being unified with the given text, the parse is considered a success.

The `unify_rule` is similar. It retrieves the tokens corresponding to the rule that it needs to unify with the text, and calls `unify_key` on them in sequence. If *all* tokens are successfully unified with the text, the parse is a success.

In [None]:
class PEGParser(Parser):
    def __init__(self, grammar, start_symbol):
        super().__init__(canonical(grammar), start_symbol)
        
    def parse(self, text):
        return self.unify_key(self.start_symbol, text, 0)
    
    @functools.lru_cache(maxsize=None)
    def unify_key(self, key, text, at=0):
        if key not in self.grammar:
            if text[at:].startswith(key): return at + len(key), (key, [])
            else: return at, None
        for rule in self.grammar[key]:
            l, res = self.unify_rule(rule, text, at)
            if res: return (l, (key, res))
        return 0, None

    def unify_rule(self, rule, text, at):
        results = []
        for token in rule:
            at, res = self.unify_key(token, text, at)
            if res is None: return at, None
            results.append(res)
        return at, results

We wrap initialization and calling of PEGParser in a method `parse` that accepts the text to be parsed along with the grammar.

In [None]:
def parse(text, grammar):
    peg = PEGParser(grammar, START_SYMBOL)
    return peg.parse(text)  

One of the consequences of using parsing expression grammars is that one need to be aware of the restrictions it imposes on the grammar. In particular, the production rules need to be written in such a way that the longest match is always tried first. Considering the definition of our original `EXPR_GRAMMAR`

In [None]:
EXPR_GRAMMAR

We note that for the key `<factor>`, the definition contains two rules `<integer>` and `<integer>.<integer>`. Due to the way parsing expression grammars work, this ordering is illegal. Hence, we modify our grammar to have the right ordering.

In [None]:
NEW_EXPR_GRAMMAR = {'<start>': ['<expr>'],
 '<expr>': ['<term> + <expr>', '<term> - <expr>', '<term>'],
 '<term>': ['<factor> * <term>', '<factor> / <term>', '<factor>'],
 '<factor>': ['+<factor>',
  '-<factor>',
  '(<expr>)',
  '<integer>.<integer>',
  '<integer>'],
 '<integer>': ['<digit><integer>', '<digit>'],
 '<digit>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}

In [None]:
cursor, tree = parse("1 + (2 * 3)", NEW_EXPR_GRAMMAR)
display_tree(tree)

In [None]:
cursor, tree = parse("1 * (2 + 3.35)", NEW_EXPR_GRAMMAR)
display_tree(tree)

While _Parsing Expression Grammars_ are simple at first sight, their behavior in some cases might be a bit unintuitive. For example, here is an example from Redziejowski~\cite{redziejowski}

In [None]:
PEG_SURPRISE = {
    "A": ["a<A>a","aa"]
}

When interpreted as a context free grammar and used as a string generator, it will produce strings of the form `a, aa, aaaa, aaaaaa` that is, it produces strings where the number of `a` is \Latex{2*n}. However, the PEG can only recognize strings of the form `a, aa, aaaa, aaaaaaaa`, that is where the number of `a` is \Latex{2^n}.

## Table driven parsers

Parsing Expression Grammars specifically oriented towards writing recognizers. Unfortunately, Parsing Expression Grammars are not suitable for grammar based fuzzing. \todo{Verify, and explain how precedence in parsing is not translatable to generation}.
\todo{Explain LL(k), LR(k), and general Context-Free parsers such as Early and CYK parsers}


### LL(1) parser

LL(k) parsers are top-down parsers that rely on a lookahead of k tokens. We provide an implementation of an LL(1) parser.

We first need to define a few tokens that will come in handy.

In [None]:
EOF = '\0'
EPSILON = ''

LL(1) grammars are rather restrictive. Specifically, the grammar should not contain left recursion. Hence, we have to
update our original grammar to remove left-recursion.

In [None]:
grammar = {'<start>': ['<expr>'],
           '<expr>': ['<term><expr_>'],
           '<expr_>': ['+<expr>',
                       '-<expr>',
                       ''],
           '<term>': ['<factor><term_>'],
           '<term_>': ['*<term>',
                       '/<term>',
                       ''],
           '<factor>': ['+<factor>',
                        '-<factor>',
                        '(<expr>)',
                        '<int>'],
           '<int>': ['<integer><integer_>'],
           '<integer_>': ['',
                          '.<integer>'],
           '<integer>': ['<digit><I>'],
           '<I>': ['<integer>',
                   ''],
           '<digit>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}

Next, we need to change the grammar so that the productions become tokens

In [None]:
new_grammar = {k: [split(e) for e in grammar[k]] for k in grammar}
new_grammar

We also need to get the listing of production rules, and the set of terminals in the grammar

In [None]:
def rules(g): return [(k, e) for k, a in g.items() for e in a]

In [None]:
def terminals(g):
    return set(t for k, expr in rules(g) for t in expr if t not in g)

### First and Follow sets

\todo{Define first and follow sets}
We first define the fixpiont
\todo{Define what is a fixpoint of a function}

In [None]:
def fixpoint(f):
    def helper(*args):
        while True:
            sargs = repr(args)
            args_ = f(*args)
            if repr(args_) == sargs:
                return args
            args = args_
    return helper

In [None]:
@fixpoint
def nullable_(rules, e):
    for A, expression in rules:
        if all((token in e)  for token in expression): e |= {A}
    return (rules, e)

def nullable(grammar):
    return nullable_(rules(grammar), set())[1]


@fixpoint
def firstset_(rules, first, epsilon):
    for A, expression in rules:
        for token in expression:
            first[A] |= first[token]

            # update until the first token that is not nullable
            if token not in epsilon:
                break
    return (rules, first, epsilon)

def firstset(grammar, epsilon):
    # https://www.cs.umd.edu/class/spring2014/cmsc430/lectures/lec05.pdf p6
    # (1) If X is a terminal, then First(X) is just X
    first = {i:{i} for i in terminals(grammar)}

    # (2) if X ::= epsilon, then epsilon \in First(X)
    for k in grammar:
        first[k] = {EPSILON} if k in epsilon else set()
    return firstset_(rules(grammar), first, epsilon)[1]

@fixpoint
def followset_(grammar, epsilon, first, follow):
    for A, expression in rules(grammar):
        # https://www.cs.umd.edu/class/spring2014/cmsc430/lectures/lec05.pdf
        # https://www.cs.uaf.edu/~cs331/notes/FirstFollow.pdf
        # essentially, we start from the end of the expression. Then:
        # (3) if there is a production A -> aB, then every thing in
        # FOLLOW(A) is in FOLLOW(B)
        # note: f_B serves as both follow and first.
        f_B = follow[A]
        for t in reversed(expression):
            # update the follow for the current token. If this is the
            # first iteration, then here is the assignment
            if t in grammar:
                follow[t] |= f_B  # only bother with nt

            # computing the last follow symbols for each token t. This
            # will be used in the next iteration. If current token is
            # nullable, then previous follows can be a legal follow for
            # next. Else, only the first of current token is legal follow
            # essentially

            # (2) if there is a production A -> aBb then everything in FIRST(B)
            # except for epsilon is added to FOLLOW(B)
            f_B = f_B | first[t] if t in epsilon else (first[t] - {EPSILON})

    return (grammar, epsilon, first, follow)

In [None]:
def followset(grammar, start):
    # Initialize first and follow sets for non-terminals
    follow = {i: set() for i in grammar}
    follow[start] = {EOF}

    epsilon = nullable(grammar)
    first = firstset(grammar, epsilon)
    return followset_(grammar, epsilon, first, follow)

In [None]:
def rnullable(rule, epsilon):
    return all(token in epsilon for token in rule)

In [None]:
def rfirst(rule, first, epsilon):
    tokens = set()
    for token in rule:
        tokens |= first[token]
        if token not in epsilon: break
    return tokens

In [None]:
def predict(rulepair, first, follow, epsilon):
    A, rule = rulepair
    rf = rfirst(rule, first, epsilon)
    if rnullable(rule, epsilon):
        rf |= follow[A]
    return rf

In [None]:
def parse_table(grammar, start, my_rules):
    _, epsilon, first, follow = followset(grammar, start)

    ptable = [(rule, predict(rule, first, follow, epsilon))
              for rule in my_rules]

    parse_tbl = {k: {} for k in grammar}

    for (k, expr), pvals in ptable:
        parse_tbl[k].update({v: (k, expr) for v in pvals})
    return parse_tbl

In [None]:
def parse_helper(grammar, tbl, stack, inplst):
    inp, *inplst = inplst
    exprs = []
    while stack:
        val, *stack = stack
        if isinstance(val, tuple):
            exprs.append(val)
        elif val not in grammar:  # terminal
            assert val == inp
            exprs.append(val)
            inp, *inplst = inplst or [None]
        else:
            _, rhs = tbl[val][inp] if inp else (None, [])
            stack = rhs + [(val, len(rhs))] + stack
    return exprs

In [None]:
def parse(grammar, start, inp):
    my_rules = rules(grammar)
    parse_tbl = parse_table(grammar, start, my_rules)
    k, _ = my_rules[0]
    stack = [k]
    return parse_helper(grammar, parse_tbl, stack, list(inp))

In [None]:
def linear_to_tree(arr):
    stack = []
    while arr:
        elt = arr.pop(0)
        if not isinstance(elt, tuple):
            stack.append((elt, []))
        else:
            # get the last n
            sym, n = elt
            elts = stack[-n:] if n > 0 else []
            stack = stack[0:len(stack) - n]
            stack.append((sym, elts))
    assert len(stack) == 1
    return stack[0]

In [None]:
tree = linear_to_tree(parse(new_grammar, START_SYMBOL, '(1+2)*3'))
display_tree(tree)

### Earley parser

In [None]:
def shrink(rule): return [i.strip() for i in rule]

In [None]:
new_grammar = {k: [shrink(split(e)) for e in EXPR_GRAMMAR[k]] for k in EXPR_GRAMMAR}
new_grammar

In [None]:
@fixpoint
def nullable_(rules, e):
    for A, expression in rules:
        if all((token in e)  for token in expression): e |= {A}
    return (rules, e)

def nullable(grammar):
    return nullable_(rules(grammar), set())[1]

In [None]:
class State(object):
    def __init__(self, name, expr, dot, origin, children=[]):
        self.name, self.expr, self.dot, self.origin = name, expr, dot, origin
        self.children = children[:]
    def finished(self): return self.dot >= len(self.expr)
    def shift(self):
        return State(self.name, self.expr, self.dot+1, self.origin, self.children)
    def symbol(self): return self.expr[self.dot]

    def _t(self): return (self.name, self.expr, self.dot, self.origin.i, tuple(self.children))
    def __hash__(self): return hash(self._t())
    def __eq__(self, other): return  self._t() == other._t()

class Column(object):
    def __init__(self, i, token):
        self.token, self.states, self._unique, self.i = token, [], {}, i

    def add(self, state):
        if state in self._unique: return self._unique[state]
        self._unique[state] = state
        self.states.append(state)
        return self._unique[state]

def predict(col, sym, grammar):
    for alt in grammar[sym]:
        col.add(State(sym, tuple(alt), 0, col))

def scan(col, state, token):
    if token == col.token:
        col.add(state.shift())

def complete(col, state, grammar):
    for st in state.origin.states:
        if st.finished(): continue
        if state.name != st.symbol(): continue
        col.add(st.shift()).children.append(state)
        
class EarleyParser(Parser):
    def __init__(self, grammar, start_symbol):
        super().__init__(canonical(grammar), start_symbol)

    # http://courses.washington.edu/ling571/ling571_fall_2010/slides/parsing_earley.pdf
    # https://github.com/tomerfiliba/tau/blob/master/earley3.py
    def _parse(words, grammar, start):
        # Aycock 2002 Practical Earley Parsing -- treatment of epsilon
        epsilon = nullable(grammar)
        alt = tuple(*grammar[start])
        chart = [Column(i, tok) for i,tok in enumerate([None, *words])]
        chart[0].add(State(start, alt, 0, chart[0], []))

        for i, col in enumerate(chart):
            for state in col.states:
                if state.finished():
                    complete(col, state, grammar)
                else:
                    sym = state.symbol()
                    if sym in grammar:
                        predict(col, sym, grammar)
                        if sym in epsilon:
                            # note that precomputation of epsilon derivation can result in infinite
                            # loops for certain grammars. Hence, we mark a nullable non-terminal
                            # but do not expand it.
                            col.add(state.shift()).children.append(State(sym + '*', tuple(), 0, col))
                    else:
                        if i + 1 >= len(chart): continue
                        scan(chart[i+1], state, sym)
        return chart
    
    def parse(self, text):
        table = self._parse(text, )
        

In [None]:
def process_expr(expr, children, grammar):
    terms = iter([(i,[]) for i in expr if i not in grammar])
    nts = iter([node_translator(i, grammar) for i in  children])
    return [next(terms if i not in grammar else nts) for i in expr]

def node_translator(state, grammar):
    return (state.name, process_expr(state.expr, state.children, grammar))

In [None]:
new_grammar = {k: [shrink(split(e)) for e in EXPR_GRAMMAR[k]] for k in EXPR_GRAMMAR}
table = parse(list('1+2+3'), new_grammar, '<start>')
states = [st for st in table[-1].states if st.name == '<start>' and st.finished()]
for state in states:
    display_tree(node_translator(state, new_grammar))

#### Ambiguous grammars generates parse forests

In [None]:
grammar= {
        '<start>': ['<A>'],
        '<A>': ['<A>+<A>', 'a'],
        }

In [None]:
new_grammar = {k: [shrink(split(e)) for e in grammar[k]] for k in grammar}
table = parse(list('a+a+a'), new_grammar, '<start>')
states = [st for st in table[-1].states if st.name == '<start>' and st.finished()]
for state in states:
    display_tree(node_translator(state, new_grammar))

## Lessons Learned

* _Lesson one_
* _Lesson two_
* _Lesson three_

## Next Steps

_Link to subsequent chapters (notebooks) here, as in:_

* [use _mutations_ on existing inputs to get more valid inputs](MutationFuzzer.ipynb)
* [use _grammars_ (i.e., a specification of the input format) to get even more valid inputs](Grammars.ipynb)
* [reduce _failing inputs_ for efficient debugging](Reducer.ipynb)


## Exercises

Close the chapter with a few exercises such that people have things to do.  In Jupyter Notebook, use the `exercise2` nbextension to add solutions that can be interactively viewed or hidden:

* Mark the _last_ cell of the exercise (this should be a _text_ cell) as well as _all_ cells of the solution.  (Use the `rubberband` nbextension and use Shift+Drag to mark multiple cells.)
* Click on the `solution` button at the top.

(Alternatively, just copy the exercise and solution cells below with their metadata.)

### Exercise 1

_Text of the exercise_

In [None]:
# Some code that is part of the exercise
pass

_Some more text for the exercise_

_Some text for the solution_

In [None]:
# Some code for the solution
2 + 2

_Some more text for the solution_

### Exercise 2

_Text of the exercise_

_Solution for the exercise_