# Parser

Parsers are one of the core techniques in fuzzing. You need parsers to take a structured input apart, and reuse the parts in other inputs without affecting the validity of the input.

## Synopsis

```python
import parser as P
my_grammar = {'<start>': [['1', '<A>'],
                          ['2']
                         ],
              '<A>'    : [['a']]}
my_parser = P.LL1Parser(my_grammar)
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
    print(P.format_parsetree(tree))
```



Secondly, as per traditional implementations,
there can only be one expansion rule for the `<start>` symbol. We work around
this restriction by simply constructing as many charts as there are expansion
rules, and returning all parse trees.

In [None]:
grammar = {
    '<start>': [['<expr>']],
    '<expr>': [
        ['<term>', '+', '<expr>'],
        ['<term>', '-', '<expr>'],
        ['<term>']],
    '<term>': [
        ['<fact>', '*', '<term>'],
        ['<fact>', '/', '<term>'],
        ['<fact>']],
    '<fact>': [
        ['<digits>'],
        ['(','<expr>',')']],
    '<digits>': [
        ['<digit>','<digits>'],
        ['<digit>']],
    '<digit>': [["%s" % str(i)] for i in range(10)],
}
START = '<start>'

In [None]:
import src.utils as utils

## Summary

An LL(1) parser executes the following steps for parsing:

The idea behind a simple $LL(1)$ recognizer is that, you try to unify the string you want to match with the corresponding key in the grammar. If the key is not present in the grammar, it is a literal, which needs to be matched with string equality. If the key is present in the grammar, get the corresponding productions (rules) for that key, and start unifying each rule one by one on the string to be matched.

In [None]:
import sys
import functools

class LL1Parser:
    def __init__(self, grammar):
        self.grammar = grammar

    @functools.lru_cache(maxsize=None)
    def unify_key(self, key, text, at=0):
        if not utils.is_nt(key):
            if text[at:].startswith(key): return (at + len(key), (key, [])) 
            else: return (at, None)
        rules = self.grammar[key]
        for rule in rules:
            l, res = self.unify_rule(rule, text, at)
            if res is not None: return l, (key, res)
        return (0, None)

For unifying rules, the idea is similar. We take each token in the rule, and try to unify that token with the string to be matched. We rely on unify_key for doing the unification of the token. if the unification fails, we return empty handed.

In [None]:
class LL1Parser(LL1Parser):
    def unify_rule(self, parts, text, tfrom):
        results = []
        for part in parts:
            tfrom, res = self.unify_key(part, text, tfrom)
            if res is None: return tfrom, None
            results.append(res)
        return tfrom, results

    def parse_on(self, text, start_symbol):
        till, result = self.unify_key(start_symbol, text, 0)
        yield result

In [None]:
small_grammar = {'<start>': [['1', '<A>'],
                          ['2']
                         ],
              '<A>'    : [['a']]}
my_parser = LL1Parser(small_grammar)
for tree in my_parser.parse_on(text='1a', start_symbol='<start>'):
    utils.display_tree(tree)

In [None]:
my_parser = LL1Parser(grammar)
tree = list(my_parser.parse_on(text='(8/3)*49', start_symbol='<start>'))[0]
utils.display_tree(tree)

In [None]:
target = tree[1][0][1][0][1][0][1]
utils.display_tree(target[1])

In [None]:
subtree = list(my_parser.parse_on(text='2+1', start_symbol='<expr>'))[0]
utils.display_tree(subtree)

In [None]:
target[1] = subtree

In [None]:
utils.display_tree(tree)

In [None]:
utils.tree_to_str(tree)

What if you want to parse more grammar varieties? For example, the following grammar describing the same language will not be parsable by `LL1Parser`.
```
grammar = {
    '<start>': [['<expr>']],
    '<expr>': [
        ['<expr>', '+', '<expr>'],
        ['<expr>', '-', '<expr>'],
        ['<expr>', '*', '<expr>'],
        ['<expr>', '/', '<expr>'],
        ['(','<expr>',')']],
        ['<digits>']],
    '<digits>': [
        ['<digit>','<digits>'],
        ['<digit>']],
    '<digit>': [["%s" % str(i)] for i in range(10)],
}
START = '<start>'
```

In the case of such grammars, we can use one of the general context-free parsers. These include
* Earley parser (in this repository)
* GLL parser
* GLR parser
* CYK parser
* Valiant parser
  and so on.

The tradeoff is that each of these parsers are costly when compared to the simple LL1Parser ($O(N^3)$ or beyond compared to O(N) for LL1Parser.).

# Done

In [None]:
#%tb