# Mining Input Grammars

So far, the grammars we have seen have been mostly specified manually – that is, you (or the person knowing the input format) had to design and write a grammar in the first place.  While the grammars we have seen so far have been rather simple, creating a grammar for complex inoputs can involve quite some effort.  In this chapter, we therefore introduce techniques that automatically _mine_ grammars from programs – by executing the programs and observing how they process which parts of the input.  In conjunction with a grammar fuzzer, this allows us to (1) take a program, (2) extract its input grammar, and (3) fuzz it with high efficiency and effectiveness.

**Prerequisites**

* You should have read the [chapter on grammars](Grammars.ipynb).
* The [chapter on configuration fuzzing](ConfigurationFuzzer.ipynb) introduces grammar mining for configuration options, as well as observing variables and values during execution.
* The concept of parsing from [chapter on parsers](Parser.ipynb) is also useful.

In [None]:
import fuzzingbook_utils

In [None]:
import sys

Consider the `parse_vehicle()` and `process_inventory()`  methods from the [chapter on parsers](Parser.ipynb)

In [None]:
def process_van(year, company, model):
    desc = "We have a %s %s van from %s vintage." % (company, model, year)
    iyear = int(year)
    if iyear > 2010:
        return "%s\nIt is a recent model!" % desc
    else:
        return "%s\nIt is an old but reliable model!" % desc

def process_car(year, company, model):
    desc = "We have a %s %s car from %s vintage." % (company, model, year)
    iyear = int(year)
    if iyear > 2016:
        return "%s\nIt is a recent model!" % desc
    else:
        return "%s\nIt is an old but reliable model!" % desc

def process_vehicle(vehicle):
    year, kind, company, model, *_ = vehicle.split(',')
    if kind == 'van':
        return process_van(year, company, model)
    elif kind == 'car':
        return process_car(year, company, model)
    else:
        raise Exception('Invalid entry')


def process_inventory(inventory):
    result = []
    for vehicle in inventory.split('\n'):
        r = process_vehicle(vehicle)
        result.append(r)
    return "\n".join(result)

A few sample inputs to `process_inventory()` are given below

In [None]:
inventory = """\
1997,van,Ford,E350
2000,car,Mercury,Cougar
1999,car,Chevy,Venture\
"""
print(process_inventory(inventory))

We found from the [chapter on parsers](Parser.ipynb) that coarse grammars do not work well for fuzzing when the input format includes details expressed only in code. That is, even though we have the formal specification of CSV files ([RFC 4180](https://tools.ietf.org/html/rfc4180)), the inventory system includes further rules as to what is expected at each index of the CSV file. The solution of simply recombining existing inputs, while practical, is incomplete. In particular, it relies on a formal input specification being available in the first place. However, we have no assurance that the program obeys the input specification given.

One of the ways out of this predicament is to interrogate the program under test as to what its input specification is. That is, if the program under test is written in a recursive descent style, with specific methods responsible for handling specific parts of the input, one can recover the parse tree, by observing the process of parsing. Further, one can recover a reasonable approximation of the grammar by abstraction from multiple input trees.

The idea is as follows
* The assumption (1) is that the program is written in such a fashion that specific methods are responsible for parsing specific fragments of the program. This includes almost all ad hoc parsers.
* We hook into the Python execution and observe values of variables as they get generated in different methods.
* 

## A Simple Grammar Miner

Say we want to obtain the grammar for the function `process_inventory()` which is our function under test.

### Function Under Test

In [None]:
VEHICLES = inventory.split('\n')

Simple miners can be defeated by lookaheads which uses part of the input fragment to decide if we should process the fragment. It is in our interest to avoid those.

In [None]:
LOOKAHEAD=2

### Recording Occurrence of Input Values.

We have a few inputs that can be used, as listed below:

#### Get qualified name of a variable

In [None]:
class Context:
    def __init__(self, frame, track_caller=True):
        self.method = self._method(frame)
        self.parameter_names = self._get_parameters(frame)
        self.file_name = self._file_name(frame)
        self.parent = Context(frame.f_back,
                              False) if track_caller and frame.f_back else None

    def _get_parameters(self, frame):
        return [
            frame.f_code.co_varnames[i]
            for i in range(frame.f_code.co_argcount)
        ]

    def _file_name(self, frame):
        return frame.f_code.co_filename
    
    def _method(self, frame):
        return frame.f_code.co_name

    def all_vars(self, frame):
        return frame.f_locals

The function `traceit()` is used to record all *non trivial* string variables (with length more than 2 characters) and values occurring during execution.

### Tracer

In [None]:
class Tracer:
    def __init__(self, inputstr, files=[]):
        self.inputstr, self.files, self.trace = inputstr, files, []

    def __enter__(self):
        self.oldtrace = sys.gettrace()
        sys.settrace(self.trace_event)
        return self

    def __exit__(self, *args):
        sys.settrace(self.oldtrace)

    def tracing_var(self, k, v):
        return isinstance(v, str)

    def tracing_context(self, cxt, event, arg):
        if not self.files:
            return True
        return any(cxt.file_name.endswith(f) for f in self.files)

    def trace_event(self, frame, event, arg):
        cxt = Context(frame)
        if not self.tracing_context(cxt, event, arg):
            return self.trace_event

        my_vars = [(k, v) for k, v in cxt.all_vars(frame).items()
                   if self.tracing_var(k, v)]
        self.trace.append((event, arg, cxt, my_vars))
        return self.trace_event

    def __call__(self):
        return self.inputstr

### Tracker

In [None]:
class Tracker:
    def __init__(self, inputstr, trace, **kwargs):
        self.the_vars = {}
        self.trace = trace
        self.inputstr = inputstr
        self.options(kwargs)
        self.process()
        
    def options(self, kwargs):
        pass

    def include(self, var, value):
        return len(value) > LOOKAHEAD and value in self.inputstr

    def track_event(self, event, arg, cxt, my_vars):
        self.the_vars.update({k: v for k, v in my_vars if self.include(k, v)})

    def process(self):
        for event, arg, cxt, my_vars in self.trace:
            self.track_event(event, arg, cxt, my_vars)

### Trace

The `trace_function()` hooks into the Python trace functionality.

In [None]:
with Tracer(VEHICLES[0]) as tracer:
    process_vehicle(tracer())

tracker = Tracker(tracer.inputstr, tracer.trace)
for k,v in tracker.the_vars.items():
    print(k, '=', repr(v))

### Extracting a Derivation Tree

In [None]:
from Grammars import START_SYMBOL, syntax_diagram

In [None]:
from GrammarFuzzer import GrammarFuzzer, FasterGrammarFuzzer, display_tree, tree_to_string

Convert a variable name into a grammar nonterminal

Now, for each pair _VAR_, _VALUE_ found:

1. We search for occurrences of _VALUE_ in the grammar
2. We replace them by <_VAR_>
3. We add a new rule <_VAR_> $\rightarrow$ <_VALUE_> to the grammar

In [None]:
class Miner:
    def __init__(self, my_input, my_assignments, **kwargs):
        self.my_input = my_input
        self.my_assignments = my_assignments
        self.log = kwargs.get('log') or False
        self.tree = self.get_derivation_tree()

    def logger(self, indent, var):
        if self.log:
            print('\t' * indent, var)

    def nonterminal(self, var):
        return "<" + var.lower() + ">"

    def to_tree(self, key=START_SYMBOL):
        if key not in self.tree:
            return (key, [])
        children = [self.to_tree(c) for c in self.tree[key]]
        return (key, children)

    def get_derivation_tree(self):
        tree = {START_SYMBOL: (self.my_input, )}
        my_assignments = self.my_assignments.copy()

        while True:
            new_rules = []
            for var, value in my_assignments.items():
                self.logger(0, "%s = %s" % (var, value))
                for key, repl in tree.items():
                    self.logger(1, "%s : %s" % (key, repl))
                    if not any(value in t for t in repl):
                        continue
                    alt_key = self.nonterminal(var)
                    new_arr = []
                    for k, token in enumerate(repl):
                        if not value in token:
                            new_arr.append(token)
                        else:
                            arr = token.split(value)
                            new_arr.extend(
                                list(sum(zip(arr,
                                             len(arr) * [alt_key]), ()))[:-1])
                    tree[key] = tuple(i for i in new_arr if i)
                    new_rules.append((var, alt_key, value))

            if not new_rules:
                break  # Nothing to expand anymore

            for (var, alt_key, value) in new_rules:
                tree[alt_key] = (value, )
                self.logger(0, "+%s = %s" % (alt_key, value))

                # Do not expand this again
                del my_assignments[var]

        return {key: values for key, values in tree.items()}

First, trace the execution:

In [None]:
trees = []
for VEHICLE in VEHICLES:
    print(VEHICLE)
    with Tracer(VEHICLE) as tracer:
        process_inventory(tracer())
    assignments = Tracker(tracer.inputstr, tracer.trace).the_vars
    trees.append((tracer.inputstr, assignments))
    for var, val in assignments.items():
        print(var + " = " + repr(val))
    print()

In [None]:
csv_dt = []
for inputstr, assignments in trees:
    print(inputstr)
    dt = Miner(inputstr, assignments)
    csv_dt.append(dt)
    display_tree(dt.to_tree())

In [None]:
URLS = [
    'http://user:pass@www.google.com:80/?q=path#ref',
    'https://www.cispa.saarland:80/',
    'http://www.fuzzingbook.org/#News',
    'ftp://freebsd.org/releases/5.8'
]

In [None]:
from urllib.parse import urlparse, clear_cache

In [None]:
url_dt = []
for URL in URLS:
    clear_cache()
    print(URL)
    with Tracer(URL, ['urllib/parse.py']) as tracer:
        urlparse(tracer())
    dt = Miner(tracer.inputstr, Tracker(tracer.inputstr, tracer.trace).the_vars)
    url_dt.append(dt)
    display_tree(dt.to_tree())

### Recovering Grammar from Derivation Trees

In [None]:
class Infer:
    def __init__(self):
        self.grammar = {}

In [None]:
class Infer(Infer):
    def add_tree(self, t):
        merged_grammar = {}
        for key in list(self.grammar.keys()) + list(t.tree.keys()):
            alternates = set(self.grammar.get(key, []))
            if key in t.tree:
                alternates.add(''.join(t.tree[key]))
            merged_grammar[key] = list(alternates)
        self.grammar = merged_grammar

In [None]:
i = Infer()
for dt in csv_dt:
    i.add_tree(dt)

In [None]:
syntax_diagram(i.grammar)

In [None]:
def recover_grammar(traces):
    m = Infer()
    for inputstr, trace in traces:
        dt = Miner(inputstr, Tracker(inputstr, trace).the_vars)
        m.add_tree(dt)
    return m.grammar

In [None]:
traces = []
for inputstr in URLS:
    clear_cache()
    with Tracer(inputstr, ['urllib/parse.py']) as tracer:
        urlparse(tracer())
    traces.append((tracer.inputstr, tracer.trace))
grammar = recover_grammar(traces)

In [None]:
syntax_diagram(grammar)

### Fuzzing

In [None]:
f = GrammarFuzzer(grammar)
for i in range(10):
    print(f.fuzz())

## Grammar Miner with Stack

### Input Stack

In [None]:
class InputStack(object):
    def __init__(self, i):
        self.original = i
        self.inputs = []
        
    def height(self):
        return len(self.inputs)

    def has(self, val):
        return any(val in var for var in self.inputs[-1].values())

    def ignored(self, val):
        return not (isinstance(val, str) and len(val) > LOOKAHEAD)

    def include(self, k, val):
        if self.ignored(val):
            return False
        return self.has(val) if self.inputs else val in self.original

    def push(self, inputs):
        my_inputs = {k: v for k, v in inputs.items() if self.include(k, v)}
        self.inputs.append(my_inputs)

    def pop(self):
        return self.inputs.pop()

### Vars

We proxy the dictionary so that it will only update if it does not already contain a value.

In [None]:
class Vars(object):
    def __init__(self, stack):
        self.defs = {START_SYMBOL: stack.original}
        self.istack = stack
        
    def set_kv(self, k, v):
        if k not in self.defs:
            self.defs[k] = v

    def update(self, v):
        for k,v in v.items():
            self.set_kv(k,v)

### Stack Tracker

In [None]:
class StackTracker(Tracker):
    def __init__(self, inputstr, trace, **kwargs):
        self.istack = InputStack(inputstr)
        self.the_vars = Vars(self.istack)
        self.trace = trace
        self.options(kwargs)
        self.process()

    def options(self, kwargs):
        self.track_params = kwargs.get('track_params') or True
        self.track_vars = kwargs.get('track_vars') or True
        self.track_return = kwargs.get('track_return') or False

    def include(self, var, value):
        return self.istack.include(var, value)

    def get_params(self, cxt, all_vars):
        return {
            "%s:%s" % (cxt.method, k): v
            for k, v in all_vars if k in cxt.parameter_names
        }

    def on_call(self, arg, cxt, my_vars):
        my_parameters = {
            k: v
            for k, v in self.get_params(cxt, my_vars).items()
            if not self.istack.ignored(v)
        }
        self.istack.push(my_parameters)
        if self.track_params:
            self.the_vars.update(my_parameters)

    def on_line(self, arg, cxt, my_vars):
        if self.track_vars:
            qvars = {"%s:%s" % (cxt.method, k): v for k, v in my_vars}
            my_vars = {
                var: value
                for var, value in qvars.items() if self.include(var, value)
            }
            if not self.track_params:
                my_vars = {
                    var: value
                    for var, value in my_vas.items() if var not in param_names
                }
            self.the_vars.update(my_vars)

    def on_return(self, arg, cxt, my_vars):
        self.istack.pop()
        self.on_line(arg, cxt, my_vars)
        if self.track_return:
            var = '(<-%s)' % cxt.method
            self.the_vars.update_vars({var: arg})

    def track_event(self, event, arg, cxt, my_vars):
        if event == 'call':
            return self.on_call(arg, cxt, my_vars)

        if event == 'return':
            return self.on_return(arg, cxt, my_vars)

        if event == 'exception':
            return

        self.on_line(arg, cxt, my_vars)

We need to modify `traceit()` to be aware of events now:

In [None]:
url_traces = []
for inputstr in URLS:
    clear_cache()
    with Tracer(inputstr, ['urllib/parse.py']) as tracer:
        urlparse(tracer())
    sm = StackTracker(tracer.inputstr, tracer.trace)
    url_traces.append((tracer.inputstr, sm))
    for k,v in sm.the_vars.defs.items():
        print(k, v)
    print()

Note that in the following we do not account for parameters getting reassigned values.

### Miner

For each (VAR, VALUE) found:
* We search for occurrences of VALUE in the grammar
* We replace them by VAR
* We add a new rule VAR -> VALUE to the grammar

In [None]:
class Miner(Miner):
    def get_derivation_tree(self):
        my_assignments = self.my_assignments.copy()
        tree = {}
        for var, value in my_assignments.items():
            nt_var = var if var == START_SYMBOL else self.nonterminal(var)
            self.logger(0, "%s = %s" % (nt_var, value))
            if tree:
                append = False
                for key, repl in tree.items():
                    self.logger(1, "%s : %s" % (key, repl))
                    if not any(value in t for t in repl):
                        continue
                    new_arr = []
                    for k, token in enumerate(repl):
                        if not value in token:
                            new_arr.append(token)
                        else:
                            append = True
                            arr = token.split(value)
                            new_arr.extend(
                                list(sum(zip(arr,
                                             len(arr) * [nt_var]), ()))[:-1])
                    tree[key] = tuple(i for i in new_arr if i)
                if append:
                    self.logger(0, "+%s = %s" % (nt_var, value))
                    tree[nt_var] = set([value])
            else:
                tree[nt_var] = (value, )
        return  {key: values for key, values in tree.items()}

In [None]:
clear_cache()
with Tracer(URLS[2], ['urllib/parse.py']) as tracer:
    urlparse(tracer())
sm = StackTracker(tracer.inputstr, tracer.trace)
dt = Miner(tracer.inputstr, sm.the_vars.defs)
display_tree(dt.to_tree())

In [None]:
def recover_grammar(traces):
    m = Infer()
    for inputstr, trace in traces:
        st = StackTracker(inputstr, trace)
        dt = Miner(inputstr, st.the_vars.defs)
        m.add_tree(dt)
    return m.grammar

In [None]:
traces = []
for inputstr in URLS:
    clear_cache()
    with Tracer(inputstr, ['urllib/parse.py']) as tracer:
        urlparse(tracer())
    traces.append((tracer.inputstr, tracer.trace))
grammar = recover_grammar(traces)
syntax_diagram(grammar)

## Tainted Miner

In [None]:
from InformationFlow import tstr

First, we expand any object to a list of variables.

In [None]:
def flatten(key, val):
    # Should we limit flatened objects to repr ~ tstr here or during call?
    tv = type(val)
    if tv in {int, float, complex, str, bytes, bytearray}:
        return [(key, val)]
    elif tv in {set, frozenset, list, tuple, range}:
        values = [e for i, elt in enumerate(val) for e in flatten(i, elt)]
        return [("%s.%d" % (key, i), v) for i, v in values]
    elif tv is dict:
        values = [e for k, elt in val.items() for e in flatten(k, elt)]
        return [("%s.%s" % (key, k), v) for k, v in values]
    elif tv is tstr:
        return [(key, val)]
    elif hasattr(val,'__dict__'):
        values = [e for k, elt in val.__dict__.items() for e in flatten(k, elt)]
        return [("%s.%s" % (key, k), v) for k, v in values]
    else:
        return [(key, repr(v))]

### Tainted Stack

For a simple miner, we do not need the input stack. All that we need is the ability to identify reassignments in variables. However, it makes life a little simpler if we can annotate variables with the stack depth.

In [None]:
class TaintedStack(InputStack):
    # height has ignored include push pop
    def __init__(self, i):  # same as original
        self.original = i
        self.inputs = []

    # has is used only with include
    def has(self, val):
        assert False

    def ignored(self, val):
        return not (isinstance(repr(val), tstr))

    # used from push (and Tracker)
    def include(self, k, val):
        if self.ignored(val):
            return False
        return val.taint_in(self.original)

One way to account for custom data structures other than containers is to rely on its `repr()`. That is, both `str()` and `repr()` relies on string methods that we have overridden in the tainted string. Hence if any of the string fragments are tainted, their return will also tainted.

### Tainted Vars

In [None]:
class TaintedVars(Vars):
    def __init__(self, stack):
        self.accessed_scop_var = {}
        self.taint_register = {}
        super().__init__(stack)

In [None]:
class TaintedVars(TaintedVars):
    def update(self, values):
        vals = [(k1, v1) for k, v in values.items() for k1, v1 in flatten(k, v)]
        for k, v in vals:
            self.set_kv(k, v)

In [None]:
class TaintedVars(TaintedVars):
    def var_init(self, var):
        if var not in self.accessed_scop_var:
            self.accessed_scop_var[var] = 0

    def var_assign(self, var):
        self.accessed_scop_var[var] += 1

    def var_name(self, var):
        t = self.accessed_scop_var[var]
        return "%s[%d]" % (var, t) # TODO: figure out how to deal with stack/vars stack height

    def set_kv(self, var, val):
        self.var_init(var)
        sa_var = self.var_name(var)
        if sa_var not in self.defs:
            self.defs[sa_var] = val
            self.taint_register[str(val.taint)] = sa_var
        else:  # possible reassignment
            if self.taint_register.get(str(val.taint)) is None:  # a change in taint
                self.var_assign(var)
                sa_var = self.var_name(var)
                self.defs[sa_var] = val
                self.taint_register[str(val.taint)] = sa_var

### Tainted Tracer

In [None]:
class TaintedTracer(Tracer):
    def __init__(self, inputstr, files=[]):
        self.inputstr = tstr(inputstr, parent=None)
        self.trace = []
        self.files = files
  
    def tracing_var(self, k, v):
        return isinstance(repr(v), tstr)

### Tainted Tracker

In [None]:
class TaintedTracker(StackTracker):
    def __init__(self, inputstr, trace, **kwargs):
        self.istack = TaintedStack(inputstr)
        self.the_vars = TaintedVars(self.istack)
        self.trace = trace
        self.options(kwargs)
        self.process()

In [None]:
def recover_grammar(traces):
    m = Infer()
    for inputstr, trace in traces:
        st = TaintedTracker(inputstr, trace)
        dt = Miner(inputstr, st.the_vars.defs)
        m.add_tree(dt)
    return m.grammar

In [None]:
clear_cache()
with TaintedTracer(URLS[0], ['urllib/parse.py']) as tracer:
    urlparse(tracer())
sm = TaintedTracker(tracer.inputstr, tracer.trace)
#grammar = recover_grammar(traces)
for k, v in sm.the_vars.defs.items():
    print("%s = <%s> \t %s" % (k,v, len(v.taint)))

### Tainted Miner

In [None]:
class TaintedMiner(Miner):
    def get_derivation_tree(self):
        my_assignments = self.my_assignments.copy()
        root = (START_SYMBOL[1:-1], my_assignments[START_SYMBOL], [])
        del my_assignments[START_SYMBOL]
        for k,v in my_assignments.items():
            self.insert_into_tree(root, (k,v, []))
        return self.once_over(root)

In [None]:
class TaintedMiner(TaintedMiner):
    def insert_into_tree(self, root, elt):
        # for each children of root, see if we are a
        # subset of taints. If none, add ourself as a first level childA
        
        ek, ev, eitems = elt
        first_level = True
        for child in root[2]:
            if ev.taint_in(child[1]):
                first_level = False
                # do not break. There may be overlaps
                self.insert_into_tree(child, elt)
        if first_level:
            root[2].append(elt)

Ensure that all children are in the right positions. In particular, assert that there are no overlaps. (If there are overlaps, we might have to choose between one of the overlapped elements based on the height of the tree of that element. -- Remember that each element is inserted as a child to *all* matching elements, and not just the first matching one. Hence, we can afford to choose between the trees and not worry about transferring elements across.

In [None]:
class TaintedMiner(TaintedMiner):
    def once_over(self, elt):
        k, v, children = elt
        new_children = []
        old_children = children
        while old_children:
            ochild, *old_children = old_children
            # look for possible overlap also here TODO
            possible_parents = [
                child for child in children
                if ochild[1].taint_in(child[1]) and ochild[0:1] != child[0:1]
            ]
            assert len(possible_parents) <= 1
            if possible_parents:
                for child in possible_parents:
                    self.insert_into_tree(child, ochild)
            else:
                new_children.append(ochild)
        return (k, v, [self.once_over(c) for c in new_children])

In [None]:
def to_tree(tree):
    children = [to_tree(c) for c in tree[2]]
    if not children:
        return ("<%s>" % tree[0], [(tree[1], [])])
    return ("<%s>" % tree[0], children)

In [None]:
dt = TaintedMiner(tracer.inputstr, sm.the_vars.defs)
display_tree(to_tree(dt.tree))

## Lessons Learned

* Given a set of inputs, we can learn an input grammar by examining variable values during execution.
* The resulting grammars can be used right during fuzzing.
* TODO: make the point that our initial implementation is about learning regular grammar not CFG because we do not know how to handle mutually recursive and looping procedures
* TODO: Use process_vehicle as a pervading example.
* TODO: Mention that control flow dependencies is not tracked in dynamic taints. But it is tracked in simple miner with string inclusion.

## Next Steps

_Link to subsequent chapters (notebooks) here, as in:_

* [use _mutations_ on existing inputs to get more valid inputs](MutationFuzzer.ipynb)
* [use _grammars_ (i.e., a specification of the input format) to get even more valid inputs](Grammars.ipynb)
* [reduce _failing inputs_ for efficient debugging](Reducer.ipynb)


## Background

\cite{Lin2008}

## Exercises

_Close the chapter with a few exercises such that people have things to do.  To make the solutions hidden (to be revealed by the user), have them start with_

```markdown
**Solution.**
```

_Your solution can then extend up to the next title (i.e., any markdown cell starting with `#`)._

_Running `make metadata` will automatically add metadata to the cells such that the cells will be hidden by default, and can be uncovered by the user.  The button will be introduced above the solution._

### Exercise 1: _Title_

_Text of the exercise_

In [None]:
# Some code that is part of the exercise

_Some more text for the exercise_

**Solution.** _Some text for the solution_

In [None]:
# Some code for the solution
2 + 2

_Some more text for the solution_

### Exercise 2: _Title_

_Text of the exercise_

**Solution.** _Solution for the exercise_