# Mining Input Grammars

So far, the grammars we have seen have been mostly specified manually – that is, you (or the person knowing the input format) had to design and write a grammar in the first place.  While the grammars we have seen so far have been rather simple, creating a grammar for complex inoputs can involve quite some effort.  In this chapter, we therefore introduce techniques that automatically _mine_ grammars from programs – by executing the programs and observing how they process which parts of the input.  In conjunction with a grammar fuzzer, this allows us to (1) take a program, (2) extract its input grammar, and (3) fuzz it with high efficiency and effectiveness.

**Prerequisites**

* You should have read the [chapter on grammars](Grammars.ipynb).
* The [chapter on configuration fuzzing](ConfigurationFuzzer.ipynb) introduces grammar mining for configuration options, as well as observing variables and values during execution.

In [None]:
import fuzzingbook_utils

In [None]:
import sys

## A Simple Grammar Miner

Say we want to obtain the grammar for the function `urlparse` from the *Python* distribution.

### Function Under Test

In [None]:
from urllib.parse import urlparse, clear_cache
FUNCTION = urlparse

### Recording Occurrence of Input Values.

We have a few inputs that can be used, as listed below:

We use two *global* variables -- `the_values` is used to keep track of variable assignments and `the_input` to keep track of the current input string. We will show later how to avoid these globals.

In [None]:
INPUTS = [
    'http://user:pass@www.google.com:80/?q=path#ref',
    'https://www.cispa.saarland:80/',
    'http://www.fuzzingbook.org/#News',
]

#### Get qualified name of a variable

In [None]:
class Context:
    def __init__(self, frame, track_caller=True):
        self.method = self._method(frame)
        self.parameter_names = self._get_parameters(frame)
        self.file_name = self._file_name(frame)
        self.parent = Context(frame.f_back,
                              False) if track_caller and frame.f_back else None

    def _get_parameters(self, frame):
        return [
            frame.f_code.co_varnames[i]
            for i in range(frame.f_code.co_argcount)
        ]

    def _file_name(self, frame):
        return frame.f_code.co_filename
    
    def _method(self, frame):
        return frame.f_code.co_name

    def all_vars(self, frame):
        return frame.f_locals

The function `traceit()` is used to record all *non trivial* string variables (with length more than 2 characters) and values occurring during execution.

In [None]:
class Tracer:
    def __init__(self, inputstr):
        self.inputstr, self.trace = inputstr, []

    def __enter__(self):
        self.oldtrace = sys.gettrace()
        sys.settrace(self.traceit)
        return self

    def __exit__(self, *args):
        sys.settrace(self.oldtrace)

    def include(self, k, v):
        return isinstance(v, str)

    def traceit(self, frame, event, arg):
        cxt = Context(frame)
        my_vars = [(k, v) for k, v in cxt.all_vars(frame).items()
                   if self.include(k, v)]
        self.trace.append((event, arg, cxt, my_vars))
        return self.traceit

    def __call__(self):
        return self.inputstr

In [None]:
class Tracker:
    def __init__(self, inputstr, trace, **kwargs):
        self.the_vars = {}
        self.trace = trace
        self.inputstr = inputstr
        self.options(kwargs)
        self.process()
        
    def options(self, kwargs):
        pass

    def include(self, var, value):
        return len(value) > 2 and value in self.inputstr

    def trace_event(self, event, arg, ctx, my_vars):
        self.the_vars.update({k: v for k, v in my_vars if self.include(k, v)})

    def process(self):
        for event, arg, cxt, my_vars in self.trace:
            self.trace_event(event, arg, cxt, my_vars)

### Trace

The `trace_function()` hooks into the Python trace functionality.

In [None]:
clear_cache()
with Tracer(INPUTS[0]) as tracer:
    FUNCTION(tracer())

tracker = Tracker(tracer.inputstr, tracer.trace)
for k,v in tracker.the_vars.items():
    print(k, '=', repr(v))

### Extracting a Derivation Tree

In [None]:
from Grammars import START_SYMBOL, syntax_diagram

In [None]:
from GrammarFuzzer import GrammarFuzzer, FasterGrammarFuzzer, display_tree, tree_to_string

Convert a variable name into a grammar nonterminal

Now, for each pair _VAR_, _VALUE_ found:

1. We search for occurrences of _VALUE_ in the grammar
2. We replace them by <_VAR_>
3. We add a new rule <_VAR_> $\rightarrow$ <_VALUE_> to the grammar

In [None]:
class Miner:
    def __init__(self, my_input, my_assignments, **kwargs):
        self.my_input = my_input
        self.my_assignments = my_assignments
        self.log = kwargs.get('log') or False
        self.tree = self.get_derivation_tree()

    def logger(self, indent, var):
        if self.log:
            print('\t' * indent, var)

    def nonterminal(self, var):
        return "<" + var.lower() + ">"

    def to_tree(self, key=START_SYMBOL):
        if key not in self.tree:
            return (key, [])
        children = [self.to_tree(c) for c in self.tree[key]]
        return (key, children)

    def get_derivation_tree(self):
        tree = {START_SYMBOL: (self.my_input, )}
        my_assignments = self.my_assignments.copy()

        while True:
            new_rules = []
            for var, value in my_assignments.items():
                self.logger(0, "%s = %s" % (var, value))
                for key, repl in tree.items():
                    self.logger(1, "%s : %s" % (key, repl))
                    if not any(value in t for t in repl):
                        continue
                    alt_key = self.nonterminal(var)
                    new_arr = []
                    for k, token in enumerate(repl):
                        if not value in token:
                            new_arr.append(token)
                        else:
                            arr = token.split(value)
                            new_arr.extend(
                                list(sum(zip(arr,
                                             len(arr) * [alt_key]), ()))[:-1])
                    tree[key] = tuple(i for i in new_arr if i)
                    new_rules.append((var, alt_key, value))

            if not new_rules:
                break  # Nothing to expand anymore

            for (var, alt_key, value) in new_rules:
                tree[alt_key] = (value, )
                self.logger(0, "+%s = %s" % (alt_key, value))

                # Do not expand this again
                del my_assignments[var]

        return {key: values for key, values in tree.items()}

First, trace the execution:

In [None]:
clear_cache()
with Tracer(INPUTS[0]) as tracer:
    FUNCTION(tracer())

In [None]:
assignments = Tracker(tracer.inputstr, tracer.trace).the_vars
for var, val in tracker.the_vars.items():
    print(var + " = " + repr(val))

In [None]:
dt0 = Miner(tracer.inputstr, assignments)
display_tree(dt0.to_tree())

In [None]:
clear_cache()
with Tracer(INPUTS[1]) as tracer:
    FUNCTION(tracer())
dt1 = Miner(tracer.inputstr,
                          Tracker(tracer.inputstr, tracer.trace).the_vars)
display_tree(dt1.to_tree())

In [None]:
clear_cache()
with Tracer(INPUTS[2]) as tracer:
    FUNCTION(tracer())
dt2 = Miner(tracer.inputstr,
                          Tracker(tracer.inputstr, tracer.trace).the_vars)
display_tree(dt2.to_tree())

### Recovering Grammar from Derivation Trees

In [None]:
class Infer:
    def __init__(self):
        self.grammar = {}

In [None]:
class Infer(Infer):
    def add_tree(self, t):
        merged_grammar = {}
        for key in list(self.grammar.keys()) + list(t.tree.keys()):
            alternates = set(self.grammar.get(key, []))
            if key in t.tree:
                alternates.add(''.join(t.tree[key]))
            merged_grammar[key] = list(alternates)
        self.grammar = merged_grammar

In [None]:
i = Infer()
i.add_tree(dt0)
i.add_tree(dt1)
i.add_tree(dt2)

In [None]:
syntax_diagram(i.grammar)

In [None]:
def recover_grammar(traces):
    m = Infer()
    for inputstr, trace in traces:
        dt = Miner(inputstr, Tracker(inputstr, trace).the_vars)
        m.add_tree(dt)
    return m.grammar

In [None]:
traces = []
for inputstr in INPUTS:
    clear_cache()
    with Tracer(inputstr) as tracer:
        FUNCTION(tracer())
    traces.append((tracer.inputstr, tracer.trace))
grammar = recover_grammar(traces)
syntax_diagram(grammar)

In [None]:
syntax_diagram(grammar)

### Fuzzing

In [None]:
f = GrammarFuzzer(grammar)
for i in range(10):
    print(f.fuzz())

## Grammar Miner with Stack

### Keep Track of The Stack

In [None]:
class InputStack(object):
    def __init__(self, i):
        self.original = i
        self.inputs = []
        
    def height(self):
        return len(self.inputs)

    def has(self, val):
        return any(val in var for var in self.inputs[-1].values())

    def ignored(self, val):
        return not (isinstance(val, str) and len(val) > 2)

    def include(self, k, val):
        if self.ignored(val):
            return False
        return self.has(val) if self.inputs else val in self.original

    def push(self, inputs):
        my_inputs = {k: v for k, v in inputs.items() if self.include(k, v)}
        self.inputs.append(my_inputs)

    def pop(self):
        return self.inputs.pop()

### Restrict The Input Window

We proxy the dictionary so that it will only update if it does not already contain a value.

In [None]:
class Vars(object):
    def __init__(self, stack):
        self.defs = {START_SYMBOL: stack.original}
        self.istack = stack
        
    def set_kv(self, k, v):
        if k not in self.defs:
            self.defs[k] = v

    def update(self, v):
        for k,v in v.items():
            self.set_kv(k,v)

In [None]:
class StackTracker(Tracker):
    def __init__(self, inputstr, trace, **kwargs):
        self.istack = InputStack(inputstr)
        self.the_vars = Vars(self.istack)
        self.trace = trace
        self.options(kwargs)
        self.process()

    def options(self, kwargs):
        self.files = kwargs.get('files') or []
        self.track_params = kwargs.get('track_params') or True
        self.track_vars = kwargs.get('track_vars') or True
        self.track_return = kwargs.get('track_return') or False

    def include(self, var, value):
        if self.istack.ignored(value):
            return False
        return self.istack.include(var, value)

    def get_params(self, cxt, all_vars):
        return {
            "%s:%s" % (cxt.method, k): v
            for k, v in all_vars if k in cxt.parameter_names
        }

    def on_call(self, arg, cxt, my_vars):
        my_parameters = {
            k: v
            for k, v in self.get_params(cxt, my_vars).items()
            if not self.istack.ignored(v)
        }
        self.istack.push(my_parameters)
        if self.track_params:
            self.the_vars.update(my_parameters)

    def on_line(self, arg, cxt, my_vars):
        if self.track_vars:
            qvars = {"%s:%s" % (cxt.method, k): v for k, v in my_vars}
            my_vars = {
                var: value
                for var, value in qvars.items() if self.include(var, value)
            }
            if not self.track_params:
                my_vars = {
                    var: value
                    for var, value in my_vas.items() if var not in param_names
                }
            self.the_vars.update(my_vars)

    def on_return(self, arg, cxt, my_vars):
        self.istack.pop()
        self.on_line(arg, cxt, my_vars)
        if self.track_return:
            var = '(<-%s)' % cxt.method
            self.the_vars.update_vars({var: arg})

    def trace_event(self, event, arg, cxt, my_vars):
        if not any(cxt.file_name.endswith(f) for f in self.files):
            return
        if event == 'call':
            return self.on_call(arg, cxt, my_vars)

        if event == 'return':
            return self.on_return(arg, cxt, my_vars)

        if event == 'exception':
            return

        self.on_line(arg, cxt, my_vars)

We need to modify `traceit()` to be aware of events now:

In [None]:
traces = []
for inputstr in INPUTS:
    clear_cache()
    with Tracer(inputstr) as tracer:
        FUNCTION(tracer())
    sm = StackTracker(tracer.inputstr, tracer.trace, files=['urllib/parse.py'])
    traces.append((tracer.inputstr, sm))

Note that in the following we do not account for parameters getting reassigned values.

For each (VAR, VALUE) found:
* We search for occurrences of VALUE in the grammar
* We replace them by VAR
* We add a new rule VAR -> VALUE to the grammar

In [None]:
class Miner(Miner):
    def get_derivation_tree(self):
        my_assignments = self.my_assignments.copy()
        tree = {}
        for var, value in my_assignments.items():
            nt_var = var if var == START_SYMBOL else self.nonterminal(var)
            self.logger(0, "%s = %s" % (nt_var, value))
            if tree:
                append = False
                for key, repl in tree.items():
                    self.logger(1, "%s : %s" % (key, repl))
                    if not any(value in t for t in repl):
                        continue
                    new_arr = []
                    for k, token in enumerate(repl):
                        if not value in token:
                            new_arr.append(token)
                        else:
                            append = True
                            arr = token.split(value)
                            new_arr.extend(
                                list(sum(zip(arr,
                                             len(arr) * [nt_var]), ()))[:-1])
                    tree[key] = tuple(i for i in new_arr if i)
                if append:
                    self.logger(0, "+%s = %s" % (nt_var, value))
                    tree[nt_var] = set([value])
            else:
                tree[nt_var] = (value, )
        return  {key: values for key, values in tree.items()}

In [None]:
clear_cache()
with Tracer(INPUTS[2]) as tracer:
    FUNCTION(tracer())
sm = StackTracker(tracer.inputstr, tracer.trace, files=['urllib/parse.py'])
dt = Miner(tracer.inputstr, sm.the_vars.defs)
display_tree(dt.to_tree())

In [None]:
def recover_grammar(traces):
    m = Infer()
    for inputstr, trace in traces:
        st = StackTracker(inputstr, trace, files=['urllib/parse.py'])
        dt = Miner(inputstr, st.the_vars.defs)
        m.add_tree(dt)
    return m.grammar

In [None]:
traces = []
for inputstr in INPUTS:
    clear_cache()
    with Tracer(inputstr) as tracer:
        FUNCTION(tracer())
    traces.append((tracer.inputstr, tracer.trace))
grammar = recover_grammar(traces)
syntax_diagram(grammar)

## Tainted Grammar Miner

In [None]:
from InformationFlow import tstr

In [None]:
class TaintedTracer(Tracer):
    def __init__(self, inputstr):
        self.inputstr = tstr(inputstr, parent=None)
        self.trace = []
        self.istack = TaintedInputStack(inputstr)
        self.vars = TaintedVars(self.istack)
  
    def include(self, k, v):
        return isinstance(repr(v), tstr)

In [None]:
class TaintedInputStack(InputStack):
    def has(self, val):
        return any(val.taint_in(var) for var in self.inputs[-1].values())
    
    def ignored(self, val):
        return not (isinstance(repr(val), tstr) and len(repr(val).taint) > 2)
    
    def include(self, k, val):
        if self.ignored(val):
            return False
        return self.has(val) if self.inputs else val.taint_in(self.original)

In [None]:
class TaintedVars(Vars):
    def set_kv(self, k, v):
        def trep(v):
            return v if isinstance(v, tstr) else repr(v)

        self.defs[k] = trep(v)

In [None]:
class TaintedTracker(StackTracker):
    def __init__(self, inputstr, trace, **kwargs):
        self.istack = TaintedInputStack(inputstr)
        self.the_vars = TaintedVars(self.istack)
        self.trace = trace
        self.options(kwargs)
        self.process()

We can only replace a value if the taints match.

In [None]:
class Miner(Miner):
    def get_derivation_tree(self):
        my_assignments = self.my_assignments.copy()
        tree = {}
        for var, value in my_assignments.items():
            nt_var = var if var == START_SYMBOL else self.nonterminal(var)
            self.logger(0, "%s = %s" % (nt_var, value))
            if tree:
                append = False
                for key, repl in tree.items():
                    self.logger(1, "%s : %s" % (key, repl))
                    if not any(value.taint_in(t) for t in repl if isinstance(t, tstr)):
                        continue
                    new_arr = []
                    for k, token in enumerate(repl):
                        if not isinstance(token, tstr) or not value.taint_in(token):
                            new_arr.append(token)
                        else:
                            append = True
                            arr = token.split(value)
                            new_arr.extend(
                                list(sum(zip(arr,
                                             len(arr) * [nt_var]), ()))[:-1])
                    tree[key] = tuple(i for i in new_arr if i)
                if append:
                    self.logger(0, "+%s = %s" % (nt_var, value))
                    tree[nt_var] = set([value])
            else:
                tree[nt_var] = (value, )
        return  {key: values for key, values in tree.items()}

In [None]:
def recover_grammar(traces):
    m = Infer()
    for inputstr, trace in traces:
        st = TaintedTracker(inputstr, trace, files=['urllib/parse.py'])
        dt = Miner(inputstr, st.the_vars.defs)
        m.add_tree(dt)
    return m.grammar

In [None]:
traces = []
for inputstr in INPUTS:
    clear_cache()
    with TaintedTracer(inputstr) as tracer:
        FUNCTION(tracer())
    traces.append((tracer.inputstr, tracer.trace))
grammar = recover_grammar(traces)
syntax_diagram(grammar)

## Tainted Objects

While the functions we have seen so far uses string parameters to pass fragments of input around, real world parses often pass around data structures that represent the input fragments. For the standard data containers in Python, one can rely on rely on simple recursive filtering.

In [None]:
def flatten(key, val):
    tv = type(val)
    if tv in {int, float, complex, str, bytes, bytearray}:
        return [(key, val)]
    elif tv in {set, frozenset, list, tuple, range}:
        values = [e for i, elt in enumerate(val) for e in flatten(i, elt)]
        return [("%s.%d" % (key, i), v) for i, v in values]
    elif tv is dict:
        values = [e for k, elt in val.items() for e in flatten(k, elt)]
        return [("%s.%s" % (key, k), v) for k, v in values]
    elif tv is tstr:
        return [(key, val)]
    elif hasattr(val,'__dict__'):
        values = [e for k, elt in val.__dict__.items() for e in flatten(k, elt)]
        return [("%s.%s" % (key, k), v) for k, v in values]
    else:
        return [(key, repr(v))]

One way to account for custom data structures other than containers is to rely on its `repr()`. That is, both `str()` and `repr()` relies on string methods that we have overridden in the tainted string. Hence if any of the string fragments are tainted, their return will also tainted.

In [None]:
class TaintedVars(TaintedVars):
    def update(self, values):
        vals = [(k1, v1) for k, v in values.items() for k1, v1 in flatten(k, v)]
        for k, v in vals:
            self.set_kv(k, v)

One of the choices here is whether to track the input parameters as variables (not just as input parameters) or only the local variable values.

### Accounting for reassignments in loops

In [None]:
class TaintedVars(TaintedVars):
    def __init__(self, stack):
        self.accessed_scop_var = {}
        self.taint_register = {}
        super().__init__(stack)

    def var_init(self, var):
        if var not in self.accessed_scop_var:
            self.accessed_scop_var[var] = 0

    def var_assign(self, var):
        self.accessed_scop_var[var] += 1

    def var_name(self, var):
        t = self.accessed_scop_var[var]
        return "%s[%d:%d]" % (var, self.istack.height(), t)

    def set_kv(self, var, val):
        self.var_init(var)
        sa_var = self.var_name(var)
        if sa_var not in self.defs:
            self.defs[sa_var] = val
            self.taint_register[str(val.taint)] = sa_var
        else:  # possible reassignment
            if self.taint_register.get(str(val.taint)) is None:  # a change in taint
                self.var_assign(var)
                sa_var = self.var_name(var)
                self.defs[sa_var] = val
                self.taint_register[str(val.taint)] = sa_var

In [None]:
traces = []
for inputstr in INPUTS:
    clear_cache()
    with TaintedTracer(inputstr) as tracer:
        FUNCTION(tracer())
    traces.append((tracer.inputstr, tracer.trace))
grammar = recover_grammar(traces)
syntax_diagram(grammar)

The problem is essentially that, with greater detail, we can no longer match the keys across different inputs. That is, the variable at each particular loop iteration has a different name, and it is no longer clear how to join them.

## Lessons Learned

* Given a set of inputs, we can learn an input grammar by examining variable values during execution.
* The resulting grammars can be used right during fuzzing.

## Next Steps

_Link to subsequent chapters (notebooks) here, as in:_

* [use _mutations_ on existing inputs to get more valid inputs](MutationFuzzer.ipynb)
* [use _grammars_ (i.e., a specification of the input format) to get even more valid inputs](Grammars.ipynb)
* [reduce _failing inputs_ for efficient debugging](Reducer.ipynb)


## Background

\cite{Lin2008}

## Exercises

_Close the chapter with a few exercises such that people have things to do.  To make the solutions hidden (to be revealed by the user), have them start with_

```markdown
**Solution.**
```

_Your solution can then extend up to the next title (i.e., any markdown cell starting with `#`)._

_Running `make metadata` will automatically add metadata to the cells such that the cells will be hidden by default, and can be uncovered by the user.  The button will be introduced above the solution._

### Exercise 1: _Title_

_Text of the exercise_

In [None]:
# Some code that is part of the exercise
pass

_Some more text for the exercise_

**Solution.** _Some text for the solution_

In [None]:
# Some code for the solution
2 + 2

_Some more text for the solution_

### Exercise 2: _Title_

_Text of the exercise_

**Solution.** _Solution for the exercise_