# Mining Input Grammars

So far, the grammars we have seen have been mostly specified manually – that is, you (or the person knowing the input format) had to design and write a grammar in the first place.  While the grammars we have seen so far have been rather simple, creating a grammar for complex inputs can involve quite some effort.  In this chapter, we therefore introduce techniques that automatically _mine_ grammars from programs – by executing the programs and observing how they process which parts of the input.  In conjunction with a grammar fuzzer, this allows us to (1) take a program, (2) extract its input grammar, and (3) fuzz it with high efficiency and effectiveness.

**Prerequisites**

* You should have read the [chapter on grammars](Grammars.ipynb).
* The [chapter on configuration fuzzing](ConfigurationFuzzer.ipynb) introduces grammar mining for configuration options, as well as observing variables and values during execution.
* The concept of parsing from [chapter on parsers](Parser.ipynb) is also useful.

Consider the `process_inventory()`  method from the [chapter on parsers](Parser.ipynb):

In [None]:
import fuzzingbook_utils

In [None]:
from Parser import process_inventory, process_vehicle, process_car, process_van, lr_graph

It takes inputs of the following form.

In [None]:
INVENTORY = """\
1997,van,Ford,E350
2000,car,Mercury,Cougar
1999,car,Chevy,Venture\
"""

In [None]:
print(process_inventory(INVENTORY))

We found from the [chapter on parsers](Parser.ipynb) that coarse grammars do not work well for fuzzing when the input format includes details expressed only in code. That is, even though we have the formal specification of CSV files ([RFC 4180](https://tools.ietf.org/html/rfc4180)), the inventory system includes further rules as to what is expected at each index of the CSV file. The solution of simply recombining existing inputs, while practical, is incomplete. In particular, it relies on a formal input specification being available in the first place. However, we have no assurance that the program obeys the input specification given.

One of the ways out of this predicament is to interrogate the program under test as to what its input specification is. That is, if the program under test is written in a recursive descent style, with specific methods responsible for handling specific parts of the input, one can recover the parse tree, by observing the process of parsing. Further, one can recover a reasonable approximation of the grammar by abstraction from multiple input trees.

 _We start with the assumption (1) that the program is written in such a fashion that specific methods are responsible for parsing specific fragments of the program -- This includes almost all ad hoc parsers._

The idea is as follows

* Hook into the Python execution and observe the fragments of input string as they are produced and named in different methods.
* Stitch the input fragments together in a tree structure to retrieve the **Parse Tree**.
* Abstract common elements from multiple parse trees to produce the **Context Free Grammar** of the input.

## A Simple Grammar Miner

Say we want to obtain the input grammar for the function `process_vehicle()`. We first collect the sample inputs for this function.

In [None]:
VEHICLES = INVENTORY.split('\n')

We have seen from the chapter on [configuration fuzzing](ConfigurationFuzzer.ipynb) that one can hook into the Python runtime to observe the arguments to a function and any local variables created. We have also seen that one can obtain the context of execution by inspecting the `frame` argument. Here is a simple tracer that can return the local variables and other contextual information in a traced function.

In [None]:
INVENTORY_METHODS = {
    'process_inventory',
    'process_vehicle',
    'process_van',
    'process_car'}

In [None]:
import inspect

In [None]:
def traceit(frame, event, arg):
    method_name = inspect.getframeinfo(frame).function
    if method_name not in INVENTORY_METHODS:
        return
    file_name = inspect.getframeinfo(frame).filename
    
    param_names = inspect.getargvalues(frame).args
    lineno = inspect.getframeinfo(frame).lineno
    print(event, file_name, lineno, method_name, param_names, frame.f_locals)
    return traceit

We first obtain and save the current trace.

In [None]:
import sys

In [None]:
oldtrace = sys.gettrace()

Next, set our trace function as the current one.

In [None]:
sys.settrace(traceit)

Then, run the code under this trace.

In [None]:
process_vehicle(VEHICLES[0])

Finally, we reset the trace.

In [None]:
sys.settrace(oldtrace)

### Tracer

In the interests of modularity, we expand the `traceit()` function to a full fledged class `Tracer` that acts as a *context manager*. A context manager in Python requires two methods `__enter__()` to enter the context and `__exit__()` to leave the context.

In [None]:
class Tracer:
    def __enter__(self):
        self.oldtrace = sys.gettrace()
        sys.settrace(self.trace_event)
        return self

    def __exit__(self, *args):
        sys.settrace(self.oldtrace)

The logic in the `traceit()` function is now moved to a method `trace_event()` which is set as the trace function by the `Tracer` context manager.

In [None]:
class Tracer(Tracer):
    def trace_event(self, frame, event, arg):
        method_name = inspect.getframeinfo(frame).function
        if method_name not in INVENTORY_METHODS:
            return
        param_names = inspect.getargvalues(frame).args
        lineno = inspect.getframeinfo(frame).lineno
        local_vars = inspect.getargvalues(frame).locals
        print(event, method_name, lineno, param_names, local_vars)
        return self.trace_event

 That is, any function executed under it gets a tracing hook installed, and after the execution, the hook is uninstalled automatically.

In [None]:
with Tracer() as tracer:
    process_vehicle(VEHICLES[0])

The `trace_event()` relies on information from the `frame` variable which exposes Python internals. We define a `context` class that encapsulates the information that we need from the `frame`.

#### Context

The `Context` class provides easy access to the information such as the current module, and parameter names.

In [None]:
class Context:
    def __init__(self, frame, track_caller=True):
        self.method = self._method(frame)
        self.parameter_names = self._get_parameter_names(frame)
        self.file_name = self._file_name(frame)
        self.line_no = self._line(frame)

    def _get_parameter_names(self, frame):
        return inspect.getargvalues(frame).args

    def _line(self, frame):
        return inspect.getframeinfo(frame).lineno

    def _file_name(self, frame):
        return inspect.getframeinfo(frame).filename

    def _method(self, frame):
        return inspect.getframeinfo(frame).function

    def extract_vars(self, frame):
        return inspect.getargvalues(frame).locals

    def parameters(self, all_vars):
        return {k: v for k, v in all_vars.items() if k in self.parameter_names}

    def qualified(self, all_vars):
        return {"%s:%s" % (self.method, k): v for k, v in all_vars.items()}

    def _t(self):
        return (self.file_name, self.line_no, self.method, ','.join(self.parameter_names))

    def __repr__(self):
        return "%s:%d:%s(%s)" % self._t()

We hook printing the context to our `trace_event()` to see it in action.

In [None]:
class Tracer(Tracer):
    def trace_event(self, frame, event, arg):
        print(Context(frame))
        return self.trace_event

Running `process_vehicle()` under trace prints the contexts encountered.

In [None]:
with Tracer() as tracer:
    process_vehicle(VEHICLES[0])

Notice that `<string>` is the placeholder name for those functions executed within our functions `process_vehicle()` and `process_van()`. The Jupyter specific functions have a special `<ipython-input...>` suffix in their filename. We will show how to remove the Jupyter specific functions from the trace next.

The trace produced by executing any function can get overwhelmingly large. Hence, we need restrict our attention to specific modules. Further, we also restrict our attention exclusively to `str` variables since these variables are more likely to contain input fragments. (We will show how to deal with complex objects later.)

We use the `context` to decide which modules to monitor, and which variables to trace.

We store the current *input string* so that it can be used to determine if any particular string fragments came from the current input string. We add a `kwargs` for optional arguments.

In [None]:
class Tracer(Tracer):
    def __init__(self, my_input, **kwargs):
        self.my_input, self.trace = my_input, []
        self.options(kwargs)

We use an optional argument `files` to indicate the specific source files we are interested in. Further, we also use `log` to specify whether verbose logging should be enabled during trace.

In [None]:
class Tracer(Tracer):
    def options(self, kwargs):
        self.files = kwargs.get('files') or []
        self.log = kwargs.get('log') or False

    def logger(self, event, var):
        if not self.log:
            return
        char = '  '
        if event == 'call':
            char = '->'
        elif event == 'return':
            char = '<-'
        print(char, var)

The `files` is checked to determine if a particular event should be traced or not

In [None]:
class Tracer(Tracer):
    def tracing_context(self, ctx, event, arg):
        if not self.files:
            return True
        return any(ctx.file_name.endswith(f) for f in self.files)

Similar to the context of events, we also want to restrict our attention to specific variables. For now, we want to focus only on strings. (See the exercises on how to extend it to other kinds of objects).

In [None]:
class Tracer(Tracer):
    def tracing_var(self, k, v):
        return isinstance(v, str)

We modify the `trace_event()` to call an `on_event()` function with the context information only on the specific events we are interested in.

In [None]:
class Tracer(Tracer):
    def on_event(self, event, arg, cxt, my_vars):
        self.trace.append((event, arg, cxt, my_vars))

    def trace_event(self, frame, event, arg):
        cxt = Context(frame)
        self.logger(event, cxt)
        if not self.tracing_context(cxt, event, arg):
            return self.trace_event

        my_vars = {
            k: v
            for k, v in cxt.extract_vars(frame).items()
            if self.tracing_var(k, v)
        }
        self.on_event(event, arg, cxt, my_vars)
        return self.trace_event

The `Tracer` class can now focus on specific kinds of events on specific files. Further, it provides a first level filter for variables that we find interesting. For example, we want to focus specifically on `string` variables that contain input fragments. Here is how our updated `Tracer` can be used

In [None]:
with Tracer(VEHICLES[0], files=['<string>'], log=True) as tracer:
    process_vehicle(VEHICLES[0])

The execution produced the following trace.

In [None]:
for t in tracer.trace:
    print(t[0], t[2].method, dict(t[3]))

Since we are saving the input already in Tracer, it is redundant to specify it separately again as an argument.

In [None]:
with Tracer(VEHICLES[0], log=True) as tracer:
    process_vehicle(tracer.my_input)

The `settrace()` function hooks into the Python debugging facility. When it is in operation, no debugger can hook into the program. Hence, we limit the tracer to the simplest implementation possible as given above, and implement the core of grammar mining in later stages.

### Tracker

We define a `Tracker` class that processes the trace from the `Tracer`.

The tracker identifies string fragments that are part of the input string, and stores them in a dictionary `my_assignments`. It saves the trace, and the corresponding input for processing. Finally it calls `process()` to process the `trace` it was given. We additionally define a logging facility for debugging.

One of the problems of using substring search is that short string sequences tend to be included in other string sequences even though they may not have come from the original string. That is, say the input fragment is `v`. It could have equally come from either `van` or `chevy`. We rely on being able to predict the exact place input where a given fragment occurred. Hence, we define a constant `FRAGMENT_LEN` such that we ignore strings up to that length.

In [None]:
FRAGMENT_LEN = 2

In [None]:
class Tracker:
    def __init__(self, my_input, trace, **kwargs):
        self.my_input = my_input
        self.trace = trace
        self.my_assignments = {}
        self.options(kwargs)
        self.process()

    def options(self, kwargs):
        self.log = kwargs.get('log') or False
        self.fragment_len = kwargs.get('fragment_len') or FRAGMENT_LEN

    def logger(self, event, var):
        if not self.log:
            return
        char = '  '
        if event == 'call':
            char = '->'
        elif event == 'return':
            char = '<-'
        print(char, var)

Our tracer simply records the variable values as they occur. We next need to check if the variables contain values from the **input string**. Common ways to do this is to rely on symbolic execution or at least dynamic tainting, which are powerful, but also complex. However, one can obtain a reasonable approximation by simply relying on substring search. That is, we consider any value produced that is a substring of the original input string to have come from the original input.

We define `include()` method that relies on string inclusion to detect if the string came from the input.

In [None]:
class Tracker(Tracker):
    def include(self, var, value):
        return len(value) > self.fragment_len and value in self.my_input

We can use `include()` to select only a subset of keys in a dictionary, as implemented below in `selected()`.

In [None]:
class Tracker(Tracker):
    def selected(self, variables):
        return {k: v for k, v in variables.items() if self.include(k, v)}

The tracker processes each event, and at each event, it updates the dictionary `my_assignments` with the current local variables that contain strings that are part of the input.

In [None]:
class Tracker(Tracker):
    def track_event(self, event, arg, cxt, my_vars):
        self.logger(event, (cxt.method, my_vars))
        self.my_assignments.update(self.selected(my_vars))

    def process(self):
        for event, arg, cxt, my_vars in self.trace:
            self.track_event(event, arg, cxt, my_vars)

Using the tracker, we can obtain the input fragments. For example, say we are only interested in strings that are at least `5` characters long.

In [None]:
tracker = Tracker(tracer.my_input, tracer.trace, fragment_len=5)
for k, v in tracker.my_assignments.items():
    print(k, '=', repr(v))

Or strings that are `2` characters long (the default).

In [None]:
tracker = Tracker(tracer.my_input, tracer.trace)
for k, v in tracker.my_assignments.items():
    print(k, '=', repr(v))

### Mining a Derivation Tree

The input fragments from the `Tracker` only tell half the story. The fragments may be created at different stages of parsing. Hence, we need to assemble the fragments to a  derivation tree of the input. We start with a few imports.

In [None]:
from Grammars import START_SYMBOL, syntax_diagram, is_nonterminal

The derivation tree `Miner` is initialized with the input string, and the variable assignments, and it converts the assignments to the corresponding derivation tree.

In [None]:
class Miner:
    def __init__(self, my_input, my_assignments, **kwargs):
        self.my_input = my_input
        self.my_assignments = my_assignments
        self.options(kwargs)
        self.tree = self.get_derivation_tree()

    def options(self, kwargs):
        self.log = kwargs.get('log') or False

    def logger(self, indent, var):
        self.log and print('\t' * indent, var)

    def get_derivation_tree(self):
        return {}

The basic idea is as follows:
* We represent the derivation tree as a [straight line grammar](https://en.wikipedia.org/wiki/Straight-line_grammar) with each node represented by a key value pair. The key corresponds to the variable name, and the value corresponds to the representation of the value of the variable. **For now, we assume that the value assigned to a variable is stable. That is, it is never reassigned. That is, there are no recursive calls, or multiple calls to the same function from different parts.** (We will show how to overcome this limitation later). The value representation may contain references to other nodes.
* We start with a derivation tree with a single node -- the start symbol and the input string as its leaf.
* For each pair _var_, _value_ found in `my_assignments`:

* (1) We search for occurrences of _value_ in the grammar
* (2) If found, we replace them by <_var_>
* (3) If at least one replacement occurred, we add a new rule <_var_> $\rightarrow$ <_value_> to the grammar

First, we define a wrapper to generate a nonterminal from a variable name.

In [None]:
def to_nonterminal(var):
    return "<" + var.lower() + ">"

We need to display the derivation tree being constructed. Hence we define a procedure to display our tree using `display_tree` defined earlier.

In [None]:
from GrammarFuzzer import GrammarFuzzer, FasterGrammarFuzzer, display_tree, tree_to_string

In [None]:
def stgrammar_to_tree(tree, key=START_SYMBOL):
    if key not in tree:
        return (key, [])
    children = [stgrammar_to_tree(tree, c) for c in tree[key]]
    return (key, children)

In [None]:
def display_derivation_tree(tree, key=START_SYMBOL, **kwargs):
    display_tree(stgrammar_to_tree(tree, key), **kwargs)

Considering our example previously, we started with the following input `1997,van,Ford,E350`. We initialize our derivation tree with this value. A definition may contain multiple tokens. Hence, we use a tuple to represent a definition.

In [None]:
derivation_tree = {START_SYMBOL: ('1997,van,Ford,E350',)}
display_derivation_tree(derivation_tree)

Next, we found that we had a method call `process_vehicle` with parameters `{'vehicle': '1997,van,Ford,E350'}` which is present in `my_assignments`. This is the same string as what is present in `START_SYMBOL` -- see (1). As we described above, we replace the matching part for `START_SYMBOL` with the new key (2), and add the new definition (3).

In [None]:
alt_key_0 = to_nonterminal('vehicle')
value_0 = '1997,van,Ford,E350'

We split the single string corresponding to `START_SYMBOL` using `value_0`.

In [None]:
arr = derivation_tree[START_SYMBOL][0].split(value_0)

We want to rejoin the `arr` after incorporating the key for `value_0`.

In [None]:
def rejoin(arr, sep):
    return list(sum(zip(arr, len(arr) * [sep]), ()))[:-1]

In [None]:
v = rejoin(arr, alt_key_0)

Of course, the input is completely replaced by `value_0`

In [None]:
v

All it remains is to update the definition of `START_SYMBOL` with the new rule -- (2).

In [None]:
derivation_tree[START_SYMBOL] = [i for i in v if i]

Since at least one replacement took place, we update our definitions with the new rule -- (3).

In [None]:
derivation_tree[alt_key_0] = (value_0,)

Here is how our tree looks after this update.

In [None]:
display_derivation_tree(derivation_tree)

Our next input was as follows.

In [None]:
alt_key_1 = to_nonterminal('year')
value_1 = '1997'

Our rule corresponding to `START_SYMBOL` no longer contains a reference to the fragment `"1997"`. However, the newly added rule corresponding to `alt_key_0` does -- (1). Hence, we update the rule corresponding to `alt_key_0`.

In [None]:
arr = derivation_tree[alt_key_0][0].split(value_1)

In [None]:
v = rejoin(arr, alt_key_1)

This, as expected, replaces the string fragment `"1997"` with a token `<year>`.

In [None]:
v

We update the rule corresponding to `alt_key_0` as before -- (2).

In [None]:
derivation_tree[alt_key_0] = [i for i in v if i]

We add the new rule to our derivation tree -- (3).

In [None]:
derivation_tree[alt_key_1] = (value_1,)

The new tree is as below.

In [None]:
display_derivation_tree(derivation_tree)

Continuing with the next assignment.

In [None]:
alt_key_2 = to_nonterminal('kind')
value_2 = 'van'

Only the rule corresponding to `alt_key_0` cotains a reference to `"van"`

In [None]:
for k in derivation_tree:
    print(k, repr(derivation_tree[k]), 'has value_2:',
          any((value_2 in v) for v in derivation_tree[k]))

The rule corresponding to `alt_key_0` has a reference to `"van"` only in the second term of the tuple.
Hence, we replace the rule corresponding to `alt_key_0` in the second term.

In [None]:
arr = derivation_tree[alt_key_0][1].split(value_2)

In [None]:
v = rejoin(arr, alt_key_2)

In [None]:
v

Finally, we update the rule using both the unchanged first term, and the updated second term of the tuple.

In [None]:
derivation_tree[alt_key_0] = derivation_tree[alt_key_0][0:1] + \
    [i for i in v if i]
derivation_tree[alt_key_2] = (value_2,)

Our new derivation tree is as below.

In [None]:
display_derivation_tree(derivation_tree)

Let us try to incorporate this in code.

In [None]:
class Miner(Miner):
    def has_value(self, value, token):
        return False if is_nonterminal(token) else (value in token)

In [None]:
class Miner(Miner):
    def replace_in_rule(self, nt_var, value, rule):
        fragments = [
            rejoin(token.split(value), nt_var)
            if self.has_value(value, token) else [token] for token in rule
        ]
        return tuple(token for f in fragments for token in f if token)

In [None]:
alt_key_3 = to_nonterminal('model')
value_3 = 'E350'

In [None]:
print(derivation_tree[alt_key_0])
m = Miner(None, None)
m.replace_in_rule(alt_key_3, value_3, derivation_tree[alt_key_0])

It should not affect rules that do not contain the given value.

In [None]:
print(derivation_tree[alt_key_1])
m.replace_in_rule(alt_key_3, value_3, derivation_tree[alt_key_1])

In [None]:
v = m.replace_in_rule(alt_key_3, value_3, derivation_tree[alt_key_0])
derivation_tree[alt_key_0] = v
derivation_tree[alt_key_3] = (value_3,)

With this, our derivation tree changes as below.

In [None]:
display_derivation_tree(derivation_tree)

Now, we need to apply a new definition to an entire grammar.

In [None]:
class Miner(Miner):
    def apply_new_definition(self, tree, nt_var, value):
        self.logger(0, "%s = %s" % (nt_var, repr(value)))
        applied = False
        for key, rule in tree.items():
            self.logger(1, "%s : %s" % (key, repr(rule)))
            if not any(self.has_value(value, token) for token in rule):
                continue
            tree[key] = self.replace_in_rule(nt_var, value, rule)
            self.logger(1, "%s -> %s" % (key, repr(tree[key])))
            applied = True
        return applied

To make life simple, we define a wrapper function `nt_var()` that will convert a token to its corresponding nonterminal symbol.

In [None]:
class Miner(Miner):
    def nt_var(self, var):
        return var if is_nonterminal(var) else to_nonterminal(var)

We tryout the `apply_new_definition()`.

In [None]:
var_4 = 'company'
m = Miner(None, None)
alt_key_4 = m.nt_var(var_4)
value_4 = 'Ford'
m.apply_new_definition(derivation_tree, alt_key_4, value_4)

We apply the new rules as below to our derivation tree.

In [None]:
derivation_tree[alt_key_4] = (value_4, )

Our derivation tree now looks as below.

In [None]:
display_derivation_tree(derivation_tree)

This algorithm is implemented as `get_derivation_tree()`. The important aspect of this implementation is that we are not relying of the order in which variables are assigned. A smaller fragment could in principle occur before a larger fragment that contains it. Hence, we loop until all the rule assignments have been used up or no new rules are introduced.

In [None]:
class Miner(Miner):
    def get_derivation_tree(self):
        tree = {START_SYMBOL: (self.my_input, )}
        my_vars = self.my_assignments.keys()

        while my_vars:
            self.logger(0, "assignments: %d" % len(my_vars))
            remaining = set()
            for var in my_vars:
                nt_var, value = self.nt_var(var), self.my_assignments[var]
                v = self.apply_new_definition(tree, nt_var, value)
                if v:
                    tree[nt_var] = (value, )
                    self.logger(0, "+%s = %s" % (nt_var, value))
                else:
                    remaining.add(var)

            if remaining == my_vars:
                break
            my_vars = remaining

        return tree

The `Miner` is used as follows:

In [None]:
with Tracer(VEHICLES[0]) as tracer:
    process_vehicle(tracer.my_input)
assignments = Tracker(tracer.my_input, tracer.trace).my_assignments
dt = Miner(tracer.my_input, assignments, log=True)
dt.tree

The obtained derivation tree is as below.

In [None]:
display_derivation_tree(Miner(tracer.my_input, assignments).tree)

Combining all the pieces:

In [None]:
trees = []
for vehicle in VEHICLES:
    print(vehicle)
    with Tracer(vehicle) as tracer:
        process_vehicle(tracer.my_input)
    assignments = Tracker(tracer.my_input, tracer.trace).my_assignments
    trees.append((tracer.my_input, assignments))
    for var, val in assignments.items():
        print(var + " = " + repr(val))
    print()

The corresponding derivation trees are below.

In [None]:
csv_dt = []
for inputstr, assignments in trees:
    print(inputstr)
    dt = Miner(inputstr, assignments)
    csv_dt.append(dt)
    display_derivation_tree(dt.tree)

### Recovering Grammar from Derivation Trees

We define a class `Infer` that can combine multiple derivation trees to produce the grammar. The initial grammar is empty.

In [None]:
class Infer:
    def __init__(self):
        self.grammar = {}

The `add_tree()` method gets a combined list of non-terminals from current grammar, and the tree to be added to the grammar, and updates the definitions of each non-terminal.

In [None]:
class Infer(Infer):
    def add_tree(self, t):
        merged_grammar = {}
        for key in list(self.grammar.keys()) + list(t.tree.keys()):
            alternates = set(self.grammar.get(key, []))
            if key in t.tree:
                alternates.add(''.join(t.tree[key]))
            merged_grammar[key] = list(alternates)
        self.grammar = merged_grammar

The `add_tree()` is used as follows:

In [None]:
inventory_grammar = Infer()
for dt in csv_dt:
    inventory_grammar.add_tree(dt)

In [None]:
syntax_diagram(inventory_grammar.grammar)

Given execution traces from various inputs, one can define `recover_grammar()` to obtain the complete grammar from the traces.

In [None]:
def recover_grammar(traces):
    m = Infer()
    for inputstr, trace in traces:
        dt = Miner(inputstr, Tracker(inputstr, trace).my_assignments)
        m.add_tree(dt)
    return m.grammar

#### Example 1. Recovering the Inventory Grammar

In [None]:
traces = []
for inputstr in VEHICLES:
    with Tracer(inputstr) as tracer:
        process_vehicle(tracer.my_input)
    traces.append((tracer.my_input, tracer.trace))
inventory_grammar = recover_grammar(traces)

In [None]:
syntax_diagram(inventory_grammar)

#### Example 2. Recovering URL Grammar

Our algorithm is robust enough to recover grammar from real world programs. For example, the `urlparse` function in the Python `urlib` module accepts the following sample URLs.

In [None]:
URLS = [
    'http://user:pass@www.google.com:80/?q=path#ref',
    'https://www.cispa.saarland:80/',
    'http://www.fuzzingbook.org/#News',
]

The urllib caches its intermediate results for faster access. Hence, we need to disable it using `clear_cache()` after every invocation.

In [None]:
from urllib.parse import urlparse, clear_cache

We use the sample URLs to recover grammar as follows

In [None]:
traces = []
for inputstr in URLS:
    clear_cache()
    with Tracer(inputstr, files=['urllib/parse.py']) as tracer:
        urlparse(tracer.my_input)
    traces.append((tracer.my_input, tracer.trace))
url_grammar = recover_grammar(traces)

The recovered grammar describes the URL format reasonably well.

In [None]:
syntax_diagram(url_grammar)

### Fuzzing

We can now use our recovered grammar for fuzzing as follows

In [None]:
f = GrammarFuzzer(inventory_grammar)
for i in range(10):
    print(f.fuzz())

The recovered grammar can be used for fuzzing

In [None]:
f = GrammarFuzzer(url_grammar)
for i in range(10):
    print(f.fuzz())

### Problems with the Simple Miner

One of the problems with our simple grammar miner is the assumption that the values assigned to variables are stable. Unfortunately, that may not hold true in all cases. For example, here is a URL with a slightly different format.

In [None]:
URLS_X = URLS + ['ftp://freebsd.org/releases/5.8']

The grammar generated from this set of samples is not as nice as what we got earlier

In [None]:
traces = []
for inputstr in URLS_X:
    clear_cache()
    with Tracer(inputstr, files=['urllib/parse.py']) as tracer:
        urlparse(tracer.my_input)
    traces.append((tracer.my_input, tracer.trace))
grammar = recover_grammar(traces)
syntax_diagram(grammar)

Clearly, something has gone wrong.

To investigate why the `url` definition has gone wrong, let us inspect the trace for the URL.

In [None]:
clear_cache()
with Tracer(URLS_X[0]) as tracer:
    urlparse(tracer.my_input)
for i, t in enumerate(tracer.trace):
    if t[0] in {'call', 'line'} and 'parse.py' in str(t[2]) and t[3]:
        print(i, t[2]._t()[1], t[3:])

Notice how the value of `url` changes as the parsing progresses? This violates our assumption that the value assigned to a variable is stable. We next look at how this limitation can be removed.

## Grammar Miner with Reassignment

One way to uniquely identify different variables is to annotate them with *line numbers* both when they are defined and also when their value changes. Consider the code fragment below

### Tracking variable assignment locations

In [None]:
def C(cp_1):
    c_2 = cp_1 + '@2'
    c_3 = c_2 + '@3'
    return c_3


def B(bp_7):
    b_8 = bp_7 + '@8'
    return C(b_8)


def A(ap_12):
    a_13 = ap_12 + '@13'
    a_14 = B(a_13) + '@14'
    a_14 = a_14 + '@15'
    a_13 = a_14 + '@16'
    a_14 = B(a_13) + '@17'
    a_14 = B(a_13) + '@18'

Notice how all variables are either named corresponding to either where they are defined, or the value is annotated to indicate that it was changed.

Let us run this under the trace.

In [None]:
with Tracer('____') as tracer:
    A(tracer.my_input)
    
for t in tracer.trace:
    print(t[0], "%d:%s" % (t[2].line_no, t[2].method), t[3])

Each variables were referenced first as follows:

* `cp_1` -- *call* `1:C`
* `c_2` -- *line* `3:C` (but the previous event was *line* `2:C`)
* `c_3` -- *line* `4:C` (but the previous event was *line* `3:C`)
* `bp_7` -- *call* `7:B`
* `b_8` -- *line* `9:B` (but the previous event was *line* `8:B`)
* `ap_12` -- *call* `12:A`
* `a_13` -- *line* `14:A` (but the previous event was *line* `13:A`)
* `a_14` -- *line* `15:A` (the previous event was *return* `9:B`. However, the previous event in A was *line* `14:A`)
* reassign `a_14` at *15* -- *line* `16:A` (the previous event was *line* `15:A`)
* reassign `a_13` at *16* -- *line* `17:A` (the previous event was *line* `16:A`)
* reassign `a_14` at *17* -- *return* `17:A` (the previous event in A was *line* `17:A`)

So, our observations are that, if it is a call, the current location is the right one for any new variables being defined. On the other hand, if the variable being referenced for the first time (or reassigned a new value), then the  right location to consider is the previous location *in the same method invocation*. Next, let us see how we can incorporate this information into variable naming.

In order to account for variable reassignments, we need to have a more intelligent data structure than a dictionary for storing variables. We first define a simple interface `Vars`. It acts as a container for variables, and is instantiated at `my_assignments`.

### Vars

The `Vars` stores references to variables as they occur during parsing in its internal dictionary `defs`. We initialize the dictionary with the original string.

In [None]:
class Vars:
    def __init__(self, original):
        self.defs = {START_SYMBOL: original}

The dictionary needs two methods: `update()` that takes a set of key-value pairs to update itself, and `_set_kv()` that updates a particular key-value pair.

In [None]:
class Vars(Vars):
    def _set_kv(self, k, v):
        self.defs[k] = v

    def __setitem__(self, k, v):
        self._set_kv(k, v)

    def update(self, v):
        for k, v in v.items():
            self._set_kv(k, v)

The vars is a proxy for the internal dictionary. For example, here is how one can use it.

In [None]:
v = Vars('test')
v.defs

In [None]:
v['x'] = 'X'
v.defs

In [None]:
v.update({'x': 'x', 'y': 'y'})
v.defs

### SingleAssignmentVars

We now extend the simple `Vars` to account for variable reassignments. For this, we define `SingleAssignmentVars`.

The idea for detecting reassignments and renaming variables is as follows: We keep track of the previous reassignments to particular variables using `accessed_seq_var`. It contains the last rename of any particular variable as its corresponding value. Second, we also maintain `new_vars` which contains a list of all new variables that were added on this iteration.

In [None]:
class SingleAssignmentVars(Vars):
    def __init__(self, original):
        self.accessed_seq_var = {}
        self.new_vars = set()
        super().__init__(original)

In [None]:
class SingleAssignmentVars(SingleAssignmentVars):
    def update(self, v):
        self.new_vars = set()
        for k, v in v.items():
            self._set_kv(k, v)
        return self.new_vars

The variable name now incorporate an index of who many reassignments it has gone through, effectively making each reassignment a unique variable.

In [None]:
class SingleAssignmentVars(SingleAssignmentVars):
    def var_name(self, var):
        return "%s[%d]" % (var, self.accessed_seq_var[var])

While storing variables, we need to first check whether it was previously known. If it is not, we need to initialize the rename count. This is accomplished by `var_access`.

In [None]:
class SingleAssignmentVars(SingleAssignmentVars):
    def var_access(self, var):
        if var not in self.accessed_seq_var:
            self.accessed_seq_var[var] = 0
        return self.var_name(var)

During a variable reassignment, we update the `accessed_seq_var` to reflect the new count.

In [None]:
class SingleAssignmentVars(SingleAssignmentVars):
    def var_assign(self, var):
        self.accessed_seq_var[var] += 1
        self.new_vars.add(self.var_name(var))
        return self.var_name(var)

This trio of methods can be used as follows

In [None]:
sav = SingleAssignmentVars('')
sav.defs

In [None]:
sav.var_access('v1')

In [None]:
sav.var_assign('v1')

Assigning to it again increments the counter.

In [None]:
sav.var_assign('v1')

The core of the logic is in `_set_kv()`. When a variable is being assigned, we get the sequenced variable name `s_var`. If the sequenced variable name was previously unknown in `defs`, then we have no further concerns. We add the sequenced variable to `defs`.

If the variable is previously known, then it is an indication of a possible reassignment. In this case, we look at the value the variable is holding. We check if the value changed. If it has not, then it is not.

If the value has changed, it is a reassignment. We first increment the variable usage sequence using `var_assign`, retrieve the new name, update the new name in `defs`.

In [None]:
class SingleAssignmentVars(SingleAssignmentVars):
    def _set_kv(self, var, val):
        s_var = self.var_access(var)
        if s_var in self.defs and self.defs[s_var] == val:
            return
        self.defs[self.var_assign(var)] = val

Here is how it can be used. Assigning a variable the first time initializes its counter.

In [None]:
sav = SingleAssignmentVars('')
sav['x'] = 'X'
sav.defs

If the variable is assigned again with the same value, it is probably not a reassignment.

In [None]:
sav['x'] = 'X'
sav.defs

However, if the value changed, it is a reassignment.

In [None]:
sav['x'] = 'Y'
sav.defs

There is a subtlety here. It is possible for a child method to be called from the middle of a parent method, and for both to use the same variable name with different values. In this case, when the child returns, parent will have the old variable with old value in context. With our implementation, we consider this as a reassignment. However, this is OK because adding a new reassignment is harmless, but missing one is not. Further, we will discuss later how this can be avoided.

### AssignmentTracker

The `AssignmentTracker` keeps the assignment definitions using the `SingleAssignmentVars` we defined previously.

It contains a number of variables. The `current_event` contains the event name that is being processed. The `var_def_lines` stores the line number where a particular variable was defined.

In [None]:
class AssignmentTracker(Tracker):
    def __init__(self, my_input, trace, **kwargs):
        self.my_input = my_input

        self.current_event = None
        self.var_def_lines = {}
        
        self.method_init()
        self.my_assignments = SingleAssignmentVars(my_input)
        
        self.trace = trace
        self.options(kwargs)
        self.process()

The `method_init()` method takes care of keeping track of method invocations. The `method_register` stores the number of method invocations we have seen so far. Each method invocation gets a new number. The `method_id` keeps the method identifier (made with the method name, and the method invocation number in method_register). Event locations is for keeping track of the locations accessed *within this method*. This is used for line number tracking of variable definitions. Finally `mstack` contains the stack of method calls we have seen so far.

In [None]:
class AssignmentTracker(AssignmentTracker):
    def method_init(self):
        self.method_register = 0
        self.method_id = START_SYMBOL
        self.mstack = [self.method_id]
        self.event_locations = {self.method_id:[]}

The stack looks like below when it is initialized.

In [None]:
a = AssignmentTracker('hello', [])
a.method_init()
a.mstack

To fine-tune the process, we define an optional parameter called `track_return`. During tracing a method return, Python produces a virtual variable that contains the result of the returned value. If the `track_return` is set, we capture this value as a variable.

* `track_return` -- if true, add a *virtual variable* to the Vars representing the return value

In [None]:
class AssignmentTracker(AssignmentTracker):
    def options(self, kwargs):
        self.track_return = kwargs.get('track_return') or False
        super().options(kwargs)

There can be different kinds of events during a trace, which includes `call` when a function is entered, `return` when the function returns, `exception` when an exception is thrown and `line` when a statement is executed.

The previous `Tracker` was too simplistic in that it did not distinguish between the different events. We rectify that and define `on_call()`, `on_return()`, and `on_line()` respectively that gets called on their corresponding events.

Note that `on_line()` is called also for `on_return()`. The reason is that, Python invokes the trace function *before* the corresponding line is executed. Hence, effectively, the `on_return()` is called with the binding produced by the execution of the previous statement in the environment. Our processing in effect is done on values that were bound by the previous statement. Hence, calling `on_line()` here is appropriate as it provides the event handler a chance to work on the previous binding.

In [None]:
class AssignmentTracker(AssignmentTracker):
    def update_vars(self, my_vars):
        added_vars = self.my_assignments.update(self.selected(my_vars))
        self.var_location_register(added_vars)
        
    def on_call(self, arg, cxt, my_vars):
        self.method_enter(cxt)
        self.update_vars(cxt.parameters(my_vars))

    def on_line(self, arg, cxt, my_vars):
        self.method_statement(cxt)
        self.update_vars(my_vars)

    def on_return(self, arg, cxt, my_vars):
        self.on_line(arg, cxt, my_vars)
        self.method_exit(cxt)

        if not self.track_return:
            return
        var = '(<-%s)' % cxt.method
        self.update_vars({var:arg})

    def on_exception(self, arg, cxt, my_vara):
        return

    def track_event(self, event, arg, cxt, my_vars):
        self.current_event = event
        dispatch = {
            'call': self.on_call,
            'return': self.on_return,
            'line': self.on_line,
            'exception': self.on_exception
        }
        dispatch[event](arg, cxt, my_vars)

We also define book keeping codes for `register_event()` `method_enter()` and `method_exit()` which are the methods responsible for keeping track of the method stack. The basic idea is that, each `method_enter()` represents a new method invocation. Hence it merits a new method id, which is generated from the `method_register`, and saved in the `method_id`. Since this is a new method, the method stack is extended by one element with this id. In the case of `method_exit()`, we pop the method stack, and reset the current `method_id` to what was below the current one.

In [None]:
class AssignmentTracker(AssignmentTracker):
    def indent(self):
        return len(self.mstack) * "\t"

    def method_enter(self, cxt):
        self.method_register += 1
        self.method_id = "%s:%d" % (cxt.method, self.method_register)
        self.logger('call', "%s%s" % (self.indent(), self.method_id))
        self.mstack.append(self.method_id)
        self.register_event(cxt)

    def method_exit(self, cxt):
        self.register_event(cxt)
        self.mstack.pop()
        self.logger('return', "%s%s" % (self.indent(), self.method_id))
        self.method_id = self.mstack[-1]

    def method_statement(self, cxt):
        self.register_event(cxt)

For each of the method events, we also register the event using `register_event()` which keeps track of the line numbers that were referenced in *this* method.

In [None]:
class AssignmentTracker(AssignmentTracker):
    def register_event(self, cxt):
        if self.method_id not in self.event_locations:
            self.event_locations[self.method_id] = []
        self.event_locations[self.method_id].append(cxt.line_no)

The `var_location_register()` keeps the locations of newly added variables.

In [None]:
class AssignmentTracker(AssignmentTracker):
    def var_location_register(self, my_vars):
        def loc(mid):
            # First refernce. Check the current event. If it is call, we can use
            # the current location info as is.
            if self.current_event == 'call':
                return self.event_locations[mid][-1]
            # if it is line, then use the previous event in the current method invocation.
            elif self.current_event == 'line':
                return self.event_locations[mid][-2]
            # return is similar to the line.
            elif self.current_event == 'return':
                return self.event_locations[mid][-2]
            else:
                assert False
                
        my_loc = loc(self.method_id)
        for var in my_vars:
            self.var_def_lines[var] = my_loc

We can now use `AssignmentTracker` to track the different variables. To verify that our variable line number inference works, we recover definitions from the functions A, B and C (with data annotations removed so that the input fragments are correctly identified). 

In [None]:
def C(cp_1):
    c_2 = cp_1
    c_3 = c_2
    return c_3


def B(bp_7):
    b_8 = bp_7
    return C(b_8)


def A(ap_12):
    a_13 = ap_12
    a_14 = B(a_13)
    a_14 = a_14
    a_13 = a_14
    a_14 = B(a_13)
    a_14 = B(a_14)[3:]

Running `A()` with sufficient input.

In [None]:
with Tracer('---xxx') as tracer:
    A(tracer.my_input)
tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)
for k, v in tracker.my_assignments.defs.items():
    print(k, tracker.var_def_lines.get(k), '=', repr(v))
print()

As can be seen, the line numbers are now correctly identified for each variables.

Let us add a final method `defined_vars()` to retrieve the variable names correctly.

In [None]:
import re

In [None]:
class AssignmentTracker(AssignmentTracker):
    def defined_vars(self):
        def to_lno(v):
            if v == START_SYMBOL:
                return v
            else:
                grp = re.match(r'(.+)\[(.+)\]', v).groups()
                return "%s:%d[%s]" % (grp[0], self.var_def_lines[v], grp[1])

        return {to_lno(k): v for k, v in self.my_assignments.defs.items()}

 Let us try retrieving the assignments for a real world example.

In [None]:
traces = []
for inputstr in URLS_X:
    clear_cache()
    with Tracer(inputstr, files=['urllib/parse.py']) as tracer:
        urlparse(tracer.my_input)
    traces.append((tracer.my_input, tracer.trace))

    tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)
    for k, v in tracker.defined_vars().items():
        print(k, '=', repr(v))
    print()

The line numbers of variables can be verified from the source code of [urllib/parse.py](https://github.com/python/cpython/blob/3.6/Lib/urllib/parse.py).

### Obtaining a Derivation Tree

The previous `get_derivation_tree` was simplistic in that it tried to check for string inclusions without regard to the order in which the variable assignments were made. However, when one considers parsing, strings are fragmented in order. That is, a larger string that includes a smaller string will be assigned to a variable *before* the smaller string is assigned to a variable.

Hence, while mining the derivation tree, we only look at variable assignments that happened *before* the current variable assignment took place. The algorithm is as follows.

For each (*var*, *value*) found:
* We search for occurrences of *value* in the rules present in the grammar
* We replace them by <*var*>
* We add a new rule <*var*> $\rightarrow$ value to the grammar

In [None]:
class Miner(Miner):
    def get_derivation_tree(self):
        tree = {}
        for var in self.my_assignments:
            nt_var, value = self.nt_var(var), self.my_assignments[var]
            if tree:
                v = self.apply_new_definition(tree, nt_var, value)
                if not v:
                    continue
            self.logger(0, "+%s = %s" % (nt_var, value))
            tree[nt_var] = (value, )
        return tree

Does handling variable reassignments help with our URL examples? We look at these next.

#### Example 1: Recovering URL Grammar

First we obtain the derivation tree of the URL 1

##### URL 1 derivation tree

In [None]:
clear_cache()
with Tracer(URLS_X[0], files=['urllib/parse.py']) as tracer:
    urlparse(tracer.my_input)
sm = AssignmentTracker(tracer.my_input, tracer.trace)
dt = Miner(tracer.my_input, sm.defined_vars())
display_derivation_tree(dt.tree)

Next, we obtain the derivation tree of URL 4

##### URL 4 derivation tree

In [None]:
clear_cache()
with Tracer(URLS_X[-1], files=['urllib/parse.py']) as tracer:
    urlparse(tracer.my_input)
sm = AssignmentTracker(tracer.my_input, tracer.trace)
dt = Miner(tracer.my_input, sm.defined_vars())
display_derivation_tree(dt.tree)

The derivation trees seem to belong to the same grammar. Hence, we obtain the grammar for the complete set. First, we update the `recover_grammar()` to use `AssignTracker`.

### Recover Grammar

In [None]:
def recover_grammar(traces):
    m = Infer()
    for inputstr, trace in traces:
        st = AssignmentTracker(inputstr, trace)
        dt = Miner(inputstr, st.defined_vars())
        m.add_tree(dt)
    return m.grammar

Next, we use the modified `recover_grammar()` on derivation trees obtained from URLs.

In [None]:
traces = []
for inputstr in URLS_X:
    clear_cache()
    with Tracer(inputstr, files=['urllib/parse.py']) as tracer:
        urlparse(tracer.my_input)
    traces.append((tracer.my_input, tracer.trace))
grammar = recover_grammar(traces)

The recovered grammar is below.

In [None]:
syntax_diagram(grammar)

Let us fuzz a little to see if the produced values are sane.

In [None]:
f = GrammarFuzzer(grammar)
for i in range(10):
    print(f.fuzz())

Our modifications does seem to help. Next, we check whether we can still retrieve the grammar for inventory.

#### Example 2: Recovering Inventory Grammar

In [None]:
traces = []
for inputstr in VEHICLES:
    with Tracer(inputstr) as tracer:
        process_vehicle(tracer.my_input)
    traces.append((tracer.my_input, tracer.trace))
grammar = recover_grammar(traces)

In [None]:
syntax_diagram(grammar)

Using fuzzing to produce values from the grammar.

In [None]:
f = GrammarFuzzer(grammar)
for i in range(10):
    print(f.fuzz())

### Problems with the Grammar Miner with Reassignment

One of the problems with our grammar miner is that it doesn't yet account for the current context. That is, when replacing, a variable can replace tokens that it does not have access to (and hence, it is not a frament of). Consider this example.

In [None]:
with Tracer(INVENTORY) as tracer:
    process_inventory(tracer.my_input)
sm = AssignmentTracker(tracer.my_input, tracer.trace)
dt = Miner(tracer.my_input, sm.defined_vars())
display_tree(stgrammar_to_tree(dt.tree), graph_attr=lr_graph)

As can be seen, the derivation tree obtained is not quite what we expected. The issue is easily seen if we enable logging in the `Miner`.

In [None]:
dt = Miner(tracer.my_input, sm.my_assignments.defs, log=True)

Look for when `car` gets replaced. i.e the string `+<kind[2]> = car` in the above log. From the next loop onwards, one can see that the definition of `vehicle[2]` has changed as follows:

* `<vehicle[2]> : ('2000,car,', '<company[2]>', ',', '<model[2]>')`
* `<vehicle[2]> : ('2000,', '<kind[2]>', ',', '<company[2]>', ',', '<model[2]>')`

This is as expected. However, we note that `inventory[1]` has also changed from the first to second.

* `<inventory[1]> : ('<vehicle[1]>', '\n', '<vehicle[2]>', '\n1999,car,Chevy,Venture')`
* `<inventory[1]> : ('<vehicle[1]>', '\n', '<vehicle[2]>', '\n1999,', '<kind[2]>', ',Chevy,Venture')`

That is, the variable `kind[2]` replaced the value `car` in `inventory[1]` third token. However, `kind[2]` is from `process_vehicle()` which should have access only to `vehicle[2]`. The problem here is that, because of this replacement, later replacements such as `vehicle[2]` cannot occur any more. One way to overcome this is to restrict the variable replacements to only those variables that are in scope.

## Grammar Miner with Scope

We need to incorporate inspection of the variables in the current context. We already have a stack of method calls so that we can obtain the current method at any point. We need to do the same for variables.

For that, we define a class `InputStack` which holds the unmodified record of activation of the method. Essentially, we start with the original input at the base of the stack, and for each new method call we push the parameters of that call into the stack as a new record.

### Input Stack

In [None]:
class InputStack:
    def __init__(self, i, fragment_len=FRAGMENT_LEN):
        self.inputs = [('*', {START_SYMBOL: i})]
        self.fragment_len = fragment_len

In order to check if a particular variable be saved, we define `in_current_record()` which checks only the last activation record for inclusion (rather than the original input string).

In [None]:
class InputStack(InputStack):
    def in_current_record(self, val):
        return any(val in var for var in self.inputs[-1][1].values())

In [None]:
def display_stack(istack):
    def stack_to_tree(stack):
        current, *rest = stack
        if not rest:
            return (repr(current), [])
        return (repr(current), [stack_to_tree(rest)])
    display_tree(stack_to_tree(istack.inputs))

In [None]:
my_istack = InputStack('hello my world')

In [None]:
display_stack(my_istack)

In [None]:
my_istack.in_current_record('hello')

In [None]:
my_istack.in_current_record('bye')

In [None]:
my_istack.inputs.append(('say', {'greeting': 'hello', 'location': 'world'}))

In [None]:
display_stack(my_istack)

In [None]:
my_istack.in_current_record('hello')

In [None]:
my_istack.in_current_record('my')

We define the method `ignored()` that returns true if either the variable is not a string, or the variable length is less than the defined `fragment_len`.

In [None]:
class InputStack(InputStack):
    def ignored(self, val):
        return not (isinstance(val, str) and len(val) > self.fragment_len)

In [None]:
my_istack = InputStack('hello world')
my_istack.ignored(1)

In [None]:
my_istack.ignored('a')

In [None]:
my_istack.ignored('help')

We can now define the `include()` method that checks whether the variable needs to be ignored, and if it is not to be ignored, whether the variable value is present in the activation record.

In [None]:
class InputStack(InputStack):
    def include(self, k, val):
        if self.ignored(val):
            return False
        return self.in_current_record(val)

Finally, we define `push()` that pushes relevant variables in the current context to the activation record.

In [None]:
class InputStack(InputStack):
    def push(self, method, inputs):
        my_inputs = {k: v for k, v in inputs.items() if self.include(k, v)}
        self.inputs.append((method, my_inputs))

When a method returns, we also need a corresponding `pop()` to unwind the activation record.

In [None]:
class InputStack(InputStack):
    def pop(self):
        self.inputs.pop()

We also define a convenience method `height()` that returns the height of the current activation record.

In [None]:
class InputStack(InputStack):
    def height(self):
        return len(self.inputs) - 1

In [None]:
my_istack = InputStack('hello world')
display_stack(my_istack)

In [None]:
my_istack.push('say', {'greeting': 'hello', 'location': 'world'})
display_stack(my_istack)
my_istack.height()

In [None]:
my_istack.pop()
display_stack(my_istack)
my_istack.height()

### ScopedVars

We need to update our `SingleAssignmentVars` to include information about where the variable was defined.

In [None]:
class ScopedVars(SingleAssignmentVars):
    def __init__(self, original):
        self.accessed_seq_var = {}
        self.defs = {START_SYMBOL: (original, ':0')}
        self.new_vars = set()
        self.method_id = None

We also need to save the current method invocation so as to determine which variables are in scope.

In [None]:
class ScopedVars(ScopedVars):
    def set_current_method(self, method):
        self.method_id = method
        if method not in self.accessed_seq_var:
            self.accessed_seq_var[method] = {}

This information is now incorporated in the variable name.

In [None]:
class ScopedVars(ScopedVars):
    def var_name(self, var):
        return "%s[%s:vseq:%d]" % (var, self.method_id,
                                 self.accessed_seq_var[self.method_id][var])

It is useful to define a method `split_var()` that can recover the information from the string constructed by `var_name()`.

In [None]:
def split_var(var):
    r = r'([^:<>\[\]]+):([^:<>\[\]]+):([^:<>\[\]]+)\[([^:<>\[\]]+):([^:<>\[\]]+):vseq:([^:<>\[\]]+)\]'
    v = re.match(r, var)
    if v is None:
        return {}
    vals = v.groups()
    return {
        'method': vals[0],
        'var': vals[1],
        'lno': vals[2],
        'mscope': vals[3],
        'mseq': vals[4],
        'vseq': vals[5]
    }

As before, `var_access` simply initializes the corresponding counter.

In [None]:
class ScopedVars(ScopedVars):
    def var_access(self, var):
        if var not in self.accessed_seq_var[self.method_id]:
            self.accessed_seq_var[self.method_id][var] = 0
        return self.var_name(var)

During a variable reassignment, we update the `accessed_seq_var` to reflect the new count.

In [None]:
class ScopedVars(ScopedVars):
    def var_assign(self, var):
        self.accessed_seq_var[self.method_id][var] += 1
        self.new_vars.add(self.var_name(var))
        return self.var_name(var)

### Scope Tracker

With the `InputStack` and `Vars` defined, we can now define the `ScopeTracker`. The `ScopeTracker` only saves variables if the value is present in the current activation record.

In [None]:
class ScopeTracker(AssignmentTracker):
    def __init__(self, my_input, trace, **kwargs):
        self.current_event = None
        self.var_def_lines = {}

        self.method_init()
        self.my_assignments = ScopedVars(my_input)

        self.trace = trace
        self.options(kwargs)
        self.istack = InputStack(my_input, fragment_len=self.fragment_len)
        self.process()

We define a wrapper for checking whether a variable is present in the activation record.

In [None]:
class ScopeTracker(ScopeTracker):
    def include(self, var, value):
        return self.istack.include(var, value)

The method `scope()` retrieves the corresponding method at the given record.

In [None]:
class ScopeTracker(ScopeTracker):
    def scope(self, record):
        return self.mstack[record]


In [None]:
class ScopeTracker(ScopeTracker):
    def update_vars(self, my_vars, r):
        added_vars = self.my_assignments.update({
            var: (val, self.scope(r))
            for var, val in self.selected(my_vars).items()
        })
        self.var_location_register(added_vars)

We now define methods `on_call`, `on_line` and `on_return` that is responsible for processing of corresponding events. The `on_call()` method is similar to the `on_call()` method on parent. The main changes are that

* it pushes the current (interesting) parameters on stack, and hence update the activation record
* It updates the `my_assignments` with `var:value` pairs where the `var` is annotated with the scope where the variable is applicable. In the case of `call`, the parameters are fragments of the parent scope. Hence, we pass the record number `-2` which is the previous activation record (current activation record is at `-1`).

In [None]:
class ScopeTracker(ScopeTracker):
    def on_call(self, arg, cxt, my_vars):
        self.method_enter(cxt)
        self.my_assignments.set_current_method(self.method_id)
        my_parameters = {
            k: v
            for k, v in cxt.parameters(my_vars).items()
            if not self.istack.ignored(v)
        }
        self.istack.push(cxt.method, my_parameters)
        self.update_vars(cxt.qualified(my_parameters), -2)

The `on_return()` is the counterpart to `on_call()`. It pops the stack, and updates the `my_assignments` variable with a virtual parameter corresponding to the return value if the return value is being tracked.

In [None]:
class ScopeTracker(ScopeTracker):
    def on_return(self, arg, cxt, my_vars):
        self.on_line(arg, cxt, my_vars)
        self.istack.pop()
        self.method_exit(cxt)
        self.my_assignments.set_current_method(self.method_id)
        if not self.track_return:
            return
        var = '(<-%s)' % cxt.method
        self.update_vars({var: arg}, -1)

Finally, the `on_line()` method is very similar to the parent, except that the variables for `my_assignments` are annotated with the scope. In the case of `on_line`, the variables should contain fragments of the *current* activation record. Hence, we pass the record number `-1` to retrieve the scope.

In [None]:
class ScopeTracker(ScopeTracker):
    def on_line(self, arg, cxt, my_vars):
        self.method_statement(cxt)
        my_vars = cxt.qualified(my_vars)
        self.update_vars(my_vars, -1)

A few convenience methods

In [None]:
def split_token(token):
    return split_var(token[1:-1]) if is_nonterminal(token) else {}

Note that we can uniquely identify a variable using just its name (1), its method sequence (3), and the assignment sequence (4). retrieved from the `scope()` information.

In [None]:
def abbrev_var(var):
    v = split_var(var)
    return var if len(v) < 5 else "%s:%s[%s:%s]" % (
        v['var'], v['lno'], v['mseq'], v['vseq'])

In [None]:
def abbrev_token(var):
    return "<%s>" % abbrev_var(var[1:-1]) if is_nonterminal(var) else var

We can use the `ScopeTracker` as follows

In [None]:
vehicle_traces = []
with Tracer(INVENTORY) as tracer:
    process_inventory(tracer.my_input)
sm = ScopeTracker(tracer.my_input, tracer.trace)
vehicle_traces.append((tracer.my_input, sm))
for k, v in sm.defined_vars().items():
    print(abbrev_var(k), '=', repr(v))

### Mining a Derivation Tree

First, we define `mseq()` to retrieve the current method context.

In [None]:
class ScopeMiner(Miner):
    def mseq(self, key, dval=0):
        return dval if key == START_SYMBOL else int(split_token(key)['mseq'])

The main difference in `apply_new_definition()` is that we add a second condition that checks for scope. In particular, variables are only allowed to replace portions of string fragments that were in scope. The scope is indicated by `scope_of_var` variable. An exception is made for cases where an internal child method call may have generated a large frament. In that case, the `mseq` of the internal child method call would be larger than the current `mseq`. If so, we allow the replacement to proceed.

In [None]:
class ScopeMiner(ScopeMiner):
    def apply_new_definition(self, tree, nt_var, value_):
        value, scope_of_var = value_
        mseq_of_var = self.mseq(nt_var)
        self.logger(
            0, "%s = %s\t%s" % (nt_var, repr(value), "[%s]" % scope_of_var))
        applied = False
        for key, rule in tree.items():
            self.logger(1, "%s : %s" % (key, repr(rule)))
            if scope_of_var not in key:
                mseq_of_key = self.mseq(key)
                if mseq_of_var > mseq_of_key:
                    continue
            if not any(self.has_value(value, token) for token in rule):
                continue
            tree[key] = self.replace_in_rule(nt_var, value, rule)
            self.logger(1, "%s -> %s" % (key, repr(tree[key])))
            applied = True
        return applied

The `get_derivation_tree()` is almost exactly same as that of the parent, except that we now have to handle the context annotations in variable assignments.

In [None]:
class ScopeMiner(ScopeMiner):
    def get_derivation_tree(self):
        tree = {}
        for var in self.my_assignments:
            nt_var, value = self.nt_var(var), self.my_assignments[var]
            if tree:
                v = self.apply_new_definition(tree, nt_var, value)
                if not v:
                    continue
            self.logger(0, "+%s = %s" % (nt_var, repr(value[0])))  # <-- change
            tree[nt_var] = (value[0], )  # <-- change
        return tree

A few examples of our miner in action.

#### Example 1: Recovering Inventory Parse Tree

In [None]:
def stgrammar_to_tree(tree, key=START_SYMBOL, abbrev=abbrev_token):
    if key not in tree:
        return (repr(key), [])
    children = [stgrammar_to_tree(tree, c) for c in tree[key]]
    return (abbrev(key), children)

In [None]:
with Tracer(INVENTORY) as tracer:
    process_inventory(tracer.my_input)
sm = ScopeTracker(tracer.my_input, tracer.trace)
for k, v in sm.defined_vars().items():
    print(abbrev_var(k), '=', repr(v))
vehicle_dt = ScopeMiner(tracer.my_input, sm.defined_vars())
display_derivation_tree(vehicle_dt.tree, graph_attr=lr_graph)

One of the things that one might notice from our Example (1) is that the three subtrees -- `vehicle[3:1]`, `vehicle[5:1]` and `vehicle[7:1]` are quite alike.

#### Example 2: Recovering URL Parse Tree

In [None]:
url_dts = []
for inputstr in URLS_X:
    clear_cache()
    with Tracer(inputstr, files=['urllib/parse.py']) as tracer:
        urlparse(tracer.my_input)
    sm = ScopeTracker(tracer.my_input, tracer.trace)
    for k, v in sm.my_assignments.defs.items():
        print(abbrev_var(k), '=', repr(v))
    dt = ScopeMiner(tracer.my_input, sm.defined_vars())
    display_derivation_tree(dt.tree, graph_attr=lr_graph)
    url_dts.append(dt)

### Grammar Inference

We noticed how some of the children were quite alike. These children can be abstracted out directly to produce a context free grammar from a single derivation tree.

In [None]:
class ScopedInfer(Infer):
    def abbrev_token(self, var):
        return "<%s>" % self.abbrev_var(var[1:-1]) if is_nonterminal(
            var) else var

    def abbrev_var(self, var):
        v = split_var(var)
        return var if len(v) < 5 else "%s[%s:%s]" % (v['var'], v['method'], v['lno'])

    def add_tree(self, t):
        merged_grammar = {}
        my_tree = t.tree
        abbrev_keys = {self.abbrev_token(k) for k in t.tree}
        keylst = list(self.grammar.keys()) + list(abbrev_keys)

        for key in keylst:
            alternates = set(self.grammar.get(key, []))
            if key in abbrev_keys:
                rules = [
                    my_tree[k] for k in my_tree if self.abbrev_token(k) == key
                ]
                for r in rules:
                    if len(r) == 1:
                        if key == self.abbrev_token(r[0]):
                            continue
                    alternates.add(tuple([self.abbrev_token(t) for t in r]))
            merged_grammar[key] = list(alternates)
        self.grammar = {k: v for k, v in merged_grammar.items()}

The last piece of the puzzle is the cleanup method `clean_grammar()`.

In [None]:
class ScopedInfer(ScopedInfer):
    def clean_grammar(self):
        replacements = {}
        for k in self.grammar:
            if k == START_SYMBOL:
                continue
            alts = self.grammar[k]
            if len(alts) != 1:
                continue
            rule = alts[0]
            if len(rule) != 1:
                continue
            tok = rule[0]
            if not is_nonterminal(tok):
                continue
            replacements[k] = tok

        while True:
            changed = set()
            for k in self.grammar:
                if k in replacements:
                    continue
                new_alts = []
                for alt in self.grammar[k]:
                    new_alt = []
                    for t in alt:
                        if t in replacements:
                            new_alt.append(replacements[t])
                            changed.add(t)
                        else:
                            new_alt.append(t)
                    new_alts.append(new_alt)
                self.grammar[k] = new_alts
            if not changed:
                break
            for k in changed:
                del self.grammar[k]
        new_grammar = {}
        for k in self.grammar:
            new_grammar[k] = list(set([''.join(a) for a in self.grammar[k]]))
        return new_grammar

The `add_tree()` is used as follows:

In [None]:
i = ScopedInfer()
i.add_tree(vehicle_dt)

In [None]:
syntax_diagram(i.clean_grammar())

In [None]:
si = ScopedInfer()
for dt in url_dts:
    si.add_tree(dt)

In [None]:
syntax_diagram(si.clean_grammar())

In [None]:
f = GrammarFuzzer(si.clean_grammar())
for _ in range(10):
    print(f.fuzz())

## Lessons Learned

* Given a set of sample inputs for program, we can learn an input grammar by examining variable values during execution if the program relies on handwritten parsers.
* Simple string inclusion checks are sufficient to obtain reasonably accurate grammars from real world programs.
* The resulting grammars can be directly used for fuzzing, and can have a multiplier effect on any samples you have.

## Next Steps

* [Use _mutations_ on existing inputs to get more valid inputs](MutationFuzzer.ipynb)
* [Use _grammars_ (i.e., a specification of the input format) to get even more valid inputs](Grammars.ipynb)
* [Reduce _failing inputs_ for efficient debugging](Reducer.ipynb)


## Background

Recovering the input specification of an arbitrary program is a well researched topic. The majority of research has happened on the black box approach where nothing more is known about the program in question. Our approach relies on the detail knowledge of the program in question, and also on the ability to execute the program under observation. The pioneering work in this area was done by Lin et al.~\cite{Lin2008} who invented a way to retrieve the parse trees from top down and bottom up parsers. The current approach is based directly on the work of Hoschele et al.~\cite{Hoschele2017}.

## Exercises

### Exercise 1: Flattening complex objects

Our grammar miners only check for string fragments. However, programs may often pass containers or custom objects containing input fragments. Can you modify our grammar miner to correctly account for the complex objects too?

Here is a possible solution.

**Solution.**

In [None]:
def flatten(key, val):
    # Should we limit flatened objects to repr ~ tstr here or during call?
    tv = type(val)
    if isinstance(val, (int, float, complex, str, bytes, bytearray)):
        return [(key, val)]
    elif isinstance(val, (set, frozenset, list, tuple, range)):
        values = [e for i, elt in enumerate(val) for e in flatten(i, elt)]
        return [("%s.%d" % (key, i), v) for i, v in values]
    elif isinstance(val, dict):
        values = [e for k, elt in val.items() for e in flatten(k, elt)]
        return [("%s.%s" % (key, k), v) for k, v in values]
    elif isinstance(val, tstr):
        return [(key, val)]
    elif hasattr(val, '__dict__'):
        values = [e for k, elt in val.__dict__.items()
                  for e in flatten(k, elt)]
        return [("%s.%s" % (key, k), v) for k, v in values]
    else:
        return [(key, repr(v))]

_Some more text for the solution_

### Exercise 2: Incorporating Taints from InformationFlow

We have been using *string inclusion* to check whether a particular fragment came from the input string. This is unsatisfactory as it required us to compromise on the size of the strings tracked, which was limited to those greater than `FRAGMENT_LEN`. Further, it is possible that a single method could process a string where a fragment repeats, but is part of different tokens. For example, an embedded comma in the CSV file would cause our parser to fail. One way to avoid this is to rely on *dynamic taints*, and check for taint inclusion rather than string inclusion.

The chapter on [information flow](InformationFlow.ipynb) details how to incorporate dynamic taints. Can you update our grammar miner based on scope to use *dynamic taints* instead?

<!-- **Advanced.** The *dynamic taint* approach is limited in that it can not observe implicit flows. For example, consider the fragment below.

```python
if my_fragment == 'begin':
    return 'begin'
```

In this case, we lose track of the string `begin` that is returned even though it is dependent on the value of `my_fragment`. For such cases, a better (but costly) alternative is to rely on concolic execution and capture the constraints as it relates to input characters on each variable.

The chapter on [symbolic execution](SymbolicExecution.ipynb) details how to incorporate concolic symbolic execution to program execution. Can you update our grammar miner to use *concolic exeuction* to track taints instead?
-->