# Mining Input Grammars

So far, the grammars we have seen have been mostly specified manually – that is, you (or the person knowing the input format) had to design and write a grammar in the first place.  While the grammars we have seen so far have been rather simple, creating a grammar for complex inoputs can involve quite some effort.  In this chapter, we therefore introduce techniques that automatically _mine_ grammars from programs – by executing the programs and observing how they process which parts of the input.  In conjunction with a grammar fuzzer, this allows us to (1) take a program, (2) extract its input grammar, and (3) fuzz it with high efficiency and effectiveness.

**Prerequisites**

* You should have read the [chapter on grammars](Grammars.ipynb).
* The [chapter on configuration fuzzing](ConfigurationFuzzer.ipynb) introduces grammar mining for configuration options, as well as observing variables and values during execution.

## A Simple Grammar Miner

Say we want to obtain the grammar for the function `urlparse` from the *Python* distribution.

### Function Under Test

In [None]:
from urllib.parse import urlparse
FUNCTION = urlparse

## Tracing Variable Values

We have a few inputs that can be used, as listed below:

We use two *global* variables -- `the_values` is used to keep track of variable assignments and `the_input` to keep track of the current input string. We will show later how to avoid these globals.

In [None]:
INPUTS = [
    'http://user:pass@www.google.com:80/?q=path#ref',
    'https://www.cispa.saarland:80/',
    'http://www.fuzzingbook.org/#News',
]
the_values = {}
the_input = None

### Recording Occurrence of Input Values.

The function `traceit()` is used to record all *non trivial* string variables (with length more than 2 characters) and values occurring during execution.

In [None]:
def traceit(frame, event, arg):
    my_vars = {
        var: value
        for var, value in frame.f_locals.items()
        if isinstance(value, str) and len(value) > 2
        and value in the_input
    }
    the_values.update(my_vars)

    return traceit

### Trace

The `trace_function()` hooks into the Python trace functionality.

In [None]:
def trace_function(function, inputstr):
    import sys
    global the_input
    the_input = inputstr

    global the_values
    the_values = {}

    oldtrace = sys.gettrace()
    sys.settrace(traceit)
    o = function(the_input)
    sys.settrace(oldtrace)

    return the_values

In [None]:
values = trace_function(FUNCTION, INPUTS[0])
for var in values.keys():
    print(var + " = " + repr(values[var]))
print('')

### Extracting a Grammar

In [None]:
import fuzzingbook_utils

In [None]:
from Grammars import START_SYMBOL

Convert a variable name into a grammar nonterminal

In [None]:
def nonterminal(var):
    return "<" + var.lower() + ">"

Now, for each pair _VAR_, _VALUE_ found:

1. We search for occurrences of _VALUE_ in the grammar
2. We replace them by <_VAR_>
3. We add a new rule <_VAR_> $\rightarrow$ <_VALUE_> to the grammar

In [None]:
def get_grammar(traces, inputstr):
    # Here's our initial grammar
    grammar = {START_SYMBOL: [inputstr]}

    # Replace as listed above
    while True:
        new_rules = []
        for var, value in traces.items():
            for key, repl_alternatives in grammar.items():
                for j, repl in enumerate(repl_alternatives):
                    if not value in repl:
                        continue
                    # Replace value by nonterminal name
                    alt_key = nonterminal(var)
                    repl_alternatives[j] = repl.replace(value, alt_key)
                    new_rules.append((var, alt_key, value))

        if not new_rules:
            break  # Nothing to expand anymore

        for (var, alt_key, value) in new_rules:
            # Add new rule to grammar
            grammar[alt_key] = [value]

            # Do not expand this again
            del traces[var]

    return grammar

First, trace the execution:

In [None]:
traces = trace_function(FUNCTION, INPUTS[0])

In [None]:
grammar = get_grammar(traces, INPUTS[0])
grammar

In [None]:
grammar = get_grammar(trace_function(FUNCTION, INPUTS[1]), INPUTS[1])
grammar

In [None]:
grammar = get_grammar(trace_function(FUNCTION, INPUTS[2]), INPUTS[2])
grammar

### Merging Grammars

In [None]:
def merge_grammars(g1, g2):
    merged_grammar = {}
    for key in list(g1.keys()) + list(g2.keys()):
        merged_grammar[key] = g1.get(key, set()) | g2.get(key, set())
    return merged_grammar

In [None]:
def get_merged_grammar(function, inputs):
    def alt_to_set(grammar):
        return {key: set(values) for key, values in grammar.items()}

    merged_grammar = {}
    for inputstr in inputs:
        traces = trace_function(function, inputstr)
        grammar = alt_to_set(get_grammar(traces, inputstr))
        merged_grammar = merge_grammars(merged_grammar, grammar)

    return merged_grammar

In [None]:
grammar = get_merged_grammar(FUNCTION, INPUTS)
grammar

### Fuzzing

In [None]:
from GrammarFuzzer import GrammarFuzzer

In [None]:
f = GrammarFuzzer(grammar)
for i in range(10):
    print(f.fuzz())

## Keeping Track of The Stack

In [None]:
# import InformationFlow

## Lessons Learned

* Given a set of inputs, we can learn an input grammar by examining variable values during execution.
* The resulting grammars can be used right during fuzzing.

## Next Steps

_Link to subsequent chapters (notebooks) here, as in:_

* [use _mutations_ on existing inputs to get more valid inputs](MutationFuzzer.ipynb)
* [use _grammars_ (i.e., a specification of the input format) to get even more valid inputs](Grammars.ipynb)
* [reduce _failing inputs_ for efficient debugging](Reducer.ipynb)


## Background

\cite{Lin2008}

## Exercises

_Close the chapter with a few exercises such that people have things to do.  To make the solutions hidden (to be revealed by the user), have them start with_

```markdown
**Solution.**
```

_Your solution can then extend up to the next title (i.e., any markdown cell starting with `#`)._

_Running `make metadata` will automatically add metadata to the cells such that the cells will be hidden by default, and can be uncovered by the user.  The button will be introduced above the solution._

### Exercise 1: _Title_

_Text of the exercise_

In [None]:
# Some code that is part of the exercise
pass

_Some more text for the exercise_

**Solution.** _Some text for the solution_

In [None]:
# Some code for the solution
2 + 2

_Some more text for the solution_

### Exercise 2: _Title_

_Text of the exercise_

**Solution.** _Solution for the exercise_