# Fuzzing with Grammars

In the chapter on ["Mutation-Based Fuzzing"](Mutation_Fuzzing.ipynb), we have seen how to use extra hints – such as sample input files – to speed up test generation.  In this chapter, we take this idea one step further, by providing a _specification_ of the legal inputs to a program.  These _grammars_ allow for very effective and efficient testing, as we will see in this chapter.

**Prerequisites**

* You should know how basic fuzzing works, e.g. from the [Chapter introducing fuzzing](Basic_Fuzzing.ipynb).
* Knowledge on [mutation-based fuzzing](Mutation_Fuzzing.ipynb) and [coverage](Coverage.ipynb) is _not_ required yet, but still recommended.

## Input Languages

All possible behaviors of a program can be triggered by its input.  "Input" here can be a wide range of possible sources: We are talking about data read from files or over the network, data input by the user, or data acquired from interaction with other resources.  The set of all these inputs determines how the program will behave – including its failures.  When testing, it is thus very helpful to think about possible input sources, how to get them under control, and _how to systematically test them_.

For the sake of simplicity, we will assume for now that the program has only one source of inputs; this is the same assumption we have been using in the previous chapters, too.  The set of valid inputs to a program is called a _language_.  Languages range from the simple to the complex: the CSV language denotes the set of valid comma-separated inputs, whereas the Python language denotes the set of valid Python programs.  We commonly separate data languages and programming languages, although any program can also be treated as input data (say, to a compiler).  The [Wikipedia page on file formats](https://en.wikipedia.org/wiki/List_of_file_formats) lists more than 1,000 different file formats, each of which is its own language.

## Grammars

### Rules and Expansions

To formally specify input languages, _grammars_ are among the most popular (and best understood) formalisms.  A grammar consists of a _start symbol_ and a set of _rules_ which indicate how the start symbol (and other symbols) can be expanded.  As an example, consider the following grammar, denoting a sequence of two digits:

```grammar
<start> ::= <digit><digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
```

To read such a grammar, start with the starting symbol (`<start>`).  A rule `<A> ::= <B>` means that the symbol on the left side (`<A>`) can be replaced by the string on the right side (`<B>`).  In the above grammar, `<start>` would be replaced by `<digit><digit>`.

In this string again, `<digit>` would be replaced by the string on the right side of the `<digit>` rule.  The special operator `|` denotes _alternatives_, meaning that any of the digits can be chosen for an expansion.  Each `<digit>` thus would be expanded into one of the given digits, eventually yielding a string between `00` and `99`.  There are no further expansions for `0` to `9`, so we are all set.

The interesting thing about grammars is that they can be _recursive_. That is, expansions can make use of symbols expanded earlier – which would then be expanded again.  As an example, consider a grammar that describes integers:

```grammar
<start>  ::= <integer>
<integer> ::= <digit> | <digit><integer>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
```

Here, a `<integer>` is either a single digit, or a digit followed by another integer.  The number `1234` thus would be represented as a single digit `1`, followed by the integer `234`, which in turn is a digit `2`, followed by the integer `34`.

If we wanted to express that an integer can be preceded by a sign (`+` or `-`), we would write the grammar as

```grammar
<start>   ::= <number>
<number>  ::= <integer> | +<integer> | -<integer>
<integer> ::= <digit> | <digit><integer>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
```

These rules formally define the language: Anything that can be derived from the start symbol is part of the language; anything that cannot is not.

### Arithmetic Expressions

Let us expand our grammar to cover full _arithmetic expressions_ – a poster child example for a grammar.  We see that an expression (`<expr>`) is either a sum, or a difference, or a term; a term is either a product or a division, or a factor; and a factor is either a number or a parenthesized expression.  Amost all rules can have recursion, and thus allow arbitrary complex expressions such as `(1 + 2) * (3.4 / 5.6 - 789)`.

```grammar
<start>   ::= <expr>
<expr>    ::= <expr> + <term> | <expr> - <term> | <term>
<term>    ::= <term> * <factor> | <term> / <factor> | <factor>
<factor>  ::= +<factor> | -<factor> | (<expr>) | <integer> | <integer>.<integer>
<integer> ::= <digit> | <digit><integer>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
```

In such a grammar, if we start with `<start>` and then expand one symbol after another, randomly choosing alternatives, we can quickly produce one valid arithmetic expression after another.  Such _grammar fuzzing_ is highly effective as it comes to produce complex inputs, and this is what we will implement in this chapter.

## Representing Grammars in Python

Our first step in building a grammar fuzzer is to find an appropriate format for grammars.  To make the writing of grammars as simple as possible, we use a mostly format that is mostly based on strings.  Our grammars in Python takes the format of a _mapping_ between symbol names and expansions, where expansions are _lists_ of alternatives.  A one-rule grammar for digits thus takes the form

In [26]:
import gstbook

In [27]:
DIGIT_GRAMMAR = {
    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

whereas the full grammar for arithmetic expressions looks like this:

In [28]:
EXPR_GRAMMAR = {
    "<start>":
        ["<expr>"],

    "<expr>":
        ["<expr> + <term>", "<expr> - <term>", "<term>"],

    "<term>":
        ["<term> * <factor>", "<term> / <factor>", "<factor>"],

    "<factor>":
        ["+<factor>", "-<factor>", "(<expr>)", "<integer>", "<integer>.<integer>"],

    "<integer>":
        ["<integer><digit>", "<digit>"],

    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

In the grammar, we can access any rule by its symbol...

In [29]:
EXPR_GRAMMAR["<digit>"]

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

....and we can check whether a symbol is in the grammar:

In [30]:
"<identifier>" in EXPR_GRAMMAR

False

## Hatching Grammars

Since grammars are represented as strings, it is fairly easy to introduce errors.  So let us introduce a helper function that checks a grammar for consistency.

Frst, this handy `symbols()` function gets us the list of symbols in an expansion.  

In [31]:
import re

# As a symbol, we can have anything between <...> except spaces.
RE_SYMBOL = re.compile(r'(<[^<> ]*>)')

In [32]:
def symbols(expansion):
    return re.findall(RE_SYMBOL, expansion)

In [33]:
assert symbols("<term> * <factor>") == ["<term>", "<factor>"]
assert symbols("<digit><integer>") == ["<digit>", "<integer>"]
assert symbols("1 < 3 > 2") == []
assert symbols("1 <3> 2") == ["<3>"]
assert symbols("1 + 2") == []

The helper function `is_valid_grammar()` iterates over a grammar to check whether all used symbols are defined, and vice versa, which is very useful for debugging.  You don't have to dwelve into details here, but as always, it is important to get the input data straight before we make use of it.

In [34]:
import sys

In [35]:
def is_valid_grammar(grammar, start_symbol="<start>"):
    used_symbols = set([start_symbol])
    defined_symbols = set()

    for defined_symbol in grammar:
        defined_symbols.add(defined_symbol)
        expansions = grammar[defined_symbol]
        if not isinstance(expansions, list):
            print(repr(defined_symbol) + ": expansion is not a list", file=sys.stderr)
            return False
        if len(expansions) == 0:
            print(repr(defined_symbol) + ": expansions list empty", file=sys.stderr)
            return False

        for expansion in expansions:
            if not isinstance(expansion, str):
                print(repr(defined_symbol) + ": " + repr(expansion) + ": not a string", file=sys.stderr)
                return False

            for used_symbol in symbols(expansion):
                used_symbols.add(used_symbol)

    for unused_symbol in defined_symbols - used_symbols:
        print(repr(unused_symbol) + ": defined, but not used", file=sys.stderr)
    for undefined_symbol in used_symbols - defined_symbols:
        print(repr(undefined_symbol) + ": used, but not defined", file=sys.stderr)

    return used_symbols == defined_symbols

Our expression grammar passes the test:

In [36]:
assert is_valid_grammar(expr_grammar)

But these ones don't:

In [37]:
assert not is_valid_grammar({"<start>": ["<x>"], "<y>": ["1"]})

'<y>': defined, but not used
'<x>': used, but not defined


In [38]:
assert not is_valid_grammar({"<start>": "123"})

'<start>': expansion is not a list


In [39]:
assert not is_valid_grammar({"<start>": []})

'<start>': expansions list empty


In [40]:
assert not is_valid_grammar({"<start>": [1, 2, 3]})

'<start>': 1: not a string


## A Simple Grammar Fuzzer

Let us now put the above grammars to use.   We will build a very simple grammar fuzzer that starts with a start symbol (`"<start>"`) and then keeps on expanding it.  To avoid expansion to infinite inputs, we place a limit (`max_symbols`) on the number of symbols.  Furthermore, to avoid being stuck in a sitution where we cannot reduce the number of symbols any further, we also limit the total number of expansion steps.

In [41]:
import random

In [42]:
class ExpansionError(Exception):
    pass

In [43]:
def simple_grammar_fuzzer(grammar, start_symbol="<start>", max_symbols=10, max_expansion_trials=100, log=False):
    term = start_symbol
    expansion_trials = 0

    while len(symbols(term)) > 0:
        symbol_to_expand = random.choice(symbols(term))
        expansion = random.choice(grammar[symbol_to_expand])
        new_term = term.replace(symbol_to_expand, expansion, 1)

        if len(symbols(new_term)) < max_symbols:
            term = new_term
            if log:
                print("%-40s" % (symbol_to_expand + " -> " + expansion), term)
            expansion_trials = 0
        else:
            expansion_trials += 1
            if expansion_trials >= max_expansion_trials:
                raise ExpansionError("Cannot expand " + repr(term))

    return term

Let us see how this simple grammar fuzzer obtains an arithmetic expression from the start symbol:

In [44]:
simple_grammar_fuzzer(grammar=expr_grammar, max_symbols=3, log=True)

<start> -> <expr>                        <expr>
<expr> -> <expr> - <term>                <expr> - <term>
<term> -> <factor>                       <expr> - <factor>
<expr> -> <term>                         <term> - <factor>
<factor> -> -<factor>                    <term> - -<factor>
<factor> -> <integer>                    <term> - -<integer>
<integer> -> <digit>                     <term> - -<digit>
<digit> -> 0                             <term> - -0
<term> -> <term> / <factor>              <term> / <factor> - -0
<factor> -> +<factor>                    <term> / +<factor> - -0
<term> -> <factor>                       <factor> / +<factor> - -0
<factor> -> +<factor>                    +<factor> / +<factor> - -0
<factor> -> +<factor>                    ++<factor> / +<factor> - -0
<factor> -> <integer>                    ++<integer> / +<factor> - -0
<factor> -> -<factor>                    ++<integer> / +-<factor> - -0
<integer> -> <digit>                     ++<digit> / +-<factor> - -0
<

'++2 / +-(--(-(+(9))) / -(+-7926)) - -0'

In [45]:
for i in range(10):
    print(simple_grammar_fuzzer(grammar=expr_grammar, max_symbols=5))

+(9 - +12 - -73) / -4
-(+(5)) / +((-+6)) * +(-+4) + (-+0 * -64.9 / +41.6)
+6 * 46 / +(-1 + (((++-(++(0.54))) - (4) + (-7) + --6 - --(+(05)))))
+++6 + -8 - -111 - -29
(-2200 * 8)
+85 * -7 / +-((++0 + (++(5) * (9 * +8 / --1 * -1)) - +((6 / 3 / 8 * -++-2889.6)))) * 3 - 4 / 6 / 9
(1 - +++-(8 + +3) * ((-4.07 * +-6)) + +((2 * (-+((--((5))) + +(((9.7)) - 9)) * ++8) * 8 - 8 - --+-+++-(-+(+++0 / -((9 + -42 + (+672))) + --(++-(4)) - -3 / -(+-7)))))) + 1 - 5 - +646
4 / +-2 + 0 - -71.7 / 8 / (--52981.874812908 * 9.6 / -39426) / 1.3
-0
-7 * ++5 + 6 - --((1) + 0) / +((-(9.1 + 9)) - -(3))


\todo{Discuss.}

## Some Grammars

With grammars, we can easily specify the format for several of the examples we discussed earlier.  The above arithmetic expressions, for instance, can be directly sent into `bc` (or any other program that takes arithmetic expressions.  

Let us create some more grammars.  Here's one for `cgi_decode()`:

In [46]:
CGI_GRAMMAR = {
    "<start>":
        ["<string>"],

    "<string>":
        ["<letter>", "<letter><string>"],

    "<letter>":
        ["<plus>", "<percent>", "<other>"],

    "<plus>":
        ["+"],

    "<percent>":
        ["%<hexdigit><hexdigit>"],

    "<hexdigit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f"],

    "<other>":  # Actually, could be _all_ letters
        ["0", "1", "2", "3", "4", "5", "a", "b", "c", "d", "e", "-", "_"],
}

assert is_valid_grammar(CGI_GRAMMAR)

In [47]:
for i in range(10):
    print(simple_grammar_fuzzer(grammar=CGI_GRAMMAR, max_symbols=10))

%31
++
+
+
0
+
+
%72
3+4+%39%81+
+


Or a URL grammar:

In [48]:
URL_GRAMMAR = {
    "<start>":
        ["<call>"],

    "<call>":
        ["<url>"],

    "<url>":
        ["<scheme>://<authority><path><query>"],

    "<scheme>":
        ["http", "https", "ftp", "ftps"],

    "<authority>":
        ["<host>", "<host>:<port>", "<userinfo>@<host>", "<userinfo>@<host>:<port>"],

    "<host>":  # Just a few
        ["cispa.saarland", "www.google.com", "fuzzingbook.com"],

    "<port>":
        ["80", "8080", "<nat>"],

    "<nat>":
        ["<digit>", "<digit><digit>"],

    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],

    "<userinfo>":  # Just one
        ["user:password"],

    "<path>":  # Just a few
        ["", "/", "/<id>"],

    "<id>":  # Just a few
        ["abc", "def", "x<digit><digit>"],

    "<query>":
        ["", "?<params>"],

    "<params>":
        ["<param>", "<param>&<params>"],

    "<param>":  # Just a few
        ["<id>=<id>", "<id>=<nat>"],
}

assert is_valid_grammar(url_grammar)

In [49]:
for i in range(10):
    print(simple_grammar_fuzzer(grammar=URL_GRAMMAR, max_symbols=10))

ftp://user:password@www.google.com/
http://cispa.saarland?def=def&x43=9
ftp://fuzzingbook.com?x58=4&x41=x91
ftps://user:password@fuzzingbook.com:6/?abc=77
https://user:password@www.google.com
ftps://cispa.saarland/abc
http://user:password@fuzzingbook.com/abc?x55=6&abc=def
ftps://fuzzingbook.com
ftps://user:password@fuzzingbook.com/x93
ftps://www.google.com:80/


## Grammar Coverage

\todo{Show how systematically covering grammar features makes things even better.  Maintain a set of '(symbol, expansion)' pairs that would be consulted first.}

## Evolving Derivation Trees

\todo{Go for an implementation that evolves trees rather than strings}
\todo{How about doing this when mutating with grammars?}

### Derivation Trees

\todo{Add}

## Alternatives to Grammars

To formally describe languages, the field of _formal languages_ has devised a number of _language specifications_ that describe a language.  _Regular expressions_, for instance, denote sets of strings: The regular expression `[a-z]*`, for instance, denotes a (possibly empty) sequence of lowercase letters.  _Automata theory_ connects these languages to automata that accept these inputs; _finite state machines_, for instance, can be used to specify the same language as regular expressions.

Regular expressions are great for not-too-complex input formats, and the associated finite state machine have many properties that make them great for reasoning.  To specify more complex inputs, though, they quickly encounter limitations.  On the other hand of the language spectrum, we have _universal grammars_ that denote the language accepted by _Turing machines_.  A Turing machine can compute anything that can be computed; and with Python being a Turing-complete language, this means that we can also use a Python program $p$ to specify or even enumerate legal inputs.  But then, computer science theory also tells us that each such testing program has to be written specifically for the program to be tested, which is not the level of automation we want.



## Lessons Learned

* _Lesson one_
* _Lesson two_
* _Lesson three_

## Next Steps

_Link to subsequent chapters (notebooks) here, as in:_

* [use _mutations_ on existing inputs to get more valid inputs](Mutation_Fuzzing.ipynb)
* [use _grammars_ (i.e., a specification of the input format) to get even more valid inputs](Grammars.ipynb)
* [reduce _failing inputs_ for efficient debugging](Reducing.ipynb)


## Exercises

_Close the chapter with a few exercises such that people have things to do.  Use the Jupyter `Exercise2` nbextension to add solutions that can be interactively viewed or hidden.  (Alternatively, just copy the exercise and solution cells below with their metadata.)  We will set up things such that solutions do not appear in the PDF and HTML formats._

### Exercise 1


_Solution for the exercise_