# Fuzzing: Breaking Things with Random Inputs

In this unit, we'll start with one of the simplest test generation techniques.  The key idea of random text generation, also known as "fuzzing", is to feed a _string of random characters_ into a program in the hope to uncover failures.


## A Testing Assignment

Fuzzing was conceived by Bart Miller in 1989 as a programming exercise for his students at the University of Wisconsin-Madison.  The [assignment](http://pages.cs.wisc.edu/~bart/fuzz/CS736-Projects-f1988.pdf) read 

> The goal of this project is to evaluate the robustness of various UNIX utility programs, given an unpredictable input stream. [...] First, you will build a _fuzz generator_. This is a program that will output a random character stream. Second, you will take the fuzz generator and use it to attack as many UNIX utilities as possible, with the goal of trying to break them.

This assignment captures the essence of fuzzing: Create random inputs, and see if they break things.  Just let it run long enough and you'll see.

## A Simple Fuzzer

Let us try to fulfill this assignment and build a fuzz generator.  The idea is to produce random characters, adding them to a buffer string variable (`out`), and finally returning the string.

This implementation uses the following Python features and functions:

* `random.randrange(start, end)` - return a random number [`start`, `end`]
* `range(start, end)` - create a list with integers from `start` to `end`.  Typically used in iterations.
* `for elem in list: body` executes `body` in a loop with `elem` taking each value from `list`.
* `for i in range(start, end): body` executes `body` in a loop with `i` from `start` to `end` - 1.
* `chr(n)` - return a character with ASCII code `n`

In [94]:
import gstbook

In [95]:
import random

In [96]:
# We set a specific seed to get the same inputs each time
random.seed(53727895348829)

In [97]:
def fuzzer(max_length=100, char_start=32, char_range=32):
    """A string of up to `max_length` characters 
       in the range [`char_start`, `char_start` + `char_range`]"""
    string_length = random.randrange(0, max_length)
    out = ""
    for i in range(0, string_length):
        out += chr(random.randrange(char_start, char_start + char_range))
    return out

With its default arguments, the `fuzzer()` function returns a string of random characters:

In [98]:
fuzzer(500, 32, 96)

'@0"\'H|YFas,H2AqF4+LsHAXB&[[~1Q7![uomgLqJz~U}rN[Hl~JSd6EfdM.r[?\\4b#OvH2:%_P=*{gE}\'B"iVJnKKOTN{sy}.;?FRzyV>{3\'>Rj)u>^;u5>WfXEA^sG`/{<X<L{v1#%PP90&}f;j7]<3{=)[<m;l;Rv~<dnIb+chULlfu"=,WG\\]]fC\'Co4S~cV^t2FGF9dg>7+4`l)(berJ/R|r:WrPY0;?N8N.E[pepB8gn/tV=Yl>\\^/y#[u",,\\z=Yr\'+ruyoXo&OKfIy\'NBQR%sjCv+>Llwh#C}er\'8=#'

Now imagine that this string were the input to a program expecting a specific input format – say, a comma-separated list of values, or an e-mail address.  Would the program be able to process such an input without any problems?

## Fuzzing Alphabets

If the above fuzzing input already is intriguing, consider that fuzzing can easily be set up to produce other kinds of input.  For instance, we can also have `fuzzer()` produce a series of upercase letters.  We use `ord(c)` to return the ASCII code of the character `c`.

In [99]:
fuzzer(1000, ord('a'), 26)

'jrzidwaafeqkagfltxdfnhayhjhaktzignrkcfhcsiugijidslokcmzyxtgjvdkozqhaeyhcqyxamviblvyomcemsasmimxrxrfeagvqyrh'

Assume a program expects an identifier as its input.  Would it expect such a long identifier?

We can also have `fuzzer()` produce a series of digits:

In [100]:
fuzzer(100, ord('0'), 10)

'60596698418928873010609759525593759127300999447325621674878013'

... and feed this into some function or program.  This can yield very high numbers:

In [101]:
if __name__ == "__main__":
    x = int(fuzzer(100, ord('0'), 10))
    print(x)

26084717320899


A Python program can actually deal with these numbers.  But how about programs in other languages?

## Bugs Fuzzers Find

When Miller and his students ran their first fuzzers in 1989, they found an alarming result: About **a third of the UNIX utilities** they fuzzed had issues – they crashed, hung, or otherwise failed when confronted with fuzzing input \cite{miller1989}.  Considering that many of these UNIX utilities were used in scripts that would also process network input, this was an alarming result.  Programmers quickly built and ran their own fuzzers, rushed to fix the reported errors, and learned not to trust external inputs anymore.

What kind of problems did Miller's fuzzing experiment find?  It turns out that the mistakes programmers made in 1989 are still the same mistakes being made today.

### Buffer Overflows

Many programs have built-in maximum lengths for inputs and input elements.  In languages like C, it is easy to excess these lengths without the program (or the programmer) even noticing, triggering so-called **buffer overflows**.  The following code, for instance, happily copies the `input` string into a `weekday` string even if `input` has more than eight characters:
```c
char weekday[9]; // 8 characters + trailing '\0' terminator
strcpy (weekday, input);
```
Ironically, this already fails if `input` is `"Wednesday"` (9 characters); any excess characters (here, `'y'` and the following `'\0'` string terminator) are simply copied to whatever resides in memory after `weekday`, triggering arbitrary behavior; maybe some boolean character variable which would be set from `'n'` to `'y'`.  With fuzzing, it is very easy to produce arbitrary long inputs and input elements.

### Missing Error Checks

Many programming languages do not have exceptions, but instead have functions return special **error codes** in exceptional circumstances.  The C function `getchar()`, for instance, normally returns a character from the standard input; if no input is available anymore, it returns the special value `EOF` (end of file).  Now assume a programmer is scanning the input for the next character, skipping space characters:
```c
char read_next_nonspace() {
    char lastc;

    do {
        lastc = getchar();
    } while (lastc != ’ ’);

    return (lastc);
}
```
What happens if the input ends prematurely, as would perfectly be feasible with fuzzing?  Well, `getchar()` returns `EOF`, and keeps on returning `EOF` when called again; so the code above simply enters an infinite loop.


### Rogue Numbers

With fuzzing, it is easy to generate **uncommon value** in the input, causing all kinds of interesting behavior.  Consider the following code, again in the C language, which first reads a buffer size from the input, and then allocates a buffer of the given size:
```c
char *read_input() {
    int size = read_buffer_size();
    char *buffer = (char *)malloc(size);
    // fill buffer
    return (buffer);
}
```
What happens if `size` is very large, exceeding program memory?  What happens if `size` is less then the number of characters following?  What happens if `size` is negative?  By providing a random number here, fuzzing can create all kinds of damages.


## Some Famous Bugs

found via fuzzing...

One might argue that these are all problems of bad programming, or of bad programming languages.  But then, there's thousands of people starting to program every day, and all of them can make the same mistakes again and again.  The somewhat better news is that fuzzing can easily detect such mistakes.

## Background

Here's a citation, just for the fun: \cite{purdom1972}