# Mutation-Based Fuzzing

Most [randomly generated inputs](Basic_Fuzzing.ipynb) are syntactically _invalid_ and thus are quickly rejected by the processing program.  To exercise functionality beyond input processing, we must increase chances to obtain valid inputs.  One such way is by _mutating_ existing valid inputs - that is, introducing small changes that may still keep the input valid, yet exercise new behavior.

**Prerequisites**

* You should know how basic fuzzing works; for instance, from the ["Fuzzing"](Basic_Fuzzing.ipynb) chapter.

## Fuzzing a URL Parser

Many programs expect their inputs to come in a very specific format before they would actually process them.  As an example, think of a program that accepts a URL (a Web address).  The URL has to be in a valid format (i.e., the URL format) such that the program can deal with it.  When fuzzing with random inputs, what are our chances to actually produce a valid URL?

To get deeper into the problem, let us explore what URLs are made of.  A URL consists of a number of elements:

    scheme://netloc/path?query#fragment
    
where
* `scheme` is the protocol to be used, including `http`, `https`, `ftp`, `file`...
* `netloc` is the name of the host to connect to, such as `www.google.com`
* `path` is the path on that very host, such as `search`
* `query` is a list of key/value pairs, such as `q=fuzzing`
* `fragment` is a marker for a location in the retrieved document, such as `#result`

In Python, we can use the `urlparse()` function to parse and decompose a URL into its parts.

In [162]:
import gstbook

In [48]:
try:
    from urlparse import urlparse      # Python 2
except ImportError:
    from urllib.parse import urlparse  # Python 3

urlparse("http://www.google.com/search?q=fuzzing")

ParseResult(scheme='http', netloc='foo.bar', path='/', params='', query='', fragment='')

We see how the result encodes the individual parts of the URL in different attributes.

Let us now assume we have a program that takes a URL as input.  To simplify things, we won't let it do very much; we simply have it check the passed URL for validity.  If the URL is valid, it returns True; otherwise, it raises an exception.

In [89]:
def http_program(url):
    supported_schemes = ["http", "https"]
    result = urlparse(url)
    if result.scheme not in supported_schemes:
        raise ValueError("Scheme must be one of " + repr(supported_schemes))
    if result.netloc == '':
        raise ValueError("Host must be non-empty")
    
    # Do something with the URL
    return True

Let us now go and fuzz `http_program()`.  To fuzz, we use the full range of printable ASCII characters, such that `:`, `/`, and lowercase letters are included.

In [50]:
%%capture
from Basic_Fuzzing import fuzzer

In [90]:
fuzzer(char_start=32, char_range=96)

'0F.f{68{/IbAdED\x7fvPL'

Let's try to fuzz with 1000 random inputs and see whether we have some success.

In [91]:
for i in range(1000):
    try:
        url = fuzzer()
        result = http_program(url)
        print("Success!")
    except ValueError:
        pass

What are the chances of actually getting a valid URL?  We need our string to start with `"http://"` or `"https://"`.  Let's take the `"http://"` case first.  That's seven very specific characters we need to start with.  The chances of producing these seven characters randomly (with a character range of 96 different characters) is $1 : 96^7$, or

In [63]:
96 ** 7

75144747810816

The odds of producing a `"https://"` prefix are even worse, at $1 : 96^8$:

In [64]:
96 ** 8

7213895789838336

which gives us a total chance of

In [67]:
likelihood = 1 / (96 ** 7) + 1 / (96 ** 8)
likelihood

1.344627131107667e-14

And this is the number of runs (on average) we'd need to produce a valid URL:

In [68]:
1 / likelihood

74370059689055.02

Let's measure how long one run of `http_program()` takes:

In [93]:
import time
 
def clock():
    try:
        return time.perf_counter()  # Python 3
    except:
        return time.clock()         # Python 2
    

TRIALS = 1000
start_time = clock()
for i in range(TRIALS):
    try:
        url = fuzzer()
        result = http_program(url)
        print("Success!")
    except ValueError:
        pass
end_time = clock()

duration_per_run_in_seconds = (end_time - start_time) / TRIALS
duration_per_run_in_seconds

0.00010330440901452676

That's pretty fast, isn't it?  Unfortunately, we have a lot of runs to cover.

In [94]:
seconds_until_success = duration_per_run_in_seconds * (1 / likelihood)
seconds_until_success

7682755064.552908

which translates into

In [95]:
hours_until_success = seconds_until_success / 3600
days_until_success = hours_until_success / 24
years_until_success = days_until_success / 365.25
years_until_success

243.45181713922824

Even if we parallelize things a lot, we're still in for months to years of waiting.  And that's for getting _one_ successful run that will get deeper into `http_program()`.

What basic fuzzing will do well is to test `urlparse()`, and if there is an error in this parsing function, it has good chances of uncovering it.  But as long as we cannot produce a valid input, we are out of luck in reaching any deeper functionality.

## Mutating Inputs

The alternative to generating random strings from scratch is to start with a guiven _valid_ input, and then to subsequently _mutate_ it.  A _mutation_ in this context is a simple string manipulation - say, inserting a (random) character, deleting a character, or flipping a bit in a character representation.  Here are some mutations to get you started:

In [34]:
import random

In [35]:
def delete_random_character(s):
    """Returns s with a random character deleted"""
    if s == "":
        return s

    pos = random.randint(0, len(s) - 1)
    # print("Deleting", repr(s[pos]), "at", pos)
    return s[:pos] + s[pos + 1:]

In [36]:
for i in range(10):
    x = delete_random_character("A quick brown fox")
    print(x)

A quick brow fox
A quick brwn fox
Aquick brown fox
A quick bown fox
A quic brown fox
A quick brow fox
A quick brwn fox
A uick brown fox
A quik brown fox
A quick brwn fox


In [124]:
def insert_random_character(s):
    """Returns s with a random character inserted"""
    pos = random.randint(0, len(s))
    random_character = chr(random.randrange(32, 128))
    # print("Inserting", repr(random_character), "at", pos)
    return s[:pos] + random_character + s[pos:]

In [125]:
for i in range(10):
    print(insert_random_character("A quick brown fox"))

A quick br`own fox
A quick brown fox^
-A quick brown fox
A quick brown flox
A q}uick brown fox
A quzick brown fox
A quic$k brown fox
A q3uick brown fox
A quick b.rown fox
A quick sbrown fox


In [126]:
def flip_random_character(s):
    """Returns s with a random bit flipped in a random position"""
    if s == "":
        return s

    pos = random.randint(0, len(s) - 1)
    c = s[pos]
    bit = 1 << random.randint(0, 6)
    new_c = chr(ord(c) ^ bit)
    # print("Flipping", bit, "in", repr(c) + ", giving", repr(new_c))
    return s[:pos] + new_c + s[pos + 1:]


In [127]:
for i in range(10):
    print(flip_random_character("A quick brown fox"))

A quick browj fox
A quick brmwn fox
A quick brown foy
A quick br/wn fox
C quick brown fox
A`quick brown fox
A quick"brown fox
A quick browj fox
A quick browf fox
A quick br/wn fox


Let us now create a random mutator that randomly chooses which mutation to apply:

In [128]:
mutators = [delete_random_character, insert_random_character, flip_random_character]

In [129]:
def mutate(s):
    """Return s with a random mutation applied"""
    mutator = random.choice(mutators)
    # print(mutator)
    return mutator(s)

In [130]:
for i in range(10):
    print(mutate("A quick brown fox"))

A quick brow~ fox
{A quick brown fox
A quick brow~ fox
A quick brown foxZ
A quick brown fo
A quick brown foxS
A quick brownfox
A quick brown f&ox
A quick bBrown fox
A qUick brown fox


The idea is now that _if_ we have some valid input(s) to begin with, we may create more input candidates by applying one of the above mutations.  To see how this works, let's get back to URLs.

## Mutating URLs

Let us now get back to our URL parsing problem.  Let us create a function `is_valid_url()` that checks whether `http_program()` accepts the input.

In [131]:
def is_valid_url(url):
    try:
        result = http_program(url)
        return True
    except ValueError:
        return False

assert is_valid_url("http://www.google.com/search?q=fuzzing")
assert not is_valid_url("xyzzy")

Let us now apply the `mutate()` function on a given URL and see how many valid inputs we obtain.

In [132]:
seed_input = "http://www.google.com/search?q=fuzzing"
valid_inputs = set()
TRIALS = 20

for i in range(TRIALS):
    inp = mutate(seed_input)
    if is_valid_url(inp):
        valid_inputs.add(inp)

We can now observe that by _mutating_ the original input, we get a high proportion of valid inputs:

In [133]:
len(valid_inputs) / TRIALS

0.65

What are the odds of also producing a `https:` prefix by mutating a `http:` sample seed input?  We have to insert ($1 : 3$) the right character `'s'` ($1 : 96$) into the correct position ($1 : l$), where $l$ is the length of our seed input.  This means that on average, we need this many runs:

In [134]:
trials = 3 * 96 * len(seed_input)
trials

10944

We can actually afford this.  Let's try:

In [151]:
%%capture
from ExpectError import ExpectTimeout

In [161]:
start_time = clock()
trials = 0
while True:
    trials += 1
    inp = mutate(seed_input)
    if inp.startswith("https://"):
        print("Success after", trials, "trials in", clock() - start_time, "seconds")
        break

Success after 6917 trials in 0.07489188300678506 seconds


Of course, if we wanted to get, say, an `"ftp://"` prefix, we would need more mutations and more runs – most important, though, we would need to apply _multiple_ mutations.  This raises the question on whether we can _guide_ these mutations in some way; and this is what we will explore in the next chapter on ["Guided Mutations"](Guided_Mutations.ipynb).

## Lessons Learned

* Randomly generated inputs are frequently invalid – and thus exercise mostly input processing functionality.
* Mutations from existing valid inputs have much higher chances to be valid, and thus to exercise functionality beyond input processing.


## Next Steps

Our aim is still to sufficiently cover functionality.  From here, we can continue with:

1. Try to cover as much _implemented_ functionality as possible.  To this end, we need to access the program implementation, measure which parts would actually be reached with our inputs, and use this _coverage_ to guide our search.  We will explore this in the next chapter, which discusses [guided mutations](Guided_Mutations.ipynb).

2. Try to cover as much _specified_ functionality as possible.  Here, we would need a _specification of the input format,_ distinguishing between individual input elements such as (in our case) numbers, operators, comments, and strings – and attempting to cover as many of these as possible.  We will explore this as it comes to [grammar-based testing](Grammar_Testing.ipynb), and especially in [grammar-based mutations](Grammar_Mutations.ipynb).

Finally, the concept of a "population" that is systematically "evolved" trhough "mutations" will be explored in depth when discussing [search-based testing](Search_Based_Testing.ipynb).  Enjoy!


## Exercises


### Exercise 1

Apply the above mutation-based fuzzing technique on `bc`, using files, as in our [Introduction to Fuzzing](Basic_Fuzzing.ipynb).

### Exercise 2

In this [blog post](https://lcamtuf.blogspot.com/2014/08/binary-fuzzing-strategies-what-works.html), the author of _American Fuzzy Lop_ (AFL), a very popular mutation-based fuzzer discusses the efficiency of various mutation operators.  Implement four of them and evaluate their efficiency as in the examples above.

### Exercise 3

Design and implement a system that will gather a population of URLs from the Web.  How can you ensure a maximum of diversity between the individual inputs?