# Mutation-Based Fuzzing

Most [randomly generated inputs](Basic_Fuzzing.ipynb) are syntactically _invalid_ and thus are quickly rejected by the processing program.  To exercise functionality beyond input processing, we must increase chances to obtain valid inputs.  One such way is by _mutating_ existing valid inputs - that is, introducing small changes that may still keep the input valid, yet exercise new behavior.

**Prerequisites**

* You should know how basic fuzzing works; for instance, from the ["Fuzzing"](Basic_Fuzzing.ipynb) chapter.

## The Problem

Most modern programs do a good job in _validating_ their inputs before they actually process them.  As an example, think of a _compiler_ translating program code into a lower-level language.  The processing steps of a compiler are typically depicted as a pipeline of components, each processing the input of its predecessors and producing output for its successors.  In the beginning, we typically have a lexical analysis that puts letters together into words, a syntactic analysis that puts sequences of words and items into structures, and then the actual compilation steps that translate these structures into code:

![Compiler Pipeline](PICS/Compiler.pdf)

The problem is that with random inputs, you will be able to exercise a lot of functionality in the leftmost stages (i.e., the lexical and possibly syntactical analyses), but the chances of actually producing a _valid_ input that will make it to the later stages are slim.


### Fuzzing Python Expressions

To illustrate just how low the chances are, let us _illustrate the problem on the Python interpreter._  Let us get the fuzzing function from  the ["Fuzzing"](Basic_Fuzzing.ipynb) chapter.

In [606]:
import gstbook
from Basic_Fuzzing import fuzzer

We can use the `fuzzer()` function to generate random inputs.  Would this be a valid Python expression?

In [607]:
fuzzer()

"=+8!1=)'9.)"

To test which inputs are actually valid, we use the Python `parser` module.  `parser.suite(_source_)` returns an internal object (actually, a parse tree) if _source_ is a valid command:

In [608]:
import parser

parser.suite("print(2 + 2)")

<parser.st at 0x10da68730>

Note that the command is _not_ executed.  That is because evaluating or otherwise executing randomly generated strings would be quite a risk: What happens if by chance, we create a command that deletes your files?

If _source_ is invalid, `parser.suite()` raises an exception:

In [609]:
from gstbook.expect_error import ExpectError

with ExpectError():
    parser.suite("print(2 +<>^& 37)")

Traceback (most recent call last):
  File "<ipython-input-609-a6d15798aa29>", line 4, in <module>
    parser.suite("print(2 +<>^& 37)")
  File "<string>", line 1
    print(2 +<>^& 37)
              ^
SyntaxError: invalid syntax


We can thus write a function `is_valid_expr()` that checks whether an expression is valid in Python:

In [610]:
def is_valid_expr(source):
    """Returns true iff source is a valid Python expression"""
    if '#' in source:
        return False
    try:
        parser.suite("print(" + source + ")")
        return True
    except SyntaxError:
        return False
    except ValueError:
        return False

assert is_valid_expr("4 + 4")
assert not is_valid_expr("37 !@#$ 564")

### Invalid Inputs

Let us see how many of the `fuzzer()` outputs are actually valid:

In [611]:
valid_inputs = set()
TRIALS = 1000

for i in range(TRIALS):
    input = fuzzer()
    if is_valid_expr(input):
        valid_inputs.add(input)
        
len(valid_inputs) / TRIALS

0.01

About 99% of all generated inputs are invalid - that's not very many.  What are the valid ones we get?

In [612]:
valid_inputs

{'',
 ' 24',
 "'/5-1>(>0382?7+?/8 '",
 '\'9-):;%)& ?"+\'',
 '.0',
 '44&+"48.?*+*8<$:&:)(3)!"',
 '5',
 '6',
 '7',
 '9'}

While we do have a chance to create numbers and very simple arithmetic expressions, we are going to miss plenty of Python data types and functionality - and of course, we are not going to cover the code that handles these.  

What are the odds of producing a Python set, for instance? It would have to start with `set(`.  In its default configuration, for instance, `get_fuzzer()` does not even produce letters.  If we give it a range of say, 64 characters, we have a chance of $1 : 64^4$ to have an input that starts with `"set("`.  How much is that, again?

In [613]:
64 ** 4

16777216

Indeed, less than one in a million.  Plus, we'd also need the closing `')'` character...

## Mutating Inputs

The alternative to generating random strings from scratch is to start with a guiven _valid_ input, and then to subsequently _mutate_ it.  A _mutation_ in this context is a simple string manipulation - say, inserting a (random) character, deleting a character, or flipping a bit in a character representation.  Here are some mutations to get you started:

In [614]:
import random

In [615]:
def delete_random_character(s):
    """Returns s with a random character deleted"""
    if s == "":
        return s

    pos = random.randint(0, len(s) - 1)
    # print("Deleting", repr(s[pos]), "at", pos)
    return s[:pos] + s[pos + 1:]

for i in range(10):
    x = delete_random_character("A quick brown fox")
    print(x)

 quick brown fox
 quick brown fox
A quck brown fox
A quick bown fox
A quick brow fox
A quickbrown fox
A qick brown fox
A quick brown ox
A quick brow fox
A quick brownfox


In [616]:
def insert_random_character(s):
    """Returns s with a random character inserted"""
    pos = random.randint(0, len(s))
    random_character = chr(random.randrange(32, 64))
    # print("Inserting", repr(random_character), "at", pos)
    return s[:pos] + random_character + s[pos:]

for i in range(10):
    print(insert_random_character("A quick brown fox"))

A (quick brown fox
A quic/k brown fox
A quick 'brown fox
A quick: brown fox
A quick b%rown fox
A quick br&own fox
A quick brown$ fox
A quick bro)wn fox
5A quick brown fox
A quick br,own fox


In [617]:
def flip_random_character(s):
    """Returns s with a random bit flipped in a random position"""
    if s == "":
        return s

    pos = random.randint(0, len(s) - 1)
    c = s[pos]
    bit = 1 << random.randint(0, 6)
    new_c = chr(ord(c) ^ bit)
    # print("Flipping", bit, "in", repr(c) + ", giving", repr(new_c))
    return s[:pos] + new_c + s[pos + 1:]

for i in range(10):
    print(flip_random_character("A quick brown fox"))

A quick bvown fox
A quick brown fkx
A qUick brown fox
A quisk brown fox
A quick brown fox
A quick brgwn fox
A quick brown vox
@ quick brown fox
A quick brmwn fox
A quick brown fkx


Let us now create a random mutator that randomly chooses which mutation to apply:

In [618]:
mutators = [delete_random_character, insert_random_character, flip_random_character]

def mutate(s):
    """Return s with a random mutation applied"""
    mutator = random.choice(mutators)
    # print(mutator)
    return mutator(s)


for i in range(10):
    print(mutate("A quick brown fox"))

A quick brown fx
A quick browo fox
A quik brown fox
Q quick brown fox
A quck brown fox
A quick 3brown fox
A$ quick brown fox
A quik brown fox
A quick brown fo,x
A quick brownfox


Let us now apply the `mutate()` function on a Python expression and see how many valid inputs we obtain.

In [619]:
seed_input = "1 + 2 * 3 / 4"
valid_inputs = set()
TRIALS = 1000

for i in range(TRIALS):
    input = mutate(seed_input)
    if is_valid_expr(input):
        valid_inputs.add(input)

The first thing we observe is that the number of valid inputs is now much higher.  We are still far away from 100% valid inputs, but this is an improvement:

In [620]:
len(valid_inputs) / TRIALS

0.097

Most important, though, is that the valid inputs now cover many more Python expression features – additional operands, identifiers, and more:

In [621]:
print(valid_inputs)

{'15 + 2 * 3 / 4', '1 + 2 *  3 / 4', '1 + 2 * 3 / 0', '9 + 2 * 3 / 4', '1 + r * 3 / 4', '1+ + 2 * 3 / 4', '0 + 2 * 3 / 4', '1 + 2 * 3 / 6', '1 + 2 * 83 / 4', '19 + 2 * 3 / 4', '1 + 3 * 3 / 4', '1 + 2 * 3 / 24', '1 + 2 * 3 / 5', '-1 + 2 * 3 / 4', ' 1 + 2 * 3 / 4', '1 + 2 * 3 / 45', '1 + 2 * 3 / 49', '1 + 2 * 30 / 4', '1 + 2 * 3/ 4', '1 + 2 * 38 / 4', '1 + 2 * 3 /  4', '1 / 2 * 3 / 4', '1 + 2 * 2 / 4', '1 =+ 2 * 3 / 4', '1 + 2 * 7 / 4', '10+ 2 * 3 / 4', '1 + 2 * 36 / 4', '1 *+ 2 * 3 / 4', '1 + 0 * 3 / 4', '1 + 2 * s / 4', '1 + 2 * 35 / 4', '1 + 2 * 3 / 94', '1 + 2 * 3 /4', '1 + 24 * 3 / 4', '1 + 2 *3 / 4', '1 + 2 * +3 / 4', '1 + 2 * 1 / 4', '1 + 2 * 3  / 4', '1 + 2 + 3 / 4', '1 * 2 * 3 / 4', '1 + 2 * 33 / 4', '14 + 2 * 3 / 4', '3 + 2 * 3 / 4', '1 + 20 * 3 / 4', '1 + 2 * 3 + 4', '91 + 2 * 3 / 4', '1 ,+ 2 * 3 / 4', '1 + 2 * 93 / 4', ' + 2 * 3 / 4', '1. + 2 * 3 / 4', '1 + 2  * 3 / 4', '1 + 2 * 3 - 4', '1 + 2 * 3 // 4', '16 + 2 * 3 / 4', 'q + 2 * 3 / 4', '1 + 2 * 3 / 4,', '1 + 72 * 3 / 4', '

## Guided Mutations

So far, we have only applied one single mutation on a sample string.  However, we can also apply _multiple_ mutations, further changing it.  What happens, for instance, if we apply, say, 20 mutations on our sample string?

In [622]:
seed_input = "1 + 2 * 3 / 4"
MUTATIONS = 20

input = seed_input
for i in range(MUTATIONS):
    input = mutate(input)
    print(repr(input))

'1 +"2 * 3 / 4'
'1 +"2 * 3 :/ 4'
'1 +"2 * 3:/ 4'
'1 +"2 * 3:-/ 4'
'1 +"2 * 2:-/ 4'
'1 +"2 * 2:-/4'
'1 +"2 * 2-/4'
'1 +"2 * \x12-/4'
'1 +"2 * \x12-?/4'
'1 +"2 * \x12-7/4'
'1 +"2 * \x12\r7/4'
'1 +"+2 * \x12\r7/4'
'1 )"+2 * \x12\r7/4'
'1 )"+2 * 5\x12\r7/4'
'1 )"+2 \'* 5\x12\r7/4'
'1 )"+2 \'* %\x12\r7/4'
'1 )"+2 \'*%\x12\r7/4'
'1 )"+2 \'*%\x12\r71/4'
'1 )"+2 \'*%\x12\r\'1/4'
'1 )"+2!\'*%\x12\r\'1/4'


As you see, the original seed input is hardly recognizable anymore.  Mutating the input again and again has the advantage of getting a higher variety in the input, but on the other hand further increases the risk of having an invalid input.

The key to success lies in the idea of _guiding_ these mutations – that is, keeping those that are especially valuable.  For instance, we may want to keep those samples that happen to be _valid_, and keep on mutating these.  To this end, we proceed as follows:

1. We keep a population of seen valid inputs `population`, initialized with the seed input.
2. We pick a random (valid) candidate from this set, which we now mutate up to `MAX_MUTATIONS` times.
3. If the candidate is valid and new, we add it to `population`.
4. In the next iterations, the previous candidates can be chosen again as seeds.

Here's the full code:

In [623]:
def create_candidate(population):
    MIN_MUTATIONS = 2
    MAX_MUTATIONS = 10

    candidate = random.choice(population)
    trials = random.randint(MIN_MUTATIONS, MAX_MUTATIONS)
    for i in range(trials):
        candidate = mutate(candidate)
    return candidate

def create_population(seed, size):
    population = seed
    while len(population) < size:
        candidate = create_candidate(population)
        if is_valid_expr(candidate) and candidate not in valid_inputs:
            population.append(candidate)
    return population

seed_input = "1 + 2 * 3 / 4"
POPULATION_SIZE = 20

create_population(seed = [seed_input], size = POPULATION_SIZE)

['1 + 2 * 3 / 4',
 '14+ 22* 3 / 94',
 '+ 22.3 / 94',
 '1422* 3 / 94',
 '+ 292.3/ 9=4',
 '1+2 *  3 / 4',
 '1+ 2* 3 / 94',
 '1+ 2* 3 / 94',
 'k923/ 9,4',
 '1+ 2  / 94',
 '+2+   94',
 '61+3  / 4',
 '1402* 3 /94',
 'k9%3/92,4',
 ' + 22. / 94',
 ' 22. / 94',
 '923/ 9,6',
 'k93/9,4',
 '3+9,',
 '+9.']

As you can see, we now get an even larger variety of inputs.  If you are using the interactive notebook version of this chapter, you can toy with different settings (such as `MAX_MUTATION`) to see how they influence the outputs.  (Also see the [Exercises](#Exercises) below.)

## Maximizing Diversity

A good set of tests not only consists of valid inputs, but the inputs should also be _as diverse as possible_: If you already have tested `1 + 1`, then testing for `1 + 2` or `1 + 1 + 1` is not going to cover much additional functionality, compared to, say, `15.0 / 7`.  To easily ensure a greater diversity, we can check for two features: _length_ and _input elements_.


### Diversity through Length

A simple way to increase diversity in our setting is ro favor _longer inputs_.  The argument is simple: The longer a randomly generated input is, the higher its chances to cover several program features.  Let us define a function `evolve_population()` that 

1. Sorts the population in ascending order according to its _fitness_ (the length)
2. Replaces the first element in the population by a new, fitter (longer) one

This is what this looks like:

In [624]:
def evolve_population(population, fitness = len, steps = 100):
    evolutions = 0
    while evolutions < steps:
        population.sort(key=fitness)
        
        candidate = create_candidate(population)
        if is_valid_expr(candidate) and fitness(candidate) > fitness(population[0]):
            population[0] = candidate
            evolutions += 1

    return population

Let us try out `evolve_population()` with a sample population in which one element `25` is the shortest and thus would be replaced by a new, longer element:

In [625]:
evolve_population(population = ['1 + 1', '1 + 1 + 1', '25'], steps = 1)

[' +9* 1', '1 + 1', '1 + 1 + 1']

We can now use this to evolve a population over several iterations.  We can see that the strings get longer and longer:

In [626]:
population = create_population(seed = [seed_input], size = POPULATION_SIZE)
evolve_population(population)

['q&6<<642> + 251=73,+21/ 32=6',
 'q&6+62+ 217720/  7925',
 'q&6+62+ 2177,+20/ 725',
 'p%4>6=9+55 +8* 863, 1',
 '9%4>6=9+5 +8* 86>3, -1',
 'q&6<62>+ 2173,+2/ 72=6',
 'q&8+692+ 3157>0/ 792.5',
 'a&+6.92+ 21-5760/ 7925',
 'p4144>6==+57 +8* 83- 1',
 'q&8+692+ 215760/  7925',
 'q&262+ 2173,+20/ 7/256',
 'q&8+6.9+ 21-55760/ 7925',
 'q&262+ 2173,k30/ 17/256',
 'q&8+6.92+ 21-5760/ 7925',
 'p%t>6 ==0+55 +8*+ 83- 3',
 'p%4>6 ==+55 +8*+ 83- 31',
 'q&6590=-36+ 27.220/ 725',
 'q&6<<62>+ 25173,+2/ 72=6',
 'q&22762+ 2173,k30/ 17/2.6',
 'q&6<<642> + 25173,+21/ 72=6']

### Diversity in Input Elements

Not only do we want long inputs, but we also want diversity in input elements.  For instance, we'd like our inputs to cover as many arithmetic operations as possible: The input `1 + 2 * 3` is shorter than the input `1234567890`, yet covers more operations.  We can achieve this by defining a special fitness function that favors inputs with a large diversity in input characters.

In [627]:
def diversity_fitness(s):
    return len(set(s))

Here, `len(set(s))` returns the size of the set of characters in `s` – that is, the number of different characters:

In [628]:
diversity_fitness("1 + 1")

3

We can see that this gives us even more diverse inputs:

In [629]:
evolve_population(population, fitness = diversity_fitness)

['q&96<.864>k885!=57. -2031/36,',
 'q&96<.40>85!=1-77 +237/356',
 'q&9<.40>585!=1>4.*72 +27-316',
 'q&96<64>+85=0&7 +231./s2.v',
 'q&y6<.40>857!=15,74 *27/355',
 'q&96<.647>+8285!=5*7. -+231/56',
 'q&96<.40>857!=15,743 *23/356',
 'u&96<.40>85!=157 +3/3,52',
 'q&96<.64>+88!=53. -+67,10/36',
 'q&92<.64>9+825!=57 -+2*31/56',
 'q&96<.40>857!=15,74 *237/356',
 'q&96<.64>+88!=53. -+27,18/36',
 'q&96<.40>857!=15,7 +237/356',
 'q&96<644>+85=0&7 +231./s2v',
 'q&9<.40>585!=14*7 +27-316',
 'q&96<.64>j85!=0571. -2031+53516,',
 'q&96<.64>k88!=3. -+27,*18/36',
 'q&96<.64>j85!=0571.  -2031/3516,',
 'q&96<.64>j885!=0571. -2031/3516,',
 'q&96<.64>k885!=57. -2031/356,']

However, there's a catch: As soon as our input contains a comment (a substring starting with `#`), any characters are allowed after the `#`.  The same holds if our input contains a quoted string (`'...'` or `"..."`); then arbitrary characters can be added to the string.  We may thus achieve diversity in comments or strings, but not necessarily in functionality.  

This, of course, is one of the central problems of software testing: How can I cover all functionality of the program under test – including all its bugs?  Fortunately, researchers and practitioners have devised a great number of solutions to address this problem; and the best of these will be covered in the subsequent chapters.

## Next Steps


How can we sufficiently cover functionality?  Essentially, we have two options:

1. Try to cover as much _implemented_ functionality as possible.  To this end, we need to access the program implementation, measure which parts would actually be reached with our inputs, and use this _coverage_ to guide our search.  We will explore this in the next chapter, which discusses [coverage](Coverage.ipynb).

2. Try to cover as much _specified_ functionality as possible.  Here, we would need a specification of the input format, distinguishing between individual input elements such as (in our case) numbers, operators, comments, and strings – and attempting to cover as many of these as possible.  We will explore this as it comes to [grammar-based testing](Grammar_Testing.ipynb).

Finally, the concept of a "population" that is systematically "evolved" trhough "mutations" will be explored in depth when discussing [search-based testing](Search_Based_Testing.ipynb).  Enjoy!


## Exercises


### Exercise 1

Apply the above mutation-based fuzzing technique on `bc`, using files, as in our [Introduction to Fuzzing](Basic_Fuzzing.ipynb).

### Exercise 2

To achieve diversity in our set of inputs, one may also try to maximize the _difference_ between the individual inputs in the population.  Rather than sorting the population and replace the least fit individual, we could also compare all individuals against each other and pick the one that is most similar to all the others.  For instance, one could define similarity as the ratio of which characters are common in both strings:

In [630]:
def similar(s1, s2):
    return len(set(s1) & set(s2)) / len(set(s1) | set(s2))

assert similar("A", "A") == 1.0
assert similar("A", "B") == 0

Here, `a & b` and `a | b` denote the intersection and union of two sets, respectively.

The more the set of characters in two strings is the same, the higer the value of `similar()`:

In [631]:
similar("Apple", "Microsoft")

0.0

In [632]:
similar("Python", "Boa constrictor")

0.23076923076923078

In [633]:
similar("Python 2", "Python 3")

0.7777777777777778

When it comes to remove one element from the population, use the above `similar()` implementation to determine the one element that is most similar to all the others.  Report a typical output sample.   How expensive is your approach?

### Exercise 3

Python brings an [impressive library for computing the difference between strings](https://docs.python.org/3/library/difflib.html), which we can use to measure similarity between two strings including the ordering of characters:

In [634]:
from difflib import SequenceMatcher

def similar(s1, s2):
    return SequenceMatcher(None, s1, s2).ratio()

Repeat [Exercise 2](#Exercise-2) with the above.