# Guided Mutations

So far, we have essentially created inputs _randomly_, in the hope that these may eventually trigger a bug.  If we can feed in knowledge about the program behavior, we can be much more efficient in creating useful inputs.  The key to success lies in the idea of _guiding_ these mutations – that is, _keeping those that are especially valuable._  "Valuable" might mean that they would be syntactically valid, but also that they help covering more implemented behavior.

**Prerequisites**

* This chapter extends our previous implementation of mutation-based fuzzing from the ["Mutation-Based Fuzzing"](Mutation_Based_Fuzzing.ipynb) chapter.
* You should understand how to measure code coverage; for instance from the ["Coverage"](Coverage.ipynb) chapter.

## Multiple Mutations

So far, we have only applied one single mutation on a sample string.  However, we can also apply _multiple_ mutations, further changing it.  What happens, for instance, if we apply, say, 20 mutations on our sample string?

In [14]:
import gstbook

In [18]:
%%capture
from Mutation_Fuzzing import mutate, is_valid_expr, valid_inputs

In [19]:
seed_input = "1 + 2 * 3 / 4"
MUTATIONS = 20

input = seed_input
for i in range(MUTATIONS):
    input = mutate(input)
    print(repr(input))

'\x11 + 2 * 3 / 4'
'\x11 + 2 " 3 / 4'
'\x11 + 2 " 3 . 4'
'\x11 + 2 " 3 . '
'\x11 + 2 " 3? . '
'\x11 + 2 " 3 . '
'\x11 + 2 " 3`. '
'\x11 \x0b 2 " 3`. '
'\x11 \x0b 2 #" 3`. '
'\x11 \x0b 2 "" 3`. '
'\x11 \x0b 2 "" 3`.'
'\x11 \x0b 2 "" 3a.'
'\x11 \x0b 2 " 3a.'
'\x11# \x0b 2 " 3a.'
'\x11# \x0b 6 " 3a.'
'\x11# \x0b 68 " 3a.'
'\x11# \x0b \'68 " 3a.'
'\x11 \x0b \'68 " 3a.'
'\x11 \x0b \'68 " 3.'
'\x11 \x0b \'68 "! 3.'


As you see, the original seed input is hardly recognizable anymore.  Mutating the input again and again has the advantage of getting a higher variety in the input, but on the other hand further increases the risk of having an invalid input.

The key to success lies in the idea of _guiding_ these mutations – that is, _keeping those that are especially valuable._  For instance, we may want to keep those samples that happen to be _valid_, and keep on mutating these.  To this end, we proceed as follows:

1. We keep a population of seen valid inputs `population`, initialized with the seed input.
2. We pick a random (valid) candidate from this set, which we now mutate up to `MAX_MUTATIONS` times.
3. If the candidate is valid and new, we add it to `population`.
4. In the next iterations, the previous candidates can be chosen again as seeds.

Here's the full code:

In [20]:
import random

def create_candidate(population):
    MIN_MUTATIONS = 2
    MAX_MUTATIONS = 10

    candidate = random.choice(population)
    trials = random.randint(MIN_MUTATIONS, MAX_MUTATIONS)
    for i in range(trials):
        candidate = mutate(candidate)
    return candidate

def create_population(seed, size):
    population = seed
    while len(population) < size:
        candidate = create_candidate(population)
        if is_valid_expr(candidate) and candidate not in valid_inputs:
            population.append(candidate)
    return population

seed_input = "1 + 2 * 3 / 4"
POPULATION_SIZE = 20

create_population(seed = [seed_input], size = POPULATION_SIZE)

['1 + 2 * 3 / 4',
 '1 + 2 * 30< 4',
 '1 +-  3/ 41',
 '1 + 2 * 30',
 '1 + 2* 3',
 '9874+ 2* 8',
 '9878+ 2* 18',
 ' + 24* 3 / 4',
 '9+2* 3',
 '1',
 'g',
 '',
 '997<+2* 8',
 '974+ 5*48',
 '',
 '',
 '3',
 'gg',
 '39',
 '']

As you can see, we now get an even larger variety of inputs.  If you are using the interactive notebook version of this chapter, you can toy with different settings (such as `MAX_MUTATION`) to see how they influence the outputs.  (Also see the [Exercises](#Exercises) below.)

## Maximizing Diversity

A good set of tests not only consists of valid inputs, but the inputs should also be _as diverse as possible_: If you already have tested `1 + 1`, then testing for `1 + 2` or `1 + 1 + 1` is not going to cover much additional functionality, compared to, say, `15.0 / 7`.  To easily ensure a greater diversity, we can check for two features: _length_ and _input elements_.


### Diversity through Length

A simple way to increase diversity in our setting is ro favor _longer inputs_.  The argument is simple: The longer a randomly generated input is, the higher its chances to cover several program features.  Let us define a function `evolve_population()` that 

1. Sorts the population in ascending order according to its _fitness_ (the length)
2. Replaces the first element in the population by a new, fitter (longer) one

This is what this looks like:

In [21]:
def evolve_population(population, fitness = len, steps = 100):
    evolutions = 0
    while evolutions < steps:
        population.sort(key=fitness)
        
        candidate = create_candidate(population)
        if is_valid_expr(candidate) and fitness(candidate) > fitness(population[0]):
            population[0] = candidate
            evolutions += 1

    return population

Let us try out `evolve_population()` with a sample population in which one element `25` is the shortest and thus would be replaced by a new, longer element:

In [22]:
evolve_population(population = ['1 + 1', '1 + 1 + 1', '25'], steps = 1)

['375', '1 + 1', '1 + 1 + 1']

We can now use this to evolve a population over several iterations.  We can see that the strings get longer and longer:

In [23]:
population = create_population(seed = [seed_input], size = POPULATION_SIZE)
evolve_population(population)

['10%+ 2  >  9*2938=1',
 '1+0 * 2  > 9*938=1',
 '10. *6 *  -93\n=9<3',
 '1/90* 2 *  88-9=43',
 '11/9* 2 *  8-90943',
 '92*6+35<=6&586-+147',
 '92*6+35<=6&4586-+47',
 '11/9* 2 * 5 <-10943',
 '1/y0* 2 *  88-95+43',
 '92 *+6 +8-6&1958+47',
 '92 *+6 +88-61958-47',
 '92 *6+38-6&19586-+47',
 '10. *6 *  -3/\n2=9<23',
 '10. *6 * 8 -963J=9<3',
 '11/9* 2 *  8-90+-943',
 '92*6+358-6&19456-+475',
 '92 *+6 +38-6&19586+47',
 '92 *6+35<=6&194586-+47',
 '92 *6+358-6&194586-+47',
 '11/+98* 2 *  (8-909&43)']

### Diversity in Input Elements

Not only do we want long inputs, but we also want diversity in input elements.  For instance, we'd like our inputs to cover as many arithmetic operations as possible: The input `1 + 2 * 3` is shorter than the input `1234567890`, yet covers more operations.  We can achieve this by defining a special fitness function that favors inputs with a large diversity in input characters.

In [24]:
def diversity_fitness(s):
    return len(set(s))

Here, `len(set(s))` returns the size of the set of characters in `s` – that is, the number of different characters:

In [25]:
diversity_fitness("1 + 1")

3

We can see that this gives us even more diverse inputs:

In [26]:
evolve_population(population, fitness = diversity_fitness)

['y>p*32+w318=21%75|5 //4-+3.78',
 '9%26 *656<=6&1943.836-+74/7',
 'y>0*36+3158=2179<58/4-+.7',
 '9%2&+6 *65<=6&1943.863-k74/7',
 'y>0*36+33,8=217>8<5 /54-+.7',
 'y>p*32+w318=21%75|58 /4-+3.7',
 'y>0*36+7121%9< 58 /4-+3.',
 'y>p*36+78=21%79< 8 /4-+3.7 ',
 'y>0*36+31,8=2178<5 /54-+.7',
 'y>0*36+78=21%79< 8 /4-+3.7',
 'y>0*3*6+731=21%7|58 /4-+3.7',
 'y>0 *36+31,8=2179<58 /5<-+.7',
 '96%29 *656<=6.&194+3>836-+7/74',
 'y>0*36+7318=21%75|58 /4-+3.7',
 'y0*36+318=21%79<58 /4-+3.57',
 'y>0*36+318=2179<58 /4-+.7',
 'y>0*36+31,8=2179<8 /54-.7',
 'y>0*36+31,8=2179<58 /54-+.7',
 'y>0*36+718=21%79< 58 /4-+3.7',
 'y>0*36+318=21%79<58 /4-+3.7']

However, there's a catch: As soon as our input contains a comment (a substring starting with `#`), any characters are allowed after the `#`.  The same holds if our input contains a quoted string (`'...'` or `"..."`); then arbitrary characters can be added to the string.  We may thus achieve diversity in comments or strings, but not necessarily in functionality.  

This, of course, is one of the central problems of software testing: How can I cover all functionality of the program under test – including all its bugs?  Fortunately, researchers and practitioners have devised a great number of solutions to address this problem.

## Guiding by Coverage

\todo{add}

## Next Steps


How can we sufficiently cover functionality?  Essentially, we have two options:

1. Try to cover as much _implemented_ functionality as possible.  To this end, we need to access the program implementation, measure which parts would actually be reached with our inputs, and use this _coverage_ to guide our search.  We will explore this in the next chapter, which discusses [coverage](Coverage.ipynb).

2. Try to cover as much _specified_ functionality as possible.  Here, we would need a specification of the input format, distinguishing between individual input elements such as (in our case) numbers, operators, comments, and strings – and attempting to cover as many of these as possible.  We will explore this as it comes to [grammar-based testing](Grammar_Testing.ipynb).

Finally, the concept of a "population" that is systematically "evolved" trhough "mutations" will be explored in depth when discussing [search-based testing](Search_Based_Testing.ipynb).  Enjoy!


## Exercises


### Exercise 1

Apply the above mutation-based fuzzing technique on `bc`, using files, as in our [Introduction to Fuzzing](Basic_Fuzzing.ipynb).

### Exercise 2

To achieve diversity in our set of inputs, one may also try to maximize the _difference_ between the individual inputs in the population.  Rather than sorting the population and replace the least fit individual, we could also compare all individuals against each other and pick the one that is most similar to all the others.  For instance, one could define similarity as the ratio of which characters are common in both strings:

In [None]:
def similar(s1, s2):
    return len(set(s1) & set(s2)) / len(set(s1) | set(s2))

assert similar("A", "A") == 1.0
assert similar("A", "B") == 0

Here, `a & b` and `a | b` denote the intersection and union of two sets, respectively.

The more the set of characters in two strings is the same, the higer the value of `similar()`:

In [None]:
similar("Apple", "Microsoft")

In [None]:
similar("Python", "Boa constrictor")

In [None]:
similar("Python 2", "Python 3")

When it comes to remove one element from the population, use the above `similar()` implementation to determine the one element that is most similar to all the others.  Report a typical output sample.   How expensive is your approach?

### Exercise 3

Python brings an [impressive library for computing the difference between strings](https://docs.python.org/3/library/difflib.html), which we can use to measure similarity between two strings including the ordering of characters:

In [None]:
from difflib import SequenceMatcher

def similar(s1, s2):
    return SequenceMatcher(None, s1, s2).ratio()

Repeat [Exercise 2](#Exercise-2) with the above.