# Module 2 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from pprint import pprint
from typing import List, Tuple, Dict, Callable
import random

## Local Search - Genetic Algorithm

There are some key ideas in the Genetic Algorithm.

First, there is a problem of some kind that either *is* an optimization problem or the solution can be expressed in terms of an optimization problem.
For example, if we wanted to minimize the function

$$f(x) = \sum (x_i - 0.5)^2$$

where $n = 10$.
This *is* an optimization problem. Normally, optimization problems are much, much harder.

![Eggholder](http://www.sfu.ca/~ssurjano/egg.png)!

The function we wish to optimize is often called the **objective function**.
The objective function is closely related to the **fitness** function in the GA.
If we have a **maximization** problem, then we can use the objective function directly as a fitness function.
If we have a **minimization** problem, then we need to convert the objective function into a suitable fitness function, since fitness functions must always mean "more is better".

Second, we need to *encode* candidate solutions using an "alphabet" analogous to G, A, T, C in DNA.
This encoding can be quite abstract.
You saw this in the Self Check.
There a floating point number was encoded as bits, just as in a computer and a sophisticated decoding scheme was then required.

Sometimes, the encoding need not be very complicated at all.
For example, in the real-valued GA, discussed in the Lectures, we could represent 2.73 as....2.73.
This is similarly true for a string matching problem.
We *could* encode "a" as "a", 97, or '01100001'.
And then "hello" would be:

```
["h", "e", "l", "l", "o"]
```

or

```
[104, 101, 108, 108, 111]
```

or

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

In Genetics terminology, this is the **chromosome** of the individual. And if this individual had the **phenotype** "h" for the first character then they would have the **genotype** for "h" (either as "h", 104, or 01101000).

To keep it straight, think **geno**type is **genes** and **pheno**type is **phenomenon**, the actual thing that the genes express.
So while we might encode a number as 10110110 (genotype), the number itself, 182, is what goes into the fitness function.
The environment operates on zebras, not the genes for stripes.

## String Matching

You are going to write a Genetic Algorithm that will solve the problem of matching a target string (at least at the start).
Now, this is kind of silly because in order for this to work, you need to know the target string and if you know the target string, why are you trying to do it?
Well, the problem is *pedagogical*.
It's a fun way of visualizing the GA at work, because as the GA finds better and better candidates, they make more and more sense.

Now, string matching is not *directly* an optimization problem so this falls under the general category of "if we convert the problem into an optimization problem we can solve it with an optimization algorithm" approach to problem solving.
This happens all the time.
We have a problem.
We can't solve it.
We convert it to a problem we *can* solve.
In this case, we're using the GA to solve the optimization part.

And all we need is some sort of measure of the difference between two strings.
We can use that measure as a **loss function**.
A loss function gives us a score tells us how similar two strings are.
The loss function becomes our objective function and we use the GA to minimize it by converting the objective function to a fitness function.
So that's the first step, come up with the loss/objective function.
The only stipulation is that it must calculate the score based on element to element (character to character) comparisons with no global transformations of the candidate or target strings.

And since this is a GA, we need a **genotype**.
The genotype for this problem is a list of "characters" (individual letters aren't special in Python like they are in some other languages):

```
["h", "e", "l", "l", "o"]
```

and the **phenotype** is the resulting string:

```
"hello"
```

In addition to the generic code and problem specific loss function, you'll need to pick parameters for the run.
These parameters include:

1. population size
2. number of generations
3. probability of crossover
4. probability of mutation

You will also need to pick a selection algorithm, either roulette wheel or tournament selection.
In the later case, you will need a tournament size.
This is all part of the problem.

Every **ten** (10) generations, you should print out the fitness, genotype, and phenotype of the best individual in the population for the specific generation.
The function should return the best individual *of the entire run*, using the same format.

In [2]:
ALPHABET = "abcdefghijklmnopqrstuvwxyz "

<a id="to_binary"></a>
## to_binary

Converts a string of alphabet to a string of binary.

Variables

* **string** str: the string to be converted to binary

**returns** str: the input string converted to a binary string - will have leading 0 values for every character

In [3]:
def to_binary(string: str):
    l,m = [],[]
    for i in string:
        l.append(ord(i))
    for i in l:
        m.append(format(i,'08b'))
    return "".join(m)

In [4]:
test1 = "love"
test2 = "hate"
test3 = "idk"
assert to_binary(test1) == '01101100011011110111011001100101'
assert to_binary(test2) == '01101000011000010111010001100101'
assert to_binary(test3) == '011010010110010001101011'

<a id="mutate_string"></a>
## mutate_string

Mutates part of a specified member from a population.

Variables

* **to_mutate** str: the string to mutate
* **random_phenotype** str: the random phenotype to mutate to
* **threshold** float: the float threshold for if we should actually mutate
* **random_value** float: the random value to check against for if there should be a mutation
* **random_mutation_index** int: if we mutate, we mutate this index

**returns** str: the original or mutated version of the input string

In [5]:
def mutate_string(to_mutate: str, random_phenotype: str, threshold: float, random_value: float, random_mutation_index: int):
    if random_mutation_index < len(to_mutate) and random_value < threshold:
        to_mutate = to_mutate[:random_mutation_index] + random_phenotype + to_mutate[random_mutation_index + 1:]
    return to_mutate

In [6]:
t1 = "something"
t2 = "whatever"
t3 = "elephant"
assert mutate_string(t1, "x", 0.5, 0.255, 4) == 'somexhing'
assert mutate_string(t1, "x", 0.5, 0.255, 9) == 'something'
assert mutate_string(t2, "x", 0.5, 0.255, 5) == 'whatexer'
assert mutate_string(t3, "x", 0.5, 0.255, 1) == 'exephant'

<a id="generate_children"></a>
## generate_children

Generates two children from two parents in a population.

Variables

* **parent1** str: the string value of parent1
* **parent2** str: the string value of parent2
* **threshold** float: the float threshold for if we should actually have crossover event
* **random_value** float: the random value to check against for if there should be a crossover event
* **random_splice_index** int: if we have a crossover event, we use this index

**returns** List[str]: children (or parents) based on the results of crossover events.

In [7]:
def generate_children(parent1: str, parent2: str, threshold: float, random_value: float, random_splice_index: int):
    if random_value >= threshold:
        return [parent1, parent2]
    else:
        son = parent1[:random_splice_index] + parent2[random_splice_index:]
        daughter = parent2[:random_splice_index] + parent1[random_splice_index:]
    return [son, daughter]

In [8]:
t1p1 = "frogs"
t1p2 = "mouse"
t2p1 = "dump"
t2p2 = "ppp "
assert generate_children(t1p1, t1p2, 0.5, 0.1, 2) == ['fruse', 'moogs']
assert generate_children(t2p1, t2p2, 0.5, 0.1, 3) == ['dum ', 'pppp']
assert generate_children(t2p1, t2p2, 0.01, 0.1, 2) == ['dump', 'ppp ']

<a id="calculate_fitness"></a>
## calculate_fitness

This takes the target string and condidate string to calculate how close the candidate is to the target. 'Fitness' is based on an index to index comparison - exact index matches so 0 compared to 0, 1 to 1, etc.

Variables

* **target** str: the target value
* **candidate** str: the candidate value we will be comparing to the target
* **phenotypes** List[str]: this is a None type, but is necessary here due to the ROT13 fitness using the same signature and used dynamically later

**returns** int: integer fitness value of the provided member compared to the target

In [9]:
def calculate_fitness(target: str, candidate: str, phenotypes = None):
    score = 0
    for i in range(0, len(target)):
        if target[i] == candidate[i]:
            score += 1
    return score

In [10]:
assert calculate_fitness('something', 'something') == 9
assert calculate_fitness('somethinx', 'something') == 8
assert calculate_fitness('something', 'gsomethin') == 0

<a id="calculate_fitness_reverse"></a>
## calculate_fitness_reverse

This takes the target string and condidate string to calculate how close the candidate is to the target. 'Fitness' is based on an index to index comparison - reverse index matches so 0 compared to len-1, 1 to len-2, etc.

Variables

* **target** str: the target value
* **candidate** str: the candidate value we will be comparing to the target
* **phenotypes** List[str]: this is a None type, but is necessary here due to the ROT13 fitness using the same signature and used dynamically later

**returns** int: integer fitness value of the provided member compared to the target

In [11]:
def calculate_fitness_reverse(target: str, candidate: str, phenotypes = None):
    score = 0
    for i in range(0, len(target)):
        if target[len(target) - 1 - i] == candidate[i]:
            score += 1
    return score

In [12]:
assert calculate_fitness_reverse('gnihtemos', 'something') == 9
assert calculate_fitness_reverse('gnihtemos', 'somethinn') == 8
assert calculate_fitness_reverse('gnihtemos', 'gsomethin') == 0

<a id="calculate_fitness_rot13"></a>
## calculate_fitness_rot13

This takes the target string and condidate string to calculate how close the candidate is to the target. 'Fitness' is based on an index to index comparison - ROT13 index matches so 0 compared to 0+13, 1 to 1+13 (if reaching end of string then it restarts to beginning), etc.

Variables

* **target** str: the target value
* **candidate** str: the candidate value we will be comparing to the target
* **phenotypes** List[str]: this is needed to to index shifts to decode the target

**returns** int: integer fitness value of the provided member compared to the target

In [13]:
def calculate_fitness_rot13(target: str, candidate: str, phenotypes: List[str]):
    score = 0
    for i in range(0, len(target)):
        target_index = phenotypes.index(target[i])
        if phenotypes[(target_index + 13) % len(phenotypes)] == candidate[i]:
            score += 1
    return score

In [14]:
ALPHABET3_TEST = "abcdefghijklmnopqrstuvwxyz"
assert calculate_fitness_rot13('guvfvffbzhpusha', 'thisissomuchfun', ALPHABET3_TEST) == 15
assert calculate_fitness_rot13('guvfvffbzhpusha', 'thisissooooofun', ALPHABET3_TEST) == 11
assert calculate_fitness_rot13('guvfvffbzhpusha', 'nthisissomuchfu', ALPHABET3_TEST) == 1

<a id="pick_parents"></a>
## pick_parents

This picks the two 'best' parents from the population based on their fitness scores. We are using a tournament style selection. Tournament size is always 7.

Variables

* **population** List[str]: the population
* **target** str: the target value - needed for fitness
* **fitness_calc** callable func: this is the fitness function
* **phenotypes** List[str]: this is needed to to index shifts to decode the target - needed for fitness

**returns** List[str]: string values of the parents that were found to be the 'best'

In [15]:
def pick_parents(population: List[str], target: str, fitness_calc: callable, phenotypes = None):
    r1_winners = []
    while len(r1_winners) < 7 and not len(r1_winners) == len(population):
        possible_winner = population[random.randint(0,len(population) - 1)]
        p_w_with_fitness = (possible_winner, fitness_calc(target, possible_winner, phenotypes))
        if not p_w_with_fitness in r1_winners:
            r1_winners.append(p_w_with_fitness)
    max1 = ('', -1)
    max2 = ('', -1)
    for geno in r1_winners:
        if geno[1] > max1[1]:
            max1 = geno
        if max1[1] > max2[1]:
            max1, max2 = max2, max1
    return [max1[0], max2[0]]

In [16]:
population1 = ['qwcf', 'qwer', 'zxcv', 'sdfg', 'werv', 'erty', 'dfgh', 'cvcn', 'sdfg', 'axcv']
population2 = ['qwcf', 'qwer', 'zxcv', 'sdfg', 'werv', 'erty']
population3 = ['qwcf', 'qwer', 'erty', 'dfgh', 'cvcn', 'sdfg', 'axcv']
target = 'zxcv'
non_zero = ['zxcv', 'qwcf', 'werv', 'axcv', 'cvcn']
# should always pick non-zero values
assert [p in non_zero for p in pick_parents(population1, target, calculate_fitness)]
assert [p in non_zero for p in pick_parents(population2, target, calculate_fitness)]
assert [p in non_zero for p in pick_parents(population3, target, calculate_fitness)]

<a id="pick_parents"></a>
## pick_parents

This picks the two 'best' parents from the population based on their fitness scores. We are using a tournament style selection. Tournament size is always 7.

Variables

* **population** List[str]: the population
* **target** str: the target value - needed for fitness
* **fitness_calc** callable func: this is the fitness function
* **phenotypes** List[str]: this is needed to to index shifts to decode the target - needed for fitness

**returns** List[str]: string values of the parents that were found to be the 'best'

This generates a list of members for a population with a supplied length and valid phenotypes.

In [17]:
def generate_random_population(num_of_members: int, length: int, phenotypes: List[str]):
    population = []
    while len(population) < num_of_members:
        member = ''
        for i in range(length):
            member += phenotypes[random.randint(0,len(phenotypes) - 1)]
        population.append(member)
    return population

In [18]:
assert generate_random_population(1, 3, ['a']) == ['aaa']
assert generate_random_population(5, 3, ['b']) == ['bbb', 'bbb', 'bbb', 'bbb', 'bbb']
assert generate_random_population(4, 7, ['g']) == ['ggggggg', 'ggggggg', 'ggggggg', 'ggggggg']

<a id="get_generational_information"></a>
## get_generational_information

Prints out the best member in the current population. Prints Generation number information, then the best member's: genotype, phenotype, and fitness.

Variables

* **generation_string** str: string information about the generation number that we are on
* **population** List[str]: the population
* **target** str: the target value - needed for fitness
* **fitness_calc** callable func: this is the fitness function
* **phenotypes** List[str]: this is needed to to index shifts to decode the target - needed for fitness

**returns** str: the string version of the information described above

In [19]:
def get_generational_information(generation_string: str, population: List[str], target: str, fitness_calc: callable, phenotypes = None):
    best = ('', -1)
    for member in population:
        fitness = fitness_calc(target, member, phenotypes)
        if fitness > best[1]:
            best = (member, fitness)
    genotype = to_binary(best[0])
    phenotype = best[0]
    result = ''
    result += '\r\n' + generation_string
    result += '\r\nGenotype: ' + str(genotype)
    result += '\r\nPhenotype: ' + str(phenotype)
    result += '\r\nFitness: ' + str(best[1])
    return result

In [20]:
print(get_generational_information('Current Generation: 10', ['aaa'], 'aab', calculate_fitness))


Current Generation: 10
Genotype: 011000010110000101100001
Phenotype: aaa
Fitness: 2


<a id="genetic_algorithm"></a>
### genetic_algorithm

Steps follow the pseudocode from the lectures:

Start with some initial population, valid phenotypes, mutation and crossover thresholds, and generation limit.

Until we hit our generation limit, we do this:
* go through half of the population and pick parents that are the best fitness using a fair tournament style selection
* create children from those parents
* all children make up the next population
* repeat until generational limit is hit
* every 10 generations and the final generation we print out information about the best member (by fitness) of that population

Variables

* **target** str: the target string to base our fitness off of
* **population_size** int: the population size to generate
* **generation_limit** int: the maximum number of generations to create
* **phenotypes** List[str]: the valid phenotypes we can use in our process
* **mutation_threshold** float: the float threshold for if we should mutate children
* **crossover_threshold** float: the float threshold for if we should have a crossover event
* **fitness_calc** callable func: this is the callable fitness function

**returns** str: the results as a string

In [21]:
def genetic_algorithm(target: str, population_size: int, generation_limit: int, phenotypes: List[str], mutation_threshold: float, crossover_threshold: float, fitness_calc: callable):
    population = generate_random_population(population_size, len(target), phenotypes)
    current_gen = 0
    result = ''
    while current_gen < generation_limit:
        next_population = []
        for n in range(0, int(population_size/2)):
            parents = pick_parents(population, target, fitness_calc, phenotypes)
            children = generate_children(parents[0], parents[1], crossover_threshold, random.uniform(0,1), random.randint(0,len(target) - 1))
            for child in children:
                next_population.append(mutate_string(child, phenotypes[random.randint(0,len(phenotypes) - 1)], mutation_threshold, random.uniform(0,1), random.randint(0,len(target) - 1)))
        population = next_population
        current_gen += 1
        if current_gen % 10 == 0:
            result += get_generational_information('\r\nCurrent Generation: ' + str(current_gen), population, target, fitness_calc, phenotypes)
    result += get_generational_information('\r\nFinal Generation', population, target, fitness_calc, phenotypes)
    return result

## Problem 1

The target is the string "this is so much fun".
The challenge, aside from implementing the basic algorithm, is deriving a fitness function based on "b" - "p" (for example).
The fitness function should come up with a fitness score based on element to element comparisons between target v. phenotype.

In [22]:
target1 = "this is so much fun"

In [23]:
# in testing, this take anywhere from 50 to 110 generations to reach the target
result1 = genetic_algorithm(target1, 50, 120, ALPHABET, 0.7, 0.7, calculate_fitness)

In [24]:
print(result1)



Current Generation: 10
Genotype: 01110100011010000110100101110011011110000111000001110100011110000111000101111001011100100110110001110101011011000110100001100011011001100111010101101110
Phenotype: thisxptxqyrlulhcfun
Fitness: 9

Current Generation: 20
Genotype: 01110100011010000110100101110011011110000111000001110011011000100111001101101100011110010110110001110001011011000110100001110100011001100111010101101110
Phenotype: thisxpsbslylqlhtfun
Fitness: 10

Current Generation: 30
Genotype: 01110100011010000110100101110011001000000111010101110011001000000111001101101001011110010110110001110101011000110110100001110100011001100111010101101110
Phenotype: this us siyluchtfun
Fitness: 14

Current Generation: 40
Genotype: 01110100011010000110100101110011011110000111001001110011001000000111001101101111001000000110011101110101011000110110100001100101011001100111010101101110
Phenotype: thisxrs so guchefun
Fitness: 15

Current Generation: 50
Genotype: 0111010001101000011010010111001101111000011010

## Problem 2

You should have working code now.
The goal here is to think a bit more about fitness functions.
The target string is now, 'nuf hcum os si siht'.
This is obviously target #1 but reversed.
If we just wanted to match the string, this would be trivial.
Instead, this problem, we want to "decode" the string so that the best individual displays the target forwards.
In order to do this, you'll need to come up with a fitness function that measures how successful candidates are towards this goal.
The constraint is that you may not perform any global operations on the target or individuals.
Your fitness function must still compare a single gene against a single gene.
Your solution will likely not be Pythonic but use indexing.
That's ok.
<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not reverse an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene (one letter against one letter).
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual in the population is the one who expresses this string *forwards*.

In [25]:
target2 = "nuf hcum os si siht"

In [26]:
result2 = genetic_algorithm(target2, 50, 120, ALPHABET, 0.7, 0.7, calculate_fitness_reverse)

In [27]:
print(result2)



Current Generation: 10
Genotype: 01101101011001010110100101101000011101100110100101110011011011010111001101110011011101000110101001110101011010110110100000100000011001100111100001110000
Phenotype: meihvismsstjukh fxp
Fitness: 8

Current Generation: 20
Genotype: 01110100011010000110100101110011001000000110100101110011011011010111001101110011011011000110101001110101011001110110100000100000011001100110001001101110
Phenotype: this ismssljugh fbn
Fitness: 13

Current Generation: 30
Genotype: 01110100011010000110100101110011001000000110100101110011011001010111001101101101011010110110101001110101011000110110100000100000011001100111010101101110
Phenotype: this isesmkjuch fun
Fitness: 15

Current Generation: 40
Genotype: 01110100011010000110100101110011011000100110100101110011001000000111001101110110001000000111011101110101011000110110100000100000011001100111010101101110
Phenotype: thisbis sv wuch fun
Fitness: 16

Current Generation: 50
Genotype: 0111010001101000011010010111001100100000011010

## Problem 3

This is a variation on the theme of Problem 2.
The Caeser Cypher replaces each letter of a string with the letter 13 characters down alphabet (rotating from "z" back to "a" as needed).
This is also known as ROT13 (for "rotate 13").
Latin did not have spaces (and the space is not continguous with the letters a-z) so we'll remove them from our alphabet.
Again, the goal is to derive a fitness function that compares a single gene against a single gene, without global transformations.
This fitness function assigns higher scores to individuals that correctly decode the target.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not apply ROT13 to an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene.
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual will express the target *decoded*.

In [28]:
ALPHABET3 = "abcdefghijklmnopqrstuvwxyz"

In [29]:
target3 = "guvfvffbzhpusha"

In [30]:
result3 = genetic_algorithm(target3, 50, 120, ALPHABET3, 0.7, 0.7, calculate_fitness_rot13)

In [31]:
print(result3)



Current Generation: 10
Genotype: 011100000110100001101110011001100110100101110011011100110110111101110100011011000110000101100001011001100111010101101110
Phenotype: phnfissotlaafun
Fitness: 8

Current Generation: 20
Genotype: 011101000110100001100010011100110110100101110011011100110110111101111001011001010110001101101100011001100111010101101110
Phenotype: thbsissoyeclfun
Fitness: 11

Current Generation: 30
Genotype: 011101000110100001101001011100110110100101110011011100110110111101101101011011000110001101110000011001100111010101101110
Phenotype: thisissomlcpfun
Fitness: 13

Current Generation: 40
Genotype: 011101000110100001101001011100110110100101110011011100110110111101101101011001010110001101101000011001100111010101101110
Phenotype: thisissomechfun
Fitness: 14

Current Generation: 50
Genotype: 011101000110100001101001011100110110100101110011011100110110111101101101011101010110001101101000011001100111010101101110
Phenotype: thisissomuchfun
Fitness: 15

Current Generation: 60
Genoty

## Problem 4

There is no code for this problem.

In Problem 3, we assumed we knew what the shift was in ROT-13.
What if we didn't?
Describe how you might solve that problem including a description of the solution encoding (chromosome and interpretation) and fitness function. Assume we can add spaces into the message.

One way we could solve for an unknown shift is to do something similar to children with mutations and carying the best ones over to the next generation. We could start the process with a list of random shift values, and calculate our fitness with all of the values. Every child would take remember the shift that gave them their best fitness. We could then pick the best fitnesses and make our new shift pool out of those shift values and also generate a few new random ones to try.

## Challenge

**You do not need to do this problem and it won't be graded if you do. It's just here if you want to push your understanding.**

The original GA used binary encodings for everything.
We're basically using a Base 27 encoding.
You could, however, write a version of the algorithm that uses an 8 bit encoding for each letter (ignore spaces as they're a bit of a bother).
That is, a 4 letter candidate looks like this:

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

If you wrote your `genetic_algorithm` code general enough, with higher order functions, you should be able to implement it using bit strings instead of latin strings.

## Comments

1. I liked this assignment a lot and I think I did all of the programming correctly unlike assignment1

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.