# AI - CA2 - Genetics - Mohamad Taha Fakharian

## Goal
In this assignment, we're going to decode an encoded text using genetic algorithm. Genetic algorithms are efficient methods in searching problems, when total states are big enough that using typical methods isn't efficient.

## Overall approach
In this problem, we need to find the decoding key and use that for decoding. In order to implement genetic algorithm, we need to define chromosome, fitness, mutation and crossover. Let's start!

## Phase 0: Preprocessing and making dictionary
First of all, we need to make a dictionary from meaningful words, stored in a text named 'global_text'. This dictionary can be halpful when we want to calculate fitness for our chromosomes, where we want to see how many meaningful words have been decoded by our chromosome. This phase is implemented in 'make_dict' method in 'Decoder' class. Finally we'll make a list of encoded words and save it in class. This is important that we lowercase all words, in order to avoid dissimility among same words with lowercase/uppercase difference. We keep a copy of encoded text to decode it properly.

## Phase 1: Definitions
Let's define!
## Chromosome
Each chromosome in this algorithm is a candidate key for decoding text, which its size is given in constructor of Decoder class.

## Phase 2: Generate initial population
We generate enough chromosomes for initial generation, which are random-generated. In each step, we try to make previous generation better by doing some crossover and mutation on it.

## Phase 3: Fitness function
Now we'll introduce our fitness function. Fitness for each chromosome is calculated by number of words that decoded to a meaningful word. We'll use the dictionary that we made in phase 0 to check whether decoded word is meaningful or not. 

## Phase 4: Crossover and mutation and generating new generation
Let's define crossover and mutation for this problem. There are many ways to define these two operator. For mutation, each gene in chromosome can be changed to another gene(in this problem, each gene is an alphabet) with a pobability, known as mutation probability.
To improve performance, we check if fitness of the new chromosome is better than old chromosome and if isn't, we won't replace the old chromosome with the new one. Mutation probability is better be chosen between $\frac{1}{population size}$ and $\frac{1}{chromosome size}$. Since here population size is 100 and chromosome size is 14, we choose 0.09 for mutation probability. 

For crossover, we split two parent chromosomes into half and make two new child by choosing from two different halves of parents. To generate new generation, we do as follow:

1. First we'll sort current generation chromosomes by their fitness and transfer top 30% of chromosomes directly to new generation.
2. For the remaining 70% of the population, we use remaining 70% of the last generation and shuffle them. Then crossover pairs of chromosomes from this set with a probability known as crossover probability. 
3. Mutation is done on each of new chromosomes, in order to make them better.

## Phase 5: Finalize algorithm
Now we'll put all these parts together! We keep generating new populations until we find the key, or we pass the limited number of generations allowed(which doesn't happen in regular conditions). Finally, using this key we decode the whole text properly.

Let's go!

In [52]:
# import some libraries
import numpy as np
import pandas as pd
from re import split

import string
import random
from math import floor

In [53]:
CONSTANT = 14

In [54]:
class Decoder:
    
    def __init__(self, global_text, encoded_text, key_length = CONSTANT):
        self.words_dict = self.make_dict(global_text)
        self.raw_text = encoded_text
        self.key_length = key_length
        self.encoded_words = self.make_words(encoded_text)
        
        self.key = None
        self.mutation_prob = 0.07
        self.crossover_prob = 0.9
        self.pop_size = 100
        self.total_generations = 10000
        self.chromosomes = self.generate_initial_population()
        self.elites_perc = 30
        
        self.cur_index = 0
        
    def make_dict(self, global_text):
        dictionary = {}
        for word in split('[^a-zA-Z]+', global_text):
            if not word:
                continue
            word = word.lower()
            if word[0] not in dictionary:
                dictionary[word[0]] = set()
            dictionary[word[0]].add(word)
        return dictionary
    
    def make_words(self, encoded_text):
        return [word.lower() for word in split('[^a-zA-Z]+', encoded_text) if word]
    
    
    def convert(self, encoded, chromosome):
        alphabet = (ord(encoded) - ord(chromosome[self.cur_index])) % 26 + 97
        converted = chr(alphabet)
        self.cur_index = (self.cur_index + 1) % self.key_length
        return converted
    
    def calc_fitness(self, chromosome):
        fitness = 0
        self.cur_index = 0
        for word in self.encoded_words:
            decoded = ''.join(self.convert(encoded, chromosome) for encoded in word)
            if (decoded[0] in self.words_dict) and (decoded in self.words_dict[decoded[0]]):
                fitness += 1
        return fitness
    
    def crossover(self, first_chromosome, second_chromosome):
        point = floor(self.key_length / 2)
        first_child, second_child = first_chromosome, second_chromosome
        if (random.random() < self.crossover_prob):
            first_child, second_child = first_chromosome[:point] + second_chromosome[point:], second_chromosome[:point] + first_chromosome[point:]
        return first_child, second_child
    
    def mutate(self, chromosome):
        new_chromosome = chromosome
        candidate = chromosome
        for i in range(self.key_length):
            if (random.random() < self.mutation_prob):
                candidate = candidate[:i] + random.choice(string.ascii_lowercase) + candidate[i+1:]
        if self.calc_fitness(candidate) > self.calc_fitness(chromosome):
            new_chromosome = candidate
        return new_chromosome 
    
    def generate_initial_population(self):
        return [''.join(random.choice(string.ascii_lowercase) 
                for i in range(self.key_length)) 
                for j in range(self.pop_size)]
    
    def generate_new_population(self):
        sorted_chromosomes = [(self.calc_fitness(chromosome), chromosome) for chromosome in self.chromosomes]
        sorted_chromosomes.sort(reverse=True)
        print("Best: key = {}, fitness = {}".format(sorted_chromosomes[0][1], sorted_chromosomes[0][0]))
        
        if sorted_chromosomes[0][0] == len(self.encoded_words):
            self.key = sorted_chromosomes[0][1]
            return self.chromosomes
        
        elites_num = floor((self.elites_perc / 100) * (self.pop_size))
        elites = sorted_chromosomes[:elites_num]
        new_pop = [self.mutate(chromosome[1]) for chromosome in elites]
        crossovering = [chromosome[1] for chromosome in sorted_chromosomes[:(self.pop_size - elites_num)]]
        random.shuffle(crossovering)
        
        while(len(new_pop) != self.pop_size):
            first_parent = crossovering.pop(0)
            second_parent = crossovering.pop(0)
            
            first_child, second_child = self.crossover(first_parent, second_parent)
            first_child = self.mutate(first_child)
            second_child = self.mutate(second_child)
            
            new_pop.append(first_child)
            new_pop.append(second_child)
        return new_pop
    
    def decode(self):
        for i in range(self.total_generations):
            if self.key:
                return self.show_text()
            self.chromosomes = self.generate_new_population()
        return "Couldn't decode!"
    
    def show_text(self):
        decoded_text = ''
        self.cur_index = 0
        for character in self.raw_text:
            if character >= 'A' and character <= 'Z':
                character = character.lower()
                decoded_text += self.convert(character, self.key).upper()
            elif character >= 'a' and character <= 'z':
                decoded_text += self.convert(character, self.key)
            else:
                decoded_text += character
        return decoded_text

In [55]:
import time
encoded_text = open('encoded_text.txt').read()
global_text = open('global_text.txt').read()

decoder = Decoder(global_text, encoded_text)
t0 = time.time()
decoded_text = decoder.decode()
t1 = time.time()

with open('decoded_text.txt', 'w') as decoded:
    decoded.write(decoded_text)

print("Time spent for decoding = {}s".format(t1 - t0))

Best: key = zyjvfveiimmgcr, fitness = 41
Best: key = zyjvfveiimmgcr, fitness = 41
Best: key = zyjvfveiimmgcr, fitness = 41
Best: key = zyjvfveiblgtuj, fitness = 47
Best: key = zyjvfveiblgtuj, fitness = 47
Best: key = umpzsmkiblgtuj, fitness = 49
Best: key = utpbsmkiblgtuj, fitness = 50
Best: key = ahofaevbdzyfin, fitness = 52
Best: key = aoygrolbazyfin, fitness = 52
Best: key = aoygrolbazyfin, fitness = 52
Best: key = utpbsmlbdzzein, fitness = 60
Best: key = ahofaevbdzzein, fitness = 76
Best: key = azofaevbdzzein, fitness = 80
Best: key = azofaevbdzzein, fitness = 80
Best: key = azofaevbdzzein, fitness = 80
Best: key = acyerolbdzzein, fitness = 82
Best: key = acyerolbdzzein, fitness = 82
Best: key = acyeroxbdzzein, fitness = 84
Best: key = acyeroxbdzzein, fitness = 84
Best: key = acyerolsdzzein, fitness = 85
Best: key = acberolqdzzein, fitness = 96
Best: key = alyeroladbzein, fitness = 102
Best: key = alyeroludzzein, fitness = 104
Best: key = alyeroludzzein, fitness = 104
Best: key = a

## Phase 6: Questions
1. Small population size in genetic algorithm may lead to inability to find the solution, since total states in searching space is vast and the algoithm may not be able to converge to the optimal solution. This doesn't mean that higher population size always make the algorithm better. In contrast, increase in population size after a threshold only results in increase in spending time and not in accuracy.
2. It will result in higher accuracy and also higher spending time. Pay attention! The increase in accuracy can be really small.
3. Crossover is a strategy that make better chromosomes from good ones by integrating good qualities from them. This operation tries to move solutions to space's optimums. In contrast, mutations makes small changes in chromosomes, in order to make chromosomes in generations different from each other. These small changes help us search space better and avoid stucking in local optimums. If we avoid using mutation in algorithm, chromosomes may be biased and there isn't a return way from this bias. 
4. Using mutation will usually result in higher accuracy, since mutation can make searching space bigger. So if we have unlimited time, mutation can generate almost all of states and we can find the solution. In contrast, using crossover may result in lower accuracy, but in faster manner. This is because changes due to crossover is vast(unlike mutation) and this results in faster search. This is why we use both operations and not only one of them.
5. There are many solutions for this problem. One can increase mutation rate, hoping that chromosomes change and after that, crossover can possibly help them generate new chromosomes. Another solution is to remove some percentage of population randomly and enter new random chromosomes, in order to make new different chromosomes in next generations.
6. I guess crossover is more efficient than mutation. Let's consider using only mutation in the algorithm. Finding solution is almost happening in all runs but in a large amount of time(since we're typically trying every states to find the solution). In contrast, using only crossover is really faster, but may result in bias and inability to find the solution. According to this explanation, crossover is a more efficient way than mutation.
7. In this problem, we look for a key that can decode all words properly. Instead of generating random keys and trying to make them better, we can guess some parts of key from the text. This can be done from special words in the text like 'A.M', which is encoded like X.Y or 2nd, which is encoded like 2xy. After guessing some parts from the key, we can use a genetic algorithm but don't change the guessed alphabets in the chromosomes. In this way, searching space deacreses and we can find the key faster.

# Conclusion
Using genetic algorithm is a good choice when searching space has lots of states. We can improve the algorithm using better cross over function and adaptive mutation rate. It's important to choose a good mutation and crossover strategy to find the solution in a good amount of time.  