# Genetic Algorithms
Genetic Algorithms are inspired by nature to "improve" populations. Among organisms, each parent passes down **genes** to their children, which affects the ability of the children to survive. It is an analogy of this principle of biological genetic evolution that is the basis for the heuristic optimization method called "Genetic Algorithm" (GA). In this algorithm, the "genes" are decision variables.

#### Definitions
* Gene: basic genetic element
* Chromosome: collection of genes
* Allele: The values a gene can take (eg. 0 or 1)

  For example, if $s$ is a binary string of length 4, examples of chromosome are `1010`, `0110` etc. Genes are the bits in the string, and alleles are 0 or 1.
* Fitness: The objective function, $F(s)$ that we want to *maximize*
* Generation: A single iteration of the GA algorithm

#### Genetic Algorithms vs. Simulated Annealing
The primary difference is that simulated annealing only carries the single $s_{curr}$ from one iteration tot he next. On the other hand, genetic algorithms carry many solutions from one iteration to the next. These many solutions are referred to as "population".

## I. Characteristics of Genetic Algorithms
Binary genetic algorithms are ones where the decision variable is a binary string. The primary operations in genetic algorithms are: 
* **Crossover**: This involves switching locations of substrings within a binary string between two strings.
* **Mutation**: This involves changing the value of one gene in the binary string on a probability distribution.

In each interation (generation) there are multiple members of the population, $s^{curr}_i \forall i \in (1, \ldots, M)$. GA is a stpchastic algorithm, and it usually does not know when to stop, requiring a `maxIter`. Like other heuristic algorithms, GA does not require deriviatives, continuity etc. However the problem structure can possibly be incorporated into the encoding of the string or into the reproductive process to improve performance.

## II. Outline of Genetic Algorithm
1. Randomly generate an initial “ population ” $\{s^{curr}_j\}  \forall j= 1,…,M$.
2. Compute fitness $F(s^{curr}_j) \forall j= 1,…,M$
3. Generate "children" which are the members of the population in the next generation by applying:
  * Selection of pairs of parents (by roulette or tournament) randomly influenced by $F(s^{curr}_j)$
  * Crossover
  * Mutation
4. The population of children becomes a population of parents for the next iteration and goto step 2 unless a stopping criterion has been satisfied.

### Crossover (using Roulette)
In order to generate $N_c$ children, we perform the steps:
* Randomly select one parent with a probability that is proportional to its fitness $F(s_i)$.
$$p_i = [\frac{F(s_i)}{\sum_{k=1}^{N_p}F(s_k)}]$$
* Then randomly select a second parent also with a probability that is proportional to the fitness of the parent.
* For one point crossover, pick a crossover point. This can be done randomly although there may be some restrictions
    * Multipoint cross over randomly selects multiple crossover points, effectively cropping multiple genes between paretns.
* Then switch the protions of the binary strings that occur before the crossover point between the parents.

### Crossover (using Tournament)
* Let all parents have an equal probability of being selected (e.g. $p_i = 1/M$, where $M$ is number of individuals in the population).
* Pick two parents $s_i$ and $s_k$ at random.
* Select the parent with the higher fitness to be the first parent.
* Repeat the process starting with step 1 to select the second parent

### Mutation
* Mutation is applied after crossover and produces new characteristics. WIthout mutation, there would be some values we would never reach.
* It produces incremental random changes in the offspring by randomly changing allele values. Typically, a mutation flips a value with $S$ is a binary string.

**Note**: Comparing the computational effort between Genetic Algorithms and Simulated Annealing, we see that each iteration of SA requires only 1 evaluation of the cost function, whereas each iteration of GA requires $N_p$ evaluations of the cost function.

### Selecting length of sequence, $M$
If you have a problem with $k$ integer variables, each taking one of $l$ possible values, Then the number of decision values is: $l^k$. To represent these decision values, you need at least $M$ bits, such that $2^M = l^k$.

### Multidimensional $S$
* If the decision variable consists of multiple dimensions, so $S = \{S_0, \ldots, S_n\}$, This can be represented into a single binary string as well, by concatenating the binary values $b(S) = \{b(S_0) | \ldots |b(S_n)\}$.
* If each $S_i$ can be represented using $k$ bits, then $S$ is represented using $k\cdot n$ bits.
* In such a case, the crossover operation is performed for each part of the binary string separately, related to each decision variable $S_i$.

## III. Constraint Handling in GAs
Broadly classified into 5 classes:
1. Based on preserving feasibility of solutions
2. Based on penalty functions
3. Based on distinguishing feasible and infeasible solutions
4. Based on decoders
5. Based on hybrid approach

We have the following notation:
* $f(x)\rightarrow$ objective (cost) function
* $g_k(x)\rightarrow$ the $k$th inequality constraint
* $h_j(x)\rightarrow$ the $j$th equality constraint
* $\mathbf{x} = [x_1 x_2 \ldots x_n]\rightarrow$ vector of deciision variables
* $F(x)\rightarrow$ the GA fitness function (after modifying $f(x)$ to deal with constraints)

### Penalty Function Method
In this method, we consider constraint violations to decresae the fitness usually by using the form
$$F(x) = f(x)-\sum_{j=1}^J R_j <g_j (\mathbf{x})>^2$$
where $R_j$ is the penalty coefficient for inequality constraint $j$. $R_j$ serves to make the penalty function the same order of magnitude as the objective function. Often, constraints are normalized and a single $R$ is used. Also, 
$<g_j(\mathbf{x})> = \begin{cases}
|g_j(\mathbf{x})|,\quad \text{for }g_j(\mathbf{x})<0,\\
0,\quad \text{for }g_j(\mathbf{x})\geq 0
\end{cases}$

#### Problems with penalty methos
* Introduces atleast one but sometimes more algorithm parameters ($R_j$) to adjust for $F(x)$
* Possible experiemnts needed to determine penalty coefficient values $R_j$ that guide the algorithm towards feasible region
* Algorithm specific methods for constraint handling can be more efficient.

### Efficient Constraint Handling using Deb (2000) Method
This is a parameter free penalty function method that considers feasibility of solution. It takes advantage of 1:1 comparison of solutions in the Tournament selection. In tournament selection, when comparing two solutions together, the following criteria are always enforced:
1. Any feasible solution is preferred to any infeasible solution
2. For two feasible solutiosn, the one with the higher fitness (better objective) is preferred
3. For two infeasible solutions, the one having the smaller constraint violation is preferred

A new fitness function $F(x)$ is therefore defined:
$$F(x) = \begin{cases}
f(x), \quad \text{if }g_j(x)\geq= \forall j=1,2,\ldots\text{ (no violation)}\\
f_{min} - \sum_{j=1}^J <g_j(x)>,\quad\text{ otherwise (with some constraint violations)}
\end{cases}$$

$f_{min}$ is the fitness value of the worst feasible solution found so far. All the constraints are normalized. One idea is to check the constraints first and DO NOT evaluate objective function if they are violated. This is particularly beneficial for computationally demanding cost functions.

For equaity constriants:
1. Direct substitution is possible: Best way is to get rid of equality constraints by substitution. For example, if constraint is $x_1 = x_2+5$, then simply replace $x_1$ everywhere.
2. Direct substitution is not possible: With Deb's method, you can also use the evolution of the equality directly to determine if a solution is feasible or infeasible.

## IV. Partitioning: Maintaining diversity in the GA
In partitioing, we create $K$ initial populations on different "islands", each with $P$ members. Within each populations, we do reproduction, crossover and mutation. We further copy a small number of elite individuals from one population to another every $J$ generations. THe idea is to allow selection to occur simultaneously in multiple subpopulations and then periodically mix transfer some fo the best individuals among the populations in some way.

* **Advantages**:
  * Helps prevent premature convergence (to the worng solution) since the wrong solution doesn't occur in all cases and hence maintains diversity
  * Probably most feasible if you are doing a very large number of evaluations (eg. 20k or 100k)
* **Disadvantages**:
  * takes a lot of objective function evaluations, especially if you want population size in each island to be substantial.
  * not clear how often you should transfer from one island to another - that becomes another algorithm parameter.
  
## V. Real Coded Genetic Algorithms
This section deals with working with cases when $x$ is not binary, but real valued. We define:
* $M\rightarrow$ number of parents, where parent $k: x_k = (x_{1k}, x_{2k}, \ldots, x_{nk})$
* $x_{jk}\rightarrow$ the $j$th element of the $k$th parent, where all $x_{jk}$ are real valued.
* $n\rightarrow$ the dimension of the decision variables.

### Real Coded Crossover
We have several possible ways of doing real valued crossover:
1. **One Point Crossover**: Pick two parents $x_1$ and $x_2$. Randomly select a cross over point $c$ and exchange the elements occuring after point $c$ between the parents.
2. **Two Point Crossover**: Pick two parents and two crossover points $c_1$ and $c_2$. Exchange the elements of the two parents in the segment between $c_1$ and $c_2$. This can be extended to multiple crossover points by selecting alternating segments for exchange.
3. **Arithmetic Operator**: Let $x'$ be the offspring of two parents $x_1$ and $x_2$. Then, $x' = wx_1 + (1-w)x_2,\quad w: 0<w<1$. The parameter $w$ is chosen by the user. If $w=0.5$, then a simple average is taken. This can be extended to a case with more than two parents used to determine the offspring, and this general case is called the arithmetic crossover. THe value of $w$ can be picked in multiple ways and can be varied systematically or randomly.
4. **Geometrical Crossover**: Let $x'$ be the offspring of two parents $x_1$ and $x_2$. Then, $x' = [(x_{11}\cdot x_{12})^{1/2}, (x_{21}\cdot x_{22})^{1/2}, \ldots, (x_{n1}\cdot x_{n2})^{1/2}]$. As with arithmetic crossover, this can be extended to more than two parents.
5. **Fitness Based Scan**:
  * Select $M$ parents and compute $Cost(x_k) \quad\forall k\in\{1,\ldots,M\}$.
  * Then for each crossover segment of the offspring, select a segment from one of the parents based on a probability. The probability of picking a segment fromt eh $k$th parent is $Cost(x_k)$ divided by $\sum_{i=1}^M Cost(x_i)$.
  
Benefits of fitness scan over other methods:
* The other methods are doing weighted (arithmetic or geometric) based averages of the decision variables, which encourages the selection of decision variables in the middle of the domain without regard for fitness.
* This fitness based scan scheme retains the schema and combines them with schema from other fit individuals, hence it is not pushing answers toward the middle of the doman . Fitness based scan is similar to the treatment of schema in the binary GA case. However, in order to get NEW real values, there does need to be a significant amount of mutation.

### Real Coded Mutation
Mutation can be done in many ways. For example, if $x$ is the individual after crossover, and $x'$ isthe individual after mutation, we can do $x' = x+M(x)$ where $M(x)$ is some function or random variable drawn from a distribution that might depend on $x$. Examples of $M(x)$:
* uniform on $[-a,a]$
* uniform on $[-ax, ax]$ (this corresponds to a random percentage change)
* normally distributed with mean 0 and std $ax$

### Discretization: Real valued to binary coded GA
You can use Use Binary Coded GA with your real variable value converted (by “discretization”) to the closest value represented by a binary string of length $N$. The difficulty is that you are replacing decision variables that are continuous by discrete points and so this generates some inaccuracy.

## Theory for Genetic Algorithms
### Schema
A schema is a set of genes that make up a partial solution to the optimization problem. Schemata (plural of schema) define subsets of similar chromoses. A **building block** is denoted as, for example, $H_1 = \{1*0***\}$ where the $*$ can be either 0 or 1.

We then have some definitions:
* Average fitness: $f(H)$
* Order of schema: Orde rof schema $H$ is denoted by $o(H)$ and it is the number of non-$*$ symbols it contains.
* Length of schema: The length, denoted by $\delta(H)$, is the distance from the first non-$*$ to the last non-$*$ position.

### Effect of reproduction without crossover on schema
Definitions:
* $M = $ number of individual strings in GA population
* $S[j] = $ an individual string in the GA population, $j\in \{1,\ldots, M\}$
* $H = $ a schema
* $f = average f_j = average f(S[j])$, $S_j$ is part of GA population $P(t)$
* $f(H)=$ average fitness of all $S[j]$ in schema $H$
* $P(t) =$ GA population at iteration $t$. $P(t) = \{S[j], \forall j\in\{1, \ldots, M\}\}$
* $m(H,t) = $ number of $S[j]$ contained in schema $H$ in $P(t)$ in iteration $t$

Therefore, the probability of a string $S[j]$ from $P(t)$ is selected to be a parent (by routlette) is $P_i = \frac{f_i}{\sum_{j=1}^M f_j}$. THe expected value of $m(H,t+1)$ is:
$$m(H,t+1) = m(H,t)\cdot M\cdot \frac{f(H)}{\sum_{j=1}^M f_j}$$

### Surviving Crossover
We want to find the probability of the schema structure $H$ being preserved after crossover is done. The probabilty deifintely depends upon the length of the schema $\delta(H)$. For example, $H_1 = \{11****\}$ (wtih $\delta(H_1)=1$) is much more likely to survive crossover that $H_2 = \{1****1\}$ (with $\delta(H_2)=6$). We have the probability of $H$ surviving the crossover, $p_s(H)$:
$$p_s(H) = 1-p_c\frac{\delta(H)}{n-1}$$
where $p_c$ is the probability of crossover happening, assuming crossover and reproduction are independent.

Therefore, number of strings belonging to $H$ surviving both reproduction + crossover:
$$m(H,t+1) \geq m(H,t)\cdot\frac{f(H)}{\text{Avg }f}\cdot [1-p_c\frac{\delta(H)}{n-1}]$$

### Surviving Mutation
We want to determine if schema $H$ survives mutation. It does not matter if a bit in a $*$ position is mutated. Let $p_m$ be the probability that one bit is mutated. Then the probability that $H$ survives mutation is:
$$(1-p_m)^{o(H)}$$
Probabilty of surviving reproduction + crossover + mutation:
$$m(H,t+1) \geq m(H,t)\cdot\frac{f(H)}{\text{Avg }f}\cdot [1-p_c\frac{\delta(H)}{n-1}]\cdot [1-p_m]^{o(H)}$$
Let $A_t = \frac{f(H)}{\text{Avg }f}\cdot [1-p_c\frac{\delta(H)}{n-1}]\cdot [1-p_m]^{o(H)}$. Assume that there exists a $U>1$ for all iterations $t$ up to $K$ such that $A_t \geq U$. Then,
$$m(H,t+1)\geq m(H,t)\cdot U\geq m(H, t-1)\cdot U^2$$
* Since $U>1$ this means that upto $K$ iterations the expected number of members in the popultions will increase.
* If no such $U$ exists, then the schema $H$ will gradually die out.

### Schema Theorem (by Goldberg)
Schema Theorem: : Highly fit, short defining length schemata are most likely undisturbed and are propagated from generation to generation. These schemata receive exponentially increasing number of trials in subsequent generations.

**Convergence**: Convergence means that for any $K$ and $\delta$ in the iterations after the $K$th, the solution will eventually be within a distance $\delta$ of the minimizing solution $x$, $f(x)$ assumes its
minimizing value for a compact domain.

Schema Theorem does NOT lead to convergence. Infact, there have been examples of problems where GA does not lead to the optimum solution. It is not possible that there exists a $U>1$ for all $t$. The "theorem" is more of an explanation of why things work for a finite number of evaluations. The Schema Theorem is taken to imply that fit, low order, short schemas get amplified. The **building block hypothesis** is that
from short, low order, fit schemas, crossover builds fitter long schemas.

The schema theorem also shows that the closeness of decision variables in a binary string can be important. For example, if there is reason to believe decision variable $A$ and decision variable $B$ are correlated in the benefit, should we put them close together or far apart?