Skip to content

Genotype and phenotype

Martin Bagic edited this page Aug 18, 2022 · 1 revision

Genome structure

A genome is, under the hood, a bit string structured as a 3D array. The three dimensions are named haplotype, locus, and bit.

# A genotype from a diploid individual
[
  [
    [0 1 0 0 1 1 0 1] # locus with 8 bits
    [1 0 0 1 1 0 0 1]
    [1 0 1 0 1 0 1 1]
    ...
  ] # haplotype
  [
    [0 1 0 0 1 1 0 1]
    [1 0 0 1 1 0 0 1]
    [1 0 1 0 1 0 1 1]
    ...
  ] # haplotype
] # individual

The smallest meaningful unit in the genome is a locus - an array of bits (which are the smallest numerical units in the genome - single 0s or 1s). Each locus contains BITS_PER_LOCUS number of bits (this is usually a number between 4 and 10).

Every locus is meaningful in the way that it encodes a probability of a specific biological event (e.g. a probability to reproduce at some specific age). Different loci encode probabilities for different traits (e.g. reproduction, survival) experienced at specific ages. The number of loci depends on the number of traits modeled as well as the maximum attainable lifespan set by parameter MAX_LIFESPAN.

An array of loci constitutes one haplotype, and a genome can have one or two of those, depending on the REPRODUCTION_MODE (two for sexual, one for asexual).

Phenotype structure

A phenotype is an array of real numbers representing probabilities of individual-specific biological events.

# An example of a phenotype

[
  # 50 values for survival rates
  1.000, 1.000, 1.000, 1.000, 1.000, 0.999, 1.000, 1.000, 1.000, 1.000, 0.973, 0.986, 0.993, 0.925, 0.784, ...,

  # 50 values for reproduction rates
  0.059, 0.058, 0.052, 0.075, 0.034, 0.072, 0.053, 0.043, 0.032, 0.025, 0.464, 0.461, 0.438, 0.446, 0.368, ...,

  # Neutral loci
  0.048, 0.059, 0.069, 0.060, 0.022
]

# Parameter configuration for this example:
#   MAX_LIFESPAN: 50
#   G_neut_agespecific: 5
#   MATURATION_AGE: 10

The length of an array depends on multiple factors:

  • Which traits are modeled? It is possible to model reproduction, survival and mutation rates.
  • Are traits age-dependent?
  • What is the maximum attainable lifespan set by MAX_LIFESPAN?
  • Are neutral loci modeled?

Age-dependent traits will require MAX_LIFESPAN probabilities to model, while age-independent traits will require only one. The total length of the phenotypic array is then the sum of required probabilities for all modeled traits. For example, if MAX_LIFESPAN is 100 and only reproduction and survival are modeled, phenotype will be 200 numbers long.

Conversion of genotype to phenotype

Genomes are bit strings that have to be transformed into an array of real numbers that encode for probabilities of biological events. To convert a genome to a phenotype, it goes through multiple steps:

  1. application of environment map (optional)
  2. transformation of loci into real numbers
  3. pleiotropy (optional)
  4. scaling and shifting to upper and lower bounds

1. Application of environment map

This step is optional and serves to model a shifting fitness landscape or to apply periodic selection pressure. Mathematically, it consists of applying the exclusive or (XOR) operation on every genome and the environment map.

Environment map is a bit string that has the same structure/shape/dimensionality as a genome and contains only 0s at the beginning of the simulation. Applying a XOR on a bit string and a string of 0s returns the original string (since X XOR 0 = X).

However, periodically (period length set by parameter ENVIRONMENT_CHANGE_RATE), one of the bits in the random position of the environment map flips. Over time, this means that the environment map will contains some 1s. Applying XOR on a bit string and the environment map now does not return the original bit string but a bit string that has flipped values in positions where the environment map has 1s. The genomes do not change in composition but the meaning of some bits are inverted. This will produce an effect of a changed environment, shifted fitness landscape, since genotypes that previously had high fitness, might experience fitness reduction, and vice versa. If those bits encode traits that are important for individuals' fitness, this operation can produce the effect of a selective pressure.

2. Transformation of loci into real numbers

Loci, containing an array of bits, must be mapped to real numbers each corresponding to a specific biological trait. There are various ways of performing that mapping, each with different properties. Most common mapping is the uniform mapping that calculates the proportion of 1s in the locus and the binary mapping that interprets the locus as a binary number and then divides it by the maximum possible binary number given the particular number of bits per locus BITS_PER_LOCUS.

3. Pleiotropy

This step is optional and serves to make loci affect multiple probabilities of biological events. Without this step, every real number calculated in the previous step describes a rate of some biological trait of an individual carrying that specific transformed genome. To apply pleiotropy, the calculated real number values are not only assigned to one biological trait, but are weighted and added or subtracted from the values from the previous step.

That way, a certain locus can encode for the probability to reproduce at age 20, but could also, for example, reduce the probability to reproduce at age 50 by 30%. If the real value of that locus is, for example, 0.6 - the probability to reproduce at age 20 would be 60% and the reduction of the reproduction rate at age 50 would be 18%.

When there is positive pleiotropism, the final values might exceed 1 (100%) or drop below 0. In those cases, the values will be clipped to 1 and 0, respectively.

4. Scaling and shifting to upper and lower bounds

Probabilities usually take on values from the interval [0,1]. However, bounds to those probabilities can be set. The motivation to do that can be to ensure that a certain trait never drops below a certain value (e.g. to guarantee survival of at least 10%) or to limit the maximum attainable rate of a certain trait (e.g. to limit the chance of reproducing to 80%).

When bounds are set, lets say to 10%-80% interval, the real values from previous values and shifted and stretched so that what is usually 0% and 100% is now 10% and 80%, and the rest of the 0-1 interval is accordingly shifted and scaled.

Configuration of genome structure

There are four traits / types of loci that can be specified in the configuration file: surv, repr, muta and neut. First three encode for survival, reproduction and mutation rates, while the last does not influence the individuals at all.

Trait parameters start with the prefix G_ following the string specifying the trait (e.g. surv), and the suffix specifies some property of the modeled trait. These are the following:

  • _evolvable - specifies whether the trait will be modeled in the genomes or not (i.e. absent from it, no loci encoding for it)
  • _agespecific - specifies whether there will be one locus encoding the trait or MAX_LIFESPAN number of loci so that the trait has a specific value for every age
  • _interpreter - specifies the method by which the locus is transformed into a real number
  • _lo - specifies the lowest attainable value of the phenotypic trait
  • _hi - specifies the highest attainable value of the phenotypic trait
  • _initial - specifies the proportion of loci that are initialized as 1 (and not 0) in a fresh population