In [None]:
1. https://www.eecs.yorku.ca/course_archive/2018-19/F/4425/LEC/Chapter3.pdf
2. https://www.bio-recipes.ethz.ch/Dayhoff/code.html
3. http://www.icb.usp.br/~biocomp/aulas/Dayhoff.pdf
4. https://cs.wellesley.edu/~cs313/lectures/2_Mutations.pdf

In [None]:
# Dayhoff Model - Purpose and Steps

The Dayhoff Model is widely used in bioinformatics to model evolutionary changes in protein sequences. It provides a framework for understanding the substitution rates of amino acids over time, allowing researchers to infer evolutionary relationships and estimate the evolutionary distance between protein sequences.

## Purpose of Dayhoff Model

The Dayhoff Model serves several purposes:

1. **Substitution Rate Estimation:** It estimates the rate at which amino acids are substituted over evolutionary time, helping researchers quantify the evolutionary divergence between proteins.

2. **Construction of PAM Matrices:** The model forms the basis for creating Point Accepted Mutation (PAM) matrices, which are essential in scoring sequence alignments and identifying conserved regions.

3. **Homology Inference:** By understanding the mutability of amino acids and substitution probabilities, the model aids in inferring homologous relationships between proteins.

## Seven Steps of Dayhoff Model

### Step 1: Point Mutation Occurrence

\[ P_{ij}(t) = P_{ij}(0) \cdot e^{-\alpha t} \]

- \( P_{ij}(t) \) is the probability that amino acid \( i \) at a particular site will change to amino acid \( j \) after time \( t \).
- \( \alpha \) is a constant representing the rate of amino acid substitutions.

### Step 2: Transition Probabilities Over Time (PAM Rate)

\[ Q_{ij}(t) = P_{ij}(t) \cdot \alpha \]

- \( Q_{ij}(t) \) is the rate matrix representing the instantaneous rate of change from amino acid \( i \) to \( j \) over time \( t \).

### Step 3: Generating Substitution Probability Matrix

\[ P_{ij}(1) = e^{Q_{ij}(1)} \]

- The substitution probability matrix \( P_{ij}(1) \) represents the probability of changing from amino acid \( i \) to \( j \) over a unit of evolutionary time.

### Step 4: Extending to \( k \) PAMs

\[ P_{ij}(k) = \left[ P_{ij}(1) \right]^k \]

- Extending the substitution probability matrix to \( k \) PAMs involves exponentiating the matrix obtained in Step 3.

### Step 5: Creating PAM1 to PAM250 Matrices

\[ PAM_{ij}(k) = \frac{P_{ij}(k)}{P_{ij}(1)} \]

- PAM matrices are obtained by normalizing the substitution probability matrices for different evolutionary distances (\( k \)).

### Step 6: Scoring Sequence Alignments

\[ S_{ij} = \log_2 \left( \frac{PAM_{ij}(k)}{0.01} \right) \]

- The scores for sequence alignments are calculated using the logarithm of the ratio of observed to expected substitutions.

### Step 7: Deriving PAM250 from PAM1

\[ PAM_{ij}(250) = \left[ PAM_{ij}(1) \right]^{250} \]

- PAM250 is derived by raising PAM1 to the power of 250.

## Logic and Explanation

- **Rate of Amino Acid Substitutions:** The model assumes that the rate of amino acid substitutions is constant over time (\( \alpha \)), allowing for the estimation of evolutionary distances.

- **Substitution Probability Matrix:** The substitution probability matrix is based on the assumption of a Markovian process, where the probability of transitioning from one state to another depends only on the current state, not on the sequence's history.

- **Normalization and Scoring:** The normalization of matrices (Step 5) ensures that the matrices are comparable across different evolutionary distances. Scoring sequence alignments (Step 6) involves logarithmic transformation for better interpretation.

- **Choice of 0.01 in Scoring:** The choice of 0.01 in the scoring formula is arbitrary but commonly used. It represents the expected probability of a random amino acid substitution.

- **PAM250 Application:** PAM250, derived from PAM1, is used when proteins share about 20% amino acid identity, providing a scoring system suitable for proteins that have moderately diverged.

The Dayhoff Model, through its stepwise approach, provides a robust method for studying protein evolution and establishing scoring matrices essential in bioinformatics analyses.


# Dayhoff’s Algorithm - Foundation

In bioinformatics, Dayhoff’s Algorithm serves as a foundational concept for understanding the evolution of protein sequences. 
Dayhoff (1978) created a base dataset comprising ***34 protein superfamilies*** grouped into ***71 phylogenetic trees***.
This dataset allows researchers to explore the range of conservation among different protein families, from
histones and glutamate dehydrogenase to immunoglobulin (Ig) chains and kappa casein.

## Accepted Point Mutations

### Dayhoff Model (Step 1)

Dayhoff proposed a model involving accepted point mutations, where a change in amino acid is considered evolutionarily 
favorable when a gene undergoes a DNA mutation resulting in a different amino acid. The key criterion is that the 
entire species adopts this change as the predominant form of the protein.

### PAM rate of proteins used by Dayhoff et. al.
researchers often use the Dayhoff Model to compute the ***Percentage of Accepted Mutations (PAM) ***
rates for proteins. This involves assessing the frequency of amino acid changes in protein sequences.

### Dayhoff Model (Steps 2-5)

Dayhoff's model includes steps such as analyzing the frequency of amino acid changes, assessing mutability, determining 
mutation probabilities over 1 PAM, and constructing matrices like PAM1, PAM10, and PAM250. These matrices quantify the likelihood of amino acid substitutions over specific evolutionary distances.

## PAM Matrices

PAM matrices are crucial in bioinformatics for scoring amino acid substitutions during sequence alignments.
The choice of which PAM matrix to use depends on the evolutionary distance between sequences. For example, PAM250 is suitable when proteins share approximately 20% amino acid identity, providing a scoring system that reflects the evolutionary divergence.

## Twilight Zone
zone of seq similarity (0-20% identity) but not siginificant statistcally

Dayhoff’s Algorithm, with its emphasis on PAM matrices and accepted point mutations, lays the groundwork for bioinformaticians 
to analyze and compare protein sequences, unraveling insights into evolutionary relationships and functional implications.


In [None]:
# Accepted Point Mutations

An accepted point mutation in a protein refers to the replacement of one amino acid by another, a process accepted by natural selection. This phenomenon involves two distinct processes: firstly, a mutation occurring in the gene template, leading to the production of a different amino acid, and secondly, the acceptance of this mutation by the species as the new predominant form. For acceptance, the new amino acid typically needs to function similarly to the old one, with observed interchangeability reflecting chemical and physical similarities.

*Margaret Dayhoff*

## Dayhoff Model (Step 1)

In the Dayhoff Model, an amino acid change accepted by natural selection involves two criteria:

1. A gene undergoes a DNA mutation, resulting in the encoding of a different amino acid.
2. The entire species adopts this change as the predominant form of the protein.

## PAM Rate of Proteins Used by Dayhoff et. al.

### Dayhoff Model (Step 2): Frequency of AA

### Dayhoff Model (Step 3): Mutability

### Dayhoff Model (Step 4): Mutation Prob over 1 PAM

- One PAM is defined as the unit of evolutionary divergence in which 1% of amino acids have changed between two protein sequences.
- PAM1 Mutation Probability Matrix example: 98.7% of Ala in the sequence stays the same over 1 PAM.

### PAM 10

- Observing switches and higher penalties, e.g., D to R in PAM 10; E to N switches.

### Dayhoff Model (Step 5): PAM 250

- Simply PAM1 raised to the power of 250.
- This matrix applies to an evolutionary distance where proteins share about 20% amino acid identity.
 
  - Mutability
  - Not Symmetric
  - PAM250 mutation probability matrix.
  - At this evolutionary distance, only one in five amino acid residues remains unchanged from the original AA sequence.

## What do the PAM matrices mean?

## Which PAM Matrix to Use?
