# Exercise 1 
------------

Based on the standard genetic code codon table, which of the three codon positions (first, second or third) is most sensitive to a mutation?  That is, at which position in a codon would a nucleotide substitution be most likely to change the amino acid.  How much more likely than the least sensitive position?  

Please explain your answer and show any relevant evidence you have from your work to support your answer.

>Hint: The correct answer will require computation, not intuition.

Approach: Calculate the percentage of mutations at each position that lead to a change in the encoded amino acid. 
- Look at the entire codon table
- count the number of synonymous and non-synonymous mutations for each position.

In [1]:
import pandas as pd

# map the codons to amino acids
codon_table = {
    'UUU': 'Phe', 'UUC': 'Phe', 'UUA': 'Leu', 'UUG': 'Leu',
    'UCU': 'Ser', 'UCC': 'Ser', 'UCA': 'Ser', 'UCG': 'Ser',
    'UAU': 'Tyr', 'UAC': 'Tyr', 'UAA': 'Stop', 'UAG': 'Stop',
    'UGU': 'Cys', 'UGC': 'Cys', 'UGA': 'Stop', 'UGG': 'Trp',
    'CUU': 'Leu', 'CUC': 'Leu', 'CUA': 'Leu', 'CUG': 'Leu',
    'CCU': 'Pro', 'CCC': 'Pro', 'CCA': 'Pro', 'CCG': 'Pro',
    'CAU': 'His', 'CAC': 'His', 'CAA': 'Gln', 'CAG': 'Gln',
    'CGU': 'Arg', 'CGC': 'Arg', 'CGA': 'Arg', 'CGG': 'Arg',
    'AUU': 'Ile', 'AUC': 'Ile', 'AUA': 'Ile', 'AUG': 'Met',
    'ACU': 'Thr', 'ACC': 'Thr', 'ACA': 'Thr', 'ACG': 'Thr',
    'AAU': 'Asn', 'AAC': 'Asn', 'AAA': 'Lys', 'AAG': 'Lys',
    'AGU': 'Ser', 'AGC': 'Ser', 'AGA': 'Arg', 'AGG': 'Arg',
    'GUU': 'Val', 'GUC': 'Val', 'GUA': 'Val', 'GUG': 'Val',
    'GCU': 'Ala', 'GCC': 'Ala', 'GCA': 'Ala', 'GCG': 'Ala',
    'GAU': 'Asp', 'GAC': 'Asp', 'GAA': 'Glu', 'GAG': 'Glu',
    'GGU': 'Gly', 'GGC': 'Gly', 'GGA': 'Gly', 'GGG': 'Gly'
}

def count_mutations(codon_table):
    '''
    Function to count the synonymous and non-synonymous mutations at each codon position

    Inputs:
      codon_table (map): map of the codon table

    Returns:
      first_pos, second_pos, third_pos (float): the sensitivity each position has to mutations
    '''
    nucleotides = ['U', 'C', 'A', 'G']
    first_changes = 0
    second_changes = 0
    third_changes = 0
    first_mut = 0
    sec_mut = 0
    third_mut = 0

    for codon, amino_acid in codon_table.items():
        # Calculate the mutations at the first position
        for n in nucleotides:
            if n != codon[0]:
                mutated_codon = n + codon[1] + codon[2]
                if codon_table.get(mutated_codon) and codon_table[mutated_codon] != amino_acid:
                    first_changes += 1
                first_mut += 1

        # second position
        for n in nucleotides:
            if n != codon[1]:
                mutated_codon = codon[0] + n + codon[2]
                if codon_table.get(mutated_codon) and codon_table[mutated_codon] != amino_acid:
                    second_changes += 1
                sec_mut += 1

        # third position
        for n in nucleotides:
            if n != codon[2]:
                mutated_codon = codon[0] + codon[1] + n
                if codon_table.get(mutated_codon) and codon_table[mutated_codon] != amino_acid:
                    third_changes += 1
                third_mut += 1

    # calculate percentage of mutations changing amino acid
    first_pos = first_changes / first_mut
    second_pos = second_changes / sec_mut
    third_pos = third_changes / third_mut

    return first_pos, second_pos, third_pos

first_pos, second_pos, third_pos = count_mutations(codon_table)

print(f"first position sensitivity: {first_pos}\nsecond position sensitivity: {second_pos}\nthird position sensitivity: {third_pos}")


first position sensitivity: 0.9583333333333334
second position sensitivity: 0.9895833333333334
third position sensitivity: 0.3333333333333333


### Results:

First position: 95.83% of mutations lead to a change in the amino acid.
Second position: 98.96% of mutations lead to a change in the amino acid.
Third position: 33.33% of mutations lead to a change in the amino acid.

These results show that the second position is the most sensitive to mutations. 98.9% of mutations leading to a change in the encoded amino acid, compared to 95.8% in the first position (close but a little less), and 33.3% in the third position. This shows that the third position is the least sensitive, so mutations in this position do not have much of an affect on the amino acid formed. This is consistent with the article linked below.

Answer: a mutation in the second position is most likely to affect the amino acid formed.

#### References:
Mutation effects on amino acids: https://www.thetech.org/ask-a-geneticist/articles/2022/why-three-base-codon/#:~:text=The%20third%20position%20that%20is,a%20lot%20of%20amino%20acids.


# Exercise 2
-------------

In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). 


![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/AT-GC.jpg/320px-AT-GC.jpg)

_Nucleotide bonds showing AT and GC pairs. Arrows point to the hydrogen bonds._

This measure indicates the proportion of G and C bases out of an implied four total bases, also including adenine and thymine in DNA and adenine and uracil in RNA. GC-content may be given for a certain fragment of DNA or RNA or for an entire genome and is typically expressed as a percentage value using the formula:

$ \frac{(G+C)}{(A+T+G+C)} * 100 $

GC-rich regions of a genome typically include many protein-coding genes within them, Determination of GC-ratios of these specific regions contributes to mapping gene-rich regions of the genome. Also, since G+C bonds are stronger than A+T bonds, differences in GC-content can indicate differences in the properties of the DNA.

Calculcate the GC-content for each of the following sequences in the multiple DNA sequences file (multiple_DNA.fasta). Print out the description headers and GC-content in the following format: 
```
- >description | GC-content 
- >description | GC-content 
...
```

In [2]:
# Your solution here....

What are some properties and/or behaviors of genes that can be inferred from C-G content?

In [3]:
# Your answer here...

# Exercise 3
------------

A recent report has suggested that Human SARS-CoV-2 has evolved to reduce CG dinucleotide in its open reading frames[1]. The authors suggest that this allows the SARS-Cov-2 virus to reproduce more efficiently in host cells becuase of reduced stability.

It is also suggested that SARS-CoV-2 is more efficient in reproduction than other coronaviruses, because less energy is consumed in disrupting the stem-loops formed by its genomic RNA. The stability of a stem-loop structure is dependent on number of hydrogen bonds formed between bases in the stem part. Because C-G and T-A base-pairs are formed through three and two hydrogen bonds respectively, a viral RNA strand with high number of C and G bases will form more stable stem-loops than that with high number of T and A bases.

Using your solutions from the previous exercises, identify all the ORFs and compute the average CG-content for each genome. Assess whether this claim is consistent with the fatality ratios frum Human MERS-Cov and Human SARS-Cov[2].  

>The approach we are using here is simplified compared to the full experiment conducted by Wang et. al., however, it is still useful to gain some insights.

The genome sequences can be found in the following files: `data/human_sars-cov.fasta`, `data/human_sars_cov-2.fasta`, and `data/human_mers_cov.fasta`.

| Coronavirus | ID | Fatality Ratio | No. ORFs | Avg. GC-content |
|-------------|----|----------------|------|-----------|
| Human MERS-CoV | JX869059.2 | 39% | ? | ? |
| Human SARS-CoV | KY352407.1 | 9% | ? | ? |
| Human SARS-CoV-2 | NC_045512.2 | 2.4% | ? | ? |


In [4]:
# Complete the above table with your ORF and CG-content values and briefly explain your results and conclusion

In [5]:
def load_fasta(file_path):
    with open(file_path, 'r') as file:
        sequence = ''
        for line in file:
            if not line.startswith('>'):
                sequence += line.strip()
    return sequence


In [6]:
def find_orfs(sequence):
    start_codon = 'ATG'
    stop_codons = ['TAA', 'TAG', 'TGA']
    orfs = []
    for i in range(0, len(sequence) - 2, 3):
        codon = sequence[i:i+3]
        if codon == start_codon:
            for j in range(i + 3, len(sequence) - 2, 3):
                stop_codon = sequence[j:j+3]
                if stop_codon in stop_codons:
                    orfs.append(sequence[i:j+3])
                    break
    return orfs


In [7]:
def compute_cg_content(orf):
    cg_count = orf.count('C') + orf.count('G')
    return cg_count / len(orf)

def average_cg_content(orfs):
    total_cg_content = sum(compute_cg_content(orf) for orf in orfs)
    return total_cg_content / len(orfs)


In [8]:
# loading sequences
sars_cov2_seq = load_fasta('data/human_sars_cov-2.fasta')
sars_cov_seq = load_fasta('data/human_sars-cov.fasta')
mers_cov_seq = load_fasta('data/human_mers_cov.fasta')

# finding open reading frames
sars_cov2_orfs = find_orfs(sars_cov2_seq)
sars_cov_orfs = find_orfs(sars_cov_seq)
mers_cov_orfs = find_orfs(mers_cov_seq)

# calculating the average CG-content for each genome
sars_cov2_avg_cg = average_cg_content(sars_cov2_orfs)
sars_cov_avg_cg = average_cg_content(sars_cov_orfs)
mers_cov_avg_cg = average_cg_content(mers_cov_orfs)

# Print results
print(f"Human SARS-CoV-2: {len(sars_cov2_orfs)} ORFs, Avg. CG-content: {sars_cov2_avg_cg:.4f}")
print(f"Human SARS-CoV: {len(sars_cov_orfs)} ORFs, Avg. CG-content: {sars_cov_avg_cg:.4f}")
print(f"Human MERS-CoV: {len(mers_cov_orfs)} ORFs, Avg. CG-content: {mers_cov_avg_cg:.4f}")


Human SARS-CoV-2: 116 ORFs, Avg. CG-content: 0.3596
Human SARS-CoV: 150 ORFs, Avg. CG-content: 0.3773
Human MERS-CoV: 299 ORFs, Avg. CG-content: 0.4013


### Observations:

#### Average CG-Content Comparison:
- Human SARS-CoV-2: 116 ORFs, Avg. CG-content: 0.3596
- Human SARS-CoV: 150 ORFs, Avg. CG-content: 0.3773
- Human MERS-CoV: 299 ORFs, Avg. CG-content: 0.4013

From these results, we observe that Human SARS-CoV-2 has the lowest average CG-content (35.96%) among the three viruses, followed by Human SARS-CoV (37.73%), and Human MERS-CoV has the highest CG-content (40.13%).

#### Correlation with Fatality Ratios:
The fatality ratios for the three viruses are as follows:

- Human MERS-CoV: 39% fatality ratio.
- Human SARS-CoV: 9% fatality ratio.
- Human SARS-CoV-2: 2.4% fatality ratio.
  
Trend: higher CG-content correlates with a higher fatality ratio

- MERS-CoV, which has the highest CG-content (40.13%), also has the highest fatality ratio (39%).
- SARS-CoV, with a CG-content of 37.73%, has a moderate fatality ratio (9%).
- SARS-CoV-2, with the lowest CG-content (35.96%), has the lowest fatality ratio (2.4%).
  
#### Conclusion:
These findings suggest a trade-off between RNA stability and reproduction efficiency across the three coronaviruses, where SARS-CoV-2 has evolved to reduce CG-content in a way that potentially enhances its ability to reproduce efficiently within host cells, contributing to its lower fatality rate but higher transmission rate. Conversely, MERS-CoV, with higher CG-content, may reproduce less efficiently but cause more severe disease when infection occurs.

These results align with the hypothesis proposed by Wang et al. (2020), and while this is a simplified analysis, it provides valuable insights into the relationship between viral RNA structure, reproduction efficiency, and fatality.


References:
1. Wang, Y., Mao, JM., Wang, GD. et al. Human SARS-CoV-2 has evolved to reduce CG dinucleotide in its open reading frames. Sci Rep 10, 12331 (2020). https://doi.org/10.1038/s41598-020-69342-y
2. Peeri, N. C. et al. The SARS, MERS and novel coronavirus (COVID-19) epidemics, the newest and biggest global health threats: what lessons have we learned?. Int. J. Epidemiol. https://doi.org/10.1093/ije/dyaa033 (2020).


# Exercise 4. CpG Islands
-------------------------
CpG islands are often found in the 5' regions of vertebrate genes, therefore this program can be used to highlight potential genes in genomic sequence and were first described by [Gardiner-Garden and Frommer (1987)](fromer-1987.pdf) [1].

The calculation is performed using a 200 bp window moving across the sequence at 1bp intervals. CpG islands are defined as sequence ranges where the `Observed`/`Expected` value is greater than 0.6 and the `GC` content is greater than 50%. The expected number of CpG dimers in a window is calculated as the number of `C`s in the window multiplied by the number of `G`s in the window, divided by the window length.  

In 2002, Takai and Jones published an improved algorithm for identifying CpG islands in genomes, [Comprehensive analysis of CpG islands in human chromosomes 21 and 22 - PubMed](takai-jones-2002-comprehensive-analysis-of-cpg-islands-in-human-chromosomes-21-and-22.pdf), that tweaked some of the parameters and identified some new thresholds to consider. Identification of CpG islands is still a very active research topic and today makes use of advanced mathematical and computational approaches.

Implement the Gardiner-Garden and Frommer algorithm and use it to identify all the CpG islands in Human [Chromosome 21](https://www.ncbi.nlm.nih.gov/nuccore/NC_000021?report=fasta). Due to the size of this file, you will need to download it locally and put it in the `data/` directory.

> Note: You can ignore the `N` character in the sequence because you do not know if it is C or G.


References 
1. [Gardiner-Garden and Frommer (1987) - PubMed](https://pubmed.ncbi.nlm.nih.gov/3656447/)
2. [Comprehensive analysis of CpG islands in human chromosomes 21 and 22 - PubMed](https://pubmed.ncbi.nlm.nih.gov/11891299/)

## Example of finding CpG Islands
Given the following sequence:
```
taacatacttattgtttttaactactcgttttccattcgactcatcacgctccccccccc
cccccccccccttatccgttccgttcgacgtatttcgttgtctaatttctgacgtaactt
gttccctgttaagtaccgtttatggcctatactccggtatttaaaacgacgacgattcca
ccgtaaagccgtcaaccagatgaacgacctcgctcgttatatttttccggcaaaatccct
atttccgattcgcttagtgctaccgacgctatatcgttccgcaattcctcgagatcatcg
atttcttctccggcgacgtctcaagtttttccgttacaacgcgatctatcctgtaaattc
gaccgcgctcattctcacgttttatacattgcgcagttgattacgctaaataatccgctg
actgttaccttccctgttagattcgcgcattataaactacttactttaacaaacgatttt
cacagtttaatttctgcgatgacgtctaactcttcagttttaaccgataacaaccttctc
gacacttcgtttcttataccatcctcgttatccatacccattcttaaatttctcactact
attctctttacaaccacattagctctaatcttacatctaatttctatacataaaatgctc
cttctgctgtatgtttctctttctcataattacatttttaattactaaatccctcatccc
tcccacccatctattccaccatcaaggttatacaccatgtattactgtaaaacccactaa
tattaattgtcaccgatattaaacgaaattcattcacacaaatttcattaattacctttt
cttattaattgcatatgtactctacatatactcaaccaactaaaaatcgatattttacat
ttgatttctaatgtaccccacaactttcttgctttatgattgaacttagctttataataa
tagttatttaccctaacgcatatactcttatccttatatgaaccttgcttatttgttaga
tttatccaatctaaaccacagataatatcccttctcttacttcattttattatcaccatt
ttcacttcttcctagatatatacaattatataactctattaccacattttcccttaactt
tctgttctgcactattatatttactctttttctaaaaccttcttaactttttcagatgca
```
The following CpG islands can be detected:

```
CpG island detected in region 32 to 231 (Obs/Exp = 1.75 and %GC = 50.50) 
CpG island detected in region 33 to 232 (Obs/Exp = 1.75 and %GC = 50.50)
```


## Examples of CpG Island Finding Programs
The following are online CpG island calculators to use as references.
* [CpG Islands](https://www.genscript.com/sms2/cpg_islands.html)
* [EMBOSS: cpgplot](http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/cpgplot.html)

