# Using Sequence Alignment and Statistical Methods to detect Gene Mutations and Codon Usage Bias

Author: Sindhuja Gundeti 
Co-Author Coder: Thaddeus Gonzalez-Serna

## About me
Ever since the pandemic I've noticed discourse on misinformation which caught my attention. I researched the impacts of data missuse and ethicality, being a student in the 21st century I believe these concepts are integral to understanding the implications of technology and society. So far I've learned a lot of CSS techniques and data modeling tips, and have new found interest in probabilty on Excel. Computational science is a new field I'd want to enter because of how fast some questions can be answered with proper statistical methods. Although coding is a struggle for me I'm happy to learn new things through the resources the Internet provides and I hope through practice I can become a better coder. Since I'm most interested in data integrity It would be beneficial to dip my toes in different coding languages. 

## Background
This project was partered up with my classmate Thaddeus.
Our project was started off on the question of what mutations had occurred when the COVID-SARS strain had transferred from bats to humans. Our initial understanding of nucleotides was that each nucleotide had a start and stop codon, and that three nucleotides can translate into a specific amino acid. We had gathered our different strains we wanted to compare: Bat and Human. The strains were collected from a reputable source in the NCBI database, the Bat originating in Russia and the Human one originating where the pandemic first started, Wuhan, China. While researching the impacts of the covid mutations and how it spread like wild-fire around the world, we came upon  a new concept we have never seen before called Codon Usage Bias. Codon Bias refers to the phenomenon where specific codons are used more often than other synonymous codons during translation of genes, the extent of which varies within and among species. Molecular evolutionary investigations suggest that codon bias is manifested as a result of balance between mutational and translational selection of such genes and that this phenomenon is widespread across species and may contribute to genome evolution in a significant manner.

## Planned Methods

Since our data will be in string form we need to input a parser function. But before that our project is heavily reliant on a hard-coded dictionary of specific codons. We will also be adding in our necessary imports beforehand. 



In [3]:
from collections import Counter
from typing import List, Any
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


Now we hard-code our codon dictionary.

In [4]:
# Dictionary for Codon sequences that produce the same amino acid are in the same line


codon_dict = {'uuu': 'phe', 'uuc': 'phe',  # phe
              'uua': 'leu', 'uug': 'leu', 'cuu': 'leu', 'cuc': 'leu', 'cua': 'leu', 'cug': 'leu',  # leu
              'ucu': 'ser', 'ucc': 'ser', 'uca': 'ser', 'ucg': 'ser', 'agu': 'ser', 'agc': 'ser',  # ser
              'uau': 'tyr', 'uac': 'tyr',  # tyr
              'ugu': 'cys', 'ugc': 'cys',  # cys
              'ugg': 'trp',  # trp
              'ccu': 'pro', 'ccc': 'pro', 'cca': 'pro', 'ccg': 'pro',  # pro
              'cau': 'his', 'cac': 'his',  # his
              'caa': 'gin', 'cag': 'gin',  # gin
              'cgu': 'arg', 'cgc': 'arg', 'cga': 'arg', 'cgg': 'arg', 'aga': 'arg', 'agg': 'arg',  # arg
              'auu': 'lle', 'auc': 'lle', 'aua': 'lle',  # lle
              'aug': 'met',  # met
              'acu': 'thr', 'acc': 'thr', 'aca': 'thr', 'acg': 'thr',  # thr
              'aau': 'asn', 'aac': 'asn',  # asn
              'aaa': 'lys', 'aag': 'lys',  # lys
              'guu': 'val', 'guc': 'val', 'gua': 'val', 'gug': 'val',  # val
              'gcu': 'ala', 'gcc': 'ala', 'gca': 'ala', 'gcg': 'ala',  # ala
              'gau': 'asp', 'gac': 'asp',  # asp
              'gaa': 'glu', 'gag': 'glu',  # glu
              'ggu': 'gly', 'ggc': 'gly', 'gga': 'gly', 'ggg': 'gly',  # gly
              'uaa': 'stop', 'uag': 'stop', 'uga': 'stop'  # stop
              }

# Human and Bat strains
The data we're going to use is 
Bat : https://www.ncbi.nlm.nih.gov/nuccore/2042764321
Human : https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta
from an NCBI databse. Although, we had to keep in mind that each strain did not have equal lengths so this would affect our statistical analysis. 



In [8]:
# for this vignette we will only exercise a couple of lines because it is ridicously long.
bat_sequence = "TTTAAAATCTGTGTATCTGTCACTAGGCTGTATGCCCAGTGCATTTACGCAGTATAGCTTTTAAACCTTTACTGTCGTTGGCAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGTCGATCATCAGCATACCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGG"
human_sequence = "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG"


# 
A problem that we ran into was realizing the strain lengths of each species were different and that would impact our statistical analysis, it is preferred to keep an error statistic (i.e we can use p = 0.01) 
#
Now the next step is to code our counter methods. What this does is it takes account of the length a codon should be which is 3 nucleotides. 


In [9]:
def codon_counter(sequence):
    """
    :param sequence:
    :return Counter of :
    """
    codon_list = []
    for i in range(len(sequence) // 3):
        index: int = i * 3
        curr_sequence = sequence[index: index + 3].lower()

        if curr_sequence in codon_dict:
            codon_list.append(curr_sequence)
    codon_count = Counter(codon_list)
    return codon_count

# 
Then next we make our amino acid counter, which takes in the previous method and sorts them out using the hard-coded dictionary. 

In [10]:
def amino_acid_counter(sequence):
    """
    :param sequence:
    :return:
    """
    amino_acid_list = []
    for i in range(len(sequence) // 3):
        index: int = i * 3
        curr_sequence = sequence[index: index + 3].lower()

        if curr_sequence in codon_dict:
            amino_acid_list.append(codon_dict.get(curr_sequence.lower()))
    amino_acid_count = Counter(amino_acid_list)
    return amino_acid_count

# Using our methods




In [11]:
# Get data 
codon_counts_human = codon_counter(human_sequence).most_common()
amino_acid_counts_human = amino_acid_counter(human_sequence).most_common()

codon_counts_bat = codon_counter(bat_sequence).most_common()
amino_acid_counts_bat = amino_acid_counter(bat_sequence).most_common()


# 
To get our data in compiled form where we can see the stats, use pandas to make a dataframe table. 

In [12]:
# human data
codon_counts_human_df = pd.DataFrame.from_dict(codon_counts_human)
codon_counts_human_df.columns = ['Codon', 'Occurrence']

amino_acid_counts_human_df = pd.DataFrame.from_dict(amino_acid_counts_human)
amino_acid_counts_human_df.columns = ['Amino_Acid', 'Occurrence']

print("Human Sequence Data")
print(codon_counts_human_df)
print(amino_acid_counts_human_df)

# bat data
codon_counts_bat_df = pd.DataFrame.from_dict(codon_counts_bat)
codon_counts_bat_df.columns = ['Codon', 'Occurrence']

amino_acid_counts_bat_df  = pd.DataFrame.from_dict(amino_acid_counts_bat)
amino_acid_counts_bat_df.columns = ['Amino_Acid','Occurrence']

print("Bat Sequence Data")
print(codon_counts_bat_df)
print(amino_acid_counts_bat_df)

Human Sequence Data
   Codon  Occurrence
0    aac           3
1    acg           3
2    aaa           2
3    agg           2
4    caa           2
5    ccc           1
6    acc           1
7    aga           1
8    cgg           1
9    cac           1
10   gca           1
11   cag           1
12   gac           1
  Amino_Acid  Occurrence
0        arg           4
1        thr           4
2        gin           3
3        asn           3
4        lys           2
5        pro           1
6        his           1
7        ala           1
8        asp           1
Bat Sequence Data
   Codon  Occurrence
0    ggc           2
1    cac           1
2    cca           1
3    cgc           1
4    acc           1
5    agg           1
6    aca           1
7    cga           1
8    gca           1
9    acg           1
10   cag           1
11   agc           1
12   ggg           1
13   gac           1
  Amino_Acid  Occurrence
0        arg           3
1        thr           3
2        gly           3
3  

# Statistical Analysis
Now that we found a way to quickly read and sort codons/nucleotides, we can make inferences about about the differences in amino acid occurrences. 


# Refrences

Behura SK, Severson DW. Codon usage bias: causative factors, quantification methods and genome-wide patterns: with emphasis on insect genomes. Biol Rev Camb Philos Soc. 2013 Feb;88(1):49-61. doi: 10.1111/j.1469-185X.2012.00242.x. Epub 2012 Aug 14. PMID: 22889422.

https://doi.org/10.1016/j.accpm.2021.100998
Ferré, V. M., Peiffer-Smadja, N., Visseaux, B., Descamps, D., Ghosn, J., & Charpentier, C. (2021). Omicron SARS-CoV-2 variant: What we know and what we don't. Anaesthesia, critical care & pain medicine, 41(1), 100998. Advance online publication. 

https://www.bioinformatics.org/sms2/codon_usage.html
This website was used as a reference point, it also contained an amino-acid counter.

I would like to thank my partner Thaddeus Gonzalez for helping me code and the implementation our project. 