# Gene Finder

Author: **Simrun Mutha**

In [16]:
%load_ext autoreload
%autoreload 2

In [21]:
import gene_finder
import data

## Gene Analysis of SARS-CoV-2 isolate Wuhan-Hu-1

The goal of this project is to identify candidate genes from the SARS-CoV-2 genetic data by making a gene finder. More specifically, the gene finder will identify potential protein-coding genes in the SARS-CoV-2 virus, and use the protein-BLAST search engine to determine which of these genes may encode key proteins of the virus. Finding the protein-coding genes is really important because once those are identified, steps can be made towards finding potential treatments for the virus.  

The input for this gene finder is a file that contains the nucleotide sequence of the SARS-CoV-2 virus. To find the protein coding genes, I had to first identify all the open reading frames (ORFs) in the DNA strand because an ORF is the sequence in a DNA strand which can be translated into a protein. ORFs begin with a start codon, ATG, and end with stop codons which can be TGA, TAA or TAG. In DNA, codons are translated three units at a time and the stop codon has to be multiple of three units away from the start codon. Because of this, I had to find ORFs in multiple frames for a given DNA sequence. Additionally, ORFs can be found in a DNA strand or in its reverse complement so that had to be accounted for as well. Here is an example of where all the ORFs in a given DNA strand "TCAATGAATGTGACTTGACAT" were found.   

In [20]:
gene_finder.find_all_orfs_both_strands("TCAATGAATGTGACTTGACAT")

['ATGAATGTGACT', 'ATG', 'ATGTCAAGTCACATTCAT']

After all the ORFs are found, the next step is to filter out the ORFs that are too short to be useful for producing proteins. To do this, I wrote a function that decides a cutoff length for the ORFs based on 1500 trials. Once this cutoff length is found, the gene finder will find ORFs that are longer than this cutoff length and convert those ORFs into amino acid sequences. These amino acid sequences represent potential key proteins in the virus. The following code shows the gene finder running on the SARS-CoV-2 genetic data and returning potential protein-coding genes.

In [22]:
#path containing the datafile for SARS-CoV-2
path = "/home/smutha/gene-finder/data/NC_045512.2.fa"
#running the gene finder
gene_finder.find_genes(path)

['MVPHISRQRLTKYTMADLVYALRHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYANLGERVRQALLKTVQFCDAMRNAGIVGVLTLDNQDLNGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAESHVDTDLTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRELGVVHNQDVNLHSSRLSFKELLVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQTVKPGNFNKDFYDFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIAATRGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLARKHTTCCSLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCSQHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFYEAMYTPHTVLQAVGACVLCNSQTSLRCGACIRRPFLCCKCCYDHVISTSHKLVLSVNPYVCNAPGCDVTDVTQLYLGGMSYYCKSHKPPISFPLCANGQVFGLYKNTCVGSDNVTDFNAIATCDWTNAGDYILANTCTERLKLFAAETLKATEETFKLSYGIATVREVLSDRELHLSWEVGKPRPP

A total of 18 potential protein-coding genes were found with this gene finder. I searched up each of these genes using the protein- BLAST search. I was able to find 5 of the major proteins that are significant for COVID. I found the Polyprotein ORF1a which is a precursor protein from which the virus's proteins are formed. The query length for this gene sequence was 4405 and the accession number was QNN96215.1. I also found the ORF1ab protein which had a query length of 2595 and an accession number of QKV41589.1. Another protein gene finder found was the nucleoplastid protein which carries the genetic material. The amino acid sequence had a query length of 419 and an accession number QOC66901.1. Another gene found with gene finder corresponded to the membrane glycoprotein which had a query length of 222 and an accession number of YP_009724393.1 The membrane glycoprotein fuses with the hosts's cell membrane when the virus enters. Another protein found was the envelope protein which forms the outer layer of the virus. It had a query length of 75 with an accession number YP_009724392.1. Another protein that I found was the surface or spike glycoprotein which had a query length of 1282 and an accession number of YP_009724390.1. The spike glycoprotein binds with specific receptors on the host cell. 

Besides these 5 proteins, this gene finder also found several other proteins. One of these proteins is ORF7a. This protein inhibits BST-2 which is an antigen which inhibits the release of virions. Hence, the protein ORF7a prevents our body from carrying out its antivirus response. This protein has query length of 121 and an accession number of YP_009724395.1. Another protein that was identified by the gene finder was the ORF8 which had a query length of 121 and an accession number of  YP_009724396.1. 

Being able to identify and understand these proteins is important because once these proteins are understood, drugs can be developed that target specific actions of the proteins. For example, a drug can be developed that prevents the envelope protein from creating an outer layer around the virus. However, if a gene finder is not thouroughly checked, results from it can unintentionally be harmful to others. Some of the limitations of this gene finder will be explored in the next section.

## Limitations and Future Extensions

One limitation of my code is that will not identify some of the proteins in SARS- CoV-2 that have a smaller length. Currently, the way the gene finder works is that it filters out shorter ORFs because most of these ORFs are too small to produce any useful proteins by creating a cutoff length. However, along with getting rid of many useless sequences, this cutoff length may also get rid of some of the smaller proteins that are actually present in the virus. For example, the Mat-Peptide -AA protein with an accession number of YP_009725312.1 is present in the COVID genome. However, it only has a query length of 13 which is below the cutoff length of this gene finder and so it was not found. It also looks like there are several smaller proteins that are nested under larger proteins. As our function skips over nested ORFs it will not be able to find these proteins. Some of these proteins include helicase and endoRNAse. One way to address this limitation would be to include nested ORFs instead of skipping over them. It is important to thouroughly test and improve a tool like gene finder to maximise the benefits of it. Some of the ethical implications of gene finder will be explored in the next section.

## Contextual and Ethical Implications

Considering how COVID-19 has drastically impacted so many people all over the world, it is easy to see how a gene finder can be helpful in the context of developing drugs and treatment. As the virus impacts more and more people every single day, it becomes crucial to develop an effective vaccine or drug as soon. A gene finder is a tool that can help with that process. Additionally, a gene finder can be a valuable tool in many other situations. One field where gene finders are used is in transcriptomics, which is the sum of an organism's RNA transcripts. This information is used to understand cellular diffrentiation and tumor formation. Additionally, bioinformatics is used to improve the nutritional value of certain foods. For example, it has been used to add vitamin A to rice. However, gene finders could also be harmful to people. For example, companies like 23andMe collect genetic information and run it through gene finders. There are several potential privacy concerns because people have to consent to allowing 23andMe to use their genetic information for research purposes. As 23andMe is a for-profit company, there is a possibililty that they might give away private information to other companies which is concerning.

In conclusion, tools like gene finders are really important in the context of the current pandemic. It is important to make sure that these tools are well-developed and that they are used to benefit the world in a positive way.