# Protein Sequence Study: Functional Analysis & Prediction

## Project Requirements:
- Jupyter Notebook
- Python3

## How to Run:
`jupyter notebook sequence_analysis.ipynb`

## Purpose:

The purpose of this project is to analyze protein sequences through function prediction. With the use of a large data set of DNA sequences, our project will be able to translate those sequences into a sequence of amino acids. From there, the amino acids will provide three-dimensional structure of the protein, which we will use to predict its functionality against known proteins.

## DNA to Protein Translation

When a DNA sequence is discovered, we can translate it into the corresponding protein sequence using the genetic code. The DNA codon table shows how a triplet sequence of four nucleotides (A,T,G,C) can be uniquely related to a set of 20 amino acids. This allows us to show how the information is encoded within genetic material and how it is translated into proteins by living cells. Below is an image of the codon mapping.

![title](codon_table.png)

From the DNA codon table, we can easily see nucleotide triplet mappings such as the STOP codon triplet, TAA. 

By creating our own codon mapping using python, we can then read in a DNA sequence, divide it out into a three-segment piece, and compare it to its protein letter to generate our protein sequence.

In [23]:
import re

#declare variable strings
tempStr = ""
finalSeq = ""
dna = ""

#open file, clean data (remove all lines that start with ">")
with open("sample_dna.txt","r") as f:
    for line in f:
           if re.match(">",line):
              pass
           else:
              line = line.rstrip("\n")
              dna = dna + line
                
# create the codon table mapping
table = { 
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
} 
    
    
# Generate protein sequence
for i in range(0, len(string)-(3+len(string)%3), 3):
    if table[dna[i:i+3]] == '_' :
        break
    finalSeq += table[dna[i:i+3]]

# Print the protein sequence
print ("Protein Sequence: ", finalSeq)


Protein Sequence:  MPTTKKTLMFLSSFFTSLGSFIVICSIL


## Protein Function Prediction

After we have collected a series of protein sequences from our translation, we may begin to analyze their functionality. Our chosen method of prediction is homology-based function prediction. Proteins of similar sequence tend to be homologous, thus sharing similar functionality. By comparing our newly generated protein sequences to a database of known proteins, we are able to adequately predict their functionality.

In [22]:
#This specific set is reviewed proteins found within humans, 
#and contains 20,379 proteins.

#If you want to switch data sets, download your search results 
#from the above website as FASTA, and then just change the 
#extension to a .txt


fileName = "sequenceData.txt"
sequenceData = []
sequenceInfo = []

def getData():
    with open(fileName,'r') as fp:
        i = -1
        for line in fp:
            if(line[0] == '>'):
                i = i+1
                sequenceInfo.append(line)
            else:
                if(i == len(sequenceData)):
                    sequenceData.append(line)
                else:
                    sequenceData.append(line)
                    #sequenceData[i]+=line
    return

#the checkSequence() function takes in the amino acid sequence 
#and checks the file for any matching sequences, and alerts the 
#user if there are any possible matches that the passed sequence 
#contain and tells the user which sequence match is closest 
#in size.

def checkSequence(AAsequence):
    getData()
    count = 0
    bestMatch = -1
    bestMatchID = -1
    i = 0
    for entry in sequenceData:
        if(AAsequence in entry):
            print("Amino acid sequence contained within:")
            print(sequenceInfo[i])
            if(len(AAsequence)/len(entry) > bestMatch):
                bestMatch = len(AAsequence)/len(entry)
                bestMatchID = i
            count+=1
        i += 1
    print("Total number of potential matches: " + str(count))
    if(bestMatchID > -1):
        print("\nClosest potential match: \n" 
              + sequenceInfo[bestMatchID] + "\n" 
              + sequenceData[bestMatchID])
    return


checkSequence("MPTTKKTLMFLSSFFTSLGSFIVICSIL")

#checkSequence("MEVFKAPPIGI")


Amino acid sequence contained within:
>sp|Q8N1N5|CRPAK_HUMAN Cysteine-rich PAK1 inhibitor OS=Homo sapiens OX=9606 GN=CRIPAK PE=1 SV=1

Total number of potential matches: 1

Closest potential match: 
>sp|Q8N1N5|CRPAK_HUMAN Cysteine-rich PAK1 inhibitor OS=Homo sapiens OX=9606 GN=CRIPAK PE=1 SV=1

MPTTKKTLMFLSSFFTSLGSFIVICSILGTQAWITSTIAVRDSASNGSIFITYGLFRGES



## Conclusion

The three-dimensional structure of a protein has a direct relationship with its functionality. It is widely accepted that the function of an unknown protein can be determined if its structure is similar to another protein whose function is known. However, it has also been observed that some proteins that share similar global structures do not necessarily correspond to the same function. This proposes that function similarity may originate from the local structural information of a protein rather than their global shapes. In our project, we have shown a method of prediction based on sequence similarity, which uses a protein's sequence as its primary local structure.


Our program exist currently on a small scale, but can be expanded to predict functionality of unknown protein sequences in a relatively fast amount of time. By providing a means for protein function prediction, we have gained a better understanding of bioinformatics and its benefits towards human biology. Bioinformatics helps us make predictions that better industries, such as healthcare, and further scientific research as well as clinical treatments. 

## Sources:

Protein data for the function prediction program is pulled from:

https://www.uniprot.org/uniprot/?query=*&fil=organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22+AND+reviewed%3Ayes


## Contributors:

- Sercan Akcay
- Adel Alazemi
- Wyatt Pickett 
- Kaylyn Schnaus
- Skylar Trendley