### Challenge 05.
### Translate DNA Sequences to Proteins.
### *Taken from [Geeks for Geeks.](https://www.geeksforgeeks.org/dna-protein-python-3/)*

**Description**

*Translation Theory :* DNA ⇒ RNA ⇒ Protein

Life depends on the ability of cells to store, retrieve, and translate genetic instructions. These instructions are needed to make and maintain living organisms. For a long time, it was not clear what molecules were able to copy and transmit genetic information. We now know that this information is carried by the deoxyribonucleic acid or DNA in all living things. 

DNA: DNA is a discrete code physically present in almost every cell of an organism. We can think of DNA as a one dimensional string of characters with four characters to choose from. These characters are A, C, G, and T. They stand for the first letters with the four nucleotides used to construct DNA. The full names of these nucleotides are Adenine, Cytosine, Guanine, and Thymine. Each unique three character sequence of nucleotides, sometimes called a nucleotide triplet, corresponds to one amino acid. The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things. 

Instructions in the DNA are first transcribed into RNA and the RNA is then translated into proteins. We can think of DNA, when read as sequences of three letters, as a dictionary of life. 

*Aim:* Convert a given sequence of DNA into its Protein equivalent. 

*Source:* Download a DNA strand as a text file from a public web-based repository of DNA sequences from NCBI. The Nucleotide sample is (NM_207618.2), which can be found [here](https://www.ncbi.nlm.nih.gov/nuccore/NM_207618.2).

### Proposed Solution

In [31]:
### Challenge 05
### Translate DNA Sequences to Proteins
### Proposed Solution


### Import libraries
import json



### Load utilities to translate

# Load list of aminoacids
def load_aminoacids():

    # Initialize list of aminoacids
    ls_aminoacids = []

    # Fill list of aminoacids
    with open('input_aminoacids.dat') as file_aminoacids:
        for line in file_aminoacids:
            # Whitespaces are deleted, characters to uppercase, and splitting information into columns
            aminoacid_info = line.replace(" ", "").replace("\n", "").upper().split("-")
            # Append extracted aminoacid information
            ls_aminoacids.append(aminoacid_info)

    return ls_aminoacids


# Load dictionary of codons (3-nucleotids-in-a-row) to aminoacids
def load_codons(ls_aminoacids):
    
    # Opening database of codons
    file_codons = open('input_codon_to_aminoacids.json')
    # Convert JSON object to a dictionary
    dc_codons = json.load(file_codons)
    # List of aminoacids - names of 3 characters
    ls_aminoacids_abbrv_3char = [aminoacid[1] for aminoacid in ls_aminoacids]+['STOP']

    # Assign new values to dictionary, 1 character per aminoacid instead of 3 characters
    for key, value in dc_codons.items():
        # Convert the value to uppercase
        value_aux = value.upper()
        # Special case for stop codons
        if value_aux == "STOP":
            dc_codons[key] = "_"
        # Normal cases
        else:
            # Index of the aminoacid
            index_aux = ls_aminoacids_abbrv_3char.index(value_aux)
            # Equivalent name of 1 character
            aminoacid_1char = ls_aminoacids[index_aux][2]
            # New value in the dictionary for the codon
            dc_codons[key] = aminoacid_1char
        
    return dc_codons



### Generate translation 

# Load input (DNA sequence)
def load_input():
    with open('input_dna_sequence.dat') as file_dna_sequence:
        # The entire text processing is wrapped in one line
        dna_chain = "".join([line.replace("\n", "").replace(" ", "").upper() for line in file_dna_sequence])
    return dna_chain


# RNA transcription
def rna_chain(dna_chain):
    # Just replacing nucleotide "T" by "U"
    dna_chain = dna_chain.upper().replace("T", "U")
    return dna_chain


# Find start and end of the RNA sequence
def crop_rna_sequence(rna):

    # Look for the 3-characters sequence "AUG" as they set the beginning of a protein
    while rna[0:3] != "AUG":
        rna = rna[1:]

    # Look for the 3-characters sequences that set the end of a protein
    # Initial guess
    final_index = 3
    possible_end = rna[final_index:final_index+3]
    # Iterative search from the starting aminoacid "AUG"
    while (possible_end != "UAA") and (possible_end != "UAG") and (possible_end != "UGA"):
        final_index += 3
        possible_end = rna[final_index:final_index+3]
    rna = rna[0:final_index]

    return rna


# Convert RNA into aminoacids
# Complete translation for a single protein
def rna2protein(rna, dc_codons):
    
    # Lenght of the RNA chain
    lrna = len(rna)

    # Empty string where sequence of aminoacids will be saved
    sequence_aminoacids = ""
    
    # Translation of RNA into aminoacids
    for ii in range(int(lrna/3)):
        # Extract codon
        codon = rna[3*ii:3*(ii+1)]
        # Add aminoacid to the sequence
        aminoacid_1char = dc_codons[codon]
        sequence_aminoacids = sequence_aminoacids + aminoacid_1char

    return sequence_aminoacids


# Verify result
def validation(real_res):
    # Load expected result
    with open('output_protein_sequence.dat') as file_protein_sequence:
        exp_res = "".join([line.replace("\n", "").replace(" ", "").upper() for line in file_protein_sequence])
    # Verification
    check_val = real_res == exp_res
    return check_val



### Full sequence of aminoacids for a single protein (main function)
def main():

    # Load list of 20 fundamental aminoacids
    ls_aminoacids = load_aminoacids()

    # Load dictionary of codons
    # Each combination between nucleotides "U", "C", "A", and "G" represent an aminoacid 
    dc_codons = load_codons(ls_aminoacids)

    # Load DNA sequence
    dna_chain = load_input()

    # Transcription of DNA sequence into RNA
    rna = rna_chain(dna_chain)

    # Crop RNA from the starting to end aminoacids
    rna = crop_rna_sequence(rna)

    # RNA translated into sequence of aminoacids
    sequence_aminoacids = rna2protein(rna, dc_codons)

    # Perform result validation
    check_val = validation(sequence_aminoacids)

    return dna_chain, sequence_aminoacids, check_val

In [37]:
### Testing Proposed Solution
### Translate a DNA sequence into an aminoacid sequence for a given protein (execute the main function)

# Call the main function
if __name__ == '__main__':
    dna_chain, sequence_aminoacids, check_val = main()
    print("--------------------- Is it the expected result? ---------------------")
    print(check_val)
    print("--------------------- DNA sequence ---------------------")
    print(dna_chain)
    print("--------------------- Translated sequence of aminoacids ---------------------")
    print(sequence_aminoacids)

--------------------- Is it the expected result? ---------------------
True
--------------------- DNA sequence ---------------------
GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCC

### [External Solution](https://www.geeksforgeeks.org/dna-protein-python-3/)

In [50]:
### Challenge 05
### DNA to Protein Translation
### External Solution



### Definition of the solution

# Function to read DNA chain
def read_seq(inputfile):
    with open(inputfile, "r") as f:
        seq = f.read()
    seq = seq.replace("\n", "")
    seq = seq.replace("\r", "")
    return seq

# Function to translate DNA to protein
def translate(seq):
    
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    protein =""
    if len(seq)%3 == 0:
        for i in range(0, len(seq), 3):
            codon = seq[i:i + 3]
            protein+= table[codon]
    return protein


In [52]:
### Testing External Solution

# Initialize inputs
dna = read_seq("DNA_sequence_original.txt")
expected_protein = read_seq("amino_acid_sequence_original.txt")

# Execution
protein = translate(dna[20:935])
if protein == expected_protein:
    print('Protein:', protein)
else:
    print('Bad DNA translation')

Protein: MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC


### Final Remarks

After challenge 04, I was fascinated about DNA. It is impressive how a single long chain of characters has the entire genetic information of a living being. I found this challenge in [Geeks for Geeks](https://www.geeksforgeeks.org/dna-protein-python-3/) about DNA, and I do not hesitate to face it. 

The external solution is the one given on the aforementioned website. The proposed solution is my own approach, where I even looked for the data sources by my own. This is the reason why both solutions use different inputs. The proposed solution requires as inputs the following files `input_aminoacids.dat`, `input_codon_to_aminoacids.json`, `input_dna_sequence.dat`, and `output_protein_sequence.dat`, while the external solution uses the inputs `DNA_sequence_original.txt` and `amino_acid_sequence_original.txt`. Thus, the proposed solution handles more data formats and consequently the data loading process is a little bit longer. Besides, the proposed solution was devised to automatically detect the start and the end of a protein. This is not achieved by the external solution.

A more detail description of the DNA translation and the proposed solution can be found [here](https://github.com/wgfajardom/portfolio/blob/develop/Challenges/05_dna_to_protein_translation/explanation_dna_to_protein_translation.md).