# Background

Not all DNA will be transcribed into RNA: so-called junk DNA appears to have no practical purpose for cellular function. Second, we can begin translation at any position along a strand of RNA, meaning that any substring of a DNA string can serve as a template for translation, as long as it begins with a start codon, ends with a stop codon, and has no other stop codons in the middle. See Figure 1. As a result, the same RNA string can actually be translated in three different ways, depending on how we group triplets of symbols into codons. For example, ...AUGCUGAC... can be translated as ...AUGCUG..., ...UGCUGA..., and ...GCUGAC..., which will typically produce wildly different protein strings.

# Problem

Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

Given: A DNA string s of length at most 1 kbp in FASTA format.

Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

# Solution

We can start by searching both the DNA sequence and its reverse complement for start codons. The indices of all the start codons will then be used to begin translation starting from the indices found. This is accomplished with functions that find the indices of start codons in a given seqeunce and translate a DNA sequence into its protein. If the resulting protein begins with a start codon and ends in a stop codon, it is a valid candidate protein.

[BioPython](https://docs.python.org/3/library/re.html) is used here to parse the FASTA file, import a DNA codon table, and return the reverse trancript of the sequence. Please visit the BioPython documentation for more information.

In [1]:
# Import necessary packages
from Bio.Data.CodonTable import standard_dna_table
codon_table = standard_dna_table.forward_table
stop_codons = standard_dna_table.stop_codons
from Bio import SeqIO
from Bio.Seq import Seq

In [2]:
def translate(dna):
    protein_string = ''
    # Loop through the codons of the sequence
    for i in range(0, len(dna), 3):
        codon = dna[i:i+3]     
        # Stops translation of DNA if a stop codon is met
        if codon in stop_codons:
            break
        # Continues through translation if the codon is not a stop codon
        else:
            # If you reach the end of the sequence and no stop codon is met then it is not 
            # an open reading frame
            if i >=(len(dna) - 3):
                return
            # Translate the codon and add it to the end of th protein string
            aa = codon_table[codon]
            protein_string = protein_string + aa     
            
    return protein_string

In [3]:
def find_start_codons(dna):
    start_index = 0
    indices = []
    # Loop through the each nucleotide in the sequence
    for i in range(0, len(dna)-3):
        codon = dna[i:i+3]
        if codon == 'ATG':
            indices.append(i)     
    return indices

In [4]:
filename = 'input_files/rosalind_orf.txt'

with open(filename, 'r') as f:
    iterator = SeqIO.parse(f, 'fasta')
    # Read the sequence from the FASTA file and find its reverse complement 
    strand1 = str(next(iterator).seq)
    strand2 = str(Seq(strand1).reverse_complement())
    double_helix = [strand1, strand2]
    
    # This list will hold all distinct proteins that result from trsnlating the open reading
    # frames of the sequences
    all_proteins = []
    
    # Loop though each DNA strand and find the location of every start codon in the sequence
    for strand in double_helix:   
        start_codons = find_start_codons(strand)
        # Translate the sequence starting from the found start codons
        for index in start_codons:
            temp_str = strand[index:]
            protein = translate(temp_str)
            # Add the result to the end of the protein list if it is a valid protein
            if protein:
                all_proteins.append(protein)
        
all_proteins = set(all_proteins)

# Write all the distinct proteins to a new file
with open('output_files/output.txt', 'w') as f:
    f.write('All distinct proteins from the translated ORFs: \n')
    for protein in all_proteins:
        f.write(f'{protein}\n')

In [5]:
with open('output_files/output.txt', 'r') as f:
    print(f.read())

All distinct proteins from the translated ORFs: 
MLTFVSI
MIERRGR
MGCLTRDANRSCLPLRNKGDSA
MTIFSLLGLLRMLPYLKSKRTSTCRYSNCQKISDSQG
MCRSGFRFNLVRGVFPEKVTNALHAGQTGLANPV
MQ
MYSSKIKGTMHYTPVRLG
MEQCLVESI
MRSGNVTNGLSDA
MRVTTGLLGVVFA
MIMQ
MQVRTWGRPVSGSIFWSVSRGMRNMCPMTEWNSV
MVRASRGTDIILGGVAFVP
MRNMCPMTEWNSV
MLSGICLHW
MNHKRVNYK
MD
MCRNFSPFLYDHAVDLRLYV
MSVPREARTIRKLLVIHSLVIHASPHVGATSIRFDLLVRFSGNAEYVSYDGMEQCLVESI
M
MII
MSTRIPEAQRQPDASDYRAARCRVRLGSRSEGGRSPLGESYPWLSEIFWQLLYLHVDVRFDLRYGSIRRSPSSENIVILLLYD
MTEWNSV
MPLNMCRNFSPFLYDHAVDLRLYV
MGAFVGAQVAKI
MLPYLKSKRTSTCRYSNCQKISDSQG
MCPMTEWNSV
MHYTPVRLG
MQIEVAYL
MFTLVESVSSLALFYLLIGRGVRSCQQGFRKPSANLMRVTTGLLGVVFA

