# Inferring mRNA from Protein

## Problem

For positive integers a and n, a modulo n (written _a mod n_ in shorthand) is the remainder when a is divided by n. For example, _29 mod 11=7_ because _29=11×2+729_.

Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that a and b are congruent modulo n if _a mod n=b mod n_; in this case, we use the notation _a≡b mod n_.

Two useful facts in modular arithmetic are that if _a≡b mod n_ and _c≡d mod n_, then _a+c≡b+d mod n_ and _a×c≡b×d mod n _. To check your understanding of these rules, you may wish to verify these relationships for _a=29_, _b=73_, _c=10_, _d=32_, and _n=11_.

As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise with storing such large numbers.

_Given_: A protein string of length at most 1000 aa.

_Return_: The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don't neglect the importance of the stop codon in protein translation.)

**Sample Dataset**

    MA
    
**Sample Output**

    12

____________
## Solution

The solution to this problem is relatively straightforward once you have a codon table properly stored. We start by building a sort of reverse lookup table where instead of looking up a codon to get the aminoacid, we can look up aminoacids to get a list of all possible codons.

In [1]:
codon = '''UUU F      CUU L      AUU I      GUU V
UUC F      CUC L      AUC I      GUC V
UUA L      CUA L      AUA I      GUA V
UUG L      CUG L      AUG M      GUG V
UCU S      CCU P      ACU T      GCU A
UCC S      CCC P      ACC T      GCC A
UCA S      CCA P      ACA T      GCA A
UCG S      CCG P      ACG T      GCG A
UAU Y      CAU H      AAU N      GAU D
UAC Y      CAC H      AAC N      GAC D
UAA Stop   CAA Q      AAA K      GAA E
UAG Stop   CAG Q      AAG K      GAG E
UGU C      CGU R      AGU S      GGU G
UGC C      CGC R      AGC S      GGC G
UGA Stop   CGA R      AGA R      GGA G
UGG W      CGG R      AGG R      GGG G '''

codon_dict = {}
codon_clean = []
for line in codon.split('\n'):
    line_sp = line.split('  ')
    for i in range(len(line_sp)):
        if line_sp[i] != '':
            codon_clean.append(line_sp[i].strip())                
for pair in codon_clean:
    tup = pair.split()
    if tup[1] not in codon_dict:
        codon_dict[tup[1]] = [tup[0]]
    else:
        codon_dict[tup[1]] += [tup[0]]
        
print('Codon dictionary: \n', codon_dict)

Codon dictionary: 
 {'I': ['AUU', 'AUC', 'AUA'], 'Q': ['CAA', 'CAG'], 'F': ['UUU', 'UUC'], 'T': ['ACU', 'ACC', 'ACA', 'ACG'], 'S': ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'], 'Y': ['UAU', 'UAC'], 'Stop': ['UAA', 'UAG', 'UGA'], 'C': ['UGU', 'UGC'], 'W': ['UGG'], 'L': ['CUU', 'CUC', 'UUA', 'CUA', 'UUG', 'CUG'], 'R': ['CGU', 'CGC', 'CGA', 'AGA', 'CGG', 'AGG'], 'P': ['CCU', 'CCC', 'CCA', 'CCG'], 'D': ['GAU', 'GAC'], 'G': ['GGU', 'GGC', 'GGA', 'GGG'], 'H': ['CAU', 'CAC'], 'M': ['AUG'], 'N': ['AAU', 'AAC'], 'V': ['GUU', 'GUC', 'GUA', 'GUG'], 'K': ['AAA', 'AAG'], 'A': ['GCU', 'GCC', 'GCA', 'GCG'], 'E': ['GAA', 'GAG']}


With a dictionary built in this way, we can simply walk through the aminoacid chain and for every aminoacid we encounter we now there are _n_ times more possible choices of mRNA, _n_ being the length of the codon list associated with that particular aminoacid.

In [2]:
def infer(protein):
    num = 1
    for am in protein:
        num *= len(codon_dict[am])
    num *= len(codon_dict['Stop'])
    return num%1000000

print(infer('AM'))

12
