## 'Inferring mRNA from Protein'

**Connections**: `PROT`

---


**Given**: A protein string of length at most 1000 aa.

**Return**: The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (*Don't neglect the importance of the stop codon in protein translation*)


**Notes**:
- *Why did attempts 01-06 fail?* You initially multiplied the codon frequency count by the codon-to-aa frequency. Instead, you needed to raise codon-to-aa by the codon frequency count. For example, for aa residue 'D' in a sequence where it appears three times, 2^3=8 means that there are eight mRNA combinations, not 2*3=6 combinations. 


-----
#### Look at: *modulo arithmetic*

In [1]:
a, b, c, d, n = 29, 73, 10, 32, 11

In [2]:
for x in [a,b,c,d]:
    print( str(x), "mod", str(n), "=", str( int(x%n) ) )
    print( "divmod:", str(x), str(n), "=", str( divmod(x,n) ) )

29 mod 11 = 7
divmod: 29 11 = (2, 7)
73 mod 11 = 7
divmod: 73 11 = (6, 7)
10 mod 11 = 10
divmod: 10 11 = (0, 10)
32 mod 11 = 10
divmod: 32 11 = (2, 10)


In [3]:
print(a+c)
print((a+c) % n)
print(divmod(a+c, n))

39
6
(3, 6)


In [4]:
print(a-c)
print((a-c) % n)
print(divmod(a-c, n))

19
8
(1, 8)


In [5]:
print(a*c)
print((a*c) % n)
print(divmod(a*c, n))

290
4
(26, 4)


In [6]:
((a+c) % n) == ((b+d) % n)

True

In [7]:
((a-c) % n) == ((b-d) % n)

True

In [8]:
((a*c) % n) == ((b*d) % n)

True

In [9]:
del a,b,c,d,n

In [10]:
def power_mod(base, exponent, modulus):
    return pow(base, exponent, modulus)

print(2**10)
print()

print(2**10 // 100, "r", 2**10 % 100)
print()

print(power_mod(2, 10, 100))


1024

10 r 24

24


In [11]:
print(2**10)
print()
print(power_mod(2, 10, 100))

1024

24


----

#### Modular Arthimetic with mRNA-protein sequence

In [12]:
# Libraries to load:
import os, time, random

datapath = os.getcwd()
std_path = datapath.replace('stronghold_level02','standards')

sample_dataset_path = datapath + '/datasets/rosalind_sample_dataset_mrna.txt'
submission_dataset  = datapath + '/datasets/rosalind_mrna.txt'


In [33]:
def sequence_from_file(path_to_file) -> str:
    '''
    Load file with only one sequence
    return sequence
    '''
    with open(path_to_file, "r") as infile:
        return infile.read().rstrip('\n')


def mrna_codon_table(path_to_standards):
    '''
    Load standardf file that contains all unique 3-char mRNA codons for its AA-residue
    Return dictionary- key:value = RNA:AA pairings, including Stop codons
    '''
    with open(path_to_standards + "/rosalind_RNA_codon_table.txt", "r") as infile:
        rna_table      = infile.read().replace('\n', ' ').split()
        rna_codon_dict = {rna_table[i]:rna_table[i+1] for i in range(len(rna_table)) if i%2==0}
        del rna_table
    return rna_codon_dict


def mrna_codon_frequency_table(dict_mrna_codon_table):
    '''
    Load dictionary that has each unique (3-char mRNA):(1-char AA-residue) pairing 
    Return frequency dictionary- key:value = AA-residue:freq, including Stop codons
    '''
    rna_codon_counter_dict = {i:0 for i in sorted(['A','R','N','D','C','Q','E','G','H','I','L','K','M','F','P','S','T','W','Y','V','Stop'])}
    for codon, counter in rna_codon_counter_dict.items():
        ct = 0
        for key, val in dict_mrna_codon_table.items():
            if val == codon:
                ct += 1
        rna_codon_counter_dict[codon] = ct
        del ct
    return rna_codon_counter_dict



In [61]:
starttime = time.time()

rna_codon_counter_dict = mrna_codon_frequency_table( mrna_codon_table(std_path) )

aa_residues      = sorted(['A','R','N','D','C','Q','E','G','H','I','L','K','M','F','P','S','T','W','Y','V'])
protein_sequence = 'M'+''.join( [random.choice(aa_residues) for _ in range(40)] )
aa_len = len(protein_sequence)

print(protein_sequence,": "+str(aa_len)+' residues in length')
print()
print( ''.join(sorted(protein_sequence) ) )
print()
aa_res_counter = {i:0 for i in aa_residues}
for i in range(aa_len):
    aa_res_counter[protein_sequence[i]] += 1 

if aa_len != sum(aa_res_counter.values()):
    print('Error')
aa_res_counter = {key:aa_res_counter[key] for key in aa_res_counter.keys() 
                  if aa_res_counter[key] != 0}

mrna_seqs_counter = {i:0 for i in sorted(list(aa_res_counter.keys()))}
mrna_seqs_counter['Stop'] = 3
for key, val in aa_res_counter.items():
    mrna_seqs_counter[key] = rna_codon_counter_dict[key] ** aa_res_counter[key]

print("Frequency of each residue in sequence:"+'\t', aa_res_counter)
print("Count of mRNA codons/aa-residue:"+'\t', rna_codon_counter_dict)
print("Frequency of potential mRNA codons in sequence:"+'\t', mrna_seqs_counter)
print()


print("The total number of possible mRNA combinations, incrementing by AA-residue:")
rna_cts     = 1
rna_mod_cts = 1

for key, val in mrna_seqs_counter.items():
    rna_mod_cts  = (rna_mod_cts*int(val))%1000000
    rna_cts     *= int(val)
    print(str(rna_cts), str(rna_mod_cts), str( divmod( rna_cts,1000000 )) )

print("Program execution time:", str( round(time.time() - starttime, 5)))

del rna_cts, rna_mod_cts, mrna_seqs_counter, aa_res_counter, protein_sequence, aa_len, aa_residues, rna_codon_counter_dict, starttime

METMSFNRALDFPIKKGCSYEINMDWRVMYTADCVCHIRST : 41 residues in length

AACCCDDDEEFFGHIIIKKLMMMMNNPRRRSSSTTTVVWYY

Frequency of each residue in sequence:	 {'A': 2, 'C': 3, 'D': 3, 'E': 2, 'F': 2, 'G': 1, 'H': 1, 'I': 3, 'K': 2, 'L': 1, 'M': 4, 'N': 2, 'P': 1, 'R': 3, 'S': 3, 'T': 3, 'V': 2, 'W': 1, 'Y': 2}
Count of mRNA codons/aa-residue:	 {'A': 4, 'C': 2, 'D': 2, 'E': 2, 'F': 2, 'G': 4, 'H': 2, 'I': 3, 'K': 2, 'L': 6, 'M': 1, 'N': 2, 'P': 4, 'Q': 2, 'R': 6, 'S': 6, 'Stop': 3, 'T': 4, 'V': 4, 'W': 1, 'Y': 2}
Frequency of potential mRNA codons in sequence:	 {'A': 16, 'C': 8, 'D': 8, 'E': 4, 'F': 4, 'G': 4, 'H': 2, 'I': 27, 'K': 4, 'L': 6, 'M': 1, 'N': 4, 'P': 4, 'R': 216, 'S': 216, 'T': 64, 'V': 16, 'W': 1, 'Y': 4, 'Stop': 3}

The total number of possible mRNA combinations, incrementing by AA-residue:
16 16 (0, 16)
128 128 (0, 128)
1024 1024 (0, 1024)
4096 4096 (0, 4096)
16384 16384 (0, 16384)
65536 65536 (0, 65536)
131072 131072 (0, 131072)
3538944 538944 (3, 538944)
14155776 155776 (14, 15

In [62]:
# --------------------------------------------------------------------------------------------------------------
# TRY: (a x b) mod m = ( (a mod m) x (b mod m) ) mod m
# --------------------------------------------------------------------------------------------------------------

starttime = time.time()

# -!- Nothing changes in storing data
rna_codon_counter_dict = mrna_codon_frequency_table( mrna_codon_table(std_path) )
aa_residues      = sorted(['A','R','N','D','C','Q','E','G','H','I','L','K','M','F','P','S','T','W','Y','V'])
protein_sequence = 'M'+''.join( [random.choice(aa_residues) for _ in range(40)] )
aa_len = len(protein_sequence)
print(protein_sequence,": "+str(aa_len)+' residues in length')
print()
print( ''.join(sorted(protein_sequence) ) )
print()
aa_res_counter = {i:0 for i in aa_residues}
for i in range(aa_len):
    aa_res_counter[protein_sequence[i]] += 1 
if aa_len != sum(aa_res_counter.values()):
    print('Error')
aa_res_counter = {key:aa_res_counter[key] for key in aa_res_counter.keys() 
                  if aa_res_counter[key] != 0}
mrna_seqs_counter = {i:0 for i in sorted(list(aa_res_counter.keys()))}
mrna_seqs_counter['Stop'] = 3
for key, val in aa_res_counter.items():
    mrna_seqs_counter[key] = rna_codon_counter_dict[key] ** aa_res_counter[key]
print("Frequency of each residue in sequence:"+'\t', aa_res_counter)
print("Count of mRNA codons/aa-residue:"+'\t', rna_codon_counter_dict)
print("Frequency of potential mRNA codons in sequence:"+'\t', mrna_seqs_counter)
print()
# ----------------------------------------------------------------------------------------------------------------------

print("The total number of possible mRNA combinations, incrementing by AA-residue:")
rna_cts = 1

for key, val in mrna_seqs_counter.items():
    a = rna_cts
    b = int(val)

    print( str( divmod(a*b, 1000000) ),'\n',
    "a mod m", str((a%1000000)),'\n',
    "b mod m", str((b%1000000)),'\n',
    "(axb) mod m", str( (a*b)%1000000 ),'\n',
    "((a mod m)x(b mod m)) mod m", str( ( (a%1000000)*(b%1000000) )%1000000 )
    )
    
    rna_cts *= int(val)
    del a,b

print("Program execution time:", str( round(time.time() - starttime, 5)))

#del rna_cts, pi_seq, mrna_seqs_counter, aa_res_counter, protein_sequence, aa_len, aa_residues, rna_codon_counter_dict, starttime

MFKECVDGWTKVRYHIPNTHKASTLHGQPMTMTAPESFPPD : 41 residues in length

AACDDEEFFGGHHHIKKKLMMMNPPPPPQRSSTTTTTVVWY

Frequency of each residue in sequence:	 {'A': 2, 'C': 1, 'D': 2, 'E': 2, 'F': 2, 'G': 2, 'H': 3, 'I': 1, 'K': 3, 'L': 1, 'M': 3, 'N': 1, 'P': 5, 'Q': 1, 'R': 1, 'S': 2, 'T': 5, 'V': 2, 'W': 1, 'Y': 1}
Count of mRNA codons/aa-residue:	 {'A': 4, 'C': 2, 'D': 2, 'E': 2, 'F': 2, 'G': 4, 'H': 2, 'I': 3, 'K': 2, 'L': 6, 'M': 1, 'N': 2, 'P': 4, 'Q': 2, 'R': 6, 'S': 6, 'Stop': 3, 'T': 4, 'V': 4, 'W': 1, 'Y': 2}
Frequency of potential mRNA codons in sequence:	 {'A': 16, 'C': 2, 'D': 4, 'E': 4, 'F': 4, 'G': 16, 'H': 8, 'I': 3, 'K': 8, 'L': 6, 'M': 1, 'N': 2, 'P': 1024, 'Q': 2, 'R': 6, 'S': 36, 'T': 1024, 'V': 16, 'W': 1, 'Y': 2, 'Stop': 3}

The total number of possible mRNA combinations, incrementing by AA-residue:
(0, 16) 
 a mod m 1 
 b mod m 16 
 (axb) mod m 16 
 ((a mod m)x(b mod m)) mod m 16
(0, 32) 
 a mod m 16 
 b mod m 2 
 (axb) mod m 32 
 ((a mod m)x(b mod m)) mod m 32
(0, 128) 

In [76]:
def protein_mrna_combinations(protein_sequence: str,
                              modulo: int,
                              dict_mrna_codon_frequency,
                              verbose=False) -> int:
    '''
    Load in the protein sequence string, the modulo value for modular arithmetic, and the standard mRNA codon frequency dictionary
    Return is __verbose__-dependent:
        - default: return the protein_sequence_combination mod __modulo__ integer value
        - True: print outs of dictionaries, and the modular arithmetic value through each protein_sequence_combination iteration
    '''
    # Count frequency of each residue in sequence, compile into dictionary, then insert 'Stop':1    
    aa_res_counter = {i:0 for i in sorted(['A','R','N','D','C','Q','E','G','H','I','L','K','M','F','P','S','T','W','Y','V'])}

    for i in range(len(protein_sequence)):
        aa_res_counter[protein_sequence[i]] += 1
    
    # -!- to save space, remove anything where the value is 0
    aa_res_counter = {key:aa_res_counter[key]
                      for key in aa_res_counter.keys()
                      if aa_res_counter[key] != 0}
    
    # Compile dictionary of frequencies*(number of mRNA codon strings)
    mrna_seqs_counter = {i:0 for i in sorted(list(aa_res_counter.keys()))}
    mrna_seqs_counter['Stop'] = 3
    for key, val in aa_res_counter.items():
        mrna_seqs_counter[key] =  dict_mrna_codon_frequency[key] ** aa_res_counter[key]
    
    if verbose == True:
        print(str(len(protein_sequence))+' residues in length')
        print(protein_sequence)
        print()
        print(''.join(sorted(protein_sequence)))
        print()
        print("Frequency of each residue in sequence:"+'\t', aa_res_counter)
        print("Count of mRNA codons/aa-residue:"+'\t', {key:dict_mrna_codon_frequency[key]
                      for key in aa_res_counter.keys()
                      if aa_res_counter[key] != 0})
        print("Frequency of potential mRNA codons in sequence:"+'\t', mrna_seqs_counter)
        print()

    # Iteratively multiply the values in the previous dictionary to get total number of mRNA string combinations
    #rna_cts = 1
    #for key, val in mrna_seqs_counter.items():
    #    rna_cts *= val
    #    if verbose == True:
    #        print(str(rna_cts)+" mod "+str(modulo)+" =", str( divmod(rna_cts, modulo) ))

    # -!- try this for improved memory storage
    rna_cts = 1
    for key, val in mrna_seqs_counter.items():
         if verbose == True:
            print("("+str(rna_cts)+" x "+str(val)+") mod "+str(modulo)+" =", str( (rna_cts*val)%modulo ))
         rna_cts = (rna_cts*val)%modulo
            
    del mrna_seqs_counter, aa_res_counter
    
    # Return the rna_cts mod modulo value
    return (rna_cts % modulo)

In [77]:
starttime = time.time()

protein_seq   = 'MA'

print(protein_mrna_combinations(protein_seq, 
                                1000000, 
                                mrna_codon_frequency_table(mrna_codon_table(std_path)),
                                verbose=True))

print("Program execution time:", str( round(time.time() - starttime, 5)))

del starttime, protein_seq


2 residues in length
MA

AM

Frequency of each residue in sequence:	 {'A': 1, 'M': 1}
Count of mRNA codons/aa-residue:	 {'A': 4, 'M': 1}
Frequency of potential mRNA codons in sequence:	 {'A': 4, 'M': 1, 'Stop': 3}

(1 x 4) mod 1000000 = 4
(4 x 1) mod 1000000 = 4
(4 x 3) mod 1000000 = 12
12
Program execution time: 0.00557


In [78]:
starttime = time.time()

print(protein_mrna_combinations('M'+''.join( [random.choice(['A','R','N','D','C','Q','E','G','H','I','L','K','M','F','P','S','T','W','Y','V']) for _ in range(35)] ), 
                                1000000, 
                                mrna_codon_frequency_table(mrna_codon_table(std_path)),
                                verbose=True))

print("Program execution time:", str( round(time.time() - starttime, 5)))

del starttime


36 residues in length
MKRKKEVIANEDMQEWLFFQWYMYRMWSCTKSLIPG

ACDEEEFFGIIKKKKLLMMMMNPQQRRSSTVWWWYY

Frequency of each residue in sequence:	 {'A': 1, 'C': 1, 'D': 1, 'E': 3, 'F': 2, 'G': 1, 'I': 2, 'K': 4, 'L': 2, 'M': 4, 'N': 1, 'P': 1, 'Q': 2, 'R': 2, 'S': 2, 'T': 1, 'V': 1, 'W': 3, 'Y': 2}
Count of mRNA codons/aa-residue:	 {'A': 4, 'C': 2, 'D': 2, 'E': 2, 'F': 2, 'G': 4, 'I': 3, 'K': 2, 'L': 6, 'M': 1, 'N': 2, 'P': 4, 'Q': 2, 'R': 6, 'S': 6, 'T': 4, 'V': 4, 'W': 1, 'Y': 2}
Frequency of potential mRNA codons in sequence:	 {'A': 4, 'C': 2, 'D': 2, 'E': 8, 'F': 4, 'G': 4, 'I': 9, 'K': 16, 'L': 36, 'M': 1, 'N': 2, 'P': 4, 'Q': 4, 'R': 36, 'S': 36, 'T': 4, 'V': 4, 'W': 1, 'Y': 4, 'Stop': 3}

(1 x 4) mod 1000000 = 4
(4 x 2) mod 1000000 = 8
(8 x 2) mod 1000000 = 16
(16 x 8) mod 1000000 = 128
(128 x 4) mod 1000000 = 512
(512 x 4) mod 1000000 = 2048
(2048 x 9) mod 1000000 = 18432
(18432 x 16) mod 1000000 = 294912
(294912 x 36) mod 1000000 = 616832
(616832 x 1) mod 1000000 = 616832
(616832 x 2) 

---

### Problem Attempt:

In [82]:
starttime = time.time()

print(protein_mrna_combinations(sequence_from_file(submission_dataset), 
                                1000000,
                                mrna_codon_frequency_table(mrna_codon_table(std_path)),
                                verbose=False))

print("Program execution time:", str( round(time.time() - starttime, 5)))

del starttime


733504
Program execution time: 0.01209


---

### Scratch from incorrect attempts

In [19]:
starttime = time.time()

#print(protein_mrna_combinations(submission_dataset, 1000000))
print(protein_mrna_combinations(os.getcwd() + "/datasets/rosalind_mrna_attempt05.txt", 1000000))

print("Program execution time:", str( round(time.time() - starttime, 5)))

del starttime

TypeError: protein_mrna_combinations() missing 1 required positional argument: 'dict_mrna_codon_frequency'

In [None]:
starttime = time.time()

#print(protein_mrna_combinations(submission_dataset, 1000000, verbose=True))
print(protein_mrna_combinations(os.getcwd() + "/datasets/rosalind_mrna_attempt05.txt", 1000000, verbose=True))

print("Program execution time:", str( round(time.time() - starttime, 5)))

del starttime

#### Cycle through previous attempts

In [80]:
for XX in ['01','02','03','04','05','06']:
    starttime = time.time()
    print(protein_mrna_combinations(sequence_from_file(datapath + "/datasets/rosalind_mrna_attempt"+XX+".txt"), 
                                    1000000,
                                    mrna_codon_frequency_table(mrna_codon_table(std_path)),
                                    verbose=False))
    
    print("Program execution time:", str( round(time.time() - starttime, 5)))
    print('-'*50)
    del starttime


994624
Program execution time: 0.01132
--------------------------------------------------
791488
Program execution time: 0.00743
--------------------------------------------------
34624
Program execution time: 0.00606
--------------------------------------------------
678464
Program execution time: 0.00777
--------------------------------------------------
691072
Program execution time: 0.00669
--------------------------------------------------
590592
Program execution time: 0.0072
--------------------------------------------------


In [81]:
for XX in ['01','02','03','04','05','06']:
    starttime = time.time()
    print(protein_mrna_combinations(sequence_from_file(datapath + "/datasets/rosalind_mrna_attempt"+XX+".txt"), 
                                    1000000,
                                    mrna_codon_frequency_table(mrna_codon_table(std_path)),
                                    verbose=True))
    
    print("Program execution time:", str( round(time.time() - starttime, 5)))
    print('-'*50)
    del starttime

999 residues in length
MFEDGCLIKDYDFAPVWTWHMRGSVFRFHWLTPTYAFEPPNLHQEKQTNLDDKLMKALPQLLNRVAYWNNWLDPQGKPCGVFVWDWTTCMGGADSNICSEPWQMYEHVMNKVSQREEDHRVMGQSCITWQRVNGYYYEISNPPINWQECRCNIECAHWKSQSPDCNIKISSDLTATRYRLMPVLRWMHCHTRPNPDCWPTQRMNFTAIVHDWVCDAHKNWYNCLGEAQLLIILEACMNHFEASYNWLWEKNCMSCAETGWRFLGYQWFLFWEGARMQYHEDECFVESYTMDLRLMWRTKNQQEGKAEFGWANWTTEFWGWRNEEGARYGWPIKEDYDVFEHRTYFKMWFPAWQWLPQYYEPNPPMECDLNIIWMCFYKVRNEWAPYQYKMMVAGAQWSCAQFFACFFCHCFYLWPYLPWLKDMIMLGDYSWIGREEIFEHCSIMFFGGAWAVVIERTGESCCLADNAKYIWMEQFAIRQQYATHPQWQAYEMIGRNNHFFKNDWGKGQNQGRAAAVRSNIFIKETLTIRGKNPITIAAQHVSYFTFIKDDDLPLPRPEWTVWKDRQTRNWHYDYIEHPHFRITAPRNCCWPYWFDHMLMSHYMIALESWMVRPPTFNETNGLLKTKVHVDIDAQNGYWTVRRCPEATDIHYGCTCQEVKCRFFDDLRGCQCGFNADPAALIGIIVGMLDYYWPCRRSMHTMQQARRFIVHYESMDLGEWHVQVEATVCWSGNHAYLTRAYTGWVMQVEMGQIKASWREVFKGHRMHDGWEWNKARDMMHEHCTGTCIKGKISRFSSDYAYPFIFTPTVFSQCETAQPADPQFYHHKWLWCHMFCTVMICFRETAETQQCKQILNWKMMNPAPVTMSLVCSSQQMLFPLCLFVRLFIHWWERWHSGHWDGQAGHLYYPAFELIMQPLQEPKVWFGFVNEDKDTKPAMVVVWCLFMYESSKTCGDNYTAICKKMASSLRHPYILEEQWG