# 1. Introduction

In the nucleus, mRNA transcription is done by an enzyme called RNA polymerase (RNAP).  RNAP breaks the hydrogen bonds joining the base pairs of each strand of DNA.  RNAP then creates a pre-mRNA with the DNA template strand.  After several RNA nucleotides are conjugated together, the first separated complementary DNA bases bond back together.

It is unclear whether an entire substring of DNA is transcribed into RNA then translated into peptides one at a time.  We know that a pre-mRNA is chopped into introns and exons; segments of mRNA that are thrown out, or kept.  The introns are tossed while exons are glued together sequentially to make the final mRNA.  The process of cutting and pasting is called splicing and is facilitated by RNA and proteins called a spliceosome.  

A pre-mRNA sequence can undergo a process called “alternative splicing”, where a single pre-mRNA strand can be cut in different ways, with different fragments identified as introns and exons, in order to generate many different protein sequences from a single strand of mRNA.  This means a single pre-mRNA strand can yield multiple different protein fragments!

Using python, you will be identifying the pitfalls of reverse translation, mRNA alternative splicing, alignment DNA reads, and finding protein motifs.

**Note, while answers are provided, rarely when you’re working with biology will the answers be given to you.  Please work through the examples, and think about other examples that could be run through your code.  Check if those examples work too (and thereby checking your own understanding of the topic), if not, you’re still not done answering the question!  Do your best to generalize your code.  Avoid “hard-coding” your code to get the right answer. 


# 2. Pitfalls of Reverse Translation

When researchers find a new protein, they identify the protein sequences using mass spectrometry.  Once a protein sequence is identified, they want to infer the mRNA strand that makes the protein.  This way, they can find the gene that is associated with the protein’s expression.  However, while a protein sequence can be reverse translated into a unique mRNA sequence, reverse translation yields thousands upon thousands of possible mRNA coding sequences as most amino acid sequences correspond to more than one set of codons.

Given this 33 amino acid protein, return the potential number of mRNA sequences from the protein, then infer all possible iterations of  99 base pair mRNA strands that can result in the amino acid protein.    Ignore the potential of alternative splicing.


`TDAPSRWICTLLDTHWFQSTSECMLQTEHDELN`

Example:
`GTECE`
G - Glycine, T - Threonine, E - Glutamate, C - Cysteine <br>
G has 4, T has 4, E has 2 and C has 2 possible outputs <br>
Total possible combinations: 4 * 4 * 2 * 2 = 64 <br>
Example mRNA strand: GGU ACU GAA UGU GAG <br>


In [1]:
protein_str = 'TDAPSRWICTLLDTHWFQSTSECMLQTEHDELN'

In [2]:
# translations
translation = {
    'UUU':'PHE','UUC':'PHE',
    'UUA':'LEU','UUG':'LEU','CUU':'LEU','CUC':'LEU','CUA':'LEU','CUG':'LEU',
    'AUU':'ILE','AUC':'ILE','AUA':'ILE',
    'AUG':'MET',
    'GUU':'VAL','GUC':'VAL','GUA':'VAL','GUG':'VAL',
    'UCU':'SER','UCC':'SER','UCA':'SER','UCG':'SER',
    'CCU':'PRO','CCC':'PRO','CCA':'PRO','CCG':'PRO',
    'ACU':'THR','ACC':'THR','ACA':'THR','ACG':'THR',
    'GCU':'ALA','GCC':'ALA','GCA':'ALA','GCG':'ALA',
    'UAU':'TYR','UAC':'TYR',
    'UAA':'STOP','UAG':'STOP','UGA':'STOP',
    'CAU':'HIS','CAC':'HIS',
    'CAA':'GLN','CAG':'GLN',
    'AUA':'ASN','AAC':'ASN',
    'AAA':'LYS','AAG':'LYS',
    'GAU':'ASP','GAC':'ASP','AAU':'ASP',
    'GAA':'GLU','GAG':'GLU',
    'UGU':'CYS','UGC':'CYS',
    'UGG':'TRP',
    'CGU':'ARG','CGC':'ARG','CGA':'ARG','CGG':'ARG',
    'AGU':'SER','AGC':'SER',
    'AGA':'ARG','AGG':'ARG',
    'GGU':'GLY','GGC':'GLY','GGA':'GLY','GGG':'GLY'
}

In [3]:
amino_acid_vals = dict() # dictionary of amino acid occurence.
for i in translation.values():
    try:
        val = amino_acid_vals[i]+1
        amino_acid_vals.update({i:val})
    except:
        amino_acid_vals.update({i:1})

In [4]:
AAshort = {
    'ALA':'A','ARG':'R','ASN':'N','ASP':'D','CYS':'C',
    'GLU':'E','GLN':'Q','GLY':'G','HIS':'H','ILE':'I',
    'LEU':'L','LYS':'K','MET':'M','PHE':'F','PRO':'P',
    'SER':'S','THR':'T','TRP':'W','TYR':'Y','VAL':'V',
    'STOP':'-'
    }

In [5]:
# create a dictionary of single to three letter code
reverseAAshort = zip(AAshort.values(), AAshort.keys())
reverseAAshort = dict(reverseAAshort)

In [6]:
def possibleIter (protein:str):
    '''
    Takes a protein string, returns the total possible combinations
    '''
    AAshort = {
    'ALA':'A','ARG':'R','ASN':'N','ASP':'D','CYS':'C',
    'GLU':'E','GLN':'Q','GLY':'G','HIS':'H','ILE':'I',
    'LEU':'L','LYS':'K','MET':'M','PHE':'F','PRO':'P',
    'SER':'S','THR':'T','TRP':'W','TYR':'Y','VAL':'V',
    'STOP':'-'
    }
    
    # create a dictionary of single to three letter code
    reverseAAshort = zip(AAshort.values(), AAshort.keys())
    reverseAAshort = dict(reverseAAshort)
    
    # get counts of amino acids per triplet letter code
    amino_acid_vals = dict() # dictionary of amino acid occurence.
    for i in translation.values():
        try:
            val = amino_acid_vals[i]+1
            amino_acid_vals.update({i:val})
        except:
            amino_acid_vals.update({i:1})
    
    # core function
    combos = [amino_acid_vals[g] for g in [reverseAAshort[i] for i in protein]]    
    
    k = 1
    for i in combos:
        k = i*k
    return(k)

In [7]:
possibleIter(protein_str)

3043362286338048

#### How many possible first four combo's?

In [8]:
protein_str[0:4] # first four letters

'TDAP'

In [9]:
possibleIter(protein_str[0:4]) # possibilities

192

In [10]:
# return the names from the code
for i in 'TDAP':
    print(reverseAAshort[i])

THR
ASP
ALA
PRO


In [11]:
# create a reverse dictionary where code is the dict and AA is the key.
reverse_trans = dict()
for key, value in translation.items():
    try:
        new_val = reverse_trans[value]
        new_val.append(key)
        reverse_trans.update({value:new_val})

    except:
        reverse_trans.update({value:[key]})
    

In [12]:
reverse_trans

{'PHE': ['UUU', 'UUC'],
 'LEU': ['UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'],
 'ILE': ['AUU', 'AUC'],
 'ASN': ['AUA', 'AAC'],
 'MET': ['AUG'],
 'VAL': ['GUU', 'GUC', 'GUA', 'GUG'],
 'SER': ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'],
 'PRO': ['CCU', 'CCC', 'CCA', 'CCG'],
 'THR': ['ACU', 'ACC', 'ACA', 'ACG'],
 'ALA': ['GCU', 'GCC', 'GCA', 'GCG'],
 'TYR': ['UAU', 'UAC'],
 'STOP': ['UAA', 'UAG', 'UGA'],
 'HIS': ['CAU', 'CAC'],
 'GLN': ['CAA', 'CAG'],
 'LYS': ['AAA', 'AAG'],
 'ASP': ['GAU', 'GAC', 'AAU'],
 'GLU': ['GAA', 'GAG'],
 'CYS': ['UGU', 'UGC'],
 'TRP': ['UGG'],
 'ARG': ['CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'],
 'GLY': ['GGU', 'GGC', 'GGA', 'GGG']}

#### Two different ways to do it.
1. Through list comprehension
2. Through itertools product.

In [13]:
res = [(i, j, k, l) 
       for i in reverse_trans['THR']
       for j in reverse_trans['ASP'] 
       for k in reverse_trans['ALA']
       for l in reverse_trans['PRO']
      ]

In [14]:
def aa_permutator(aa_list:list):
    '''
    given list of x length, return all possible combinations of amino acids based on their 3 letter code.
    '''
    from itertools import product as pro
    return(list(pro(*aa_list)))

In [15]:
aa_permutator([reverse_trans['THR'],reverse_trans['ASP'],reverse_trans['ALA'],reverse_trans['PRO']])

[('ACU', 'GAU', 'GCU', 'CCU'),
 ('ACU', 'GAU', 'GCU', 'CCC'),
 ('ACU', 'GAU', 'GCU', 'CCA'),
 ('ACU', 'GAU', 'GCU', 'CCG'),
 ('ACU', 'GAU', 'GCC', 'CCU'),
 ('ACU', 'GAU', 'GCC', 'CCC'),
 ('ACU', 'GAU', 'GCC', 'CCA'),
 ('ACU', 'GAU', 'GCC', 'CCG'),
 ('ACU', 'GAU', 'GCA', 'CCU'),
 ('ACU', 'GAU', 'GCA', 'CCC'),
 ('ACU', 'GAU', 'GCA', 'CCA'),
 ('ACU', 'GAU', 'GCA', 'CCG'),
 ('ACU', 'GAU', 'GCG', 'CCU'),
 ('ACU', 'GAU', 'GCG', 'CCC'),
 ('ACU', 'GAU', 'GCG', 'CCA'),
 ('ACU', 'GAU', 'GCG', 'CCG'),
 ('ACU', 'GAC', 'GCU', 'CCU'),
 ('ACU', 'GAC', 'GCU', 'CCC'),
 ('ACU', 'GAC', 'GCU', 'CCA'),
 ('ACU', 'GAC', 'GCU', 'CCG'),
 ('ACU', 'GAC', 'GCC', 'CCU'),
 ('ACU', 'GAC', 'GCC', 'CCC'),
 ('ACU', 'GAC', 'GCC', 'CCA'),
 ('ACU', 'GAC', 'GCC', 'CCG'),
 ('ACU', 'GAC', 'GCA', 'CCU'),
 ('ACU', 'GAC', 'GCA', 'CCC'),
 ('ACU', 'GAC', 'GCA', 'CCA'),
 ('ACU', 'GAC', 'GCA', 'CCG'),
 ('ACU', 'GAC', 'GCG', 'CCU'),
 ('ACU', 'GAC', 'GCG', 'CCC'),
 ('ACU', 'GAC', 'GCG', 'CCA'),
 ('ACU', 'GAC', 'GCG', 'CCG'),
 ('ACU',

In [16]:
[f'{i[0]}{i[1]}{i[2]}{i[3]}'for i in aa_permutator([reverse_trans['THR'],reverse_trans['ASP'],reverse_trans['ALA'],reverse_trans['PRO']])]

['ACUGAUGCUCCU',
 'ACUGAUGCUCCC',
 'ACUGAUGCUCCA',
 'ACUGAUGCUCCG',
 'ACUGAUGCCCCU',
 'ACUGAUGCCCCC',
 'ACUGAUGCCCCA',
 'ACUGAUGCCCCG',
 'ACUGAUGCACCU',
 'ACUGAUGCACCC',
 'ACUGAUGCACCA',
 'ACUGAUGCACCG',
 'ACUGAUGCGCCU',
 'ACUGAUGCGCCC',
 'ACUGAUGCGCCA',
 'ACUGAUGCGCCG',
 'ACUGACGCUCCU',
 'ACUGACGCUCCC',
 'ACUGACGCUCCA',
 'ACUGACGCUCCG',
 'ACUGACGCCCCU',
 'ACUGACGCCCCC',
 'ACUGACGCCCCA',
 'ACUGACGCCCCG',
 'ACUGACGCACCU',
 'ACUGACGCACCC',
 'ACUGACGCACCA',
 'ACUGACGCACCG',
 'ACUGACGCGCCU',
 'ACUGACGCGCCC',
 'ACUGACGCGCCA',
 'ACUGACGCGCCG',
 'ACUAAUGCUCCU',
 'ACUAAUGCUCCC',
 'ACUAAUGCUCCA',
 'ACUAAUGCUCCG',
 'ACUAAUGCCCCU',
 'ACUAAUGCCCCC',
 'ACUAAUGCCCCA',
 'ACUAAUGCCCCG',
 'ACUAAUGCACCU',
 'ACUAAUGCACCC',
 'ACUAAUGCACCA',
 'ACUAAUGCACCG',
 'ACUAAUGCGCCU',
 'ACUAAUGCGCCC',
 'ACUAAUGCGCCA',
 'ACUAAUGCGCCG',
 'ACCGAUGCUCCU',
 'ACCGAUGCUCCC',
 'ACCGAUGCUCCA',
 'ACCGAUGCUCCG',
 'ACCGAUGCCCCU',
 'ACCGAUGCCCCC',
 'ACCGAUGCCCCA',
 'ACCGAUGCCCCG',
 'ACCGAUGCACCU',
 'ACCGAUGCACCC',
 'ACCGAUGCACCA

# 3. Alternative Splicing

Alternative splicing is the process in which introns in the pre-mRNA are removed and the remaining exon sequences are pasted together.

Given the following  DNA coding sequence, return the pre-mRNA sequence.  Next remove the introns for each set and return the corresponding mRNA and its amino acid sequences. Use '-' for stop.

`'TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATAACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATATCTTTAATGTGGTAATTGGAAGGATTCTTGGCCCTCCACCCTTAGAC'`

INTRON SET 1: <br>
`‘CGTTCTTGC’` <br>
`‘TGTCCTTGAGAAGAGGAG’` <br>
`‘TATAACGAACTTCGACATGGCAAT’` <br>

INTRON SET 2: <br>
`‘CTTTCAGAATCATGGTGTGCATGGTAGAATGACTC’` <br>
`‘CCCCGATTAATGGCACTT’` <br>
`‘CCCTCCACCCTTA’` <br>


In [17]:
dna = 'TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATAACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATATCTTTAATGTGGTAATTGGAAGGATTCTTGGCCCTCCACCCTTAGAC'
intron01 = ['CGTTCTTGC','TGTCCTTGAGAAGAGGAG','TATAACGAACTTCGACATGGCAAT']
intron02 = ['CTTTCAGAATCATGGTGTGCATGGTAGAATGACTC','CCCCGATTAATGGCACTT','CCCTCCACCCTTA']

In [18]:
def alt_splicer(host:str,introns:list):
    '''
    given a host strand, remove introns from host.
    '''
    x = host
    for i in introns:
        x = x.replace(i,'')
    return(x)

In [19]:
def dna_to_rna(dna:str):
    '''
    takes a dna coding strand input and returns it as mRNA
    replaces all T's with U's
    '''
    return(dna.replace('T','U'))
        

In [20]:
dna_to_rna(dna=dna)

'UACUCUCGUUCUUGCAGCUUGUCAGUACUUUCAGAAUCAUGGUGUGCAUGGUAGAAUGACUCUUAUAACGAACUUCGACAUGGCAAUAACCCCCCGAUUAAUGGCACUUUUUUACUUUUGUCCUUGAGAAGAGGAGACGUCAGUGCAGAUAUCUUUAAUGUGGUAAUUGGAAGGAUUCUUGGCCCUCCACCCUUAGAC'

In [21]:
def rna_to_pro(dna:str):
    '''
    takes a dna coding strand, turns it into mRNA, and returns the protein as simplified protein code.
    '''
    mRNA = dna_to_rna(dna)                              # returns mRNA strand
    mRNA2 = [mRNA[i:i+3] for i in range(0,len(mRNA),3)] # splits into 3 letter amino acid sequence
    mRNA3 = [translation[i] for i in mRNA2]             # translate into 3 letter code
    mRNA4 = [AAshort[i] for i in mRNA3]                 # single letter code
    aa = ''
    for i in mRNA4:
        aa = aa+i
    return (aa)
    

In [22]:
rna_to_pro(dna)

'YSRSCSLSVLSESWCAW-DDSYNELRHGDNPPIDGTFLLLSLRRGDVSADIFDVVIGRILGPPPLD'

In [23]:
#intron1
rna_to_pro(alt_splicer(host = dna, introns = intron01))

'YSSLSVLSESWCAW-DDSNPPIDGTFLLYVSADIFDVVIGRILGPPPLD'

In [24]:
#intron2
rna_to_pro(alt_splicer(host = dna, introns = intron02))

'YSRSCSLSVL-RTSTWQ-PFTFVLEKRRRQCRYL-CGDWKDSWD'

# 4. DNA Alignment Round 2

Determining an organism's complete genome is a central theme of bioinformatics.  It is impossible to sequence the entire genome in one go, therefore genomes are sequenced piecewise from much smaller snippets of DNA called reads.  Once multiple collection of reads are generated from an organism's DNA, the desired genome from these small pieces of DNA are reconstructed. This process is called fragment assembly.

Given the following collection of strings, a larger super string will contain every one of the smaller strings.  The shortest possible superstring over a collection of reads serves as the most likely DNA construct.  The data should have only one unique way to reconstruct the DNA by piecing together pairs of reads that overlap by half their length.

Sequences:

`TACTCTCGTTCTTGCAGCTT` <br>
`GTCAGTACTTTCAGAATCAT` <br>
`GGTGTGCATGGTAGAATGAC` <br>
`TCTTATAACGAACTTCGACA` <br>
`TGGCAATACCCCCCGATTAA` <br>
`TGGCACTTTTTTACTTTTGT` <br>
`CCTTGAGAAGAGGAGACGTC` <br>
`AGTGCAGATATCTTTAATGC` <br>
`CTTGCAGCTTGTCAGTACTT` <br>
`TCAGAATCATGGTGTGCATG` <br>
`GTAGAATGACTCTTATAACG` <br>
`AACTTCGACATGGCAATACC` <br>
`CCCCGATTAATGGCACTTTT` <br>
`TTACTTTTGTCCTTGAGAAG` <br>
`AGGAGACGTCAGTGCAGATA` <br>


In [25]:
a = 'TACTCTCGTTCTTGCAGCTT'
b = 'GTCAGTACTTTCAGAATCAT'
c = 'GGTGTGCATGGTAGAATGAC'
d = 'TCTTATAACGAACTTCGACA'
e = 'TGGCAATACCCCCCGATTAA'
f = 'TGGCACTTTTTTACTTTTGT'
g = 'CCTTGAGAAGAGGAGACGTC'
h = 'AGTGCAGATATCTTTAATGC'
i = 'CTTGCAGCTTGTCAGTACTT'
j = 'TCAGAATCATGGTGTGCATG'
k = 'GTAGAATGACTCTTATAACG'
l = 'AACTTCGACATGGCAATACC'
m = 'CCCCGATTAATGGCACTTTT'
n = 'TTACTTTTGTCCTTGAGAAG'
o = 'AGGAGACGTCAGTGCAGATA'

In [26]:
import itertools as it
combos = list(it.combinations([a,b,c,d,e,f,g,h,i,j,k,l,m,n,o],2)) # yields combinations

In [27]:
# get combinations, will be a list of tuples
def get_combo(seq_list:list):
    '''get combinations of all pairs'''
    from itertools import combinations as comb
    return(comb(seq_list,2))

In [28]:
def fragmenter(seq:str, n:int):
    '''create sets of strings given size'''
    fragment_set = set()
    for ind, val in enumerate(seq):
        try:
            fragment_set.add(seq[ind:ind+n])
        except:
            pass
    fragment_set02 = {v for v in fragment_set if len(v) == n}
    return(fragment_set02)

In [29]:
# create function that checks overlap, Overlap must be at ends and at least 50%of og string len
# return tuple in correct order of overlap

def overlap_check(atup:tuple):
    '''
    Checks for overlap, then checks if overlaps are at the end, return the "longest overlap" (some may be
    unexpectedly short)'''
    in_0 = atup[0] # tuple at index 0
    in_1 = atup[1] # tuple at index 1

    ## finds overlap
    overlap = 1
    fragment_length = 2
    longest_string = set()
    
    while overlap > 0:
        longest2 = longest_string # creates memory of previous
        frag_0 = fragmenter(in_0,fragment_length)
        frag_1 = fragmenter(in_1,fragment_length) 
        longest_string = set(frag_0) & set(frag_1)
        overlap = len(longest_string)
        fragment_length+=1
        

    ## finds if overlap exists in beginning or end of sequence keep it.
    longest3=[]
    for i in list(longest2):
        dist = len(i)
        if (in_0[0:dist+1]==i or in_0[-dist:]==i) and (in_1[0:dist+1]==i or in_1[-dist:]):
            longest3.append(i)
            
    longest3 = sorted(longest3,key= len)
    try:
        res = longest3[-1]
    except:
        res = None
    return (res)

def order(atup:tuple):
    '''returns tuple in proper head tail order'''
    in_0 = atup[0]
    in_1 = atup[1]
    try:
        over = overlap_check(atup) #sequence that is overlapping
        dist = len(over)           #length of overlapping seq
        if in_0[-dist:] == over:
            res = (in_0,in_1)
        else:
            res = (in_1,in_2)
        return(res)
    except:
        pass

In [30]:
order(('tttaaacccggg','gggaaatttccc'))

('tttaaacccggg', 'gggaaatttccc')

In [31]:
# combine to create new string
def combine (atup:tuple):
    '''combines the tuples at a correct order and aligned at the end'''
    value = order(atup)        # make sure sequence is in proper order
    over = overlap_check(atup) # get the overlap
    try:
        value[0]==''
        value[1]==''
        try:
            #print(f'Part1: {value[0]}\nOverlap: {over}\nPart2: {value[1]}')
            return(f'{value[0].replace(over,"")}{over}{value[1].replace(over,"")}')
        except:
            pass
    except:
        pass

In [32]:
# Running the function you get different lengths, from 30 to 37. Longer length means shorter overlap, so restrict
# to the shortest length!!
sorted([combine(i) for i in get_combo([a,b,c,d,e,f,g,h,i,j,k,l,m,n,o]) if combine(i)!=None],key = len)

['TACTCTCGTTCTTGCAGCTTGTCAGTACTT',
 'GTCAGTACTTTCAGAATCATGGTGTGCATG',
 'GGTGTGCATGGTAGAATGACTCTTATAACG',
 'TCTTATAACGAACTTCGACATGGCAATACC',
 'TGGCAATACCCCCCGATTAATGGCACTTTT',
 'TGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'CCTTGAGAAGAGGAGACGTCAGTGCAGATA',
 'GCAGGTCAGTACTTGTAGAATGACTATAACG',
 'TACTCTCGTTGCAGCTTAACGACATGGCAATACC',
 'TACTCTCGTTGCAGCTTCCCCGATTAATGGCATT',
 'CCCCGATTAATGGCACTTTTTTGTCCTTGAGAAG',
 'CTTGCAGCTTGTCAGTACTTTTTGTCCTTGAGAAG',
 'TGGCTACCCCCCGATTAACCTTGAGGAGGAGACGTC',
 'TGGCAATACCCCCCGATTAAAGTGCAGATATCTTGC',
 'CTTGCAGCTTGTCAGTACTTACGACATGGCAATACC',
 'CTTGCAGCTTGTCAGTACTTCCCCGATTAATGGCTT',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCA',
 'TGGCAATACCCCCCGATTAAGTAGAATGACTCTTACG',
 'TGGCACTTTTTTACTTTTGTTCAGAATCATGGGCATG',
 'GTAGAATGACTCTTATAACGAGGAGTCAGTGCAGATA']

In [33]:
def minimizer(alist:list):
    '''takes a list and returns results that are the minimum'''
    minimum = min([len(i) for i in sorted(alist,key = len)])
    new_res = [val for val in sorted(alist,key=len) if len(val)==minimum]
    return(new_res)

In [34]:
def an_iterator(alist:list):
    '''takes a list and keeps reducing until you get a list of 1'''
    x = 2
    z = alist
    while x != 1:
        y = minimizer([combine(i) for i in get_combo(z) if combine(i)!=None])
        z = y
        x = len(y)
    return(y)

In [35]:
an_iterator([a,b,c,d,e,f,g,h,i,j,k,l,m,n,o])

['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']

#### If you want to see it work one step at a time....

In [36]:
minimizer([combine(i) for i in get_combo([a,b,c,d,e,f,g,h,i,j,k,l,m,n,o]) if combine(i)!=None])

['TACTCTCGTTCTTGCAGCTTGTCAGTACTT',
 'GTCAGTACTTTCAGAATCATGGTGTGCATG',
 'GGTGTGCATGGTAGAATGACTCTTATAACG',
 'TCTTATAACGAACTTCGACATGGCAATACC',
 'TGGCAATACCCCCCGATTAATGGCACTTTT',
 'TGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'CCTTGAGAAGAGGAGACGTCAGTGCAGATA']

In [37]:
minimizer([combine(i) for i in get_combo(['TACTCTCGTTCTTGCAGCTTGTCAGTACTT',
 'GTCAGTACTTTCAGAATCATGGTGTGCATG',
 'GGTGTGCATGGTAGAATGACTCTTATAACG',
 'TCTTATAACGAACTTCGACATGGCAATACC',
 'TGGCAATACCCCCCGATTAATGGCACTTTT',
 'TGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'CCTTGAGAAGAGGAGACGTCAGTGCAGATA']) if combine(i)!=None])

['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATG',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACG',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACC',
 'TCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTT',
 'TGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'TGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']

In [38]:
minimizer([combine(i) for i in get_combo(['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATG',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACG',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACC',
 'TCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTT',
 'TGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'TGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']) if combine(i)!=None])

['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACG',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACC',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTT',
 'TCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'TGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']

In [39]:
minimizer([combine(i) for i in get_combo(['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACG',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACC',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTT',
 'TCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'TGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']) if combine(i)!=None])

['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACC',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTT',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'TCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']

In [40]:
minimizer([combine(i) for i in get_combo(
    ['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACC',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTT',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'TCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']
    ) if combine(i)!=None])

['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTT',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']

In [41]:
minimizer([combine(i) for i in get_combo(
    ['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTT',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'GGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']
    ) if combine(i)!=None])

['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']

In [42]:
minimizer([combine(i) for i in get_combo(
   ['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAG',
 'GTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']
    ) if combine(i)!=None])

['TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA']

In [43]:
len('TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATA')

150

# 5. Finding Glycosylated Protein Motifs (future)

http://rosalind.info/problems/mprt/
Just like DNA have motif’s, proteins also have motifs (also known as blocks, signatures, fingerprints etc).  Protein motifs are evolutionarily conserved, which means that they don’t change very much between species.

Much is known about proteins and data about them are stored in many different freely accessible databases. One of these databases is UniPro, which provides protein annotations on function description, domain structure and post-translational modifications. 

### A. Name that protein
Given the following protein names, 1. look up the protein on Uniprot and find proteins possessing the N-glycosylation motif, 2. Output those access ID’s followed by the location in the protein string where the motif can be found. 

### B. Protein motifs
Given the protein names, return a protein string that is represented by its shorthand notation.  For example [TY] means ‘either Threonine or Tyrosine’ and {W} means ‘all amino acids except for Tryptophan’.
