# 1. Introduction

In the nucleus, mRNA transcription is done by an enzyme called RNA polymerase (RNAP).  RNAP breaks the hydrogen bonds joining the base pairs of each strand of DNA.  RNAP then creates a pre-mRNA with the DNA template strand.  After several RNA nucleotides are conjugated together, the first separated complementary DNA bases bond back together.

It is unclear whether an entire substring of DNA is transcribed into RNA then translated into peptides one at a time.  We know that a pre-mRNA is chopped into introns and exons; segments of mRNA that are thrown out, or kept.  The introns are tossed while exons are glued together sequentially to make the final mRNA.  The process of cutting and pasting is called splicing and is facilitated by RNA and proteins called a spliceosome.  

A pre-mRNA sequence can undergo a process called “alternative splicing”, where a single pre-mRNA strand can be cut in different ways, with different fragments identified as introns and exons, in order to generate many different protein sequences from a single strand of mRNA.  This means a single pre-mRNA strand can yield multiple different protein fragments!

Using python, you will be identifying the pitfalls of reverse translation, mRNA alternative splicing, alignment DNA reads, and finding protein motifs.

**Note, while answers are provided, rarely when you’re working with biology will the answers be given to you.  Please work through the examples, and think about other examples that could be run through your code.  Check if those examples work too (and thereby checking your own understanding of the topic), if not, you’re still not done answering the question!  Do your best to generalize your code.  Avoid “hard-coding” your code to get the right answer. 


# 2. Pitfalls of Reverse Translation

When researchers find a new protein, they identify the protein sequences using mass spectrometry.  Once a protein sequence is identified, they want to infer the mRNA strand that makes the protein.  This way, they can find the gene that is associated with the protein’s expression.  However, while a protein sequence can be reverse translated into a unique mRNA sequence, reverse translation yields thousands upon thousands of possible mRNA coding sequences as most amino acid sequences correspond to more than one set of codons.

Given this 33 amino acid protein, return the potential number of mRNA sequences from the protein, then infer all possible iterations of  99 base pair mRNA strands that can result in the amino acid protein.    Ignore the potential of alternative splicing.


`TDAPSRWICTLLDTHWFQSTSECMLQTEHDELN`

Example:
`GTECE`
G - Glycine, T - Threonine, E - Glutamate, C - Cysteine <br>
G has 4, T has 4, E has 2 and C has 2 possible outputs <br>
Total possible combinations: 4 * 4 * 2 * 2 = 64 <br>
Example mRNA strand: GGU ACU GAA UGU GAG <br>


In [None]:
protein_str = 'TDAPSRWICTLLDTHWFQSTSECMLQTEHDELN'

In [None]:
# translations of amino acids
translation = {
    'UUU':'PHE','UUC':'PHE',
    'UUA':'LEU','UUG':'LEU','CUU':'LEU','CUC':'LEU','CUA':'LEU','CUG':'LEU',
    'AUU':'ILE','AUC':'ILE','AUA':'ILE',
    'AUG':'MET',
    'GUU':'VAL','GUC':'VAL','GUA':'VAL','GUG':'VAL',
    'UCU':'SER','UCC':'SER','UCA':'SER','UCG':'SER',
    'CCU':'PRO','CCC':'PRO','CCA':'PRO','CCG':'PRO',
    'ACU':'THR','ACC':'THR','ACA':'THR','ACG':'THR',
    'GCU':'ALA','GCC':'ALA','GCA':'ALA','GCG':'ALA',
    'UAU':'TYR','UAC':'TYR',
    'UAA':'STOP','UAG':'STOP','UGA':'STOP',
    'CAU':'HIS','CAC':'HIS',
    'CAA':'GLN','CAG':'GLN',
    'AUA':'ASN','AAC':'ASN',
    'AAA':'LYS','AAG':'LYS',
    'GAU':'ASP','GAC':'ASP','AAU':'ASP',
    'GAA':'GLU','GAG':'GLU',
    'UGU':'CYS','UGC':'CYS',
    'UGG':'TRP',
    'CGU':'ARG','CGC':'ARG','CGA':'ARG','CGG':'ARG',
    'AGU':'SER','AGC':'SER',
    'AGA':'ARG','AGG':'ARG',
    'GGU':'GLY','GGC':'GLY','GGA':'GLY','GGG':'GLY'
}

In [None]:
AAshort = {
    'ALA':'A','ARG':'R','ASN':'N','ASP':'D','CYS':'C',
    'GLU':'E','GLN':'Q','GLY':'G','HIS':'H','ILE':'I',
    'LEU':'L','LYS':'K','MET':'M','PHE':'F','PRO':'P',
    'SER':'S','THR':'T','TRP':'W','TYR':'Y','VAL':'V',
    'STOP':'-'
    }

In [None]:
# you should make a dictionary of amino acid occurence, then you can multiply them to get your answer.

In [None]:
# think about creating a dictionary of single to three letter code. (reverse the above)

#### How many possible first four combo's?

In [None]:
# create a reverse dictionary where code is the dict and AA is the key.
# this would look like {'PHE':['UUU','UUC']}
# then you can make all combinations of it.

#### Two different ways to do it.
1. Through list comprehension
2. Through itertools product.

# 3. Alternative Splicing

Alternative splicing is the process in which introns in the pre-mRNA are removed and the remaining exon sequences are pasted together.

Given the following  DNA coding sequence, return the pre-mRNA sequence.  Next remove the introns for each set and return the corresponding mRNA and its amino acid sequences. Use '-' for stop.

`'TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATAACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATATCTTTAATGTGGTAATTGGAAGGATTCTTGGCCCTCCACCCTTAGAC'`

INTRON SET 1: <br>
`‘CGTTCTTGC’` <br>
`‘TGTCCTTGAGAAGAGGAG’` <br>
`‘TATAACGAACTTCGACATGGCAAT’` <br>

INTRON SET 2: <br>
`‘CTTTCAGAATCATGGTGTGCATGGTAGAATGACTC’` <br>
`‘CCCCGATTAATGGCACTT’` <br>
`‘CCCTCCACCCTTA’` <br>


In [None]:
dna = 'TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATAACCCCCCGATTAATGGCACTTTTTTACTTTTGTCCTTGAGAAGAGGAGACGTCAGTGCAGATATCTTTAATGTGGTAATTGGAAGGATTCTTGGCCCTCCACCCTTAGAC'
intron01 = ['CGTTCTTGC','TGTCCTTGAGAAGAGGAG','TATAACGAACTTCGACATGGCAAT']
intron02 = ['CTTTCAGAATCATGGTGTGCATGGTAGAATGACTC','CCCCGATTAATGGCACTT','CCCTCCACCCTTA']

In [None]:
# Write a function that removes introns from the host
# Then write a function that turns dna to rna by replacing T's with U's
# Then write a function to translate from rna to protein using triplicate sequence

# 4. DNA Alignment Round 2

Determining an organism's complete genome is a central theme of bioinformatics.  It is impossible to sequence the entire genome in one go, therefore genomes are sequenced piecewise from much smaller snippets of DNA called reads.  Once multiple collection of reads are generated from an organism's DNA, the desired genome from these small pieces of DNA are reconstructed. This process is called fragment assembly.

Given the following collection of strings, a larger super string will contain every one of the smaller strings.  The shortest possible superstring over a collection of reads serves as the most likely DNA construct.  The data should have only one unique way to reconstruct the DNA by piecing together pairs of reads that overlap by half their length.

Sequences:

`TACTCTCGTTCTTGCAGCTT` <br>
`GTCAGTACTTTCAGAATCAT` <br>
`GGTGTGCATGGTAGAATGAC` <br>
`TCTTATAACGAACTTCGACA` <br>
`TGGCAATACCCCCCGATTAA` <br>
`TGGCACTTTTTTACTTTTGT` <br>
`CCTTGAGAAGAGGAGACGTC` <br>
`AGTGCAGATATCTTTAATGC` <br>
`CTTGCAGCTTGTCAGTACTT` <br>
`TCAGAATCATGGTGTGCATG` <br>
`GTAGAATGACTCTTATAACG` <br>
`AACTTCGACATGGCAATACC` <br>
`CCCCGATTAATGGCACTTTT` <br>
`TTACTTTTGTCCTTGAGAAG` <br>
`AGGAGACGTCAGTGCAGATA` <br>


In [None]:
a = 'TACTCTCGTTCTTGCAGCTT'
b = 'GTCAGTACTTTCAGAATCAT'
c = 'GGTGTGCATGGTAGAATGAC'
d = 'TCTTATAACGAACTTCGACA'
e = 'TGGCAATACCCCCCGATTAA'
f = 'TGGCACTTTTTTACTTTTGT'
g = 'CCTTGAGAAGAGGAGACGTC'
h = 'AGTGCAGATATCTTTAATGC'
i = 'CTTGCAGCTTGTCAGTACTT'
j = 'TCAGAATCATGGTGTGCATG'
k = 'GTAGAATGACTCTTATAACG'
l = 'AACTTCGACATGGCAATACC'
m = 'CCCCGATTAATGGCACTTTT'
n = 'TTACTTTTGTCCTTGAGAAG'
o = 'AGGAGACGTCAGTGCAGATA'

In [None]:
import itertools as it
combos = list(it.combinations([a,b,c,d,e,f,g,h,i,j,k,l,m,n,o],2)) # yields combinations

In [None]:
# write a function to get combinations, return will be a list of tuples

In [None]:
# write a function that breaks down a string by how long you want it to be (like previous assignment)

In [None]:
# create function that checks overlap between two strings, Overlap must be at ends OF BOTH strings
# return tuple in correct order of overlap

In [None]:
# write a function that sets up the strings above in the right order

In [None]:
# write a function to combine the strings to create new string

In [None]:
# combine all the functions together to combine a string
# use list comprehension or for loop to do it for all sequences.
# HINT: Running the function you get different lengths, from 30 to 37. Longer length means shorter overlap, so restrict
# to the shortest length!!

# 5. Finding Glycosylated Protein Motifs (future)

http://rosalind.info/problems/mprt/
Just like DNA have motif’s, proteins also have motifs (also known as blocks, signatures, fingerprints etc).  Protein motifs are evolutionarily conserved, which means that they don’t change very much between species.

Much is known about proteins and data about them are stored in many different freely accessible databases. One of these databases is UniPro, which provides protein annotations on function description, domain structure and post-translational modifications. 

### A. Name that protein
Given the following protein names, 1. look up the protein on Uniprot and find proteins possessing the N-glycosylation motif, 2. Output those access ID’s followed by the location in the protein string where the motif can be found. 

### B. Protein motifs
Given the protein names, return a protein string that is represented by its shorthand notation.  For example [TY] means ‘either Threonine or Tyrosine’ and {W} means ‘all amino acids except for Tryptophan’.
