DNA String Exercise
===================

This exercise is a bit more advanced because it uses some string methods and tools 
that weren't discussed in lectures -- specifically translation tables.  Still, I think you're up to it.  Oh, and if you're a geneticist, the subject of DNA sequences is "old hat".  If, on the other hand, you've never seen DNA, never fear. You don't need to be an expert to solve the problem.

Sequences of DNA and RNA are frequently represented by strings of letters
corresponding to the bases:

    "A" is adenine
    "C" is cytosine
    "G" is guanine
    "T" is thymine
    "U" is uracil which replaces thymine in RNA

In this example, we will look at a genetic sequence from the human genome which encodes the [histone cluster 1, H1b.](http://www.ncbi.nlm.nih.gov/nuccore/NM_005322)

      A ACC TGC TCT TTA GAT TTC GAG CTT ATT CTC TTC TAG CAG TTT CTT GCC
    ACC ATG TCG GAA ACC GCT CCT GCC GAG ACA GCC ACC CCA GCG CCG GTG GAG
    AAA TCC CCG GCT AAG AAG AAG GCA ACT AAG AAG GCT GCC GGC GCC GGC GCT
    GCT AAG CGC AAA GCG ACG GGG CCC CCA GTC TCA GAG CTG ATC ACC AAG GCT
    GTG GCT GCT TCT AAG GAG CGC AAT GGC CTT TCT TTG GCA GCC CTT AAG AAG
    GCC TTA GCG GCC GGT GGC TAC GAC GTG GAG AAG AAT AAC AGC CGC ATT AAG
    CTG GGC CTC AAG AGC TTG GTG AGC AAG GGC ACC CTG GTG CAG ACC AAG GGC
    ACT GGT GCT TCT GGC TCC TTT AAA CTC AAC AAG AAG GCG GCC TCC GGG GAA
    GCC AAG CCC AAA GCC AAG AAG GCA GGC GCC GCT AAA GCT AAG AAG CCC GCG
    GGG GCC ACG CCT AAG AAG GCC AAG AAG GCT GCA GGG GCG AAA AAG GCA GTG
    AAG AAG ACT CCG AAG AAG GCG AAG AAG CCC GCG GCG GCT GGC GTC AAA AAG
    GTG GCG AAG AGC CCT AAG AAG GCC AAG GCC GCT GCC AAA CCG AAA AAG GCA
    ACC AAG AGT CCT GCC AAG CCC AAG GCA GTT AAG CCG AAG GCG GCA AAG CCC
    AAA GCC GCT AAG CCC AAA GCA GCA AAA CCT AAA GCT GCA AAG GCC AAG AAG
    GCG GCT GCC AAA AAG AAG TAG GAA GCT GGC GTG TGA AAA CCG CAA CAA AGC
    CCC AAA GGC TCT TTT CAG AGC CAC CCA


A DNA protein is made of 2 such sequences called strands. There is a "coding" strand and a "template" strand. The above sequence is the "coding" strand which is usually the only one stored in bio-informatics, since the other strand can be derived from the fact that an adenine (A) base always pairs with a thymine (T) and a cytosine (C) always pairs with a guanine (G). (For more background information, you can visit [some lectures on nucleic acids](https://www2.chemistry.msu.edu/faculty/reusch/virttxtjml/nucacids.htm))

Question 1
----------

When a gene is expressed, the first thing that happens is that the "template" sequence is separated from the "coding" sequence and gets associated with a messenger RNA (mRNA).  This sequence of mRNA is therefore almost the same as the "coding" sequence, except that the thymine bases (T) have been replaced by uracil bases (U).

Compute the equivalent mRNA sequence to the sequence in this example.


In [None]:
coding_sequence = (
    "AACCTGCTCTTTAGATTTCGAGCTTATTCTCTTCTAGCAGTTTCTTGCCACCATGTCGGAAACCGCTCCT" +
    "GCCGAGACAGCCACCCCAGCGCCGGTGGAGAAATCCCCGGCTAAGAAGAAGGCAACTAAGAAGGCTGCCG" +
    "GCGCCGGCGCTGCTAAGCGCAAAGCGACGGGGCCCCCAGTCTCAGAGCTGATCACCAAGGCTGTGGCTGC" +
    "TTCTAAGGAGCGCAATGGCCTTTCTTTGGCAGCCCTTAAGAAGGCCTTAGCGGCCGGTGGCTACGACGTG" +
    "GAGAAGAATAACAGCCGCATTAAGCTGGGCCTCAAGAGCTTGGTGAGCAAGGGCACCCTGGTGCAGACCA" +
    "AGGGCACTGGTGCTTCTGGCTCCTTTAAACTCAACAAGAAGGCGGCCTCCGGGGAAGCCAAGCCCAAAGC" +
    "CAAGAAGGCAGGCGCCGCTAAAGCTAAGAAGCCCGCGGGGGCCACGCCTAAGAAGGCCAAGAAGGCTGCA" +
    "GGGGCGAAAAAGGCAGTGAAGAAGACTCCGAAGAAGGCGAAGAAGCCCGCGGCGGCTGGCGTCAAAAAGG" +
    "TGGCGAAGAGCCCTAAGAAGGCCAAGGCCGCTGCCAAACCGAAAAAGGCAACCAAGAGTCCTGCCAAGCC" +
    "CAAGGCAGTTAAGCCGAAGGCGGCAAAGCCCAAAGCCGCTAAGCCCAAAGCAGCAAAACCTAAAGCTGCA" +
    "AAGGCCAAGAAGGCGGCTGCCAAAAAGAAGTAGGAAGCTGGCGTGTGAAAACCGCAACAAAGCCCCAAAG" +
    "GCTCTTTTCAGAGCCACCCA"
)

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_hint('2')">Hint</button></div>

See if there is a string method that lets you _replace_ substrings.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('3')">Solution</button></div>

In [None]:
mrna_sequence = coding_sequence.replace('T', 'U')
print mrna_sequence

Question 2
----------

From what we described above, it is clear that the "template" strand of the DNA is built from the coding sequence with the following replacements:

* adenine "A" is replaced by thymine "T"
* thymine "T" is replaced by adenine "A"
* cytosine "C" is replaced by guanine "G"
* guanine "G" is replaced by cytosine "C"

In addition, convention holds that the template sequence is written in the reverse order. Compute that sequence for the above coding strand. 

Note: Python strings don't have a reverse method. But slicing could help...

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_hint('4')">Hint</button></div>

Question to think about: why can't you use the same method you used in the first question?

Look at the documentation for the `translate` method on strings and
the `maketrans` function from the `string` module (which we have imported
for you).

In [None]:
from string import maketrans

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('6')">Solution</button></div>

In [None]:
coding_to_template = maketrans('ATCG', 'TAGC')
template_sequence = coding_sequence.translate(coding_to_template)[::-1]
print template_sequence

Question 3
----------

A gene encodes a protein by specifying the amino acids that compose it via
groups of 3 bases (called "codons").  In the usual genetic code the
sequence "ATG" indicates the start of the encoding of the protein.

Find the index of the start of the coding section in the coding sequence above.

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_hint('7')">Hint</button></div>

Look for a string method that helps you _find_ a substring.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('8')">Solution</button></div>

In [None]:
start_index = coding_sequence.find('ATG')
print "Start index:", start_index

Question 4
----------

The end of the encoding of a protein is indicated by one of three "stop"
codons: "TAA", "TAG" or "TGA".  In this case the stop is 'TAG'.  Find the
'TAG' codon closest to the end of the coding sequence above.

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_hint('9')">Hint</button></div>

Look for a string method that helps you find a substring starting from the _right_ of a string.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('10')">Solution</button></div>

In [None]:
end_index = coding_sequence.rfind('TAG')
print "End index:", end_index

Copyright 2008-2016, Enthought, Inc.  
Use only permitted under license.  Copying, sharing, redistributing or other unauthorized use strictly prohibited.  
http://www.enthought.com