## `lab5`—RNA Sequencing

❖ Objectives

-   Build a multi-part workflow to process data.

![](https://oerpub.github.io/epubjs-demo-book/resources/0324_DNA_Translation_and_Codons.jpg)

A DNA sequence is composed of adenine (`'A'`), guanine (`'G'`), cytosine (`'C'`), and thymine (`'T'`) nucleobases.  During the process of gene expression, RNA reads off each nucleobase with its opposite.  Thus an RNA sequence is a string containing uracil (`'U'`), cytosine (`'C'`), guanine (`'G'`), and adenine (`'A'`) bases<sup>[[Wikipedia](https://en.wikipedia.org/wiki/RNA#Types_of_RNA)]</sup>.  (Note that U pairs with A as RNA does not contain thymine.)

| Symbol | Name     | Complementary Base |
|--------|----------|--------------------|
| A  | adenine  | T (DNA); U (RNA)   |
| C  | cytosine | G                  |
| G  | guanine  | C                  |
| T  | thymine  | A                  |
| U  | uracil   | A                  |

This multi-part problem will lead you through processing DNA sequence data through transcription into RNA and then examining sequences.

#### Complementing DNA

-   Write a function `dna2rna` which accepts a string `seq_dna` representing a template strand of DNA.  `dna2rna` should `return` a string `seq_rna` which should contain the DNA strand transcribed to its RNA complement.  That is, the input `'ACGT'` should return `'UGCA'`.  The function should not be case sensitive with respect to input, but should return an upper-case transcription.
    
    You may use any means to accomplish this, but you may find the [`replace` function](http://www.tutorialspoint.com/python/string_replace.htm) useful.

In [None]:
# define your function here
def dna2rna('''(delete this string and replace it with the incoming variables)'''):
    pass # you can always delete a `pass` statement, since it does nothing

In [None]:
# it should pass this test---do NOT edit this cell
# test for simple case
assert dna2rna('CGAT') == 'GCUA'
print('Success!')

In [None]:
# it should pass this test---do NOT edit this cell
# test for case insensitivity
assert dna2rna('CgATaaTTgcGGAttCAGatcGAaacGcg') == 'GCUAUUAACGCCUAAGUCUAGCUUUGCGC'
print('Success!')

In the directory `./data/` there is a file called `dna_seq.dat` containing many lines of DNA sequences.

-   Write a function `read_and_complement_dna` which accepts a filename as a string `dna_file`.  `read_and_complement_dna` then loads the data in the file, converts each line into its RNA complement using `dna2rna`, and `return`s the resulting string.

In [None]:
# define your function here
def read_and_complement_dna('''(delete this string and replace it with the incoming variables)'''):
    # load the file, get the data out, close the file
    result_string = ""
    # loop over each line in the file
        # convert the string to its RNA complement
        converted_string = # your code here
        # append the string to the overall result string
        result_string += converted_string
    # return the result string
    pass # you can always delete a `pass` statement, since it does nothing

In [None]:
# it should pass this test---do NOT edit this cell
# test for simple case
assert read_and_complement_dna('./data/dna_seq.dat') == 'AUGCCGCAAUCUGUUCACGCACUCAUGUGU'
print('Success!')

#### Mapping RNA to Amino Acids (Codons)

One of the major functions of RNA in the body is as “messenger RNA”, which contains groups of three-letter *codons* mapping to amino acids expressed in the cell.  Thus if we find `CUU CAG` in mRNA, we anticipate that the cell will create leucine and glutamine, written `LQ`.  The full table of codons follows.

<table class="wikitable">
<h4>Standard genetic code<sup>[[Wikipedia](https://en.wikipedia.org/wiki/Genetic_code#RNA_codon_table)]</sup></h4>
<tr>
<th rowspan="2">1st<br />
base</th>
<th colspan="8">2nd base</th>
<th rowspan="2">3rd<br />
base</th>
</tr>
<tr>
<th colspan="2">U</th>
<th colspan="2">C</th>
<th colspan="2">A</th>
<th colspan="2">G</th>
</tr>
<tr>
<th rowspan="4">U</th>
<td>UUU</td>
<td rowspan="2" style="background-color:#ffe75f">(Phe/F) <a href="/wiki/Phenylalanine" title="Phenylalanine">Phenylalanine</a></td>
<td>UCU</td>
<td rowspan="4" style="background-color:#b3dec0">(Ser/S) <a href="/wiki/Serine" title="Serine">Serine</a></td>
<td>UAU</td>
<td rowspan="2" style="background-color:#b3dec0">(Tyr/Y) <a href="/wiki/Tyrosine" title="Tyrosine">Tyrosine</a></td>
<td>UGU</td>
<td rowspan="2" style="background-color:#b3dec0">(Cys/C) <a href="/wiki/Cysteine" title="Cysteine">Cysteine</a></td>
<th>U</th>
</tr>
<tr>
<td>UUC</td>
<td>UCC</td>
<td>UAC</td>
<td>UGC</td>
<th>C</th>
</tr>
<tr>
<td>UUA</td>
<td rowspan="6" style="background-color:#ffe75f">(Leu/L) <a href="/wiki/Leucine" title="Leucine">Leucine</a></td>
<td>UCA</td>
<td>UAA</td>
<td style="background-color:#B0B0B0;"><a href="/wiki/Stop_codon" title="Stop codon">Stop</a> (<i>Ochre</i>)</td>
<td>UGA</td>
<td style="background-color:#B0B0B0;"><a href="/wiki/Stop_codon" title="Stop codon">Stop</a> (<i>Opal</i>)</td>
<th>A</th>
</tr>
<tr>
<td>UUG</td>
<td>UCG</td>
<td>UAG</td>
<td style="background-color:#B0B0B0;"><a href="/wiki/Stop_codon" title="Stop codon">Stop</a> (<i>Amber</i>)</td>
<td>UGG</td>
<td style="background-color:#ffe75f;">(Trp/W) <a href="/wiki/Tryptophan" title="Tryptophan">Tryptophan</a>&#160;&#160;&#160;&#160;</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">C</th>
<td>CUU</td>
<td>CCU</td>
<td rowspan="4" style="background-color:#ffe75f">(Pro/P) <a href="/wiki/Proline" title="Proline">Proline</a></td>
<td>CAU</td>
<td rowspan="2" style="background-color:#bbbfe0">(His/H) <a href="/wiki/Histidine" title="Histidine">Histidine</a></td>
<td>CGU</td>
<td rowspan="4" style="background-color:#bbbfe0">(Arg/R) <a href="/wiki/Arginine" title="Arginine">Arginine</a></td>
<th>U</th>
</tr>
<tr>
<td>CUC</td>
<td>CCC</td>
<td>CAC</td>
<td>CGC</td>
<th>C</th>
</tr>
<tr>
<td>CUA</td>
<td>CCA</td>
<td>CAA</td>
<td rowspan="2" style="background-color:#b3dec0">(Gln/Q) <a href="/wiki/Glutamine" title="Glutamine">Glutamine</a></td>
<td>CGA</td>
<th>A</th>
</tr>
<tr>
<td>CUG</td>
<td>CCG</td>
<td>CAG</td>
<td>CGG</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">A</th>
<td>AUU</td>
<td rowspan="3" style="background-color:#ffe75f">(Ile/I) <a href="/wiki/Isoleucine" title="Isoleucine">Isoleucine</a></td>
<td>ACU</td>
<td rowspan="4" style="background-color:#b3dec0">(Thr/T) <a href="/wiki/Threonine" title="Threonine">Threonine</a>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</td>
<td>AAU</td>
<td rowspan="2" style="background-color:#b3dec0">(Asn/N) <a href="/wiki/Asparagine" title="Asparagine">Asparagine</a></td>
<td>AGU</td>
<td rowspan="2" style="background-color:#b3dec0">(Ser/S) <a href="/wiki/Serine" title="Serine">Serine</a></td>
<th>U</th>
</tr>
<tr>
<td>AUC</td>
<td>ACC</td>
<td>AAC</td>
<td>AGC</td>
<th>C</th>
</tr>
<tr>
<td>AUA</td>
<td>ACA</td>
<td>AAA</td>
<td rowspan="2" style="background-color:#bbbfe0">(Lys/K) <a href="/wiki/Lysine" title="Lysine">Lysine</a></td>
<td>AGA</td>
<td rowspan="2" style="background-color:#bbbfe0">(Arg/R) <a href="/wiki/Arginine" title="Arginine">Arginine</a></td>
<th>A</th>
</tr>
<tr>
<td>AUG<sup class="reference" id="ref_methionineA"><a href="#endnote_methionineA">[A]</a></sup></td>
<td style="background-color:#ffe75f;">(Met/M) <a href="/wiki/Methionine" title="Methionine">Methionine</a></td>
<td>ACG</td>
<td>AAG</td>
<td>AGG</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">G</th>
<td>GUU</td>
<td rowspan="4" style="background-color:#ffe75f">(Val/V) <a href="/wiki/Valine" title="Valine">Valine</a></td>
<td>GCU</td>
<td rowspan="4" style="background-color:#ffe75f">(Ala/A) <a href="/wiki/Alanine" title="Alanine">Alanine</a></td>
<td>GAU</td>
<td rowspan="2" style="background-color:#f8b7d3">(Asp/D) <a href="/wiki/Aspartic_acid" title="Aspartic acid">Aspartic acid</a></td>
<td>GGU</td>
<td rowspan="4" style="background-color:#ffe75f">(Gly/G) <a href="/wiki/Glycine" title="Glycine">Glycine</a></td>
<th>U</th>
</tr>
<tr>
<td>GUC</td>
<td>GCC</td>
<td>GAC</td>
<td>GGC</td>
<th>C</th>
</tr>
<tr>
<td>GUA</td>
<td>GCA</td>
<td>GAA</td>
<td rowspan="2" style="background-color:#f8b7d3">(Glu/E) <a href="/wiki/Glutamic_acid" title="Glutamic acid">Glutamic acid</a></td>
<td>GGA</td>
<th>A</th>
</tr>
<tr>
<td>GUG</td>
<td>GCG</td>
<td>GAG</td>
<td>GGG</td>
<th>G</th>
</tr>
</table>

We provide the function `rna2amino` which accepts a three-letter codon and returns the corresponding amino acid.  This uses a `dict`, a data type we haven't encountered yet but which is easy to use.

In [None]:
genetic_code = {
    'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L',        'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',
    'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M',        'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',
    
    'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',        'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
    'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',        'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    
    'UAU': 'Y', 'UAC': 'Y', 'UAA': '*', 'UAG': '*',        'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
    'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',        'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
    
    'UGU': 'C', 'UGC': 'C', 'UGA': '*', 'UGG': 'W',        'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
    'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',        'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
}
allowed_codons = set('ACGU')

def rna2amino(codon):
    '''
    Convert a three-letter RNA codon to an amino acid.
    '''
    # Check for the correct length of the codon.
    if len(codon) != 3:
        return None
    codon = codon.upper()
    # Check that the codon is valid.
    if (set(codon) > allowed_codons):
        return None
    return genetic_code[codon]

Now we can convert an RNA codon to an amino acid trivially using `rna2amino`:

In [None]:
rna2amino('CGU')

-   We next need a function `sequence_string` which accepts a string `rna_seq` containing RNA sequence data and maps it to amino acids.  This requires that you:
    
    1.  Break each string into three-letter chunks.
    2.  For each chunk, map it to a valid amino acid codon according to the table below.  (We provide code for this step.)
    3.  Return the result.

The tricky part is figuring out how to get a string chopped into three-letter chunks.  (This is harder than it seems at first.)  There are many ways you can think of to do this.  One possibility:

In [None]:
example_string = 'abcdefghijklmnopqrstuvwxyz'
for i in range(0,int(len(example_string)/3)):
    print(example_string[3*i:3*i+3])

In [None]:
# define your function here
def sequence_string('''(delete this string and replace it with the incoming variables)'''):
    # divide the string into three-letter chunks
    # map each three-letter codon to a protein
    # append the protein to the result string
    # return the result string
    pass # you can always delete a `pass` statement, since it does nothing

In [None]:
# it should pass this test---do NOT edit this cell
# test for simple case
assert sequence_string('ACUGAU') == 'TD'
print('Success!')

In [None]:
# it should pass this test---do NOT edit this cell
# test for a more complicated case
assert sequence_string('AUCACUGUAGUAGUAGCUGGAAAGAGAAAUCUGUGACUCCAAUUAGCC') == 'ITVVVAGKRNL*LQLA'
print('Success!')

In [None]:
# it should pass this test---do NOT edit this cell
# test for failure case
try:
    sequence_string('ASDF')
except KeyError:
    True
else:
    False

Finally, we are interested in loading a file of DNA sequence data, complementing it, and mapping the resulting RNA to amino acids.  This requires that you:

1.  Load a file.
2.  Use your function `sequence_string` to convert each line of the file to its protein expression string.
3.  Return the resulting string.

-   Write a function `sequence_file` which accepts a string `dna_seq_file`.  This function will `return` (NOT write to disk) a string containing the amino acids described in the file `dna_seq_file`, including line breaks.

In [None]:
# define your function here
def sequence_file('''(delete this string and replace it with the incoming variables)'''):
    # use read_and_complement_dna to get the dna complement as rna
    # use sequence_string to convert rna to amino acid sequence
    pass # you can always delete a `pass` statement, since it does nothing

In [None]:
# it should pass this test---do NOT edit this cell
# test for simple case
test_data_file = './data/dna_test.dat'
assert sequence_file(test_data_file) == '''
YKRPPPPGDRGPPFSRRSRRRPKKRGRPGAPRPAPGGGTNSVLKMLGMERGQPCVVFGWPGWVAVCMISLTLLKAK*RGAHDLLLRGGLFARPHVLPVEFARGFHMTPPLLALQEAQPSPPKTGAASADTPPHLVRGRPKLGWQWQKVHML*DLLHLVRM
'''.strip()
print('Success!')

This has quickly become a sophisticated workflow, and it is easy to lose track of both what you're doing and what you've already written!  This diagram shows how the data pipeline works to process data from disk:

![](./img/flowchart.png)