# CS 425 notebook 1:  the central dogma of molecular biology

The objective of this notebook is to understand DNA strand complementarity and the concepts of transcription of DNA to RNA and translation of RNA to protein.

### Problem 0 - a discussion of Python dictionaries

Python dictionaries are a highly effective and efficient way to map keys to values.  Which underlying data structure is used in implementing Python dictionaries?  Given that information what can you say about the computational time required to insert or retrieve a value from a dictionary?


*your answer here*

### Problem 1 - reverse complement

In this problem, we will write a function that is given the string representing a strand of a DNA molecule and returns a string representing the  complementary strand.


In [7]:
def reverse_complement(sequence):
    """Returns the reverse complement of the input sequence"""
    reversed_sequence = sequence[::-1].upper()
    map_of_complements = {"A": "T", "T": "A", "C": "G", "G": "C"}
    complement = ""
    for val in reversed_sequence:
        try:
            complement += map_of_complements[val]
        except KeyError:
            raise ValueError("Invalid string of nucleotides")
    return complement

    

In [10]:
# tests for reverse_complement
assert reverse_complement("A") == "T"
assert reverse_complement("ATCG") == "CGAT"
assert reverse_complement("") == ""
assert reverse_complement("GAATTC") == "GAATTC", "Failed on palindromic EcoR1 recognition sequence"
print("SUCCESS: all tests for reverse_complement passed!")

SUCCESS: all tests for reverse_complement passed!


### Problem 2 - transcription

Write a function that takes as input a string representing a DNA sequence and outputs the string representing the RNA that would result from transcribing this DNA sequence.

As a hint, there is a [string method](https://docs.python.org/3/library/stdtypes.html#string-methods) that will allow you to do this in one line.

In [11]:
def transcribe(sequence):
    """
    Returns the RNA sequence that would result from transcription
    of a DNA sense strand sequence
    """
    return sequence.replace("T", "U")

### Problem 3 - translation

In this problem you will write a function that translates an RNA sequence into the protein sequence for which it codes.  Recall that each codon is translated into an amino acid, where codons are nonoverlapping substrings of length three.
To assist you in this task, here is a Python dictionary that represents the standard genetic code:

In [12]:
genetic_code = {
 'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAU': 'N',
 'ACA': 'U', 'ACC': 'U', 'ACG': 'U', 'ACU': 'U',
 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGU': 'S',
 'AUA': 'I', 'AUC': 'I', 'AUG': 'M', 'AUU': 'I',
 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAU': 'H',
 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCU': 'P',
 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGU': 'R',
 'CUA': 'L', 'CUC': 'L', 'CUG': 'L', 'CUU': 'L',
 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAU': 'D',
 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCU': 'A',
 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGU': 'G',
 'GUA': 'V', 'GUC': 'V', 'GUG': 'V', 'GUU': 'V',
 'UAA': '*', 'UAC': 'Y', 'UAG': '*', 'UAU': 'Y',
 'UCA': 'S', 'UCC': 'S', 'UCG': 'S', 'UCU': 'S',
 'UGA': '*', 'UGC': 'C', 'UGG': 'W', 'UGU': 'C',
 'UUA': 'L', 'UUC': 'F', 'UUG': 'L', 'UUU': 'F'}

In [13]:
def translate_rna_fragment(rna_sequence):
    """
    Returns the protein sequence resulting from translation of 
    a given RNA sequence
    """
    rna_sequence_upper = rna_sequence.upper()
    protein_sequence = ""
    for i in range(0, len(rna_sequence_upper), 3):
        codon = rna_sequence_upper[i:i+3]
        try:
            protein_sequence+=genetic_code[codon]
        except KeyError:
            raise ValueError("Invalid string of nucleotides: " + codon)
    return protein_sequence

In [14]:
assert translate_rna_fragment("UUUGCGACUUAU") == "FAUY", "Failed on input 'UUUGCGACUUAU'"
assert translate_rna_fragment("ACG") == "U", "Failed on input 'UGA'"
assert translate_rna_fragment("") == "", "Failed on the empty string"
print("SUCCESS: translate_rna_fragment passed all tests!")

SUCCESS: translate_rna_fragment passed all tests!


### Problem 4 - Consequences of the deltaF508 mutation in CFTR 

One of the most famous disease causing mutations in humans is the deltaF508 mutation in the *CFTR* gene.  This is the most common mutation among people with Cystic Fibrosis.  This mutation occurs in the gene fragment shown below and corresponds to the deletion of 3 consecutive bases, starting at base 129 (using 1-based indexing).  The code below shows how to "slice" the string representing this gene fragment to determine the identity of the 3 bases that are deleted by this mutation.

We will now examine how the deltaF508 mutation impacts the resulting amino acid sequence of the encoded protein.  Here is the sequence of the CFTR gene fragment (sense strand).

In [38]:
cftr_gene_fragment = ("ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAA"
                      "AATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTA"
                      "TGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAA"
                      "TATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG")
print(cftr_gene_fragment[128:])
print(len(cftr_gene_fragment)/3)
64-3

CTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG
64.0


61

In [29]:
print(cftr_gene_fragment[128:131])
print(cftr_gene_fragment[:128] + "============" + cftr_gene_fragment[128:131] + "========" + cftr_gene_fragment[131:])


CTT


And here is the sequence of the mutated CFTR gene fragment, which has bases 129, 130, and 131 (1-based coordinates) removed:

In [45]:
deltaf508_fragment = cftr_gene_fragment[:128] +cftr_gene_fragment[131:]
print(deltaf508_fragment)

ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG


What are the consequences of this mutation on the protein sequence of this gene?
Keep in mind that the first three bases of the fragment are a codon.  Note that it's possible to answer this simply by considering the coordinates of the mutation.  You can then verify your answer using your translation function.

These 3 bases of the fragment that were deleted happened at the start of the last index of a codon so now all of the bases from index 132 and on are shifted over causing the F polypeptide to not be coded for and possibly other changes.

In [63]:
cftr_gene_fragment_rna = transcribe(cftr_gene_fragment)
deltaf508_fragment_rna = transcribe(deltaf508_fragment)

cftr_protein = translate_rna_fragment(cftr_gene_fragment_rna)
deltaf508_protein = translate_rna_fragment(deltaf508_fragment_rna)
print(cftr_protein)
print(deltaf508_protein)

USLLMVIMGELEPSEGKIKHSGRISFCSQFSWIMPGUIKENIIFGVSYDEYRYRSVIKACQLEE
USLLMVIMGELEPSEGKIKHSGRISFCSQFSWIMPGUIKENIIGVSYDEYRYRSVIKACQLEE


So it looks like the protein structure has lost the a polypeptide but it looks the same otherwise pretty similar.

In [68]:

import difflib
d = difflib.Differ()

diff = d.compare(cftr_protein, deltaf508_protein)
print(" \n".join(diff))

  U 
  S 
  L 
  L 
  M 
  V 
  I 
  M 
  G 
  E 
  L 
  E 
  P 
  S 
  E 
  G 
  K 
  I 
  K 
  H 
  S 
  G 
  R 
  I 
  S 
  F 
  C 
  S 
  Q 
  F 
  S 
  W 
  I 
  M 
  P 
  G 
  U 
  I 
  K 
  E 
  N 
  I 
  I 
- F 
  G 
  V 
  S 
  Y 
  D 
  E 
  Y 
  R 
  Y 
  R 
  S 
  V 
  I 
  K 
  A 
  C 
  Q 
  L 
  E 
  E


I compared the 2 strings above to confirm that the F polypeptide is now gone and to see if there where any other changes in the protein structure after the base deletions which it looks like no, there are no other changes.