<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Case-Study-1---DNA-Translation" data-toc-modified-id="Case-Study-1---DNA-Translation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Case Study 1 - DNA Translation</a></span><ul class="toc-item"><li><span><a href="#Introduction-to-DNA-Translation" data-toc-modified-id="Introduction-to-DNA-Translation-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction to DNA Translation</a></span></li><li><span><a href="#Downloading-DNA-Data" data-toc-modified-id="Downloading-DNA-Data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Downloading DNA Data</a></span></li><li><span><a href="#Importing-DNA-Data-Into-Python" data-toc-modified-id="Importing-DNA-Data-Into-Python-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Importing DNA Data Into Python</a></span></li><li><span><a href="#Translating-the-DNA-Sequence" data-toc-modified-id="Translating-the-DNA-Sequence-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Translating the DNA Sequence</a></span></li><li><span><a href="#Comparing-Your-Translation" data-toc-modified-id="Comparing-Your-Translation-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Comparing Your Translation</a></span></li></ul></li></ul></div>

## Case Study 1 - DNA Translation

### Introduction to DNA Translation

DNA is a discrete code physically present in almost every cell of an organism.

We can think of DNA as a one dimensional string of characters with four characters to choose from.

These characters are A, C, G, and T. 

They stand for the first letters with the four nucleotides used to construct DNA.

The full names of these nucleotides are adenine, cytosine, guanine,and thymine.

Each unique three character sequence of nucleotides, sometimes called a nucleotide triplet, corresponds to one amino acid.

The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things.

The so called central dogma of molecular biology describes the flow of genetic information in a biological system.

Instructions in the DNA are first transcribed into RNA and the RNA is then translated into proteins.

We can think of DNA, when read as sequences of three letters, as a dictionary of life.

In this case study, we will first download a DNA strand as a text file from a public web-based repository of DNA sequences.

We will then write code to translate the DNA sequence to a sequence of amino acids where each amino acid is represented by a unique letter.

We will also download the amino acid sequence to check our solution.

Think about it conceptually.

The input to our program is going to be a DNA sequence that consists of a four letter alphabet.
We then read this sequence three letters at a time, translate each triplet to a single letter that stands for a specific amino acid, and then proceed to the next set of three letters.

We do this until we have reached the end of the input sequence.

Part of the table we will use:

1. ATA -> I
1. ATG -> M 
1. CAA -> Q
1. TCT -> S
1. TGG -> W

Hence translation of the following DNA sequence is

ATACAATGGCAA -> IQWQ

### Downloading DNA Data

The NCBI is the National Center for Biotechnology Information, and it is United States' main public repository of DNA and related information.

Download two files:

1. Strand of DNA
1. Corresponding protein sequence

Select Neucleotide in databases and Download for the following code: NM_207618.2

Click FASTA for DNA sequence and CDS for Translation of Amino Acids

### Importing DNA Data Into Python

In [759]:
cd "C:/Users/dvije/Google Drive/Jupyter Notebooks/Python Research Libraries/case studies/translation"

C:\Users\dvije\Google Drive\Jupyter Notebooks\Python Research Libraries\case studies\translation


In [760]:
inputfile = "dna.txt"

with open(inputfile, "r") as f:
    seq = f.read()
    # seq # note seq will have \n characters but when printed it wont be there
    print(f"DNA seq: {seq}")

    # use replace method
    seq = seq.replace("\n", "")
    # seq # note those extra \n are note there

    # remove extra hidden characters as well
    seq = seq.replace("\r", "")
    # seq

DNA seq: GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA
GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT
CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT
TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT
CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG
AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA
ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA
GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT
TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA
GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA
CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT
TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT
GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG
TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCT

### Translating the DNA Sequence

In [761]:
def translate(seq):
    """
    Translate a string containing a nucleotide sequence into a string containing the corresponding sequence of amino        acids . Nucleotides are translated in triplets using the table dictionary; each amino acid 4 is encoded with a          string of length 1.
    docstring
    """
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    protein = ""
    #check if sequence length is divisible by 3
    if len(seq) % 3 == 0:
        for i in range(0, len(seq), 3):
            codon = seq[i : i+3]
            protein += table[codon]

    return protein

In [762]:
translate("GCC")

'A'

### Comparing Your Translation

If you look at the website where it says CDS, you will see two numbers next to it, 21 and 938.

These are the locations of the gene where the coding sequence starts and ends.

So instead of taking the entire DNA sequence, we would really like to be doing the translation starting at position 21 and ending at 938.

We can use string slicing to obtain the part of the sequence that we want.

However, we need to be careful with the indices.

If you investigate the NCBI website, you will see that the sequence positions are numbered from 1 to 1157.

In Python, indexing starts at 0, so genome positions 21 and 938 correspond to Python string positions 20 and 937.

So the starting point of the string slice will be 20, but the stopping location of the string is 938.

This is because when we specify the stopping location as 938, the last character to be included is at position 937, which is exactly what we want.

In [763]:
def read_seq(inputfile):
    """
    Reads and returns input sequence with special characters removed. 
    """
    with open(inputfile, "r") as f:
        seq = f.read()
        # seq # note seq will have \n characters but when printed it wont be there
        # print(f"DNA seq: {seq}")

        # use replace method
        seq = seq.replace("\n", "")
        # seq # note those extra \n are note there

        # remove extra hidden characters as well
        seq = seq.replace("\r", "")
        # seq

    return seq

prt = read_seq("protein.txt")
dna = read_seq("dna.txt")

ans = translate(dna)
print(f"DNA translation: {ans}\n")

ans = translate(dna[20:938])
print(f"DNA translation: {ans}\n")

print(f"Protein Sequence: {prt}\n")

DNA translation: 

DNA translation: MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC_

Protein Sequence: MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC



If you compare these two sequences, you will see that they look almost identical.

The only difference between these two sequences is the underscore character that appears at the end of our translated sequence.

At the very end of a protein coding sequence, nature places what's called a stop codon.

There are three stop codons, and their function is to tell someone reading the sequence that this is where you should stop reading.

The stop codon is not included in the downloaded protein, because it's usually not of interest.

But when we download the DNA sequence and translate it ourselves, the stop codon is included in the translation.

Therefore, we should really skip the last codon from our translation, and we can modify the stopping point of the slice.

In [764]:
ans = translate(dna[20:935])
print(f"DNA translation: {ans}\n")

print(f"Protein Sequence: {prt}\n")

print(f"Is Downloaded and Transalted protein same: {ans == prt}")

DNA translation: MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC

Protein Sequence: MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC

Is Downloaded and Transalted protein same: True


An alternative approach is the following:

The function translate returns a string of amino acids.

So instead of omitting the last three characters from the DNA sequence, I can just omit the very last character of the translated amino acid sequence.

In [765]:
prt = read_seq("protein.txt")
dna = read_seq("dna.txt")

ans = translate(dna[20:938])[:-1]
print(f"DNA translation: {ans}\n")

print(f"Protein Sequence: {prt}\n")

print(f"Is Downloaded and Transalted protein same: {ans == prt}")

DNA translation: MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC

Protein Sequence: MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC

Is Downloaded and Transalted protein same: True


In [766]:
cd "../.."

C:\Users\dvije\Google Drive\Jupyter Notebooks\Python Research Libraries
