<a href="https://colab.research.google.com/github/shfarhaan/Using-Python-for-Research_edX/blob/main/DNA_Translation_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to DNA Translation

DNA is a discrete code physically present in almost every cell of an organism. We can think of DNA as a **one dimensional string** of characters with four characters to choose from. 

These characters are A, C, G, and T. They stand for the first letters with the four nucleotides used to construct DNA. The full names of these nucleotides are `adenine`, `cytosine`, `guanine`, and `thymine`.

Each unique three character sequence of nucleotides, sometimes called a `nucleotide triplet`, corresponds to one **amino acid**. The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things.

Protein molecules dominate the behavior of the cell
serving as structural supports, chemical catalysts, molecular motors, and so on. The so called central dogma of molecular biology describes the flow of genetic information in a biological system.

Instructions in the DNA are first transcribed into RNA and the RNA is then translated into proteins. **We can think of DNA, when read as sequences of three letters, as a dictionary of life.**

In this case study, we will 

1. first download a DNA strand as a text file
from a [public web-based repository](https://www.ncbi.nlm.nih.gov) of DNA sequences.
2. We will then write code to translate the DNA sequence
to a sequence of amino acids where each amino acid is
represented by a unique letter.
3. Subsequently, We will download the amino acid sequence to check our solution.

To make the problem a bit more concrete, let's first
think about it conceptually. The input to our program is going to be a DNA sequence that consists of a four letter alphabet. We then read this sequence three letters at a time,
translate each triplet to a single letter that stands for a specific amino acid, and then proceed to the next set of three letters. We do this until we have reached the end of the input sequence.

### **In this case study, we have four tasks.**
1. Manually download DNA and protein sequence data to your computer.
2. Import the DNA data into Python.
3. Create an algorithm that translates to DNA using the translation table we will provide.
4. Check if the DNA translation matches the protein string we have downloaded.

We are going to use the nucleotide [NM_207618.2](https://www.ncbi.nlm.nih.gov/nuccore/NM_207618.2). We will find the dna sequence data from [FASTA](https://www.ncbi.nlm.nih.gov/nuccore/NM_207618.2?report=fasta) report and then save it as text file. Afterwards, we go back to [NM_207618.2](https://www.ncbi.nlm.nih.gov/nuccore/NM_207618.2) and navigate to CDS under feature attribute of the nucleotide and we will the save the translation file as text file.

### **Importing DNA Data Into Python and remove special characters**

In [None]:

dnaInputFile = "/content/dna.txt"
f_dna = open(dnaInputFile, "r")
dna_seq = f_dna.read()

dna_seq.replace("/n", "")
dna_seq.replace("/r", "")

print(dna_seq)


GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA
GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT
CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT
TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT
CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG
AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA
ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA
GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT
TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA
GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA
CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT
TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT
GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG
TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT
GCTAAT

In [None]:
dna_seq[40:50]

'CCTGAAAACC'

In [None]:
# Adding the Protein File

ProteinInputFile = "/content/protein.txt"
f_protein = open(ProteinInputFile, "r")
protein_seq = f_protein.read()

protein_seq.replace("/n", "")
protein_seq.replace("/r", "")

print(protein_seq)

MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPIST
GSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARST
NLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTG
PQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRM
QYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCND
ILVSGFPTISPLLLTFRDPKGPCSVFFNC


### **Translating the DNA Sequence**

The translation process is essentially a table lookup operation. Python provides a very natural object for dealing
with these types of situations. This object is a dictionary.
In this case, the key objects are strings, each consisting of three letters drawn from the four letter alphabet. The value object is also a string but a string consisting of just one character.



In [None]:
table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
}

### How would we look up the key that corresponds to CAA or CCT or GTA?

1. Firstly we should check that the length of the sequence is actually divisible by three.
2. Next, look up each three-letter string in our table and store the result somewhere.
3. Finally, keep doing this in a loop until we
get to the end of the sequence.

In [None]:
def translate_dna(dna_seq):
    """Translate a string containing a nucleotide sequence into a string containing
    the corresponding sequence of amino acids. Nucleotides are translated in triplets
    using the table dictionary; each amino acid 4 is encoded with a string of length 1."""

    table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',}

    protein = ""

    # CHECK THE SEQUENCE IS DIVISIBLE BY 3
    if len(dna_seq) % 3 == 0:

      # LOOP OVER THE SEQUENCE
        for i in range(0, len(dna_seq), 3): 
      
          # EXTRACT A SINGLE CODON
            codon = dna_seq[i : i+3]

          # LOOK UP THE CODON AND STORE THE RESULT
            protein += table[codon]

    return protein


A docstring is a string literal that occurs
as the first statement in a module function,
or a class, or a method definition, and it becomes part of that object. 

The docstring should summarize the behavior of the function and document its arguments, returned values, possible side effects, and anything else that would be important for a user to know about the function.

In [None]:
translate("ATA")

'I'

In [None]:
translate("CTC")

'L'

In [None]:
translate("GCC")

'A'

### **Comparing The DNA Translation**

In [None]:
def read_seq(InputFile):
    """Read and returns the input sequence with special characters removed."""
    with open(InputFile, "r") as f:
        seq = f.read()
    seq = seq.replace("\n", "")
    seq = seq.replace("\r", "")
    return seq


**What does the with statement do?**

It opens a file and uses it for the subsequent block of code only, and then closes the file.


In [None]:
dna = read_seq("/content/dna.txt")

In [None]:
prt = read_seq("/content/protein.txt")

In [None]:
def translate(seq):
    """Translate a string containing a nucleotide sequence into a string containing
    the corresponding sequence of amino acids. Nucleotides are translated in triplets
    using the table dictionary; each amino acid 4 is encoded with a string of length 1."""

    table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',}

    protein = ""

    # CHECK THE SEQUENCE IS DIVISIBLE BY 3
    if len(seq) % 3 == 0:

      # LOOP OVER THE SEQUENCE
        for i in range(0, len(seq), 3): 
      
          # EXTRACT A SINGLE CODON
            codon = seq[i : i+3]

          # LOOK UP THE CODON AND STORE THE RESULT
            protein += table[codon]

    return protein


In [None]:
translate(dna)

''

In [None]:
len(dna) % 3

2

Here instead of getting 0 as a remainder we get 2 because if we see in the website where CDS is mentioned, we will see two numbers next to it, 21 and 938. 

#### **These are the locations of the gene where the coding sequence starts and ends.**

So instead of taking the entire DNA sequence,
we would really like to be doing the translation starting at position 21 and ending at 938.

However, we need to be careful with the indices. As the sequence is numbered from 1 to 1157. 

In Python, indexing starts at 0, so genome positions 21 and 938 correspond to Python string positions 20 and 937. So the starting point of the string slice will be 20, but the stopping location of the string is 938. This is because when we specify the stopping  location as 938, the last character to be included is at position 937.




In [None]:
translate(dna[20:938])

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC_'

In [None]:
prt

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

The extra underscore (`_`) is a stop codon. Their function is to tell someone reading the sequence that this is where you should stop reading. The stop codon is not included in the downloaded protein, because it's usually not of interest. But when we download the DNA sequence and translate it ourselves, the stop codon is included in the translation.

Therefore, we should really skip the last codon from our translation, and we can modify the stopping point of the slice.


In [None]:
translate(dna[20:935])

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

In [None]:
prt == translate(dna[20:935])

True

In [None]:
# An alternative approach
translate(dna[20:938])[:-1]

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

In [None]:
prt == translate(dna[20:938])[:-1]

True

In [None]:
translate(dna[20:938])[:-1] == translate(dna[20:935])

True