# DNA Translation
---
The preservation, retrieval, and interpretation of genetic instructions are fundamental for the existence and functionality of living cells. These instructions play a crucial role in the formation and upkeep of living organisms. It is well-established that genetic data is housed within the deoxyribonucleic acid (DNA) present in all living organisms.


DNA is a discrete, linear code ubiquitously present in most of the cells within an organism, consisting of a string of characters drawn from a set of four options.

These characters are:

    A - Adenine
    C - Cytosine
    G - Guanine
    T - Thymine
    
They stand for the first letters with the four nucleotides used to construct DNA.    

Each unique three character sequence of nucleotides, sometimes called a nucleotide triplet, corresponds to one amino acid. The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things

Protein molecules dominate the behaviour of the cell serving as structural supports, chemical catalysts, molecular motors.

#### Central Dogma of Molecular Biology

DNA serves as a master template for the production of all cellular proteins and other molecules, and it is responsible for controlling the cell's growth and reproduction.


The genetic code embedded in the DNA sequence is read in groups of three `nucleotides`, or `codons`, which code for specific amino acids. These amino acids are then linked in a particular order to form a protein molecule.


Errors or mutations in the DNA sequence can significantly affect the organism, leading to genetic diseases, developmental disorders, or even cancer. However, some modifications can also provide a selective advantage and contribute to evolution by providing new traits that help an organism adapt to changing environmental conditions.

1. **Task: Downloading DNA Strand as a Text File**

- **Description**: Download the DNA strand from a publicly available web-based repository of DNA sequences.

- **Subtasks**:

        Identify the appropriate repository for downloading the DNA strand.
        Locate the DNA sequence of interest.
        Download the DNA sequence in the form of a text file.



2. **Task: DNA Translation**

- **Description**: Write code to translate the DNA sequence to a sequence of amino acids using the genetic code.
- **Subtasks**:

        Understand the genetic code and the process of DNA translation.
        Write a code that converts the DNA sequence to its corresponding amino acid sequence.
        Ensure the code works correctly for various DNA sequences.



3. **Task: Checking the Solution**

- **Description**: Download the amino acid sequence and check whether it matches with the expected sequence.


- **Subtasks**:

        Identify a publicly available web-based repository for downloading amino acid sequences.
        Locate the amino acid sequence corresponding to the downloaded DNA sequence.
        Download the amino acid sequence.
        Compare the translated amino acid sequence with the downloaded one to confirm the correctness of the code.

Two files will be downloaded from the National Centre for Biotechnology Information (NCBI), including a DNA strand and its corresponding protein sequence of amino acids translated from the DNA.

In [6]:
file_path = "/Users/rajdeep_ch/Documents/projects/project_notebooks/python_research/dna_seq1.txt"

protein_path = "/Users/rajdeep_ch/Documents/projects/project_notebooks/python_research/protein_seq1.txt"


Next, the program needs to read the files.





In [7]:
dna_file = open(file_path,"r")

seq = dna_file.read()

In [8]:
seq

'GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA\nGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT\nCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT\nTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT\nCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG\nAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA\nACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA\nGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT\nTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA\nGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA\nCCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT\nTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT\nGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG\nTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGA

The file being read contains the \n character, as a result of copying and pasting text from a web browser to a text editor. This character can affect the way the string is printed.

In [9]:
print(seq)

GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA
GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT
CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT
TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT
CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG
AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA
ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA
GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT
TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA
GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA
CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT
TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT
GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG
TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT
GCTAAT

To replace the newline character, the `replace` method can be used with two arguments where the first argument is the substring to be replaced and the second argument is the substring to replace the first argument. 

The method replaces every occurrence of the first argument with the second argument. Since strings are immutable, the replace method returns a new string which must be assigned to a new variable.


In [10]:
dna_seq = seq.replace("\n","")
dna_seq = dna_seq.replace("\r","")

In [11]:
print(dna_seq)

GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGGTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTTGCTAATACCATTAAATACTT

The process of translating the code into its protein sequence involves a table lookup operation. 

In this project, a dictionary object is utilized to aid in the translation process. The keys of the dictionary are strings, each consisting of three letters selected from a four-letter alphabet, while the value of each key is a string consisting of a single character, representing the corresponding amino acid.


In [12]:
table = {
    
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'
    
}

To check if the length of the DNA sequence is divisible by 3, the code needs to use the modulus operator with the value 3.

If the length is not divisible by 3, the code should discard the last one or two nucleotides, since they will not correspond to a complete codon.

In [13]:
(len(dna_seq) % 3) == 0

False

In [14]:
remain = (len(dna_seq) % 3) - 1

In this piece of code below, an empty string variable `amino_string` is defined. Then, a for loop is used to iterate over a range of numbers, from 20 to 938 with a step of 3.

During each iteration, a substring of the DNA sequence `dna_seq` of length 3 is extracted and stored in the variable codon. The corresponding amino acid for the codon is looked up from a table using the `table[codon]` expression, and stored in the variable `amino`.

If the amino acid is not a stop codon represented by the symbol `_`, then it is added to the end of the amino_string variable using the `amino_string + amino expression`.

At the end of the loop, `amino_string` will contain the translated protein sequence.

In [16]:
amino_string = ""

for ind in range(20,938,3):
    
    codon = dna_seq[ind:ind+3]
    amino = table[codon]
    if amino != "_":
        amino_string = amino_string + amino

In [17]:
amino_string

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

In [18]:
protein_file = open(protein_path,'r')

protein_seq = protein_file.read()

In [19]:
protein_seq = protein_seq.replace("\n","")

print(protein_seq)

MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC


In [20]:
amino_string == protein_seq


True

The algorithm will be formalized by creating three functions.

The first function, named `read_sequence` will be responsible for reading a text file and replacing the newline and carriage return characters from the string. This function will be used to preprocess the input file for the subsequent analysis.

The second function, named `lookup_table` takes a string of codon as it's input and returns the corresponding amino acid.

The third function, named `translate`, will take a string of DNA nucleotides as input and return the corresponding amino acid sequence. The translation process involves dividing the DNA sequence into non-overlapping groups of three nucleotides called codons. Each codon is translated into an amino acid using a lookup table or dictionary that maps each possible codon to the corresponding amino acid. The resulting amino acid sequence will be returned by the translate function.

In [21]:
def read_sequence(filename):
    """
    Parameters
        filename (str) - the name of the file
    
    Returns
        out_string (str) - output string with necessary modifications 
    """
    
    with open(filename,'r') as file:
        string_seq = file.read()
        
        string_seq = string_seq.replace("\n","")
        string_seq = string_seq.replace("\r","")
        
        return string_seq

In [22]:
def lookup_table(input_codon):
    """
    Parameters
        input_codon (str) - dna codon
    
    Returns
        amino (str) - translated amino-acid
    """
    
    table = {
    
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'
    
    }
        
    amino = table[input_codon]
        
    return amino
    

In this context, the NCBI website offers the start codon index and stop codon index for a specific DNA sequence under the name **CDS**. However, it's important to note that the indexing for these sequences begins at 1. 

Therefore, the start codon index needs to be reduced by 1 to align with the Python indexing that begins at 0.

![Screenshot%202023-01-14%20at%2018.33.31.png](attachment:Screenshot%202023-01-14%20at%2018.33.31.png)

In [28]:
def translate(filename):
    """
    Parameters
        filename (str) - name of file to be translated
    
    Returns
        amino_sequence (str) - translated protein sequence for the given file
    """
    
    dna_seq = read_sequence(filename)
    
    start = int(input("Enter starting position of codon: "))
    end = int(input("Enter stopping position of codon: "))
    print()
    
    translate_codon = dna_seq[start-1:end]
    
    amino_sequence = ""
    
    for ind in range(0,len(translate_codon),3):
        
        codon = translate_codon[ind:ind+3]
        amino = lookup_table(codon)
        
        if amino != "_":
            amino_sequence += amino
        
    return amino_sequence

To check whether the `translate` function is working as intended, the value returned by it is compared with the value of the protein sequence.

The protein sequence is modified by passing it through `read_sequence`.

In [25]:
translate(file_path) == read_sequence(protein_path)

Enter starting position of codon: 21
Enter stopping position of codon: 938


True

In [26]:
print(read_sequence(protein_path))

MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC


In [29]:
print(translate(file_path),end = "\n\n")

Enter starting position of codon: 21
Enter stopping position of codon: 938

MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC

