### Week 3, class 2
# ***DNA sequence analysis: Example***

<sub><sub>Acknowledgement: This notebook follows the contents of "Illustrating Python via Bioinformatics Examples", 2015, by Hans Petter Langtangen, Geir Kjetil Sandve.</sub></sub>

---


### **From genes to mRNAs to proteins**

Central dogma in genetics:

1. A gene is a region of DNA, consisting of several coding parts, called exons, interspersed by non-coding parts, called introns.  The coding parts are concatenated to form a string called mRNA, where the letter Ts (thymine) are replaced by Us (urasil). 

2. Triplets of mRNA letters encode amino acids, the building blocks of proteins.


Here, we will examine a lactase gene, LCT, that is essential to digest milk but would not be expressed in adulthood. We will extract exons from the lactase gene (LCT), generate an mRNA sequence, and finally, translate it into a sequence of lactase protein (LPH).

---

#### **Reading a gene file**

To perform any analysis, we need a gene sequence, which is typically stored in a text file or on the internet. Let's begin with loading a gene file.

lactage_gene.txt is in the 'data' directory. Examine it. Each line has 70 characters. At the end, there is a special character called, 'new line', which is only visible to the computer. When the computer reads this character, most editing software changes its line to the next line to display following characters.

In our program, we do not need the 'new line' characters. So, we will remove them using a string function called `strip()`.

In [None]:
def read_seq_file(filename):
    seq = ''
    for line in open(filename, 'r'):
        seq += line.strip()
    return seq

There are multiple ways of readin files. Some examples can be found in 'string_err_fileIO_debugging.ipynb'.  You need to experiment with different methods to understand the logics behind them.  Please ask questions in the forum or in TA sections.

Below is another implementation performing the same task.

In [None]:
# This is slightly faster, because it accesses the disk once, which is slow.
def read_seq_file(filename):
    lines = open(filename,'r').readlines()  # reads in all lines of sequences
    return ''.join([  line.strip() for line in lines  ])  # combines all lines without the 'new line' character

Your file may be in a differnt location in a different operating system. Therefore, in Windows computers, the name would be 'data\lactase_gene.txt' and, in Mac or Linux, the name would be 'data/lactase_gene.txt'.

To make your code portable between different systems, you need a way to change the '\' and '/'. Python provides an easy way with `os` package.

In [None]:
# Let's test it
import os
filename = os.path.join('data', 'lactase_gene.txt')
dna_seq = read_seq_file(filename)    # Try %timeit for two different implementations
print(dna_seq)

---

Next step is to convert the dna sequnce to an mRNA sequence. To do this, 
1. We extract exons
2. then convert Ts to Us

The exon position information (in Python indexing method) of lactase gene is stored in the 'lactase_exon.tsv' file, which is in 'data' directory.  Examine it by double clicking the file.

This file is not similar to the lactase_gene.txt file. It has two columns of numbers separated by a tab between them. We need a different script to read this file.

After reading this file, we want each exon position information is loaded in tuples. Each tuple would contain the beginning and the end postions as (begin, end).

In [None]:
def read_exon_positions(filename):
    positions = []     # list of tuples
    for line in open(filename, 'r'):
        begin, end = line.split()    # This function removes the tab and the new line and returns individual component. See string_err_fileIO_debugging.ipynb for more details
        positions.append(  (int(begin),  int(end) ) )  # Add a tuple
        
    return positions

Again, we can simplify it.

In [None]:
def read_exon_positions(filename):
    return [ tuple(int(num_str) for num_str in line.split() )  for line in open(filename, 'r') ]
    # Uses a comprehension for tuple in a comprehension of list

In [None]:
# Let's read in the exon position information
import os
filename = os.path.join('data','lactase_exon.tsv')
exon_pos = read_exon_positions(filename)
print(exon_pos)

Now, it is time to convert the DNA sequence into an mRNA sequence. This process is straight forward. We simply combine exon regions and replace Ts with Us.

In [None]:
def create_mRNA(dna_seq, exon_pos):
    mRNA = ''
    for begin, end in exon_pos:
        mRNA += dna_seq[begin:end]
    return mRNA.replace('T','U')   # replacing a letter is a very common task

In [None]:
# Can you even more simplify it?



In [None]:
# Let's create an mRNA sequence

mRNA_seq = create_mRNA(dna_seq, exon_pos)
print(mRNA_seq)

If the input and output files are huge, then repeating these steps whenever you want to perform next steps would be time-consuming. Therefore, it is a good idea to save intermediate results.

Let's create a new directory and save our lactase mRNA sequence. We first want to make a function to save our sequence file.

In [None]:
def save_seq(seq, filename, letters_per_line = 70):
    savefile = open(filename, 'w')
    for i in range(0, len(seq), letters_per_line):
        begin = i
        end = begin + letters_per_line
        savefile.write(seq[begin:end] + '\n')
    savefile.close()

We will make a directory and save the sequence.

In [None]:
output_path = 'output'

import os
if not os.path.isdir(output_path):
    os.mkdir(output_path)

filename = os.path.join(output_path, 'lactase_mRNA.txt')
save_seq(mRNA_seq, filename)

---

Finally, we will convert the mRNA sequence to a protein. We will first read the saved mRNA sequence.

In [None]:
import os
filename = os.path.join('output', 'lactase_mRNA.txt')
mRNA_seq = read_seq_file(filename)
print(mRNA_seq)

To get a protein sequence, we need to know the conversion rule from mRNA triplets to amino acids. This rule can be hardcoded because it is a fixed information. However, to keep the code more flexbile, it is here provided as a file: "conversion_map.tsv" in 'data' directory. Examine it. This file has now 4 columns separated by tabs. The reading procedure would be almost the same as the exon position reading procedure.



In [None]:
def read_conversion_map(filename):
    conversion_map = {}     # We will save it into a dictionary
    for line in open(filename, 'r'):
        m = line.split()
        conversion_map[m[0]] = m[1:]
    return conversion_map

In [None]:
# For those who want a shorter code:
def read_conversion_map(filename):
    return { line.split()[0]:line.split()[1:]  for line in open(filename, 'r') }
    # Uses a comprehension for tuple in a comprehension of list

In [None]:
# Read the conversion map
import os
filename = os.path.join('data','conversion_map.tsv')
conversion_map = read_conversion_map(filename)
print(conversion_map)

We can now create a protein from the mRNA. One last piece of information we need to remember is this:

1. Translation begins from the code for Methionine, i.e., AUG.
2. Translation stops with the stop codons (there are multiple).

In [None]:
def create_protein_seq(mRNA_seq, conversion_map):
    protein_seq = ''
    begin_pos = mRNA_seq.find('AUG')   # finding a string is also a very common peration
    
    for i in range(begin_pos, len(mRNA_seq), 3):
        triplet = mRNA_seq[i:i+3]
        amino_acid = conversion_map[triplet][0]
        if amino_acid == 'X':
            break;
        else:
            protein_seq += amino_acid
    return protein_seq

In [None]:
protein_seq = create_protein_seq(mRNA_seq, conversion_map)
print('Length:', len(protein_seq))
print(protein_seq)

---

### **Combine all of these functions**

We have developed many useful functions. Leaving these functions scattered around woule make future re-use of the code quite frustrating. Here we will write a class that combines all of them.



In [None]:
class Gene():
    def __init__(self, gene_name, dna_seq_filename='', exon_position_filename='',  mRNA_seq_filename='', translation_conversion_map_filename=''):

        # Define member variables (Class member variables are also called "attributes")
        self.gene_name = gene_name
        self.dna_seq_filename = dna_seq_filename
        self.exon_position_filename = exon_position_filename
        self.mRNA_seq_filename = mRNA_seq_filename
        self.translation_conversion_map_filename = translation_conversion_map_filename
        
        # Store all information including intermediate results
        self.dna_seq = ''
        self.exon_positions = []
        self.mRNA_seq = ''
        self.translation_conversion_map = {}
        self.protein_seq = ''
        
        
    def read_dna_seq_file(self, filename=''):
        if len(filename) == 0:
            filename = self.dna_seq_filename
        else:
            self.dna_seq_filename = filename
            
        lines = open(filename,'r').readlines()
        self.dna_seq = ''.join([  line.strip() for line in lines  ])
        
    def read_mRNA_seq_file(self, filename=''):
        if len(filename) == 0:
            filename = self.mRNA_seq_filename
        else:
            self.mRNA_seq_filename = filename
            
        lines = open(filename,'r').readlines()
        self.mRNA_seq = ''.join([  line.strip() for line in lines  ])
        
    def read_exon_positions(self, filename=''):
        if len(filename) == 0:
            filename = self.exon_position_filename
        else:
            self.exon_position_filename = filename
            
        self.exon_positions = [ tuple(int(num_str) for num_str in line.split() )  for line in open(filename, 'r') ]
    
    def create_mRNA(self):
        self.mRNA_seq = ''
        for begin, end in self.exon_positions:
            self.mRNA_seq += self.dna_seq[begin:end].replace('T','U')

    def save_mRNA_seq(self, filename='', letters_per_line = 70):
        if len(filename) == 0:
            filename = self.mRNA_seq_filename
        else:
            self.mRNA_seq_filename = filename
            
        savefile = open(filename, 'w')
        for i in range(0, len(self.mRNA_seq), letters_per_line):
            begin = i
            end = begin + letters_per_line
            savefile.write(self.mRNA_seq[begin:end] + '\n')
        savefile.close()
        
    def read_translation_conversion_map(self, filename=''):
        if len(filename) == 0:
            filename = self.translation_conversion_map_filename 
        else:
            self.translation_conversion_map_filename = filename
            
        self.translation_conversion_map = { line.split()[0]:line.split()[1:]  for line in open(filename, 'r') }
    
    def create_protein_seq(self):
        self.protein_seq = ''
        begin_pos = self.mRNA_seq.find('AUG')

        for i in range(begin_pos, len(self.mRNA_seq), 3):
            triplet = self.mRNA_seq[i:i+3]
            amino_acid = self.translation_conversion_map[triplet][0]
            if amino_acid == 'X':
                break;
            else:
                self.protein_seq += amino_acid



In [None]:
# Example 1: using all initialization values
import os
dna_seq_filename = os.path.join('data','lactase_gene.txt')
exon_position_filename = os.path.join('data', 'lactase_exon.tsv')
mRNA_save_filename = os.path.join('output','lactase_mRNA.txt')
translation_conversion_map_filename = os.path.join('data', 'conversion_map.tsv')

lactase_gene = Gene('lactase', dna_seq_filename, exon_position_filename, mRNA_save_filename, translation_conversion_map_filename)
lactase_gene.read_dna_seq_file()
lactase_gene.read_exon_positions()
lactase_gene.create_mRNA()
lactase_gene.save_mRNA_seq()
lactase_gene.read_mRNA_seq_file()
lactase_gene.read_translation_conversion_map()
lactase_gene.create_protein_seq()

#print(lactase_gene.dna_seq)
seq = lactase_gene.protein_seq
len(seq)

## Let's make a module

Make a Gene.py file as shown in the lecture (a finished file is provided for your examination, but I recommend making one yourself). Run the same code as above in a separate notebook to avoid name conflict with the existing class in this notebook. See W03_2_Example_using_class.ipynb 