# Project: DNATools
This program implements Bioinformatics tools in a tutorial-style Jupyter Notebook.

### This project is available on Github.
Github: https://github.com/xkevinramos/CentralDogma-Project

## Description
In Molecular Biology, the Central Dogma is described as the process of passing genetic information within a biological system. In essense, it is the process of using DNA to produce RNA, which is then used to create proteins. The step of transcribing RNA from DNA is crucial, as the process cannot go from DNA to protein without first creating RNA. 

The process of using DNA to create RNA is called transcription. In the process, every instance of Adenine ('A') is replaced with Uracil ('U'). The next step is called translation, which is the process of reading an RNA sequence in order to produce a protein, which is made up of amino acids. In an RNA sequence, a codon, which is made up of 3 nucleotides, translates into an amino acid. For this reason, we read a sequence of RNA in triplets. 

These proteins are essential for all the cells in the human body, so the Central Dogma is extremely important. In the study of Bioinformatics, this biological process is interpreted using computer code. Below, we will replicate the process of DNA replication, transcription, and translation using Python as our programming language. 

First, lets start with some basic functions: finding the length of a DNA strand and counting the number of occurences of each nucleotide in any given DNA strand.


## Find the length of a DNA strand
Given a sample DNA strand, calculate the number of nucleotides in the strand.

In [1]:
def dna_length(dna_strand):
    """
    Calculate the number of nucleotides in the given DNA strand. 
    
    Parameters
    ----------
    dna_strand : string
        The DNA strand we are evaluating. 

    Returns
    -------
    nucleotide_count : int
        The result of counting all the nucleotides in the DNA strand. 
    """
    
    # Initialize a counter used to keep track of the number of nucleotides in the strand
    nucleotide_count = 0
    
    # Use a loop to iterate through the strand
    for char in dna_strand:
        # for every nucleotide in the input strand, increment the counter
        nucleotide_count += 1
    
    return nucleotide_count

In [2]:
# Initialize a string that will serve as a DNA strand
sample_dataset = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'

# Call the function we created 
number_of_nucleotides = dna_length(sample_dataset)
print('This DNA strand consists of ' + str(number_of_nucleotides) + ' nucleotides.')

This DNA strand consists of 70 nucleotides.


## A Step Further: Calculate the number of 'A' 'T' 'G' 'C' occurences in the DNA strand
A DNA strand is made up of four nucleotides: A, T, G, C. We are going to write a function that counts the number of occurences for these four symbols in a given DNA strand.

In [3]:
def individual_nucleotides(dna_strand):
    """
    Calculate the number of occurences for each nucleotide in a given DNA strand. 
    
    Parameters
    ----------
    dna_strand : string
        The DNA strand we are evaluating. 

    Returns
    -------
    nucleotide_dict : dictionary
        A dictionary containing each nucleotide as a key and the number of occurences as the value. 
    """
    
    # Initialize a dictionary to keep track of the counts for each symbol
    nucleotide_dict = {'A': 0, 'T': 0, 'G': 0, 'C': 0}
    
    # Iterate through each symbol in the DNA strand
    for char in dna_strand:
        if char == 'A':
            nucleotide_dict[char] += 1
            
        if char == 'T':
            nucleotide_dict[char] += 1
            
        if char == 'G':
            nucleotide_dict[char] += 1
            
        if char == 'C':
            nucleotide_dict[char] += 1
            
    return nucleotide_dict

In [4]:
## TEST for individual_nucleotides

# Initialize a string consisting of the acceptable nucleotides 'A', 'T', 'G', 'C'
sample = 'ATTGGGCCCC'
individual_count = individual_nucleotides(sample)

# Format the output
print('The DNA strand contains: \n' + 
            'A: ' + str(individual_count['A']) +
            '\nT: ' + str(individual_count['T']) +
            '\nG: ' + str(individual_count['G']) +
            '\nC: ' + str(individual_count['C'] ))

The DNA strand contains: 
A: 1
T: 2
G: 3
C: 4


# DNA Replication: complementing a strand of DNA
During DNA replication, DNA serves as a template strand to create a new, complement strand of DNA. In DNA, nucleotides 'A' and 'T' are complements of  each other, and also 'C' and 'G'. During replication, the new strand is produced by copying the complement of each nucleotide in the DNA strand.

In [5]:
def complement(template_strand):
    """
    Computes the complement strand of the given template strand. 
    
    Parameters
    ----------
    template_strand : string
        The template strand we are evaluating. 

    Returns
    -------
    complement_strand : string
        The resulting string after finding the complement for each nucleotide.
    """
    
    # Empty string to store the complement of the template strand
    complement_strand = ''
    
    complement_dict = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    
    # For each nucleotide in the sequence, add its complement to a new string
    for char in template_strand:
        
        # Append the complement of the current nucleotide using our dictionary complement_dict
        complement_strand += complement_dict[char]
                    
    return complement_strand

In [6]:
## TEST for  complement

template_strand = 'ATCTGACC'
complement_strand = complement(template_strand)

print('Template strand: ' + "'5' " + template_strand + " 3'" + '\nComplement strand: ' + "3' "
          + complement_strand + " 5'")

Template strand: '5' ATCTGACC 3'
Complement strand: 3' TAGACTGG 5'


### Extra: Reverse Complement of a strand of DNA

In [7]:
def reverse_complement(complement_strand):
    """
    Compute the reverse of a given complement strand. 
    
    Parameters
    ----------
    complement_strand : string
        The complement strand we are evaluating. 

    Returns
    -------
    reverse_strand : string
        The result of reversing a complement strand. 
    """
    # Empty string to store the reverse of the complement strand
    reverse_strand = ''
    
    # Calculate the length of the template strand using our DNA_length function
    length = dna_length(complement_strand)
    
    while(length > 0):
        reverse_strand += complement_strand[length - 1]
        length -= 1
        
    return reverse_strand       

In [8]:
## TEST for reverse_complement

sample_dataset = 'AAAACCCGGT'
comp = complement(sample_dataset)
reverse_comp = reverse_complement(comp) 

# Reverse complement should be 'ACCGGGTTTT'
assert reverse_comp == 'ACCGGGTTTT'
print('Template Strand: ' + sample_dataset + '\nReverse Complement Strand: ' + reverse_comp)

Template Strand: AAAACCCGGT
Reverse Complement Strand: ACCGGGTTTT


# Transcription: transcribing a complement strand into RNA
Transcription is the process of creating an RNA strand by using a DNA strand as a template. In the process, every instance of Adenine ('A') is replaced with Uracil ('U'). 

In [9]:
def transcribe(dna_sequence):
    """
    Computes the RNA strand produced from a given DNA sequence. 
    
    Parameters
    ----------
    dna_sequence : string
        The DNA sequence we are evaluating. 

    Returns
    -------
    rna_strand : 
        The result of replacing all instances of 'A' with a 'U' in the given DNA sequence. 
    """
    # Use the python function replace to replace instances of 'A' with 'U'
    rna_strand = dna_sequence.replace('A', 'U')
    
    return rna_strand

In [10]:
## TEST for transcribe
sample_dna = 'TTCCATA'
rna_strand = transcribe(sample_dna)

assert rna_strand == 'TTCCUTU'
print('DNA strand: ' + sample_dna + '\nRNA strand: ' + rna_strand)

DNA strand: TTCCATA
RNA strand: TTCCUTU


# Translation: translating RNA into Protein
Translation is the process of reading an RNA sequence in order to produce a protein, which is made up of amino acids. In an RNA sequence, a codon, which is made up of 3 nucleotides, translates into an amino acid. For this reason, we read a sequence of RNA in triplets. 

In [11]:
def translate(rna_sequence):
    """
    Compute the protein that will be produced by the given RNA sequence. 
    
    Parameters
    ----------
    rna_sequence : string
        The RNA strand we are evaluating. 

    Returns
    -------
    protein : string
        The resulting protein from evaluating the codons in the RNA strand. 
    """
    
    rna_codon = { "UUU" : "F", "CUU" : "L", "AUU" : "I", "GUU" : "V",
                  "UUC" : "F", "CUC" : "L", "AUC" : "I", "GUC" : "V",
                  "UUA" : "L", "CUA" : "L", "AUA" : "I", "GUA" : "V",
                  "UUG" : "L", "CUG" : "L", "AUG" : "M", "GUG" : "V",
                  "UCU" : "S", "CCU" : "P", "ACU" : "T", "GCU" : "A",
                  "UCC" : "S", "CCC" : "P", "ACC" : "T", "GCC" : "A",
                  "UCA" : "S", "CCA" : "P", "ACA" : "T", "GCA" : "A",
                  "UCG" : "S", "CCG" : "P", "ACG" : "T", "GCG" : "A",
                  "UAU" : "Y", "CAU" : "H", "AAU" : "N", "GAU" : "D",
                  "UAC" : "Y", "CAC" : "H", "AAC" : "N", "GAC" : "D",
                  "UAA" : "STOP", "CAA" : "Q", "AAA" : "K", "GAA" : "E",
                  "UAG" : "STOP", "CAG" : "Q", "AAG" : "K", "GAG" : "E",
                  "UGU" : "C", "CGU" : "R", "AGU" : "S", "GGU" : "G",
                  "UGC" : "C", "CGC" : "R", "AGC" : "S", "GGC" : "G",
                  "UGA" : "STOP", "CGA" : "R", "AGA" : "R", "GGA" : "G",
                  "UGG" : "W", "CGG" : "R", "AGG" : "R", "GGG" : "G"}
    
    # Contains the three stop codons that terminate the translation process
    stop_codon = ['UAA', 'UAG', 'UGA']
    
    # String that stores the protein being formed
    protein = ''
    
    # Iterate through the rna sequence in steps of 3 to analyze one codon at a time
    for i in range(0, len(rna_sequence), 3):
        # Look at the next 3 nucleotides from our starting point
        codon = rna_sequence[i:i + 3]
        
        # This condition terminates translation if the codon is one of the three termination codons
        if codon in stop_codon:
            break
        
        # Add the amino acid for the current codon to our protein string
        protein += rna_codon[codon]
        
    return protein

In [12]:
## TEST for translate
sample_rna = 'AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA'

protein = translate(sample_rna)
assert protein == 'MAMAPRTEINSTRING'
print("RNA strand: " + sample_rna + "\nProtein: " + protein)

RNA strand: AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
Protein: MAMAPRTEINSTRING


The end.