# BRITE REU Python Workshop
### Instructor: Dakota Hawkins


## Overview

Protein synthesis generally follows what has been termed "The Central
Dogma of Molecular Biology." That is that DNA codes RNA where RNA then
makes protein. Here is a useful source if you need a quick refresher
(https://www.nature.com/scitable/topicpage/translation-dna-to-mrna-to-protein-393).
In today's workshop we will be writing a small Python script to simulate
this process by reading a DNA sequence from a FASTA file, transcribing
the sequence to mRNA, translating the computed mRNA strand to amino
acids, and finally writing the protein sequence to another FASTA file.
This workshop is intended to synthesise the information we learned in
the Python.

For this workshop you will be working with a partner in small teams. The
groups will be used as a means to facilitate discussion (e.g. "How can
we structure this function?"), while you and your partner will help each
other implement the code. Partners should choose a single computer to
write the code with. While a single person will be "driving" at a time,
both partners are expected to converse and contribute. Likewise, no one
person should be driving for the entire workshop: make sure to switch
semi-regularly to ensure each person is getting the same out of the
workshop. Please ensure each partner has a working copy of the completed
Jupyter Notebook after the workshop is complete.

This notebook includes skeleton methods for all of the different Python
functions we'll need: **``read_fasta()``**, **``write_fastsa()``**,
**``read_codon_table()``**, **``transcribe()``**, **``translate()``**,
and **``main()``**. While these functions *should* encompass all of the
functions we'll need, feel free to write your own helper functions if
you deem it necessary. Similarly, if you'd rather eskew the structure I
provided -- whether combining previously separated functions, changing
passed arguments, etc. -- feel free to do so. The only requirement is
both partners are onboard with the change and the final product produces
the same output. The skeleton code is mainly used to provide a starting
structure so the code is easier to jump into.



### Read FASTA Files:

In [6]:
def read_fasta(fasta_file):
    """
    Retrieve a DNA or protein sequence data from a FASTA file.

    Arguments:
        fasta_file (string): path to FASTA file.
    Returns:
        (string): DNA or protein sequence found in `fasta_file`.
    """
    sequence = ''
    with open(fasta_file, 'r') as file:
        for line in file:
            if not line.startswith('>'):  # Skip header lines starting with '>'
                sequence += line.strip()
    return sequence

# Path to the FASTA file
fasta_path = r"C:\Users\Dell\OneDrive\Desktop\PB\tuts\tut4\human_notch.fasta"

# Read the sequence from the FASTA file
sequence = read_fasta(fasta_path)

# Print the sequence (for testing)
print(sequence)


ATGCCGCCGCTCCTGGCGCCCCTGCTCTGCCTGGCGCTGCTGCCCGCGCTCGCCGCACGAGGTAGGCGCCCACCCACCCGCGAGCCCCCACTTTCCGCGCCCTTTGGAAACTTTGGCGGCGCCCGGCGCGCGCGCCCCACGGCTGGGAGCGGGCGGCGGGGAGGCCAGCATGGAGAGGGAAAAGCGGGCGGCCCGGGGCGTGGGGTTCTGGAGTCCCGGGATCAGGGAGGACCGACCTTCCCCCTCGATCCCCCCGTGGAGGCGGACTCGCGCCGCCCGTGCCTGGAGCCGAGTTAGGAGGCCGGTGTGGGGTGCTGGGGCCCCGGAGGCCCTACTCCGGGCCCGCCCTTCACCCGCCGCGCGTGGGGCTTGCCGCCGGTCGGCCGGGCGGGCGGGCTGCCTACTATTTTTCGATTTGAATAGAGTCGGTTTTGGTTTCCTGTTGCTTCTCCGGGCCATTTATCTTCTTTCTTCTTCGCCTCTGGCCCACGCCGGGGCGGATGTTGGGGCGCGGAGTGTGGGCTCTGCGGCGCCGCGTTCGCCTTCACTGACCCGCGCGGCCGGGCTGGGTCCCCGGGCTCCCGGTCGCCCCGCCCGCCGGTGCCCCCCAGCCCGGCTCTCAGTTTGGGGGAGGGGTTGCGTAAGAAGCCGCCGCGCCCGGGGGGACTGAACTTTCCTTTTGCTTTGCGGAGTTGAAGTTTGGAAAGCTTGGGGGCGGAGAGCGGGACGCGGGTGGGGGGCTCTTACATTTCTCCCCGCCGCACAGCGAGCGGGGTCTCTGGGGAATCGAGTGATTAATCCACTCTTTCTCCGAGAGTTGGAGGCGAGAATTATCTGTCCTCTTCCAGAAAGTGCGGCTCTGTGTCACACCCCCCTCCCCCGTTTCTCAGCCCCGGTAAGATGGGGAGGGAGGGGCTTGAGTAATTGATCCCTTCTCGAGATGGGGTCGAATTCCTTCCGAATGGGGGACCTTCATCCCCCTCCTGTGGGTGTATGGGGG

### Read `codon_table.csv`:

In [15]:
import csv

def read_codon_table(codon_table):
    """
    Create a dictionary that maps RNA codons to amino acids.

    Constructs dictionary by reading a .csv file containing codon to amino
    acid mappings.

    Arguments:
        codon_table (string, optional): path to the .csv file containing codon
            to amino acid mappings. Assumed column structure is 'Codon',
            'Amino Acid Abbreviation', 'Amino Acid Code', and 'Amino Acid Name'.
            Default is '/content/codon_table.csv'.
    Returns:
        (dictionary, string:string): dictionary with codons as keys and amino acid codes
            as values.
    """
    codon_dict = {}
    with open(codon_table, 'r') as file:
        csv_reader = csv.DictReader(file)
        for row in csv_reader:
            # Assuming the column containing the amino acid codes is named 'AA.Code'
            # Change this to the actual column name in your CSV file
            codon_dict[row['Codon']] = row['AA.Code']
    return codon_dict

# Path to the codon table CSV file
codon_table_path = r"C:\Users\Dell\OneDrive\Desktop\PB\tuts\tut4\codon_table.csv"

# Read the codon table
codon_table = read_codon_table(codon_table_path)

# Print the codon table (for testing)
print(codon_table)


{'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L', 'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L', 'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M', 'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V', 'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S', 'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'UAU': 'Y', 'UAC': 'Y', 'UAA': 'Stop', 'UAG': 'Stop', 'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K', 'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E', 'UGU': 'C', 'UGC': 'C', 'UGA': 'Stop', 'UGG': 'W', 'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R', 'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'}


In [8]:
actual_codon_count = len(codon_table)
actual_codon_count

64

### Transcribe DNA to RNA:

In [9]:
def read_fasta(fasta_file):
    """
    Retrieve a DNA sequence from a FASTA file.

    Arguments:
        fasta_file (string): path to FASTA file.
    Returns:
        (string): DNA sequence found in `fasta_file`.
    """
    sequence = ''
    with open(fasta_file, 'r') as file:
        for line in file:
            if not line.startswith('>'):  # Skip header lines starting with '>'
                sequence += line.strip()
    return sequence

def transcribe(dna_seq):
    """
    Transcribe a DNA sequence to an RNA sequence.

    Arguments:
        dna_seq (string): DNA sequence to transcribe to RNA.
    Returns:
        (string): transcribed RNA sequence from `dna_seq`.
    """
    # Define a dictionary to map DNA nucleotides to RNA nucleotides
    dna_to_rna = {'A': 'U', 'T': 'A', 'G': 'C', 'C': 'G'}

    # Transcribe the DNA sequence to RNA
    rna_seq = ''.join([dna_to_rna[base] for base in dna_seq])

    return rna_seq

# Path to the FASTA file containing DNA sequence
fasta_path = r"C:\Users\Dell\OneDrive\Desktop\PB\tuts\tut4\human_notch.fasta"

# Read the DNA sequence from the FASTA file
dna_sequence = read_fasta(fasta_path)
print(dna_sequence)

# Transcribe the DNA sequence to RNA
rna_sequence = transcribe(dna_sequence)

# Print the transcribed RNA sequence
print(rna_sequence)

ATGCCGCCGCTCCTGGCGCCCCTGCTCTGCCTGGCGCTGCTGCCCGCGCTCGCCGCACGAGGTAGGCGCCCACCCACCCGCGAGCCCCCACTTTCCGCGCCCTTTGGAAACTTTGGCGGCGCCCGGCGCGCGCGCCCCACGGCTGGGAGCGGGCGGCGGGGAGGCCAGCATGGAGAGGGAAAAGCGGGCGGCCCGGGGCGTGGGGTTCTGGAGTCCCGGGATCAGGGAGGACCGACCTTCCCCCTCGATCCCCCCGTGGAGGCGGACTCGCGCCGCCCGTGCCTGGAGCCGAGTTAGGAGGCCGGTGTGGGGTGCTGGGGCCCCGGAGGCCCTACTCCGGGCCCGCCCTTCACCCGCCGCGCGTGGGGCTTGCCGCCGGTCGGCCGGGCGGGCGGGCTGCCTACTATTTTTCGATTTGAATAGAGTCGGTTTTGGTTTCCTGTTGCTTCTCCGGGCCATTTATCTTCTTTCTTCTTCGCCTCTGGCCCACGCCGGGGCGGATGTTGGGGCGCGGAGTGTGGGCTCTGCGGCGCCGCGTTCGCCTTCACTGACCCGCGCGGCCGGGCTGGGTCCCCGGGCTCCCGGTCGCCCCGCCCGCCGGTGCCCCCCAGCCCGGCTCTCAGTTTGGGGGAGGGGTTGCGTAAGAAGCCGCCGCGCCCGGGGGGACTGAACTTTCCTTTTGCTTTGCGGAGTTGAAGTTTGGAAAGCTTGGGGGCGGAGAGCGGGACGCGGGTGGGGGGCTCTTACATTTCTCCCCGCCGCACAGCGAGCGGGGTCTCTGGGGAATCGAGTGATTAATCCACTCTTTCTCCGAGAGTTGGAGGCGAGAATTATCTGTCCTCTTCCAGAAAGTGCGGCTCTGTGTCACACCCCCCTCCCCCGTTTCTCAGCCCCGGTAAGATGGGGAGGGAGGGGCTTGAGTAATTGATCCCTTCTCGAGATGGGGTCGAATTCCTTCCGAATGGGGGACCTTCATCCCCCTCCTGTGGGTGTATGGGGG

### Translate RNA to Protein:

In [10]:
import csv

def translate(rna_sequence, codon_to_amino):
    """
    Translate an RNA sequence to an amino acid sequence.

    Arguments:
        rna_sequence (string): RNA sequence to translate to amino acid sequence.
        codon_to_amino (dict string:string): Mapping of three-nucleotide-long codons to
            amino acid codes.
    Returns:
        (string): Amino acid sequence of translated `rna_sequence` codons.
    """
    # Initialize an empty amino acid sequence
    amino_sequence = ""

    # Iterate over the RNA sequence in steps of 3 to extract codons
    for i in range(0, len(rna_sequence), 3):
        codon = rna_sequence[i:i+3]  # Extract the current codon
        # Check if the codon is in the codon-to-amino acid mapping dictionary
        if codon in codon_to_amino:
            amino_sequence += codon_to_amino[codon]  # Append the corresponding amino acid
        else:
            amino_sequence += '-'  # If codon is not found, use a placeholder '-'

    return amino_sequence

# Path to the codon table CSV file
codon_table_path = r"C:\Users\Dell\OneDrive\Desktop\PB\tuts\tut4\codon_table.csv"

# Read the codon table from the CSV file
codon_to_amino = {}
with open(codon_table_path, 'r') as file:
    csv_reader = csv.DictReader(file)
    for row in csv_reader:
        codon_to_amino[row['Codon']] = row['AA.Code']


# Translate the RNA sequence to an amino acid sequence
amino_sequence = translate(rna_sequence, codon_to_amino)

# Print the translated amino acid sequence
print(amino_sequence)
print(rna_sequence)
print(len(amino_sequence))
print(len(rna_sequence))

YGGEDRGDETDRDDGRERRAPSAGGWALGGERRGKPLKPPRAARAGCRPSPAAPPVVPLPFRPPGPAPQDLRALVPPGWKGELGGHLRLSAAGTDLGSILRPHPTTPGPPGStopGPGGKWAARTPNGGQPARPPDGStopStopKAKLISAKTKGQRRGPVNRRKKKRRPGAAPPTTPRLTPETPRRKRKStopLGAPARPRGPRASGAGGHGGSGRESNPLPNAFFGGAGPPDLKGKRNASTSNLSNPRLSPCAHPPRMStopRGAACRSPQRPLSSLIRStopERGSQPPLLIDRRRSFTPRHSVGGGGKESGPFYPSLPELINStopGRALPQLKEGLPPGSRGRTPTYPRRDLYARAPPSITHSRLRPGAGVPRQGHGDLRRGStopGVTRRDStopDSASGRETTARPIARSSTAStopKGARPRLTPVGGRGAGRDGGTSCPNLSPLEPPTYPPGRDGLRSHSEVPPPVAKPLIGStopNRDEKGVTQVRRVVRWGGTKFVLHKGRRGEGGStopRLMSStopDPSYNCGETKVPPRRTPGHGVVGYYDSSHAGPIPRLLRGTDPRPGAGVPLVPEATTTTPTLTGEKRTGTDPGATRVGPLWTDLPPFTLRRLPCLRTHTHGGDALDNETAPVSGTEDVGASPVLStopSGNRRPLPGLRPVPVHDStopKStopPVSDRStopStopISGRStopYDAHLDQTDDSRTREEAPGGVRGPSDSPSGSWGStopEEIQGFFVNStopTVPFETWStopMTLRVGGGGTVEPRGGDPPTTEDPCPLRRRAPTTEDPPTTTENPTITTEETVSTTVEDPPTTEETHALRRTRPPLRRRPVSTTVEDPPTTEETHALRRRPVPTTVEDPPTTEETHHYGGDPPTTEETHHYGGDPPTTEEIRVHYEGPVHYGGGPPTTEKFPLTEKFStopPGRKPERQTGSREKGRSPTPTPSLPPStopVQNRTPHPINNSDQRQDRRRLRLEFStopGTGKTR

**Calculating the Length of amino acid sequence**

In [11]:
# Calculate the length of the amino acid sequence without considering stop codons
length_without_stops = len(amino_sequence.replace('Stop', ''))

# Print the length of the amino acid sequence without stop codons
print("Length of amino acid sequence without stop codons:", length_without_stops)


Length of amino acid sequence without stop codons: 16720


**Number of stop codons**

In [12]:
# Count the number of stop codons in the amino acid sequence
num_stop_codons = amino_sequence.count('Stop')

# Print the number of stop codons
print("Number of stop codons:", num_stop_codons)

Number of stop codons: 398
