# Introduction to Biology with Python 

> Adapted from Fatih Enes Kemal Ergin (https://notebook.community/eneskemalergin/OldBlog/_oldnotebooks/Biology_with_Python)

## Overview

In this tutorial, we will give a brief introduction to biological concepts with theory and implementation in Python.

This tutorial is designed for programmers with some basic knowledge of biology. If you're new to biology, we recommend checking out the [Best Resources To Learn Molecular Biology For A Computer Scientist](https://www.biostars.org/p/3066/) before proceeding. Those with a strong background in biology can skip directly to the implementation sections.

## Learning Objectives

- Understand basic biological concepts such as DNA, RNA, and proteins
- Learn about transcription and translation processes
- Implement DNA to protein translation in Python
- Estimate molecular weight of biological sequences

## Prerequisites

- Basic understanding of Python programming
- Familiarity with Jupyter notebooks

## Get Started

We will use Python to perform the following tasks:

- Translate DNA sequences into protein sequences
- Estimate molecular weight of DNA, RNA, and protein sequences


## DNA, RNA, Protein

### DNA 

DNA, or deoxyribonucleic acid, is the fundamental building block of life. It contains the information cells need to synthesize proteins and replicate themselves. In short, DNA serves as the storage repository for the information required for any cell to function.

DNA is composed of four nucleotide bases: Adenine (A), Guanine (G), Thymine (T), and Cytosine (C). A DNA sequence typically looks like this: "ATTGCTGAAGGTGCGG."

The length of DNA is measured by the number of base pairs it contains, usually expressed in kilobase pairs (kBp) or megabase pairs (mBp). Each base has a complementary base, meaning in the double-helical structure of DNA, Adenine (A) pairs with Thymine (T), and Guanine (G) pairs with Cytosine (C).

DNA is organized into tightly wound structures called chromosomes. Humans have 23 pairs of **chromosomes**, which are further divided into smaller segments known as **genes**.

Determining the arrangement of the four nucleotide bases in DNA is called **DNA sequencing**. There are various methods for sequencing DNA, often performed using specialized machines or through a process called gel electrophoresis.

### RNA

RNA is somewhat similar to DNA; both are nucleic acids composed of nitrogen-containing bases linked by a sugar-phosphate backbone. However, structural and functional differences distinguish RNA from DNA.

Structurally, RNA is single-stranded, whereas DNA is double-stranded. DNA contains the base Thymine (T), while RNA contains Uracil (U) instead. Additionally, RNA nucleotides include the sugar ribose, unlike DNA, which contains deoxyribose.

Functionally, DNA serves as the repository for protein-encoding information, while RNA uses this information to facilitate the synthesis of specific proteins within the cell.


### Proteins

Proteins are essential to all life forms. In fact, it is often said that without proteins, life as we know it would not exist.

Proteins are sequences composed of 20 different amino acids, linked together by polypeptide bonds. The structure of a protein directly influences its chemical activity within an organism. Most proteins adopt a three-dimensional (3D) structure, which enables them to perform a wide variety of functions in living organisms. The complexity of protein structure makes it a challenging yet fascinating subject for research.

In computer science, proteins are represented as sequences of 20 distinct letters, with each letter corresponding to a specific amino acid. Amino acids linked together in a protein chain are often referred to as residues. The term "residue" is commonly used when referring to a specific amino acid at a particular position within a protein chain.

### Transcription

The process of reading DNA and creating RNA from it is called *transcription*. On the computational side, we will use the same DNA sequence and replace the DNA nucleotide "T" with "U". The Python code `dna.replace("T", "U")` accomplishes this.

- The DNA strand that has the same sequence as the RNA is called the **coding strand**.

### Translation

Most RNA molecules go on to specify protein amino acid sequences in a process called **translation**; these are known as messenger RNAs (mRNA).

- Each subsequent group of *three* bases is called a **codon**.
- The regions of an RNA chain that are removed are called **introns**, while those that remain are called **exons**.
- The presence of introns makes it significantly more challenging to determine which parts of a gene are actually used to produce protein sequences.


## Using biological sequences in computing

### Important tips

- Fit the sequences into appropriate data structures, ensuring they are reusable.
- The simplest and most common approach is to store sequences as text:
    - DNA and RNA can be represented as strings composed of 4 distinct characters:
        - DNA: A, C, G, T
        - RNA: A, C, G, U
    - Proteins can be represented as strings composed of 20 distinct characters, each corresponding to a specific amino acid.


### 1. Translating a DNA sequence into protein

In this example, we will translate a given DNA sequence into a protein using pre-defined structures for amino acid (aa) representation. For now, we will focus solely on implementing the core concept of translation, without considering starting codons, stop codons, or special codons.

In [None]:
# Here is the genetic code of the amino acids defined as dictionaries
STANDARD_GENETIC_CODE = {
    "UUU": "Phe",  # Phenylalanine
    "UUC": "Phe",  # Phenylalanine
    "UCU": "Ser",  # Serine
    "UCC": "Ser",  # Serine
    "UAU": "Tyr",  # Tyrosine
    "UAC": "Tyr",  # Tyrosine
    "UGU": "Cys",  # Cysteine
    "UGC": "Cys",  # Cysteine
    "UUA": "Leu",  # Leucine
    "UCA": "Ser",  # Serine
    "UAA": None,   # Stop codon
    "UGA": None,   # Stop codon
    "UUG": "Leu",  # Leucine
    "UCG": "Ser",  # Serine
    "UAG": None,   # Stop codon
    "UGG": "Trp",  # Tryptophan
    "CUU": "Leu",  # Leucine
    "CUC": "Leu",  # Leucine
    "CCU": "Pro",  # Proline
    "CCC": "Pro",  # Proline
    "CAU": "His",  # Histidine
    "CAC": "His",  # Histidine
    "CGU": "Arg",  # Arginine
    "CGC": "Arg",  # Arginine
    "CUA": "Leu",  # Leucine
    "CUG": "Leu",  # Leucine
    "CCA": "Pro",  # Proline
    "CCG": "Pro",  # Proline
    "CAA": "Gln",  # Glutamine
    "CAG": "Gln",  # Glutamine
    "CGA": "Arg",  # Arginine
    "CGG": "Arg",  # Arginine
    "AUU": "Ile",  # Isoleucine
    "AUC": "Ile",  # Isoleucine
    "ACU": "Thr",  # Threonine
    "ACC": "Thr",  # Threonine
    "AAU": "Asn",  # Asparagine
    "AAC": "Asn",  # Asparagine
    "AGU": "Ser",  # Serine
    "AGC": "Ser",  # Serine
    "AUA": "Ile",  # Isoleucine
    "ACA": "Thr",  # Threonine
    "AAA": "Lys",  # Lysine
    "AGA": "Arg",  # Arginine
    "AUG": "Met",  # Methionine (Start codon)
    "ACG": "Thr",  # Threonine
    "AAG": "Lys",  # Lysine
    "AGG": "Arg",  # Arginine
    "GUU": "Val",  # Valine
    "GUC": "Val",  # Valine
    "GCU": "Ala",  # Alanine
    "GCC": "Ala",  # Alanine
    "GAU": "Asp",  # Aspartic acid
    "GAC": "Asp",  # Aspartic acid
    "GGU": "Gly",  # Glycine
    "GGC": "Gly",  # Glycine
    "GUA": "Val",  # Valine
    "GUG": "Val",  # Valine
    "GCA": "Ala",  # Alanine
    "GCG": "Ala",  # Alanine
    "GAA": "Glu",  # Glutamic acid
    "GAG": "Glu",  # Glutamic acid
    "GGA": "Gly",  # Glycine
    "GGG": "Gly",  # Glycine
}

# Pre-defined DNA sequence, We will use this along the way.
dna_seq = "ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG"
print("Input DNA: ")
print(dna_seq)

print("Output RNA:")
# Convert the DNA sequence to RNA by replacing 'T' with 'U'
# This is because RNA uses uracil (U) instead of thymine (T)
rna_seq = dna_seq.replace("T", "U")
print(rna_seq)


def protein_translation(seq, genetic_code):
    """Translate a nucleic acid sequence into a protein sequence.

    Translation proceeds until the end of the sequence or until a stop codon is reached.

    Args:
        seq (str): The nucleic acid sequence to translate.
        genetic_code (dict): The genetic code dictionary.

    Returns:
        list: The translated protein sequence as a list of amino acids.

    """
    # Changes all the T into U, DNA to RNA
    seq = seq.replace("T", "U")  # Make sure we have RNA sequence
    protein_seq = []  # Initializing the protein_seq list to store the output

    i = 0
    while i + 2 < len(seq):
        # Get codons of three letters
        codon = seq[i : i + 3]
        # Get the match-up amino acid from the genetic code
        amino_acid = genetic_code.get(codon)  # Use .get to avoid KeyError for invalid codons
        # If found stop looping (stop codon reached)
        if amino_acid is None:  # Found stop codon (e.g., UAA, UAG, or UGA)
            break
        # Otherwise, add the amino acid to protein_seq list
        protein_seq.append(amino_acid)
        i += 3  # Move to the next codon

    return protein_seq


print("Output Protein:")
# Translate the RNA sequence into a protein sequence using the genetic code
print(protein_translation(dna_seq, STANDARD_GENETIC_CODE))


### 2. Estimating the Molecular Weight

This example estimates the molecular weight of a DNA, RNA, or protein molecule in units of Daltons. Note that this is only an approximation, as residues can reversibly bind hydrogen ions under varying conditions (e.g., pH affects the binding of H⁺ ions to acidic and basic sites). Additionally, the calculation assumes standard isotopic proportions.

#### Steps

- Define a function with two arguments: sequence and MoleculeType. 
- Inside the function, create a dictionary to store the average molecular weights of the different residues.
- Initialize a variable to accumulate the total molecular mass.

In [None]:
# Define function with seq and mol_type Protein
def estimate_mol_mass(seq, mol_type="protein"):
    """Calculate the molecular weight of a biological sequence.

    Assumes normal isotopic ratios and protonation/modification states.

    Args:
        seq (str): The biological sequence to calculate the molecular weight for.
        mol_type (str, optional): The type of molecule, either "DNA", "RNA", or
          "protein". Defaults to "protein".

    Returns:
        float: The molecular weight of the biological sequence in daltons.

    """
    # Define a dictionary with the molecular masses for DNA, RNA, and protein residues
    residue_masses = {
        "DNA": {"G": 329.21, "C": 289.18, "A": 323.21, "T": 304.19},  # Masses of DNA nucleotides
        "RNA": {"G": 345.21, "C": 305.18, "A": 329.21, "U": 302.16},  # Masses of RNA nucleotides
        "protein": {  # Masses of amino acids (protein residues)
            "A": 71.07,
            "R": 156.18,
            "N": 114.08,
            "D": 115.08,
            "C": 103.10,
            "Q": 128.13,
            "E": 129.11,
            "G": 57.05,
            "H": 137.14,
            "I": 113.15,
            "L": 113.15,
            "K": 128.17,
            "M": 131.19,
            "F": 147.17,
            "P": 97.11,
            "S": 87.07,
            "T": 101.10,
            "W": 186.20,
            "Y": 163.17,
            "V": 99.13,
        },
    }

    # Get the corresponding mass dictionary based on the specified molecule type (DNA, RNA, or protein)
    mass_dict = residue_masses[mol_type]
    # Start with the mass of the extra end atoms (H + OH)
    mol_mass = 18.02

    # Loop through each letter in the biological sequence (DNA, RNA, or protein)
    for letter in seq:
        # Add the mass of the current residue to the total molecular mass
        # If the residue is not found in the dictionary, add 0.0 (for unknown or invalid residues)
        mol_mass += mass_dict.get(letter, 0.0)

    # Return the calculated molecular mass
    return mol_mass


# Test Case 1: Calculate the molecular weight of a DNA sequence
print("DNA: " + dna_seq)
# Call the function with the dna_seq variable and "DNA" as the molecule type
print("Weight: ", estimate_mol_mass(dna_seq, mol_type="DNA"), " daltons")

# Test Case 2: Calculate the molecular weight of an RNA sequence
print("RNA: " + rna_seq)
# Call the function with the rna_seq variable and "RNA" as the molecule type
print("Weight: ", estimate_mol_mass(rna_seq, mol_type="RNA"), " daltons")

# Test Case 3: Calculate the molecular weight of a protein sequence
# Define a sample protein sequence
protein_seq = "IRTNGTHMQPLLKLMKFQKFLLELFTLQKRKPEKGYNLPIISLNQ"
print("Protein: " + protein_seq)
# Call the function with the protein_seq variable and the default value "protein"
print("Weight: ", estimate_mol_mass(protein_seq), " daltons")
print()

## Conclusion
In this module, we've learned about understanding basic biological concepts such as DNA, RNA, and proteins, We also learned about transcription and translation processes, implemented DNA to protein translation in Python, abd estimated molecular weight of biological sequences.

## Clean up
Remember to shut down your Jupyter Notebook instance when you're done to avoid unnecessary charges. You can do this by stopping the notebook instance from the Amazon SageMaker console.