# Omics Tutorial

## Installation

Please run the cell block below:

In [2]:
!pip --quiet install biopython==1.75

You should consider upgrading via the '/Users/abrace/src/umich_scifm_tutorial_2024/env/bin/python -m pip install --upgrade pip' command.[0m


# Omics

The branches of science known informally as omics are various disciplines in biology whose names end in the suffix -omics, such as genomics, proteomics, metabolomics, metagenomics, phenomics and transcriptomics. Omics aims at the collective characterization and quantification of pools of biological molecules that translate into the structure, function, and dynamics of an organism or organisms. [source](https://www.mdpi.com/2673-592X/2/1/9)

![image.png](attachment:b68c5639-fa71-4e3b-9fcd-971d083bb9c5.png)


The suffix -ome as used in molecular biology refers to a totality of some sort.

For more information: https://en.wikipedia.org/wiki/Omics


# There are many types of omics

1. **Genomics**: Study of the genome, the complete set of genes in an organism.

2. **Proteomics**: Study of the proteome, the entire collection of proteins in an organism's cells.

3. **Metabolomics**: Study of metabolism and the function and interactions of metabolic breakdown products, or metabolites.

4. Transcriptomics: Study of the full complement of RNA in an organism's cells.

5. Lipidomics: Study of lipids and pathways involved in lipid signaling.

6. Epigenomics: Study of the chemical modifications to DNA and histone proteins that regulate gene expression without changing the DNA sequence.

### ... there are actually 194 types of omics listed on the [wikipedia](https://en.wikipedia.org/wiki/Omics) page.

This tutorial will primarily focus on **genomics** and **proteomics**.

# Why should you be aware of omics?

- mention personalized medicine

https://www.nature.com/articles/s41467-019-11461-w
- Read this one for big picture


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9317225/pdf/ijerph-19-08758.pdf
- Many big omics projects kicked off with human genome
- One Health perspective and UN sustainable development
-  discovery of the human gut microbiome and its interactions with vital axes functioning in the human body [11].
-  Moreover, environmental exposure and the related interaction with host genetic factors may have an essential role in common chronic diseases [32]. This environmental impact on human health was introduced in 2005 and dubbed exposome, encompassing all environmental exposures (non-genetic factors including lifestyle factors) in the course of life from the prenatal period onwards [33,34].


# Machine learning and omics

big data

# 1. Biology fundamentals 

## What is DNA?

**DNA (Deoxyribonucleic Acid)** is the molecule that carries the genetic instructions for life. It is composed of two long strands that coil around each other to form a double helix. Each strand is made up of a sequence of nucleotides, which are the basic units of DNA. There are four types of nucleotides in DNA, distinguished by the nitrogenous bases they contain: adenine (A), thymine (T), cytosine (C), and guanine (G).

![image.png](attachment:0344b6bb-dc8f-4e5a-a8f6-a3d4cd9736a6.png)

Source: https://www.genome.gov/genetics-glossary/Deoxyribonucleic-Acid

### Key Points:
- **Double Helix Structure**: DNA's double helix is formed by two complementary strands that run in opposite directions.
- **Nucleotide Composition**: Each nucleotide consists of a sugar (deoxyribose), a phosphate group, and a nitrogenous base.
- **Base Pairing**: Adenine pairs with thymine (A-T) and cytosine pairs with guanine (C-G) through hydrogen bonds.
- **Genetic Information**: DNA stores genetic information that determines the development, functioning, growth, and reproduction of all living organisms and many viruses.

### Functions:
- **Genetic Blueprint**: DNA contains the instructions needed to build and maintain an organism.
- **Protein Synthesis**: DNA sequences (genes) are transcribed into RNA, which then directs the synthesis of proteins.

## What is RNA?

**RNA (Ribonucleic Acid)** is a single-stranded molecule involved in various roles within the cell, mainly related to the synthesis of proteins. RNA is similar to DNA but has some key differences: it contains the sugar ribose instead of deoxyribose, and the base uracil (U) replaces thymine (T).

![image.png](attachment:48f0783e-a628-45e2-8ccf-8485b793bde3.png)

Source: https://www.genome.gov/genetics-glossary/RNA-Ribonucleic-Acid

### Key Points:
- **Single-Stranded**: Unlike DNA, RNA is typically single-stranded.
- **Nucleotide Composition**: Each nucleotide in RNA consists of a sugar (ribose), a phosphate group, and a nitrogenous base (adenine (A), uracil (U), cytosine (C), and guanine (G)).
- **Types of RNA**:
  - **mRNA (Messenger RNA)**: Carries the genetic information from DNA to the ribosome, where proteins are synthesized.
  - **tRNA (Transfer RNA)**: Brings the appropriate amino acids to the ribosome during protein synthesis.
  - **rRNA (Ribosomal RNA)**: A component of ribosomes, which are the sites of protein synthesis.
  - And more ...

### Functions:
- **Transcription**: The process of copying a segment of DNA into RNA.
- **Translation**: The process where mRNA is decoded by ribosomes to produce a specific protein.
- **Gene Regulation**: Certain types of RNA (like miRNA and siRNA) are involved in regulating gene expression.

## What is a Protein?

**Proteins** are large, complex molecules that play many critical roles in the body. They are made up of one or more chains of amino acids, which are linked together in a specific order determined by the sequence of nucleotides in the gene encoding the protein. Proteins are essential for the structure, function, and regulation of the body's tissues and organs.

![image.png](attachment:5bf0bbf4-f7ed-4fc4-9d34-31f57838233d.png)

Source: https://www.genome.gov/genetics-glossary/Protein

### Key Points:
- **Amino Acids**: The building blocks of proteins. There are 20 different amino acids that combine to form proteins.
- **Structure Levels**:
  - **Primary Structure**: The sequence of amino acids in a polypeptide chain.
  - **Secondary Structure**: Local folding into structures like alpha-helices and beta-sheets.
  - **Tertiary Structure**: The overall three-dimensional shape of a single polypeptide chain.
  - **Quaternary Structure**: The structure formed by multiple polypeptide chains (subunits).

### Functions:
- **Enzymatic Activity**: Enzymes are proteins that catalyze biochemical reactions.
- **Structural Support**: Proteins like collagen provide structural support to cells and tissues.
- **Transport**: Proteins like hemoglobin transport molecules (e.g., oxygen) throughout the body.
- **Signaling**: Hormones and receptors are proteins that facilitate communication between cells.
- **Immune Response**: Antibodies are proteins that help protect the body from pathogens.

## Putting it all together with ribosomes

![image.png](attachment:6c663865-ed02-474f-ac55-ac8c81ff0794.png)

Source: https://www.genome.gov/genetics-glossary/Ribosome

Proteins are synthesized through the processes of **transcription** (DNA to RNA) and **translation** (RNA to protein), highlighting the central role of DNA and RNA in protein production.


![image.png](attachment:b6cbc093-63f9-48c9-bfe0-0e1f6125b146.png)

Source: https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology

In general we assume that coding sequences of DNA will propogate to proteins through ribosomes.

## The genetic code

Genetic code refers to the instructions contained in a gene that tell a cell how to make a specific protein. Each gene’s code uses the four nucleotide bases of DNA: adenine (A), cytosine (C), guanine (G) and thymine (T) — in various ways to spell out three-letter “codons” that specify which amino acid is needed at each position within a protein.

![image.png](attachment:9cb48981-10b0-4457-95b3-7940da3bf99d.png)

Source: https://www.genome.gov/genetics-glossary/Genetic-Code

# 2. Genomics file formats 

## FASTA File Format

The FASTA file format is a widely used format for representing nucleotide sequences (DNA or RNA) or peptide sequences (proteins).

For more information: https://en.wikipedia.org/wiki/FASTA_format

### Structure of a FASTA File

A FASTA file consists of multiple sequence entries. Each entry has two main components:

1. **Header Line**: Begins with a `>` character, followed by a sequence identifier and an optional description.
2. **Sequence Lines**: One or more lines that contain the actual sequence data (nucleotides or amino acids).

### Example of a FASTA File containing nucleotide sequences

```plaintext
>sequence1 description of sequence 1
ATGCGTACGTAGCTAGCTAGCTA
GCTAGCTAGCTAGCTAGCTAGCT
>sequence2 description of sequence 2
GATTACAAGGTTAGCTAGCTAGT
AGCTAGCTAGCTAGCTAGCTAGC
```

### Tips for dealing with FASTA files
- **Line Length**: While not mandatory, it's common practice to limit sequence lines to a maximum of 80 characters for better readability.
- **File Extension**: FASTA files typically have extensions like .fasta, .fa, .fna (nucleic acid), or .faa (amino acid).
- **Applications**: FASTA files are used in bioinformatics for sequence alignment, database searching, and other computational analyses.

In [35]:
# Reading a fasta file in python
from Bio import SeqIO

# Replace 'your_fasta_file.fasta' with the path to your FASTA file
fasta_file = "example.fasta"

# Read the FASTA file
sequences = list(SeqIO.parse(fasta_file, "fasta"))

# Print out each sequence
for seq_record in sequences:
    # Get the nucleotide sequence
    nucleotide_seq = seq_record.seq
    
    # Get the amino acid sequence
    amino_acid_seq = nucleotide_seq.translate()

    # Print the sequence info
    print(f"ID: {seq_record.id}")
    print(f"Description: {seq_record.description}")
    print(f"Nucleotide sequence: {nucleotide_seq}")
    print(f"Amino acid sequence: {amino_acid_seq}\n")

ID: Sequence_1
Description: Sequence_1
Nucleotide sequence: ATGCGACTACGATCGAGGGCCATGTGA
Amino acid sequence: MRLRSRAM*

ID: Sequence_2
Description: Sequence_2
Nucleotide sequence: ATGCGTAGCTGGCTAGCATCGATGCTAGCTGATTAA
Amino acid sequence: MRSWLASMLAD*

ID: Sequence_3
Description: Sequence_3
Nucleotide sequence: ATGTTAGCTAGCTCGCTCGATCGATCGCAGCTGATCGATCGTAGCTTGCTATGA
Amino acid sequence: MLASSLDRSQLIDRSLL*



# 3. Toward building genomic foundation models

If we want to train foundation models for genomics, it is important to understand the underlying data distribution and the process by which it was generated (i.e., **[evolution](https://en.wikipedia.org/wiki/Evolution)**). 

## Evolutionary conservation

**[Sequence homology](https://en.wikipedia.org/wiki/Sequence_homology)** is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena:
1. Speciation events (orthologs)
2. Duplication events (paralogs)
3. Horizontal (or lateral) gene transfer events (xenologs)

Homology among DNA, RNA, or proteins is typically inferred from their nucleotide or amino acid **[sequence similarity](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/)**. Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence. **[Alignments](https://en.wikipedia.org/wiki/Multiple_sequence_alignment)** of multiple sequences are used to indicate which regions of each sequence are homologous. Computationally, this resembles the [Longest Common Subsequence](https://www.geeksforgeeks.org/longest-common-subsequence-dp-4/) (LCS) problem.

**Conservation indicates that a sequence or important subsequence (i.e., [motif](https://en.wikipedia.org/wiki/Sequence_motif)) has been maintained by natural selection.**

**These patterns, along with the biological functions they encode, are what we hope for a genomic foundation model to learn.**

### Example: discovering a new domain of life through evolutionary conservation

One important class of highly conserved sequences are the [RNA components](https://en.wikipedia.org/wiki/Ribosomal_RNA) of [ribosomes](https://en.wikipedia.org/wiki/Ribosome) present in all domains of life. By comparing these across a diverse set of organisms, Woese and Fox were able to discover a new domain of life, the archaea.

![image.png](attachment:e83a67e9-8ce3-4a17-aa04-a4106a5d2e9f.png)


![image.png](attachment:4e315575-1b27-481b-aae5-90fd12568106.png)


**LUCA**: [last universal common ancestor](https://en.wikipedia.org/wiki/Last_universal_common_ancestor)

## Sequence alignment

Sequence alignment computational approach for understanding the similarities between sequences. 

We willl focus on [Pairwise sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment) which uses a [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming) algorithm.

There are two primary types of alignment:
- **Local alignment**: finds just the subsequences that align the best.
- **Global alignment**: finds the best agreement between all characters in two sequences.

When aligning sequences, there are two important settings you can specify, the **match score** and **gap penalty**.

### Understanding Match Score and Gap Penalty
#### Match Score
- Match Score: Points given when characters in two sequences match. Higher points indicate higher similarity.
- Exact Match (x): Full points for identical characters (e.g., A vs. A might score +1).

#### Gap Penalty
- Gap Penalty: Points subtracted for inserting gaps in the alignment (these could represent **insertion** or **deletion** mutations)
- Gap **Open Penalty**: Cost of starting a new gap.
- Gap **Extension Penalty**: Cost of extending an existing gap.

### Global alignment

In [11]:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment

In [12]:
# Align the two sequences with:
# Match score: 1 for identical characters and otherwise 0
# Gap penalty: None
alignments = pairwise2.align.globalxx("ACCGT", "ACG")

# Print the alignments
for alignment in alignments:
    print(format_alignment(*alignment))

ACCGT
| || 
A-CG-
  Score=3

ACCGT
|| | 
AC-G-
  Score=3



In [10]:
# Align the two sequences with:
# Match score: 2 for identical characters and otherwise -1
# Gap penalty: None
alignments = pairwise2.align.globalmx("ACCGT", "ACG", 2, -1)

# Print the alignments
for alignment in alignments:
    print(format_alignment(*alignment))

ACCGT
| || 
A-CG-
  Score=6

ACCGT
|| | 
AC-G-
  Score=6



In [13]:
# Align the two sequences with:
# Match score: 2 for identical characters and otherwise -1.
# Gap penalty: 0.5 points are deducted when opening a gap, and 0.1 points are deducted when extending it
alignments = pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1)

# Print the alignments
for alignment in alignments:
    print(format_alignment(*alignment))

ACCGT
| || 
A-CG-
  Score=5

ACCGT
|| | 
AC-G-
  Score=5



### Local alignment

In [20]:
# Align the two sequences with:
# Match score: 2 for identical characters and otherwise -1
# Gap penalty: None
alignments = pairwise2.align.localmx("ACCCGT", "ATCCCG", 2, -1)

# Print the alignments
for alignment in alignments:
    print(format_alignment(*alignment))

1 A-CCCG
  | ||||
1 ATCCCG
  Score=10



In [24]:
# Align the two sequences with:
# Match score: 2 for identical characters and otherwise -1
# Gap penalty: None
alignments = pairwise2.align.localmx("AGCCCGT", "ATCCCG", 2, -1)

# Print the alignments
for alignment in alignments:
    print(format_alignment(*alignment))

1 AG-CCCG
  |  ||||
1 A-TCCCG
  Score=10



In [28]:
alignments = pairwise2.align.localms("AGCCCGT", "ATCCCG", 2, -1, -.5, -.1)

# Print the alignments
for alignment in alignments:
    print(format_alignment(*alignment))

1 AG-CCCG
  |  ||||
1 A-TCCCG
  Score=9

1 AGCCCG
  |.||||
1 ATCCCG
  Score=9



In [29]:
alignments = pairwise2.align.localms("AGCCCGT", "ATCCCG", 2, -1, -1, -.1)

# Print the alignments
for alignment in alignments:
    print(format_alignment(*alignment))

1 AGCCCG
  |.||||
1 ATCCCG
  Score=9



## Multiple sequence alignment

**[Multiple sequence alignment (MSA)](https://en.wikipedia.org/wiki/Multiple_sequence_alignment)** is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. 

![image.png](attachment:c2125e43-4a3a-48f5-a584-86abfc6c51c2.png)

# 4. How to deduplicate gene/protein data for training a foundation model?

At a high level, we would like to train protein/genomic foundation models across diverse sets data. 
However, there exists bias in the sequence databases that reflect which sequences humans have spent
time characterizing and sequencing. For instance, there are many more SARS-CoV-2 genomes sequenced
then for other viruses.

**Q: How can we ensure that our model training is not biased towards particular regions of sequence space?**

**A:** By first clustering the sequences based on their similarity (i.e., alignments), and then sampling
training examples from each cluster.

There are many ways of doing this. In this tutorial, we demonstrate the use of [MMseqs2](https://github.com/soedinglab/MMseqs2).

![image.png](attachment:a5e130dd-67af-4e8a-8d3c-827cdbc9e38a.png)

Note: Everything that is done for mmseqs2 clustering is an approximation or equal to pairwise alignment. Any massive scale approach needs some heuristics.

# Conclusion

For more bioinformatics tutorials: https://rosalind.info/problems/list-view/

# Citations

Many graphics courtesy of the National Human Genome Research Institute: https://www.genome.gov/