# 🧬 Genomic Embeddings and Representation Learning with OmniGenBench

Welcome to this comprehensive tutorial where we'll explore how to generate **high-quality genomic embeddings** from DNA and RNA sequences using **OmniGenBench**. This guide will walk you through the process of extracting meaningful vector representations from genomic sequences for downstream analysis and machine learning applications.

### 1. The Computational Challenge: What are Genomic Embeddings?

**Genomic embeddings** are dense vector representations that capture the semantic and functional information encoded in DNA and RNA sequences. These embeddings transform discrete nucleotide sequences into continuous vector spaces where similar sequences are positioned closer together.

The power of genomic embeddings lies in their ability to:
- **Capture Sequence Semantics**: Encode biological meaning and functional relationships
- **Enable Similarity Analysis**: Find functionally related sequences through vector similarity
- **Support Downstream ML**: Serve as input features for various machine learning tasks
- **Compress Information**: Reduce high-dimensional sequence data to manageable representations

Applications span across computational biology:
- **Drug Discovery**: Finding target sequences and analyzing molecular interactions
- **Evolutionary Analysis**: Studying sequence relationships and phylogenetic patterns  
- **Functional Annotation**: Predicting sequence function from embedding similarity
- **Biomarker Discovery**: Identifying disease-related sequence patterns

### 2. The Data: From Sequences to Vectors

Unlike traditional one-hot encoding, genomic foundation models learn rich representations that capture:

- **Local Patterns**: k-mer frequencies, motifs, and short-range dependencies
- **Global Context**: Long-range interactions and structural relationships  
- **Functional Similarities**: Sequences with similar biological roles cluster together
- **Evolutionary Relationships**: Homologous sequences have similar embeddings

**Transformation Process:**

| Raw Sequence | Traditional Encoding | Embedding Vector |
|-------------|---------------------|------------------|
| `ATGCGATCG` | `[1,0,0,0,0,1,0,0,...]` | `[0.23, -0.45, 0.12, ...]` |
| `ATGCGTTCG` | `[1,0,0,0,0,1,0,1,...]` | `[0.21, -0.43, 0.15, ...]` |

### 3. The Tool: Genomic Foundation Models for Representation Learning

#### Pre-trained Understanding
**OmniGenome** models are pre-trained on massive genomic datasets, learning to represent sequences in biologically meaningful vector spaces. This pre-training captures:

1. **Sequence Patterns**: Common motifs, regulatory elements, and structural features
2. **Functional Relationships**: Similar functions lead to similar representations
3. **Evolutionary Context**: Related sequences cluster in embedding space
4. **Multi-scale Information**: From local k-mers to global sequence properties

### 4. The Workflow: A 4-Step Guide to Genomic Embeddings

```mermaid
flowchart TD
    subgraph "4-Step Workflow for Genomic Embeddings"
        A["📥 Step 1: Setup and Configuration<br/>Initialize models and prepare sequences"] --> B["🔧 Step 2: Model Loading<br/>Load pre-trained genomic foundation models"]
        B --> C["🎓 Step 3: Embedding Generation<br/>Extract vector representations from sequences"]
        C --> D["🔮 Step 4: Analysis and Applications<br/>Analyze embeddings and explore applications"]
    end

    style A fill:#e1f5fe,stroke:#333,stroke-width:2px
    style B fill:#f3e5f5,stroke:#333,stroke-width:2px
    style C fill:#e8f5e8,stroke:#333,stroke-width:2px
    style D fill:#fff3e0,stroke:#333,stroke-width:2px
```

Let's start generating powerful genomic embeddings!

## 🚀 Step 1: Setup and Configuration

This first step focuses on setting up our environment for genomic embedding generation and analysis.

### 1.1: Environment Setup

First, let's install the required packages for genomic embedding generation and analysis.

In [None]:
!pip install omnigenbench torch transformers scikit-learn matplotlib seaborn -U

### 1.2: Import Required Libraries

Next, we import the essential libraries for genomic embedding generation, analysis, and visualization.

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

# Import various model types - ALL support embedding and attention extraction!
from omnigenbench import (
    OmniModelForEmbedding,
    OmniModelForSequenceClassification,
    OmniModelForSequenceRegression,
    ModelHub,
)

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

### 1.3: Global Configuration

Let's define our configuration parameters for embedding generation and analysis.

#### Key Parameters
- **Model Selection**: Choose the appropriate genomic foundation model for embedding generation
- **Analysis Settings**: Configure parameters for dimensionality reduction and clustering
- **Visualization Options**: Set up parameters for embedding visualization and exploration

In [None]:
# Configuration for embedding generation and analysis
embedding_config = {
    "model_name": "yangheng/OmniGenome-52M",
    "aggregation_method": "mean",  # Options: mean, max, cls
    "max_length": 512,
    "batch_size": 16,
}

# Analysis configuration
analysis_config = {
    "n_components_pca": 50,
    "n_components_tsne": 2,
    "n_clusters": 5,
    "random_state": 42,
}

print("🎯 Genomic Embedding Configuration:")
print(f"  Model: {embedding_config['model_name']}")
print(f"  Aggregation method: {embedding_config['aggregation_method']}")
print(f"  Max sequence length: {embedding_config['max_length']}")
print(f"\n📊 Analysis Configuration:")
print(f"  PCA components: {analysis_config['n_components_pca']}")
print(f"  t-SNE components: {analysis_config['n_components_tsne']}")
print(f"  Number of clusters: {analysis_config['n_clusters']}")

## 🚀 Step 2: Model Loading

Now let's load the pre-trained genomic foundation model for embedding generation. 

### 🎯 Important: All OmniModel Classes Support Embeddings!

**All OmniGenBench models** now inherit from `EmbeddingMixin`, which means:
- ✅ `OmniModelForEmbedding` - Dedicated embedding extraction
- ✅ `OmniModelForSequenceClassification` - Classification + Embeddings
- ✅ `OmniModelForSequenceRegression` - Regression + Embeddings  
- ✅ `OmniModelForTokenClassification` - Token classification + Embeddings
- ✅ **All other OmniModel variants** - Task-specific + Embeddings

You can use **any** of these model types to extract embeddings and attention scores!

### Model Features
- **Pre-trained Understanding**: Leverages genomic foundation model knowledge
- **Flexible Aggregation**: Multiple methods for sequence-to-vector conversion
- **Batch Processing**: Efficient handling of multiple sequences
- **GPU Acceleration**: Automatic CUDA optimization for faster processing

In [None]:
# Initialize the embedding model
print("🔧 Loading genomic foundation model for embeddings...")

# Option 1: Use dedicated embedding model
embedding_model = OmniModelForEmbedding(
    embedding_config["model_name"],
    trust_remote_code=True
)

# Option 2: You can also use any other OmniModel type!
# All of these models support the same embedding extraction methods:
# classification_model = OmniModelForSequenceClassification.from_pretrained("path/to/model")
# regression_model = OmniModelForSequenceRegression.from_pretrained("path/to/model")
# They all have: .encode(), .batch_encode(), .extract_attention_scores(), etc.

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_model = embedding_model.to(device)

# Use mixed precision for efficiency
if device.type == "cuda":
    embedding_model = embedding_model.to(torch.float16)
    
print(f"✅ Model loaded successfully!")
print(f"📊 Model configuration:")
print(f"  Device: {device}")
print(f"  Precision: {'float16' if device.type == 'cuda' else 'float32'}")
print(f"  Model parameters: ~52M")
print(f"\n💡 Note: All OmniModel types (Classification, Regression, etc.) support embedding extraction!")

## 🚀 Step 3: Embedding Generation

Now let's generate embeddings for various types of genomic sequences. We'll use diverse sequences to demonstrate how the model captures different biological patterns and relationships.

### Our Sequence Collection

We'll analyze sequences with different characteristics:
- **Functional RNAs**: tRNAs, rRNAs, and regulatory sequences
- **Coding Sequences**: mRNAs encoding different proteins
- **Regulatory Elements**: Promoters, enhancers, and UTRs
- **Structural Variants**: Sequences with different folding properties

In [None]:
# Diverse collection of genomic sequences for embedding analysis
genomic_sequences = {
    "Functional RNAs": {
        "tRNA-Ala": "GGGGGUAUAGCUCAGUGGUAGAGCGCGUGCCUUUGCAAGCACAAGAGUCUCGGGAGUCGUUGGUUCGAAUCACCGUACCCCCA",
        "rRNA-18S": "CGGCUACCACAUCCAAGGAAGGCAGCAGGCGCGCAAAUUACCCACUCCCGACCCGGGGAGGGUAGUGGCGGUUCGCCAGGA",
        "miRNA-21": "UAGCUUAUCAGACUGAUGUUGACUGUUGAAUCUCAUGGCAACACCAGUCGAUGGGCUGU",
    },
    "mRNA Sequences": {
        "Insulin mRNA": "AUGCCGCGCAACGAGGCCUACACUGUGCGAACUGCUGCCUGCUGCUGCCCGCUGCUGCUGCUGGGCUCCGCCCGCCGAG",
        "Hemoglobin mRNA": "AUGGUGGACGACGUGCUCGGCAAGAACGUCAACCACGUGAAGCUGGUGGUGGACGACGACGGCUGCGUGGGCAACUGC",
        "p53 mRNA": "AUGGAGGAGCCGCAGUCAGAUCCUAGCGUCCGGGACGACACGCCAACCUGCUCUCCUGCCGUCCCCGCCAAGACCAGC",
    },
    "Regulatory Elements": {
        "TATA promoter": "CGGCGCGCCAUAUAAAGCAUCGAGCGCGCACGUGCGCUGCGCGCGCGCUACGCGCGCAUGUGCGCGCACGUACGCGCG",
        "Enhancer seq": "GCGCGCGCACGUGCGCACGUGCGCGCACGUGCGCGCGCACGUGCGCGCACGUGCGCGCACGUGCGCGCACGUGCGCGC",
        "5'UTR": "GCGCGCCACCAAUGCGCGCGCCACCAUGUGCGCGCCACCAUGUGCGCGCCACCAUGUGCGCGCCACCAUGUGCGCGCC",
    },
    "Structural Variants": {
        "Hairpin RNA": "CGGAAACCCUUUGGGAAACCCGGGAAACCCUUUGGGAAACCCGGGAAACCCUUUGGGAAACCCG",
        "Repeat seq": "CACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA",
        "Random seq": "AUGCGAUCUCGAGCUACGUCGAUGCUAGCUCGAUGGCAUCCGAUUCGAGCUACGUCGAUGCUAG",
    }
}

# Flatten sequences for batch processing
all_sequences = []
sequence_labels = []
sequence_categories = []

for category, sequences in genomic_sequences.items():
    for label, sequence in sequences.items():
        all_sequences.append(sequence)
        sequence_labels.append(label)
        sequence_categories.append(category)

print(f"🧬 Prepared {len(all_sequences)} genomic sequences for embedding:")
for category, sequences in genomic_sequences.items():
    print(f"  {category}: {len(sequences)} sequences")

# Generate embeddings in batches
print(f"\n🎓 Generating embeddings...")
print(f"⚡ Using batch processing with aggregation method: {embedding_config['aggregation_method']}")

# Process sequences in batches
batch_size = embedding_config["batch_size"]
all_embeddings = []

for i in range(0, len(all_sequences), batch_size):
    batch_sequences = all_sequences[i:i + batch_size]
    batch_embeddings = embedding_model.batch_encode(
        batch_sequences, 
        agg=embedding_config["aggregation_method"]
    )
    all_embeddings.append(batch_embeddings)

# Concatenate all embeddings
genomic_embeddings = torch.cat(all_embeddings, dim=0)

print(f"✅ Embedding generation completed!")
print(f"📊 Embedding matrix shape: {genomic_embeddings.shape}")
print(f"  Number of sequences: {genomic_embeddings.shape[0]}")
print(f"  Embedding dimension: {genomic_embeddings.shape[1]}")

## 🔮 Step 4: Analysis and Applications

Now let's analyze our genomic embeddings to understand the relationships between sequences and explore various applications. This demonstrates the power of genomic foundation models in capturing biological meaning.

### Analysis Pipeline

Our comprehensive analysis includes:
1. **Similarity Analysis**: Calculate pairwise similarities between sequences
2. **Dimensionality Reduction**: Visualize embeddings in 2D space using PCA and t-SNE
3. **Clustering**: Discover sequence groups with similar properties
4. **Functional Analysis**: Interpret biological relationships captured by embeddings

## Step 5: Computing Similarity Between RNA Sequences
Let's compute the similarity between two RNA sequence embeddings using cosine similarity.

In [None]:
# Compute the similarity between the first two RNA sequence embeddings
similarity = embedding_model.compute_similarity(loaded_embeddings[0], loaded_embeddings[1])

# Display the similarity score
print(f"Similarity between the first two RNA sequences: {similarity:.4f}")

## Step 6: Encoding a Single RNA Sequence
You can also encode a single RNA sequence into its embedding.

In [None]:
# Example single RNA sequence
single_rna_sequence = "AUGGCUACG"

# Get the embedding for the single RNA sequence

head_rna_embedding = embedding_model.encode(rna_sequences[0], agg='head', keep_dim=True)  # Encode a single RNA sequence
mean_rna_embedding = embedding_model.encode(rna_sequences[0], agg='mean')  # Encode a single RNA sequence
tail_rna_embedding = embedding_model.encode(rna_sequences[0], agg='tail')  # Encode a single RNA sequence

# Display the embedding for the single RNA sequence
print("Single RNA Sequence Embedding:")
print(head_rna_embedding)

## Full Example
Here's a complete example that walks through all the steps we covered in the tutorial.

In [None]:
from omnigenbench import (
    OmniModelForEmbedding,
    OmniModelForSequenceClassification,
    OmniModelForSequenceRegression
)

# Step 1: Initialize the model
# 🎯 ALL OmniModel types support embedding extraction!
model_name = "yangheng/OmniGenome-52M"

# Option A: Use dedicated embedding model
embedding_model = OmniModelForEmbedding(model_name, trust_remote_code=True).to(torch.device("cuda")).to(torch.float16)

# Option B: Use classification model (also supports embeddings!)
# embedding_model = OmniModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True)

# Option C: Use regression model (also supports embeddings!)
# embedding_model = OmniModelForSequenceRegression.from_pretrained(model_name, trust_remote_code=True)

# Step 2: Encode RNA sequences (works with ANY OmniModel type!)
rna_sequences = ["AUGGCUACG", "CGGAUACGGC"]
rna_embeddings = embedding_model.batch_encode(rna_sequences)
print("RNA Embeddings:", rna_embeddings)

# Step 3: Save embeddings to a file
embedding_model.save_embeddings(rna_embeddings, "rna_embeddings.pt")

# Step 4: Load embeddings from the file
loaded_embeddings = embedding_model.load_embeddings("rna_embeddings.pt")

# Step 5: Compute similarity between the first two RNA sequence embeddings
similarity = embedding_model.compute_similarity(loaded_embeddings[0], loaded_embeddings[1])
print(f"Similarity between RNA sequences: {similarity:.4f}")

# Step 6: Encode a single RNA sequence
single_rna_sequence = "AUGGCUACG"
single_rna_embedding = embedding_model.encode(single_rna_sequence)
print("Single RNA Sequence Embedding:", single_rna_embedding)

# Step 7: Extract attention scores (also available on ALL OmniModel types!)
attention_result = embedding_model.extract_attention_scores(
    sequence=rna_sequences[0],
    max_length=128,
    layer_indices=[0, -1],  # First and last layer
)
print(f"\nAttention scores shape: {attention_result['attentions'].shape}")
print(f"💡 Embedding and attention extraction work with ALL OmniModel types!")