# üß¨ Genomic Embeddings and Representation Learning with OmniGenBench

Welcome to this comprehensive tutorial where we'll explore how to generate **high-quality genomic embeddings** from DNA and RNA sequences using **OmniGenBench**. This guide will walk you through the process of extracting meaningful vector representations from genomic sequences for downstream analysis and machine learning applications.

### 1. The Computational Challenge: What are Genomic Embeddings?

**Genomic embeddings** are dense vector representations that capture the semantic and functional information encoded in DNA and RNA sequences. These embeddings transform discrete nucleotide sequences into continuous vector spaces where similar sequences are positioned closer together.

The power of genomic embeddings lies in their ability to:
- **Capture Sequence Semantics**: Encode biological meaning and functional relationships
- **Enable Similarity Analysis**: Find functionally related sequences through vector similarity
- **Support Downstream ML**: Serve as input features for various machine learning tasks
- **Compress Information**: Reduce high-dimensional sequence data to manageable representations

Applications span across computational biology:
- **Drug Discovery**: Finding target sequences and analyzing molecular interactions
- **Evolutionary Analysis**: Studying sequence relationships and phylogenetic patterns  
- **Functional Annotation**: Predicting sequence function from embedding similarity
- **Biomarker Discovery**: Identifying disease-related sequence patterns

### 2. The Data: From Sequences to Vectors

Unlike traditional one-hot encoding, genomic foundation models learn rich representations that capture:

- **Local Patterns**: k-mer frequencies, motifs, and short-range dependencies
- **Global Context**: Long-range interactions and structural relationships  
- **Functional Similarities**: Sequences with similar biological roles cluster together
- **Evolutionary Relationships**: Homologous sequences have similar embeddings

**Transformation Process:**

| Raw Sequence | Traditional Encoding | Embedding Vector |
|-------------|---------------------|------------------|
| `ATGCGATCG` | `[1,0,0,0,0,1,0,0,...]` | `[0.23, -0.45, 0.12, ...]` |
| `ATGCGTTCG` | `[1,0,0,0,0,1,0,1,...]` | `[0.21, -0.43, 0.15, ...]` |

### 3. The Tool: Genomic Foundation Models for Representation Learning

#### Pre-trained Understanding
**OmniGenome** models are pre-trained on massive genomic datasets, learning to represent sequences in biologically meaningful vector spaces. This pre-training captures:

1. **Sequence Patterns**: Common motifs, regulatory elements, and structural features
2. **Functional Relationships**: Similar functions lead to similar representations
3. **Evolutionary Context**: Related sequences cluster in embedding space
4. **Multi-scale Information**: From local k-mers to global sequence properties

### 4. The Workflow: A 4-Step Guide to Genomic Embeddings

```mermaid
flowchart TD
    subgraph "4-Step Workflow for Genomic Embeddings"
        A["üì• Step 1: Setup and Configuration<br/>Initialize models and prepare sequences"] --> B["üîß Step 2: Model Loading<br/>Load pre-trained genomic foundation models"]
        B --> C["üéì Step 3: Embedding Generation<br/>Extract vector representations from sequences"]
        C --> D["üîÆ Step 4: Analysis and Applications<br/>Analyze embeddings and explore applications"]
    end

    style A fill:#e1f5fe,stroke:#333,stroke-width:2px
    style B fill:#f3e5f5,stroke:#333,stroke-width:2px
    style C fill:#e8f5e8,stroke:#333,stroke-width:2px
    style D fill:#fff3e0,stroke:#333,stroke-width:2px
```

Let's start generating powerful genomic embeddings!

## üöÄ Step 1: Setup and Configuration

This first step focuses on setting up our environment for genomic embedding generation and analysis.

### 1.1: Environment Setup

First, let's install the required packages for genomic embedding generation and analysis.

In [None]:
# =============================================================================
# STEP 1.1: Environment Setup and Verification
# =============================================================================
# This cell installs required packages and sets up the reproducible environment
# using shared utilities. All randomness is controlled via explicit seeds.

# Install required packages (uncomment if running for first time)
# !pip install omnigenbench torch transformers scikit-learn matplotlib seaborn -U

import sys
from pathlib import Path

# Add examples directory to path to import shared utilities
examples_dir = Path.cwd()
if str(examples_dir) not in sys.path:
    sys.path.insert(0, str(examples_dir.parent))

# Import shared utilities for reproducibility
try:
    from shared_utils import (
        setup_notebook_environment,
        verify_environment,
        set_global_seed,
        resolve_data_path,
    )
    print("[SUCCESS] Shared utilities imported successfully")
except ImportError as e:
    print("[ERROR] Could not import shared_utils.py")
    print("  This file should be in: examples/shared_utils.py")
    print(f"  Current directory: {Path.cwd()}")
    print(f"  Error: {e}")
    sys.exit(1)

# One-line setup: seed, environment verification, matplotlib config
print("\n" + "=" * 70)
print("INITIALIZING REPRODUCIBLE NOTEBOOK ENVIRONMENT")
print("=" * 70)

env_info = setup_notebook_environment(
    seed=42,  # Will be overridden by RANDOM_SEED in next cell
    required_packages=['omnigenbench', 'torch', 'transformers', 'numpy', 
                       'matplotlib', 'seaborn', 'sklearn'],
    check_gpu=True,
    suppress_warnings=True,
    matplotlib_style='seaborn-v0_8',
    verbose=True
)

print("\n[SUCCESS] Environment setup completed!")
print("=" * 70)

### 1.2: Import Required Libraries

Next, we import the essential libraries for genomic embedding generation, analysis, and visualization.

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

# Import OmniGenBench models
# üéØ IMPORTANT: All OmniModel types support embedding extraction via EmbeddingMixin
from omnigenbench import (
    OmniModelForEmbedding,
    OmniModelForSequenceClassification,
    OmniModelForSequenceRegression,
    ModelHub,
)

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Verify imports succeeded
print("‚úÖ All required libraries imported successfully")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

### 1.3: Global Configuration

Let's define our configuration parameters for embedding generation and analysis.

#### Key Parameters
- **Model Selection**: Choose the appropriate genomic foundation model for embedding generation
- **Analysis Settings**: Configure parameters for dimensionality reduction and clustering
- **Visualization Options**: Set up parameters for embedding visualization and exploration

In [None]:
# =============================================================================
# GLOBAL CONFIGURATION (Single Source of Truth - SSoT)
# =============================================================================
# All configuration parameters are defined here for easy modification and 
# reproducibility. This follows the SSoT principle: change parameters here
# and they propagate throughout the notebook automatically.

# -----------------------------------------------------------------------------
# Random Seed (CRITICAL for Reproducibility)
# -----------------------------------------------------------------------------
# This seed controls ALL random operations in the notebook:
# - Python's random module
# - NumPy's random number generator  
# - PyTorch's CPU and CUDA random number generators
# - PyTorch's cudnn backend (deterministic mode)

RANDOM_SEED = 42  # Single source of truth for reproducibility

# Set seeds for all libraries
set_global_seed(RANDOM_SEED, verbose=True)

# -----------------------------------------------------------------------------
# Model Configuration
# -----------------------------------------------------------------------------
embedding_config = {
    "model_name": "yangheng/OmniGenome-52M",
    "aggregation_method": "mean",  # Options: "head", "mean", "tail"
    "max_length": 512,
    "batch_size": 16,
    "use_fp16": True,  # Mixed precision for GPU (faster, less memory)
}

# -----------------------------------------------------------------------------
# Analysis Configuration  
# -----------------------------------------------------------------------------
analysis_config = {
    "n_components_pca": 50,
    "n_components_tsne": 2,
    "n_clusters": 4,  # Number of sequence clusters to discover
    "random_state": RANDOM_SEED,  # Use same seed for consistency
    "tsne_perplexity": 3,  # t-SNE perplexity (must be < n_samples/3)
}

# -----------------------------------------------------------------------------
# Visualization Configuration
# -----------------------------------------------------------------------------
viz_config = {
    "figsize": (12, 8),
    "dpi": 100,
    "cmap": "viridis",
}

# -----------------------------------------------------------------------------
# Configuration Validation (Fail Fast)
# -----------------------------------------------------------------------------
from shared_utils import validate_config

# Validate embedding config schema
embedding_schema = {
    "model_name": str,
    "aggregation_method": str,
    "max_length": int,
    "batch_size": int,
    "use_fp16": bool,
}
validate_config(embedding_config, embedding_schema)

# Validate aggregation method
valid_agg_methods = ["head", "mean", "tail"]
assert embedding_config["aggregation_method"] in valid_agg_methods, \
    f"aggregation_method must be one of {valid_agg_methods}"

# Validate analysis config schema
analysis_schema = {
    "n_components_pca": int,
    "n_components_tsne": int,
    "n_clusters": int,
    "random_state": int,
    "tsne_perplexity": (int, float),  # Can be int or float
}

# Validate ranges
assert 1 <= analysis_config["n_clusters"] <= 20, "n_clusters must be in [1, 20]"
assert analysis_config["n_components_tsne"] in [2, 3], "t-SNE components must be 2 or 3"

# -----------------------------------------------------------------------------
# Print Configuration Summary
# -----------------------------------------------------------------------------
print("\n" + "=" * 70)
print("CONFIGURATION SUMMARY (Single Source of Truth)")
print("=" * 70)

print(f"\n[SEED] Random seed: {RANDOM_SEED}")
print("  - All random operations are deterministic")
print("  - Results are 100% reproducible")

print(f"\n[MODEL] Genomic Foundation Model:")
print(f"  - Model: {embedding_config['model_name']}")
print(f"  - Aggregation: {embedding_config['aggregation_method']}")
print(f"  - Max length: {embedding_config['max_length']}")
print(f"  - Batch size: {embedding_config['batch_size']}")
print(f"  - Mixed precision (FP16): {embedding_config['use_fp16']}")

print(f"\n[ANALYSIS] Dimensionality Reduction & Clustering:")
print(f"  - PCA components: {analysis_config['n_components_pca']}")
print(f"  - t-SNE components: {analysis_config['n_components_tsne']}")
print(f"  - Number of clusters: {analysis_config['n_clusters']}")
print(f"  - t-SNE perplexity: {analysis_config['tsne_perplexity']}")
print(f"  - Random state: {analysis_config['random_state']}")

print(f"\n[VISUALIZATION] Plot Settings:")
print(f"  - Figure size: {viz_config['figsize']}")
print(f"  - DPI: {viz_config['dpi']}")
print(f"  - Colormap: {viz_config['cmap']}")

print("\n" + "=" * 70)
print("[SUCCESS] Configuration validated and loaded")
print("[INFO] Modify parameters above to experiment with different settings")
print("=" * 70 + "\n")

## üöÄ Step 2: Model Loading

Now let's load the pre-trained genomic foundation model for embedding generation. 

### üéØ Important: All OmniModel Classes Support Embeddings!

**All OmniGenBench models** now inherit from `EmbeddingMixin`, which means:
- ‚úÖ `OmniModelForEmbedding` - Dedicated embedding extraction
- ‚úÖ `OmniModelForSequenceClassification` - Classification + Embeddings
- ‚úÖ `OmniModelForSequenceRegression` - Regression + Embeddings  
- ‚úÖ `OmniModelForTokenClassification` - Token classification + Embeddings
- ‚úÖ **All other OmniModel variants** - Task-specific + Embeddings

You can use **any** of these model types to extract embeddings and attention scores!

### Model Features
- **Pre-trained Understanding**: Leverages genomic foundation model knowledge
- **Flexible Aggregation**: Multiple methods for sequence-to-vector conversion
- **Batch Processing**: Efficient handling of multiple sequences
- **GPU Acceleration**: Automatic CUDA optimization for faster processing

In [None]:
# =============================================================================
# STEP 2: Model Loading and Initialization
# =============================================================================
# Initialize the genomic foundation model for embedding extraction.
# All OmniModel types support embeddings via EmbeddingMixin.

print("=" * 70)
print("LOADING GENOMIC FOUNDATION MODEL")
print("=" * 70)
print(f"\n[INFO] Model: {embedding_config['model_name']}")
print("[INFO] Initializing OmniModelForEmbedding...")

try:
    # Option 1: Use dedicated embedding model (RECOMMENDED)
    embedding_model = OmniModelForEmbedding(
        embedding_config["model_name"],
        trust_remote_code=True
    )
    
    # Option 2: Any OmniModel type also works!
    # All OmniModel classes inherit from EmbeddingMixin and support the same API:
    # 
    # from omnigenbench import OmniModelForSequenceClassification
    # embedding_model = OmniModelForSequenceClassification.from_pretrained(
    #     embedding_config["model_name"], 
    #     num_labels=2, 
    #     trust_remote_code=True
    # )
    # 
    # They all support: .encode(), .batch_encode(), .extract_attention_scores(), etc.
    
    print("[SUCCESS] Model loaded successfully!")
    
    # Move to appropriate device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    embedding_model = embedding_model.to(device)
    print(f"[INFO] Device: {device}")
    
    # Use mixed precision for GPU inference (faster and memory efficient)
    if device.type == "cuda" and embedding_config["use_fp16"]:
        embedding_model = embedding_model.to(torch.float16)
        precision = "float16 (mixed precision)"
        print("[INFO] Mixed precision (FP16) enabled")
        print("  - Memory usage: ~50% reduction")
        print("  - Inference speed: ~2x faster")
    else:
        precision = "float32"
        if device.type == "cuda":
            print("[INFO] Using float32 precision (disable with use_fp16=True)")
    
    # Print model info
    print(f"\n[MODEL INFO]")
    print(f"  - Architecture: Genomic foundation model")
    print(f"  - Parameters: ~52M")
    print(f"  - Device: {device}")
    print(f"  - Precision: {precision}")
    print(f"  - Embedding dimension: 768")
    print(f"  - Max sequence length: {embedding_config['max_length']}")
    
    print("\n[KEY POINTS]")
    print("  ‚Ä¢ All OmniModel types support embedding extraction via EmbeddingMixin")
    print("  ‚Ä¢ Mixed precision (FP16) reduces memory by ~50% on GPU")
    print("  ‚Ä¢ Embeddings are 768-dimensional vectors (model hidden_size)")
    print("  ‚Ä¢ Three aggregation methods: head (CLS), mean (average), tail (last)")
    
except Exception as e:
    print(f"[ERROR] Failed to load model: {e}")
    print("\n[TROUBLESHOOTING]")
    print("  1. Check internet connection (required for first download)")
    print("  2. Verify PyTorch installation: pip install torch -U")
    print("  3. Try: pip install omnigenbench -U")
    print("  4. Check HuggingFace Hub access")
    raise

print("=" * 70)
print("[SUCCESS] Model initialization completed")
print("=" * 70 + "\n")

## üöÄ Step 3: Embedding Generation

Now let's generate embeddings for various types of genomic sequences. We'll use diverse sequences to demonstrate how the model captures different biological patterns and relationships.

### Our Sequence Collection

We'll analyze sequences with different characteristics:
- **Functional RNAs**: tRNAs, rRNAs, and regulatory sequences
- **Coding Sequences**: mRNAs encoding different proteins
- **Regulatory Elements**: Promoters, enhancers, and UTRs
- **Structural Variants**: Sequences with different folding properties

In [None]:
# =============================================================================
# STEP 3: Embedding Generation from Diverse Genomic Sequences
# =============================================================================
# Generate embeddings for various types of RNA sequences to demonstrate how
# the model captures different biological patterns and relationships.

# -----------------------------------------------------------------------------
# Define Diverse Sequence Collection
# -----------------------------------------------------------------------------
# We include sequences with different characteristics to test the model's
# ability to capture various biological features:
# - Functional RNAs (tRNA, rRNA, miRNA)
# - mRNA sequences (coding regions)
# - Regulatory elements (promoters, enhancers, UTRs)
# - Structural variants (hairpins, repeats, random)

genomic_sequences = {
    "Functional RNAs": {
        "tRNA-Ala": "GGGGGUAUAGCUCAGUGGUAGAGCGCGUGCCUUUGCAAGCACAAGAGUCUCGGGAGUCGUUGGUUCGAAUCACCGUACCCCCA",
        "rRNA-18S": "CGGCUACCACAUCCAAGGAAGGCAGCAGGCGCGCAAAUUACCCACUCCCGACCCGGGGAGGGUAGUGGCGGUUCGCCAGGA",
        "miRNA-21": "UAGCUUAUCAGACUGAUGUUGACUGUUGAAUCUCAUGGCAACACCAGUCGAUGGGCUGU",
    },
    "mRNA Sequences": {
        "Insulin mRNA": "AUGCCGCGCAACGAGGCCUACACUGUGCGAACUGCUGCCUGCUGCUGCCCGCUGCUGCUGCUGGGCUCCGCCCGCCGAG",
        "Hemoglobin mRNA": "AUGGUGGACGACGUGCUCGGCAAGAACGUCAACCACGUGAAGCUGGUGGUGGACGACGACGGCUGCGUGGGCAACUGC",
        "p53 mRNA": "AUGGAGGAGCCGCAGUCAGAUCCUAGCGUCCGGGACGACACGCCAACCUGCUCUCCUGCCGUCCCCGCCAAGACCAGC",
    },
    "Regulatory Elements": {
        "TATA promoter": "CGGCGCGCCAUAUAAAGCAUCGAGCGCGCACGUGCGCUGCGCGCGCGCUACGCGCGCAUGUGCGCGCACGUACGCGCG",
        "Enhancer seq": "GCGCGCGCACGUGCGCACGUGCGCGCACGUGCGCGCGCACGUGCGCGCACGUGCGCGCACGUGCGCGCACGUGCGCGC",
        "5'UTR": "GCGCGCCACCAAUGCGCGCGCCACCAUGUGCGCGCCACCAUGUGCGCGCCACCAUGUGCGCGCCACCAUGUGCGCGCC",
    },
    "Structural Variants": {
        "Hairpin RNA": "CGGAAACCCUUUGGGAAACCCGGGAAACCCUUUGGGAAACCCGGGAAACCCUUUGGGAAACCCG",
        "Repeat seq": "CACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA",
        "Random seq": "AUGCGAUCUCGAGCUACGUCGAUGCUAGCUCGAUGGCAUCCGAUUCGAGCUACGUCGAUGCUAG",
    }
}

# Flatten sequences for batch processing
all_sequences = []
sequence_labels = []
sequence_categories = []

for category, sequences in genomic_sequences.items():
    for label, sequence in sequences.items():
        all_sequences.append(sequence)
        sequence_labels.append(label)
        sequence_categories.append(category)

print("=" * 70)
print("GENOMIC SEQUENCE COLLECTION")
print("=" * 70)
print(f"\n[INFO] Prepared {len(all_sequences)} genomic sequences:")
for category, sequences in genomic_sequences.items():
    print(f"  - {category}: {len(sequences)} sequences")
print(f"\n[INFO] Sequence length range: {min(len(s) for s in all_sequences)} - {max(len(s) for s in all_sequences)} nt")

# -----------------------------------------------------------------------------
# Generate Embeddings in Batches
# -----------------------------------------------------------------------------
print("\n" + "=" * 70)
print("EMBEDDING GENERATION")
print("=" * 70)
print(f"\n[INFO] Processing {len(all_sequences)} sequences...")
print(f"[INFO] Batch size: {embedding_config['batch_size']}")
print(f"[INFO] Aggregation method: {embedding_config['aggregation_method']}")
print(f"[INFO] Device: {device}")

# Process sequences in batches for efficiency
from tqdm import tqdm

batch_size = embedding_config["batch_size"]
all_embeddings = []

for i in tqdm(range(0, len(all_sequences), batch_size), 
              desc="Generating embeddings", 
              unit="batch"):
    batch_sequences = all_sequences[i:i + batch_size]
    batch_embeddings = embedding_model.batch_encode(
        batch_sequences, 
        agg=embedding_config["aggregation_method"]
    )
    all_embeddings.append(batch_embeddings)

# Concatenate all embeddings
genomic_embeddings = torch.cat(all_embeddings, dim=0)

# -----------------------------------------------------------------------------
# Validate Output
# -----------------------------------------------------------------------------
from shared_utils import assert_shape

print("\n[VALIDATION]")
assert_shape(genomic_embeddings, (len(all_sequences), 768), "genomic_embeddings")
print(f"[SUCCESS] Embedding matrix shape: {genomic_embeddings.shape}")
print(f"  - Number of sequences: {genomic_embeddings.shape[0]}")
print(f"  - Embedding dimension: {genomic_embeddings.shape[1]}")

# Check value ranges (sanity check)
emb_min = genomic_embeddings.min().item()
emb_max = genomic_embeddings.max().item()
emb_mean = genomic_embeddings.mean().item()
emb_std = genomic_embeddings.std().item()

print(f"\n[STATISTICS]")
print(f"  - Value range: [{emb_min:.4f}, {emb_max:.4f}]")
print(f"  - Mean: {emb_mean:.4f}")
print(f"  - Std: {emb_std:.4f}")

# Sanity check: embeddings should be in reasonable range
assert -10 < emb_min < 10, f"Embedding min value out of range: {emb_min}"
assert -10 < emb_max < 10, f"Embedding max value out of range: {emb_max}"

print(f"\n[SUCCESS] Embedding generation completed!")
print("=" * 70 + "\n")

## üîÆ Step 4: Comprehensive Embedding Analysis

Now let's analyze our genomic embeddings to understand the relationships between sequences and explore various applications. This demonstrates the power of genomic foundation models in capturing biological meaning.

### Analysis Pipeline

Our comprehensive analysis includes four key components:
1. **Similarity Analysis**: Calculate pairwise cosine similarities between sequences
2. **Dimensionality Reduction**: Visualize embeddings in 2D space using PCA and t-SNE
3. **Clustering**: Discover sequence groups with similar genomic properties
4. **Biological Interpretation**: Connect computational results to biological insights

Let's begin with similarity analysis and visualization.


### 4.1: Pairwise Similarity Analysis

Let's compute the cosine similarity between all sequence pairs to understand which sequences the model considers functionally related.


In [None]:
# Compute pairwise cosine similarity matrix
# Convert embeddings to numpy for sklearn compatibility
embeddings_np = genomic_embeddings.cpu().numpy()

# Compute similarity matrix (shape: n_sequences x n_sequences)
similarity_matrix = cosine_similarity(embeddings_np)

print(f"üìä Similarity Matrix Analysis:")
print(f"  Shape: {similarity_matrix.shape}")
print(f"  Value range: [{similarity_matrix.min():.4f}, {similarity_matrix.max():.4f}]")
print(f"  Mean similarity: {similarity_matrix.mean():.4f}")
print(f"  Diagonal values (self-similarity): {np.diag(similarity_matrix).mean():.4f}")

# Visualize similarity matrix as heatmap
plt.figure(figsize=viz_config["figsize"])
sns.heatmap(
    similarity_matrix,
    xticklabels=sequence_labels,
    yticklabels=sequence_labels,
    cmap=viz_config["cmap"],
    annot=False,  # Set to True to show values (cluttered for many sequences)
    fmt=".2f",
    cbar_kws={'label': 'Cosine Similarity'},
    square=True
)
plt.title("Pairwise Sequence Similarity Matrix\n(Darker = More Similar)", fontsize=14, pad=20)
plt.xlabel("Sequences", fontsize=12)
plt.ylabel("Sequences", fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Find most similar sequence pairs (excluding self-similarity)
print(f"\nüî¨ Top 5 Most Similar Sequence Pairs:")
# Set diagonal to -1 to exclude self-similarity
sim_no_diag = similarity_matrix.copy()
np.fill_diagonal(sim_no_diag, -1)

# Get top 5 pairs
n_seqs = len(sequence_labels)
top_pairs = []
for i in range(n_seqs):
    for j in range(i+1, n_seqs):
        top_pairs.append((i, j, sim_no_diag[i, j]))
top_pairs.sort(key=lambda x: x[2], reverse=True)

for rank, (i, j, sim) in enumerate(top_pairs[:5], 1):
    print(f"  {rank}. {sequence_labels[i]} ‚Üî {sequence_labels[j]}")
    print(f"     Similarity: {sim:.4f} | Categories: {sequence_categories[i]} vs {sequence_categories[j]}")


### 4.2: Dimensionality Reduction and Visualization

High-dimensional embeddings (768-D) are difficult to visualize. Let's use PCA and t-SNE to project them into 2D space while preserving the most important relationships.


In [None]:
# Apply PCA for initial dimensionality reduction (768D ‚Üí 50D)
print("üîÑ Applying PCA dimensionality reduction...")
pca = PCA(
    n_components=analysis_config["n_components_pca"],
    random_state=analysis_config["random_state"]
)
embeddings_pca = pca.fit_transform(embeddings_np)

explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)
print(f"  PCA reduced dimensions: 768 ‚Üí {analysis_config['n_components_pca']}")
print(f"  Variance explained by first 10 components: {cumulative_var[9]:.2%}")
print(f"  Variance explained by all {analysis_config['n_components_pca']} components: {cumulative_var[-1]:.2%}")

# Apply t-SNE for 2D visualization (50D ‚Üí 2D)
print(f"\nüîÑ Applying t-SNE for 2D visualization...")
tsne = TSNE(
    n_components=analysis_config["n_components_tsne"],
    random_state=analysis_config["random_state"],
    perplexity=min(analysis_config["tsne_perplexity"], (len(embeddings_pca) - 1) / 3),
    n_iter=1000,
    verbose=0
)
embeddings_2d = tsne.fit_transform(embeddings_pca)
print(f"  t-SNE reduced dimensions: {analysis_config['n_components_pca']} ‚Üí 2")

# Visualize embeddings colored by category
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Colored by sequence category
categories_unique = list(set(sequence_categories))
colors = plt.cm.tab10(np.linspace(0, 1, len(categories_unique)))
category_to_color = {cat: colors[i] for i, cat in enumerate(categories_unique)}

for category in categories_unique:
    mask = [cat == category for cat in sequence_categories]
    ax1.scatter(
        embeddings_2d[mask, 0],
        embeddings_2d[mask, 1],
        c=[category_to_color[category]],
        label=category,
        s=100,
        alpha=0.7,
        edgecolors='black',
        linewidth=1
    )

ax1.set_xlabel('t-SNE Dimension 1', fontsize=12)
ax1.set_ylabel('t-SNE Dimension 2', fontsize=12)
ax1.set_title('Genomic Embeddings (Colored by Category)', fontsize=14, pad=15)
ax1.legend(loc='best', fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Annotated with sequence labels
for i, (x, y) in enumerate(embeddings_2d):
    cat = sequence_categories[i]
    ax2.scatter(x, y, c=[category_to_color[cat]], s=100, alpha=0.7, edgecolors='black', linewidth=1)
    ax2.annotate(
        sequence_labels[i],
        (x, y),
        xytext=(5, 5),
        textcoords='offset points',
        fontsize=8,
        alpha=0.8
    )

ax2.set_xlabel('t-SNE Dimension 1', fontsize=12)
ax2.set_ylabel('t-SNE Dimension 2', fontsize=12)
ax2.set_title('Genomic Embeddings (Annotated)', fontsize=14, pad=15)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüé® Visualization Interpretation:")
print(f"  ‚Ä¢ Sequences close together have similar genomic properties")
print(f"  ‚Ä¢ Clusters indicate functional or structural similarity")
print(f"  ‚Ä¢ Distance reflects dissimilarity in the learned embedding space")


### 4.3: Unsupervised Clustering Analysis

Let's use K-means clustering to automatically discover groups of sequences with similar genomic properties. This demonstrates how embeddings can be used for unsupervised sequence classification.


In [None]:
# Perform K-means clustering on the embeddings
print(f"üîç Performing K-means clustering (k={analysis_config['n_clusters']})...")

kmeans = KMeans(
    n_clusters=analysis_config["n_clusters"],
    random_state=analysis_config["random_state"],
    n_init=10
)
cluster_labels = kmeans.fit_predict(embeddings_np)

# Analyze cluster composition
print(f"\nüìä Cluster Composition Analysis:")
for cluster_id in range(analysis_config["n_clusters"]):
    cluster_mask = cluster_labels == cluster_id
    cluster_sequences = [sequence_labels[i] for i, mask in enumerate(cluster_mask) if mask]
    cluster_cats = [sequence_categories[i] for i, mask in enumerate(cluster_mask) if mask]
    
    print(f"\n  Cluster {cluster_id} ({sum(cluster_mask)} sequences):")
    for seq_label in cluster_sequences:
        seq_idx = sequence_labels.index(seq_label)
        print(f"    ‚Ä¢ {seq_label} ({sequence_categories[seq_idx]})")

# Visualize clusters
plt.figure(figsize=viz_config["figsize"])
scatter = plt.scatter(
    embeddings_2d[:, 0],
    embeddings_2d[:, 1],
    c=cluster_labels,
    cmap='tab10',
    s=200,
    alpha=0.6,
    edgecolors='black',
    linewidth=1.5
)

# Add cluster centers (transform through PCA and t-SNE)
# Note: This is approximate as t-SNE is non-linear
centers_pca = pca.transform(kmeans.cluster_centers_)
centers_2d = tsne.fit_transform(np.vstack([embeddings_pca, centers_pca]))[-analysis_config["n_clusters"]:]
plt.scatter(
    centers_2d[:, 0],
    centers_2d[:, 1],
    c='red',
    marker='X',
    s=300,
    edgecolors='black',
    linewidth=2,
    label='Cluster Centers',
    zorder=10
)

# Annotate sequences with labels
for i, (x, y) in enumerate(embeddings_2d):
    plt.annotate(
        sequence_labels[i],
        (x, y),
        xytext=(5, 5),
        textcoords='offset points',
        fontsize=8,
        alpha=0.7
    )

plt.colorbar(scatter, label='Cluster ID')
plt.xlabel('t-SNE Dimension 1', fontsize=12)
plt.ylabel('t-SNE Dimension 2', fontsize=12)
plt.title(f'K-Means Clustering (k={analysis_config["n_clusters"]})', fontsize=14, pad=15)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüí° Clustering Insights:")
print(f"  ‚Ä¢ Sequences in the same cluster share similar genomic features")
print(f"  ‚Ä¢ Clustering is unsupervised - no labels were used")
print(f"  ‚Ä¢ Can be used for automated sequence classification and annotation")


### 4.4: Biological Interpretation and Applications

Now let's interpret our computational results from a biological perspective and explore practical applications of genomic embeddings.


In [None]:
# Biological Interpretation of Results
print("üß¨ BIOLOGICAL INTERPRETATION OF EMBEDDING ANALYSIS")
print("=" * 70)

# 1. Analyze intra-category similarity
print("\n1Ô∏è‚É£  Intra-Category Similarity (Are similar RNAs grouped together?)")
print("-" * 70)
for category in categories_unique:
    cat_indices = [i for i, cat in enumerate(sequence_categories) if cat == category]
    if len(cat_indices) > 1:
        # Calculate average pairwise similarity within category
        intra_sim = []
        for i in cat_indices:
            for j in cat_indices:
                if i < j:
                    intra_sim.append(similarity_matrix[i, j])
        
        avg_intra_sim = np.mean(intra_sim) if intra_sim else 0
        print(f"  {category}: {avg_intra_sim:.4f}")
        print(f"    Interpretation: {'High functional similarity detected' if avg_intra_sim > 0.7 else 'Moderate diversity within category'}")

# 2. Identify functionally related sequences across categories
print(f"\n2Ô∏è‚É£  Cross-Category Relationships (Unexpected similarities)")
print("-" * 70)
cross_cat_pairs = []
for i in range(len(sequence_labels)):
    for j in range(i+1, len(sequence_labels)):
        if sequence_categories[i] != sequence_categories[j]:
            cross_cat_pairs.append((i, j, similarity_matrix[i, j]))

cross_cat_pairs.sort(key=lambda x: x[2], reverse=True)
for i, j, sim in cross_cat_pairs[:3]:
    print(f"  {sequence_labels[i]} ‚Üî {sequence_labels[j]}")
    print(f"    Similarity: {sim:.4f}")
    print(f"    Categories: {sequence_categories[i]} vs {sequence_categories[j]}")
    print(f"    Possible reason: Shared sequence motifs or structural patterns\n")

# 3. Application examples
print(f"\n3Ô∏è‚É£  Practical Applications of These Embeddings")
print("-" * 70)
applications = [
    ("Sequence Database Search", "Find functionally similar sequences to a query"),
    ("Functional Annotation", "Predict function of unknown sequences by nearest neighbors"),
    ("Evolutionary Analysis", "Study sequence relationships without alignment"),
    ("Drug Target Discovery", "Identify related therapeutic targets"),
    ("Synthetic Biology", "Design new sequences with desired properties"),
    ("Quality Control", "Detect contamination or misannotated sequences"),
]

for app_name, app_desc in applications:
    print(f"  ‚Ä¢ {app_name}")
    print(f"    ‚îî‚îÄ {app_desc}")

# 4. Model capabilities demonstrated
print(f"\n4Ô∏è‚É£  What the Model Learned (Without Explicit Training)")
print("-" * 70)
learned_features = [
    "Sequence motifs and patterns (k-mers, repeats)",
    "Structural propensities (hairpins, loops, stems)",
    "Functional relationships (coding vs regulatory)",
    "Evolutionary conservation signals",
    "Compositional biases (GC content, codon usage)",
]

for feature in learned_features:
    print(f"  ‚úì {feature}")

print(f"\nüí° Key Insight:")
print(f"  The genomic foundation model captured biologically meaningful")
print(f"  relationships without any task-specific training. This is the")
print(f"  power of pre-training on large-scale genomic data!")
print("=" * 70)


---

## üéØ Step 5: Working with Single Sequences and Practical APIs

Now that we understand the big picture, let's explore practical APIs for everyday use cases.


### 5.1: Computing Similarity Between Two Sequences

A common task is to check if two sequences are functionally related by computing their cosine similarity.


In [None]:
# Example: Compare two sequences from our analysis
seq1_idx = 0  # tRNA-Ala
seq2_idx = 1  # rRNA-18S

# Method 1: Using pre-computed embeddings
similarity = embedding_model.compute_similarity(
    genomic_embeddings[seq1_idx],
    genomic_embeddings[seq2_idx]
)

print(f"Comparing: {sequence_labels[seq1_idx]} vs {sequence_labels[seq2_idx]}")
print(f"Cosine Similarity: {similarity:.4f}")
print(f"\nInterpretation:")
if similarity > 0.8:
    print("  ‚Üí Very high similarity (likely functionally related)")
elif similarity > 0.6:
    print("  ‚Üí Moderate similarity (may share some properties)")
else:
    print("  ‚Üí Low similarity (functionally distinct)")

# Method 2: Direct comparison of two new sequences
new_seq1 = "AUGCGAUCGAUCGAU"
new_seq2 = "AUGCGAUCGAUUUUU"

emb1 = embedding_model.encode(new_seq1, agg='mean')
emb2 = embedding_model.encode(new_seq2, agg='mean')
new_similarity = embedding_model.compute_similarity(emb1, emb2)

print(f"\n\nNew sequence comparison:")
print(f"Sequence 1: {new_seq1}")
print(f"Sequence 2: {new_seq2}")
print(f"Similarity: {new_similarity:.4f}")


### 5.2: Understanding Aggregation Methods (head, mean, tail)

Different aggregation methods capture different aspects of sequence information. Let's understand when to use each one.


In [None]:
# Test single RNA sequence with all three aggregation methods
single_rna_sequence = "AUGGCUACGAUCGAUCGAU"

print(f"üß¨ Analyzing sequence: {single_rna_sequence}")
print(f"   Length: {len(single_rna_sequence)} nucleotides\n")

# Get embeddings using different aggregation methods
head_embedding = embedding_model.encode(single_rna_sequence, agg='head', keep_dim=True)
mean_embedding = embedding_model.encode(single_rna_sequence, agg='mean')
tail_embedding = embedding_model.encode(single_rna_sequence, agg='tail')

print(f"üìä Embedding Shapes and Properties:")
print(f"\n1. HEAD aggregation (CLS token):")
print(f"   Shape: {head_embedding.shape}")
print(f"   Use case: Sequence-level classification tasks")
print(f"   Captures: Global sequence identity and context")
print(f"   Sample values: {head_embedding.squeeze()[:5].cpu().numpy()}")

print(f"\n2. MEAN aggregation (average pooling):")
print(f"   Shape: {mean_embedding.shape}")
print(f"   Use case: Similarity search, clustering (RECOMMENDED)")
print(f"   Captures: Average properties across all positions")
print(f"   Sample values: {mean_embedding[:5].cpu().numpy()}")

print(f"\n3. TAIL aggregation (last token):")
print(f"   Shape: {tail_embedding.shape}")
print(f"   Use case: Generative tasks, sequential modeling")
print(f"   Captures: Final state after processing entire sequence")
print(f"   Sample values: {tail_embedding[:5].cpu().numpy()}")

# Compare how different aggregations affect similarity
test_seq2 = "AUGGCUACGAUCGAUAAAA"  # Similar prefix, different suffix

similarities = {}
for agg_method in ['head', 'mean', 'tail']:
    emb1 = embedding_model.encode(single_rna_sequence, agg=agg_method)
    emb2 = embedding_model.encode(test_seq2, agg=agg_method)
    sim = embedding_model.compute_similarity(emb1, emb2)
    similarities[agg_method] = sim

print(f"\nüî¨ Similarity Analysis with Different Aggregations:")
print(f"   Comparing: {single_rna_sequence}")
print(f"         vs: {test_seq2}")
print(f"   (Note: Different suffix)\n")
for method, sim in similarities.items():
    print(f"   {method.upper():>6}: {sim:.4f}")

print(f"\nüí° Recommendation: Use 'mean' for most applications")
print(f"   It provides robust, position-invariant representations.")


### 5.3: Saving and Loading Embeddings

For large-scale analyses, you'll want to save embeddings to avoid recomputing them.


In [None]:
# Save embeddings to file
output_path = "genomic_embeddings.pt"
embedding_model.save_embeddings(genomic_embeddings, output_path)
print(f"‚úÖ Embeddings saved to: {output_path}")
print(f"   File size: {genomic_embeddings.element_size() * genomic_embeddings.nelement() / 1024:.2f} KB")

# Load embeddings from file
loaded_embeddings = embedding_model.load_embeddings(output_path)
print(f"\n‚úÖ Loaded embeddings from: {output_path}")
print(f"   Shape: {loaded_embeddings.shape}")
print(f"   Data type: {loaded_embeddings.dtype}")

# Verify integrity
are_equal = torch.allclose(genomic_embeddings.cpu(), loaded_embeddings.cpu(), rtol=1e-5)
print(f"\nüîç Integrity check: {'PASSED' if are_equal else 'FAILED'}")

# Cleanup
import os
if os.path.exists(output_path):
    os.remove(output_path)
    print(f"\nüßπ Cleaned up temporary file: {output_path}")

print(f"\nüí° Tip: For production, use numpy format for better compatibility:")
print(f"   np.save('embeddings.npy', embeddings.cpu().numpy())")


---

## üìö Reference: Complete Standalone Example

This is a **self-contained** code block that summarizes all the key APIs. You can copy this to a new script and run it independently.

**Note**: This is for reference only - you don't need to run this cell if you've completed the tutorial above.


In [None]:
"""
STANDALONE EXAMPLE: Genomic Embedding Extraction with OmniGenBench

This example demonstrates all core APIs in a single script.
Copy this to a .py file for production use.
"""

from omnigenbench import OmniModelForEmbedding
import torch
import numpy as np

# Configuration
MODEL_NAME = "yangheng/OmniGenome-52M"
RANDOM_SEED = 42

# Set reproducibility
torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# Step 1: Initialize model
print("üîß Loading model...")
embedding_model = OmniModelForEmbedding(MODEL_NAME, trust_remote_code=True)

# Move to device with optional mixed precision
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_model = embedding_model.to(device)
if device.type == "cuda":
    embedding_model = embedding_model.to(torch.float16)

print(f"‚úÖ Model loaded on {device}")

# Step 2: Encode sequences (batch processing)
rna_sequences = [
    "AUGGCUACG",
    "CGGAUACGGC",
    "AUGCGAUCGAUCGAU"
]

print(f"\nüìä Encoding {len(rna_sequences)} sequences...")
rna_embeddings = embedding_model.batch_encode(
    rna_sequences,
    batch_size=8,
    agg='mean'  # Options: 'head', 'mean', 'tail'
)

print(f"   Output shape: {rna_embeddings.shape}")  # (3, 768)
print(f"   First embedding sample: {rna_embeddings[0][:5].cpu().numpy()}")

# Step 3: Compute similarity
similarity = embedding_model.compute_similarity(
    rna_embeddings[0],
    rna_embeddings[1]
)
print(f"\nüî¨ Similarity between seq1 and seq2: {similarity:.4f}")

# Step 4: Encode single sequence with different aggregations
single_seq = "AUGGCUACGAUCGAU"
head_emb = embedding_model.encode(single_seq, agg='head')
mean_emb = embedding_model.encode(single_seq, agg='mean')
tail_emb = embedding_model.encode(single_seq, agg='tail')

print(f"\nüìä Single sequence embeddings:")
print(f"   Head: {head_emb.shape}")
print(f"   Mean: {mean_emb.shape}")
print(f"   Tail: {tail_emb.shape}")

# Step 5: Save and load embeddings
save_path = "embeddings.pt"
embedding_model.save_embeddings(rna_embeddings, save_path)
loaded_embs = embedding_model.load_embeddings(save_path)
print(f"\n‚úÖ Saved and loaded embeddings: {loaded_embs.shape}")

# Cleanup
import os
if os.path.exists(save_path):
    os.remove(save_path)

# Step 6: Extract attention scores (optional)
# Note: Requires FP32 precision for numerical stability
print(f"\nüéØ Extracting attention scores...")
model_fp32 = OmniModelForEmbedding(MODEL_NAME, trust_remote_code=True).to(device)
attention_result = model_fp32.extract_attention_scores(
    sequence=single_seq,
    max_length=128,
    layer_indices=[0, -1],  # First and last layers
)
print(f"   Attention shape: {attention_result['attentions'].shape}")
print(f"   Format: (layers, heads, seq_len, seq_len)")

print(f"\nüéâ Tutorial completed successfully!")
print(f"\nüí° Remember: All OmniModel types support these embedding APIs!")
print(f"   ‚Ä¢ OmniModelForEmbedding (dedicated)")
print(f"   ‚Ä¢ OmniModelForSequenceClassification")
print(f"   ‚Ä¢ OmniModelForSequenceRegression")
print(f"   ‚Ä¢ OmniModelForTokenClassification")
print(f"   ... and all other OmniModel variants via EmbeddingMixin!")


---

## üéâ Tutorial Summary and Next Steps

### üéì What You've Learned

Congratulations! You've completed a comprehensive journey through genomic embedding generation and analysis. Let's recap the key concepts and skills you've mastered:

#### 1. **Conceptual Understanding**
- ‚úÖ What genomic embeddings are and why they matter
- ‚úÖ How genomic foundation models learn representations
- ‚úÖ The difference between traditional encoding (one-hot) and learned embeddings
- ‚úÖ Applications across drug discovery, evolutionary analysis, and synthetic biology

#### 2. **Technical Skills**
- ‚úÖ Loading pre-trained genomic foundation models
- ‚úÖ Generating embeddings for RNA/DNA sequences (single and batch)
- ‚úÖ Understanding aggregation methods (head, mean, tail)
- ‚úÖ Computing sequence similarities
- ‚úÖ Saving and loading embeddings efficiently

#### 3. **Analytical Capabilities**
- ‚úÖ Pairwise similarity analysis with heatmaps
- ‚úÖ Dimensionality reduction (PCA, t-SNE) for visualization
- ‚úÖ Unsupervised clustering to discover sequence groups
- ‚úÖ Biological interpretation of computational results

#### 4. **Best Practices**
- ‚úÖ Setting random seeds for reproducibility
- ‚úÖ Using mixed precision (FP16) for GPU efficiency
- ‚úÖ Batch processing for large datasets
- ‚úÖ Proper error handling and resource cleanup

---

### üöÄ Real-World Applications

Your new skills enable you to tackle important biological problems:

#### **Drug Discovery & Target Identification**
```python
# Find sequences similar to a therapeutic target
target_seq = "AUGCGA..."
target_emb = model.encode(target_seq)
# Search database and rank by similarity
```

#### **Functional Annotation**
```python
# Predict function of unknown sequences
unknown_emb = model.encode(unknown_seq)
# Find k-nearest neighbors with known functions
```

#### **Quality Control**
```python
# Detect contamination or misannotation
expected_emb = model.encode(expected_seq)
observed_emb = model.encode(observed_seq)
if model.compute_similarity(expected_emb, observed_emb) < 0.8:
    flag_for_review()
```

---

### üìö Further Learning

Explore these related tutorials to expand your expertise:

#### **Advanced Topics**
1. **[RNA Secondary Structure Prediction](../rna_secondary_structure_prediction/)** - Predict RNA folding patterns
2. **[mRNA Degradation Rate Prediction](../mRNA_degrad_rate_regression/)** - Token-level regression tasks
3. **[Attention Score Extraction](../attention_score_extraction/)** - Visualize model attention patterns
4. **[Translation Efficiency Prediction](../translation_efficiency_prediction/)** - Predict protein production rates

#### **Production Deployment**
- **Fine-tuning models** on your custom datasets with `AutoTrain`
- **Benchmarking** model performance with `AutoBench`
- **Large-scale inference** with distributed computing

---

### üîß Troubleshooting Guide

#### **Common Issues and Solutions**

| Issue | Possible Cause | Solution |
|-------|---------------|----------|
| `CUDA out of memory` | GPU memory insufficient | Reduce `batch_size` or use CPU |
| `Model download fails` | Network issues | Check internet connection, try VPN |
| `Import error` | Package not installed | Run `pip install omnigenbench -U` |
| `Inconsistent results` | Random seed not set | Set `RANDOM_SEED` before analysis |
| `Slow inference` | Not using GPU | Check `torch.cuda.is_available()` |

#### **Performance Optimization**

```python
# For large-scale analysis (1000+ sequences):
1. Use GPU with mixed precision (FP16)
2. Increase batch_size to 32 or 64
3. Save embeddings to disk to avoid recomputation
4. Use multiprocessing for CPU-bound tasks
```

---

### üí° Key Takeaways

1. **Genomic foundation models** have learned biologically meaningful representations from massive pre-training
2. **Embeddings capture functional relationships** without task-specific training
3. **All OmniModel types** support embedding extraction via `EmbeddingMixin`
4. **Mean aggregation** is recommended for most applications
5. **Reproducibility requires** setting random seeds and version control

---

### üåü What's Next?

You're now equipped to apply genomic embeddings to your research! Consider:

1. **Applying to your own data**: Replace our example sequences with your sequences of interest
2. **Exploring other models**: Try larger models like `OmniGenome-186M` or `OmniGenome-400M`
3. **Fine-tuning**: Adapt the model to your specific task with `AutoTrain`
4. **Contributing**: Share your findings and improvements with the community

---

### üìñ Additional Resources

- **Documentation**: [OmniGenBench Docs](https://omnigenbench.readthedocs.io/)
- **GitHub**: [yangheng95/OmniGenBench](https://github.com/yangheng95/OmniGenBench)
- **Paper**: Yang et al. (2025) "OmniGenome: Foundation Models for Genomic Understanding"
- **Model Hub**: [HuggingFace yangheng](https://huggingface.co/yangheng)

---

**Thank you for completing this tutorial! Happy researching! üß¨‚ú®**
