# üß¨ Automated Genomic Foundation Model Benchmarking with LoRA Fine-Tuning

**Tutorial Version:** 1.0 | **Last Updated:** November 2025 | **Estimated Time:** 20-40 minutes

## üìö Learning Objectives

By completing this tutorial, you will learn how to:

1. **Configure and run automated benchmarking** for genomic foundation models (GFMs) using OmniGenBench
2. **Apply parameter-efficient fine-tuning** with LoRA (Low-Rank Adaptation) to reduce computational costs
3. **Evaluate model performance** across standardized benchmark suites (RGB, GUE, GB, PGB, BEACON)
4. **Interpret evaluation results** and compare multiple models systematically

## üéØ Prerequisites

**Required Knowledge:**
- Basic Python programming
- Familiarity with deep learning concepts (fine-tuning, evaluation metrics)
- Understanding of genomic sequences (DNA/RNA)

**Required Environment:**
- Python 3.8+
- CUDA-capable GPU (recommended: 8GB+ VRAM)
- ~10GB free disk space for models and datasets
- Internet connection for downloading models and benchmarks

**Installed Packages** (will be installed in Section 1):
- `omnigenbench` (core framework)
- `peft` (Parameter-Efficient Fine-Tuning)
- `bitsandbytes` (quantization support)
- `transformers`, `torch`, `accelerate` (deep learning infrastructure)

---

## üî¨ Background: Why Benchmark with LoRA?

### The Challenge

**Genomic Foundation Models (GFMs)** are pre-trained on massive genomic corpora and need task-specific adaptation. Traditional fine-tuning:
- **Requires 100% of model parameters** to be trainable (millions to billions)
- **Demands high memory** (16GB+ VRAM for 186M parameter models)
- **Risks catastrophic forgetting** of pre-trained knowledge

### The Solution: LoRA (Low-Rank Adaptation)

LoRA enables efficient fine-tuning by:
1. **Freezing pre-trained weights** (no updates to base model)
2. **Injecting trainable low-rank matrices** into attention layers
3. **Reducing trainable parameters** to <1% of original model (e.g., 52M ‚Üí 0.5M parameters)
4. **Maintaining model quality** while using 3-4x less memory

**Mathematical Foundation:**
```
W = W‚ÇÄ + ŒîW = W‚ÇÄ + BA
```
Where:
- `W‚ÇÄ`: Frozen pre-trained weights
- `B`: Low-rank matrix (d √ó r)
- `A`: Low-rank matrix (r √ó k)
- `r`: Rank (typically 8-32, much smaller than d,k)

---

## üìä Benchmark Suites Overview

OmniGenBench provides 5 comprehensive benchmark suites:

| Suite | Full Name | Focus | Tasks | Genome Type | Example Tasks |
|-------|-----------|-------|-------|-------------|---------------|
| **RGB** | RNA Genome Benchmark | RNA biology | 12 | RNA | Secondary structure, m6A modification |
| **BEACON** | Broad Evaluation Across COmputational geNOmics | Multi-domain RNA | 13 | RNA | Translation efficiency, mRNA degradation |
| **GUE** | Genomic Understanding Evaluation | DNA understanding | 36 | DNA | Promoter recognition, enhancer prediction |
| **GB** | Genomics Benchmarks | Classic DNA tasks | 9 | DNA | Splice site detection, TF binding |
| **PGB** | Plant Genome Benchmark | Plant genomics | 7+ | DNA (Plant) | Plant regulatory elements |

**Task Types Covered:**
- **Sequence Classification**: Binary/multi-class labels (e.g., "Is this a promoter?")
- **Token Classification**: Per-nucleotide predictions (e.g., splice site positions)
- **Regression**: Continuous values (e.g., expression levels)
- **Multi-label**: Multiple simultaneous labels (e.g., binding sites for 919 TFs)

---

## üõ†Ô∏è Tutorial Workflow

```mermaid
graph TD
    A[1. Environment Setup<br/>Install dependencies & verify GPU] --> B[2. Configuration<br/>Set benchmark, model, LoRA parameters]
    B --> C[3. Model Loading<br/>Handle model-specific requirements]
    C --> D[4. AutoBench Execution<br/>Automated training & evaluation]
    D --> E[5. Results Analysis<br/>Interpret metrics & visualizations]
    E --> F[6. Optional: Multi-Model Comparison<br/>Batch evaluation of multiple GFMs]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#fff3e0
    style D fill:#e8f5e9
    style E fill:#fce4ec
    style F fill:#f1f8e9
```

**Execution Strategy:**
- **Quick Test**: 5 minutes (1 epoch, 1000 examples, 1 task)
- **Full Evaluation**: 30-60 minutes per model (50 epochs, full datasets, all tasks)

Let's begin!

---

## ? Step 1: Environment Setup and Verification

**Purpose:** Install required packages and verify that your environment meets prerequisites.

**What This Step Does:**
- Installs `omnigenbench` and LoRA dependencies (`peft`, `bitsandbytes`)
- Verifies Python version, CUDA availability, and GPU memory
- Confirms successful installation

**Expected Duration:** 2-3 minutes (depending on internet speed)

In [None]:
### 1.1: Install Required Packages
# This cell installs omnigenbench and LoRA dependencies
# Skip if already installed (check by running: pip show omnigenbench peft bitsandbytes)

import sys
print(f"[INFO] Python Version: {sys.version}")
print(f"[INFO] Installing packages...")

# Install with upgrade flag to ensure latest versions
!pip install omnigenbench peft bitsandbytes transformers accelerate -U -q

print("[SUCCESS] Installation complete!")

### 1.2: Environment Verification

**Critical Check:** Ensure GPU is available and has sufficient memory for LoRA fine-tuning.

In [None]:
import torch
import omnigenbench

print("[INFO] Environment Verification")
print("="*60)

# Check PyTorch and CUDA
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    
    # Get GPU memory
    total_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)  # Convert to GB
    print(f"Total GPU Memory: {total_memory:.2f} GB")
    
    # Memory recommendation
    if total_memory < 6:
        print("[WARNING] GPU memory < 6GB. Consider using smaller models or reducing batch size.")
    elif total_memory < 12:
        print("[INFO] GPU memory sufficient for models up to ~186M parameters with LoRA.")
    else:
        print("[INFO] GPU memory excellent for large-scale benchmarking.")
else:
    print("[WARNING] No GPU detected. Training will be slow on CPU.")
    print("          For LoRA fine-tuning, GPU is strongly recommended.")

# Check OmniGenBench installation
print(f"\nOmniGenBench Version: {omnigenbench.__version__}")

# Check PEFT installation
try:
    import peft
    print(f"PEFT Version: {peft.__version__}")
except ImportError:
    print("[ERROR] PEFT not installed. Run the installation cell above.")

print("="*60)
print("[SUCCESS] Environment verification complete!")

---

## ‚öôÔ∏è Step 2: Configuration - The Single Source of Truth

**Purpose:** Define all experimental parameters in one centralized location following the **Single Source of Truth (SSoT)** principle.

**What This Step Does:**
- Configures benchmark selection (RGB, GUE, GB, PGB, or BEACON)
- Selects the genomic foundation model to evaluate
- Sets training hyperparameters (epochs, batch size, learning rate)
- Defines model-specific LoRA configurations

**Design Principle:** All configuration is declared upfront to enable reproducibility and easy experimentation.

In [None]:
### 2.1: Import Core Libraries

import random
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from omnigenbench import AutoBench

print("[SUCCESS] Core libraries imported successfully.")

### 2.2: Experimental Configuration (Single Source of Truth)

**Modify the parameters below to customize your benchmarking experiment.**

Key parameters explained:
- **`BENCHMARK`**: Which benchmark suite to evaluate on (RGB, GUE, GB, PGB, BEACON)
- **`GFM_TO_TUNE`**: HuggingFace model identifier or local path
- **`MAX_EXAMPLES`**: Limit training examples per task (for quick testing; `None` = full dataset)
- **`EPOCHS`**: Number of training epochs (RGB/BEACON typically use 50 for full evaluation)
- **`SEED`**: Random seed for reproducibility (or randomize for variance estimation)

In [None]:
# ============================================================================
# EXPERIMENTAL CONFIGURATION - Modify these parameters for your experiment
# ============================================================================

# --- General Training Settings ---
BENCHMARK = "RGB"                    # Options: "RGB", "GUE", "GB", "PGB", "BEACON"
BATCH_SIZE = 8                       # Adjust based on GPU memory (8-16 typical)
PATIENCE = 3                         # Early stopping patience (epochs without improvement)
EPOCHS = 1                           # Quick test: 1, Full evaluation: 50
MAX_EXAMPLES = 1000                  # Quick test: 1000, Full dataset: None
SEED = 42                            # Fixed seed: 42, Random: random.randint(0, 1000)

# --- Model Selection ---
GFM_TO_TUNE = 'yangheng/OmniGenome-52M'   # Model to evaluate

# Available models for testing (uncomment to try others):
AVAILABLE_GFMS = [
    'yangheng/OmniGenome-52M',              # 52M parameters, balanced performance
    # 'yangheng/OmniGenome-186M',           # 186M parameters, higher capacity
    # 'yangheng/OmniGenome-v1.5',           # Latest OmniGenome version
    # 'zhihan1996/DNABERT-2-117M',          # DNA-specific BERT variant
    # 'LongSafari/hyenadna-large-1m-seqlen-hf',  # Long-range DNA model
    # 'InstaDeepAI/nucleotide-transformer-v2-100m-multi-species',  # Multi-species DNA
    # 'kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16',  # Caduceus architecture
    # 'multimolecule/rnafm',                # RNA Foundation Model
    
    # Special models requiring additional setup:
    # 'arcinstitute/evo-1-131k-base',       # Evo models (install `evo` package first)
    # 'multimolecule/SpliceBERT-510nt',     # Splice-specific model
]

# Extract model name for configuration lookup
GFM = GFM_TO_TUNE.split('/')[-1]

# --- LoRA Hyperparameter Configurations ---
# Each model has optimized LoRA settings based on its architecture
# Key parameters:
#   - r: Rank of low-rank matrices (higher = more capacity, more parameters)
#   - lora_alpha: Scaling factor (typically 2-4x the rank)
#   - lora_dropout: Regularization to prevent overfitting
#   - target_modules: Which layers to apply LoRA (attention/projection layers)

LORA_CONFIGS = {
    # --- Transformer-based models (BERT/RoBERTa architecture) ---
    "OmniGenome-52M": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"],  # Standard attention layers
        "bias": "none"
    },
    "OmniGenome-186M": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"],
        "bias": "none"
    },
    "DNABERT-2-117M": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["Wqkv", "dense"],  # DNABERT uses fused QKV projection
        "bias": "none"
    },
    
    # --- State Space Models (Mamba/Caduceus architecture) ---
    "caduceus-ph_seqlen-131k_d_model-256_n_layer-16": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["in_proj", "x_proj", "out_proj"],  # SSM-specific projections
        "bias": "none"
    },
    
    # --- RNA-specific models ---
    "rnamsm": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["q_proj", "v_proj", "out_proj"],
        "bias": "none"
    },
    "rnafm": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"],
        "bias": "none"
    },
    "rnabert": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"],
        "bias": "none"
    },
    "SpliceBERT-510nt": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"],
        "bias": "none"
    },
    
    # --- Hyena models (long-range convolution) ---
    "hyenadna-large-1m-seqlen-hf": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["in_proj", "out_proj"],
        "bias": "none"
    },
    
    # --- Nucleotide Transformer ---
    "nucleotide-transformer-v2-100m-multi-species": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"],
        "bias": "none"
    },
    
    # --- Evo models (StripedHyena architecture) ---
    "evo-1-131k-base": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": [
            "Wqkv", "out_proj",
            "mlp",
            "projections",
            "out_filter_dense"
        ],
        "bias": "none"
    },
    "evo-1.5-8k-base": {
        "r": 8, 
        "lora_alpha": 32, 
        "lora_dropout": 0.1,
        "target_modules": [
            "Wqkv", "out_proj",
            "l1", "l2", "l3",
            "projections",
            "out_filter_dense"
        ],
        "bias": "none"
    },
}

# ============================================================================
# Configuration Validation and Summary
# ============================================================================

print("[INFO] Experimental Configuration Loaded")
print("="*60)
print(f"Benchmark Suite:    {BENCHMARK}")
print(f"Model to Evaluate:  {GFM_TO_TUNE}")
print(f"Training Epochs:    {EPOCHS}")
print(f"Batch Size:         {BATCH_SIZE}")
print(f"Max Examples:       {MAX_EXAMPLES if MAX_EXAMPLES else 'Full Dataset'}")
print(f"Random Seed:        {SEED}")
print("="*60)

# Verify LoRA config exists for selected model
if GFM in LORA_CONFIGS:
    lora_cfg = LORA_CONFIGS[GFM]
    print(f"\nLoRA Configuration for {GFM}:")
    print(f"  Rank (r):          {lora_cfg['r']}")
    print(f"  Alpha:             {lora_cfg['lora_alpha']}")
    print(f"  Dropout:           {lora_cfg['lora_dropout']}")
    print(f"  Target Modules:    {', '.join(lora_cfg['target_modules'])}")
    
    # Estimate trainable parameters (rough approximation)
    # For transformer: ~4 * d_model * r * num_layers * num_target_modules
    # This is a rough estimate; actual depends on model architecture
    print(f"\n[INFO] Estimated trainable parameters: <1% of base model")
else:
    print(f"\n[WARNING] No LoRA config found for '{GFM}'.")
    print(f"          Available configs: {list(LORA_CONFIGS.keys())}")
    print(f"          Using default OmniGenome-52M config as fallback.")
    lora_cfg = LORA_CONFIGS["OmniGenome-52M"]

print("\n[SUCCESS] Configuration validated!")

---

## üîß Step 3: Model Loading and Preparation

**Purpose:** Load the selected genomic foundation model and its tokenizer, handling architecture-specific requirements.

**What This Step Does:**
- Defines a flexible loading function for different model architectures
- Handles special cases (multimolecule RNA models, Evo models)
- Validates model loading before benchmarking

**Architecture-Specific Handling:**
- **Standard HuggingFace models**: Direct loading via model name
- **Multimolecule models**: Custom tokenizer and base model extraction
- **Evo models**: Special handling for StripedHyena architecture and pad token configuration

In [None]:
### 3.1: Define Model Loading Function

def load_gfm_and_tokenizer(gfm_name):
    """
    Loads a genomic foundation model and its tokenizer with architecture-specific handling.
    
    This function abstracts away the complexity of loading different model architectures,
    providing a unified interface for AutoBench.
    
    Args:
        gfm_name (str): HuggingFace model identifier (e.g., 'yangheng/OmniGenome-52M')
    
    Returns:
        tuple: (model, tokenizer)
            - model: Model instance or model name (for standard HF models)
            - tokenizer: Tokenizer instance or None (AutoBench will auto-load)
    
    Supported Architectures:
        - Standard Transformers (BERT, RoBERTa): Pass model name, AutoBench handles loading
        - Multimolecule models: Load with custom RnaTokenizer and extract base_model
        - Evo models: Load with custom config, patch unembed layer, set pad tokens
    """
    print(f"\n[INFO] Loading model and tokenizer: {gfm_name}")
    print("-" * 60)
    
    # --- Special Case 1: Multimolecule RNA Models ---
    if 'multimolecule' in gfm_name:
        try:
            from multimolecule import RnaTokenizer, AutoModelForTokenPrediction
            tokenizer = RnaTokenizer.from_pretrained(gfm_name)
            model = AutoModelForTokenPrediction.from_pretrained(
                gfm_name, 
                trust_remote_code=True
            ).base_model
            print(f"[SUCCESS] Loaded multimolecule model with custom RnaTokenizer")
            print(f"          Tokenizer vocab size: {len(tokenizer)}")
            return model, tokenizer
        except ImportError:
            print("[ERROR] 'multimolecule' package not installed.")
            print("        Install with: pip install multimolecule")
            raise
    
    # --- Special Case 2: Evo Models (StripedHyena Architecture) ---
    elif 'evo-1' in gfm_name or 'evo2' in gfm_name:
        try:
            # Load config and model with trust_remote_code
            config = AutoConfig.from_pretrained(gfm_name, trust_remote_code=True)
            model = AutoModelForCausalLM.from_pretrained(
                gfm_name, 
                config=config, 
                trust_remote_code=True
            ).backbone  # Extract backbone for fine-tuning
            
            tokenizer = AutoTokenizer.from_pretrained(gfm_name, trust_remote_code=True)
            
            # Fix pad token configuration (Evo-specific requirement)
            tokenizer.pad_token_id = tokenizer.pad_token_type_id
            model.config = config
            model.config.pad_token_id = tokenizer.pad_token_id
            
            # Patch unembed layer (prevent output projection errors)
            model.unembed.unembed = lambda x: x
            
            print(f"[SUCCESS] Loaded Evo model with custom patching")
            print(f"          Model layers: {config.num_hidden_layers}")
            print(f"          Pad token ID: {tokenizer.pad_token_id}")
            return model, tokenizer
        except Exception as e:
            print(f"[ERROR] Failed to load Evo model: {e}")
            print("        Ensure 'evo' package is installed (if required)")
            print("        Refer to: https://github.com/evo-design/evo")
            raise
    
    # --- Default Case: Standard HuggingFace Models ---
    else:
        # Return model name; AutoBench will handle loading with proper task head
        print(f"[INFO] Using standard HuggingFace loading")
        print(f"       AutoBench will auto-load model and tokenizer")
        return gfm_name, None

# Test loading function with selected model
try:
    model, tokenizer = load_gfm_and_tokenizer(GFM_TO_TUNE)
    print("\n" + "="*60)
    print("[SUCCESS] Model loading function validated!")
    print("="*60)
except Exception as e:
    print(f"\n[ERROR] Model loading failed: {e}")
    print("        Check model name and network connection.")

---

## üéì Step 4: Running Automated Benchmarking with LoRA

**Purpose:** Execute the complete evaluation pipeline: data loading, LoRA fine-tuning, evaluation, and result saving.

**What This Step Does:**
1. **Initialize AutoBench**: Configure benchmark suite, model, and trainer
2. **Apply LoRA Configuration**: Inject trainable low-rank adapters into the model
3. **Execute Benchmark**: Run automated training and evaluation across all tasks
4. **Save Results**: Store metrics, checkpoints, and visualizations

**Expected Outputs:**
- `autobench_logs/`: Training logs (loss curves, learning rates)
- `autobench_evaluations/`: Evaluation results (.mv files for visualization)
- Console output: Per-task metrics (accuracy, MCC, F1, etc.)

**Duration:** 
- Quick test (MAX_EXAMPLES=1000, EPOCHS=1): ~5-10 minutes
- Full evaluation (MAX_EXAMPLES=None, EPOCHS=50): ~30-60 minutes per model

In [None]:
### 4.1: Initialize AutoBench and Run Evaluation

import time

print("[INFO] Starting Automated Benchmarking with LoRA")
print("="*60)
start_time = time.time()

# --- Step 4.1: Load Model and Tokenizer ---
print(f"\n[Step 1/3] Loading model: {GFM_TO_TUNE}")
model, tokenizer = load_gfm_and_tokenizer(GFM_TO_TUNE)

# --- Step 4.2: Initialize AutoBench ---
print(f"\n[Step 2/3] Initializing AutoBench for benchmark: {BENCHMARK}")
bench = AutoBench(
    benchmark=BENCHMARK,              # Benchmark suite (RGB, GUE, GB, PGB, BEACON)
    config_or_model=model,            # Model instance or HF model name
    tokenizer=tokenizer,              # Tokenizer (None for auto-loading)
    overwrite=True,                   # Overwrite existing results
    trainer='native',                 # Training backend: 'native', 'accelerate', 'hf_trainer'
    autocast='fp16',                  # Mixed precision: 'fp16', 'bf16', 'fp32'
    device='cuda',                    # Device: 'cuda' or 'cpu'
)

print(f"[INFO] AutoBench initialized successfully")
print(f"       Benchmark: {BENCHMARK}")
print(f"       Trainer: native (single GPU)")
print(f"       Precision: fp16 (mixed precision)")

# --- Step 4.3: Get LoRA Configuration ---
lora_config = LORA_CONFIGS.get(GFM, LORA_CONFIGS["OmniGenome-52M"])
print(f"\n[INFO] Applying LoRA configuration:")
print(f"       Rank: {lora_config['r']}, Alpha: {lora_config['lora_alpha']}")
print(f"       Target modules: {lora_config['target_modules']}")

# --- Step 4.4: Run Benchmark with LoRA Fine-Tuning ---
print(f"\n[Step 3/3] Running benchmark evaluation...")
print(f"           This may take several minutes...")
print("-"*60)

try:
    bench.run(
        batch_size=BATCH_SIZE,
        gradient_accumulation_steps=1,
        patience=PATIENCE,
        max_examples=MAX_EXAMPLES,
        seeds=SEED,                   # Single seed or list: [0, 1, 2] for multi-seed
        epochs=EPOCHS,
        lora_config=lora_config,      # Enable LoRA fine-tuning
    )
    
    elapsed_time = time.time() - start_time
    print("\n" + "="*60)
    print(f"[SUCCESS] Benchmarking complete!")
    print(f"          Total time: {elapsed_time/60:.2f} minutes")
    print("="*60)
    
    print("\n[INFO] Results saved to:")
    print(f"       Evaluations: ./autobench_evaluations/")
    print(f"       Logs: ./autobench_logs/")
    print(f"\nNext steps:")
    print(f"  1. Check evaluation metrics in the output above")
    print(f"  2. Visualize results using MetricVisualizer")
    print(f"  3. Compare with other models (see Step 6)")

except Exception as e:
    print(f"\n[ERROR] Benchmarking failed: {e}")
    print(f"        Check error messages above for details")
    raise

---

## üìä Step 5: Results Analysis and Interpretation

**Purpose:** Understand the evaluation results and interpret model performance.

**What This Step Does:**
- Explains key evaluation metrics used in genomic benchmarks
- Shows how to access and visualize results
- Provides guidance on interpreting performance

**Key Metrics by Task Type:**

| Metric | Task Type | Range | Interpretation |
|--------|-----------|-------|----------------|
| **MCC** (Matthews Correlation) | Classification | [-1, 1] | 1 = perfect, 0 = random, -1 = inverse |
| **Accuracy** | Classification | [0, 1] | Proportion of correct predictions |
| **F1 Score** | Classification | [0, 1] | Harmonic mean of precision & recall |
| **AUPRC** | Classification | [0, 1] | Area under precision-recall curve |
| **MSE** | Regression | [0, ‚àû) | Lower is better (squared error) |
| **Spearman œÅ** | Regression | [-1, 1] | Rank correlation (1 = perfect) |

**Accessing Results:**

```python
from metric_visualizer import MetricVisualizer

# Load saved results
mv = MetricVisualizer.load("./autobench_evaluations/<your_results>.mv")

# View summary
mv.summary(round=4)

# Get specific metrics
metrics = mv.get_metrics()
```

In [None]:
### 5.1: View Evaluation Summary

# Access the AutoBench MetricVisualizer for detailed results
print("[INFO] Evaluation Results Summary")
print("="*60)

# Display summary with 4 decimal places
bench.mv.summary(round=4)

print("\n[INFO] Interpreting Results:")
print("  - MCC (Matthews Correlation): Balanced metric for imbalanced datasets")
print("    Range: [-1, 1], where 1 = perfect, 0 = random, -1 = inverse")
print("  - Accuracy: Simple proportion of correct predictions")
print("  - F1 Score: Balance between precision and recall")
print("  - For multi-seed runs, results show mean ¬± std deviation")
print("\n[INFO] Results are automatically saved to:")
print(f"       {bench.mv_path}")

---

## üéØ Summary and Next Steps

### What You've Learned

Congratulations! You've completed the OmniGenBench LoRA benchmarking tutorial. You now know how to:

‚úÖ **Configure automated benchmarking experiments** with reproducible parameters  
‚úÖ **Apply LoRA for parameter-efficient fine-tuning** (reducing trainable parameters by 99%)  
‚úÖ **Evaluate genomic foundation models** across standardized benchmark suites  
‚úÖ **Interpret evaluation metrics** (MCC, F1, accuracy, Spearman correlation)  
‚úÖ **Compare multiple models** systematically using consistent protocols  

### Key Takeaways

1. **LoRA enables efficient fine-tuning**: Train large GFMs with 3-4x less memory
2. **AutoBench automates evaluation**: Consistent protocols across benchmarks eliminate manual errors
3. **Multiple seeds improve reliability**: Running with seeds=[0, 1, 2] quantifies variance
4. **Architecture matters**: Different models (Transformer, Mamba, Hyena) require specific LoRA configs

---

### Next Steps and Advanced Topics

#### üìö Explore Other Tutorials

- **[Attention Score Extraction](../attention_score_extraction/)**: Visualize model attention patterns
- **[Genomic Embeddings](../genomic_embeddings/)**: Extract and analyze sequence representations
- **[RNA Sequence Design](../rna_sequence_design/)**: Generate RNA sequences with desired structures
- **[Variant Effect Prediction](../variant_effect_prediction/)**: Predict mutation impacts

#### üî¨ Extend This Tutorial

1. **Run Full Evaluations**: Set `MAX_EXAMPLES=None`, `EPOCHS=50`, `seeds=[0,1,2]`
2. **Try Different Benchmarks**: Change `BENCHMARK` to "GUE", "GB", or "PGB"
3. **Experiment with LoRA Parameters**: 
   - Increase `r` (rank) to 16 or 32 for more capacity
   - Adjust `lora_alpha` to control scaling (typically 2-4x rank)
   - Try different `target_modules` (e.g., add "query" projections)
4. **Multi-Seed Evaluation**: Use `seeds=[0, 1, 2, 3, 4]` for robust statistics

#### üìä Analyze Results Programmatically

```python
from metric_visualizer import MetricVisualizer
import pandas as pd

# Load results
mv = MetricVisualizer.load("./autobench_evaluations/<your_file>.mv")

# Get all metrics as DataFrame
metrics_df = mv.to_dataframe()

# Compare specific tasks
task_metrics = metrics_df[metrics_df['task'] == 'task_name']
print(task_metrics[['accuracy', 'mcc', 'f1']])

# Export to CSV for external analysis
metrics_df.to_csv("benchmark_results.csv", index=False)
```

#### ü§ù Contribute to OmniGenBench

- **Add new benchmarks**: Create task configs following RGB/GUE structure
- **Integrate new models**: Add LoRA configs for emerging GFM architectures
- **Report issues**: [GitHub Issues](https://github.com/yangheng95/OmniGenBench/issues)
- **Share results**: Submit PRs with benchmark results for new models

---

### Reproducibility Checklist

When sharing results or writing papers, document:

- [ ] OmniGenBench version (`omnigenbench.__version__`)
- [ ] Model name and version (e.g., `yangheng/OmniGenome-52M`)
- [ ] Benchmark suite and task list
- [ ] LoRA configuration (r, alpha, target_modules)
- [ ] Training hyperparameters (epochs, batch_size, learning_rate)
- [ ] Random seeds used (for variance estimation)
- [ ] Hardware specifications (GPU model, VRAM)
- [ ] Evaluation metrics with standard deviations

---

### References and Resources

**OmniGenBench Documentation:**
- üìñ [Getting Started Guide](../../docs/GETTING_STARTED.md)
- üîß [CLI Reference](../../docs/cli.rst)
- üèóÔ∏è [Architecture Overview](../../framework_architecture_v2.md)
- üêõ [Troubleshooting Guide](../../docs/troubleshooting.rst)

**Key Papers:**
- **LoRA**: Hu et al. (2021) "LoRA: Low-Rank Adaptation of Large Language Models"
- **OmniGenBench**: [Cite the OmniGenBench paper when published]
- **RGB Benchmark**: [Cite the RGB benchmark paper]
- **GUE Benchmark**: [Cite the GUE benchmark paper]

**Community:**
- üí¨ GitHub Discussions: https://github.com/yangheng95/OmniGenBench/discussions
- üêõ Bug Reports: https://github.com/yangheng95/OmniGenBench/issues
- üìß Contact: YANG, HENG <hy345@exeter.ac.uk>

---

## üîß Troubleshooting

### Common Issues and Solutions

#### Issue 1: CUDA Out of Memory

**Symptoms:** `RuntimeError: CUDA out of memory`

**Solutions:**
```python
# Reduce batch size
BATCH_SIZE = 4  # or even 2

# Enable gradient accumulation
bench.run(
    batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 4*4 = 16
    ...
)

# Use smaller model
GFM_TO_TUNE = 'yangheng/OmniGenome-52M'  # Instead of 186M
```

#### Issue 2: Model Loading Fails

**Symptoms:** `OSError: Can't load config for...` or `ImportError`

**Solutions:**
```python
# Check network connection
!ping huggingface.co

# Verify model exists on HuggingFace Hub
# Visit: https://huggingface.co/[model_name]

# For multimolecule models, install package
!pip install multimolecule -U

# For Evo models, check requirements
# https://github.com/evo-design/evo
```

#### Issue 3: LoRA Config Not Found

**Symptoms:** `[WARNING] No LoRA config found for '<model_name>'`

**Solution:** Add custom config to LORA_CONFIGS:
```python
LORA_CONFIGS["your-model-name"] = {
    "r": 8,
    "lora_alpha": 32,
    "lora_dropout": 0.1,
    "target_modules": ["key", "value", "dense"],  # Inspect model architecture
    "bias": "none"
}
```

To find target modules, inspect model:
```python
from transformers import AutoModel
model = AutoModel.from_pretrained("model-name")
print(model)  # Look for attention/projection layers
```

#### Issue 4: Slow Training on CPU

**Symptoms:** Training takes hours, no GPU utilization

**Solutions:**
```python
# Verify GPU availability
import torch
print(torch.cuda.is_available())  # Should be True

# If False, check CUDA installation
!nvidia-smi

# Force CUDA device
bench = AutoBench(
    ...,
    device='cuda:0',  # Explicitly specify GPU
)
```

#### Issue 5: Benchmark Data Not Found

**Symptoms:** `FileNotFoundError: Benchmark data not available`

**Solution:**
```python
# AutoBench auto-downloads from HuggingFace Hub
# Ensure internet connection and HuggingFace credentials

# Manual download (if needed)
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="OmniGenBench/RGB",  # Example
    repo_type="dataset",
    local_dir="./benchmarks/RGB"
)
```

#### Issue 6: Windows Path Errors

**Symptoms:** `OSError: [Errno 22] Invalid argument` on Windows

**Solution:**
```python
# Use forward slashes or Path objects
from pathlib import Path
output_dir = Path("./autobench_evaluations")

# Avoid Windows-specific paths like:
# output_dir = ".\\autobench_evaluations"  # DON'T USE
```

---

### Getting Help

If you encounter issues not covered here:

1. **Check error messages carefully**: Most errors include helpful diagnostics
2. **Search GitHub Issues**: https://github.com/yangheng95/OmniGenBench/issues
3. **Review documentation**: Especially [troubleshooting.rst](../../docs/troubleshooting.rst)
4. **Ask in Discussions**: https://github.com/yangheng95/OmniGenBench/discussions
5. **Report bugs**: Include full error traceback and environment details

**When reporting issues, include:**
- OmniGenBench version
- Python version
- PyTorch/CUDA versions
- Full error traceback
- Minimal code to reproduce

---

**End of Tutorial** | **Thank you for using OmniGenBench!** üß¨