# 🧬 VEP Tutorial 1/4: From Genetic Variants to Predictive Models

Welcome to the first tutorial in our four-part Variant Effect Prediction (VEP) series. This guide focuses on understanding genetic variants and preparing data for computational analysis.

> 📚 **Prerequisites**: It is recommended to study the **[Fundamental Concepts Tutorial](../../00_fundamental_concepts.ipynb)** to understand machine learning task classification and foundation model principles, particularly how models like PlantRNA-FM work.

Before diving into code, let's understand what genetic variants are and why predicting their effects is crucial for genomic medicine.

## 1. Understanding Genetic Variants: The Language of Diversity 🧬

### 1.1 What Are Genetic Variants?

**Genetic variants** are differences in DNA sequences between individuals. They range from:
- **Single Nucleotide Variants (SNVs)**: A single base change (e.g., A→G)
- **Insertions/Deletions (InDels)**: Added or removed DNA segments
- **Structural Variants**: Large-scale chromosomal rearrangements

```mermaid
graph LR
    A[Reference Genome<br/>...ATCG...] --> B[SNV<br/>...AGCG...]
    A --> C[Insertion<br/>...ATCCG...]
    A --> D[Deletion<br/>...A_CG...]
    style A fill:#e1f5ff
    style B fill:#ffe1e1
    style C fill:#e1ffe1
    style D fill:#fff5e1
```

### 1.2 Why Variant Effect Prediction Matters

Understanding variant effects is critical for:
- 🏥 **Clinical Diagnostics**: Identifying disease-causing mutations
- 💊 **Personalized Medicine**: Tailoring treatments based on genetic profiles
- 🧪 **Drug Discovery**: Understanding protein-drug interactions
- 🌾 **Agriculture**: Breeding crops with improved traits
- 🔬 **Evolutionary Biology**: Understanding species adaptation

However, experimentally validating millions of variants is **impractical and expensive**. This is where computational prediction becomes essential.

### 1.3 The VEP Challenge: Classification vs. Interpretation

Unlike other genomic tasks, VEP is unique because:
1. **Zero-shot learning**: We often don't have labeled training data for specific variants
2. **Embedding-based approach**: We compare sequence embeddings rather than training classifiers
3. **Reference vs. alternative**: We analyze the *difference* between wild-type and mutant sequences


## 2. Framing VEP as a Machine Learning Task 🎯

### 2.1 VEP in the ML Task Taxonomy

| Aspect | Description |
| :--- | :--- |
| **Primary Task** | Sequence Comparison & Scoring |
| **Input Format** | Paired sequences (reference + alternative) |
| **Output Format** | Effect score (continuous) or classification (binary/multi-class) |
| **Learning Paradigm** | Zero-shot or few-shot learning |
| **Model Usage** | Feature extraction (embeddings) |

### 2.2 Two Approaches to VEP

**Approach 1: Supervised Classification (Traditional)**
```
Sequence → Model → Pathogenic/Benign Label
```
*Requires*: Large labeled dataset of known variant effects

**Approach 2: Embedding-Based Scoring (This Tutorial)**
```
Reference Sequence → Model → Embedding₁
Alternative Sequence → Model → Embedding₂
Compare(Embedding₁, Embedding₂) → Effect Score
```
*Requires*: Only reference genome and variant coordinates

### 2.3 Why Embedding-Based VEP?

✅ **Advantages:**
- No labeled training data required
- Works for novel variants
- Captures semantic differences
- Transferable across organisms

⚠️ **Considerations:**
- Requires high-quality pre-trained models
- May need calibration for clinical use
- Interpretation requires biological context


## 3. The OmniGenBench VEP Workflow 🔄

Our four-part tutorial follows the standard OmniGenBench methodology:

```mermaid
graph LR
    A[📊 Data<br/>Preparation] --> B[🤖 Model<br/>Setup]
    B --> C[🧬 Embedding &<br/>Scoring]
    C --> D[📈 Visualization &<br/>Export]
    style A fill:#e1f5ff
    style B fill:#ffe1f5
    style C fill:#f5ffe1
    style D fill:#ffe1e1
```

| Tutorial | Focus | Key Concepts |
| :--- | :--- | :--- |
| **Part 1** (This) | Data Preparation | Variant formats, data loading, quality control |
| **Part 2** | Model Setup | PlantRNA-FM (35M, *Nature MI*) initialization, embedding extraction |
| **Part 3** | Scoring Methods | Cosine similarity, effect score calculation |
| **Part 4** | Analysis | Visualization, interpretation, export |


---

## 🛠️ Step-by-Step Guide: Preparing Variant Data

Now let's move from theory to practice. This section guides you through loading and processing variant data.

### 4.1: Environment Setup

First, install the required packages:


In [None]:
# Install OmniGenBench and dependencies
%pip install omnigenbench -U


In [None]:
# Import required libraries
import warnings
from omnigenbench import (
    OmniTokenizer,
    OmniDatasetForSequenceClassification
)
from dataclasses import dataclass

warnings.filterwarnings('ignore')
print("✅ Libraries imported successfully!")
print("🧬 Ready to load variant data!")


## 📋 Understanding VEP Data Formats

Before loading data, let's understand how variant data is structured in OmniGenBench.

### 4.2: VEP Data Templates

VEP data follows the standard OmniGenBench directory structure:

```
variant_effect_prediction/
├── train.jsonl              # Training variants (if available)
├── valid.jsonl              # Validation variants (optional)
├── test.jsonl               # Test variants (required)
├── config.py                # Dataset metadata
└── README.md                # Documentation
```

### 4.3: Variant Data Format

**Option 1: JSONL Format (Recommended)**
```json
{"sequence": "ATCG...GCTA", "ref_allele": "A", "alt_allele": "G", "label": 1}
{"sequence": "GCTA...ATCG", "ref_allele": "C", "alt_allele": "T", "label": 0}
```

**Option 2: CSV Format**
```csv
chr,pos,ref,alt,sequence,label
chr1,12345,A,G,ATCG...GCTA,1
chr2,67890,C,T,GCTA...ATCG,0
```

### 4.4: Label Conventions

For VEP, labels typically represent:
- **Binary**: 0 = Benign, 1 = Pathogenic
- **Multi-class**: 0 = Benign, 1 = Likely benign, 2 = VUS, 3 = Likely pathogenic, 4 = Pathogenic
- **Regression**: Continuous effect scores

### 4.5: Sequence Context

Variants are embedded in genomic context:
```
[Upstream Context] [Reference Allele] [Downstream Context]
                          ↓
[Upstream Context] [Alternative Allele] [Downstream Context]
```

**Context window size** (e.g., 200bp upstream/downstream) is crucial for capturing regulatory effects.


## 5. Configuration

Define analysis parameters with clear documentation:


In [None]:
# Configuration parameters
@dataclass
class VEPConfig:
    """Configuration for Variant Effect Prediction pipeline"""
    # Dataset settings
    dataset_name: str = "yangheng/variant_effect_prediction"
    cache_dir: str = "vep_data"
    
    # Model settings
    model_name: str = "yangheng/OmniGenome-52M"
    max_length: int = 512
    
    # Analysis settings
    context_length: int = 200  # Base pairs upstream/downstream of variant
    batch_size: int = 16
    
config = VEPConfig()
print("📋 Configuration loaded:")
print(f"   Dataset: {config.dataset_name}")
print(f"   Model: {config.model_name}")
print(f"   Context window: ±{config.context_length}bp")


## 6. Data Loading

Load variant dataset with automatic caching and preprocessing:

**What happens during loading:**
1. 🔽 Download dataset from Hugging Face Hub (if not cached)
2. 🔤 Initialize tokenizer for sequence encoding
3. 📊 Create dataset objects for train/valid/test splits
4. ✅ Validate data format and integrity


In [None]:
# Initialize tokenizer
print("🔤 Loading tokenizer...")
tokenizer = OmniTokenizer.from_pretrained(config.model_name, trust_remote_code=True)
print(f"   Vocabulary size: {len(tokenizer)}")
print(f"   Special tokens: {tokenizer.all_special_tokens}")

# Load datasets with progress indication
print("\n📥 Loading variant datasets...")
datasets = OmniDatasetForSequenceClassification.from_hub(
    dataset_name=config.dataset_name,
    tokenizer=tokenizer,
    max_length=config.max_length,
    cache_dir=config.cache_dir
)
print("✅ Datasets loaded successfully!")


## 7. Data Exploration & Quality Control

Understanding your data is crucial before analysis. Let's examine:
- Dataset sizes and splits
- Sequence characteristics
- Label distributions
- Data quality indicators


In [None]:
# Check dataset sizes
print("📊 Dataset Statistics:")
print("=" * 50)
for split, dataset in datasets.items():
    print(f"{split.upper():10s}: {len(dataset):6,d} variants")

# Examine sample structure
print("\n🔍 Sample Data Structure:")
print("=" * 50)
sample = datasets['test'][0]
print(f"Available keys: {list(sample.keys())}")
print(f"\nInput shape: {sample['input_ids'].shape}")
print(f"Attention mask shape: {sample['attention_mask'].shape}")

# Show first few tokens (if available)
if hasattr(tokenizer, 'convert_ids_to_tokens'):
    tokens = tokenizer.convert_ids_to_tokens(sample['input_ids'][:20])
    print(f"\nFirst 20 tokens: {tokens}")


### 7.1: Label Distribution Analysis


In [None]:
# Analyze label distribution
import numpy as np

print("\n📈 Label Distribution:")
print("=" * 50)
for split, dataset in datasets.items():
    if 'labels' in dataset[0]:
        labels = [dataset[i]['labels'].item() if hasattr(dataset[i]['labels'], 'item') 
                 else dataset[i]['labels'] for i in range(len(dataset))]
        unique, counts = np.unique(labels, return_counts=True)
        print(f"\n{split.upper()}:")
        for label, count in zip(unique, counts):
            percentage = count / len(labels) * 100
            print(f"  Label {label}: {count:6,d} ({percentage:5.2f}%)")


### 7.2: Sequence Length Statistics


In [None]:
# Analyze sequence lengths
print("\n📏 Sequence Length Distribution:")
print("=" * 50)
test_sample = datasets['test']
lengths = []
for i in range(min(1000, len(test_sample))):  # Sample first 1000
    length = (sample['attention_mask'] == 1).sum().item()
    lengths.append(length)

lengths = np.array(lengths)
print(f"Mean length: {lengths.mean():.1f} tokens")
print(f"Std dev: {lengths.std():.1f} tokens")
print(f"Min length: {lengths.min()} tokens")
print(f"Max length: {lengths.max()} tokens")
print(f"Median length: {np.median(lengths):.1f} tokens")


## 5. Create DataLoader


In [None]:
# Create DataLoader for batch processing
test_loader = datasets['test'].get_dataloader(
    batch_size=16,
    shuffle=False
)

# Verify batch structure
batch = next(iter(test_loader))
print(f"Batch keys: {list(batch.keys())}")
print(f"Input shape: {batch['input_ids'].shape}")
