In [None]:
from IPython.display import Image, display
display(Image(filename="4-step workflow.png"))


**Variant Effect Prediction** uses embedding comparison to assess functional impacts without training. We leverage `OmniModelForSequenceClassification` with **PlantRNA-FM** to extract plant-optimized embeddings and compare reference vs. alternative sequences.

### 4. Tutorial Structure

1. **[Data Preparation](01_vep_data_preparation.ipynb)**: Load variant datasets and reference genome
2. **[Model Setup](02_vep_model_setup.ipynb)**: Initialize PlantRNA-FM for plant genomic analysis
3. **[Embedding Extraction](03_embedding_and_scoring.ipynb)**: Compare reference and alternative sequences using PlantRNA-FM
4. **[Visualization](04_visualization_and_export.ipynb)**: Analyze and export results

Let's begin!


## 🚀 Step 1: Environment Setup and Configuration


In [None]:
%pip install omnigenbench -U


In [None]:
from omnigenbench import (
    OmniTokenizer,
    OmniModelForSequenceClassification,
    OmniDatasetForSequenceClassification
)


### Configuration

Define analysis parameters with sensible defaults:


In [None]:
# Configuration parameters
dataset_name = "yangheng/variant_effect_prediction"
model_name = "yangheng/PlantRNA-FM"  # Using PlantRNA-FM for plant variant analysis
max_length = 512
batch_size = 16
context_length = 200
max_variants = 100  # Use subset for quick testing
cache_dir = "vep_data"
output_dir = "vep_results"


## 📊 Step 2: Data Loading

Load the variant dataset using OmniGenBench's enhanced data loading:


In [None]:
# Load tokenizer
tokenizer = OmniTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Load dataset with automatic caching
datasets = OmniDatasetForSequenceClassification.from_hub(
    dataset_name=dataset_name,
    tokenizer=tokenizer,
    max_length=max_length,
    cache_dir=cache_dir
)


## 🔧 Step 3: Model Initialization


In [None]:
model = OmniModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    trust_remote_code=True
)
model.eval()


## 🧬 Step 4: Variant Effect Scoring

Extract embeddings and calculate effect scores:


In [None]:
import torch
from tqdm import tqdm

test_dataset = datasets['test']
dataloader = test_dataset.get_dataloader(batch_size=batch_size, shuffle=False)

results = []
with torch.no_grad():
    for batch in tqdm(dataloader, desc="Processing variants"):
        outputs = model(**batch)
        embeddings = outputs.hidden_states[-1][:, 0, :]  # CLS token
        results.append(embeddings)

all_embeddings = torch.cat(results, dim=0)


## 📈 Results and Visualization

For detailed analysis and visualization, see the complete tutorial notebooks:
- **[01_vep_data_preparation.ipynb](01_vep_data_preparation.ipynb)**
- **[02_vep_model_setup.ipynb](02_vep_model_setup.ipynb)**
- **[03_embedding_and_scoring.ipynb](03_embedding_and_scoring.ipynb)**
- **[04_visualization_and_export.ipynb](04_visualization_and_export.ipynb)**
