# üß¨ RNA Secondary Structure Prediction with OmniGenBench

Welcome to this comprehensive tutorial where we'll explore how to predict **RNA secondary structures** from primary sequences using **OmniGenBench**. This guide will walk you through a complete genomic deep learning project, from understanding the fundamental biological concepts to deploying a trained model for real-world applications.

> üìö **Prerequisites**: If you're new to OmniGenBench, we strongly recommend starting with the **[Fundamental Concepts Tutorial](https://github.com/yangheng95/OmniGenBench/blob/master/examples/00_fundamental_concepts.ipynb)**. It covers essential topics including language model concepts, machine learning task classification, foundation model principles, and other foundational knowledge.

### 1. The Biological Challenge: What is RNA Secondary Structure?

**RNA secondary structure** refers to the pattern of base pairing within a single RNA molecule, forming loops, stems, and other structural elements that are crucial for RNA function. Understanding these structures is fundamental because:

- **Functional Importance**: RNA structure directly determines function in processes like catalysis (ribozymes), regulation (miRNAs), and protein synthesis (rRNA, tRNA)
- **Disease Relevance**: Structural mutations can disrupt RNA function, leading to genetic disorders
- **Drug Design**: RNA structures serve as targets for therapeutic interventions
- **Synthetic Biology**: Designing functional RNA molecules requires precise structural control

The challenge lies in predicting these complex three-dimensional folding patterns from linear sequence information - a problem that has puzzled scientists for decades.

### 2. The Data: RNA Secondary Structure Dataset

To train our predictive model, we utilize the **bpRNA dataset**, a comprehensive collection of RNA sequences with experimentally determined secondary structures.

- **What it contains**: RNA sequences with annotated base-pairing patterns
- **What it labels**: Each nucleotide position is classified based on its structural role:
  - `.` (unpaired/loop regions)
  - `(` and `)` (base-paired positions forming stems)
- **Our Goal**: Train a model that can accurately predict the structural state of each nucleotide position

**Dataset Structure:**

| sequence | structure |
|---------|-----------|  
| AUGCCGUGC... | .(((...))).|  
| GCCAUGCUA... | (((....))).| 
| ... | ... |

### 3. Quick Start: RNA Structure Prediction Workflow

This tutorial demonstrates the practical application of the **[Fundamental Concepts Tutorial](https://github.com/yangheng95/OmniGenBench/blob/master/examples/00_fundamental_concepts.ipynb)** to a specific biological problem. We'll use the standard 4-step OmniGenBench workflow:

In [None]:
from IPython.display import Image, display
display(Image(filename="4-step workflow.png"))


**RNA Secondary Structure Prediction** is a **token classification** task where we predict per-nucleotide structural labels (unpaired `.`, left bracket `(`, right bracket `)`) for RNA sequences. We'll use `OmniModelForTokenClassification` with **OmniGenome** to leverage sequence-structure patterns learned from large-scale RNA data.

### 4. Tutorial Structure

For a detailed understanding of each step, explore our comprehensive tutorial series:

1. **[Data Preparation](01_data_preparation.ipynb)**: Learn about RNA structure data formats, tokenization strategies, and dataset preprocessing
2. **[Model Initialization](02_model_initialization.ipynb)**: Understand foundation models, model architecture, and configuration options
3. **[Model Training](03_model_training.ipynb)**: Explore training strategies, hyperparameter tuning, and evaluation metrics
4. **[Model Inference](04_model_inference.ipynb)**: Master inference techniques, structure validation, and results interpretation

Each tutorial builds upon the concepts introduced here with comprehensive explanations, best practices, and advanced techniques. Let's get started with the quick workflow!

## üöÄ Step 1: Data Preparation

This first step is all about getting our data ready for in-silico analysis. It involves four key parts:
1. **Environment Setup**: Installing and importing the necessary libraries.
2. **Configuration**: Defining all our important parameters in one place.
3. **Data Acquisition**: Downloading and preparing the raw dataset.
4. **Data Loading**: Creating a pipeline to efficiently feed data to the model.

### 1.1: Environment Setup

First, let's install the required Python packages. `omnigenbench` is our core library that provides the framework for genomic foundation model development and benchmarking.

In [None]:
!pip install omnigenbench -U

Next, we import the libraries we just installed. This gives us the tools for data processing, deep learning, and working with RNA structure data.

A key part of this setup is determining the best available hardware for training. Our script will automatically prioritize a **CUDA-enabled GPU** if one is available, as this can accelerate training by 10-100x compared to a CPU.

In [None]:
from omnigenbench import (
    ClassificationMetric,
    AccelerateTrainer,
    ModelHub,
    OmniTokenizer,
    OmniDatasetForTokenClassification,
    OmniModelForTokenClassification,
)

### 1.2: Global Configuration

To make our tutorial easy to modify and understand, we'll centralize all important parameters in this section. This is a best practice in software development that makes experiments more reproducible.

#### Key Parameters
- **Dataset**: We use the RNA-SSP-Archive2 dataset containing RNA sequences with experimentally determined secondary structures.
- **Model**: We select `OmniGenome-52M` as our foundation model - it's fast, efficient, and perfect for learning and prototyping.
- **Label Mapping**: We define the three structural labels: `(` (left bracket), `)` (right bracket), and `.` (unpaired).

This centralized approach allows you to easily experiment with different settings (e.g., a larger model, different learning rate) without hunting through the code.

#### Note
Almost all the parameters here are standard in machine learning workflows and have default values that work well. Don't worry if some terms are unfamiliar - we'll explain each one as we go along.

In [None]:
model_name_or_path = "yangheng/OmniGenome-52M"  # Genomic Foundation Model
dataset_name = "RNA-SSP-Archive2"
label2id = {"(": 0, ")": 1, ".": 2}  # Structure labels
id2label = {v: k for k, v in label2id.items()}

### 1.3: Data Acquisition and Loading

With our environment configured, it's time to load the RNA secondary structure dataset. The function below automates this process by:
1. Downloading the dataset from the OmniGenBench Hub if needed.
2. Processing RNA sequences and structure labels.
3. Creating train/validation/test splits ready for training.

This ensures we have properly formatted data ready for the next stage. For token classification tasks like RNA structure prediction, each nucleotide position needs its own label.

In [None]:
# Initialize tokenizer
tokenizer = OmniTokenizer.from_pretrained(model_name_or_path)

# Load datasets with automatic processing
datasets = OmniDatasetForTokenClassification.from_hub(
    dataset_name_or_path=dataset_name,
    tokenizer=tokenizer,
    max_length=512,
    label2id=label2id,
    max_examples=1000,  # For quick testing; set to None for full dataset
)

### 1.4: Dataset Loading with OmniGenBench

With OmniGenBench, data loading is significantly simplified! The framework automatically handles:

#### Automatic Data Processing
The `OmniDatasetForTokenClassification` class automatically:
1. **Downloads and processes** the dataset from our curated collection
2. **Handles sequence preprocessing** including truncation, padding, and tokenization
3. **Manages token-level label formatting** for structure prediction
4. **Creates train/validation/test splits** ready for training

This streamlined approach eliminates the need for custom dataset classes while maintaining full flexibility and performance.

In [None]:
print(f"üìä Loaded datasets: {list(datasets.keys())}")
for split, dataset in datasets.items():
    print(f"  - {split}: {len(dataset)} samples")

## üöÄ Step 2: Model Initialization with OmniGenome

With our data pipeline in place, it's time to set up **OmniGenome** for RNA secondary structure prediction. Instead of building a model from scratch, we'll load the pre-trained OmniGenome and adapt it for our structure prediction task.

This process involves three key components:
1. **The Tokenizer**: Converts RNA sequences into a numerical format OmniGenome can process. It's crucial that we use the tokenizer specifically designed for OmniGenome.
2. **OmniGenome Base Model**: This genomic foundation model has learned fundamental patterns of nucleotide sequences from extensive genomic data.
3. **The Classification Head**: We add a token classification head on top of OmniGenome that maps per-nucleotide representations to our structural labels (unpaired, left bracket, right bracket).

The `OmniModelForTokenClassification` class seamlessly combines OmniGenome with the task-specific classification head for RNA structure prediction.

In [None]:
# === Model Initialization with OmniGenome ===
# Using OmniGenome for RNA secondary structure prediction

model = OmniModelForTokenClassification(
    model_name_or_path,  # OmniGenome-52M
    tokenizer,
    label2id=label2id,  # Token classification: (, ), .
    id2label=id2label,
)

print(f"‚úÖ Loaded OmniGenome for RNA secondary structure prediction")
print(f"üìä Number of labels: {len(label2id)}")

## üöÄ Step 3: Fine-tuning OmniGenome

This is the most exciting part! With our data and OmniGenome ready, we can now begin the **fine-tuning** process. During training, OmniGenome will adapt its knowledge to learn the relationship between RNA sequence features and secondary structure patterns.

The `AccelerateTrainer` from `omnigenbench` handles the training process efficiently, allowing us to fine-tune OmniGenome with just a few lines of code.

### Training Metrics
For token classification tasks, we use:
- **F1-score**: Harmonic mean of precision and recall, robust for imbalanced classes
- **Accuracy**: Percentage of correctly predicted nucleotide positions

The `ignore_y=-100` parameter ensures special tokens (like padding) are excluded from evaluation.

In [None]:
metric_functions = [
    ClassificationMetric(ignore_y=-100).f1_score,
    ClassificationMetric(ignore_y=-100).accuracy_score,
]

trainer = AccelerateTrainer(
    model=model,
    train_dataset=datasets["train"],
    eval_dataset=datasets["valid"],
    test_dataset=datasets["test"],
    compute_metrics=metric_functions,
)

print("üéì Starting training...")
metrics = trainer.train()
trainer.save_model("ogb_rna_structure_finetuned")

print("‚úÖ Training completed!")
print("üìä Metrics:", metrics)

## üîÆ Step 4: Model Inference and Interpretation

Now that we have a trained model, let's use it for its intended purpose: predicting RNA secondary structures on new sequences. This process is called **inference**.

### The Inference Pipeline

Our inference pipeline consists of a few key steps:
1. **Load the Model**: We load the best-performing model that was saved during training.
2. **Process the Input**: We take new RNA sequences and apply the same preprocessing steps we used for our training data.
3. **Run Prediction**: We feed the processed sequence to the model and get its predictions. We use `torch.no_grad()` to disable gradient calculations, which makes inference faster and uses less memory.
4. **Interpret the Results**: The model's raw output contains per-nucleotide predictions. We'll convert these to structure notation and validate the results.

To demonstrate, we'll test our model on sample sequences with different structural characteristics.

In [None]:
import torch

# Load the fine-tuned model
inference_model = ModelHub.load("yangheng/ogb_rna_structure_finetuned")

# Sample RNA sequences with different characteristics
sample_sequences = {
    "Hairpin structure": "GCUGGGAUGUUGGCUUAGAAGCAGCCAUCAUUUAAAGAGUGCGUAACAGCUCACCAGC",
    "Complex structure": "AUCUGUACUAGUUAGCUAACUAGAUCUGUAUCUGGCGGUUCCGUGGAAGAACUGACGU",
    "Short sequence": "AUGCCGUGCAUUAA",
}

print("üîÆ Running structure prediction...\n")
with torch.no_grad():
    for name, sequence in sample_sequences.items():
        outputs = inference_model.inference(sequence)
        predictions = outputs.get('predictions', None)
        
        if predictions is not None:
            # Convert predictions to structure notation
            predicted_structure = "".join([id2label[pred] for pred in predictions])
            
            # Calculate statistics
            num_paired = predicted_structure.count('(') + predicted_structure.count(')')
            num_unpaired = predicted_structure.count('.')
            is_balanced = predicted_structure.count('(') == predicted_structure.count(')')
            
            print(f"üìä {name}:")
            print(f"  Sequence:  {sequence}")
            print(f"  Structure: {predicted_structure}")
            print(f"  Length: {len(sequence)} nt | Paired: {num_paired} | Unpaired: {num_unpaired}")
            print(f"  Balanced: {'‚úÖ Yes' if is_balanced else '‚ùå No'}\\n")

## üéâ Tutorial Summary and Next Steps

Congratulations! You have successfully completed this comprehensive tutorial on RNA secondary structure prediction with OmniGenBench.

### What You've Learned

You've walked through a complete, end-to-end application of genomic deep learning, demonstrating how the concepts from the **[Fundamental Concepts Tutorial](../00_fundamental_concepts.ipynb)** apply to a real biological problem. Specifically, you have:

1. **Applied Task Formulation**: Successfully framed RNA structure prediction as a token classification problem
2. **Mastered the 4-Step Workflow**:
   - **Step 1: Data Preparation**: Acquired, processed, and loaded the RNA structure dataset
   - **Step 2: Model Initialization**: Set up OmniModelForTokenClassification for per-nucleotide prediction
   - **Step 3: Model Training**: Fine-tuned the model using best practices and appropriate evaluation metrics
   - **Step 4: Model Inference**: Generated predictions on new RNA sequences and validated results
3. **Understood Practical Application**: Gained hands-on experience with a biologically relevant prediction task

### üöÄ Next Steps

Now that you've mastered RNA structure prediction, you can:

#### üß¨ **Explore Other Token Classification Tasks**
- **Splice Site Prediction**: Identify exon-intron boundaries
- **Binding Site Identification**: Predict protein-RNA interaction sites
- **RNA Modification Sites**: Locate m6A, pseudouridine, etc.

#### üìä **Try Different Task Types**
- **Sequence Classification**: Functional annotation, localization prediction
- **Sequence Regression**: Expression level prediction, stability scores
- **RNA Design**: Generate sequences with desired structures

#### üî¨ **Advanced Techniques**
- **Model Comparison**: Benchmark different foundation models (OmniGenome-186M, OmniGenome-418M)
- **Hyperparameter Optimization**: Fine-tune model performance
- **Ensemble Methods**: Combine multiple models for improved accuracy
- **Structure Validation**: Compare with ViennaRNA predictions

### üìö Detailed Tutorials

For deeper understanding of each step, explore our detailed tutorial series:

1. **[Data Preparation](01_data_preparation.ipynb)**: Learn about RNA structure data formats, tokenization strategies, and dataset preprocessing
2. **[Model Initialization](02_model_initialization.ipynb)**: Understand foundation models, model architecture, and configuration options
3. **[Model Training](03_model_training.ipynb)**: Explore training strategies, hyperparameter tuning, and evaluation metrics
4. **[Model Inference](04_model_inference.ipynb)**: Master inference techniques, structure validation, and results interpretation

### üìö Resources

- **[Fundamental Concepts Tutorial](../00_fundamental_concepts.ipynb)**: Review core concepts anytime
- **[OmniGenBench Documentation](https://omnigenbench.readthedocs.io/)**: Complete API reference
- **[GitHub Repository](https://github.com/yangheng95/OmniGenBench)**: Source code and community discussions

Thank you for following along. We hope this tutorial has provided you with the knowledge and confidence to apply deep learning to your own genomics research. Happy coding! üß¨üöÄ