# üß¨ RNA SSP Tutorial 1/4: From Biological Questions to Data Pipelines

Welcome to the first tutorial in our four-part series. This guide focuses on the foundational and most critical step of any computational biology project: **understanding and preparing your data**.

> üìö **Prerequisites**: It is recommended to study the **[Fundamental Concepts Tutorial](../00_fundamental_concepts.ipynb)** to understand the general machine learning task classification and framework concepts.

Before we write any code, we must first understand the landscape of biological data and how to frame biological questions as machine learning tasks.

### 1. The Language of Life: RNA Structure

RNA molecules are not simply linear strings of nucleotides. They fold into complex three-dimensional structures that are crucial for their biological function. The **secondary structure** represents the base-pairing patterns within a single RNA strand:

- **Primary Structure**: The linear sequence of nucleotides (A, U, G, C)
- **Secondary Structure**: Base-pairing patterns forming stems, loops, and bulges
- **Tertiary Structure**: Three-dimensional folding in space

Our goal is to teach a machine to predict secondary structure from primary sequence - to understand the "grammar" of RNA folding.

### 2. Framing Biological Questions as ML Tasks

RNA secondary structure prediction must be translated into a well-defined machine learning task. Here's how we conceptualize it:

| Aspect | RNA SSP Specification |
| :--- | :--- |
| **Biological Question** | What is the structural role of each nucleotide in this RNA sequence? |
| **Input Data** | RNA sequence (e.g., "AUGCCGUGC") |
| **Output Data** | Structural annotation for each position (e.g., ".(((...))") |
| **ML Task Type** | **Token Classification** - Assigning a label to each position |
| **Label Space** | Three classes: `(` (left pair), `)` (right pair), `.` (unpaired) |

This is fundamentally different from sequence classification (one label for entire sequence) because we need per-nucleotide predictions.

```mermaid
graph TD
    subgraph "Mapping Problems to Tasks"
        A["Biological Question<br/>(What is each nucleotide's role?)"] --> B["ML Task<br/>(Token Classification)"];
        B --> C["Model Choice<br/>(OmniModelForTokenClassification)"];
        C --> D["Data Representation<br/>(Sequence + per-token labels)"];
    end
```

### 3. The OmniGenBench Toolbox: Available Models

`OmniGenBench` provides a suite of pre-configured models for different tasks:

| Task | OmniGenBench Model | When to Use |
| :--- | :--- | :--- |
| Sequence Classification | `OmniModelForSequenceClassification` | One label for entire sequence |
| Multi-Label Classification | `OmniModelForMultiLabelSequenceClassification` | Multiple labels for entire sequence |
| **Token Classification** | `OmniModelForTokenClassification` | **(Our focus)** Label for each position |
| Token Regression | `OmniModelForTokenRegression` | Continuous value for each position |
| Sequence Regression | `OmniModelForSequenceRegression` | Continuous value for sequence |
| Sequence-to-Sequence | `OmniModelForSeq2Seq` | Generate output sequence |

### 4. Our Task: Why RNA SSP is Token Classification

Let's apply this framework to RNA secondary structure prediction:

- **It's Token-level**: We need a prediction for *each nucleotide position*
- **It's Classification**: Each position belongs to one of three discrete classes
- **It's Structured**: The labels have dependencies (brackets must balance)

Therefore, we use `OmniModelForTokenClassification` with special post-processing to ensure structural validity.

---

## üõ†Ô∏è Step-by-Step Guide: Preparing the bpRNA Dataset

### 1.1: Environment Setup

First, install the required packages. `omnigenbench` provides our core framework, with built-in support for token classification tasks.

In [None]:
!pip install omnigenbench -U

### 1.2: Import Required Libraries

In [None]:
import os
import torch
from omnigenbench import (
    OmniTokenizer,
    OmniDatasetForTokenClassification,
)

print("‚úÖ Libraries imported successfully!")
print(f"üî• PyTorch version: {torch.__version__}")
print(f"üéØ CUDA available: {torch.cuda.is_available()}")

### 1.3: Global Configuration

Let's centralize all important parameters for easy experimentation and reproducibility.

#### Key Parameters
- **Model**: We select `OmniGenome-52M` for efficient training and prototyping
- **Dataset**: bpRNA dataset via the `RNA-SSP-Archive2` hub identifier
- **Labels**: Three structural classes with specific meanings in RNA folding

In [None]:
# Model configuration
model_name_or_path = "yangheng/OmniGenome-52M"

# Dataset configuration  
dataset_name = "RNA-SSP-Archive2"

# Label mapping for RNA secondary structure
label2id = {
    "(": 0,  # Left bracket - opening base pair
    ")": 1,  # Right bracket - closing base pair  
    ".": 2   # Dot - unpaired nucleotide
}
id2label = {v: k for k, v in label2id.items()}

# Training configuration
max_length = 512  # Maximum sequence length
max_examples = 1000  # For quick testing; set to None for full dataset

print("‚úÖ Configuration complete!")
print(f"üìä Model: {model_name_or_path}")
print(f"üìä Dataset: {dataset_name}")
print(f"üìä Label mapping: {label2id}")

### 1.4: Data Acquisition and Loading

With OmniGenBench's enhanced dataset framework, data loading is remarkably simple. The `OmniDatasetForTokenClassification` class automatically handles:

1. **Dataset Download**: Fetching from the OmniGenBench hub
2. **Tokenization**: Converting sequences to model inputs
3. **Label Alignment**: Ensuring labels match tokenized sequences
4. **Data Splits**: Creating train/validation/test sets

This eliminates the need for complex custom dataset classes.

In [None]:
# Initialize tokenizer
tokenizer = OmniTokenizer.from_pretrained(model_name_or_path)
print(f"‚úÖ Tokenizer loaded: {model_name_or_path}")

# Load datasets with automatic preprocessing
print("üèóÔ∏è Loading datasets with automatic download...")
datasets = OmniDatasetForTokenClassification.from_hub(
    dataset_name_or_path=dataset_name,
    tokenizer=tokenizer,
    max_length=max_length,
    label2id=label2id,
    max_examples=max_examples,
)

print(f"üìä Datasets loaded: {list(datasets.keys())}")
for split, dataset in datasets.items():
    print(f"  - {split}: {len(dataset)} samples")

### 1.5: Understanding the Data Structure

Let's inspect a sample to understand what the model will see during training.

In [None]:
if len(datasets["train"]) > 0:
    sample = datasets["train"][0]
    print(f"\nüîç Sample data structure:")
    for key, value in sample.items():
        if isinstance(value, torch.Tensor):
            print(f"  {key}: shape {value.shape}, dtype {value.dtype}")
        else:
            print(f"  {key}: {type(value)}")
    
    # Show first few tokens and labels
    print(f"\nüìù First 20 tokens: {sample['input_ids'][:20]}")
    print(f"üìù First 20 labels: {sample['labels'][:20]}")
    print(f"\nüí° Label -100 indicates positions to ignore during training (special tokens)")

### 1.6: Key Dataset Features

The `OmniDatasetForTokenClassification` provides several important features:

#### A. Automatic Token-Label Alignment
- Tokenization may split characters differently than labels
- The framework ensures labels align with tokenized sequence
- Special tokens ([CLS], [SEP]) are masked with label=-100

#### B. Built-in Optimizations
- Efficient batching for variable-length sequences
- Automatic padding and attention masking
- Memory-efficient data loading

#### C. Validation and Quality Checks
- Ensures sequence and label lengths match
- Validates label vocabulary
- Checks for data quality issues

This streamlined approach allows us to focus on model design and training rather than data engineering.

## üìö Summary and Next Steps

In this tutorial, we:
1. ‚úÖ Understood RNA secondary structure as a biological problem
2. ‚úÖ Framed it as a token classification ML task
3. ‚úÖ Selected the appropriate OmniGenBench model class
4. ‚úÖ Loaded and preprocessed the bpRNA dataset in just a few lines

### What We've Accomplished
```python
# Complete data preparation in 3 lines!
tokenizer = OmniTokenizer.from_pretrained(model_name_or_path)
datasets = OmniDatasetForTokenClassification.from_hub(
    dataset_name_or_path=dataset_name,
    tokenizer=tokenizer,
    max_length=max_length,
    label2id=label2id,
)
```

### Next: Model Initialization
Now that our data is ready, proceed to **[02_model_initialization.ipynb](02_model_initialization.ipynb)** to:
- Load the pre-trained OmniGenome foundation model
- Configure it for token classification
- Understand the model architecture for RNA SSP

The journey continues! üöÄ