## Introduction
OmniGenome is a comprehensive package designed for pretrained genomic foundation models (gFMs) development and benchmark.
OmniGenome have the following key features:
- Automated genomic FM benchmarking on public genomic datasets
- Scalable genomic FM training and fine-tuning on genomic tasks
- Diversified genomic FMs implementation
- Easy-to-use pipeline for genomic FM development with no coding expertise required
- Accessible OmniGenome Hub for sharing FMs, datasets, and pipelines
- Extensive documentation and tutorials for genomic FM development

This notebook provides a demonstration of OmniGenome's capabilities using the mRNA degradation regression task.


## Requirements
OmniGenome requires the following recommended dependencies:
- Python 3.9+
- PyTorch 2.0.0+
- Transformers 4.37.0+
- Pandas 1.3.3+
- Others in case of specific tasks

!pip install -U omnigenbench  # Install OmniGenome package

## Fine-tuning Genomic FMs on mRNA Degradation Regression Task

mRNA degradation regression is a task that predicts the degradation rate of mRNA transcripts based on their sequences. The dataset is from the RGB benchmark, which contains mRNA sequences and their corresponding degradation rates. The task is to train a model that can accurately predict the degradation rate from the sequence.

### Step 1: Import Libraries

In [None]:
import autocuda
import torch
from metric_visualizer import MetricVisualizer

from omnigenbench import OmniDatasetForTokenRegression  # Token regression means that the model predicts a continuous value for each token (e.g., nucleotide base) in the sequence.
from omnigenbench import RegressionMetric
from omnigenbench import OmniSingleNucleotideTokenizer, OmniKmersTokenizer
from omnigenbench import OmniModelForTokenRegression
from omnigenbench import Trainer

### Step 2: Define and Initialize the Tokenizer

In [None]:
# The is FM is exclusively powered by the OmniGenome package
model_name_or_path = "anonymous8/OmniGenome-52M"

# Generally, we use the tokenizers from transformers library, such as AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

# However, OmniGenome provides specialized tokenizers for genomic data, such as single nucleotide tokenizer and k-mers tokenizer
# we can force the tokenizer to be used in the model
tokenizer = OmniSingleNucleotideTokenizer.from_pretrained(model_name_or_path)

### Step 3: Define and Initialize the Model

In [None]:
# We have implemented a diverse set of genomic models in OmniGenome, please refer to the documentation for more details
reg_model = OmniModelForTokenRegression(
    model_name_or_path,
    tokenizer=tokenizer,
    num_labels=3,
)

### Step 4: Define and Load the Dataset

In [None]:
import numpy as np

# necessary hyperparameters
epochs = 10
learning_rate = 2e-5
weight_decay = 1e-5
batch_size = 8
max_length = 128
seeds = [45]  # Each seed will be used for one run


# The Dataset class is a subclass of OmniDatasetForTokenRegression, which is designed for token regression tasks.

class Dataset(OmniDatasetForTokenRegression):
    def __init__(self, data_source, tokenizer, max_length, **kwargs):
        super().__init__(data_source, tokenizer, max_length, **kwargs)

    def prepare_input(self, instance, **kwargs):
        target_cols = ["reactivity", "deg_Mg_pH10", "deg_Mg_50C"]
        instance["sequence"] = f'{instance["sequence"]}'
        tokenized_inputs = self.tokenizer(
            instance["sequence"],
            padding=kwargs.get("padding", "do_not_pad"),
            truncation=kwargs.get("truncation", True),
            max_length=self.max_length,
            return_tensors="pt",
        )
        labels = [instance[target_col] for target_col in target_cols]
        labels = np.concatenate(
            [
                np.array(labels),
                np.array(
                    [
                        [-100]
                        * (len(tokenized_inputs["input_ids"].squeeze()) - len(labels[0])),
                        [-100]
                        * (len(tokenized_inputs["input_ids"].squeeze()) - len(labels[0])),
                        [-100]
                        * (len(tokenized_inputs["input_ids"].squeeze()) - len(labels[0])),
                    ]
                ),
            ],
            axis=1,
        ).T
        tokenized_inputs["labels"] = torch.tensor(labels, dtype=torch.float32)
        for col in tokenized_inputs:
            tokenized_inputs[col] = tokenized_inputs[col].squeeze()
        return tokenized_inputs

# Load the dataset according to the path
train_file = "toy_datasets/RNA-mRNA/train.json"
test_file = "toy_datasets/RNA-mRNA/test.json"



train_set = Dataset(
    data_source=train_file,
    tokenizer=tokenizer,
    max_length=max_length,
)
test_set = Dataset(
    data_source=test_file,
    tokenizer=tokenizer,
    max_length=max_length,
)
train_loader = torch.utils.data.DataLoader(
    train_set, batch_size=batch_size, shuffle=True
)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size)

### Step 5: Define the Metrics
We have implemented a diverse set of genomic metrics in OmniGenome, please refer to the documentation for more details.
Users can also define their own metrics by inheriting the `OmniGenomeMetric` class.
The `compute_metrics` can be a metric function list and each metric function should return a dictionary of metrics.

In [None]:
compute_metrics = [
    RegressionMetric(ignore_y=-100).root_mean_squared_error,
    RegressionMetric(ignore_y=-100).r2_score,
]


## Step 6: Define and Initialize the Trainer

In [None]:
# Initialize the MetricVisualizer for logging the metrics

for seed in seeds:
    optimizer = torch.optim.AdamW(
        reg_model.parameters(), lr=learning_rate, weight_decay=weight_decay
    )
    trainer = Trainer(
        model=reg_model,
        train_loader=train_loader,
        test_loader=test_loader,
        batch_size=batch_size,
        epochs=epochs,
        optimizer=optimizer,
        compute_metrics=compute_metrics,
        seeds=seed,
        device=autocuda.auto_cuda(),
    )

    metrics = trainer.train()
    test_metrics = metrics["test"]
    print(metrics)


### Step 7. Experimental Results Visualization
The experimental results are visualized in the following plots. The plots show the F1 score and accuracy of the model on the test set for each run. The average F1 score and accuracy are also shown.

### Step 8. Model Checkpoint for Sharing
The model checkpoint can be saved and shared with others for further use. The model checkpoint can be loaded using the following code:

In [None]:
path_to_save = "OmniGenome-52M-mRNA"
reg_model.save(path_to_save, overwrite=True)

# Load the model checkpoint
reg_model = reg_model.load(path_to_save)
results = reg_model.inference("CAGUGCCGAGGCCACGCGGAGAACGAUCGAGGGUACAGCACUA")
print(results["predictions"])
print("logits:", results["logits"])

### Step 9. Ready-to-use Models from Fine-tuning
All the models trained in this tutorial are available on the OmniGenome Hub, which is a Huggingface Spaces for sharing models, datasets, and pipelines. Users can easily access and use these models for their own tasks.

In [None]:
# We can load the model checkpoint using the ModelHub
from omnigenbench import ModelHub

ssp_model = ModelHub.load("OmniGenome-52M-mRNA")
results = ssp_model.inference("CAGUGCCGAGGCCACGCGGAGAACGAUCGAGGGUACAGCACUA")
print(results["predictions"])
print("logits:", results["logits"])