# OmniGenome - A Demonstration based on mRNA Degradation Regression
GitHub: https://github.com/yangheng95/OmniGenome
OmniGenome Hub: Huggingface Spaces


## Introduction
OmniGenome is a comprehensive package designed for pretrained genomic foundation models (gFMs) development and benchmark.
OmniGenome have the following key features:
- Automated genomic FM benchmarking on public genomic datasets
- Scalable genomic FM training and fine-tuning on genomic tasks
- Diversified genomic FMs implementation
- Easy-to-use pipeline for genomic FM development with no coding expertise required
- Accessible OmniGenome Hub for sharing FMs, datasets, and pipelines
- Extensive documentation and tutorials for genomic FM development

This notebook provides a demonstration of OmniGenome's capabilities using the mRNA degradation regression task.


## Requirements
OmniGenome requires the following recommended dependencies:
- Python 3.9+
- PyTorch 2.0.0+
- Transformers 4.37.0+
- Pandas 1.3.3+
- Others in case of specific tasks

!pip install -U omnigenome  # Install OmniGenome package

## Fine-tuning Genomic FMs on mRNA Degradation Regression Task

mRNA degradation regression is a task that predicts the degradation rate of mRNA transcripts based on their sequences. The dataset is from the RGB benchmark, which contains mRNA sequences and their corresponding degradation rates. The task is to train a model that can accurately predict the degradation rate from the sequence.

### Step 1: Import Libraries

In [1]:
import autocuda
import torch
from metric_visualizer import MetricVisualizer

from omnigenome import OmniGenomeDatasetForTokenRegression  # Token regression means that the model predicts a continuous value for each token (e.g., nucleotide base) in the sequence.
from omnigenome import RegressionMetric
from omnigenome import OmniSingleNucleotideTokenizer, OmniKmersTokenizer
from omnigenome import OmniGenomeModelForTokenRegression
from omnigenome import Trainer

  from .autonotebook import tqdm as notebook_tqdm



                                
   **  +----------- **           ___                     _ 
  @@                 @@         / _ \  _ __ ___   _ __  (_)
 @@*                 *@@       | |_| || | | | | || | | || |
 *@@  +------------ *@@         \___/ |_| |_| |_||_| |_||_|
  *@*               @@*       
    *@@*         *@@*          
      *@@  +---@@@*              ____  
        *@@*   **               / ___|  ___  _ __    ___   _ __ ___    ___ 
          **@**                | |  _  / _ \| '_ \  / _ \ | '_ ` _ \  / _ \ 
        *@@* *@@*              | |_| ||  __/| | | || (_) || | | | | ||  __/ 
      *@@ ---+  @@*             \____| \___||_| |_| \___/ |_| |_| |_| \___| 
    *@@*         *@@*          
  *@@               @@*        
 *@@ -------------+  @@*        ____                      _   
 @@                   @@       | __ )   ___  _ __    ___ | |__ 
  @@                 @@        | |_) ||  __/| | | || (__ | | | |
   ** -----------+  **         |____/  \___||_| |_| \___||_|

### Step 2: Define and Initialize the Tokenizer

In [2]:
# The is FM is exclusively powered by the OmniGenome package
model_name_or_path = "anonymous8/OmniGenome-52M"

# Generally, we use the tokenizers from transformers library, such as AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

# However, OmniGenome provides specialized tokenizers for genomic data, such as single nucleotide tokenizer and k-mers tokenizer
# we can force the tokenizer to be used in the model
tokenizer = OmniSingleNucleotideTokenizer.from_pretrained(model_name_or_path)

### Step 3: Define and Initialize the Model

In [3]:
# We have implemented a diverse set of genomic models in OmniGenome, please refer to the documentation for more details
reg_model = OmniGenomeModelForTokenRegression(
    model_name_or_path,
    tokenizer=tokenizer,
    num_labels=3,
)

Some weights of OmniGenomeModel were not initialized from the model checkpoint at anonymous8/OmniGenome-52M and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[2025-06-28 19:52:59] [OmniGenome 0.2.6alpha2]  Model Name: OmniGenomeModelForTokenRegression
Model Metadata: {'library_name': 'OmniGenome', 'omnigenome_version': '0.2.6alpha2', 'torch_version': '2.7.0+cu128+cu12.8+git134179474539648ba7dee1317959529fbd0e7f89', 'transformers_version': '4.52.4', 'model_cls': 'OmniGenomeModelForTokenRegression', 'tokenizer_cls': 'OmniSingleNucleotideTokenizer', 'model_name': 'OmniGenomeModelForTokenRegression'}
Base Model Name: anonymous8/OmniGenome-52M
Model Type: omnigenome
Model Architecture: None
Model Parameters: 52.453345 M
Model Config: OmniGenomeConfig {
  "OmniGenomefold_config": null,
  "attention_probs_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "anonymous8/OmniGenome-52M--configuration_omnigenome.OmniGenomeConfig",
    "AutoModel": "anonymous8/OmniGenome-52M--modeling_omnigenome.OmniGenomeModel",
    "AutoModelForMaskedLM": "anonymous8/OmniGenome-52M--modeling_omnigenome.OmniGenomeForMaskedLM",
    "AutoModelForSeq2SeqLM": "anonymous

### Step 4: Define and Load the Dataset

In [4]:
import numpy as np

# necessary hyperparameters
epochs = 10
learning_rate = 2e-5
weight_decay = 1e-5
batch_size = 8
max_length = 128
seeds = [45]  # Each seed will be used for one run


# The Dataset class is a subclass of OmniGenomeDatasetForTokenRegression, which is designed for token regression tasks.

class Dataset(OmniGenomeDatasetForTokenRegression):
    def __init__(self, data_source, tokenizer, max_length, **kwargs):
        super().__init__(data_source, tokenizer, max_length, **kwargs)

    def prepare_input(self, instance, **kwargs):
        target_cols = ["reactivity", "deg_Mg_pH10", "deg_Mg_50C"]
        instance["sequence"] = f'{instance["sequence"]}'
        tokenized_inputs = self.tokenizer(
            instance["sequence"],
            padding=kwargs.get("padding", "do_not_pad"),
            truncation=kwargs.get("truncation", True),
            max_length=self.max_length,
            return_tensors="pt",
        )
        labels = [instance[target_col] for target_col in target_cols]
        labels = np.concatenate(
            [
                np.array(labels),
                np.array(
                    [
                        [-100]
                        * (len(tokenized_inputs["input_ids"].squeeze()) - len(labels[0])),
                        [-100]
                        * (len(tokenized_inputs["input_ids"].squeeze()) - len(labels[0])),
                        [-100]
                        * (len(tokenized_inputs["input_ids"].squeeze()) - len(labels[0])),
                    ]
                ),
            ],
            axis=1,
        ).T
        tokenized_inputs["labels"] = torch.tensor(labels, dtype=torch.float32)
        for col in tokenized_inputs:
            tokenized_inputs[col] = tokenized_inputs[col].squeeze()
        return tokenized_inputs

# Load the dataset according to the path
train_file = "toy_datasets/RNA-mRNA/train.json"
test_file = "toy_datasets/RNA-mRNA/test.json"



train_set = Dataset(
    data_source=train_file,
    tokenizer=tokenizer,
    max_length=max_length,
)
test_set = Dataset(
    data_source=test_file,
    tokenizer=tokenizer,
    max_length=max_length,
)
train_loader = torch.utils.data.DataLoader(
    train_set, batch_size=batch_size, shuffle=True
)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size)

[2025-06-28 19:52:59] [OmniGenome 0.2.6alpha2]  Detected max_length=128 in the dataset, using it as the max_length.
[2025-06-28 19:52:59] [OmniGenome 0.2.6alpha2]  Loading data from toy_datasets/RNA-mRNA/train.json...
[2025-06-28 19:52:59] [OmniGenome 0.2.6alpha2]  Loaded 1728 examples from toy_datasets/RNA-mRNA/train.json
[2025-06-28 19:52:59] [OmniGenome 0.2.6alpha2]  Detected shuffle=True, shuffling the examples...


100%|██████████| 1728/1728 [00:00<00:00, 2359.50it/s]


[2025-06-28 19:53:00] [OmniGenome 0.2.6alpha2]  Max sequence length updated -> Reset max_length=112, label_padding_length=112
[2025-06-28 19:53:00] [OmniGenome 0.2.6alpha2]  {'avg_seq_len': np.float64(109.0), 'max_seq_len': np.int64(109), 'min_seq_len': np.int64(109), 'avg_label_len': np.float64(112.0), 'max_label_len': np.int64(112), 'min_label_len': np.int64(112)}
[2025-06-28 19:53:00] [OmniGenome 0.2.6alpha2]  Preview of the first two samples in the dataset:
[2025-06-28 19:53:00] [OmniGenome 0.2.6alpha2]  {'input_ids': tensor([0, 6, 6, 4, 4, 4, 9, 4, 6, 5, 6, 4, 5, 6, 4, 9, 6, 5, 9, 4, 4, 9, 6, 4,
        9, 5, 5, 4, 9, 5, 5, 4, 4, 6, 6, 4, 9, 5, 6, 6, 6, 9, 6, 6, 6, 4, 6, 4,
        6, 9, 6, 6, 5, 9, 6, 5, 6, 5, 9, 9, 4, 4, 9, 6, 5, 6, 4, 5, 6, 4, 9, 6,
        9, 6, 4, 6, 9, 9, 5, 6, 5, 9, 5, 4, 9, 4, 9, 4, 4, 4, 4, 6, 4, 4, 4, 5,
        4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 2, 1, 1, 1]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

100%|██████████| 192/192 [00:00<00:00, 2395.93it/s]


[2025-06-28 19:53:00] [OmniGenome 0.2.6alpha2]  Max sequence length updated -> Reset max_length=112, label_padding_length=112
[2025-06-28 19:53:00] [OmniGenome 0.2.6alpha2]  {'avg_seq_len': np.float64(109.0), 'max_seq_len': np.int64(109), 'min_seq_len': np.int64(109), 'avg_label_len': np.float64(112.0), 'max_label_len': np.int64(112), 'min_label_len': np.int64(112)}
[2025-06-28 19:53:00] [OmniGenome 0.2.6alpha2]  Preview of the first two samples in the dataset:
[2025-06-28 19:53:00] [OmniGenome 0.2.6alpha2]  {'input_ids': tensor([0, 6, 6, 4, 4, 4, 6, 4, 5, 6, 4, 5, 4, 6, 6, 6, 9, 5, 4, 9, 9, 6, 4, 5,
        6, 6, 9, 5, 6, 5, 6, 9, 6, 4, 5, 4, 6, 9, 5, 6, 9, 5, 4, 9, 6, 4, 5, 6,
        6, 5, 9, 5, 6, 5, 6, 9, 9, 4, 5, 6, 6, 6, 9, 4, 6, 5, 5, 5, 6, 6, 5, 5,
        5, 4, 5, 6, 9, 9, 5, 6, 5, 6, 9, 6, 6, 6, 5, 4, 4, 4, 4, 6, 4, 4, 4, 5,
        4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 2, 1, 1, 1]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

### Step 5: Define the Metrics
We have implemented a diverse set of genomic metrics in OmniGenome, please refer to the documentation for more details.
Users can also define their own metrics by inheriting the `OmniGenomeMetric` class.
The `compute_metrics` can be a metric function list and each metric function should return a dictionary of metrics.

In [5]:
compute_metrics = [
    RegressionMetric(ignore_y=-100).root_mean_squared_error,
    RegressionMetric(ignore_y=-100).r2_score,
]


## Step 6: Define and Initialize the Trainer

In [6]:
# Initialize the MetricVisualizer for logging the metrics

for seed in seeds:
    optimizer = torch.optim.AdamW(
        reg_model.parameters(), lr=learning_rate, weight_decay=weight_decay
    )
    trainer = Trainer(
        model=reg_model,
        train_loader=train_loader,
        test_loader=test_loader,
        batch_size=batch_size,
        epochs=epochs,
        optimizer=optimizer,
        compute_metrics=compute_metrics,
        seeds=seed,
        device=autocuda.auto_cuda(),
    )

    metrics = trainer.train()
    test_metrics = metrics["test"]
    print(metrics)


  self.scaler = GradScaler()
Testing: 100%|██████████| 24/24 [00:00<00:00, 24.55it/s]


[2025-06-28 19:53:02] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.9285820722579956, 'r2_score': -0.17025935649871826}


Epoch 1/10 Loss: 0.4778: 100%|██████████| 216/216 [00:19<00:00, 11.05it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 30.32it/s]


[2025-06-28 19:53:22] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7482582330703735, 'r2_score': 0.24012082815170288}


Epoch 2/10 Loss: 0.4234: 100%|██████████| 216/216 [00:19<00:00, 11.17it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 31.49it/s]


[2025-06-28 19:53:43] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7402945756912231, 'r2_score': 0.2562095522880554}


Epoch 3/10 Loss: 0.3996: 100%|██████████| 216/216 [00:19<00:00, 10.91it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 30.87it/s]


[2025-06-28 19:54:03] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7400482892990112, 'r2_score': 0.25670433044433594}


Epoch 4/10 Loss: 0.3813: 100%|██████████| 216/216 [00:19<00:00, 11.15it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 31.41it/s]


[2025-06-28 19:54:24] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.727670431137085, 'r2_score': 0.2813607454299927}


Epoch 5/10 Loss: 0.3613: 100%|██████████| 216/216 [00:19<00:00, 11.04it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 31.18it/s]


[2025-06-28 19:54:44] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7274799942970276, 'r2_score': 0.2817367911338806}


Epoch 6/10 Loss: 0.3424: 100%|██████████| 216/216 [00:19<00:00, 11.09it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 30.08it/s]


[2025-06-28 19:55:05] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7362809777259827, 'r2_score': 0.2642527222633362}


Epoch 7/10 Loss: 0.3205: 100%|██████████| 216/216 [00:19<00:00, 11.12it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 31.62it/s]


[2025-06-28 19:55:25] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7278608679771423, 'r2_score': 0.2809845209121704}


Epoch 8/10 Loss: 0.3040: 100%|██████████| 216/216 [00:19<00:00, 11.03it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 30.57it/s]


[2025-06-28 19:55:45] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7265374064445496, 'r2_score': 0.283596932888031}


Epoch 9/10 Loss: 0.2872: 100%|██████████| 216/216 [00:19<00:00, 11.08it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 31.41it/s]


[2025-06-28 19:56:06] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7455348968505859, 'r2_score': 0.24564212560653687}


Epoch 10/10 Loss: 0.2715: 100%|██████████| 216/216 [00:19<00:00, 11.18it/s]
Testing: 100%|██████████| 24/24 [00:00<00:00, 30.80it/s]


[2025-06-28 19:56:26] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.740329384803772, 'r2_score': 0.2561395764350891}


Testing: 100%|██████████| 24/24 [00:00<00:00, 30.32it/s]


[2025-06-28 19:56:27] [OmniGenome 0.2.6alpha2]  {'root_mean_squared_error': 0.7265374064445496, 'r2_score': 0.283596932888031}
{'valid': [{'root_mean_squared_error': 0.9285820722579956, 'r2_score': -0.17025935649871826}, {'root_mean_squared_error': 0.7482582330703735, 'r2_score': 0.24012082815170288}, {'root_mean_squared_error': 0.7402945756912231, 'r2_score': 0.2562095522880554}, {'root_mean_squared_error': 0.7400482892990112, 'r2_score': 0.25670433044433594}, {'root_mean_squared_error': 0.727670431137085, 'r2_score': 0.2813607454299927}, {'root_mean_squared_error': 0.7274799942970276, 'r2_score': 0.2817367911338806}, {'root_mean_squared_error': 0.7362809777259827, 'r2_score': 0.2642527222633362}, {'root_mean_squared_error': 0.7278608679771423, 'r2_score': 0.2809845209121704}, {'root_mean_squared_error': 0.7265374064445496, 'r2_score': 0.283596932888031}, {'root_mean_squared_error': 0.7455348968505859, 'r2_score': 0.24564212560653687}, {'root_mean_squared_error': 0.740329384803772, 'r

### Step 7. Experimental Results Visualization
The experimental results are visualized in the following plots. The plots show the F1 score and accuracy of the model on the test set for each run. The average F1 score and accuracy are also shown.

### Step 8. Model Checkpoint for Sharing
The model checkpoint can be saved and shared with others for further use. The model checkpoint can be loaded using the following code:

In [10]:
path_to_save = "OmniGenome-52M-mRNA"
reg_model.save(path_to_save, overwrite=True)

# Load the model checkpoint
reg_model = reg_model.load(path_to_save)
results = reg_model.inference("CAGUGCCGAGGCCACGCGGAGAACGAUCGAGGGUACAGCACUA")
print(results["predictions"])
print("logits:", results["logits"])

[2025-06-28 19:56:44] [OmniGenome 0.2.6alpha2]  The model is saved to OmniGenome-52M-mRNA.
[2025-06-28 19:56:44] [OmniGenome 0.2.6alpha2]  Restored loss function: MSELoss from torch.nn.modules.loss
tensor([[ 0.3204,  0.6764,  0.5234],
        [ 0.0991,  0.3945,  0.3255],
        [ 0.1239,  0.2418,  0.1799],
        [ 0.1493,  0.3023,  0.1130],
        [ 0.1295,  0.2456,  0.0956],
        [ 0.1754,  0.2625,  0.1353],
        [ 0.1816,  0.5658,  0.3898],
        [ 0.2379,  0.3225,  0.4089],
        [ 0.2858,  0.1182,  0.1320],
        [ 0.2132,  0.3564,  0.3354],
        [ 0.0741,  0.3589,  0.2660],
        [ 0.2004,  0.4058,  0.2843],
        [ 0.1228,  0.2852,  0.2496],
        [ 0.1590,  0.2354,  0.2072],
        [ 0.1509,  0.2023,  0.2320],
        [ 0.0360,  0.2027,  0.2244],
        [ 0.1118,  0.2864,  0.2170],
        [ 0.4847,  0.6201,  0.4272],
        [ 0.5510,  0.5563,  0.5126],
        [ 1.6754,  1.0782,  0.8244],
        [ 1.4020,  0.6246,  0.6822],
        [ 0.5579,  0.4807

### Step 9. Ready-to-use Models from Fine-tuning
All the models trained in this tutorial are available on the OmniGenome Hub, which is a Huggingface Spaces for sharing models, datasets, and pipelines. Users can easily access and use these models for their own tasks.

In [9]:
# We can load the model checkpoint using the ModelHub
from omnigenome import ModelHub

ssp_model = ModelHub.load("OmniGenome-52M-mRNA")
results = ssp_model.inference("CAGUGCCGAGGCCACGCGGAGAACGAUCGAGGGUACAGCACUA")
print(results["predictions"])
print("logits:", results["logits"])

[2025-06-28 19:56:40] [OmniGenome 0.2.6alpha2]  Model Name: OmniGenomeModelForTokenRegression
Model Metadata: {'library_name': 'OmniGenome', 'omnigenome_version': '0.2.6alpha2', 'torch_version': '2.7.0+cu128+cu12.8+git134179474539648ba7dee1317959529fbd0e7f89', 'transformers_version': '4.52.4', 'model_cls': 'OmniGenomeModelForTokenRegression', 'tokenizer_cls': 'OmniSingleNucleotideTokenizer', 'model_name': 'OmniGenomeModelForTokenRegression'}
Base Model Name: OmniGenome-52M-mRNA
Model Type: omnigenome
Model Architecture: ['OmniGenomeModel']
Model Parameters: 52.453345 M
Model Config: OmniGenomeConfig {
  "OmniGenomefold_config": null,
  "architectures": [
    "OmniGenomeModel"
  ],
  "attention_probs_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "anonymous8/OmniGenome-52M--configuration_omnigenome.OmniGenomeConfig",
    "AutoModel": "anonymous8/OmniGenome-52M--modeling_omnigenome.OmniGenomeModel",
    "AutoModelForMaskedLM": "anonymous8/OmniGenome-52M--modeling_omnigenome.OmniGe