# Genomic Foundation Model Auto-Benchmarking
This script is used to auto-benchmark the Genomic Foundation Model on diversified downstream tasks,
We have automated the benchmark pipeline based on the OmniGenome package. 
Once your foundation model is trained, you can use this script to evaluate the performance of the model. 
The script will automatically load the datasets, preprocess the data, and evaluate the model on the tasks. 
The script will output the performance of the model on each task.

## [Optional] Prepare your own benchmark datasets
We have provided a set of benchmark datasets in the tutorials, you can use them to evaluate the performance of the model.
If you want to evaluate the model on your own datasets, you can prepare the datasets in the following steps:
1. Prepare the datasets in the following format:
    - The datasets should be in the `json` format.
    - The datasets should contain two columns: `sequence` and `label`.
    - The `sequence` column should contain the DNA sequences.
    - The `label` column should contain the labels of the sequences.
2. Save the datasets in a folder line the existing benchmark datasets. This folder is referred to as the `root` in the script.
3. Place the model and tokenizer in a accessible folder.
4. Sometimes the tokenizer does not work well with the datasets, you can write a custom tokenizer and model wrapper in the `omnigenome_wrapper.py` file.

There will be detailed documentation on how to write the custom tokenizer and model wrapper after the formal release of the OmniGenome package.
Basically, please refer to the existing benchmark examples to implement your own benchmarking pipeline.


## Prepare the benchmark environment
Before running the benchmark, you need to install the following required packages in addition to the PyTorch and requirements.
Please find the installation instructions of PyTorch at https://pytorch.org/get-started/locally/.
```bash
pip install omnigenome, findfile, autocuda, metric-visualizer, transformers
```


## Import the required packages

In [1]:
from omnigenome import AutoBench
import autocuda

# 1. Define the root folder of the benchmark datasets
root = "RGB"  # Abbreviation of the RNA genome benchmark

# 2. Define the model and tokenizer paths
model_name_or_path = "anonymous8/OmniGenome-52M"

# 3. Init the AutoBench
# Select the available CUDA device based your hardware
device = autocuda.auto_cuda()
auto_bench = AutoBench(
    bench_root=root,
    model_name_or_path=model_name_or_path,
    device=device,
    overwrite=True,
)

# 4. Run the benchmark
# Please note each of the downstream tasks have a predefined AutoBenchConfig, this is intended for a fair comparison.
# However, sometimes it is necessary to adjust the config based on the dataset.
# For instance, the `max_length` and batch size in order to run benchmarks based on limited resources.
# To adjust the config, you can override the config in the `AutoBenchConfig` class.

# We provide an example of adjusting the config for the benchmark.
batch_size = 8
epochs = 10
seeds = [42, 43, 44]
auto_bench.run(epochs=epochs, batch_size=batch_size, seeds=seeds)

# 5. Benchmark Checkpointing
# Whenever the benchmark is interrupted, the benchmark results will be saved and available for further execution.
# e.g.,
# AutoBench(bench_root=root, model_name_or_path=model_name_or_path, device=device, overwrite=True).run()

# or clear the checkpoint
# auto_bench.run(epochs=epochs, batch_size=batch_size, overwrite=True)

[2024-07-14 16:12:38] (0.0.7alpha) Benchmark: RGB does not exist. Search online for available benchmarks.


Downloading benchmark: 57MB [00:02, 26.38MB/s]                        


[2024-07-14 16:12:51] (0.0.7alpha) Loaded benchmarks:  ['RNA-mRNA', 'RNA-SNMD', 'RNA-SNMR', 'RNA-SSP-Archive2', 'RNA-SSP-bpRNA', 'RNA-SSP-rnastralign', 'RNA-SSP-Rfam']
[2024-07-14 16:12:51] (0.0.7alpha) Benchmark Root: __OMNIGENOME_DATA__/benchmarks/RGB
Benchmark List: ['RNA-mRNA', 'RNA-SNMD', 'RNA-SNMR', 'RNA-SSP-Archive2', 'RNA-SSP-bpRNA', 'RNA-SSP-rnastralign', 'RNA-SSP-Rfam']
Model Name or Path: anonymous8/OmniGenome-52M
Tokenizer: None
Device: cuda:0
Metric Visualizer Path: __OMNIGENOME_DATA__-benchmarks-RGB-anonymous8-OmniGenome-52M.mv
BenchConfig Details: <module 'bench_metadata' from '__OMNIGENOME_DATA__/benchmarks/RGB/metadata.py'>

[2024-07-14 16:12:51] (0.0.7alpha) Override epochs with 10 according to the input kwargs
[2024-07-14 16:12:51] (0.0.7alpha) Override batch_size with 8 according to the input kwargs
[2024-07-14 16:12:51] (0.0.7alpha) Override seeds with [42, 43, 44] according to the input kwargs
[2024-07-14 16:12:51] (0.0.7alpha) AutoBench Config for RNA-mRNA: task_



tokenizer_config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/91.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

configuration_omnigenome.py:   0%|          | 0.00/13.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/anonymous8/OmniGenome-52M:
- configuration_omnigenome.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
You are using a model of type omnigenome to instantiate a model of type mprna. This is not supported for all configurations of models and can yield errors.


modeling_omnigenome.py:   0%|          | 0.00/75.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/anonymous8/OmniGenome-52M:
- modeling_omnigenome.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/211M [00:00<?, ?B/s]

Some weights of the model checkpoint at anonymous8/OmniGenome-52M were not used when initializing OmniGenomeModel: ['classifier.bias', 'classifier.weight', 'dense.bias', 'dense.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing OmniGenomeModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing OmniGenomeModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of OmniGenomeModel were not initialized from the model checkpoint at anonymous8/OmniGenome-52M and are newly initialized: ['OmniGenome.pooler.dense.bias', 'OmniGenome.pooler.dense.weight']
You

[2024-07-14 16:13:01] (0.0.7alpha) Model Name: OmniGenomeModelForTokenRegression
Model Metadata: {'library_name': 'OmniGenome', 'omnigenome_version': '0.0.7alpha', 'torch_version': '2.1.2+cu12.1+gita8e7c98cb95ff97bb30a728c6b2a1ce6bff946eb', 'transformers_version': '4.42.0.dev0', 'model_cls': 'OmniGenomeModelForTokenRegression', 'tokenizer_cls': 'OmniGenomeTokenizer', 'model_name': 'OmniGenomeModelForTokenRegression'}
Base Model Name: anonymous8/OmniGenome-52M
Model Type: omnigenome
Model Architecture: ['OmniGenomeModel', 'OmniGenomeForTokenClassification', 'OmniGenomeForMaskedLM', 'OmniGenomeModelForSeq2SeqLM', 'OmniGenomeForTSequenceClassification', 'OmniGenomeForTokenClassification', 'OmniGenomeForSeq2SeqLM']
Model Parameters: 52.453345 M
Model Config: OmniGenomeConfig {
  "OmniGenomefold_config": null,
  "_name_or_path": "anonymous8/OmniGenome-52M",
  "architectures": [
    "OmniGenomeModel",
    "OmniGenomeForTokenClassification",
    "OmniGenomeForMaskedLM",
    "OmniGenomeModelFo

100%|██████████| 1728/1728 [00:00<00:00, 2420.29it/s]


[2024-07-14 16:13:02] (0.0.7alpha) {'avg': 109.0, 'max': 109, 'min': 109}
[2024-07-14 16:13:02] (0.0.7alpha) Preview of the first two samples in the dataset:
{'input_ids': tensor([0, 6, 6, 4, 4, 4, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
        9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 4,
        9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 4, 9, 4, 9, 4, 9, 4, 9, 4, 4, 6, 5, 6,
        5, 6, 4, 6, 9, 9, 5, 6, 5, 9, 5, 6, 9, 6, 5, 4, 4, 4, 4, 6, 4, 4, 4, 5,
        4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1,

100%|██████████| 192/192 [00:00<00:00, 2501.64it/s]

[2024-07-14 16:13:02] (0.0.7alpha) {'avg': 109.0, 'max': 109, 'min': 109}
[2024-07-14 16:13:02] (0.0.7alpha) Preview of the first two samples in the dataset:
{'input_ids': tensor([0, 6, 6, 4, 4, 4, 6, 9, 9, 6, 6, 4, 5, 9, 6, 9, 9, 9, 9, 6, 4, 9, 9, 6,
        6, 9, 4, 6, 4, 9, 9, 9, 6, 4, 6, 5, 4, 4, 4, 6, 5, 9, 9, 4, 6, 4, 9, 9,
        9, 6, 9, 5, 4, 6, 9, 9, 4, 6, 6, 4, 9, 6, 6, 9, 5, 9, 6, 4, 5, 5, 4, 6,
        6, 9, 9, 9, 9, 9, 5, 6, 4, 4, 6, 5, 9, 9, 6, 4, 4, 4, 4, 6, 4, 4, 4, 5,
        4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1,


Testing: 100%|██████████| 24/24 [00:02<00:00,  9.02it/s]


[2024-07-14 16:13:05] (0.0.7alpha) {'mc_rmse': 0.958329087975486}


Epoch 1/10 Loss: 0.6314:   9%|▉         | 19/216 [00:04<00:51,  3.80it/s]


KeyboardInterrupt: 