# Genomic Foundation Model Auto-Benchmarking
This script is used to auto-benchmark the Genomic Foundation Model on diversified downstream tasks. 
We have automated the benchmark pipeline based on the OmniGenome package. 
Once your foundation model is trained, you can use this script to evaluate the performance of the model. 
The script will automatically load the datasets, preprocess the data, and evaluate the model on the tasks. 
The script will output the performance of the model on each task.

## [Optional] Prepare your own benchmark datasets
We have provided a set of benchmark datasets in the tutorials, you can use them to evaluate the performance of the model.
If you want to evaluate the model on your own datasets, you can prepare the datasets in the following steps:
1. Prepare the datasets in the following format:
    - The datasets should be in the `json` format.
    - The datasets should contain two columns: `sequence` and `label`.
    - The `sequence` column should contain the DNA sequences.
    - The `label` column should contain the labels of the sequences.
2. Save the datasets in a folder like the existing benchmark datasets. This folder is referred to as the `root` in the script.
3. Place the model and tokenizer in an accessible folder.
4. Sometimes the tokenizer does not work well with the datasets, you can write a custom tokenizer and model wrapper in the `omnigenome_wrapper.py` file.
More detailed documentation on how to write the custom tokenizer and model wrapper will be provided.

## Prepare the benchmark environment
Before running the benchmark, you need to install the following required packages in addition to PyTorch and other dependencies.
Find the installation instructions for PyTorch at https://pytorch.org/get-started/locally/.
```bash
pip install omnigenome, findfile, autocuda, metric-visualizer, transformers
```

## Import the required packages

In [3]:
from omnigenome import AutoBench
import autocuda

ModuleNotFoundError: No module named 'transformers'

## 1. Define the root folder of the benchmark datasets
Define the root where the benchmark datasets are stored.

In [2]:
root = 'RGB'  # Abbreviation of the RNA genome benchmark

## 2. Define the model and tokenizer paths
Provide the path to the model and tokenizer.

In [3]:
model_name_or_path = 'anonymous8/OmniGenome-52M'

## 3. Initialize the AutoBench
Select the available CUDA device based on your hardware.

In [4]:
device = autocuda.auto_cuda()
auto_bench = AutoBench(
    bench_root=root,
    model_name_or_path=model_name_or_path,
    device="cuda",
    overwrite=True,
)

[2024-10-03 12:17:28] (0.1.1alpha) Benchmark: RGB does not exist. Search online for available benchmarks.
[2024-10-03 12:17:28] (0.1.1alpha) Loaded benchmarks:  ['RNA-mRNA', 'RNA-SNMD', 'RNA-SNMR', 'RNA-SSP-Archive2', 'RNA-SSP-rnastralign', 'RNA-SSP-bpRNA', 'RNA-SSP-bpRNA-2000-90']
[2024-10-03 12:17:28] (0.1.1alpha) Benchmark Root: __OMNIGENOME_DATA__/benchmarks/RGB
Benchmark List: ['RNA-mRNA', 'RNA-SNMD', 'RNA-SNMR', 'RNA-SSP-Archive2', 'RNA-SSP-rnastralign', 'RNA-SSP-bpRNA', 'RNA-SSP-bpRNA-2000-90']
Model Name or Path: anonymous8/OmniGenome-52M
Tokenizer: None
Device: cuda
Metric Visualizer Path: __OMNIGENOME_DATA__-benchmarks-RGB-anonymous8-OmniGenome-52M.mv
BenchConfig Details: <module 'bench_metadata' from '__OMNIGENOME_DATA__/benchmarks/RGB/metadata.py'>



## 4. Run the benchmark
The downstream tasks have predefined configurations for fair comparison.
However, sometimes you might need to adjust the configuration based on your dataset or resources.
For instance, adjusting the `max_length` or batch size.
To adjust the configuration, you can override parameters in the `AutoBenchConfig` class.

In [5]:
batch_size = 8
epochs = 10
seeds = [42, 43, 44]
auto_bench.run(epochs=epochs, batch_size=batch_size, seeds=seeds)

[2024-10-03 12:17:28] (0.1.1alpha) Override epochs with 10 according to the input kwargs
[2024-10-03 12:17:28] (0.1.1alpha) Override batch_size with 8 according to the input kwargs
[2024-10-03 12:17:28] (0.1.1alpha) Override seeds with [42, 43, 44] according to the input kwargs
[2024-10-03 12:17:28] (0.1.1alpha) AutoBench Config for RNA-mRNA: task_name: RNA-mRNA
task_type: token_regression
label2id: None
num_labels: 3
epochs: 10
learning_rate: 2e-05
weight_decay: 1e-05
batch_size: 8
max_length: 110
seeds: [42, 43, 44]
compute_metrics: <function compute_metrics at 0x0000023D3692C550>
train_file: __OMNIGENOME_DATA__/benchmarks/RGB\RNA-mRNA/train.json
test_file: __OMNIGENOME_DATA__/benchmarks/RGB\RNA-mRNA/test.json
valid_file: None
dataset_cls: <class 'config.Dataset'>
model_cls: <class 'omnigenome.src.model.regression.model.OmniGenomeModelForTokenRegression'>


You are using a model of type omnigenome to instantiate a model of type mprna. This is not supported for all configurations of models and can yield errors.
Some weights of OmniGenomeModel were not initialized from the model checkpoint at anonymous8/OmniGenome-52M and are newly initialized: ['OmniGenome.pooler.dense.bias', 'OmniGenome.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[2024-10-03 12:17:32] (0.1.1alpha) Model Name: OmniGenomeModelForTokenRegression
Model Metadata: {'library_name': 'OmniGenome', 'omnigenome_version': '0.1.1alpha', 'torch_version': '2.4.1+cpu+cuNone+git38b96d3399a695e704ed39b60dac733c3fbf20e2', 'transformers_version': '4.45.1', 'model_cls': 'OmniGenomeModelForTokenRegression', 'tokenizer_cls': 'EsmTokenizer', 'model_name': 'OmniGenomeModelForTokenRegression'}
Base Model Name: anonymous8/OmniGenome-52M
Model Type: omnigenome
Model Architecture: None
Model Parameters: 52.453345 M
Model Config: OmniGenomeConfig {
  "OmniGenomefold_config": null,
  "_name_or_path": "anonymous8/OmniGenome-52M",
  "attention_probs_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "anonymous8/OmniGenome-52M--configuration_omnigenome.OmniGenomeConfig",
    "AutoModel": "anonymous8/OmniGenome-52M--modeling_omnigenome.OmniGenomeModel",
    "AutoModelForMaskedLM": "anonymous8/OmniGenome-52M--modeling_omnigenome.OmniGenomeForMaskedLM",
    "AutoModelForSeq2Seq

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1728/1728 [00:00<00:00, 3097.83it/s]


[2024-10-03 12:17:33] (0.1.1alpha) {'avg_seq_len': 109.0, 'max_seq_len': 109, 'min_seq_len': 109, 'avg_label_len': 112.0, 'max_label_len': 112, 'min_label_len': 112}
[2024-10-03 12:17:33] (0.1.1alpha) Preview of the first two samples in the dataset:
{'input_ids': tensor([0, 6, 6, 4, 4, 4, 4, 9, 4, 4, 9, 9, 4, 5, 5, 6, 9, 6, 5, 5, 9, 5, 5, 4,
        5, 6, 4, 4, 4, 6, 9, 4, 6, 6, 6, 4, 5, 6, 5, 5, 4, 4, 9, 5, 9, 5, 5, 4,
        9, 6, 6, 5, 6, 6, 4, 4, 6, 5, 5, 9, 6, 4, 5, 6, 6, 9, 9, 4, 4, 6, 5, 4,
        9, 6, 4, 6, 9, 9, 5, 6, 5, 9, 5, 4, 9, 6, 5, 4, 4, 4, 4, 6, 4, 4, 4, 5,
        4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 2, 1, 1, 1]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 192/192 [00:00<00:00, 2988.52it/s]

[2024-10-03 12:17:33] (0.1.1alpha) {'avg_seq_len': 109.0, 'max_seq_len': 109, 'min_seq_len': 109, 'avg_label_len': 112.0, 'max_label_len': 112, 'min_label_len': 112}
[2024-10-03 12:17:33] (0.1.1alpha) Preview of the first two samples in the dataset:
{'input_ids': tensor([0, 6, 6, 4, 4, 4, 4, 9, 9, 5, 4, 4, 4, 5, 5, 4, 9, 6, 5, 9, 9, 5, 4, 6,
        9, 5, 4, 6, 5, 9, 6, 4, 9, 6, 5, 4, 5, 4, 4, 9, 5, 6, 9, 9, 9, 9, 9, 4,
        4, 4, 5, 6, 6, 4, 9, 9, 9, 4, 4, 9, 4, 4, 9, 4, 4, 9, 4, 4, 4, 6, 5, 6,
        6, 9, 5, 6, 9, 9, 5, 6, 5, 6, 6, 9, 5, 6, 5, 4, 4, 4, 4, 6, 4, 4, 4, 5,
        4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 2, 1, 1, 1]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 


  self.scaler = GradScaler()


AssertionError: Torch not compiled with CUDA enabled

## 5. Benchmark Checkpointing
Whenever the benchmark is interrupted, the results will be saved and available for further execution.
You can also clear the checkpoint to start fresh:
```python
AutoBench(bench_root=root, model_name_or_path=model_name_or_path, device=device, overwrite=True).run()
```