# Large-scale Benchmarking with OmniGenBench using LoRA Fine-tuning
This notebook demonstrates how to efficiently fine-tune Genomic Foundation Models (GFMs) using Low-Rank Adaptation (LoRA) with the **OmniGenBench** framework. LoRA is a parameter-efficient fine-tuning technique that allows for quick adaptation of large models to specific tasks while minimizing computational costs.

## What is LoRA?

LoRA is a parameter-efficient fine-tuning (PEFT) technique that significantly reduces the number of trainable parameters. Instead of fine-tuning the entire model, LoRA freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into the layers of the Transformer architecture. This approach dramatically reduces memory footprint and training time, making it possible to fine-tune large models on consumer-grade hardware.

### Advantages of LoRA:
- **Reduced Computational Cost**: Fewer trainable parameters mean faster training and lower GPU memory requirements.
- **Easy Task Switching**: The original pre-trained model remains unchanged. You can have multiple small LoRA adapters for different tasks and switch between them on the fly.
- **Comparable Performance**: LoRA can achieve performance comparable to full fine-tuning while being much more efficient.

## Notebook Structure

This notebook is organized into the following sections:

1.  **Setup & Installation**: Ensures all required libraries are installed.
2.  **Import Libraries**: Loads the necessary Python libraries.
3.  **Configuration**: Defines all key parameters, including model selection, benchmark choice, training settings, and LoRA configurations.
4.  **Model-Specific Loading**: Contains the logic for loading different types of GFMs and their tokenizers.
5.  **Running LoRA Fine-tuning**: Demonstrates how to initialize `AutoBench` and run the fine-tuning process with a single command.
6.  **Multi-Model LoRA Fine-tuning (Optional)**: Shows how to automate the fine-tuning process for a list of models.

## 1. Setup & Installation

First, let's ensure all the required packages are installed. If you have already installed them, you can skip this cell.

In [None]:
# !pip install omnigenbench transformers peft accelerate bitsandbytes

## 2. Import Libraries

Import all the necessary libraries for the benchmark.

In [None]:
import random
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from omnigenbench import AutoBench

print("Libraries imported successfully.")

## 3. Configuration

This section contains all the settings for the LoRA fine-tuning experiment. You can easily modify these parameters to test different models, benchmarks, or LoRA settings.

In [None]:
# --- General Settings ---
BENCHMARK = "GUE"  # Benchmark suite to use, e.g., "GUE", "RGB"
BATCH_SIZE = 32
PATIENCE = 3
EPOCHS = 20
MAX_EXAMPLES = 1000  # Use a smaller number for quick testing, set to None for all data
SEED = random.randint(0, 1000)

# --- Model Selection ---
# Choose the Genomic Foundation Model (GFM) to fine-tune
GFM_TO_TUNE = 'yangheng/OmniGenome-52M'

# List of available GFMs for testing
AVAILABLE_GFMS = [
    'yangheng/OmniGenome-52M',
    'yangheng/OmniGenome-186M',
    'yangheng/OmniGenome-v1.5',
    'zhihan1996/DNABERT-2-117M',
    'LongSafari/hyenadna-large-1m-seqlen-hf',
    'InstaDeepAI/nucleotide-transformer-v2-100m-multi-species',
    'kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16',
    'multimolecule/rnafm',

    # Evo models, you need to install the `evo` package according to the official documentation
    # 'arcinstitute/evo-1-131k-base',
    # 'SpliceBERT-510nt',
]

# --- LoRA Configuration ---
# This dictionary contains LoRA settings for different models.
# `r`: The rank of the update matrices.
# `lora_alpha`: The scaling factor.
# `lora_dropout`: The dropout probability for LoRA layers.
# `target_modules`: The modules (e.g., attention blocks) to apply LoRA to.
LORA_CONFIGS = {
    "OmniGenome-186M": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"], "bias": "none"
    },
    "caduceus-ph_seqlen-131k_d_model-256_n_layer-16": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["in_proj", "x_proj", "out_proj"], "bias": "none"
    },
    "rnamsm": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["q_proj", "v_proj", "out_proj"], "bias": "none"
    },
    "rnafm": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"], "bias": "none"
    },
    "rnabert": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"], "bias": "none"
    },
    "agro-nucleotide-transformer-1b": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"], "bias": "none"
    },
    "SpliceBERT-510nt": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"], "bias": "none"
    },
    "DNABERT-2-117M": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["Wqkv", "dense"], "bias": "none"
    },
    "3utrbert": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"], "bias": "none"
    },
    "hyenadna-large-1m-seqlen-hf": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["in_proj", "out_proj"], "bias": "none"
    },
    "nucleotide-transformer-v2-100m-multi-species": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": ["key", "value", "dense"], "bias": "none"
    },
    "evo-1-131k-base": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": [
            "Wqkv", "out_proj",
            "mlp",
            "projections",
            "out_filter_dense"
        ],
        "bias": "none"
    },
    "evo-1.5-8k-base": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": [
            "Wqkv", "out_proj",
            "l1", "l2", "l3",
            "projections",
            "out_filter_dense"
        ],
        "bias": "none"
    },
    "evo-1-8k-base": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": [
            "Wqkv", "out_proj",
            "l1", "l2", "l3",
            "projections",
            "out_filter_dense"
        ],
        "bias": "none"
    },
    "evo2_7b": {
        "r": 8, "lora_alpha": 32, "lora_dropout": 0.1,
        "target_modules": [
            "Wqkv", "out_proj",
            "l1", "l2", "l3",
            # "projections",
            "out_filter_dense"
        ],
        "bias": "none"
    },
}

print(f"Configuration loaded:")
print(f"  - GFM to Tune: {GFM_TO_TUNE}")
print(f"  - Benchmark: {BENCHMARK}")
print(f"  - Epochs: {EPOCHS}")
print(f"  - LoRA Config: {LORA_CONFIGS.get(GFM_TO_TUNE.split('/')[-1], LORA_CONFIGS['default'])}")

## 4. Model-Specific Loading

Different GFMs may require specific loading procedures. This function handles these special cases, particularly for models like `multimolecule` or `evo` which might have custom tokenizers or model classes.

In [None]:
def load_gfm_and_tokenizer(gfm_name):
    """Loads a GFM and its tokenizer, handling special cases."""
    print(f"\nLoading model and tokenizer for: {gfm_name}")
    
    if 'multimolecule' in gfm_name:
        from multimolecule import RnaTokenizer, AutoModelForTokenPrediction
        tokenizer = RnaTokenizer.from_pretrained(gfm_name)
        model = AutoModelForTokenPrediction.from_pretrained(gfm_name, trust_remote_code=True).base_model
        print("Loaded multimolecule model with custom classes.")
        
    elif 'evo-1' in gfm_name:
        # Using transformers to load Evo models
        config = AutoConfig.from_pretrained(gfm_name, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(gfm_name, config=config, trust_remote_code=True).backbone
        tokenizer = AutoTokenizer.from_pretrained(gfm_name, trust_remote_code=True)
        tokenizer.pad_token_id = tokenizer.pad_token_type_id
        
        # Patch for the unembedding layer
        model.config = config
        model.config.pad_token_id = tokenizer.pad_token_id
        model.unembed.unembed = lambda x: x
        print("Loaded Evo model with custom patching.")
        
    else:
        # Default loading for most Hugging Face models
        tokenizer = None  # Let AutoBench handle it
        model = gfm_name
        print("Using standard model name for AutoBench.")
        
    return model, tokenizer

print("Model loading function defined.")

## 5. Running LoRA Fine-tuning

Now, we'll execute the LoRA fine-tuning process for the selected model. `AutoBench` handles the entire workflow, from data loading and preprocessing to training and evaluation.

In [None]:
# Load the selected model and tokenizer
model, tokenizer = load_gfm_and_tokenizer(GFM_TO_TUNE)

# Initialize AutoBench
print(f"\nInitializing AutoBench for benchmark: {BENCHMARK}")
bench = AutoBench(
    benchmark=BENCHMARK,
    model_name_or_path=model,
    tokenizer=tokenizer,
    overwrite=True,
    trainer='native',  # 'native' or 'accelerate'
    autocast='fp16',  # 'fp16', 'bf16', or 'fp32'
    device='cuda',
)

# Get the appropriate LoRA config
lora_config = LORA_CONFIGS.get(GFM_TO_TUNE.split('/')[-1], LORA_CONFIGS['default'])

# Run the benchmark with LoRA fine-tuning
print(f"\nStarting LoRA fine-tuning for {GFM_TO_TUNE}...")
bench.run(
    batch_size=BATCH_SIZE,
    gradient_accumulation_steps=1,
    patience=PATIENCE,
    max_examples=MAX_EXAMPLES,
    seeds=SEED,
    epochs=EPOCHS,
    lora_config=lora_config, # Pass the LoRA config here
)

print("\n🎉 LoRA fine-tuning complete!")
print("Check the 'autobench_logs' and 'autobench_evaluations' directories for results.")

## 6. Multi-Model LoRA Fine-tuning (Optional)

The following section demonstrates how to automate the LoRA fine-tuning process for a list of GFMs. Uncomment and run this cell to compare the performance of multiple models with LoRA.

In [None]:
# # Uncomment this cell to run LoRA fine-tuning for multiple models

# print("Starting multi-model LoRA fine-tuning...")
# print("="*50)

# for gfm in AVAILABLE_GFMS:
#     try:
#         # Load model and tokenizer
#         model, tokenizer = load_gfm_and_tokenizer(gfm)

#         # Initialize AutoBench
#         print(f"\nInitializing AutoBench for {gfm} on {BENCHMARK}")
#         bench = AutoBench(
#             benchmark=BENCHMARK,
#             model_name_or_path=model,
#             tokenizer=tokenizer,
#             overwrite=True,
#             trainer='native',
#             autocast='fp16',
#             device='cuda',
#         )

#         # Get the appropriate LoRA config
#         lora_config = LORA_CONFIGS.get(gfm.split('/')[-1], LORA_CONFIGS['default'])

#         # Run the benchmark
#         print(f"\nStarting LoRA fine-tuning for {gfm}...")
#         bench.run(
#             batch_size=BATCH_SIZE,
#             patience=PATIENCE,
#             max_examples=MAX_EXAMPLES,
#             seeds=SEED,
#             epochs=EPOCHS,
#             lora_config=lora_config,
#         )
#         print(f"\n✅ Finished fine-tuning for {gfm}.")
#         print("="*50)

#     except Exception as e:
#         print(f"\n❌ An error occurred while processing {gfm}: {e}")
#         print("="*50)
#         continue

# print("\n🎉 All models have been processed!")