# Model Benchmarking
This notebook benchmarks published genomic language models (DNABERT, HyenaDNA, Nucleotide Transformer) on Genomic Benchmarks.

**Requirements:**
- Google Colab with GPU runtime
- Google Cloud Storage bucket with Genomic Benchmarks datasets
- GCS authentication

In [None]:
!pip install transformers torch pandas scikit-learn google-cloud-storage tqdm

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
# ============================================
# Cell 2: Setup Google Cloud authentication
# ============================================
from google.colab import auth
auth.authenticate_user()

In [None]:
# ============================================
# Cell 3: Import all libraries
# ============================================
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import accuracy_score, matthews_corrcoef, f1_score, confusion_matrix
from google.cloud import storage
import io
from tqdm.notebook import tqdm
import json
import os
import gc

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Using device: cuda
GPU: NVIDIA A100-SXM4-40GB


In [None]:
# ============================================
# Cell 4: Define dataset class for DNA sequences
# ============================================
class DNADataset(Dataset):
    """Dataset class for DNA sequences"""
    def __init__(self, sequences, labels=None):
        self.sequences = sequences
        self.labels = labels

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        if self.labels is not None:
            return self.sequences[idx], self.labels[idx]
        return self.sequences[idx]

In [None]:
# ============================================
# Cell 5: Helper functions for GCS
# ============================================
def load_csv_from_gcs(bucket_name, file_path):
    """Load CSV from Google Cloud Storage"""
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_path)
    data = blob.download_as_string()
    data_file = io.StringIO(data.decode("utf-8"))
    df = pd.read_csv(data_file)
    return df

def save_results_to_gcs(bucket_name, file_path, results_dict):
    """Save results dictionary as JSON to GCS"""
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_path)
    blob.upload_from_string(json.dumps(results_dict, indent=2))
    print(f"Results saved to gs://{bucket_name}/{file_path}")

In [None]:
# ============================================
# Cell 6: DNABERT specific functions
# ============================================
def prepare_dnabert_input(sequence, k=6):
    """
    Convert DNA sequence to k-mer representation for DNABERT
    DNABERT uses k-mer tokenization with k=6 by default
    """
    # Convert to uppercase
    sequence = sequence.upper()
    # Create k-mers with spaces between them
    kmers = []
    for i in range(len(sequence) - k + 1):
        kmers.append(sequence[i:i+k])
    return ' '.join(kmers)

def benchmark_dnabert(train_df, test_df, dataset_name, bucket_name, max_length=512):
    """
    Benchmark DNABERT on a genomic dataset
    """
    print(f"\n{'='*50}")
    print(f"Benchmarking DNABERT on {dataset_name}")
    print(f"{'='*50}")

    # Load DNABERT model and tokenizer
    print("Loading DNABERT model...")
    model_name = "zhihan1996/DNA_bert_6"  # This is the original DNABERT with 6-mer

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Determine number of classes from the data
    num_classes = len(train_df['label'].unique())
    print(f"Number of classes: {num_classes}")

    # Load model with correct number of output classes
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_classes,
        trust_remote_code=True,
        ignore_mismatched_sizes=True  # This allows loading with different num_labels
    )
    model = model.to(device)

    # Prepare data
    print("Preparing sequences for DNABERT (k-mer tokenization)...")
    train_sequences = [prepare_dnabert_input(seq) for seq in tqdm(train_df['sequence'].values, desc="Processing train")]
    test_sequences = [prepare_dnabert_input(seq) for seq in tqdm(test_df['sequence'].values, desc="Processing test")]

    train_labels = train_df['label'].values
    test_labels = test_df['label'].values

    # Fine-tuning DNABERT
    print("\nFine-tuning DNABERT...")
    model.train()

    # Create optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

    # Training parameters
    batch_size = 16  # Small batch size to avoid memory issues
    num_epochs = 1  # Quick fine-tuning

    # Create dataset
    train_dataset = DNADataset(train_sequences, train_labels)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # Training loop
    for epoch in range(num_epochs):
        total_loss = 0
        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")

        for batch_sequences, batch_labels in progress_bar:
            # Tokenize batch
            inputs = tokenizer(
                batch_sequences,
                padding=True,
                truncation=True,
                max_length=max_length,
                return_tensors="pt"
            ).to(device)

            labels = torch.tensor(batch_labels).to(device)

            # Forward pass
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            progress_bar.set_postfix({'loss': loss.item()})

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1} - Average loss: {avg_loss:.4f}")

    # Evaluation
    print("\nEvaluating on test set...")
    model.eval()

    all_predictions = []
    all_labels = []

    test_dataset = DNADataset(test_sequences, test_labels)
    test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

    with torch.no_grad():
        for batch_sequences, batch_labels in tqdm(test_loader, desc="Evaluating"):
            inputs = tokenizer(
                batch_sequences,
                padding=True,
                truncation=True,
                max_length=max_length,
                return_tensors="pt"
            ).to(device)

            outputs = model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)

            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(batch_labels)

    # Calculate metrics
    accuracy = accuracy_score(all_labels, all_predictions)
    mcc = matthews_corrcoef(all_labels, all_predictions)
    f1 = f1_score(all_labels, all_predictions, average='weighted' if num_classes > 2 else 'binary')

    print(f"\nResults for {dataset_name}:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"MCC: {mcc:.4f}")
    print(f"F1 Score: {f1:.4f}")

    # Confusion matrix
    print("\nConfusion Matrix:")
    print(confusion_matrix(all_labels, all_predictions))

    # Save results
    results = {
        'dataset': dataset_name,
        'model': 'DNABERT',
        'accuracy': float(accuracy),
        'mcc': float(mcc),
        'f1': float(f1),
        'num_train_samples': len(train_df),
        'num_test_samples': len(test_df),
        'num_classes': num_classes
    }

    # Save to GCS
    results_path = f"benchmark_results/dnabert/{dataset_name}_results.json"
    save_results_to_gcs(bucket_name, results_path, results)

    # Clear GPU memory
    del model
    torch.cuda.empty_cache()
    gc.collect()

    return results


In [None]:
# ============================================
# Cell 7: HyenaDNA benchmark function (COMPLETE FIXED VERSION)
# ============================================
def benchmark_hyena_dna(train_df, test_df, dataset_name, bucket_name, max_length=1024):
    """
    Benchmark HyenaDNA on a genomic dataset
    """
    print(f"\n{'='*50}")
    print(f"Benchmarking HyenaDNA on {dataset_name}")
    print(f"{'='*50}")

    print("Loading HyenaDNA model...")

    # Try the standalone-hyenadna model which is more reliable
    try:
        # Option 1: Try the standalone version
        from transformers import AutoModel, AutoTokenizer

        model_name = "kuleshov-group/hyenadna-small-32k-seqlen"  # More stable version

        # Load tokenizer with trust_remote_code
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            trust_remote_code=True
        )

        # Load model for sequence classification
        model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            num_labels=len(train_df['label'].unique()),
            trust_remote_code=True,
            ignore_mismatched_sizes=True
        )

    except Exception as e:
        print(f"Failed to load kuleshov-group model: {e}")
        print("Trying alternative HyenaDNA model...")

        # Option 2: Try LongSafari with custom loading
        try:
            import torch
            from transformers import AutoConfig

            # For LongSafari models, we need to handle them differently
            model_name = "LongSafari/hyenadna-small-32k-seqlen-hf"  # HF-compatible version

            # Create a simple character-level tokenizer for DNA
            class DNATokenizer:
                def __init__(self):
                    self.vocab = {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'N': 4, '[PAD]': 5}
                    self.pad_token_id = 5

                def __call__(self, sequences, padding=True, truncation=True, max_length=1024, return_tensors="pt"):
                    if isinstance(sequences, str):
                        sequences = [sequences]

                    encoded = []
                    for seq in sequences:
                        seq = seq.upper()[:max_length] if truncation else seq.upper()
                        tokens = [self.vocab.get(c, 4) for c in seq]  # Default to N for unknown

                        if padding and len(tokens) < max_length:
                            tokens += [self.pad_token_id] * (max_length - len(tokens))

                        encoded.append(tokens)

                    if return_tensors == "pt":
                        import torch
                        return {'input_ids': torch.tensor(encoded)}
                    return {'input_ids': encoded}

            tokenizer = DNATokenizer()

            # Try to load as a standard transformer model
            from transformers import AutoModelForSequenceClassification

            model = AutoModelForSequenceClassification.from_pretrained(
                model_name,
                num_labels=len(train_df['label'].unique()),
                trust_remote_code=True,
                ignore_mismatched_sizes=True
            )

        except Exception as e2:
            print(f"Could not load HyenaDNA model: {e2}")
            print("Skipping HyenaDNA for this dataset")
            return {
                'dataset': dataset_name,
                'model': 'HyenaDNA',
                'accuracy': 0,
                'mcc': 0,
                'f1': 0,
                'error': str(e2)
            }

    model = model.to(device)

    # Prepare sequences (HyenaDNA uses character-level)
    print("Preparing sequences for HyenaDNA...")
    train_sequences = train_df['sequence'].str.upper().values
    test_sequences = test_df['sequence'].str.upper().values
    train_labels = train_df['label'].values
    test_labels = test_df['label'].values

    # Fine-tuning
    print("\nFine-tuning HyenaDNA...")
    model.train()

    learning_rate = 1e-4
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    batch_size = 16
    num_epochs = 1

    print(f"Training parameters:")
    print(f"  - Batch size: {batch_size}")
    print(f"  - Epochs: {num_epochs}")
    print(f"  - Learning rate: {learning_rate}")
    print(f"  - Training samples: {len(train_df)}")

    train_dataset = DNADataset(train_sequences, train_labels)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    for epoch in range(num_epochs):
        total_loss = 0
        correct = 0
        total = 0
        errors = 0

        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")

        for batch_sequences, batch_labels in progress_bar:
            try:
                # Tokenizer returns a dict with 'input_ids' key
                inputs = tokenizer(
                    batch_sequences.tolist() if hasattr(batch_sequences, 'tolist') else list(batch_sequences),
                    padding=True,
                    truncation=True,
                    max_length=max_length,
                    return_tensors="pt"
                )

                # Move input_ids to device properly
                input_ids = inputs['input_ids'].to(device)
                labels = torch.tensor(batch_labels).to(device)

                # Pass input_ids to model
                outputs = model(input_ids=input_ids, labels=labels)
                loss = outputs.loss

                predictions = torch.argmax(outputs.logits, dim=-1)
                correct += (predictions == labels).sum().item()
                total += labels.size(0)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()
                current_acc = correct / total if total > 0 else 0
                progress_bar.set_postfix({
                    'loss': f'{loss.item():.4f}',
                    'acc': f'{current_acc:.4f}',
                    'errors': errors
                })
            except Exception as e:
                errors += 1
                # Silently continue to avoid spam
                continue

        if total > 0:
            avg_loss = total_loss / max(1, len(train_loader) - errors)
            final_train_acc = correct / total
            print(f"Epoch {epoch+1} - Average loss: {avg_loss:.4f}, Training accuracy: {final_train_acc:.4f}")
            if errors > 0:
                print(f"  (Encountered {errors} batch errors during training)")

    # Evaluation
    print("\nEvaluating on test set...")
    model.eval()

    all_predictions = []
    all_labels = []
    eval_errors = 0

    test_dataset = DNADataset(test_sequences, test_labels)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

    with torch.no_grad():
        for batch_sequences, batch_labels in tqdm(test_loader, desc="Evaluating"):
            try:
                # Tokenizer returns a dict with 'input_ids' key
                inputs = tokenizer(
                    batch_sequences.tolist() if hasattr(batch_sequences, 'tolist') else list(batch_sequences),
                    padding=True,
                    truncation=True,
                    max_length=max_length,
                    return_tensors="pt"
                )

                # Move input_ids to device properly
                input_ids = inputs['input_ids'].to(device)

                outputs = model(input_ids=input_ids)
                predictions = torch.argmax(outputs.logits, dim=-1)

                all_predictions.extend(predictions.cpu().numpy())
                all_labels.extend(batch_labels)
            except Exception as e:
                eval_errors += 1
                continue

    if eval_errors > 0:
        print(f"  (Encountered {eval_errors} batch errors during evaluation)")

    if len(all_predictions) > 0:
        accuracy = accuracy_score(all_labels, all_predictions)
        mcc = matthews_corrcoef(all_labels, all_predictions)
        f1 = f1_score(all_labels, all_predictions, average='weighted' if len(np.unique(all_labels)) > 2 else 'binary')

        print(f"\nResults for {dataset_name}:")
        print(f"Accuracy: {accuracy:.4f}")
        print(f"MCC: {mcc:.4f}")
        print(f"F1 Score: {f1:.4f}")

        print("\nConfusion Matrix:")
        cm = confusion_matrix(all_labels, all_predictions)
        print(cm)
    else:
        accuracy = mcc = f1 = 0
        print("No valid predictions made")

    results = {
        'dataset': dataset_name,
        'model': 'HyenaDNA',
        'accuracy': float(accuracy),
        'mcc': float(mcc),
        'f1': float(f1),
        'num_train_samples': len(train_df),
        'num_test_samples': len(test_df),
        'num_classes': len(train_df['label'].unique()),
        'learning_rate': learning_rate,
        'batch_size': batch_size,
        'num_epochs': num_epochs
    }

    results_path = f"benchmark_results/hyenadna/{dataset_name}_results.json"
    save_results_to_gcs(bucket_name, results_path, results)

    del model
    torch.cuda.empty_cache()
    gc.collect()

    return results

In [None]:
# ============================================
# Cell X: Nucleotide Transformer benchmark function
# ============================================
def benchmark_nucleotide_transformer(train_df, test_df, dataset_name, bucket_name, max_length=512):
    """
    Benchmark Nucleotide Transformer on a genomic dataset
    """
    print(f"\n{'='*50}")
    print(f"Benchmarking Nucleotide Transformer on {dataset_name}")
    print(f"{'='*50}")

    print("Loading Nucleotide Transformer model...")

    # Use the 500M human reference model (best for human genomic tasks)
    model_name = "InstaDeepAI/nucleotide-transformer-500m-human-ref"
    # Alternative options:
    # "InstaDeepAI/nucleotide-transformer-500m-1000g" - trained on 1000 genomes
    # "InstaDeepAI/nucleotide-transformer-2.5b-1000g" - larger but slower

    try:
        # Load tokenizer with trust_remote_code
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            trust_remote_code=True
        )

        # Determine number of classes
        num_classes = len(train_df['label'].unique())
        print(f"Number of classes: {num_classes}")

        # Load model for sequence classification
        model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            num_labels=num_classes,
            trust_remote_code=True,
            ignore_mismatched_sizes=True
        )

    except Exception as e:
        print(f"Error loading Nucleotide Transformer: {e}")
        print("Skipping Nucleotide Transformer for this dataset")
        return {
            'dataset': dataset_name,
            'model': 'NucleotideTransformer',
            'accuracy': 0,
            'mcc': 0,
            'f1': 0,
            'error': str(e)
        }

    model = model.to(device)

    # Prepare sequences (Nucleotide Transformer uses character-level tokenization)
    print("Preparing sequences for Nucleotide Transformer...")
    train_sequences = train_df['sequence'].str.upper().values
    test_sequences = test_df['sequence'].str.upper().values
    train_labels = train_df['label'].values
    test_labels = test_df['label'].values

    # Fine-tuning
    print("\nFine-tuning Nucleotide Transformer...")
    model.train()

    # Use same learning rate as other models for consistency
    learning_rate = 3e-5
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    batch_size = 8  # Smaller batch size as NT is larger
    num_epochs = 1

    print(f"Training parameters:")
    print(f"  - Batch size: {batch_size}")
    print(f"  - Epochs: {num_epochs}")
    print(f"  - Learning rate: {learning_rate}")
    print(f"  - Training samples: {len(train_df)}")

    train_dataset = DNADataset(train_sequences, train_labels)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    for epoch in range(num_epochs):
        total_loss = 0
        correct = 0
        total = 0

        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")

        for batch_sequences, batch_labels in progress_bar:
            try:
                # Tokenize batch
                inputs = tokenizer(
                    list(batch_sequences),
                    padding=True,
                    truncation=True,
                    max_length=max_length,
                    return_tensors="pt"
                ).to(device)

                labels = torch.tensor(batch_labels).to(device)

                # Forward pass
                outputs = model(**inputs, labels=labels)
                loss = outputs.loss

                predictions = torch.argmax(outputs.logits, dim=-1)
                correct += (predictions == labels).sum().item()
                total += labels.size(0)

                # Backward pass
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()
                current_acc = correct / total if total > 0 else 0
                progress_bar.set_postfix({
                    'loss': f'{loss.item():.4f}',
                    'acc': f'{current_acc:.4f}'
                })

            except RuntimeError as e:
                if "out of memory" in str(e):
                    print(f"WARNING: Out of memory, skipping batch")
                    if hasattr(torch.cuda, 'empty_cache'):
                        torch.cuda.empty_cache()
                    continue
                else:
                    raise e

        if total > 0:
            avg_loss = total_loss / len(train_loader)
            final_train_acc = correct / total
            print(f"Epoch {epoch+1} - Average loss: {avg_loss:.4f}, Training accuracy: {final_train_acc:.4f}")

    # Evaluation
    print("\nEvaluating on test set...")
    model.eval()

    all_predictions = []
    all_labels = []

    test_dataset = DNADataset(test_sequences, test_labels)
    test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)  # Larger batch for eval

    with torch.no_grad():
        for batch_sequences, batch_labels in tqdm(test_loader, desc="Evaluating"):
            try:
                inputs = tokenizer(
                    list(batch_sequences),
                    padding=True,
                    truncation=True,
                    max_length=max_length,
                    return_tensors="pt"
                ).to(device)

                outputs = model(**inputs)
                predictions = torch.argmax(outputs.logits, dim=-1)

                all_predictions.extend(predictions.cpu().numpy())
                all_labels.extend(batch_labels)

            except RuntimeError as e:
                if "out of memory" in str(e):
                    print(f"WARNING: Out of memory during eval, skipping batch")
                    if hasattr(torch.cuda, 'empty_cache'):
                        torch.cuda.empty_cache()
                    continue
                else:
                    raise e

    if len(all_predictions) > 0:
        accuracy = accuracy_score(all_labels, all_predictions)
        mcc = matthews_corrcoef(all_labels, all_predictions)
        f1 = f1_score(all_labels, all_predictions, average='weighted' if num_classes > 2 else 'binary')

        print(f"\nResults for {dataset_name}:")
        print(f"Accuracy: {accuracy:.4f}")
        print(f"MCC: {mcc:.4f}")
        print(f"F1 Score: {f1:.4f}")

        print("\nConfusion Matrix:")
        cm = confusion_matrix(all_labels, all_predictions)
        print(cm)
    else:
        accuracy = mcc = f1 = 0
        print("No valid predictions made")

    results = {
        'dataset': dataset_name,
        'model': 'NucleotideTransformer',
        'accuracy': float(accuracy),
        'mcc': float(mcc),
        'f1': float(f1),
        'num_train_samples': len(train_df),
        'num_test_samples': len(test_df),
        'num_classes': num_classes,
        'learning_rate': learning_rate,
        'batch_size': batch_size,
        'num_epochs': num_epochs
    }

    # Save to GCS
    results_path = f"benchmark_results/nucleotide_transformer/{dataset_name}_results.json"
    save_results_to_gcs(bucket_name, results_path, results)

    # Clear GPU memory
    del model
    torch.cuda.empty_cache()
    gc.collect()

    return results

In [None]:
# ============================================
# Cell 8: Main benchmarking pipeline
# ============================================
def run_benchmarks(bucket_name, datasets, models_to_run=['dnabert', 'hyenadna']):
    """
    Run benchmarks on multiple datasets
    """
    all_results = []

    for dataset_name in datasets:
        print(f"\n{'='*60}")
        print(f"Processing dataset: {dataset_name}")
        print(f"{'='*60}")

        # Load data from GCS
        train_path = f"genomic_benchmarks/{dataset_name}_train.csv"
        test_path = f"genomic_benchmarks/{dataset_name}_test.csv"

        print(f"Loading data from GCS...")
        train_df = load_csv_from_gcs(bucket_name, train_path)
        test_df = load_csv_from_gcs(bucket_name, test_path)

        print(f"Train samples: {len(train_df)}")
        print(f"Test samples: {len(test_df)}")
        print(f"Sequence length: {len(train_df.iloc[0]['sequence'])}")

        # Run DNABERT
        if 'dnabert' in models_to_run:
            try:
                dnabert_results = benchmark_dnabert(train_df, test_df, dataset_name, bucket_name)
                all_results.append(dnabert_results)
            except Exception as e:
                print(f"Error running DNABERT on {dataset_name}: {e}")

        # Run HyenaDNA
        if 'hyenadna' in models_to_run:
            try:
                hyena_results = benchmark_hyena_dna(train_df, test_df, dataset_name, bucket_name)
                all_results.append(hyena_results)
            except Exception as e:
                print(f"Error running HyenaDNA on {dataset_name}: {e}")

        # Run Nucleotide Transformer
        if 'nucleotide_transformer' in models_to_run:
            try:
                nt_results = benchmark_nucleotide_transformer(train_df, test_df, dataset_name, bucket_name)
                all_results.append(nt_results)
            except Exception as e:
                print(f"Error running Nucleotide Transformer on {dataset_name}: {e}")

    # Save all results
    all_results_path = "benchmark_results/all_results.json"
    save_results_to_gcs(bucket_name, all_results_path, all_results)

    # Print summary
    print("\n" + "="*60)
    print("BENCHMARK SUMMARY")
    print("="*60)

    results_df = pd.DataFrame(all_results)
    print(results_df.to_string())

    return results_df

In [None]:
# ============================================
# Cell 9: Run the benchmarks!
# ============================================
# Configuration
BUCKET_NAME = "YOUR_BUCKET"  # Your GCS bucket

# List of datasets to benchmark
DATASETS = [
    "human_ocr_ensembl",
    "demo_coding_vs_intergenomic_seqs",
    "demo_human_or_worm",
    "human_enhancers_cohn",
    "human_enhancers_ensembl",
    "human_ensembl_regulatory",
    "human_nontata_promoters",
    "human_ocr_ensembl"
]

# Which models to run
MODELS = ['dnabert', 'hyenadna', 'nucleotide_transformer']

# Run benchmarks
results_df = run_benchmarks(BUCKET_NAME, DATASETS, MODELS)


Processing dataset: human_ocr_ensembl
Loading data from GCS...
Train samples: 139804
Test samples: 34952
Sequence length: 146

Benchmarking Nucleotide Transformer on human_ocr_ensembl
Loading Nucleotide Transformer model...
Number of classes: 2


Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing sequences for Nucleotide Transformer...

Fine-tuning Nucleotide Transformer...
Training parameters:
  - Batch size: 8
  - Epochs: 1
  - Learning rate: 3e-05
  - Training samples: 139804


Epoch 1/1:   0%|          | 0/17476 [00:00<?, ?it/s]

  labels = torch.tensor(batch_labels).to(device)


Epoch 1 - Average loss: 0.6192, Training accuracy: 0.6515

Evaluating on test set...


Evaluating:   0%|          | 0/2185 [00:00<?, ?it/s]


Results for human_ocr_ensembl:
Accuracy: 0.6434
MCC: 0.3068
F1 Score: 0.5665

Confusion Matrix:
[[14345  3131]
 [ 9332  8144]]
Results saved to gs://minformer_data/benchmark_results/nucleotide_transformer/human_ocr_ensembl_results.json

Processing dataset: demo_coding_vs_intergenomic_seqs
Loading data from GCS...
Train samples: 75000
Test samples: 25000
Sequence length: 200

Benchmarking Nucleotide Transformer on demo_coding_vs_intergenomic_seqs
Loading Nucleotide Transformer model...
Number of classes: 2


Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing sequences for Nucleotide Transformer...

Fine-tuning Nucleotide Transformer...
Training parameters:
  - Batch size: 8
  - Epochs: 1
  - Learning rate: 3e-05
  - Training samples: 75000


Epoch 1/1:   0%|          | 0/9375 [00:00<?, ?it/s]

  labels = torch.tensor(batch_labels).to(device)


Epoch 1 - Average loss: 0.3356, Training accuracy: 0.8557

Evaluating on test set...


Evaluating:   0%|          | 0/1563 [00:00<?, ?it/s]


Results for demo_coding_vs_intergenomic_seqs:
Accuracy: 0.8965
MCC: 0.7931
F1 Score: 0.8958

Confusion Matrix:
[[11294  1206]
 [ 1381 11119]]
Results saved to gs://minformer_data/benchmark_results/nucleotide_transformer/demo_coding_vs_intergenomic_seqs_results.json

Processing dataset: demo_human_or_worm
Loading data from GCS...
Train samples: 75000
Test samples: 25000
Sequence length: 200

Benchmarking Nucleotide Transformer on demo_human_or_worm
Loading Nucleotide Transformer model...
Number of classes: 2


Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing sequences for Nucleotide Transformer...

Fine-tuning Nucleotide Transformer...
Training parameters:
  - Batch size: 8
  - Epochs: 1
  - Learning rate: 3e-05
  - Training samples: 75000


Epoch 1/1:   0%|          | 0/9375 [00:00<?, ?it/s]

  labels = torch.tensor(batch_labels).to(device)


Epoch 1 - Average loss: 0.2407, Training accuracy: 0.9033

Evaluating on test set...


Evaluating:   0%|          | 0/1563 [00:00<?, ?it/s]


Results for demo_human_or_worm:
Accuracy: 0.9230
MCC: 0.8464
F1 Score: 0.9216

Confusion Matrix:
[[11752   748]
 [ 1178 11322]]
Results saved to gs://minformer_data/benchmark_results/nucleotide_transformer/demo_human_or_worm_results.json

Processing dataset: human_enhancers_cohn
Loading data from GCS...
Train samples: 20843
Test samples: 6948
Sequence length: 500

Benchmarking Nucleotide Transformer on human_enhancers_cohn
Loading Nucleotide Transformer model...
Number of classes: 2


Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing sequences for Nucleotide Transformer...

Fine-tuning Nucleotide Transformer...
Training parameters:
  - Batch size: 8
  - Epochs: 1
  - Learning rate: 3e-05
  - Training samples: 20843


Epoch 1/1:   0%|          | 0/2606 [00:00<?, ?it/s]

  labels = torch.tensor(batch_labels).to(device)


Epoch 1 - Average loss: 0.5962, Training accuracy: 0.6783

Evaluating on test set...


Evaluating:   0%|          | 0/435 [00:00<?, ?it/s]


Results for human_enhancers_cohn:
Accuracy: 0.6908
MCC: 0.4011
F1 Score: 0.7320

Confusion Matrix:
[[1866 1608]
 [ 540 2934]]
Results saved to gs://minformer_data/benchmark_results/nucleotide_transformer/human_enhancers_cohn_results.json

Processing dataset: human_enhancers_ensembl
Loading data from GCS...
Train samples: 123872
Test samples: 30970
Sequence length: 420

Benchmarking Nucleotide Transformer on human_enhancers_ensembl
Loading Nucleotide Transformer model...
Number of classes: 2


Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing sequences for Nucleotide Transformer...

Fine-tuning Nucleotide Transformer...
Training parameters:
  - Batch size: 8
  - Epochs: 1
  - Learning rate: 3e-05
  - Training samples: 123872


Epoch 1/1:   0%|          | 0/15484 [00:00<?, ?it/s]

  labels = torch.tensor(batch_labels).to(device)


Epoch 1 - Average loss: 0.4744, Training accuracy: 0.7721

Evaluating on test set...


Evaluating:   0%|          | 0/1936 [00:00<?, ?it/s]


Results for human_enhancers_ensembl:
Accuracy: 0.8031
MCC: 0.6207
F1 Score: 0.8222

Confusion Matrix:
[[10779  4706]
 [ 1391 14094]]
Results saved to gs://minformer_data/benchmark_results/nucleotide_transformer/human_enhancers_ensembl_results.json

Processing dataset: human_ensembl_regulatory
Loading data from GCS...
Train samples: 231348
Test samples: 57713
Sequence length: 600

Benchmarking Nucleotide Transformer on human_ensembl_regulatory
Loading Nucleotide Transformer model...
Number of classes: 3


Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing sequences for Nucleotide Transformer...

Fine-tuning Nucleotide Transformer...
Training parameters:
  - Batch size: 8
  - Epochs: 1
  - Learning rate: 3e-05
  - Training samples: 231348


Epoch 1/1:   0%|          | 0/28919 [00:00<?, ?it/s]

  labels = torch.tensor(batch_labels).to(device)


Epoch 1 - Average loss: 0.1986, Training accuracy: 0.9181

Evaluating on test set...


Evaluating:   0%|          | 0/3608 [00:00<?, ?it/s]


Results for human_ensembl_regulatory:
Accuracy: 0.9208
MCC: 0.8816
F1 Score: 0.9213

Confusion Matrix:
[[18805     0  2573]
 [  233 16881   362]
 [ 1337    67 17455]]
Results saved to gs://minformer_data/benchmark_results/nucleotide_transformer/human_ensembl_regulatory_results.json

Processing dataset: human_nontata_promoters
Loading data from GCS...
Train samples: 27097
Test samples: 9034
Sequence length: 251

Benchmarking Nucleotide Transformer on human_nontata_promoters
Loading Nucleotide Transformer model...
Number of classes: 2


Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing sequences for Nucleotide Transformer...

Fine-tuning Nucleotide Transformer...
Training parameters:
  - Batch size: 8
  - Epochs: 1
  - Learning rate: 3e-05
  - Training samples: 27097


Epoch 1/1:   0%|          | 0/3388 [00:00<?, ?it/s]

  labels = torch.tensor(batch_labels).to(device)


Epoch 1 - Average loss: 0.3808, Training accuracy: 0.8309

Evaluating on test set...


Evaluating:   0%|          | 0/565 [00:00<?, ?it/s]


Results for human_nontata_promoters:
Accuracy: 0.8619
MCC: 0.7369
F1 Score: 0.8620

Confusion Matrix:
[[3889  230]
 [1018 3897]]
Results saved to gs://minformer_data/benchmark_results/nucleotide_transformer/human_nontata_promoters_results.json

Processing dataset: human_ocr_ensembl
Loading data from GCS...
Train samples: 139804
Test samples: 34952
Sequence length: 146

Benchmarking Nucleotide Transformer on human_ocr_ensembl
Loading Nucleotide Transformer model...
Number of classes: 2


Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing sequences for Nucleotide Transformer...

Fine-tuning Nucleotide Transformer...
Training parameters:
  - Batch size: 8
  - Epochs: 1
  - Learning rate: 3e-05
  - Training samples: 139804


Epoch 1/1:   0%|          | 0/17476 [00:00<?, ?it/s]

  labels = torch.tensor(batch_labels).to(device)


Epoch 1 - Average loss: 0.6189, Training accuracy: 0.6514

Evaluating on test set...


Evaluating:   0%|          | 0/2185 [00:00<?, ?it/s]


Results for human_ocr_ensembl:
Accuracy: 0.6651
MCC: 0.3478
F1 Score: 0.6024

Confusion Matrix:
[[14375  3101]
 [ 8606  8870]]
Results saved to gs://minformer_data/benchmark_results/nucleotide_transformer/human_ocr_ensembl_results.json
Results saved to gs://minformer_data/benchmark_results/all_results.json

BENCHMARK SUMMARY
                            dataset                  model  accuracy       mcc        f1  num_train_samples  num_test_samples  num_classes  learning_rate  batch_size  num_epochs
0                 human_ocr_ensembl  NucleotideTransformer  0.643425  0.306815  0.566519             139804             34952            2        0.00003           8           1
1  demo_coding_vs_intergenomic_seqs  NucleotideTransformer  0.896520  0.793118  0.895791              75000             25000            2        0.00003           8           1
2                demo_human_or_worm  NucleotideTransformer  0.922960  0.846421  0.921612              75000             25000            2