# Baseline Evaluation of Embedding Models

This notebook establishes a baseline for evaluating several embedding models on reasoning-intensive and standard retrieval benchmarks. We will evaluate:

1.  **Late-Interaction Models**: Using the `pylate` library.
2.  **Dense Models**: Using the `sentence-transformers` library.

The primary benchmarks are:
- **BRIGHT Benchmark**: A suite of reasoning-intensive retrieval tasks.
- **NanoBEIR Benchmark**: A collection of smaller, standard retrieval tasks for quick evaluation.

## Target Baseline: `Reason-ModernColBERT` Performance

Before we begin, let's establish the performance of the `lightonai/Reason-ModernColBERT` model, which we aim to recreate. The following nDCG@10 scores on the BRIGHT benchmark are taken from its official Hugging Face model card. This table serves as our reference point.

| Model / Metric | Biology | Earth | Economics | Psychology | Robotics | Stackoverflow | Sustainable | Leetcode | Pony | AoPS | Theorem - Q | Theorem - T | **Full Mean** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Reason-ModernColBERT (150M) | **33.25** | **41.02** | **24.93** | **30.73** | **21.12** | 20.62 | 20.31 | 31.07 | 8.51 | 9.17 | 19.51 | 11.24 | **22.62** |

### 1. Setup: Imports and Configuration

In [1]:
!pip install pylate sentence-transformers datasets beir

Collecting pylate
  Downloading pylate-1.2.0-py3-none-any.whl.metadata (16 kB)
Collecting beir
  Downloading beir-2.2.0-py3-none-any.whl.metadata (28 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting voyager>=2.0.9 (from pylate)
  Downloading voyager-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.9 kB)
Collecting sqlitedict>=2.1.0 (from pylate)
  Downloading sqlitedict-2.1.0.tar.gz (21 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers==4.48.2 (from pylate)
  Downloading transformers-4.48.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ujson==5.10.0 (from pylate)
  Downloading ujson-5.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.me

In [2]:
import logging
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import torch
import random
import traceback
import gc

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from pylate import models as pylate_models, retrieve as pylate_retrieve, indexes as pylate_indexes
from pylate.evaluation import NanoBEIREvaluator
from tqdm.autonotebook import tqdm

# --- Configuration ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Set to True to sample a small subset of data for quick testing
SAMPLE_DATA = True
# Number of documents to sample (if SAMPLE_DATA is True)
SAMPLE_DOCS = 300  # Reduced from 500 to 300 for better memory usage
# Number of queries to sample (if SAMPLE_DATA is True)
SAMPLE_QUERIES = 20

### 2. Evaluation Framework Abstraction

To handle different model types (late-interaction vs. dense) cleanly, we'll create a simple abstraction. This framework will consist of a base class and two specialized subclasses.

- **`BaseEvaluator`**: Defines the common interface for all evaluators.
- **`ColBERTEvaluator`**: Handles late-interaction models using `pylate`.
- **`DenseEvaluator`**: Handles standard dense models using `sentence-transformers`.

In [3]:
class BaseEvaluator:
    """Abstract base class for model evaluation."""
    def __init__(self, model_id, device=None):
        self.model_id = model_id

        # Intelligently choose device - use CPU for larger models that might cause OOM
        if device:
            self.device = device
        else:
            # Force CPU for large models like ModernColBERT to avoid OOM
            if "ModernColBERT" in model_id:
                self.device = 'cpu'
            else:
                self.device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

        self.model = self._load_model()
        logger.info(f"Initialized {self.__class__.__name__} for model: {self.model_id} on device: {self.device}")

    def _load_model(self):
        raise NotImplementedError

    def evaluate(self, corpus, queries, qrels):
        raise NotImplementedError

class ColBERTEvaluator(BaseEvaluator):
    """Evaluator for ColBERT-style late-interaction models."""
    def _load_model(self):
        return pylate_models.ColBERT(self.model_id, device=self.device)

    def evaluate(self, corpus, queries, qrels, batch_size=16):
        doc_ids = list(corpus.keys())
        documents = [corpus[doc_id]["text"] for doc_id in doc_ids]
        doc_embeddings = self.model.encode(documents, is_query=False, show_progress_bar=True, batch_size=batch_size)

        index = pylate_indexes.Voyager(
            index_folder="output/voyager_temp_index",
            index_name="temp_index",
            override=True
        )
        index.add_documents(documents_ids=doc_ids, documents_embeddings=doc_embeddings)

        retriever = pylate_retrieve.ColBERT(index=index)
        query_ids = list(queries.keys())
        query_texts = [queries[qid] for qid in query_ids]
        query_embeddings = self.model.encode(query_texts, is_query=True, show_progress_bar=True, batch_size=batch_size)

        results = retriever.retrieve(queries_embeddings=query_embeddings, k=100)

        beir_results = {qid: {hit['id']: hit['score'] for hit in results[i]} for i, qid in enumerate(query_ids)}

        evaluator = EvaluateRetrieval()
        scores = evaluator.evaluate(qrels, beir_results, k_values=[1, 5, 10])
        logger.info(f"Evaluation scores for {self.model_id}: {scores}")
        return scores

class DenseEvaluator(BaseEvaluator):
    """Evaluator for standard dense retrieval models."""
    def _load_model(self):
        model_kwargs = {}

        # Only use trust_remote_code for models that need it
        if "Alibaba-NLP" in self.model_id or "Qwen" in self.model_id:
            model_kwargs['trust_remote_code'] = True

        try:
            return SentenceTransformer(self.model_id, device=self.device, **model_kwargs)
        except ValueError as e:
            if "qwen3" in str(e).lower():
                # Special handling for Qwen3 model architecture
                error_msg = (f"Error loading {self.model_id}: {e}\n\n"
                           f"It appears you need a newer version of transformers for Qwen3 support.\n"
                           f"Try: pip install --upgrade transformers\n"
                           f"Or for the latest development version: pip install git+https://github.com/huggingface/transformers.git")
                logger.error(error_msg)
                raise ValueError(error_msg)
            else:
                # Re-raise other errors
                raise

    def evaluate(self, corpus, queries, qrels, batch_size=32):
        doc_ids = list(corpus.keys())
        documents = [corpus[doc_id].get("title", "") + " " + corpus[doc_id].get("text", "") for doc_id in doc_ids]
        query_ids = list(queries.keys())
        query_texts = [queries[qid] for qid in query_ids]

        doc_embeddings = self.model.encode(documents, convert_to_tensor=True, show_progress_bar=True, batch_size=batch_size)
        query_embeddings = self.model.encode(query_texts, convert_to_tensor=True, show_progress_bar=True, batch_size=batch_size)

        results = semantic_search(query_embeddings, doc_embeddings, top_k=100)

        beir_results = {qid: {doc_ids[hit['corpus_id']]: hit['score'] for hit in results[i]} for i, qid in enumerate(query_ids)}

        evaluator = EvaluateRetrieval()
        scores = evaluator.evaluate(qrels, beir_results, k_values=[1, 5, 10])
        logger.info(f"Evaluation scores for {self.model_id}: {scores}")
        return scores

### 3. BRIGHT Benchmark Evaluation

Now, let's run the evaluation on a few tasks from the BRIGHT benchmark. We'll select `biology`, `stackoverflow`, and `leetcode` as representative tasks.

In [None]:
import time
import datetime
import traceback
import gc
import psutil
import os

def get_memory_usage():
    """Get current memory usage of the process"""
    process = psutil.Process(os.getpid())
    return {
        "ram_gb": process.memory_info().rss / (1024 ** 3),
        "ram_percent": psutil.virtual_memory().percent
    }

def get_gpu_memory_usage():
    """Get current GPU memory usage if available"""
    if torch.cuda.is_available():
        return {
            "allocated_gb": torch.cuda.memory_allocated() / (1024 ** 3),
            "reserved_gb": torch.cuda.memory_reserved() / (1024 ** 3),
            "max_allocated_gb": torch.cuda.max_memory_allocated() / (1024 ** 3)
        }
    return {"gpu_available": False}

def log_with_timestamp(message, level="INFO"):
    """Log a message with timestamp and log level"""
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    prefix = f"[{timestamp}] [{level}]"
    print(f"{prefix} {message}")

def log_memory_status(tag=""):
    """Log current memory usage with optional tag"""
    mem = get_memory_usage()
    log_with_timestamp(f"Memory Status {tag} - RAM: {mem['ram_gb']:.2f} GB ({mem['ram_percent']}%)", "MEMORY")

    if torch.cuda.is_available():
        gpu_mem = get_gpu_memory_usage()
        log_with_timestamp(f"GPU Memory {tag} - Allocated: {gpu_mem['allocated_gb']:.2f} GB, "
                         f"Reserved: {gpu_mem['reserved_gb']:.2f} GB, "
                         f"Max: {gpu_mem['max_allocated_gb']:.2f} GB", "MEMORY")

In [4]:
def load_bright_dataset(task_name):
    """Load a BRIGHT dataset from Hugging Face and format it for evaluation."""
    try:
        logger.info(f"Loading BRIGHT dataset for task: {task_name}")
        # Removed trust_remote_code=True as it's no longer supported
        docs_ds = load_dataset("xlangai/BRIGHT", "documents", split=task_name)
        examples_ds = load_dataset("xlangai/BRIGHT", "examples", split=task_name)

        corpus = {str(doc["id"]): {"text": doc["content"]} for doc in docs_ds}
        queries = {str(ex["id"]): ex["query"] for ex in examples_ds}
        qrels = {str(ex["id"]): {str(gid): 1 for gid in ex.get("gold_ids", [])} for ex in examples_ds}

        if SAMPLE_DATA:
            logger.info(f"Sampling → {SAMPLE_QUERIES} queries and their relevant documents.")

            valid_query_ids = [qid for qid, rels in qrels.items() if rels]
            if len(valid_query_ids) > SAMPLE_QUERIES:
                sampled_qids = random.sample(valid_query_ids, SAMPLE_QUERIES)
            else:
                sampled_qids = valid_query_ids

            queries = {qid: queries[qid] for qid in sampled_qids}
            qrels = {qid: qrels[qid] for qid in sampled_qids}

            relevant_doc_ids = set()
            for rels in qrels.values():
                relevant_doc_ids.update(rels.keys())

            all_doc_ids = list(corpus.keys())
            non_relevant_doc_ids = [doc_id for doc_id in all_doc_ids if doc_id not in relevant_doc_ids]
            num_distractors = min(len(non_relevant_doc_ids), SAMPLE_DOCS - len(relevant_doc_ids))
            if num_distractors > 0:
                final_doc_ids = relevant_doc_ids.union(random.sample(non_relevant_doc_ids, num_distractors))
            else:
                final_doc_ids = relevant_doc_ids

            corpus = {doc_id: corpus[doc_id] for doc_id in final_doc_ids}

        if not qrels:
            logger.warning(f"No relevance judgments found for task '{task_name}' after sampling/filtering.")
            return None, None, None

        logger.info(f"Loaded '{task_name}' dataset with {len(corpus)} documents, {len(queries)} queries, and {len(qrels)} relevance judgments.")
        return corpus, queries, qrels

    except Exception as e:
        traceback.print_exc()
        logger.error(f"Error loading BRIGHT dataset for task {task_name}: {e}")
        return None, None, None

In [5]:
# Define models to evaluate
MODELS_TO_TEST = [
    {"model_id": "lightonai/GTE-ModernColBERT-v1", "evaluator_class": ColBERTEvaluator, "type": "Late-Interaction"},
    {"model_id": "lightonai/Reason-ModernColBERT", "evaluator_class": ColBERTEvaluator, "type": "Late-Interaction"},
    {"model_id": "jinaai/jina-embeddings-v3", "evaluator_class": DenseEvaluator, "type": "Dense"},
    {"model_id": "Qwen/Qwen3-Embedding-0.6B", "evaluator_class": DenseEvaluator, "type": "Dense"},
]

# Define all BRIGHT tasks
BRIGHT_TASKS = [
    "biology",
    "earth_science",
    "economics",
    "psychology",
    "robotics",
    "stackoverflow",
    "sustainable_living",
    "leetcode",
    "pony",
    "aops",
    "theoremqa_questions",
    "theoremqa_theorems"
]

In [None]:
# Run evaluation on BRIGHT benchmark
bright_results = []

for task in BRIGHT_TASKS:
    logger.info(f"===== Evaluating on BRIGHT task: {task} =====")
    corpus, queries, qrels = load_bright_dataset(task)

    if not corpus or not queries or not qrels:
        logger.warning(f"Could not load or format BRIGHT task {task}. Skipping.")
        continue

    for model_info in MODELS_TO_TEST:
        evaluator = None
        try:
            logger.info(f"Evaluating model {model_info['model_id']} on task {task}")
            evaluator = model_info["evaluator_class"](model_info["model_id"])

            # Adjust batch size for ColBERT models to avoid OOM
            batch_size = 8 if "ModernColBERT" in model_info["model_id"] else 32

            scores = evaluator.evaluate(corpus, queries, qrels, batch_size=batch_size)
            ndcg_scores, _, _, _ = scores  # BEIR returns a tuple of dicts
            ndcg_at_10 = ndcg_scores.get("NDCG@10", 0.0) * 100

            bright_results.append({
                "Model": model_info["model_id"],
                "Type": model_info["type"],
                "Task": task.capitalize(),
                "nDCG@10": ndcg_at_10
            })
            logger.info(f"Model: {model_info['model_id']}, Task: {task}, nDCG@10: {ndcg_at_10:.2f}")
        except Exception as e:
            logger.error(f"Failed to evaluate model {model_info['model_id']} on task {task}. Error: {e}")
            # Still add a row with NaN for the score to maintain structure
            bright_results.append({
                "Model": model_info["model_id"],
                "Type": model_info["type"],
                "Task": task.capitalize(),
                "nDCG@10": float('nan')
            })
        finally:
            # Clean up memory
            if evaluator and hasattr(evaluator, 'model'):
                del evaluator.model
            del evaluator
            gc.collect()
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            elif torch.backends.mps.is_available():
                torch.mps.empty_cache()

# Create DataFrame from results
bright_df = pd.DataFrame(bright_results)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

biology-00000-of-00001.parquet:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

earth_science-00000-of-00001.parquet:   0%|          | 0.00/23.1M [00:00<?, ?B/s]

economics-00000-of-00001.parquet:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

psychology-00000-of-00001.parquet:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

robotics-00000-of-00001.parquet:   0%|          | 0.00/7.87M [00:00<?, ?B/s]

stackoverflow-00000-of-00001.parquet:   0%|          | 0.00/39.5M [00:00<?, ?B/s]

(…)ustainable_living-00000-of-00001.parquet:   0%|          | 0.00/11.7M [00:00<?, ?B/s]

pony-00000-of-00001.parquet:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

leetcode-00000-of-00001.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

aops-00000-of-00001.parquet:   0%|          | 0.00/65.3M [00:00<?, ?B/s]

(…)heoremqa_theorems-00000-of-00001.parquet:   0%|          | 0.00/7.63M [00:00<?, ?B/s]

Generating biology split:   0%|          | 0/57359 [00:00<?, ? examples/s]

Generating earth_science split:   0%|          | 0/121249 [00:00<?, ? examples/s]

Generating economics split:   0%|          | 0/50220 [00:00<?, ? examples/s]

Generating psychology split:   0%|          | 0/52835 [00:00<?, ? examples/s]

Generating robotics split:   0%|          | 0/61961 [00:00<?, ? examples/s]

Generating stackoverflow split:   0%|          | 0/107081 [00:00<?, ? examples/s]

Generating sustainable_living split:   0%|          | 0/60792 [00:00<?, ? examples/s]

Generating pony split:   0%|          | 0/7894 [00:00<?, ? examples/s]

Generating leetcode split:   0%|          | 0/413932 [00:00<?, ? examples/s]

Generating aops split:   0%|          | 0/188002 [00:00<?, ? examples/s]

Generating theoremqa_theorems split:   0%|          | 0/23839 [00:00<?, ? examples/s]

Generating theoremqa_questions split:   0%|          | 0/188002 [00:00<?, ? examples/s]

biology-00000-of-00001.parquet:   0%|          | 0.00/201k [00:00<?, ?B/s]

earth_science-00000-of-00001.parquet:   0%|          | 0.00/184k [00:00<?, ?B/s]

economics-00000-of-00001.parquet:   0%|          | 0.00/220k [00:00<?, ?B/s]

psychology-00000-of-00001.parquet:   0%|          | 0.00/184k [00:00<?, ?B/s]

robotics-00000-of-00001.parquet:   0%|          | 0.00/179k [00:00<?, ?B/s]

stackoverflow-00000-of-00001.parquet:   0%|          | 0.00/250k [00:00<?, ?B/s]

(…)ustainable_living-00000-of-00001.parquet:   0%|          | 0.00/218k [00:00<?, ?B/s]

pony-00000-of-00001.parquet:   0%|          | 0.00/27.7k [00:00<?, ?B/s]

leetcode-00000-of-00001.parquet:   0%|          | 0.00/169k [00:00<?, ?B/s]

aops-00000-of-00001.parquet:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

(…)heoremqa_theorems-00000-of-00001.parquet:   0%|          | 0.00/124k [00:00<?, ?B/s]

(…)eoremqa_questions-00000-of-00001.parquet:   0%|          | 0.00/1.49M [00:00<?, ?B/s]

Generating biology split:   0%|          | 0/103 [00:00<?, ? examples/s]

Generating earth_science split:   0%|          | 0/116 [00:00<?, ? examples/s]

Generating economics split:   0%|          | 0/103 [00:00<?, ? examples/s]

Generating psychology split:   0%|          | 0/101 [00:00<?, ? examples/s]

Generating robotics split:   0%|          | 0/101 [00:00<?, ? examples/s]

Generating stackoverflow split:   0%|          | 0/117 [00:00<?, ? examples/s]

Generating sustainable_living split:   0%|          | 0/108 [00:00<?, ? examples/s]

Generating pony split:   0%|          | 0/112 [00:00<?, ? examples/s]

Generating leetcode split:   0%|          | 0/142 [00:00<?, ? examples/s]

Generating aops split:   0%|          | 0/111 [00:00<?, ? examples/s]

Generating theoremqa_theorems split:   0%|          | 0/76 [00:00<?, ? examples/s]

Generating theoremqa_questions split:   0%|          | 0/194 [00:00<?, ? examples/s]

modules.json:   0%|          | 0.00/216 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/596M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/393k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:05<00:00,  5.25s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:06<00:00,  6.06s/it]


modules.json:   0%|          | 0.00/216 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/596M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/393k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:08<00:00,  8.58s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:33<00:00, 33.49s/it]


modules.json:   0%|          | 0.00/378 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task biology. Error: No module named 'custom_st'


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest development version: pip install git+https://github.com/huggingface/transformers.git
ERROR:__main__:Failed to evaluate model Qwen/Qwen3-Embedding-0.6B on task biology. Error: Error loading Qwen/Qwen3-Embed

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:05<00:00,  5.98s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:05<00:00,  5.82s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:36<00:00, 36.58s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [01:03<00:00, 63.56s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task earth_science. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest devel

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:07<00:00,  7.67s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:06<00:00,  6.45s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:11<00:00, 11.64s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:25<00:00, 25.53s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task economics. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest developme

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:05<00:00,  5.81s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:05<00:00,  5.93s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:09<00:00,  9.82s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:59<00:00, 59.34s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task psychology. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest developm

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:05<00:00,  5.77s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:05<00:00,  5.91s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:08<00:00,  8.19s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:23<00:00, 23.97s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task robotics. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest developmen

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:07<00:00,  7.62s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:06<00:00,  6.22s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:30<00:00, 30.66s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:29<00:00, 29.93s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task stackoverflow. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest devel

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:03<00:00,  3.14s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:05<00:00,  5.68s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:05<00:00,  5.03s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:47<00:00, 47.76s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task sustainable_living. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest 

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:08<00:00,  8.96s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:06<00:00,  6.15s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:13<00:00, 13.74s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:38<00:00, 38.36s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task leetcode. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest developmen

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:02<00:00,  2.70s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:05<00:00,  5.67s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:03<00:00,  3.40s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:21<00:00, 21.76s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task pony. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest development ve

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:08<00:00,  8.47s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:06<00:00,  6.16s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:12<00:00, 12.77s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:26<00:00, 26.95s/it]
ERROR:__main__:Failed to evaluate model jinaai/jina-embeddings-v3 on task aops. Error: No module named 'custom_st'
ERROR:__main__:Error loading Qwen/Qwen3-Embedding-0.6B: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

It appears you need a newer version of transformers for Qwen3 support.
Try: pip install --upgrade transformers
Or for the latest development ve

Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

Adding documents to the index (bs=2000): 100%|██████████| 1/1 [00:08<00:00,  8.39s/it]


Encoding queries (bs=8):   0%|          | 0/3 [00:00<?, ?it/s]

Retrieving documents (bs=50): 100%|██████████| 1/1 [00:06<00:00,  6.16s/it]


Encoding documents (bs=8):   0%|          | 0/38 [00:00<?, ?it/s]

#### Visualize BRIGHT Results

In [None]:
if not bright_df.empty:
    # Plot per-task performance
    plt.figure(figsize=(16, 10))
    sns.barplot(data=bright_df, x="Task", y="nDCG@10", hue="Model")
    plt.title("BRIGHT Benchmark Results (nDCG@10)", fontsize=16)
    plt.ylabel("nDCG@10 Score")
    plt.xlabel("Benchmark Task")
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

    # Calculate and plot average performance across tasks
    model_avg = bright_df.groupby('Model')['nDCG@10'].mean().reset_index()
    model_avg['Task'] = 'Average'

    plt.figure(figsize=(10, 6))
    sns.barplot(data=model_avg, x="Model", y="nDCG@10")
    plt.title("Average Performance Across All BRIGHT Tasks", fontsize=16)
    plt.ylabel("Average nDCG@10 Score")
    plt.xlabel("Model")
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

    # Print summary statistics
    print("\nSummary Statistics:")
    summary = bright_df.pivot_table(
        values='nDCG@10',
        index='Model',
        columns='Task',
        aggfunc='mean'
    )
    summary['Average'] = summary.mean(axis=1)
    display(summary.round(2))
else:
    print("No results to display for the BRIGHT benchmark.")

### 4. NanoBEIR Benchmark Evaluation

Next, we'll use the `NanoBEIREvaluator` from `pylate` to quickly evaluate the models on a subset of the BEIR benchmark. This provides a good signal for general retrieval performance.

In [None]:
nanobeir_results = []

for model_info in MODELS_TO_TEST:
    logger.info(f"===== Evaluating on NanoBEIR with model: {model_info['model_id']} =====")
    evaluator = None
    model = None
    try:
        if model_info["evaluator_class"] == ColBERTEvaluator:
            dataset_name = "scifact"
            url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset_name}.zip"
            data_path = util.download_and_unzip(url, f"beir-data/{dataset_name}")
            corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

            if SAMPLE_DATA:
                doc_ids = list(corpus.keys())
                sample_doc_ids = doc_ids[:min(SAMPLE_DOCS, len(doc_ids))]
                sample_corpus = {doc_id: corpus[doc_id] for doc_id in sample_doc_ids}

                query_ids = list(queries.keys())
                sample_query_ids = query_ids[:min(SAMPLE_QUERIES, len(query_ids))]
                sample_queries = {qid: queries[qid] for qid in sample_query_ids}

                sample_qrels = {}
                for qid in sample_query_ids:
                    if qid in qrels:
                        sample_qrels[qid] = {doc_id: score for doc_id, score in qrels[qid].items() if doc_id in sample_doc_ids}

                corpus, queries, qrels = sample_corpus, sample_queries, sample_qrels
                logger.info(f"Sampled {len(corpus)} documents and {len(queries)} queries for evaluation")

            evaluator = model_info["evaluator_class"](model_info["model_id"])
            batch_size = 8 if "ModernColBERT" in model_info["model_id"] else 32
            scores = evaluator.evaluate(corpus, queries, qrels, batch_size=batch_size)
            ndcg_score = scores[0].get("NDCG@10", 0.0) * 100
            task_name = "SciFact (Sampled)" if SAMPLE_DATA else "SciFact"
        else:
            if SAMPLE_DATA:
                logger.warning("Sampling is not supported for the standard NanoBEIREvaluator. Running on the full suite.")
            evaluator = NanoBEIREvaluator()

            device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
            model_kwargs = {}
            if "Alibaba-NLP" in model_info["model_id"] or "Qwen" in model_info["model_id"]:
                model_kwargs['trust_remote_code'] = True
            model = SentenceTransformer(model_info["model_id"], device=device, **model_kwargs)

            scores = evaluator(model)
            ndcg_score = scores['nDCG@10']
            task_name = "NanoBEIR (Avg)"

        nanobeir_results.append({
            "Model": model_info["model_id"],
            "Type": model_info["type"],
            "Task": task_name,
            "nDCG@10": ndcg_score
        })
        logger.info(f"Model: {model_info['model_id']}, {task_name} nDCG@10: {ndcg_score:.2f}")
    except Exception as e:
        logger.error(f"Failed to evaluate model {model_info['model_id']} on NanoBEIR. Error: {e}")
        # Still add a row with NaN for the score
        nanobeir_results.append({
            "Model": model_info["model_id"],
            "Type": model_info["type"],
            "Task": "NanoBEIR",
            "nDCG@10": float('nan')
        })
    finally:
        # Clean up memory
        del evaluator
        del model
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        elif torch.backends.mps.is_available():
            torch.mps.empty_cache()

nanobeir_df = pd.DataFrame(nanobeir_results)

#### Visualize NanoBEIR Results

In [None]:
if not nanobeir_df.empty:
    plt.figure(figsize=(12, 6))
    sns.barplot(data=nanobeir_df, x="Model", y="nDCG@10", hue="Task")
    plt.title("NanoBEIR Benchmark Results (nDCG@10)", fontsize=16)
    plt.ylabel("nDCG@10 Score")
    plt.xlabel("Model")
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

    # Print summary statistics
    print("\nNanoBEIR Summary:")
    display(nanobeir_df.pivot_table(values='nDCG@10', index='Model', columns='Task', aggfunc='mean').round(2))
else:
    print("No results to display for the NanoBEIR benchmark.")

### 5. Combined Results and Analysis

In [None]:
# Combine results from both benchmarks
if not bright_df.empty and not nanobeir_df.empty:
    # Calculate average scores per model
    bright_avg = bright_df.groupby('Model')['nDCG@10'].mean().reset_index()
    bright_avg['Benchmark'] = 'BRIGHT'

    nanobeir_avg = nanobeir_df.groupby('Model')['nDCG@10'].mean().reset_index()
    nanobeir_avg['Benchmark'] = 'NanoBEIR'

    combined_avg = pd.concat([bright_avg, nanobeir_avg])

    # Plot combined results
    plt.figure(figsize=(14, 8))
    sns.barplot(data=combined_avg, x="Model", y="nDCG@10", hue="Benchmark")
    plt.title("Comparison of Model Performance Across Benchmarks", fontsize=16)
    plt.ylabel("Average nDCG@10 Score")
    plt.xlabel("Model")
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

    # Print final summary
    print("\nFinal Performance Summary:")
    summary = combined_avg.pivot_table(values='nDCG@10', index='Model', columns='Benchmark')
    summary['Overall Average'] = summary.mean(axis=1)
    display(summary.round(2).sort_values('Overall Average', ascending=False))
else:
    print("Insufficient data to create combined analysis.")