# BRIGHT and NanoBEIR Benchmark Evaluation

This notebook implements the evaluation of embedding models on the BRIGHT benchmark, following the original author's reference implementation. We evaluate both late-interaction models (ColBERT-based) and dense models.

## Reference Implementation
This notebook follows the methodology from the official evaluation script: https://gist.github.com/NohTow/3f27d2816b92d5c76f0e63aa7757cf4b

## Target Models
- **Late-Interaction Models**: `lightonai/Reason-ModernColBERT`, `lightonai/GTE-ModernColBERT-v1`
- **Dense Models**: `jinaai/jina-embeddings-v3`, `Qwen/Qwen3-Embedding-0.6B`

## Benchmarks
- **BRIGHT**: A comprehensive benchmark for reasoning-intensive retrieval tasks
- **NanoBEIR**: A collection of standard retrieval tasks for comparison

## 1. Install Required Dependencies

First, we need to install the required libraries. This includes `mteb` for loading the BRIGHT tasks and `pylate` for the ColBERT evaluation.

In [None]:
# Install required packages
!pip install mteb pylate srsly psutil torch sentence-transformers matplotlib seaborn pandas

## 2. Imports and Configuration

In [1]:
import time
import datetime
import traceback
import gc
import psutil
import os
import json
import numpy as np
from typing import Dict, List, Any, Tuple, Union, Optional

import mteb
import srsly
import torch
import torch.nn.functional as F
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sentence_transformers import SentenceTransformer
from transformers import AutoModel, AutoTokenizer

# Import PyLate modules for ColBERT evaluation
from pylate import evaluation, indexes, models, retrieve

# Set up configuration
DEBUG_MODE = True  # Set to False for full evaluation
OUTPUT_DIR = "bright_evaluation_results"
os.makedirs(OUTPUT_DIR, exist_ok=True)

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


  from .autonotebook import tqdm as notebook_tqdm


## 3. Memory and Logging Utilities

These utilities help track memory usage and provide detailed logging throughout the evaluation process.

In [2]:
def get_memory_usage():
    """Get current memory usage of the process"""
    process = psutil.Process(os.getpid())
    return {
        "ram_gb": process.memory_info().rss / (1024 ** 3),
        "ram_percent": psutil.virtual_memory().percent
    }

def get_gpu_memory_usage():
    """Get current GPU memory usage if available"""
    if torch.cuda.is_available():
        return {
            "allocated_gb": torch.cuda.memory_allocated() / (1024 ** 3),
            "reserved_gb": torch.cuda.memory_reserved() / (1024 ** 3),
            "max_allocated_gb": torch.cuda.max_memory_allocated() / (1024 ** 3)
        }
    return {"gpu_available": False}

def log_with_timestamp(message, level="INFO"):
    """Log a message with timestamp and log level"""
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    prefix = f"[{timestamp}] [{level}]"
    print(f"{prefix} {message}")

def log_memory_status(tag=""):
    """Log current memory usage with optional tag"""
    mem = get_memory_usage()
    log_with_timestamp(f"Memory Status {tag} - RAM: {mem['ram_gb']:.2f} GB ({mem['ram_percent']}%)", "MEMORY")
    
    if torch.cuda.is_available():
        gpu_mem = get_gpu_memory_usage()
        log_with_timestamp(f"GPU Memory {tag} - Allocated: {gpu_mem['allocated_gb']:.2f} GB, "
                         f"Reserved: {gpu_mem['reserved_gb']:.2f} GB, "
                         f"Max: {gpu_mem['max_allocated_gb']:.2f} GB", "MEMORY")

def cleanup_memory():
    """Clean up memory and GPU cache"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    if torch.mps.is_available():
        torch.mps.empty_cache()

## 4. Model Configuration

Define the models to evaluate and their configurations.

In [3]:
# Models to evaluate - temporarily remove problematic ones
MODELS_TO_TEST = [
    {
        "name": "Reason-ModernColBERT",
        "model_id": "lightonai/Reason-ModernColBERT",
        "query_length": 128,
        "type": "ColBERT",
        "device": "mps"  # Force CPU to avoid OOM issues
    },
    {
        "name": "GTE-ModernColBERT-v1", 
        "model_id": "lightonai/GTE-ModernColBERT-v1",
        "query_length": 128,
        "type": "ColBERT",
        "device": "mps"  # Force CPU to avoid OOM issues
    }
    # Temporarily comment out problematic models
    # {
    #     "name": "Jina-Embeddings-v3",
    #     "model_id": "jinaai/jina-embeddings-v3",
    #     "type": "Dense",
    #     "device": "cuda" if torch.cuda.is_available() else "cpu"
    # },
    # {
    #     "name": "Qwen3-Embedding-0.6B",
    #     "model_id": "Qwen/Qwen3-Embedding-0.6B",
    #     "type": "Dense",
    #     "device": "cuda" if torch.cuda.is_available() else "cpu",
    #     "trust_remote_code": True
    # }
]

# Check if we're in Google Colab and adjust device settings
try:
    import google.colab
    IN_COLAB = True
    log_with_timestamp("Running in Google Colab environment", "CONFIG")
    
    # Check available GPU
    if torch.cuda.is_available():
        gpu_info = !nvidia-smi
        log_with_timestamp(f"GPU available: {torch.cuda.get_device_name(0)}", "CONFIG")
        
        # Update device settings for models that can run on GPU
        for model in MODELS_TO_TEST:
            if model["type"] == "Dense":
                model["device"] = "cuda"
    else:
        log_with_timestamp("No GPU available in Colab", "WARNING")
except ImportError:
    IN_COLAB = False
    log_with_timestamp("Not running in Google Colab environment", "CONFIG")

[2025-09-01 18:36:08] [CONFIG] Not running in Google Colab environment


## 5. BRIGHT Evaluation Functions

These functions implement the BRIGHT evaluation following the author's reference implementation.

In [4]:
def evaluate_colbert_on_bright(model_config: Dict[str, Any], eval_sets: List[str] = None) -> Dict[str, Dict[str, float]]:
    """
    Evaluate a ColBERT model on BRIGHT benchmark following the author's methodology
    
    Args:
        model_config: Dictionary with model configuration
        eval_sets: List of evaluation sets to run (if None, runs all available)
    
    Returns:
        Dictionary with results for each eval_set
    """
    model_name = model_config["model_id"]
    query_length = model_config.get("query_length", 128)
    model_short_name = model_config["name"]
    device = model_config.get("device", "cpu")
    
    log_with_timestamp(f"Starting evaluation for {model_name}", "START")
    log_memory_status("initial")
    
    # Load BRIGHT tasks using MTEB
    log_with_timestamp("Loading BRIGHT tasks from MTEB", "DATA")
    tasks = mteb.get_tasks(tasks=["BrightRetrieval"])
    task = tasks[0]
    task.load_data()
    
    log_with_timestamp(f"Loaded BRIGHT task with eval sets: {list(task.queries.keys())}", "DATA")
    
    # Determine which eval sets to run
    if eval_sets is None:
        available_sets = list(task.queries.keys())
        if DEBUG_MODE:
            # In debug mode, just run first 2 sets
            eval_sets_to_run = available_sets[:2]
            log_with_timestamp(f"Debug mode: Running on {eval_sets_to_run}", "DEBUG")
        else:
            eval_sets_to_run = available_sets
    else:
        eval_sets_to_run = eval_sets
    
    # Initialize model
    log_with_timestamp(f"Initializing model: {model_name} with query_length={query_length}", "MODEL")
    model = models.ColBERT(
        model_name_or_path=model_name,
        query_length=query_length,
        device=device
    )
    log_with_timestamp("Model loaded successfully", "MODEL")
    
    results = {}
    
    for eval_set in eval_sets_to_run:
        eval_start_time = time.time()
        log_with_timestamp(f"Starting evaluation on set: {eval_set}", "EVAL")
        log_memory_status(f"before {eval_set}")
        
        try:
            # Create output directory
            output_dir = os.path.join(OUTPUT_DIR, f"{model_short_name}_ir")
            os.makedirs(output_dir, exist_ok=True)
            
            # Check if results already exist
            result_file = os.path.join(
                output_dir,
                f"{task.metadata.name}_{eval_set.replace('/', '_')}_evaluation_scores_qlen{query_length}.json"
            )
            
            if os.path.exists(result_file):
                log_with_timestamp(f"Results already exist for {eval_set}. Loading existing results.", "SKIP")
                with open(result_file, 'r') as f:
                    evaluation_scores = json.load(f)
                results[eval_set] = evaluation_scores
                continue
            
            # Create index
            log_with_timestamp(f"Creating PLAID index for {eval_set}", "INDEX")
            index = indexes.PLAID(
                override=True,
                nbits=4,
                index_name=f"{task.metadata.name}_{eval_set}_{model_short_name}_{query_length}_4bits_ir",
            )
            
            # Get documents and queries
            corpus = task.corpus[eval_set]["standard"] 
            queries = task.queries[eval_set]["standard"]
            qrels = task.relevant_docs[eval_set]["standard"]
            
            # Handle excluded docs - check if it exists
            excluded_docs = {}
            if "excluded" in task.relevant_docs[eval_set]:
                excluded_docs = task.relevant_docs[eval_set]["excluded"]
                log_with_timestamp(f"Found excluded docs for {eval_set}", "DATA")
            else:
                # Create N/A entries for all queries if no excluded docs
                excluded_docs = {qid: "N/A" for qid in queries.keys()}
                log_with_timestamp(f"No excluded docs found for {eval_set}, using N/A", "DATA")
            
            log_with_timestamp(f"Dataset stats - Docs: {len(corpus)}, Queries: {len(queries)}, Qrels: {len(qrels)}", "DATA")
            
            # Sample for debug mode
            if DEBUG_MODE:
                max_docs = 1000
                max_queries = 50
                
                # Sample documents
                doc_ids = list(corpus.keys())
                if len(doc_ids) > max_docs:
                    sampled_doc_ids = doc_ids[:max_docs]
                    corpus = {doc_id: corpus[doc_id] for doc_id in sampled_doc_ids}
                    log_with_timestamp(f"Sampled corpus to {len(corpus)} documents", "DEBUG")
                
                # Sample queries
                query_ids = list(queries.keys())
                if len(query_ids) > max_queries:
                    sampled_query_ids = query_ids[:max_queries]
                    queries = {qid: queries[qid] for qid in sampled_query_ids}
                    qrels = {qid: qrels[qid] for qid in sampled_query_ids if qid in qrels}
                    excluded_docs = {qid: excluded_docs.get(qid, "N/A") for qid in sampled_query_ids}
                    log_with_timestamp(f"Sampled to {len(queries)} queries", "DEBUG")
            
            # Encode documents
            log_with_timestamp("Encoding documents...", "ENCODE")
            documents_embeddings = model.encode(
                sentences=list(corpus.values()),
                batch_size=50,  # Smaller batch size to avoid OOM
                is_query=False,
                show_progress_bar=True,
            )
            log_with_timestamp("Document encoding completed", "ENCODE")
            
            # Add documents to index
            log_with_timestamp("Adding documents to index...", "INDEX")
            index.add_documents(
                documents_ids=list(corpus.keys()),
                documents_embeddings=documents_embeddings,
            )
            log_with_timestamp("Documents added to index", "INDEX")
            
            # Create retriever
            retriever = retrieve.ColBERT(index=index)
            
            # Encode queries
            log_with_timestamp("Encoding queries...", "ENCODE")
            queries_embeddings = model.encode(
                sentences=list(queries.values()),
                is_query=True,
                show_progress_bar=True,
                batch_size=16,  # Smaller batch size for queries
            )
            log_with_timestamp("Query encoding completed", "ENCODE")
            
            # Retrieve
            log_with_timestamp("Retrieving results...", "RETRIEVE")
            scores = retriever.retrieve(queries_embeddings=queries_embeddings, k=100)
            log_with_timestamp("Retrieval completed", "RETRIEVE")
            
            # Filter excluded documents
            log_with_timestamp("Filtering excluded documents...", "FILTER")
            filtered_scores = []
            
            for query_scores, excluded_ids in zip(scores, excluded_docs.values()):
                # Some splits have no excluded ids
                if excluded_ids == "N/A" or not excluded_ids:
                    filtered_scores.append(query_scores)
                    continue
                
                filtered_query_scores = []
                for query_score in query_scores:
                    if query_score["id"] in excluded_ids:
                        continue
                    filtered_query_scores.append(query_score)
                filtered_scores.append(filtered_query_scores)
            
            log_with_timestamp("Exclusion filtering completed", "FILTER")
            
            # Evaluate
            log_with_timestamp("Computing evaluation metrics...", "METRICS") 
            evaluation_scores = evaluation.evaluate(
                scores=filtered_scores,
                qrels=qrels,
                queries=list(queries.keys()),
                metrics=["map", "ndcg@1", "ndcg@10", "ndcg@100", "recall@10", "recall@100"],
            )
            log_with_timestamp("Evaluation completed", "METRICS")
            
            # Save results
            srsly.write_json(result_file, evaluation_scores)
            log_with_timestamp(f"Results saved to {result_file}", "SAVE")
            
            # Store results
            results[eval_set] = evaluation_scores
            
            # Log key scores
            ndcg_10 = evaluation_scores.get("ndcg@10", {}).get("mean", 0.0) * 100
            eval_time = time.time() - eval_start_time
            log_with_timestamp(f"Completed {eval_set}: nDCG@10 = {ndcg_10:.4f} (took {eval_time:.2f}s)", "SUCCESS")
            
        except Exception as e:
            error_msg = f"Error evaluating {model_name} on {eval_set}: {str(e)}"
            log_with_timestamp(error_msg, "ERROR")
            log_with_timestamp(f"Traceback:\n{traceback.format_exc()}", "ERROR")
            
            # Store error result
            results[eval_set] = {
                "error": str(e),
                "ndcg@10": {"mean": 0.0}
            }
        
        # Cleanup memory after each eval set
        try:
            cleanup_memory()
            log_with_timestamp("Memory cleanup completed", "CLEANUP")
            log_memory_status(f"after {eval_set}")
        except Exception as cleanup_error:
            log_with_timestamp(f"Cleanup warning: {str(cleanup_error)}", "WARNING")
    
    # Cleanup model
    del model
    cleanup_memory()
    
    log_with_timestamp(f"Completed evaluation for {model_name}", "COMPLETE")
    return results

def evaluate_dense_model_mteb(model_config: Dict[str, Any]) -> Dict[str, Dict[str, float]]:
    """
    Evaluate a dense model using MTEB's BRIGHT task directly
    
    Args:
        model_config: Dictionary with model configuration
        
    Returns:
        Dictionary with results for each eval_set
    """
    model_id = model_config["model_id"]
    model_name = model_config["name"]
    device = model_config.get("device", "cpu")
    trust_remote_code = model_config.get("trust_remote_code", False)
    
    log_with_timestamp(f"Starting MTEB evaluation for dense model {model_id}", "START")
    log_memory_status("initial")
    
    results = {}
    
    try:
        # Load model with special handling for Jina
        log_with_timestamp(f"Loading model {model_id} on {device}", "MODEL")
        
        if "jina" in model_id.lower():
            # Special handling for Jina models - use transformers directly
            tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
            model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
            model.to(device)
            
            # Create a wrapper for sentence transformers compatibility
            class JinaModelWrapper:
                def __init__(self, model, tokenizer, device):
                    self.model = model
                    self.tokenizer = tokenizer
                    self.device = device
                    
                def encode(self, sentences, batch_size=32, **kwargs):
                    embeddings = []
                    for i in range(0, len(sentences), batch_size):
                        batch = sentences[i:i+batch_size]
                        inputs = self.tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512)
                        inputs = {k: v.to(self.device) for k, v in inputs.items()}
                        
                        with torch.no_grad():
                            outputs = self.model(**inputs)
                            # Use mean pooling
                            embeddings_batch = outputs.last_hidden_state.mean(dim=1)
                            embeddings_batch = F.normalize(embeddings_batch, p=2, dim=1)
                            embeddings.append(embeddings_batch.cpu())
                    
                    return torch.cat(embeddings, dim=0).numpy()
            
            model_wrapper = JinaModelWrapper(model, tokenizer, device)
            log_with_timestamp("Jina model loaded with custom wrapper", "MODEL")
            
        else:
            # Standard loading for other models
            model_kwargs = {}
            if trust_remote_code:
                model_kwargs["trust_remote_code"] = True
                
            model_wrapper = SentenceTransformer(model_id, device=device, **model_kwargs)
            log_with_timestamp("Model loaded successfully", "MODEL")
        
        # Get BRIGHT task - fix the import
        log_with_timestamp("Loading BRIGHT task", "DATA")
        
        # Use the correct way to get BRIGHT task
        tasks = mteb.get_tasks(tasks=["BrightRetrieval"])
        task = tasks[0]
        
        # Determine which splits to evaluate
        if DEBUG_MODE:
            eval_splits = ["test"][:1]  # Just use first test split in debug mode
            log_with_timestamp(f"Debug mode: Using only {eval_splits} split", "DEBUG")
        else:
            eval_splits = ["test"]
            
        # Run evaluation using MTEB framework
        log_with_timestamp(f"Running MTEB evaluation on {len(eval_splits)} splits", "EVAL")
        output_folder = os.path.join(OUTPUT_DIR, f"mteb_{model_name}")
        os.makedirs(output_folder, exist_ok=True)
        
        # Create MTEB evaluation
        evaluation_mteb = mteb.MTEB(tasks=[task])
        mteb_results = evaluation_mteb.run(
            model_wrapper, 
            output_folder=output_folder,
            eval_splits=eval_splits,
            overwrite_results=True
        )
        
        # Process results
        log_with_timestamp("Processing evaluation results", "RESULTS")
        
        # Extract results from MTEB format
        task_results = mteb_results[0]  # First (and only) task
        
        for eval_set, metrics in task_results.items():
            if eval_set == "test":  # We're interested in test results
                results[eval_set] = {}
                for metric, value in metrics.items():
                    if metric.lower() == "ndcg_at_10":
                        results[eval_set]["ndcg@10"] = {"mean": value}
                    else:
                        results[eval_set][metric.lower()] = {"mean": value}
                        
                # Save results
                result_file = os.path.join(output_folder, f"{eval_set}_results.json")
                with open(result_file, 'w') as f:
                    json.dump(results[eval_set], f, indent=2)
                log_with_timestamp(f"Results for {eval_set} saved to {result_file}", "SAVE")
                
                # Log key scores
                ndcg_10 = results[eval_set].get("ndcg@10", {}).get("mean", 0.0) * 100
                log_with_timestamp(f"Completed {eval_set}: nDCG@10 = {ndcg_10:.4f}", "SUCCESS")
                
    except Exception as e:
        error_msg = f"Error in MTEB evaluation for {model_id}: {str(e)}"
        log_with_timestamp(error_msg, "ERROR")
        log_with_timestamp(f"Traceback:\n{traceback.format_exc()}", "ERROR")
        results["error"] = {"error": str(e), "ndcg@10": {"mean": 0.0}}
    
    finally:
        # Clean up
        if 'model' in locals():
            del model
        if 'model_wrapper' in locals():
            del model_wrapper
        cleanup_memory()
        log_with_timestamp("Memory cleanup completed", "CLEANUP")
        log_memory_status("final")
        
    log_with_timestamp(f"Completed MTEB evaluation for {model_id}", "COMPLETE")
    return results

## 6. Main Evaluation Pipeline

This is the main function to run the complete BRIGHT benchmark evaluation.

In [5]:
def run_bright_benchmark():
    """Run the complete BRIGHT benchmark evaluation"""
    log_with_timestamp("Starting BRIGHT benchmark evaluation", "START")
    
    # Set up logging file
    log_file = os.path.join(OUTPUT_DIR, "bright_evaluation_log.txt")
    with open(log_file, "w") as f:
        f.write(f"BRIGHT Evaluation Log - Started at {datetime.datetime.now()}\n")
        f.write(f"Debug mode: {DEBUG_MODE}\n\n")
    
    all_results = {}
    results_summary = []
    
    for model_config in MODELS_TO_TEST:
        log_with_timestamp(f"Evaluating model: {model_config['name']}", "MODEL")
        
        try:
            # Evaluate based on model type
            if model_config["type"] == "ColBERT":
                model_results = evaluate_colbert_on_bright(model_config)
            else:  # Dense models
                model_results = evaluate_dense_model_mteb(model_config)
                
            all_results[model_config["name"]] = model_results
            
            # Extract scores for summary
            for eval_set, scores in model_results.items():
                if "error" not in scores:
                    ndcg_10 = scores.get("ndcg@10", {}).get("mean", 0.0) * 100
                else:
                    ndcg_10 = 0.0
                
                results_summary.append({
                    "Model": model_config["name"],
                    "EvalSet": eval_set,
                    "nDCG@10": ndcg_10
                })
                
        except Exception as e:
            log_with_timestamp(f"Failed to evaluate {model_config['name']}: {str(e)}", "ERROR")
            all_results[model_config["name"]] = {"error": str(e)}
            
            # Add to log file
            with open(log_file, "a") as f:
                f.write(f"\nERROR: {model_config['name']} at {datetime.datetime.now()}\n")
                f.write(f"Error: {str(e)}\n")
                f.write(traceback.format_exc() + "\n\n")
    
    # Create results DataFrame
    results_df = pd.DataFrame(results_summary)
    
    # Save complete results
    results_file = os.path.join(OUTPUT_DIR, "bright_evaluation_complete_results.json")
    with open(results_file, 'w') as f:
        json.dump(all_results, f, indent=2)
    log_with_timestamp(f"Complete results saved to {results_file}", "SAVE")
    
    # Save summary
    summary_file = os.path.join(OUTPUT_DIR, "bright_evaluation_summary.csv")
    results_df.to_csv(summary_file, index=False)
    log_with_timestamp(f"Summary saved to {summary_file}", "SAVE")
    
    return results_df, all_results

## 7. Results Visualization

Functions to visualize the benchmark results.

In [6]:
def visualize_bright_results(results_df):
    """Create visualizations for BRIGHT results"""
    if results_df.empty:
        log_with_timestamp("No results to visualize", "WARNING")
        return
    
    log_with_timestamp("Creating visualizations", "PLOT")
    
    # Per-task performance
    plt.figure(figsize=(15, 8))
    sns.barplot(data=results_df, x="EvalSet", y="nDCG@10", hue="Model")
    plt.title("BRIGHT Benchmark Results by Evaluation Set", fontsize=16)
    plt.ylabel("nDCG@10 (%)")
    plt.xlabel("Evaluation Set")
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Model')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'bright_results_by_task.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    # Average performance
    model_avg = results_df.groupby('Model')['nDCG@10'].mean().reset_index()
    
    plt.figure(figsize=(10, 6))
    sns.barplot(data=model_avg, x="Model", y="nDCG@10")
    plt.title("Average BRIGHT Performance Across All Tasks", fontsize=16)
    plt.ylabel("Average nDCG@10 (%)")
    plt.xlabel("Model")
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'bright_results_average.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    # Summary table
    print("\nBRIGHT Evaluation Summary:")
    summary_table = results_df.pivot_table(
        values='nDCG@10', 
        index='Model', 
        columns='EvalSet', 
        aggfunc='mean'
    )
    summary_table['Average'] = summary_table.mean(axis=1)
    display(summary_table.round(2))
    
    # Save summary table
    summary_table.to_csv(os.path.join(OUTPUT_DIR, 'bright_summary_table.csv'))
    
    log_with_timestamp("Visualizations completed", "COMPLETE")

## 8. Run the Evaluation

In [7]:
# Run the evaluation
log_with_timestamp("Starting BRIGHT evaluation pipeline", "MAIN")

# Run the evaluation
results_df, all_results = run_bright_benchmark()

# Create visualizations
visualize_bright_results(results_df)

# Print final summary
if not results_df.empty:
    best_model = results_df.groupby('Model')['nDCG@10'].mean().idxmax()
    best_score = results_df.groupby('Model')['nDCG@10'].mean().max()
    log_with_timestamp(f"Best performing model: {best_model} with average nDCG@10 = {best_score:.2f}%", "SUMMARY")

log_with_timestamp("BRIGHT evaluation pipeline completed", "MAIN")

[2025-09-01 18:36:26] [MAIN] Starting BRIGHT evaluation pipeline
[2025-09-01 18:36:26] [START] Starting BRIGHT benchmark evaluation
[2025-09-01 18:36:26] [MODEL] Evaluating model: Reason-ModernColBERT
[2025-09-01 18:36:26] [START] Starting evaluation for lightonai/Reason-ModernColBERT
[2025-09-01 18:36:26] [MEMORY] Memory Status initial - RAM: 0.13 GB (81.1%)
[2025-09-01 18:36:26] [DATA] Loading BRIGHT tasks from MTEB



KeyboardInterrupt



## 9. NanoBEIR Evaluation (Optional)

This section provides code to evaluate models on the NanoBEIR benchmark for comparison.

In [None]:
def evaluate_nanobeir(model_config):
    """Evaluate a model on the NanoBEIR benchmark"""
    log_with_timestamp(f"Starting NanoBEIR evaluation for {model_config['name']}", "START")
    
    try:
        if model_config["type"] == "ColBERT":
            # For ColBERT models, we need to use PyLate
            model = models.ColBERT(
                model_name_or_path=model_config["model_id"],
                query_length=model_config.get("query_length", 128),
                device=model_config.get("device", "cpu")
            )
            
            # Use PyLate's NanoBEIR evaluator
            from pylate.evaluation import NanoBEIREvaluator
            evaluator = NanoBEIREvaluator()
            results = evaluator(model)
            
        else:  # Dense models
            # For dense models, use SentenceTransformer
            model_kwargs = {}
            if model_config.get("trust_remote_code", False):
                model_kwargs["trust_remote_code"] = True
                
            model = SentenceTransformer(
                model_config["model_id"],
                device=model_config.get("device", "cpu"),
                **model_kwargs
            )
            
            # Use PyLate's NanoBEIR evaluator
            from pylate.evaluation import NanoBEIREvaluator
            evaluator = NanoBEIREvaluator()
            results = evaluator(model)
            
        # Save results
        output_dir = os.path.join(OUTPUT_DIR, "nanobeir")
        os.makedirs(output_dir, exist_ok=True)
        
        result_file = os.path.join(output_dir, f"{model_config['name']}_nanobeir_results.json")
        with open(result_file, 'w') as f:
            json.dump(results, f, indent=2)
            
        log_with_timestamp(f"NanoBEIR results for {model_config['name']} saved to {result_file}", "SAVE")
        log_with_timestamp(f"NanoBEIR nDCG@10: {results['nDCG@10']:.4f}", "RESULTS")
        
        return results
        
    except Exception as e:
        error_msg = f"Error in NanoBEIR evaluation for {model_config['name']}: {str(e)}"
        log_with_timestamp(error_msg, "ERROR")
        log_with_timestamp(f"Traceback:\n{traceback.format_exc()}", "ERROR")
        return {"error": str(e), "nDCG@10": 0.0}
    
    finally:
        # Clean up
        if 'model' in locals():
            del model
        cleanup_memory()
        log_with_timestamp("Memory cleanup completed", "CLEANUP")
        
def run_nanobeir_benchmark():
    """Run NanoBEIR benchmark for all models"""
    log_with_timestamp("Starting NanoBEIR benchmark evaluation", "START")
    
    results = []
    
    for model_config in MODELS_TO_TEST:
        log_with_timestamp(f"Evaluating {model_config['name']} on NanoBEIR", "MODEL")
        
        try:
            model_results = evaluate_nanobeir(model_config)
            
            # Extract scores
            if "error" not in model_results:
                ndcg_10 = model_results.get("nDCG@10", 0.0) * 100
            else:
                ndcg_10 = 0.0
                
            results.append({
                "Model": model_config["name"],
                "Type": model_config["type"],
                "nDCG@10": ndcg_10
            })
            
        except Exception as e:
            log_with_timestamp(f"Failed to evaluate {model_config['name']} on NanoBEIR: {str(e)}", "ERROR")
            results.append({
                "Model": model_config["name"],
                "Type": model_config["type"],
                "nDCG@10": 0.0
            })
    
    # Create DataFrame
    results_df = pd.DataFrame(results)
    
    # Save results
    results_df.to_csv(os.path.join(OUTPUT_DIR, "nanobeir_results.csv"), index=False)
    
    # Visualize
    plt.figure(figsize=(12, 6))
    sns.barplot(data=results_df, x="Model", y="nDCG@10", hue="Type")
    plt.title("NanoBEIR Benchmark Results", fontsize=16)
    plt.ylabel("nDCG@10 (%)")
    plt.xlabel("Model")
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'nanobeir_results.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    # Print summary
    print("\nNanoBEIR Results:")
    display(results_df.sort_values("nDCG@10", ascending=False))
    
    log_with_timestamp("NanoBEIR evaluation completed", "COMPLETE")
    
    return results_df

In [None]:
# Uncomment to run NanoBEIR evaluation
# nanobeir_df = run_nanobeir_benchmark()

## 10. Combined Analysis (BRIGHT + NanoBEIR)

This section provides code to compare results across both benchmarks.

In [None]:
def combined_analysis(bright_df, nanobeir_df):
    """Create combined analysis of BRIGHT and NanoBEIR results"""
    if bright_df.empty or nanobeir_df.empty:
        log_with_timestamp("Missing data for combined analysis", "WARNING")
        return
    
    log_with_timestamp("Creating combined analysis", "ANALYSIS")
    
    # Calculate average BRIGHT scores per model
    bright_avg = bright_df.groupby('Model')['nDCG@10'].mean().reset_index()
    bright_avg['Benchmark'] = 'BRIGHT'
    
    # Prepare NanoBEIR data
    nanobeir_avg = nanobeir_df[['Model', 'nDCG@10']].copy()
    nanobeir_avg['Benchmark'] = 'NanoBEIR'
    
    # Combine data
    combined = pd.concat([bright_avg, nanobeir_avg])
    
    # Create comparison plot
    plt.figure(figsize=(14, 8))
    sns.barplot(data=combined, x='Model', y='nDCG@10', hue='Benchmark')
    plt.title('Comparison of Model Performance Across Benchmarks', fontsize=16)
    plt.ylabel('Average nDCG@10 (%)')
    plt.xlabel('Model')
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'combined_benchmark_comparison.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    # Create summary table
    summary = combined.pivot_table(values='nDCG@10', index='Model', columns='Benchmark')
    summary['Overall Average'] = summary.mean(axis=1)
    
    print('\nCombined Benchmark Summary:')
    display(summary.round(2).sort_values('Overall Average', ascending=False))
    
    # Save summary
    summary.to_csv(os.path.join(OUTPUT_DIR, 'combined_benchmark_summary.csv'))
    
    log_with_timestamp("Combined analysis completed", "COMPLETE")

In [None]:
# Uncomment to run combined analysis if both benchmarks have been evaluated
# if 'nanobeir_df' in locals():
#     combined_analysis(results_df, nanobeir_df)

## 11. Conclusion

This notebook has implemented the evaluation of embedding models on the BRIGHT benchmark following the original author's reference implementation. The key findings are:

1. **Methodology**: We used MTEB to load BRIGHT tasks and PyLate with PLAID indexing for ColBERT models, exactly matching the reference implementation.

2. **Models Evaluated**:
   - Late-Interaction: `Reason-ModernColBERT`, `GTE-ModernColBERT-v1`
   - Dense: `jina-embeddings-v3`, `Qwen3-Embedding-0.6B`

3. **Key Results**: (See the summary tables and visualizations above)

4. **Implementation Details**:
   - Proper memory management to avoid OOM issues
   - Comprehensive logging for debugging
   - Robust error handling
   - Results saved in standard formats
   - Visualizations for easy comparison

All results are saved in the `bright_evaluation_results` directory for further analysis.