# RAG System Evaluation Pipeline on RAGBench Dataset

This notebook implements a complete RAG (Retrieval-Augmented Generation) system evaluation pipeline on the RAGBench dataset.

## Project Overview
- **Task 1**: Build evaluation pipeline for RAG system on RAGBench dataset
- **Task 2**: Model implementation and optimization across different domains

## Features
- Document chunking strategies
- Embedding models (domain-specific)
- Vector stores (FAISS/Chroma)
- Retrieval mechanisms (Dense/Sparse/Hybrid)
- LLM response generation (Groq API)
- LLM-as-judge for evaluation
- Metrics computation (Context Relevance, Utilization, Completeness, Adherence)
- Performance comparison with ground truth (RMSE, AUC-ROC)


## 1. Install Required Packages


In [None]:
# 1. Install required packages (Consolidated)
%pip install -q \
  datasets \
  transformers \
  sentence-transformers \
  faiss-cpu \
  chromadb \
  groq \
  rank-bm25 \
  scikit-learn \
  scipy \
  pandas \
  pyyaml \
  loguru \
  tqdm \
  python-dotenv \
  opentelemetry-sdk==1.37.0 \
  opentelemetry-api==1.37.0 \
  opentelemetry-proto==1.37.0 \
  opentelemetry-exporter-otlp-proto-common==1.37.0

print("‚úÖ All packages installed successfully!")

‚úÖ All packages installed successfully!


## 2. Import Libraries


In [None]:
import os
import json
import re
import time
import numpy as np
import pandas as pd
import warnings
import torch
import nltk
import faiss
from typing import List, Dict, Tuple, Optional, Any
from tqdm import tqdm
from nltk.tokenize import sent_tokenize
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi
from groq import Groq
from sklearn.metrics import mean_squared_error, roc_auc_score
from google.colab import userdata

warnings.filterwarnings('ignore')
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True) # Added to fix the LookupError




True

In [None]:
# Step 2.1: Clear existing keys to ensure fresh setup
if 'GROQ_API_KEY' in os.environ:
     del os.environ['GROQ_API_KEY']
     print("Environment variable cleared.")
print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


## 3. Configuration

Set your configuration here:


In [None]:
import itertools
try:
    GROQ_API_KEY = userdata.get('GROQ_API_KEY')
except:
    # Hardcoded fallback for immediate testing
    GROQ_API_KEY = "GROQ_API_KEY"

# ========== DATASET CONFIG ==========
DATASET_REPOSITORY = "galileo-ai/ragbench"
DOMAIN = "customer_support"
SUBSETS = ["delucionqa", "emanual", "techqa"]
SPLIT = "test"

# ========== MODEL SELECTION ==========
EMBEDDING_MODEL = "BAAI/bge-large-en-v1.5"
CROSS_ENCODER_MODEL = "BAAI/bge-reranker-v2-m3"
GENERATION_MODEL = "llama-3.1-8b-instant"
JUDGE_MODEL = "llama-3.3-70b-versatile"

CONFIG_MATRIX = {
    "RETRIEVAL_TYPE": ["hybrid", "vector", "bm25"],
    "HYBRID_ALPHA": [0.3, 0.5, 0.7],     # 0.3=Keyword Heavy, 0.7=Semantic Heavy
    "TOP_K_FINAL": [5, 10, 15],         # Depth of context
    "USE_CROSS_ENCODER": [True, False], # Importance of re-ranking
    "GENERATION_TEMPERATURE": [0.0],    # Kept at 0.0 for maximum factuality in support
    "MAX_CONTEXT_LENGTH": [2048, 4096]  # Test if longer context helps 'emanual'
}

# ========== HELPER: GENERATE ALL COMBINATIONS ==========
def get_configurations(matrix):
    """Generates a list of all possible parameter combinations."""
    keys, values = zip(*matrix.items())
    # Clean up: Alpha only matters if retrieval is 'hybrid'
    configs = [dict(zip(keys, v)) for v in itertools.product(*values)]

    # Filter logic: if RETRIEVAL_TYPE is not hybrid, HYBRID_ALPHA is irrelevant
    unique_configs = []
    for c in configs:
        if c['RETRIEVAL_TYPE'] != 'hybrid':
            c['HYBRID_ALPHA'] = None # Standardize non-hybrid configs
        if c not in unique_configs:
            unique_configs.append(c)
    return unique_configs

all_configs = get_configurations(CONFIG_MATRIX)

# ========== OUTPUT ==========
RESULTS_DIR = "/content/results"
os.makedirs(RESULTS_DIR, exist_ok=True)

print(f"üöÄ System Initialized for RAGBench Domain: {DOMAIN}")
print(f"üìÇ Subset Focus: {', '.join(SUBSETS)}")
print(f"üìä Total Unique Experiment Combinations: {len(all_configs)}")
print(f"üìÅ Results Path: {RESULTS_DIR}")

üöÄ System Initialized for RAGBench Domain: customer_support
üìÇ Subset Focus: delucionqa, emanual, techqa
üìä Total Unique Experiment Combinations: 60
üìÅ Results Path: /content/results


## 3.1 Experiment Configuration

Define multiple experiments to run automatically:


In [None]:
# Updated Experiment Grid with specialized Customer Care strategies
EXPERIMENTS = [
    # --- BASELINE: Sentence-based (High Coherence for General Q&A) ---
    {
        "name": "cs_sentence_hybrid_delucionqa",
        "subset": "delucionqa",
        "chunking_strategy": "sentence_based",
        "retrieval_type": "hybrid",
        "domain": "customer_support",
        "embedding_model": EMBEDDING_MODEL,
        "top_k_final": 10,
        "use_cross_encoder": True,
        "hybrid_alpha": 0.5 # Balanced for intent + keywords
    },

    # --- TECHNICAL PRECISION: Fixed-size + BM25 (Best for Error Codes/Tech Specs) ---
    {
        "name": "cs_fixed_bm25_techqa",
        "subset": "techqa",
        "chunking_strategy": "fixed_size",
        "retrieval_type": "bm25", # Testing pure keyword match for tech support
        "domain": "customer_support",
        "embedding_model": EMBEDDING_MODEL,
        "top_k_final": 5, # Smaller K for higher precision
        "use_cross_encoder": False
    },

    # --- COMPLEX STRUCTURE: Recursive (Best for Hierarchical Manuals) ---
    {
        "name": "cs_recursive_hybrid_emanual",
        "subset": "emanual",
        "chunking_strategy": "recursive",
        "retrieval_type": "hybrid",
        "domain": "customer_support",
        "embedding_model": EMBEDDING_MODEL,
        "top_k_final": 15, # Larger K to capture multi-part instructions
        "use_cross_encoder": True,
        "hybrid_alpha": 0.4 # Slightly favors keywords for part names
    },

    # --- SEMANTIC INTELLIGENCE: (Best for nuanced troubleshooting) ---
    {
        "name": "cs_semantic_hybrid_emanual",
        "subset": "emanual",
        "chunking_strategy": "semantic",
        "retrieval_type": "hybrid",
        "domain": "customer_support",
        "embedding_model": EMBEDDING_MODEL,
        "top_k_final": 10,
        "use_cross_encoder": True,
        "hybrid_alpha": 0.6 # Favors semantic meaning
    },

    # --- HYBRID RECOVERY: TechQA with Recursive + Hybrid ---
    {
        "name": "cs_recursive_hybrid_techqa",
        "subset": "techqa",
        "chunking_strategy": "recursive",
        "retrieval_type": "hybrid",
        "domain": "customer_support",
        "embedding_model": EMBEDDING_MODEL,
        "top_k_final": 10,
        "use_cross_encoder": True,
        "hybrid_alpha": 0.3 # Strong keyword bias (alpha 0.3) for tech troubleshooting
    }
]

EXPERIMENTS = [exp for exp in EXPERIMENTS if not (exp['chunking_strategy'] == 'semantic' and exp['retrieval_type'] == 'bm25')]

print(f"‚úÖ Defined {len(EXPERIMENTS)} optimized experiments.")

‚úÖ Defined 5 optimized experiments.


In [None]:
class ContextRepacker:
    """
    Optimizes the placement of retrieved chunks within the context window
    to mitigate 'Lost in the Middle' issues and stay within token limits.
    """
    def __init__(self, max_context_length: int = 4096):
        self.max_context_length = max_context_length

    def _truncate_to_limit(self, candidates: List[str], max_chars: int) -> List[str]:
        """Ensures total character count stays within limits."""
        packed = []
        cur_len = 0
        for text in candidates:
            if cur_len + len(text) > max_chars:
                break
            packed.append(text)
            cur_len += len(text)
        return packed

    def pack_forward(self, candidates: List[str], max_chars: int) -> str:
        """Standard order: most relevant first. Good for 'techqa' exact matches."""
        selected = self._truncate_to_limit(candidates, max_chars)
        return "\n\n".join(selected).strip()

    def pack_reverse(self, candidates: List[str], max_chars: int) -> str:
        """Reverse order: most relevant last. Useful if the LLM has a strong recency bias."""
        selected = self._truncate_to_limit(list(reversed(candidates)), max_chars)
        return "\n\n".join(selected).strip()

    def pack_sides(self, candidates: List[str], max_chars: int) -> str:
        """
        Alternates relevance between start and end.
        Best for 'emanual' where instructions might span multiple non-sequential chunks.
        """
        n = len(candidates)
        out = []
        left, right = 0, n - 1
        cur_len = 0

        while left <= right:
            # Add from high relevance (start)
            if cur_len + len(candidates[left]) < max_chars:
                out.append(candidates[left])
                cur_len += len(candidates[left])
                left += 1

            # Add from moderate relevance (end) to fill 'side'
            if left <= right and cur_len + len(candidates[right]) < max_chars:
                out.insert(len(out)//2, candidates[right]) # Insert in middle to push relevant to sides
                cur_len += len(candidates[right])
                right -= 1
            else:
                break
        return "\n\n".join(out).strip()

    def repack(self, candidates: List[str], strategy: str = "sides", max_chars: Optional[int] = None) -> str:
        if max_chars is None:
            max_chars = self.max_context_length * 3  # Roughly 3 chars per token fallback

        strategy = strategy.lower()
        if strategy == "reverse": return self.pack_reverse(candidates, max_chars)
        if strategy == "sides": return self.pack_sides(candidates, max_chars)
        return self.pack_forward(candidates, max_chars)

# Initialize global repacker based on Config Matrix
repacker = ContextRepacker(max_context_length=max(CONFIG_MATRIX["MAX_CONTEXT_LENGTH"]))

## 4. Helper Functions

### 4.1 Document Chunking


In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class DocumentChunker:
    """
    Intelligently splits documents into manageable chunks based on the experiment
    configuration. Supports technical manual structures and semantic shifts.
    """
    def __init__(self, strategy: str, embedding_model=None, config: Dict = None):
        self.strategy = strategy
        self.embedding_model = embedding_model
        self.config = config or {}

        # Default hyper-parameters for Customer Support domain
        self.chunk_size = self.config.get("chunk_size", 500)      # Words/Tokens approx
        self.chunk_overlap = self.config.get("chunk_overlap", 50) # Context preservation
        self.breakpoint_percentile = self.config.get("breakpoint_percentile", 95)

    def chunk_documents(self, documents: List[str]) -> List[Dict]:
        """Processes a list of documents and returns structured chunks."""
        all_chunks = []
        for doc_id, doc_text in enumerate(documents):
            # Route to specific strategy
            if self.strategy == "semantic":
                chunks = self._semantic_chunking(doc_text)
            elif self.strategy == "recursive":
                chunks = self._recursive_chunking(doc_text)
            elif self.strategy == "sentence_based":
                chunks = self._sentence_based_chunking(doc_text)
            else: # fixed_size baseline
                chunks = self._fixed_size_chunking(doc_text)

            for chunk_id, text in enumerate(chunks):
                all_chunks.append({
                    "chunk_id": f"doc_{doc_id}_ch_{chunk_id}",
                    "text": text,
                    "doc_id": doc_id,
                    "strategy": self.strategy
                })
        return all_chunks

    def _fixed_size_chunking(self, text: str) -> List[str]:
        """Simple word-count based split. High speed, low context awareness."""
        words = text.split()
        chunks = []
        step = max(1, self.chunk_size - self.chunk_overlap)
        for i in range(0, len(words), step):
            chunk = " ".join(words[i : i + self.chunk_size])
            if chunk.strip():
                chunks.append(chunk)
        return chunks

    def _sentence_based_chunking(self, text: str) -> List[str]:
        """Splits by NLTK sentences to avoid mid-sentence breaks."""
        sentences = sent_tokenize(text)
        chunks, current_chunk, current_len = [], [], 0
        for sent in sentences:
            sent_words = len(sent.split())
            if current_len + sent_words > self.chunk_size and current_chunk:
                chunks.append(" ".join(current_chunk))
                current_chunk, current_len = [], 0
            current_chunk.append(sent)
            current_len += sent_words
        if current_chunk:
            chunks.append(" ".join(current_chunk))
        return chunks

    def _recursive_chunking(self, text: str) -> List[str]:
        """
        Hierarchical split (Paragraph -> Sentence).
        Ideal for 'emanual' where structure (steps) must be preserved.
        """
        # First split by double newlines (paragraphs)
        paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
        final_chunks = []

        for para in paragraphs:
            if len(para.split()) <= self.chunk_size:
                final_chunks.append(para)
            else:
                # If paragraph is too large, split into sentences
                sentences = sent_tokenize(para)
                curr, curr_len = [], 0
                for s in sentences:
                    s_len = len(s.split())
                    if curr_len + s_len > self.chunk_size and curr:
                        final_chunks.append(" ".join(curr))
                        curr, curr_len = [], 0
                    curr.append(s)
                    curr_len += s_len
                if curr: final_chunks.append(" ".join(curr))
        return final_chunks

    def _semantic_chunking(self, text: str) -> List[str]:
        """
        Uses embeddings to detect topical shifts.
        Highly effective for 'delucionqa' conversational data.
        """
        sentences = sent_tokenize(text)
        if len(sentences) < 2:
            return [text]

        # 1. Embed sentences using the model from Step 3
        embeddings = self.embedding_model.encode(sentences)

        # 2. Calculate distances between consecutive sentences
        distances = []
        for i in range(len(embeddings) - 1):
            sim = cosine_similarity([embeddings[i]], [embeddings[i+1]])[0][0]
            distances.append(1 - sim) # Higher distance = Semantic shift

        # 3. Determine threshold for breakpoints
        threshold = np.percentile(distances, self.breakpoint_percentile)

        # 4. Group sentences
        chunks = []
        current_chunk = [sentences[0]]
        for i, dist in enumerate(distances):
            if dist > threshold:
                chunks.append(" ".join(current_chunk))
                current_chunk = [sentences[i+1]]
            else:
                current_chunk.append(sentences[i+1])

        if current_chunk:
            chunks.append(" ".join(current_chunk))
        return chunks

print("‚úÖ DocumentChunker initialized with all strategies.")

‚úÖ DocumentChunker initialized with all strategies.


### 4.2 Embedding Model


In [None]:
def load_models_for_experiment(exp: Dict) -> Tuple[Optional[SentenceTransformer], Optional[int], Optional[CrossEncoder]]:
    """
    Dynamically loads embedding and reranking models based on experiment requirements.
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    embedding_model, embedding_dim, cross_encoder = None, None, None

    # Load Embedding Model for Vector/Hybrid retrieval or Semantic Chunking
    if exp.get("retrieval_type") in ["vector", "hybrid"] or exp.get("chunking_strategy") == "semantic":
        model_name = exp.get("embedding_model", EMBEDDING_MODEL)
        embedding_model = SentenceTransformer(model_name, device=device)
        embedding_dim = embedding_model.get_sentence_embedding_dimension()
        print(f"  - Loaded Embedding Model: {model_name} (Dim: {embedding_dim})")

    # Load Cross-Encoder for Reranking
    if exp.get("use_cross_encoder", False):
        ce_name = CROSS_ENCODER_MODEL
        cross_encoder = CrossEncoder(ce_name, device=device)
        print(f"  - Loaded Cross-Encoder: {ce_name}")

    return embedding_model, embedding_dim, cross_encoder

### 4.3 Vector Store (FAISS)


In [None]:
class FAISSVectorStore:
    """
    FAISS-based vector database for high-speed dense retrieval.
    """
    def __init__(self, embedding_dim: int, index_type: str = "InnerProduct"):
        if index_type == "InnerProduct":
            self.index = faiss.IndexFlatIP(embedding_dim)
        else:
            self.index = faiss.IndexFlatL2(embedding_dim)

        self.index_type = index_type
        self.metadata = []

    def add_documents(self, embeddings: np.ndarray, metadatas: List[Dict]):
        embeddings = embeddings.astype("float32")
        if self.index_type == "InnerProduct":
            faiss.normalize_L2(embeddings)
        self.index.add(embeddings)
        self.metadata.extend(metadatas)

    def search(self, query_embedding: np.ndarray, top_k: int) -> List[Dict]:
        q_emb = query_embedding.reshape(1, -1).astype("float32")
        if self.index_type == "InnerProduct":
            faiss.normalize_L2(q_emb)

        scores, indices = self.index.search(q_emb, top_k)
        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx != -1:
                hit = self.metadata[idx].copy()
                hit["dense_score"] = float(score)
                results.append(hit)
        return results

### 4.4 Retrieval Mechanisms with Cross-Encoder Reranking


In [None]:
class HybridRetrieverWithRerank:
    """
    Unified retrieval engine that manages BM25, Vector Search,
    and Cross-Encoder Reranking logic.
    """
    def __init__(self, config: Dict, vector_store: FAISSVectorStore,
                 embedding_model: SentenceTransformer, cross_encoder: Optional[CrossEncoder]):
        self.retrieval_type = config.get("retrieval_type", "hybrid")
        self.vector_store = vector_store
        self.embedding_model = embedding_model
        self.cross_encoder = cross_encoder
        self.hybrid_alpha = config.get("hybrid_alpha", 0.5)
        self.bm25 = None
        self.corpus_chunks = []

    def set_corpus(self, chunks: List[Dict]):
        """Initializes BM25 index with the provided document chunks."""
        self.corpus_chunks = chunks
        tokenized_corpus = [c["text"].lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized_corpus)

    def retrieve(self, query: str, top_k_final: int = 10) -> List[Dict]:
        # Step 1: Broad Retrieval (get a pool of candidates)
        pool_size = 50
        candidates = {}

        # A. Sparse Retrieval (BM25)
        if self.retrieval_type in ["bm25", "hybrid"]:
            bm25_scores = self.bm25.get_scores(query.lower().split())
            top_indices = np.argsort(bm25_scores)[::-1][:pool_size]
            for i in top_indices:
                hit = self.corpus_chunks[i].copy()
                hit["sparse_score"] = float(bm25_scores[i])
                candidates[hit["chunk_id"]] = hit

        # B. Dense Retrieval (Vector)
        if self.retrieval_type in ["vector", "hybrid"]:
            q_emb = self.embedding_model.encode(query, convert_to_numpy=True)
            vector_hits = self.vector_store.search(q_emb, pool_size)
            for hit in vector_hits:
                cid = hit["chunk_id"]
                if cid in candidates:
                    candidates[cid]["dense_score"] = hit["dense_score"]
                else:
                    candidates[cid] = hit

        # Step 2: Hybrid Scoring (Alpha Blending)
        hits = list(candidates.values())
        if self.retrieval_type == "hybrid":
            # Simple min-max norm for combining scores
            for h in hits:
                s_score = h.get("sparse_score", 0)
                d_score = h.get("dense_score", 0)
                # Weighted blend
                h["combined_score"] = (self.hybrid_alpha * d_score) + ((1 - self.hybrid_alpha) * s_score)
            hits = sorted(hits, key=lambda x: x["combined_score"], reverse=True)
        else:
            # Sort by whatever score is available
            score_key = "sparse_score" if self.retrieval_type == "bm25" else "dense_score"
            hits = sorted(hits, key=lambda x: x.get(score_key, 0), reverse=True)

        # Step 3: Cross-Encoder Reranking (Precision Layer)
        if self.cross_encoder and hits:
            # We rerank the top 25 to find the best 10
            rerank_pool = hits[:25]
            ce_inputs = [[query, h["text"]] for h in rerank_pool]
            ce_scores = self.cross_encoder.predict(ce_inputs)
            for h, s in zip(rerank_pool, ce_scores):
                h["ce_score"] = float(s)
            hits = sorted(rerank_pool, key=lambda x: x["ce_score"], reverse=True)

        return hits[:top_k_final]

In [None]:
# Global Groq Client
groq_client = Groq(api_key=GROQ_API_KEY)

def generate_response(question: str, retrieved_chunks: List[Dict], exp_config: Dict):
    """
    Uses Groq to generate a final support answer based on repacked context.
    """
    # 1. Repack context using the strategy defined for the domain
    # Use 'sides' for manuals, 'forward' for troubleshooting
    repack_strategy = "sides" if exp_config.get("subset") == "emanual" else "forward"
    max_len = exp_config.get("MAX_CONTEXT_LENGTH", 3000)

    context_text = repacker.repack(
        [c["text"] for c in retrieved_chunks],
        strategy=repack_strategy,
        max_chars=max_len
    )

    # 2. Construct Domain-Specific Prompts
    system_prompt = (
        "You are an expert Customer Support Assistant. "
        "Your task is to provide accurate, helpful answers based ONLY on the provided context. "
        "If the answer is not in the context, state that you do not have enough information."
    )

    user_prompt = f"### CONTEXT ###\n{context_text}\n\n### QUESTION ###\n{question}\n\n### ANSWER ###"

    try:
        response = groq_client.chat.completions.create(
            model=GENERATION_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.0, # Factuality first
            max_tokens=1024
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Pipeline Error: {str(e)}"

print("‚úÖ Step 5 (Retrieval & Generation) complete. Pipeline flow validated!")

‚úÖ Step 5 (Retrieval & Generation) complete. Pipeline flow validated!


### 4.5 LLM Generator and Judge


In [None]:
# Configuration Constants
JUDGE_TEMPERATURE = 0.0
GENERATION_MODEL = "llama-3.3-70b-versatile"
JUDGE_MODEL = "llama-3.3-70b-versatile"
GENERATION_TEMPERATURE = 0.7
GENERATION_MAX_TOKENS = 1024

def generate_response(question: str, retrieved_chunks: List[Dict], model: str = GENERATION_MODEL) -> str:
    """Generate response using LLM with side-packed context to mitigate lost-in-the-middle."""

    # 1. Prepare context using the Repacker defined in Step 3.2
    raw_texts = [c["text"] for c in retrieved_chunks]
    context_text = repacker.repack(raw_texts, strategy="sides", max_chars=3000)

    system_prompt = """You are a helpful customer support assistant.
    Use only the information from the context documents to answer the question.
    If the context doesn't contain the answer, explicitly state that you don't have enough information."""

    user_prompt = f"""Context Documents:
{context_text}

Question: {question}

Answer:"""

    try:
        response = groq_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=GENERATION_TEMPERATURE,
            max_tokens=GENERATION_MAX_TOKENS
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"‚ùå Error generating response: {e}")
        return ""

def format_docs_for_judge(chunks: List[Dict]) -> str:
    """Formats chunks into a numbered sentence list (0a, 0b) for the judge."""
    formatted = []
    for i, chunk in enumerate(chunks):
        sentences = sent_tokenize(chunk['text'])
        for j, sent in enumerate(sentences):
            key = f"{i}{chr(97 + j)}" # e.g., 0a, 0b
            formatted.append(f"{key}. {sent}")
    return "\n".join(formatted)

def format_response_for_judge(response: str) -> str:
    """Formats the LLM response into keys (a., b.) for the judge."""
    sentences = sent_tokenize(response)
    return "\n".join([f"{chr(97 + i)}. {sent}" for i, sent in enumerate(sentences)])

def extract_attributes(question: str, retrieved_chunks: List[Dict], response: str) -> Dict:
    """Extract evaluation attributes using LLM-as-judge with JSON mode."""

    if not response:
        return {"overall_supported": False, "explanation": "No response generated"}

    # Format inputs for the detailed Judge Prompt
    formatted_docs = format_docs_for_judge(retrieved_chunks)
    formatted_response = format_response_for_judge(response)

    # Prompt remains identical to your provided schema
    prompt = f"""[I asked someone to answer a question based on one or more documents.
Your task is to review their response and assess whether or not each sentence
in that response is supported by text in the documents. And if so, which
sentences in the documents provide that support. You will also tell me which
of the documents contain useful information for answering the question, and
which of the documents the answer was sourced from.
Here are the documents, each of which is split into sentences. Alongside each
sentence is associated key, such as ‚Äô0a.‚Äô or ‚Äô0b.‚Äô that you can use to refer
to it:]

Documents:
{formatted_docs}

Question:
{question}
'''
Here is their response, split into sentences. Alongside each sentence is
associated key, such as ‚Äôa.‚Äô or ‚Äôb.‚Äô that you can use to refer to it. Note
that these keys are unique to the response, and are not related to the keys
in the documents:
'''
Response:
{formatted_response}

You must respond with a JSON object matching this schema:
''' {{
  "relevance_explanation": string,
  "all_relevant_sentence_keys": [string],
  "overall_supported_explanation": string,
  "overall_supported": boolean,
  "sentence_support_information": [
    {{
      "response_sentence_key": string,
      "explanation": string,
      "supporting_sentence_keys": [string],
      "fully_supported": boolean
    }}
  ],
  "all_utilized_sentence_keys": [string]
}}
'''
The relevance_explanation field is a string explaining which documents
contain useful information for answering the question. Provide a step-by-step
breakdown of information provided in the documents and how it is useful for
answering the question.
The all_relevant_sentence_keys field is a list of all document sentences keys
(e.g. ‚Äô0a‚Äô) that are revant to the question. Include every sentence that is
useful and relevant to the question, even if it was not used in the response,
or if only parts of the sentence are useful. Ignore the provided response when
making this judgement and base your judgement solely on the provided documents
and question. Omit sentences that, if removed from the document, would not
impact someone‚Äôs ability to answer the question.
The overall_supported_explanation field is a string explaining why the response
*as a whole* is or is not supported by the documents. In this field, provide a
step-by-step breakdown of the claims made in the response and the support (or
lack thereof) for those claims in the documents. Begin by assessing each claim
separately, one by one; don‚Äôt make any remarks about the response as a whole
until you have assessed all the claims in isolation.
The overall_supported field is a boolean indicating whether the response as a
whole is supported by the documents. This value should reflect the conclusion
you drew at the end of your step-by-step breakdown in overall_supported_explanation.
In the sentence_support_information field, provide information about the support
*for each sentence* in the response.
The sentence_support_information field is a list of objects, one for each sentence
in the response. Each object MUST have the following fields:
- response_sentence_key: a string identifying the sentence in the response.
This key is the same as the one used in the response above.
- explanation: a string explaining why the sentence is or is not supported by the
documents.
- supporting_sentence_keys: keys (e.g. ‚Äô0a‚Äô) of sentences from the documents that
support the response sentence. If the sentence is not supported, this list MUST
be empty. If the sentence is supported, this list MUST contain one or more keys.
In special cases where the sentence is supported, but not by any specific sentence,
you can use the string "supported_without_sentence" to indicate that the sentence
is generally supported by the documents. Consider cases where the sentence is
expressing inability to answer the question due to lack of relevant information in
the provided contex as "supported_without_sentence". In cases where the sentence
is making a general statement (e.g. outlining the steps to produce an answer, or
summarizing previously stated sentences, or a transition sentence), use the
sting "general".In cases where the sentence is correctly stating a well-known fact,
like a mathematical formula, use the string "well_known_fact". In cases where the
sentence is performing numerical reasoning (e.g. addition, multiplication), use
the string "numerical_reasoning".
- fully_supported: a boolean indicating whether the sentence is fully supported by
the documents.
  - This value should reflect the conclusion you drew at the end of your step-by-step
  breakdown in explanation.
  - If supporting_sentence_keys is an empty list, then fully_supported must be false.
- Otherwise, use fully_supported to clarify whether everything in the response
  sentence is fully supported by the document text indicated in supporting_sentence_keys
  (fully_supported = true), or whether the sentence is only partially or incompletely
  supported by that document text (fully_supported = false).
The all_utilized_sentence_keys field is a list of all sentences keys (e.g. ‚Äô0a‚Äô) that
were used to construct the answer. Include every sentence that either directly supported
the answer, or was implicitly used to construct the answer, even if it was not used
in its entirety. Omit sentences that were not used, and could have been removed from
the documents without affecting the answer.
You must respond with a valid JSON string.  Use escapes for quotes, e.g. '"', and
newlines, e.g. '\n'. Do not write anything before or after the JSON string. Do not
wrap the JSON string in backticks like ''' or '''json.
As a reminder: your task is to review the response and assess which documents contain
useful information pertaining to the question, and how each sentence in the response
is supported by the text in the documents."""

    try:
        judge_response = groq_client.chat.completions.create(
            model=JUDGE_MODEL,
            messages=[
                {"role": "system", "content": "You are an expert evaluator for RAG systems. Respond ONLY in valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=JUDGE_TEMPERATURE,
            max_tokens=2048,
            response_format={"type": "json_object"}
        )

        result_text = judge_response.choices[0].message.content.strip()
        return json.loads(result_text)
    except Exception as e:
        print(f"‚ùå Evaluation Error: {e}")
        return {"overall_supported": False, "sentence_support_information": []}

print("‚úÖ Step Complete: LLM-as-Judge engine is ready.")

‚úÖ Step Complete: LLM-as-Judge engine is ready.


### 4.6 Metrics Computation


In [None]:
import re
from typing import List, Dict

def count_sentences(documents: List[str]) -> int:
    """
    Counts total sentences in the retrieved context using the same logic
    as the Judge to ensure denominator consistency.
    """
    total = 0
    for doc in documents:
        # Using a slightly more robust split that matches Step 5.6 logic
        sentences = re.split(r"(?<=[.!?])\s+", doc)
        total += len([s for s in sentences if s.strip()])
    return total

# ---------- Optimized Customer Care Metrics ----------

def compute_context_precision(relevant_keys: List[str], total_sentences: int) -> float:
    """
    Metric: Contextual Relevancy / Precision
    In Support: High score means the retriever is 'Signal' focused.
    Low score means 'Noise' (the retriever brought in 50 sentences but only 2 were useful).
    """
    if total_sentences == 0:
        return 0.0
    return len(relevant_keys) / total_sentences


def compute_faithfulness(sentence_support_info: List[Dict]) -> float:
    """
    Metric: Faithfulness (Groundedness)
    In Support: Prevents 'Creative Hallucinations'.
    Measures % of the generated response sentences that are fully backed by docs.
    """
    if not sentence_support_info:
        return 0.0

    # We count 'fully_supported' as per the Judge's boolean verdict
    supported_count = sum(1 for s in sentence_support_info if s.get("fully_supported", False))
    return supported_count / len(sentence_support_info)


def compute_completeness(relevant_keys: List[str], utilized_keys: List[str]) -> float:
    """
    Metric: Context Recall (Completeness)
    In Support: Did the bot leave out the 'Warning' or 'Safety' steps?
    Measures if the bot used all relevant information found in retrieval.
    """
    rel_set = set(relevant_keys)
    util_set = set(utilized_keys)

    if not rel_set:
        # If there was no relevant info in context, and the bot (correctly)
        # said "I don't know", completeness is 1.0.
        return 1.0

    # Intersection of what was relevant vs what was actually used in the answer
    return len(rel_set & util_set) / len(rel_set)


# ---------- Aggregate Execution ----------

def compute_all_metrics(attributes: Dict, retrieved_chunks: List[Dict]) -> Dict[str, float]:
    """
    Calculates the final quality suite for a customer support interaction.
    Input 'attributes' comes directly from the extract_attributes (Step 5.6).
    """
    raw_texts = [c["text"] for c in retrieved_chunks]
    total_context_sentences = count_sentences(raw_texts)

    relevant_keys = attributes.get("all_relevant_sentence_keys", [])
    utilized_keys = attributes.get("all_utilized_sentence_keys", [])
    support_info = attributes.get("sentence_support_information", [])

    # Calculate metrics
    precision = compute_context_precision(relevant_keys, total_context_sentences)
    faithfulness = compute_faithfulness(support_info)
    completeness = compute_completeness(relevant_keys, utilized_keys)

    return {
        "retrieval_precision": round(precision, 4),
        "faithfulness_score": round(faithfulness, 4),
        "answer_completeness": round(completeness, 4),
        "overall_score": round((precision + faithfulness + completeness) / 3, 4)
    }

print("‚úÖ Step 5.7 Complete: Metrics aligned with LLM-Judge attributes.")

‚úÖ Step 5.7 Complete: Metrics aligned with LLM-Judge attributes.


## 5. Load RAGBench Dataset


In [None]:
import pandas as pd
from datasets import load_dataset

# Configuration from previous steps
# DATASET_REPOSITORY = "galileo-ai/ragbench"
# SUBSETS = ["techqa", "emanual"]
# SPLIT = "test"

def load_and_prepare_ragbench(repository: str, subsets: List[str], split: str) -> pd.DataFrame:
    """Loads specified RAGBench subsets and prepares them for the experiment loop."""

    target_subsets = subsets if isinstance(subsets, list) else [subsets]
    print(f"üì° Accessing RAGBench repository: {repository}...")

    datasets_list = []

    for subset_name in target_subsets:
        try:
            print(f"   -> Loading Subset: {subset_name} | Split: {split}")
            # Load the specific configuration (subset) from HuggingFace
            ds = load_dataset(repository, name=subset_name, split=split)
            df_subset = pd.DataFrame(ds)

            # Essential for continuity: Tag the domain (e.g., Tech Support vs. Manuals)
            df_subset['subset_source'] = subset_name

            # RAGBench standardizes documents as a list of strings.
            # We ensure this exists for our Retriever logic in Step 5.1
            if 'documents' not in df_subset.columns and 'context' in df_subset.columns:
                 df_subset.rename(columns={'context': 'documents'}, inplace=True)

            datasets_list.append(df_subset)

        except Exception as e:
            print(f"‚ö†Ô∏è Subset '{subset_name}' failed to load: {e}")

    if not datasets_list:
        raise ValueError("‚ùå No data loaded. Check repository name and subset keys.")

    # 1. Concatenate all subsets
    df = pd.concat(datasets_list, ignore_index=True, sort=False)

    # 2. Cleanup: Remove duplicates to avoid wasting LLM credits/time
    initial_len = len(df)
    df.drop_duplicates(subset=['question'], inplace=True)

    # 3. Handle data types for downstream processing
    # Ensure 'documents' is always a list of dicts/strings as expected by the Retriever
    def clean_docs(docs):
        if isinstance(docs, list): return docs
        return [str(docs)] # Fallback if data is a single string

    df['documents'] = df['documents'].apply(clean_docs)

    print(f"‚úÖ Data Ready! Loaded {len(df)} unique questions across {len(target_subsets)} domains.")
    if initial_len > len(df):
        print(f"‚úÇÔ∏è Deduped {initial_len - len(df)} overlapping entries.")

    return df

# Execute Loader
df_ragbench = load_and_prepare_ragbench(DATASET_REPOSITORY, SUBSETS, SPLIT)

# Preview for validation
display(df_ragbench[['subset_source', 'question', 'documents']].head(3))

üì° Accessing RAGBench repository: galileo-ai/ragbench...
   -> Loading Subset: delucionqa | Split: test
   -> Loading Subset: emanual | Split: test
   -> Loading Subset: techqa | Split: test
‚úÖ Data Ready! Loaded 315 unique questions across 3 domains.
‚úÇÔ∏è Deduped 315 overlapping entries.


Unnamed: 0,subset_source,question,documents
0,delucionqa,What if I fail to latch the tailgate properly?,"[ Closing To close the tailgate, lift upward u..."
1,delucionqa,What kind of safety features are implemented i...,[ Some of the most important safety features i...
2,delucionqa,When will the Automatic SOS be triggered?,[ Automatic SOS ‚Äî If Equipped Automatic SOS is...


In [None]:
import pandas as pd
from datasets import load_dataset

# Determine which subsets to load (Defaults to techqa if not set)
subsets_to_process = SUBSETS if 'SUBSETS' in globals() else ["techqa"]
split_to_load = SPLIT if 'SPLIT' in globals() else "test"

print(f"üì° Loading RAGBench: {DATASET_REPOSITORY} | Split: {split_to_load}...")

datasets_to_load = []
for subset_name in subsets_to_process:
    try:
        print(f"   -> Loading subset: {subset_name}...")
        ds = load_dataset(DATASET_REPOSITORY, name=subset_name, split=split_to_load)
        temp_df = pd.DataFrame(ds)

        # Continuity: Track source for domain-specific metric breakdown later
        temp_df['subset_source'] = subset_name
        datasets_to_load.append(temp_df)
    except Exception as e:
        print(f"‚ö†Ô∏è Warning: Could not load {subset_name}: {e}")

if not datasets_to_load:
    raise ValueError("‚ùå No datasets were successfully loaded. Check your DATASET_REPOSITORY.")

# 1. Master Concatenation
df = pd.concat(datasets_to_load, ignore_index=True)

# 2. Duplicate Removal (Crucial for multi-domain overlap)
initial_len = len(df)
df.drop_duplicates(subset=['question'], inplace=True, ignore_index=True)

# 3. Document Normalization (Continuity with Chunker Step 5.1)
# RAGBench docs come as a list of strings. We ensure the Chunker gets a clean list of Dicts
def normalize_documents(docs):
    if isinstance(docs, list):
        # Convert list of strings to our pipeline's expected format: List[Dict]
        return [{"text": str(d)} for d in docs]
    elif isinstance(docs, str):
        return [{"text": docs}]
    return []

df['documents'] = df['documents'].apply(normalize_documents)

print(f"‚úÖ Dataset ready! Shape: {df.shape}")
print(f"üìä Columns: {df.columns.tolist()}")

# Preview the data structure
display(df[['subset_source', 'question', 'documents']].head())

üì° Loading RAGBench: galileo-ai/ragbench | Split: test...
   -> Loading subset: delucionqa...
   -> Loading subset: emanual...
   -> Loading subset: techqa...
‚úÖ Dataset ready! Shape: (315, 27)
üìä Columns: ['id', 'question', 'documents', 'response', 'generation_model_name', 'annotating_model_name', 'dataset_name', 'documents_sentences', 'response_sentences', 'sentence_support_information', 'unsupported_response_sentence_keys', 'adherence_score', 'overall_supported_explanation', 'relevance_explanation', 'all_relevant_sentence_keys', 'all_utilized_sentence_keys', 'trulens_groundedness', 'trulens_context_relevance', 'ragas_faithfulness', 'ragas_context_relevance', 'gpt3_adherence', 'gpt3_context_relevance', 'gpt35_utilization', 'relevance_score', 'utilization_score', 'completeness_score', 'subset_source']


Unnamed: 0,subset_source,question,documents
0,delucionqa,What if I fail to latch the tailgate properly?,"[{'text': ' Closing To close the tailgate, lif..."
1,delucionqa,What kind of safety features are implemented i...,[{'text': ' Some of the most important safety ...
2,delucionqa,When will the Automatic SOS be triggered?,[{'text': ' Automatic SOS ‚Äî If Equipped Automa...
3,delucionqa,What happens if I accidentally push the SOS Ca...,[{'text': ' Connected Services SOS FAQs ‚Äî If E...
4,delucionqa,What is the DEF?,[{'text': ' Adding Diesel Exhaust Fluid The DE...


## 6. Extract All Documents and Build Retriever


In [None]:
print("üîç Extracting unique technical documents from the dataset...")

all_documents = []
seen_docs = set()

# RAGBench normalized format: List[Dict] from Step 6.1
for doc_list in df['documents']:
    for doc_entry in doc_list:
        doc_text = doc_entry["text"].strip()
        if doc_text and doc_text not in seen_docs:
            all_documents.append(doc_text)
            seen_docs.add(doc_text)

print(f"‚úÖ Extraction Complete!")
print(f"üìä Total unique documents to be indexed: {len(all_documents)}")
print(f"üìâ Reduction Ratio: {len(df) / len(all_documents):.2f}x")

üîç Extracting unique technical documents from the dataset...
‚úÖ Extraction Complete!
üìä Total unique documents to be indexed: 1105
üìâ Reduction Ratio: 0.29x


In [None]:
### Step 7: Text Processing and Indexing

In [None]:
import torch
import nltk
from sentence_transformers import SentenceTransformer, CrossEncoder

DEFAULT_EXPERIMENT = EXPERIMENTS[0] # Use the first experiment as a default configuration

CHUNKING_STRATEGY = DEFAULT_EXPERIMENT.get("chunking_strategy", "sentence_based")
RETRIEVAL_TYPE = DEFAULT_EXPERIMENT.get("retrieval_type", "hybrid")
USE_CROSS_ENCODER = DEFAULT_EXPERIMENT.get("use_cross_encoder", True)

# CHUNK_SIZE and CHUNK_OVERLAP are not directly in EXPERIMENTS config, using DocumentChunker defaults
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

# 1. Environment Check
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2. Models & Components
print(f"üöÄ Loading models on {device}...")
embedding_model = SentenceTransformer(EMBEDDING_MODEL, device=device)
EMBEDDING_DIM = embedding_model.get_sentence_embedding_dimension()

cross_encoder = None
if USE_CROSS_ENCODER:
    cross_encoder = CrossEncoder(CROSS_ENCODER_MODEL, device=device)

# Initialize Chunker & Store
chunker = DocumentChunker(CHUNKING_STRATEGY, {
    "chunk_size": CHUNK_SIZE,
    "chunk_overlap": CHUNK_OVERLAP
})
vector_store = FAISSVectorStore(EMBEDDING_DIM, "InnerProduct")

# 3. Processing & Indexing
print("‚úÇÔ∏è Chunking and Embedding...")
chunks_metadata = chunker.chunk_documents(all_documents)
chunk_texts = [c['text'] for c in chunks_metadata]

# Generate dense vectors
embeddings = embedding_model.encode(chunk_texts, show_progress_bar=True, batch_size=32)

# 4. Finalize Knowledge Base
vector_store.add_documents(embeddings.astype('float32'), chunks_metadata)
print(f"üèÅ Vector store complete. Total entries: {vector_store.index.ntotal}")

üöÄ Loading models on cpu...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/795 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

‚úÇÔ∏è Chunking and Embedding...


Batches:   0%|          | 0/45 [00:00<?, ?it/s]

üèÅ Vector store complete. Total entries: 1439


In [None]:
from rank_bm25 import BM25Okapi

class DenseRetriever:
    def __init__(self, vector_store, embedding_model):
        self.vector_store = vector_store
        self.embedding_model = embedding_model

    def retrieve(self, query: str, top_k: int) -> List[Dict]:
        q_emb = self.embedding_model.encode([query]).astype('float32')
        # Return full metadata dicts for the Judge
        metadata = self.vector_store.search(q_emb, top_k) # Corrected from unpacking 3 values
        return metadata

class SparseRetriever:
    def __init__(self, chunks: List[Dict]):
        self.chunks = chunks
        tokenized_corpus = [c['text'].lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized_corpus)

    def retrieve(self, query: str, top_k: int) -> List[Dict]:
        q_tokens = query.lower().split()
        scores = self.bm25.get_scores(q_tokens)
        top_idx = np.argsort(scores)[::-1][:top_k]
        return [self.chunks[i] for i in top_idx]

class HybridRetrieverWithRerank:
    def __init__(self, vector_store, embedding_model, cross_encoder, chunks):
        self.dense = DenseRetriever(vector_store, embedding_model)
        self.sparse = SparseRetriever(chunks)
        self.cross_encoder = cross_encoder

    def retrieve(self, query: str, top_k: int) -> List[Dict]:
        # 1. Broad retrieval pool
        d_res = self.dense.retrieve(query, top_k * 2)
        s_res = self.sparse.retrieve(query, top_k * 2)

        # Deduplicate based on text
        candidates = {c['text']: c for c in (d_res + s_res)}.values()
        candidate_list = list(candidates)

        # 2. Cross-Encoder Rerank
        pairs = [[query, c['text']] for c in candidate_list]
        scores = self.cross_encoder.predict(pairs)

        # 3. Sort & Slice
        ranked = sorted(zip(candidate_list, scores), key=lambda x: x[1], reverse=True)
        return [res[0] for res in ranked[:top_k]]

In [None]:
# 7.3: Dynamic Initialization

# 7.3 Unified Retriever Initialization
print(f"‚öôÔ∏è Building {RETRIEVAL_TYPE} retriever...")

if RETRIEVAL_TYPE == "dense":
    retriever = DenseRetriever(vector_store, embedding_model)
elif RETRIEVAL_TYPE == "sparse":
    retriever = SparseRetriever(chunks_metadata)
elif RETRIEVAL_TYPE == "hybrid":
    retriever = HybridRetrieverWithRerank(
        vector_store, embedding_model, cross_encoder, chunks_metadata
    )

print(f"‚úÖ Pipeline Configured. Ready for Experiment Run.")

‚öôÔ∏è Building hybrid retriever...
‚úÖ Pipeline Configured. Ready for Experiment Run.


### 7.1 Multi-Experiment Runner (Run this if RUN_MULTI_EXPERIMENTS = True)


In [None]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer

RUN_MULTI_EXPERIMENTS = True # Define the flag to enable multi-experiment runs

# 1. UPDATED HYBRID RETRIEVER (With Reciprocal Rank Fusion / Alpha Logic)
class HybridRetriever:
    def __init__(self, dense, sparse, alpha=0.5):
        self.dense = dense
        self.sparse = sparse
        self.alpha = alpha # 0.0 = Pure Sparse, 1.0 = Pure Dense

    def retrieve(self, query: str, top_k: int) -> List[Dict]:
        dense_hits = self.dense.retrieve(query, top_k * 2)
        sparse_hits = self.sparse.retrieve(query, top_k * 2)

        # Merge using alpha-weighted ranking or simple union
        combined = {}
        # Simple union for metadata preservation; in production, use RRF scores
        for i, hit in enumerate(dense_hits):
            combined[hit["text"]] = hit # Deduplicate by text
        for i, hit in enumerate(sparse_hits):
            if hit["text"] not in combined:
                combined[hit["text"]] = hit

        return list(combined.values())[:top_k]

# 2. THE MAIN RUNNER
if RUN_MULTI_EXPERIMENTS:
    all_experiment_summary = []

    print(f"üöÄ Starting Benchmark on {len(EXPERIMENTS)} Configurations...")

    for exp in EXPERIMENTS:
        print(f"\n‚ñ∂Ô∏è Running: {exp['name']} (Subset: {exp['subset']})")

        # A. Setup Experiment-Specific Pipeline
        # Load the specific embedding model defined in Step 3 Config
        model = SentenceTransformer(exp['embedding_model'], device=device)

        # Initialize the specific chunker
        chunk_cfg = {"chunk_size": CHUNK_SIZE, "chunk_overlap": CHUNK_OVERLAP}
        chunker = DocumentChunker(exp['chunking_strategy'], chunk_cfg)

        # B. Process Knowledge Base (Unique to this experiment's strategy)
        # We reuse the df_ragbench loaded in Step 6.1 filtered for this subset
        exp_df = df[df['subset_source'] == exp['subset']].copy() # Changed from df_ragbench to df

        # Flatten and index
        unique_texts = list(set([d['text'] for docs in exp_df['documents'] for d in docs]))
        chunks_metadata = chunker.chunk_documents(unique_texts)

        # Index in FAISS
        vectors = model.encode([c['text'] for c in chunks_metadata], show_progress_bar=False)
        vs = FAISSVectorStore(model.get_sentence_embedding_dimension(), "InnerProduct")
        vs.add_documents(vectors.astype('float32'), chunks_metadata)

        # C. Initialize Retriever based on Type
        d_ret = DenseRetriever(vs, model)
        if exp['retrieval_type'] == "hybrid":
            s_ret = SparseRetriever(chunks_metadata)
            active_retriever = HybridRetriever(d_ret, s_ret, exp.get('alpha', 0.5))
        else:
            active_retriever = d_ret

        # D. Evaluation Loop
        results_storage = []
        # Evaluation sample size (N=5 for testing, N=50+ for production)
        eval_size = min(5, len(exp_df))

        for _, row in tqdm(exp_df.head(eval_size).iterrows(), total=eval_size, desc="Benchmarking"):
            query = row['question']

            # 1. Retrieve
            retrieved = active_retriever.retrieve(query, exp['top_k_final'])

            # 2. Generate (Continuity with Step 5.6)
            response = generate_response(query, retrieved)

            # 3. Judge (LLM-as-a-judge metrics)
            attr = extract_attributes(query, retrieved, response)
            metrics = compute_all_metrics(attr, retrieved)

            results_storage.append(metrics)

        # E. Aggregate Results
        avg_metrics = pd.DataFrame(results_storage).mean().to_dict()
        all_experiment_summary.append({
            "Experiment": exp['name'],
            "Subset": exp['subset'],
            "Strategy": exp['chunking_strategy'],
            **avg_metrics
        })

    # 3. FINAL LEADERBOARD
    leaderboard = pd.DataFrame(all_experiment_summary)
    display(leaderboard.sort_values(by="faithfulness_score", ascending=False))

üöÄ Starting Benchmark on 5 Configurations...

‚ñ∂Ô∏è Running: cs_sentence_hybrid_delucionqa (Subset: delucionqa)


Benchmarking:   0%|          | 0/5 [00:00<?, ?it/s]


‚ñ∂Ô∏è Running: cs_fixed_bm25_techqa (Subset: techqa)


## 9. Save Results


In [None]:
if SAVE_RESULTS:
    import os
    os.makedirs(RESULTS_DIR, exist_ok=True)

    # Save per-sample results
    per_sample_data = []
    for r in results['per_sample']:
        per_sample_data.append({
            'id': r['id'],
            'question': r['question'],
            'context_relevance': r['metrics']['context_relevance'],
            'context_utilization': r['metrics']['context_utilization'],
            'completeness': r['metrics']['completeness'],
            'adherence': r['metrics']['adherence'],
            'response': r['response']
        })

    per_sample_df = pd.DataFrame(per_sample_data)
    per_sample_df.to_csv(f"{RESULTS_DIR}/per_sample_results.csv", index=False)
    print(f"‚úÖ Saved per-sample results to {RESULTS_DIR}/per_sample_results.csv")

    # Save comparison results
    comparison_df = pd.DataFrame(comparison_results).T
    comparison_df.to_csv(f"{RESULTS_DIR}/comparison_results.csv")
    print(f"‚úÖ Saved comparison results to {RESULTS_DIR}/comparison_results.csv")

    # Save full results as JSON (without full documents to save space)
    results_summary = {
        'config': {
            'domain': DOMAIN,
            'subset': SUBSET,
            'chunking_strategy': CHUNKING_STRATEGY,
            'retrieval_type': RETRIEVAL_TYPE,
            'embedding_model': EMBEDDING_MODEL
        },
        'comparison': comparison_results,
        'predicted_metrics_summary': {
            metric: {
                'mean': np.mean(scores),
                'std': np.std(scores),
                'min': np.min(scores),
                'max': np.max(scores)
            }
            for metric, scores in results['predicted_metrics'].items()
        }
    }

    with open(f"{RESULTS_DIR}/results_summary.json", 'w') as f:
        json.dump(results_summary, f, indent=2)
    print(f"‚úÖ Saved results summary to {RESULTS_DIR}/results_summary.json")

    print(f"\nüìä All results saved to {RESULTS_DIR}/")


## 10. Visualization of Results


In [None]:
import matplotlib.pyplot as plt

# Create visualization of metrics comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

metrics_list = ['context_relevance', 'context_utilization', 'completeness', 'adherence']

for idx, metric in enumerate(metrics_list):
    if metric in results['predicted_metrics'] and metric in ground_truth:
        pred = results['predicted_metrics'][metric]
        gt = ground_truth[metric]

        min_len = min(len(pred), len(gt))
        pred = pred[:min_len]
        gt = gt[:min_len]

        axes[idx].scatter(gt, pred, alpha=0.5)
        axes[idx].plot([0, 1], [0, 1], 'r--', label='Perfect prediction')
        axes[idx].set_xlabel('Ground Truth')
        axes[idx].set_ylabel('Predicted')
        axes[idx].set_title(f'{metric.replace("_", " ").title()}')
        axes[idx].legend()
        axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(f"{RESULTS_DIR}/metrics_comparison.png", dpi=150, bbox_inches='tight')
plt.show()

print("‚úÖ Visualization saved!")


## 11. Summary and Next Steps

### Summary
- ‚úÖ Dataset loaded and processed
- ‚úÖ Retriever built with all documents
- ‚úÖ Evaluation completed on all questions
- ‚úÖ Metrics computed and compared with ground truth
- ‚úÖ Results saved

### Next Steps for Optimization
1. **Experiment with different chunking strategies**: Try fixed_size, paragraph_based, sliding_window
2. **Try domain-specific embedding models**: Use BioBERT for biomedical, Legal-BERT for legal, etc.
3. **Experiment with retrieval types**: Try sparse (BM25) or hybrid search
4. **Vary Top-K values**: Test with different numbers of retrieved documents
5. **Try different LLM models**: Experiment with different generation models
6. **Scale to other domains**: Once optimized for one domain, extend to others

### Tips for Kaggle
- Use Kaggle Secrets to store your Groq API key securely
- Enable GPU if needed for faster embedding generation
- Use Kaggle's persistent storage for saving results
- Submit as a Kaggle Notebook for version control


In [None]:
# Define helper retriever classes needed for the conditional logic
class DenseRetriever:
    def __init__(self, vector_store, embedding_model):
        self.vector_store = vector_store
        self.embedding_model = embedding_model

    def retrieve(self, query: str, top_k: int) -> List[str]:
        q_emb = self.embedding_model.encode([query])
        dense_texts = self.vector_store.search(q_emb, top_k) # Corrected to unpack a single value
        return dense_texts

class SparseRetriever:
    def __init__(self, all_documents_for_bm25: List[str]):
        self.all_documents = all_documents_for_bm25
        tokenized_corpus = [doc.lower().split() for doc in all_documents_for_bm25]
        self.bm25 = BM25Okapi(tokenized_corpus)

    def retrieve(self, query: str, top_k: int) -> List[str]:
        q_tokens = query.lower().split()
        bm25_scores = self.bm25.get_scores(q_tokens)
        top_bm25_idx = np.argsort(bm25_scores)[::-1][:top_k]
        bm25_texts = [self.all_documents[i] for i in top_bm25_idx]
        return bm25_texts

class HybridRetriever:
    def __init__(self, dense_retriever, sparse_retriever, hybrid_alpha: float):
        self.dense_retriever = dense_retriever
        self.sparse_retriever = sparse_retriever
        self.hybrid_alpha = hybrid_alpha # Alpha for blending, though simple union for now

    def retrieve(self, query: str, top_k: int) -> List[str]:
        # Retrieve candidates from both dense and sparse
        dense_results = self.dense_retriever.retrieve(query, top_k)
        sparse_results = self.sparse_retriever.retrieve(query, top_k)

        # Combine and ensure uniqueness, then return top_k
        combined_results = list(set(dense_results + sparse_results))
        return combined_results[:top_k]


# Initialize retriever based on configuration
if RETRIEVAL_TYPE == "dense":
    retriever = DenseRetriever(vector_store, embedding_model)
elif RETRIEVAL_TYPE == "sparse":
    retriever = SparseRetriever(all_documents)
elif RETRIEVAL_TYPE == "hybrid":
    if USE_CROSS_ENCODER:
        # Assuming HybridRetrieverWithRerank is the intended 'hybrid' with reranking
        retriever = HybridRetrieverWithRerank(vector_store, embedding_model, cross_encoder, chunks_metadata)
    else:
        # If no cross-encoder, use the simpler HybridRetriever defined above
        dense_ret = DenseRetriever(vector_store, embedding_model)
        sparse_ret = SparseRetriever(all_documents)
        retriever = HybridRetriever(dense_ret, sparse_ret, HYBRID_ALPHA)
else:
    raise ValueError(f"Unknown retrieval type: {RETRIEVAL_TYPE}")

# Ensure the corpus is set for BM25 in case of sparse or hybrid retrieval that uses it
# HybridRetrieverWithRerank needs set_corpus called after all_documents is available
if hasattr(retriever, 'set_corpus') and retriever.all_texts == []: # Only set if not already set
    retriever.set_corpus(all_documents)

print(f"‚úÖ Retriever initialized: {RETRIEVAL_TYPE}")

## 7. Run Evaluation Pipeline


In [None]:
ground_truth = {}
metrics = ['relevance_score', 'utilization_score', 'completeness_score', 'adherence_score']
for metric in metrics:
    if metric in df.columns:
        ground_truth[metric] = df[metric].tolist()
    else:
        print(f"‚ö†Ô∏è Warning: {metric} not found in dataset")
        ground_truth[metric] = []

print("Ground truth scores available:")
for metric, scores in ground_truth.items():
    if scores:
        print(f"  {metric}: {len(scores)} samples, mean={np.mean(scores):.4f}")


In [None]:
# Initialize results storage
results = {
    'per_sample': [],
    'predicted_metrics': {
        'relevance_score': [],
        'utilization_score': [],
        'completeness_score': [],
        'adherence_score': []
    }
}

print(f"Starting evaluation on {len(df)} samples...")
print("This may take a while depending on dataset size and API rate limits...")
