# RAG Evaluation and Observability with Weights & Biases (W&B)

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Build** a complete RAG pipeline using LangChain v1.0+
2. **Understand** why evaluation is critical for production RAG systems
3. **Create** a "Golden Dataset" for systematic evaluation
4. **Use W&B Weave** to track experiments and enable observability
5. **Interpret** LLM-as-a-Judge metrics (Faithfulness, Answer Relevance)
6. **Debug** RAG failures using per-question analysis
7. **Iterate** on RAG configurations using experiment comparison

---

## üéØ Why Evaluate RAG Systems?

Before we dive into building, let's understand **why evaluation matters**.

### The Problem: "It Looks Good" is Not Enough

When you test a chatbot manually, you might ask 5-10 questions and think: *"The answers seem reasonable!"* But in production:

- You can't manually check thousands of queries
- Users will ask questions you never anticipated
- Small changes (new documents, different LLM) can break things silently

### What Can Go Wrong in RAG?

| Failure Type | Description | Example |
|--------------|-------------|---------|
| **Retrieval Failure** | Wrong documents were fetched | User asks about "Python" (the language) but retrieves documents about "python" (the snake) |
| **Hallucination** | LLM invents information not in documents | LLM confidently states a date that doesn't exist in your PDFs |
| **Irrelevant Answer** | Answer is technically correct but doesn't address the question | User asks "How do I install X?" and gets "X was developed in 2020..." |
| **Context Window Overflow** | Too many chunks stuffed into prompt | The LLM gets confused or ignores important context |

### The Solution: Systematic Evaluation

We need:
1. **A benchmark** (Golden Dataset) with known correct answers
2. **Automated metrics** that can score answers at scale
3. **Observability** to trace what happened at each step
4. **Experiment tracking** to compare different configurations

> üí° **Key Insight**: Evaluation is not just about quality‚Äîit's about **confidence**. You need to know *when* your system will fail, not just hope it won't.

---

## üî≠ Introduction to LLM Observability with W&B

### What is Observability?

**Observability** is the ability to understand what's happening *inside* your system by examining its *outputs*. For LLM applications, this means:

- **Traces**: The complete journey of a request (query ‚Üí retrieval ‚Üí generation ‚Üí response)
- **Metrics**: Quantitative measurements (latency, token count, quality scores)
- **Logs**: Detailed records of inputs, outputs, and intermediate steps

### Why Observability for RAG?

RAG systems are **multi-step pipelines**. When something goes wrong, you need to know:

```
User Query ‚Üí [Embedding] ‚Üí [Retrieval] ‚Üí [Prompt Construction] ‚Üí [LLM Generation] ‚Üí Response
     ‚Üì            ‚Üì             ‚Üì                ‚Üì                     ‚Üì              ‚Üì
  Logged?      Traced?      What docs?      What prompt?          What output?    Scored?
```

Without observability, debugging is like finding a needle in a haystack.

### W&B Weave for LLM Observability

**W&B Weave** is Weights & Biases' toolkit for LLM observability and evaluation. It provides:

| Feature | Description |
|---------|-------------|
| **Tracing** | Automatically capture LLM calls and chain executions |
| **Evaluation** | Run LLM-as-a-Judge metrics with custom scorers |
| **Datasets** | Store and version your evaluation datasets |
| **Dashboard** | Visual interface at wandb.ai to explore results |
| **Collaboration** | Share experiments with your team |

> üìå **In this notebook**, we use W&B Weave's `@weave.op()` decorator and `weave.Evaluation()` to trace RAG executions and score responses with custom LLM-as-a-Judge metrics.

---

## Step 0: Install Required Packages

Before we begin, we need to install the required packages. This cell installs:

- **`langchain`**: The core LangChain framework
- **`langchain-community`**: Community integrations (document loaders, etc.)
- **`langchain-openai`**: OpenAI-specific components (embeddings, chat models)
- **`langchain-chroma`**: Chroma vector store integration
- **`pypdf`**: PDF parsing library
- **`weave`**: W&B's LLM observability and evaluation toolkit
- **`wandb`**: Weights & Biases core package
- **`python-dotenv`**: Environment variable management

> ‚ö†Ô∏è **Important**: After running this cell, you may need to **restart the kernel** to ensure all packages are properly loaded.

In [None]:
# Install required packages
# !pip install -q langchain langchain-community langchain-openai langchain-chroma pypdf python-dotenv

print("‚úÖ LangChain packages installed!")

In [None]:
# Install W&B Weave for LLM evaluation and observability
!pip install -q weave wandb

print("‚úÖ W&B Weave installed - restart kernel if this is your first time")

---

## Step 1: Environment Setup

We need to configure our API keys to authenticate with OpenAI and W&B. This notebook supports both:

- **Google Colab**: Uses `google.colab.userdata` to securely access keys stored in Colab Secrets
- **Local Execution**: Uses `python-dotenv` to load keys from a `.env` file

### Setting Up Your API Keys

**For local development**, create a `.env` file in this directory with:
```
OPENAI_API_KEY=your-openai-key-here
WANDB_API_KEY=your-wandb-key-here
```

**For Colab**, add your keys to Colab Secrets with the names `OPENAI_API_KEY` and `WANDB_API_KEY`.

> üí° **Get your W&B API key** at: https://wandb.ai/authorize

In [None]:
import os
import sys

# Configuration
MODEL = "gpt-4o-mini"  # The LLM model to use
db_name = "vector_db"  # Directory name for the vector store
WANDB_PROJECT = "rag-evaluation"  # W&B project name

# Option 1: Set your API key directly (for Colab)
#from google.colab import userdata
#os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
#os.environ["WANDB_API_KEY"] = userdata.get('WANDB_API_KEY')

# Option 2
# Load environment variables from .env file
# from dotenv import load_dotenv
# load_dotenv()

# Verify API keys are set
if os.environ.get("OPENAI_API_KEY"):
    print("‚úÖ OPENAI_API_KEY loaded successfully")
else:
    print("‚ö†Ô∏è Warning: OPENAI_API_KEY not found. Please set it in your .env file or environment.")

if os.environ.get("WANDB_API_KEY"):
    print("‚úÖ WANDB_API_KEY loaded successfully")
else:
    print("‚ö†Ô∏è Warning: WANDB_API_KEY not found. Get yours at https://wandb.ai/authorize")

In [None]:
import weave
import wandb
import pandas as pd

# Initialize W&B Weave - this enables tracing and evaluation
weave.init(WANDB_PROJECT)

print(f"‚úÖ W&B Weave initialized with project: {WANDB_PROJECT}")
print(f"üí° View your experiments at: https://wandb.ai/{wandb.api.default_entity}/{WANDB_PROJECT}")

### RAG Chain Architecture

Here's the complete flow of our RAG system:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  RAG Chain Flow                                     ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                     ‚îÇ
‚îÇ  1. User Question + Chat History                    ‚îÇ
‚îÇ     ‚Üì                                               ‚îÇ
‚îÇ  2. History-Aware Retriever                         ‚îÇ
‚îÇ     (Reformulates question to be standalone)        ‚îÇ
‚îÇ     ‚Üì                                               ‚îÇ
‚îÇ  3. Vector Store Search (Chroma)                    ‚îÇ
‚îÇ     (Finds top-k most similar chunks)               ‚îÇ
‚îÇ     ‚Üì                                               ‚îÇ
‚îÇ  4. Question-Answer Chain                           ‚îÇ
‚îÇ     (LLM generates answer using retrieved context)  ‚îÇ
‚îÇ     ‚Üì                                               ‚îÇ
‚îÇ  5. Final Response                                  ‚îÇ
‚îÇ                                                     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Key Components:**
- **History-Aware Retriever**: Handles follow-up questions by reformulating them
- **Vector Store**: Stores embeddings and performs semantic search
- **Stuff Documents Chain**: "Stuffs" all retrieved docs into the LLM prompt

---

## Step 2: Import Dependencies

Now we import all the necessary modules from LangChain and other libraries. Here's what each import does:

### Document Processing
- **`DirectoryLoader`**: Loads multiple files from a directory
- **`PyPDFLoader`**: Parses PDF files into text
- **`RecursiveCharacterTextSplitter`**: Splits text into chunks while respecting natural boundaries

### Embeddings & Vector Store
- **`OpenAIEmbeddings`**: Converts text to vector embeddings using OpenAI's models
- **`Chroma`**: A fast, open-source vector database

### LLM & Chains
- **`ChatOpenAI`**: OpenAI's chat models (GPT-4, etc.)
- **`create_history_aware_retriever`**: Creates a retriever that understands conversation context
- **`create_retrieval_chain`**: Combines retrieval and generation into a single chain
- **`create_stuff_documents_chain`**: Creates a chain that "stuffs" documents into the prompt

### Prompts & Messages
- **`ChatPromptTemplate`**: Templates for structured prompts
- **`MessagesPlaceholder`**: Placeholder for conversation history
- **`HumanMessage` / `AIMessage`**: Message types for chat history

In [None]:
import glob
import os
import json

# Document loading and processing
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Embeddings and LLM
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# Vector store
from langchain_chroma import Chroma

# Chains for RAG
from langchain_classic.chains import create_history_aware_retriever, create_retrieval_chain
from langchain_classic.chains.combine_documents import create_stuff_documents_chain

# Prompts and messages
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage

print("‚úÖ All imports successful!")

---

## Step 3: Load Documents

The first step in building a RAG application is loading your documents. We use:

- **`glob.glob()`**: To find all PDF files in the `pdfs/` directory and current directory
- **`PyPDFLoader`**: To parse each PDF and extract text content

### Document Structure

Each loaded document contains:
- **`page_content`**: The actual text content
- **`metadata`**: Information about the document (source file, page number, etc.)

> üìÅ **Note**: Place your PDF files in a `pdfs/` subdirectory or in the same directory as this notebook.

In [None]:
# !pip install gdown -q

# !gdown 12hCcDOBYO0A3q2eFstMQnNqcCX2WYCZq -O pdfs.zip
# !unzip -q pdfs.zip -d pdfs
# !rm pdfs.zip
# print("‚úì PDFs extracted to ./pdfs folder")

In [None]:
# Find all PDF files in the pdfs/ subdirectory and current directory
folders = glob.glob("pdfs/*.pdf") + glob.glob("*.pdf")

if not folders:
    print("‚ö†Ô∏è No PDF files found. Please add PDF files to the 'pdfs/' directory or current directory.")
else:
    print(f"üìÑ Found {len(folders)} PDF file(s)")

# Load all documents
documents = []
for file_path in folders:
    loader = PyPDFLoader(file_path)
    docs = loader.load()
    for doc in docs:
        # Add custom metadata to track source file
        doc.metadata["source_file"] = os.path.basename(file_path)
        documents.append(doc)

print(f"‚úÖ Loaded {len(documents)} pages from {len(folders)} file(s)")

---

## Step 4: Split Documents into Chunks

LLMs have a **context window limit** (maximum tokens they can process at once). Additionally, for effective retrieval, we want to find *specific* relevant passages, not entire documents.

We use **`RecursiveCharacterTextSplitter`** which:
- Splits text hierarchically (paragraphs ‚Üí sentences ‚Üí words)
- Tries to keep semantically related text together
- Creates overlapping chunks to preserve context at boundaries

### Key Parameters

| Parameter | Value | Description |
|-----------|-------|--------------|
| `chunk_size` | 1000 | Maximum characters per chunk |
| `chunk_overlap` | 200 | Characters shared between adjacent chunks |
| `add_start_index` | True | Tracks the position of each chunk in the original document |

In [None]:
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Max characters per chunk
    chunk_overlap=200,     # Overlap between chunks for context continuity
    add_start_index=True   # Track position in original document
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)
print(f"‚úÖ Split {len(documents)} pages into {len(chunks)} chunks")

# Show example chunk
if chunks:
    print("\nüìù Example Chunk:")
    print("-" * 50)
    print(chunks[0].page_content[:300] + "...")
    print("-" * 50)
    print(f"Metadata: {chunks[0].metadata}")

---

## Step 5: Create Embeddings and Vector Store

### What are Embeddings?

**Embeddings** are numerical representations (vectors) of text that capture semantic meaning. Texts with similar meanings will have vectors that are close together in the embedding space.

### What is a Vector Store?

A **Vector Store** is a specialized database optimized for:
- Storing high-dimensional vectors
- Performing fast similarity searches
- Enabling "semantic search" (finding text by meaning, not just keywords)

### Our Setup

- **`OpenAIEmbeddings`**: Uses OpenAI's `text-embedding-3-small` model (fast and cost-effective)
- **`Chroma`**: Open-source vector database that persists to disk

> üí° **Tip**: The embeddings are stored locally, so subsequent runs will be faster as you won't need to re-embed documents.

In [None]:
# Initialize embedding model
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

# Clean up existing database if it exists (to ensure fresh data)
# NOTE: In production, you would likely load the existing DB mostly.
# For this lab, we check if it exists and load it to save time/cost.

if os.path.exists(db_name):
    # Load existing vector store
    vectorstore = Chroma(
        persist_directory=db_name, 
        embedding_function=embeddings
    )
    print(f"‚úÖ Loaded existing vector store: {db_name}")
    try:
        count = vectorstore._collection.count()
        print(f"üìä Document count: {count}")
    except:
        print("üìä Could not get document count")
else:
    # Create new vector store
    print(f"üÜï Creating new vector store: {db_name}...")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=db_name
    )
    print(f"‚úÖ Vector store created with {vectorstore._collection.count()} documents")

---

## Step 6: Build the RAG Chain

Now we create the complete RAG pipeline using **LangChain Expression Language (LCEL)**. The chain consists of two main components:

### 1. History-Aware Retriever

This component reformulates the user's question to be **standalone** (understandable without context). 

**Example:**
- Chat history: "Tell me about SecLM"
- Follow-up: "What are its main features?"
- Reformulated: "What are the main features of SecLM?"

### 2. Question-Answer Chain

This component:
1. Takes the retrieved documents and the question
2. "Stuffs" the documents into the prompt as context
3. Generates a grounded answer using the LLM

In [None]:
# 1. Initialize the LLM
llm = ChatOpenAI(temperature=0, model_name=MODEL)

In [None]:
# 2. Create a retriever from the vector store
# k=5 means we retrieve the top 5 most relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [None]:
# 3. Define the contextualization prompt
# This prompt helps reformulate questions based on chat history
contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages([
    ("system", contextualize_q_system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Create the history-aware retriever
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

In [None]:
# 4. Define the QA prompt
# This prompt instructs the LLM how to use the retrieved context
qa_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, just say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", qa_system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

In [None]:
# 5. Combine into the final RAG chain
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

print("‚úÖ RAG chain created successfully!")

---

## Step 7: Test the RAG Chain

Let's test our RAG chain with a simple query. The chain will:

1. Take the user's question
2. Retrieve relevant document chunks from the vector store
3. Generate a response based on the retrieved context

The response object contains:
- **`answer`**: The generated response
- **`context`**: The retrieved document chunks used to generate the answer

In [None]:
# Create a traced RAG function using W&B Weave
@weave.op()
def rag_query(question: str, chat_history: list = None) -> dict:
    """Execute a RAG query with W&B tracing."""
    if chat_history is None:
        chat_history = []
    
    response = rag_chain.invoke({
        "input": question,
        "chat_history": chat_history
    })
    
    return {
        "answer": response["answer"],
        "context": [doc.page_content for doc in response["context"]]
    }

print("‚úÖ RAG query function with W&B tracing created!")

In [None]:
# Test the RAG chain
query = "What is the main topic of these documents?"
response = rag_query(query)

print("‚ùì Question:", query)
print("\nüí¨ Answer:", response["answer"])
print("\n‚úÖ This query was traced in W&B! Check your dashboard.")

## üìù Step 8: Create Golden Evaluation Dataset

### üß† Educational Context: The "Golden Dataset"

To scientifically evaluate a RAG system, we cannot just "eyeball" a few answers. We need a **benchmark**‚Äîoften called a "Golden Dataset" or "Ground Truth" set.

#### What makes a good evaluation dataset?
1.  **Diversity**: Questions should cover different topics within your documents.
2.  **Complexity**: Include simple fact lookups ("What is X?") and reasoning questions ("Compare X and Y").
3.  **Ground Truth**: You must provide the *ideal* answer. The LLM Judge will compare the RAG system's output against this reference.

**Measurement Goals**:
*   **Retrieval Quality**: Did the system find the right page in the PDF?
*   **Generation Quality**: Did the LLM answer accurately based on that page?

üëá **Action**: The code below creates a list of dictionaries, where each item has a `question` and a `ground_truth` answer.

In [None]:
# Golden dataset: Questions with ground truth answers
# Based on your PDFs: Word Embeddings, BPE, NMT, MTEB
eval_data = [
    {
        "question": "What is Byte Pair Encoding (BPE)?",
        "ground_truth": "BPE is a data compression technique that iteratively replaces the most frequent pair of bytes/symbols with a new symbol. It's used for subword tokenization in NLP tasks like machine translation."
    },
    {
        "question": "What complexity result did Kozma and Voderholzer prove about optimal pair encoding?",
        "ground_truth": "They proved that optimal pair encoding is APX-complete, meaning it's unlikely to admit a polynomial-time approximation scheme unless P=NP."
    },
    {
        "question": "What is the distributional hypothesis in NLP?",
        "ground_truth": "Words that appear in similar contexts tend to have similar meanings. This principle, suggested by Harris (1954), underlies modern word embeddings."
    },
    {
        "question": "What is the Vector Space Model (VSM)?",
        "ground_truth": "The VSM represents words and documents as vectors in high-dimensional space, enabling mathematical operations like cosine similarity for information retrieval. Generally attributed to Salton (1975)."
    },
    {
        "question": "Who introduced the GloVe word embedding model and when?",
        "ground_truth": "GloVe (Global Vectors for Word Representation) was introduced by Pennington et al. in 2014."
    },
    {
        "question": "What is the main contribution of Neural Network Language Models (NNLMs)?",
        "ground_truth": "NNLMs, pioneered by Bengio et al. (2003), reframed language modeling as unsupervised learning and introduced embedding layers that project words into dense vector spaces."
    },
    {
        "question": "What benchmark is used to evaluate text embedding models across multiple languages?",
        "ground_truth": "The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across multiple languages and diverse NLP tasks."
    },
    {
        "question": "What is the key advantage of subword tokenization in neural machine translation?",
        "ground_truth": "Subword tokenization (like BPE) enables open-vocabulary translation, handling rare words and achieving better compression while maintaining translation quality."
    }
]

eval_df = pd.DataFrame(eval_data)
print(f"‚úÖ Golden dataset ready: {len(eval_df)} evaluation questions")
print(f"üìÑ Covering: Word Embeddings, BPE, NMT, Vector Models")
eval_df[["question"]].head()

## üîç Step 9: Run RAG Inference

Now we need to generate answers using our RAG pipeline for each question in the golden dataset.

In [None]:
results = []
print("üîç Running RAG evaluation inference...\n")

for idx, row in eval_df.iterrows():
    try:
        # Use the traced RAG function
        response = rag_query(row["question"])
        
        results.append({
            "question": row["question"],
            "ground_truth": row["ground_truth"],
            "answer": response["answer"],
            "contexts": response["context"]  # Required for faithfulness metric
        })
        
        print(f"  ‚úì Q{idx+1}: {row['question'][:70]}...")
        
    except Exception as e:
        print(f"  ‚úó Q{idx+1} failed: {e}")
        continue

results_df = pd.DataFrame(results)
print(f"\n‚úÖ Inference complete: {len(results_df)}/{len(eval_df)} questions answered")
results_df[["question", "answer"]].head(3)

---

## üìè RAG Evaluation Metrics Reference

Before we run evaluation, let's understand the metrics we'll use.

### How LLM-as-a-Judge Works

Traditional metrics like **BLEU** or **ROUGE** compare word overlap. But for conversational AI:
- "The capital of France is Paris" ‚â† "Paris is the capital city of France" (different words, same meaning!)

**LLM-as-a-Judge** uses a powerful LLM (like GPT-4) to evaluate responses semantically.

### Metrics We Use

| Metric | Question the Judge Asks | Score Range | What It Measures |
|--------|------------------------|-------------|-------------------|
| **Faithfulness** | "Is the answer supported *only* by the retrieved context?" | 1-5 | Anti-hallucination: did the LLM make things up? |
| **Answer Relevance** | "Does the answer actually address the user's question?" | 1-5 | Is the response on-topic and helpful? |

## üìä Step 10: W&B Weave LLM-as-a-Judge Evaluation

Now we'll create custom scorer functions for W&B Weave evaluation. These scorers use GPT-4o-mini as the judge to evaluate:

1. **Faithfulness**: Is the answer based solely on the retrieved context?
2. **Answer Relevance**: Does the answer address the user's question?

Unlike MLflow's built-in metrics, W&B Weave uses custom scorer functions that you define.

In [None]:
from openai import OpenAI

# Initialize OpenAI client for LLM-as-Judge
client = OpenAI()

def llm_judge(prompt: str) -> dict:
    """Call GPT-4o-mini to judge a response."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an expert evaluator. Provide a score from 1-5 and a brief justification."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    return response.choices[0].message.content

@weave.op()
def faithfulness_scorer(question: str, answer: str, contexts: list) -> dict:
    """
    Evaluates if the answer is faithful to the retrieved context.
    Score 5 = fully supported by context, no hallucination
    Score 1 = significant hallucination, answer not in context
    """
    context_text = "\n\n".join(contexts) if contexts else "No context provided"
    
    prompt = f"""Evaluate FAITHFULNESS: Is the answer supported ONLY by the provided context?

CONTEXT:
{context_text[:3000]}  # Truncate for token limits

QUESTION: {question}

ANSWER: {answer}

Score from 1-5:
- 5: Answer is completely supported by context, no external information used
- 4: Answer is mostly supported, minor additions that are reasonable
- 3: Answer is partially supported, some unsupported claims
- 2: Answer has significant unsupported claims
- 1: Answer is mostly hallucinated, not in context

Respond with JSON: {{"score": <1-5>, "justification": "<brief reason>"}}"""
    
    result = llm_judge(prompt)
    try:
        # Parse JSON response
        parsed = json.loads(result)
        return {"faithfulness_score": parsed.get("score", 3), "faithfulness_reason": parsed.get("justification", "")}
    except:
        # Fallback if JSON parsing fails
        return {"faithfulness_score": 3, "faithfulness_reason": result}

@weave.op()
def relevance_scorer(question: str, answer: str) -> dict:
    """
    Evaluates if the answer is relevant to the question.
    Score 5 = directly answers the question
    Score 1 = completely off-topic
    """
    prompt = f"""Evaluate RELEVANCE: Does the answer directly address the question?

QUESTION: {question}

ANSWER: {answer}

Score from 1-5:
- 5: Answer directly and completely addresses the question
- 4: Answer mostly addresses the question with minor gaps
- 3: Answer partially addresses the question
- 2: Answer barely addresses the question
- 1: Answer is completely off-topic

Respond with JSON: {{"score": <1-5>, "justification": "<brief reason>"}}"""
    
    result = llm_judge(prompt)
    try:
        parsed = json.loads(result)
        return {"relevance_score": parsed.get("score", 3), "relevance_reason": parsed.get("justification", "")}
    except:
        return {"relevance_score": 3, "relevance_reason": result}

print("‚úÖ LLM-as-Judge scorers defined!")

In [None]:
# Run evaluation on all results
print("üîç Running LLM-as-a-Judge evaluation...\n")

evaluation_results = []

for idx, row in results_df.iterrows():
    print(f"  Evaluating Q{idx+1}: {row['question'][:50]}...")
    
    # Get faithfulness score
    faith_result = faithfulness_scorer(
        question=row['question'],
        answer=row['answer'],
        contexts=row['contexts']
    )
    
    # Get relevance score
    rel_result = relevance_scorer(
        question=row['question'],
        answer=row['answer']
    )
    
    evaluation_results.append({
        "question": row['question'],
        "ground_truth": row['ground_truth'],
        "answer": row['answer'],
        "faithfulness": faith_result['faithfulness_score'],
        "faithfulness_reason": faith_result['faithfulness_reason'],
        "relevance": rel_result['relevance_score'],
        "relevance_reason": rel_result['relevance_reason']
    })

eval_results_df = pd.DataFrame(evaluation_results)
print(f"\n‚úÖ Evaluation complete!")

# Calculate and display aggregate metrics
avg_faithfulness = eval_results_df['faithfulness'].mean()
avg_relevance = eval_results_df['relevance'].mean()

print(f"\nüìä AGGREGATE METRICS (1-5 scale, higher is better):")
print("=" * 60)
print(f"  Faithfulness Mean: {avg_faithfulness:.3f}")
print(f"  Relevance Mean:    {avg_relevance:.3f}")

In [None]:
# Log results to W&B
current_config = {
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "retrieval_k": 5,
    "embedding_model": "text-embedding-3-small",
    "llm_model": MODEL,
    "judge_model": "gpt-4o-mini"
}

# Log configuration and metrics to W&B
wandb.init(project=WANDB_PROJECT, name="RAG_Baseline_v1", config=current_config, reinit=True)

# Log aggregate metrics
wandb.log({
    "faithfulness_mean": avg_faithfulness,
    "relevance_mean": avg_relevance,
    "num_questions": len(eval_results_df)
})

# Log the evaluation table
eval_table = wandb.Table(dataframe=eval_results_df)
wandb.log({"evaluation_results": eval_table})

# Log the golden dataset
golden_table = wandb.Table(dataframe=eval_df)
wandb.log({"golden_dataset": golden_table})

print(f"\nüéâ Results logged to W&B!")
print(f"üåê View at: {wandb.run.get_url()}")

wandb.finish()

## üî¨ Step 11: Per-Question Breakdown

### Debugging Individual Failures

Averages hide details. To improve your system, you must look at **individual failures**.

#### How to interpret this table:
1.  **Low Faithfulness, High Relevance**: The model gave a good-sounding answer, but it wasn't in the document! This is dangerous (Hallucination).
2.  **High Faithfulness, Low Relevance**: The model quoted the document perfectly, but it didn't answer the user's question.

In [None]:
# Display per-question breakdown
print("üìã PER-QUESTION PERFORMANCE:\n")

breakdown = eval_results_df[["question", "faithfulness", "relevance"]].copy()
breakdown.columns = ["Question", "Faithfulness", "Relevance"]

# Highlight low-scoring questions (< 3.5)
styled = breakdown.style.map(
    lambda x: 'background-color: #ffcccc' if isinstance(x, (int, float)) and x < 3.5 else '',
    subset=["Faithfulness", "Relevance"]
)

styled

In [None]:
# Identify and analyze low-scoring questions
THRESHOLD = 3.5  # Scores below this are concerning

print("üîç FAILURE ANALYSIS:")
print("=" * 50)

# Find questions with low faithfulness
low_faith = eval_results_df[eval_results_df['faithfulness'] < THRESHOLD]

if len(low_faith) > 0:
    print(f"\n‚ö† Found {len(low_faith)} question(s) with Faithfulness < {THRESHOLD}:")
    for idx, row in low_faith.iterrows():
        print(f"\n  Q: {row['question'][:70]}...")
        print(f"  Score: {row['faithfulness']}")
        print(f"  Reason: {row['faithfulness_reason'][:100]}...")
else:
    print("‚úÖ No low-faithfulness questions found!")

# Find questions with low relevance
low_rel = eval_results_df[eval_results_df['relevance'] < THRESHOLD]

if len(low_rel) > 0:
    print(f"\n‚ö† Found {len(low_rel)} question(s) with Relevance < {THRESHOLD}:")
    for idx, row in low_rel.iterrows():
        print(f"\n  Q: {row['question'][:70]}...")
        print(f"  Score: {row['relevance']}")
        print(f"  Reason: {row['relevance_reason'][:100]}...")
else:
    print("‚úÖ No low-relevance questions found!")

---

## üéØ Student Challenge

Now it's your turn to experiment!

### Challenge 1: Tune the Chunking Strategy

**Hypothesis**: Smaller chunks might capture specific details better.

**Task**:
1. Go back to **Step 4** and change `chunk_size` from 1000 to 500
2. Delete the `vector_db` folder to force re-indexing
3. Re-run the entire pipeline
4. Compare results in W&B dashboard

### Challenge 2: Expand the Golden Dataset

**Task**: Add 3 new questions to the evaluation dataset:

1. One **factual question** ("When was X published?")
2. One **reasoning question** ("Compare X and Y")
3. One **out-of-scope question** (something NOT in your documents)

### Challenge 3: Try Different Prompts

**Task**: Modify the `qa_system_prompt` to be more strict about hallucination:
- Add: "If the information is not in the context, say 'I don't know'"
- Re-run evaluation and compare faithfulness scores

---

## üìö Next Steps & Resources

### Further Reading

| Resource | Description | Link |
|----------|-------------|------|
| **W&B Weave Docs** | Complete guide to LLM evaluation | [docs.wandb.ai/weave](https://docs.wandb.ai/weave) |
| **W&B Evaluations Guide** | In-depth tutorial on evaluations | [docs.wandb.ai/weave/guides](https://docs.wandb.ai/weave/guides/core-types/evaluations) |
| **RAGAS Framework** | Alternative evaluation framework | [ragas.io](https://docs.ragas.io/) |
| **LangSmith** | LangChain's observability platform | [docs.smith.langchain.com](https://docs.smith.langchain.com/) |

---

## üèÅ Summary

In this notebook, you learned:

- ‚úÖ RAG systems need **systematic evaluation**, not just manual testing
- ‚úÖ **Golden Datasets** provide the ground truth for benchmarking
- ‚úÖ **LLM-as-a-Judge** enables semantic evaluation at scale
- ‚úÖ **W&B Weave** provides tracing, evaluation, and dashboards
- ‚úÖ **Faithfulness** measures hallucination, **Relevance** measures answer quality
- ‚úÖ Per-question analysis helps you **debug specific failures**
- ‚úÖ Experiment tracking in W&B helps you **iterate on configurations**

**Remember**: A RAG system is only as good as your ability to measure and improve it!