# LLM Evaluations for RAG Systems

Given the stochastic nature of Large Language Models (LLMs), establishing robust evaluation criteria is crucial for building confidence in their performance. For Retrieval-Augmented Generation (RAG) systems, comprehensive evaluation requires assessing both the retrieval and generation components to ensure system reliability and accuracy.

## Background

In the 101 RAG Hands-On Training, we demonstrated how LLM Judges can be utilized to evaluate RAG systems effectively. 

- **[Evaluation Documentation Reference](https://docs.google.com/document/d/1Rg1QXZ5Cg0aX8hYvRrvevY1uz6lPpZkaasoqW7Pcm9o/edit?tab=t.0#heading=h.jjijsv4v12qe)** 
- **[Evaluation Code Reference](./../workshop-101/eval_rag.py)** 

## Workshop Objectives

In this notebook, we will explore advanced evaluation techniques using two powerful libraries:
- **[Ragas](https://github.com/explodinggradients/ragas)** 
- **[Google Gen AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview)** 

These tools will help you implement systematic evaluation workflows to measure and improve your RAG system's performance across various metrics and use cases.

## Ragas

Ragas is an open-source library published under the Apache 2.0 license that provides a comprehensive toolkit for evaluating and optimizing LLM applications. It offers specialized metrics and evaluation frameworks making it easier to assess LLM generations

### Installation

You can install Ragas using UV (our preferred package manager):

```bash
uv add ragas
```

Alternatively, you can install it with pip:

```bash
pip install ragas
```

### Setting up Ragas

Install the Langchain wrapper for Vertex AI to use Vertex AI models in Ragas:

```bash
uv add langchain-google-vertexai
```

In [None]:
from ragas.llms import LangchainLLMWrapper
from langchain_google_vertexai import ChatVertexAI

# Define global constants for project and location
PROJECT_ID = "weave-ai-sandbox"
LOCATION = "us-central1"

evaluator_llm = LangchainLLMWrapper(
    ChatVertexAI(
        model="gemini-2.5-flash",
        project=PROJECT_ID,
        location=LOCATION,
    )
)

In [None]:
# Import additional modules for vector store integration
from pathlib import Path
from google import genai
from vector_store import MilvusVectorStore

# Initialize GenAI Client for vector store operations
genai_client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

In [None]:
# Import the complete RAG system from app_201.py
from app_201 import generate_chat_response, read_prompt_from_file
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

# Initialize the context precision metric
context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

In [None]:
# Utility functions for systematic RAG evaluation using the complete app_201 pipeline


def create_evaluation_sample_with_rag(
    user_input: str,
    vector_store: MilvusVectorStore,
    genai_client,
    collection_name: str = "weave_docs",
    top_k: int = 5,
    system_prompt_version: str = "v1",
    chat_history: list = None,
) -> SingleTurnSample:
    """
    Create a SingleTurnSample using the complete RAG pipeline from app_201.py.

    This function:
    1. Retrieves relevant contexts from vector store
    2. Generates an actual response using the RAG system
    3. Returns both for evaluation

    Args:
        user_input: The user's question/query
        vector_store: The initialized MilvusVectorStore instance
        genai_client: The GenAI client for response generation
        collection_name: Name of the collection to retrieve from
        top_k: Number of top similar documents to retrieve
        system_prompt_version: Version of system prompt to use
        chat_history: Previous conversation history

    Returns:
        SingleTurnSample with real RAG-generated response
    """
    if chat_history is None:
        chat_history = []

    # Step 1: Retrieve relevant contexts (same as before)
    retrieved_contexts = vector_store.retrieve(
        query=user_input, collection_name=collection_name, top_k=top_k, verbose=False
    )

    # Step 2: Generate actual response using the RAG system from app_201.py
    system_prompt = read_prompt_from_file(system_prompt_version)

    response = generate_chat_response(
        client=genai_client,
        system_prompt=system_prompt,
        user_message=user_input,
        chat_history=chat_history,
        context_snippets=retrieved_contexts,
        model="gemini-2.5-flash",
        verbose=False,
    )

    return SingleTurnSample(
        user_input=user_input,
        response=response,  # This is now a REAL response from the RAG system
        retrieved_contexts=retrieved_contexts,
    )


def create_evaluation_sample_mock_response(
    user_input: str,
    response: str,
    vector_store: MilvusVectorStore,
    collection_name: str = "weave_docs",
    top_k: int = 5,
) -> SingleTurnSample:
    """
    Create a SingleTurnSample for evaluation using a mock response (original approach).

    Use this when you want to test against known/expected responses.
    """
    retrieved_contexts = vector_store.retrieve(
        query=user_input, collection_name=collection_name, top_k=top_k, verbose=False
    )

    return SingleTurnSample(
        user_input=user_input,
        response=response,
        retrieved_contexts=retrieved_contexts,
    )


async def evaluate_context_precision(
    sample: SingleTurnSample, metric: LLMContextPrecisionWithoutReference
) -> float:
    """
    Evaluate context precision for a given sample.

    Args:
        sample: The SingleTurnSample to evaluate
        metric: The context precision metric instance

    Returns:
        The context precision score
    """
    return await metric.single_turn_ascore(sample)


async def batch_evaluate_samples(
    samples: list[SingleTurnSample], metric: LLMContextPrecisionWithoutReference
) -> list[float]:
    """
    Evaluate multiple samples and return their scores.

    Args:
        samples: List of SingleTurnSample instances
        metric: The context precision metric instance

    Returns:
        List of context precision scores
    """
    scores = []
    for i, sample in enumerate(samples):
        score = await evaluate_context_precision(sample, metric)
        scores.append(score)
        print(f"Sample {i + 1}: Context Precision = {score:.3f}")

    return scores


print("Enhanced evaluation utility functions defined successfully!")
print("✓ create_evaluation_sample_with_rag() - Uses complete RAG pipeline")
print("✓ create_evaluation_sample_mock_response() - Uses mock responses")

### Integration with app_201.py RAG System

Now we'll integrate the **complete RAG pipeline** from `app_201.py` to evaluate the actual system responses rather than mock responses.

**Key Components from app_201.py:**
- **`generate_chat_response()`**: The main function that generates responses using retrieved context
- **`read_prompt_from_file()`**: Loads the system prompt that guides response generation
- **System prompts**: The actual prompts used in production (v1, v2, v3)
- **Chat history handling**: Maintains conversation context
- **GenAI client configuration**: Same model and parameters as production

**Two Evaluation Approaches:**
1. **Real RAG Evaluation**: Uses the complete pipeline to generate actual responses
2. **Mock Response Evaluation**: Tests against known/expected responses (useful for regression testing)

This allows you to evaluate both the retrieval quality AND the generation quality of your RAG system.

In [None]:
# Initialize the Milvus vector store (same logic as app_201.py)
def init_vector_store(
    client: genai.Client, collection_name: str = "weave_docs", reingest: bool = False
) -> MilvusVectorStore:
    """Initialize the Milvus vector store and ingest documents if needed."""
    current_file = Path.cwd()  # Using current working directory for notebook context
    doc_paths = [str(current_file / "data" / "waml.md")]
    vector_db_path = current_file / "vector_db" / "milvus.db"

    # Ensure the parent directory exists
    vector_db_path.parent.mkdir(parents=True, exist_ok=True)

    # Initialize Milvus vector store
    vector_store = MilvusVectorStore(
        vector_db_path=str(vector_db_path), genai_client=client
    )

    # Create collection if it doesn't exist or if reingestion is forced
    if reingest or not vector_store.milvus_client.has_collection(collection_name):
        vector_store.create_collection(doc_paths, collection_name=collection_name)

    return vector_store


# Initialize the vector store
print("Initializing Milvus vector store...")
vector_store = init_vector_store(
    genai_client, collection_name="weave_docs", reingest=False
)
print("Vector store initialized successfully!")

### Retriever Evaluation 

In the 101 workshop, we demonstrated how the retrieval system's ability to rank relevant chunks can be evaluated using context precision. This evaluation was based on the Ragas metric called Context Precision.

**References:**
- Code reference to base implementation: [Base implementation](./../workshop-101/eval_rag.py#115)
- Ragas documentation: [Context Precision metric](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/)

Before implementing the code, take a moment to go through the Ragas documentation to understand how they calculate context precision. 

Now, let's implement the Ragas version of this metric to evaluate retrieval performance:

In [None]:
# Example 1: Using the complete RAG pipeline from app_201.py
VECTOR_TOP_K = 5  # Number of similar documents to retrieve

# Test query related to WAML documentation
user_query = "What is the WAML file and where does it live?"

print("=== Evaluating with REAL RAG System (app_201.py pipeline) ===")

# Create sample using the complete RAG pipeline
rag_sample = create_evaluation_sample_with_rag(
    user_input=user_query,
    vector_store=vector_store,
    genai_client=genai_client,
    collection_name="weave_docs",
    top_k=VECTOR_TOP_K,
    system_prompt_version="v1",
)

print(f"User Query: {rag_sample.user_input}")
print(f"RAG Generated Response: {rag_sample.response}")
print(f"Retrieved {len(rag_sample.retrieved_contexts)} contexts")

# Evaluate context precision for the RAG-generated response
rag_context_precision_score = await context_precision.single_turn_ascore(rag_sample)
print(f"Context Precision Score (RAG): {rag_context_precision_score:.3f}")

print("\n" + "=" * 60)
print("=== Comparison: Mock Response vs RAG Response ===")

# Mock response for comparison
mock_response = """The WAML™ is the `.weave.yaml` file that lives in the root of all service repos. It defines how a service is deployed and contains configuration details like the friendly name, GitHub slug, owner, Slack channel, namespace, and deployment specifications."""

# Create sample with mock response
mock_sample = create_evaluation_sample_mock_response(
    user_input=user_query,
    response=mock_response,
    vector_store=vector_store,
    collection_name="weave_docs",
    top_k=VECTOR_TOP_K,
)

# Evaluate context precision for the mock response
mock_context_precision_score = await context_precision.single_turn_ascore(mock_sample)

print(f"Context Precision Score (Mock): {mock_context_precision_score:.3f}")
print(f"Context Precision Score (RAG):  {rag_context_precision_score:.3f}")
print(f"Difference: {rag_context_precision_score - mock_context_precision_score:.3f}")

In [None]:
# Note: Utility functions have been moved earlier in the notebook
# This cell can be used for additional evaluation utilities or removed

print("All utility functions are now defined earlier in the notebook!")

In [None]:
# Example: Batch evaluation using the complete RAG system

# Define test queries (no need for expected responses since we'll generate them)
test_queries = [
    "What is the purpose of the slug field in WAML?",
    "How do you configure Slack notifications in WAML?",
    "What is the deploy section used for in WAML?",
    "How do you specify the owner of a repository in WAML?",
    "What are external links used for in WAML?",
]

print("Creating evaluation samples using the complete RAG pipeline...")
evaluation_samples = []

for i, query in enumerate(test_queries):
    print(f"Generating response for query {i + 1}: {query[:50]}...")

    # Generate actual responses using the RAG system
    sample = create_evaluation_sample_with_rag(
        user_input=query,
        vector_store=vector_store,
        genai_client=genai_client,
        collection_name="weave_docs",
        top_k=VECTOR_TOP_K,
        system_prompt_version="v1",
    )

    evaluation_samples.append(sample)
    print(f"✓ Generated response: {sample.response[:100]}...")

print(
    f"\nCreated {len(evaluation_samples)} evaluation samples with REAL RAG responses!"
)
print("Each sample contains:")
print("  - User query")
print("  - RAG-generated response (using app_201.py pipeline)")
print("  - Retrieved contexts from vector store")

In [None]:
scores = await batch_evaluate_samples(evaluation_samples, context_precision)

# # Calculate and display statistics
import statistics

avg_score = statistics.mean(scores)
print(f"\n--- Evaluation Results ---")
print(f"Average Context Precision: {avg_score:.3f}")
print(f"Min Score: {min(scores):.3f}")
print(f"Max Score: {max(scores):.3f}")
print(f"Standard Deviation: {statistics.stdev(scores):.3f}")