# LLM Evaluations for RAG Systems

Given the stochastic nature of Large Language Models (LLMs), establishing robust evaluation criteria is crucial for building confidence in their performance. For Retrieval-Augmented Generation (RAG) systems, comprehensive evaluation requires assessing both the retrieval and generation components to ensure system reliability and accuracy.

## Background

In the 101 RAG Hands-On Training, we demonstrated how LLM judges can be utilized to evaluate RAG systems effectively.

- **[Evaluation Documentation Reference](https://docs.google.com/document/d/1Rg1QXZ5Cg0aX8hYvRrvevY1uz6lPpZkaasoqW7Pcm9o/edit?tab=t.0#heading=h.jjijsv4v12qe)** 
- **[Evaluation Code Reference](./../../workshop-101/eval_rag.py)** 

## Workshop Objectives

In this notebook, we will explore advanced evaluation techniques using **[Ragas](https://github.com/explodinggradients/ragas)**. It will help you implement systematic evaluation workflows to measure and improve your RAG system's performance across various metrics and use cases.

## Ragas

Ragas is an open-source library published under the Apache 2.0 license that provides a comprehensive toolkit for evaluating and optimizing LLM applications. It offers specialized metrics and evaluation frameworks, making it easier to assess LLM generations.

### Installation

You can install Ragas using UV (our preferred package manager):

```bash
uv add ragas
```

Alternatively, you can install it with pip:

```bash
pip install ragas
```

### Setting up Ragas

Install the LangChain wrapper for Vertex AI to use Vertex AI models in Ragas:

```bash
uv add langchain-google-vertexai
```

In [None]:
from ragas.llms import LangchainLLMWrapper
from langchain_google_vertexai import ChatVertexAI

# Define global constants for project and location
PROJECT_ID = "weave-ai-sandbox"
LOCATION = "us-central1"

# LangChain wrapper is required to use Vertex AI models with RAGAS
evaluator_llm = LangchainLLMWrapper(
    ChatVertexAI(
        model="gemini-2.5-flash",
        project=PROJECT_ID,
        location=LOCATION,
    )
)

### Retriever Evaluation 

In the 101 workshop, we demonstrated how the retrieval system's ability to rank relevant chunks can be evaluated using context precision. Context Precision is a metric that evaluates the retriever's ability to rank relevant chunks higher than irrelevant ones for a given query in the retrieved context. Specifically, it assesses the degree to which relevant chunks in the retrieved context are placed at the top of the ranking.

Before implementing the code, take a moment to review the Ragas documentation to understand how they calculate context precision. Also experiment with calculating Context Recall, which measures how many of the relevant documents were successfully retrieved.

**References:**
- Code reference to base implementation: [Base implementation](./../../workshop-101/eval_rag.py#115)
- Ragas documentation: [Context Precision metric](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/)
- Ragas documentation: [Context Recall metric](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/)

In [None]:
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

# Example usage when you do not have a reference response
context_precision_without_reference = LLMContextPrecisionWithoutReference(
    llm=evaluator_llm
)
sample_without_reference = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",  # Experiment with changing the response here
    retrieved_contexts=[
        "The Eiffel Tower is located in Paris."
    ],  # Experiment with adding irrelevant contexts here
)

await context_precision_without_reference.single_turn_ascore(sample_without_reference)

In [None]:
from ragas.metrics import LLMContextPrecisionWithReference

# Example usage when you have a reference response
context_precision_with_reference = LLMContextPrecisionWithReference(llm=evaluator_llm)
sample_with_reference = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=[
        "The Eiffel Tower is located in Paris."
    ],  # Experiment with adding irrelevant contexts here
)

await context_precision_with_reference.single_turn_ascore(sample_with_reference)

In [None]:
from ragas.metrics import LLMContextRecall

# Higher recall means fewer relevant documents were left out.
context_recall = LLMContextRecall(llm=evaluator_llm)
sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=[
        "Paris is the capital of France.",
        "The Eiffel Tower is located in Paris.",  # Experiment by removing this relevant context
    ],
)

await context_recall.single_turn_ascore(sample)

The above metrics use LLMs to judge the retrieved context, but you can also use non-LLM judge metrics if you know the exact relevance of each context. You can look into Non-LLM Based metrics in the Ragas documentation for more details.

### Generation Evaluation 

After ensuring the retrieval context is relevant and all the relevant context is being retrieved, we need to evaluate the response quality of the RAG system.

In the 101 RAG Hands-On Training, we also demonstrated how LLM judge can be utilized to evaluate answer quality.

**References:**
- Code reference to answer quality evaluation: [Answer Quality](./../../workshop-101/eval_rag.py#44)
- Ragas documentation: [Faithfulness Metric](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/)

In [None]:
from ragas.metrics import Faithfulness

# Faithfulness metric measures how factually consistent a response is with the retrieved context.
# It ranges from 0 to 1, with higher scores indicating better consistency.
sample = SingleTurnSample(
    user_input="Where and when was Einstein born?",
    response="Einstein was born in Germany on 14th March 1879.",  # Experiment with changing the response here
    retrieved_contexts=[
        "Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time."
    ],
)
scorer = Faithfulness(llm=evaluator_llm)
await scorer.single_turn_ascore(sample)

### Aspect Critique

Ragas has a evaluation concept called Aspect Critique that is designed to assess submissions based on predefined or custom aspects. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the 'answer' as input.

**References:**
- Ragas documentation: [Aspect Critique](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/aspect_critic/)


In [None]:
from ragas.metrics import AspectCritic

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris.",
)

scorer = AspectCritic(
    name="correctness",
    definition="Does the response accurately answer the user's question?",
)
scorer.llm = evaluator_llm
await scorer.single_turn_ascore(sample)

#### Integration with the RAG Chatbot System

Now we'll integrate the **RAG Pipeline** from `app_201.py` to evaluate the RAG system instead of sample data.

**Key Components from app_201.py:**
- **`generate_chat_response()`**: The main function that generates responses using retrieved context
- **`read_prompt_from_file()`**: Loads the system prompt that guides response generation
- **System prompts**: The actual prompts used in production
- **Chat history handling**: Maintains conversation context
- **GenAI client configuration**: Same model and parameters as production

This allows you to evaluate your RAG system.

In [None]:
# Import the complete RAG system from app_201.py
from app_201 import generate_chat_response, read_prompt_from_file, init_vector_store
from vector_store import MilvusVectorStore
from google import genai

# Initialize GenAI Client
genai_client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

# NOTE: You cannot simultaneously run the chatbot app and this evaluation notebook
vector_store = init_vector_store(
    genai_client, collection_name="weave_docs", reingest=False
)  # Set reingest to True if you want to reingest the documents

In [None]:
# Utility functions for systematic RAG evaluation using the complete app_201 pipeline
def create_evaluation_sample_with_rag(
    user_input: str,
    vector_store: MilvusVectorStore,
    genai_client,
    rag_llm_model: str = "gemini-2.5-flash",
    collection_name: str = "weave_docs",
    top_k: int = 5,
    system_prompt_version: str = "v1",
    chat_history: list = None,
) -> SingleTurnSample:
    """
    Create a SingleTurnSample using the complete RAG pipeline from app_201.py.

    This function:
    1. Retrieves relevant contexts from vector store
    2. Generates an actual response using the RAG system
    3. Returns both for evaluation

    Args:
        user_input: The user's question/query
        vector_store: The initialized MilvusVectorStore instance
        genai_client: The GenAI client for response generation
        rag_llm_model: The LLM model to use for RAG response generation
        collection_name: Name of the collection to retrieve from
        top_k: Number of top similar documents to retrieve
        system_prompt_version: Version of system prompt to use
        chat_history: Previous conversation history

    Returns:
        SingleTurnSample with real RAG-generated response
    """
    if chat_history is None:
        chat_history = []

    # Step 1: Retrieve relevant contexts (same as before)
    retrieved_contexts = vector_store.retrieve(
        query=user_input, collection_name=collection_name, top_k=top_k, verbose=False
    )

    # Step 2: Generate actual response using the RAG system from app_201.py
    system_prompt = read_prompt_from_file(system_prompt_version)

    response = generate_chat_response(
        client=genai_client,
        system_prompt=system_prompt,
        user_message=user_input,
        chat_history=chat_history,
        context_snippets=retrieved_contexts,
        model=rag_llm_model,
        verbose=False,
    )

    return SingleTurnSample(
        user_input=user_input,
        response=response,
        retrieved_contexts=retrieved_contexts,
    )

In [None]:
RAG_LLM_MODEL = "gemini-2.5-flash-lite"
VECTOR_TOP_K = 5

# Test query related to WAML documentation
user_query = "What is the WAML file and where does it live?"

# Create sample using the complete RAG pipeline
rag_sample = create_evaluation_sample_with_rag(
    user_input=user_query,
    vector_store=vector_store,
    genai_client=genai_client,
    rag_llm_model=RAG_LLM_MODEL,
    collection_name="weave_docs",
    top_k=VECTOR_TOP_K,
    system_prompt_version="v1",
)

print(f"User Query: {rag_sample.user_input}")
print(f"RAG Generated Response: {rag_sample.response}")
print(f"Retrieved {len(rag_sample.retrieved_contexts)} contexts")

# Evaluate context precision for the RAG-generated response
rag_context_precision_score = (
    await context_precision_without_reference.single_turn_ascore(rag_sample)
)
print(f"Context Precision Score (RAG): {rag_context_precision_score:.3f}")

In [None]:
# Example: Batch evaluation using the complete RAG system
test_queries = [
    "What is the purpose of the slug field in WAML?",
    "How do you configure Slack notifications in WAML?",
    "What is the deploy section used for in WAML?",
    "How do you specify the owner of a repository in WAML?",
    "What are external links used for in WAML?",
]

print("Creating evaluation samples using the complete RAG pipeline...")
evaluation_samples = []

for i, query in enumerate(test_queries):
    print(f"Generating response for query {i + 1}: {query[:50]}...")

    # Generate actual responses using the RAG system
    sample = create_evaluation_sample_with_rag(
        user_input=query,
        vector_store=vector_store,
        genai_client=genai_client,
        rag_llm_model=RAG_LLM_MODEL,
        collection_name="weave_docs",
        top_k=VECTOR_TOP_K,
        system_prompt_version="v1",
    )

    evaluation_samples.append(sample)
    print(f"✓ Generated response: {sample.response[:10]}...")

print(
    f"\nCreated {len(evaluation_samples)} evaluation samples with REAL RAG responses!"
)
print("Each sample contains:")
print("  - User query")
print("  - RAG-generated response (using app_201.py pipeline)")
print("  - Retrieved contexts from vector store")

In [None]:
import statistics
import asyncio


async def evaluate_sample(sample):
    """Evaluate a single sample and return the score."""
    return await context_precision_without_reference.single_turn_ascore(sample)


# Parallel execution of evaluations
evaluation_tasks = [evaluate_sample(sample) for sample in evaluation_samples]
scores = await asyncio.gather(*evaluation_tasks)

avg_score = statistics.mean(scores)
print("\n--- Evaluation Results ---")
print(f"Average Context Precision: {avg_score:.3f}")
print(f"Min Score: {min(scores):.3f}")
print(f"Max Score: {max(scores):.3f}")
print(f"Standard Deviation: {statistics.stdev(scores):.3f}")

## Next Steps and Experimentation

Congratulations! You've successfully implemented RAG system evaluation using Ragas. Now it's time to experiment and further optimize your system.

### Suggested Experiments

#### 1. **Expand Your Data Sources**
- Add more documents to your vector store beyond the current WAML documentation.
- Try different document types (e.g., web pages, structured data).

#### 2. **Optimize Chunking Strategy**
- Experiment with different chunk sizes in `chunking.py`.
- Try different chunking strategies. Refer to the [Pinecone Reference](https://www.pinecone.io/learn/chunking-strategies/).
- Measure how chunking affects both retrieval and generation quality.

#### 3. **Comprehensive Metric Evaluation**
- Implement additional Ragas metrics.
- Create reference answers for your test queries to enable metrics that require ground truth.

#### 4. **Prompt Engineering Optimization**
- Experiment with different system prompts.
- Optimize prompts based on evaluation results.
- Test how different prompts affect metric scores.

### ⚠️ **Cost Considerations**

**Important Warning**: All the evaluation metrics in this notebook use LLMs (Gemini models) as judges, which incur API costs. Running evaluations repeatedly can lead to significant expenses.

**Cost Management Tips**:
- Start with small test sets (5–10 samples) before scaling up.
- Avoid running evaluations repeatedly on the same configuration.
- Test the system qualitatively on a few samples to ensure it is improving before running the full evaluation suite.

### Google GenAI Evaluation 

Google also provides an evaluation suite for LLM generations. Refer to [Gen AI Evaluation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview) for more details.