# LLM Evaluations for RAG Systems

Given the stochastic nature of Large Language Models (LLMs), establishing robust evaluation criteria is crucial for building confidence in their performance. For Retrieval-Augmented Generation (RAG) systems, comprehensive evaluation requires assessing both the retrieval and generation components to ensure system reliability and accuracy.

## Background

In the 101 RAG Hands-On Training, we demonstrated how LLM Judges can be utilized to evaluate RAG systems effectively. 

- **[Evaluation Documentation Reference](https://docs.google.com/document/d/1Rg1QXZ5Cg0aX8hYvRrvevY1uz6lPpZkaasoqW7Pcm9o/edit?tab=t.0#heading=h.jjijsv4v12qe)** 
- **[Evaluation Code Reference](./../workshop-101/eval_rag.py)** 

## Workshop Objectives

In this notebook, we will explore advanced evaluation techniques using two powerful libraries:
- **[Ragas](https://github.com/explodinggradients/ragas)** 
- **[Google Gen AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview)** 

These tools will help you implement systematic evaluation workflows to measure and improve your RAG system's performance across various metrics and use cases.

## Ragas

Ragas is an open-source library published under the Apache 2.0 license that provides a comprehensive toolkit for evaluating and optimizing LLM applications. It offers specialized metrics and evaluation frameworks making it easier to assess LLM generations

### Installation

You can install Ragas using UV (our preferred package manager):

```bash
uv add ragas
```

Alternatively, you can install it with pip:

```bash
pip install ragas
```

### Setting up Ragas

Install the Langchain wrapper for Vertex AI to use Vertex AI models in Ragas:

```bash
uv add langchain-google-vertexai
```

In [None]:
from ragas.llms import LangchainLLMWrapper
from langchain_google_vertexai import ChatVertexAI

# Define global constants for project and location
PROJECT_ID = "weave-ai-sandbox"
LOCATION = "us-central1"

evaluator_llm = LangchainLLMWrapper(
    ChatVertexAI(
        model="gemini-2.5-flash",
        project=PROJECT_ID,
        location=LOCATION,
    )
)

### Retriever Evaluation 

In the 101 workshop, we demonstrated how the retrieval system's ability to rank relevant chunks can be evaluated using context precision. This evaluation was based on the Ragas metric called Context Precision.

**References:**
- Code reference to base implementation: [Base implementation](./../workshop-101/eval_rag.py#115)
- Ragas documentation: [Context Precision metric](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/)

Before implementing the code, take a moment to go through the Ragas documentation to understand how they calculate context precision. 

Now, let's implement the Ragas version of this metric to evaluate retrieval performance:

In [None]:
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."],
)


await context_precision.single_turn_ascore(sample)