<a href="https://colab.research.google.com/github/syedarhamraza/google-dev-colab/blob/main/LLM_as_a_Judge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build and Evaluate RAG with Gemini using LLM as a JUDGE

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/irum-zahra-awan/geneai/blob/main/LLM_as_a_Judge.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/irum-zahra-awan/geneai/blob/main/LLM_as_a_Judge.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>    

| Author |
| --- |
| [Irum Zahra](https://github.com/irum-zahra-awan/) |

# **Overview: RAG EVALUATION**
Evaluating the quality of Large Language Model (LLM) outputs, especially within Retrieval-Augmented Generation (RAG) systems, can be challenging, time-consuming, and subjective.

Evaluating RAG systems is crucial for ensuring they deliver **accurate**, **relevant**, and **grounded** responses. While traditional metrics have their place, the rise of Large Language Models (LLMs) as "judges" has revolutionized RAG evaluation, offering a more nuanced and scalable approach. And with the robust capabilities of Google's Gemini models, implementing this advanced evaluation is more accessible than ever.

This hands-on workshop introduces the **"LLM as a Judge"** paradigm, a powerful and increasingly popular method for automating and standardizing RAG evaluation.

You will learn how to leverage the advanced capabilities of Google's Gemini LLM to act as an impartial and effective judge, providing consistent and nuanced assessments of RAG system performance.

### **Why "LLM as a Judge" for RAG?**

Traditional RAG evaluation often relies on strict string matching or human annotation, which can be time-consuming, prone to subjectivity, and may not fully capture the semantic quality of an LLM's output. LLMs, when appropriately prompted, can act as intelligent evaluators, assessing key aspects like:

For RAG systems, this technique is particularly impactful for evaluating key aspects like:

*   **Context Relevance:** Does the retrieved information truly align with the user's query?
*   **Groundedness/Faithfulness:** Is the generated response accurately derived from the retrieved context, minimizing hallucinations?
*   **Answer Coherence and Fluency:** Beyond factual correctness, Is the output well-structured, easy to understand, and natural-sounding?
*   **Completeness:** Does the answer address all aspects of the user's query based on the provided context?





### **What You Will Learn:**


*   **Understanding "LLM as Judge":** Grasp the core concepts, benefits, and limitations of using LLMs for evaluation.
*   **RAG Evaluation Metrics:** Explore specific metrics relevant to RAG systems (Context Relevance, Groundedness, Answer Relevancy) and how to evaluate them using an LLM judge.
* Designing Effective Evaluation Prompts: Learn the art of crafting robust and unbiased prompts that guide Gemini to perform accurate assessments.
* **Implementing LLM-based Evaluation with Gemini:** Get hands-on experience using the Gemini API to set up your own LLM judge for RAG systems.
* **Setting up Evaluation Pipelines:** Discover how to integrate LLM-as-Judge into your RAG development workflow for continuous evaluation and improvement.
* **Analyzing and Interpreting Judge Results:** Understand how to interpret scores and feedback from your Gemini judge to identify areas for RAG system optimization.
* **Best Practices for Reliable Evaluation:** Gain insights into strategies for minimizing bias, ensuring consistency, and validating your LLM judge's performance.

### **Implementing with Gemini Models**

Gemini's powerful understanding and generation capabilities make it an excellent choice for building LLM-as-a-Judge systems for RAG evaluation. Here's how you can leverage them:

* **Define Your Metrics & Rubric:** Clearly outline what you want to evaluate `(e.g., context precision, answer accuracy, absence of hallucination).` For each metric, create a detailed rubric that the Gemini model will use to score the RAG output.

* **Craft Effective Prompts:** This is key. Design prompts that instruct Gemini to act as a judge, providing it with the query, the retrieved context, and the RAG-generated answer. The prompt should clearly state the evaluation criteria and the desired output format `(e.g., a numerical score, a binary "pass/fail," or a qualitative explanation)`.

* **Utilize Gemini API:** Integrate with the Gemini API to send your RAG outputs and evaluation prompts. Gemini's various models (like Gemini 1.5 Pro for complex reasoning or Gemini 1.5 Flash for faster evaluations) can be chosen based on your specific needs and budget.

* **Automate and Iterate:** Set up an automated pipeline to run your RAG outputs through the Gemini-powered judge. Collect the evaluation scores and feedback, which can then inform your RAG system's improvement.

# **USE CASE: MEDICAL RESEARCH**

This notebook showcases a practical application of Retrieval-Augmented Generation (RAG) within the specialized domain of medical research, specifically using research papers on polio as a knowledge base.

The core of the RAG system lies in its ability to first retrieve relevant information from these documents and then use that information to generate a comprehensive answer to a user's query. This process ensures that the generated answers are not only accurate but also grounded in the provided context, effectively minimizing the risk of a hallucinated response.

A key focus of this notebook is the **evaluation of the RAG system's output**. To this end, it introduces the concept of an **"LLM as a Judge."** This innovative approach uses a separate Large Language Model to act as an impartial evaluator of the RAG system's generated answers.

The "judge" LLM is tasked with assessing the faithfulness and relevance of the answers by comparing them against the source documents. This evaluation process is crucial for ensuring the reliability and trustworthiness of the RAG system, providing a robust mechanism for quality control and continuous improvement.

#### INSTALL

In [None]:
%pip install requests beautifulsoup4 pypdf langchain -q -U
%pip install google-cloud-aiplatform langchain-google-vertexai faiss-cpu langchain-community -q -U

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/64.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.8/64.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/313.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.2/313.2 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.3, but you have requests 2.32.4 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m95.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.0/101.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m


### **Part 1: Downloading and Processing Papers with Python**

To use the information within these papers, we first need to get their content into our Python environment. Since these papers are available as web pages or PDFs, we will write a script to:


* **Fetch the content:** We'll use the requests library to download the HTML of the web pages. For the PDF, we'll use a library designed to extract text from PDF files.

* **Parse the text:** Raw HTML contains a lot of code we don't need. We'll use BeautifulSoup to parse the HTML and extract only the meaningful text. For PDFs, pypdf will help us extract text directly.

* **Chunk the text:** Large language models have a limited context window (the amount of text they can consider at once). A single research paper is far too long. To handle this, we break the text into smaller, overlapping "***chunks.***" This ensures that the model receives manageable pieces of information and that semantic meaning isn't lost at the boundaries of chunks.
We use the `RecursiveCharacterTextSplitter` from langchain for this, a standard tool for this task.

This entire process prepares the raw data for the next crucial step: creating vector embeddings.



In [None]:
import requests
from bs4 import BeautifulSoup
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pypdf import PdfReader


import sys
from google.colab import auth
import vertexai

from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import FAISS

In [None]:
#-- Define PDF URLs and a directory for text files --#
pdf_sources = [
    {"name": "cdc_update_2024", "url": "https://www.cdc.gov/mmwr/volumes/73/wr/pdfs/mm7341a1-H.pdf"},
    # Corrected URL for the GPEI strategy
    {"name": "Polio_endemic_disease_pakistan", "url": "https://ecommons.aku.edu/cgi/viewcontent.cgi?article=1297&context=pakistan_fhs_son"},
    # This WHO link is correct but subject to rate limiting
    {"name": "polio_eradication_strategy", "url": "https://polioeradication.org/wp-content/uploads/2022/06/Polio-Eradication-Strategy-2022-2026-Delivering-on-a-Promise.pdf"},
    # Corrected URL for the UKHSA guide
    {"name": "who_poilio_vaccine", "url": "https://cdn.who.int/media/docs/default-source/immunization/position_paper_documents/polio/who-pp-polio-mar2016-references.pdf?sfvrsn=f4e72554_2"}
]

# Create a directory to store the processed text
output_dir = "polio_papers"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)


In [None]:
#-- Function to download and extract text from a PDF --#
def download_and_extract_pdf_text(pdf_info):
    """Downloads a PDF from a URL and extracts its text content."""
    pdf_filename = f"{pdf_info['name']}.pdf"
    text_filename = os.path.join(output_dir, f"{pdf_info['name']}.txt")

    try:
        # Download the PDF
        response = requests.get(pdf_info['url'], stream=True)
        response.raise_for_status()
        with open(pdf_filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"Successfully downloaded {pdf_filename}")

        # Extract text from the downloaded PDF
        reader = PdfReader(pdf_filename)
        text = ""
        for page in reader.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"

        # Save the extracted text and return it
        with open(text_filename, "w", encoding="utf-8") as f:
            f.write(text)
        print(f"Successfully extracted and saved text to {text_filename}")
        return text

    except requests.exceptions.RequestException as e:
        print(f"Error downloading {pdf_filename}: {e}")
    except Exception as e:
        print(f"Error processing {pdf_filename}: {e}")
    return "" # Return empty string on failure

#--  Execute the download and extraction for all PDFs --#
all_texts = []
for pdf in pdf_sources:
    extracted_text = download_and_extract_pdf_text(pdf)
    if extracted_text:
        all_texts.append(extracted_text)

#--  Combine and chunk all extracted text --#
full_text = "\n\n--- NEW DOCUMENT ---\n\n".join(all_texts)
print(f"\nTotal length of combined text: {len(full_text)} characters")

Successfully downloaded cdc_update_2024.pdf
Successfully extracted and saved text to polio_papers/cdc_update_2024.txt
Successfully downloaded Polio_endemic_disease_pakistan.pdf
Successfully extracted and saved text to polio_papers/Polio_endemic_disease_pakistan.txt
Successfully downloaded polio_eradication_strategy.pdf
Successfully extracted and saved text to polio_papers/polio_eradication_strategy.txt




Successfully downloaded who_poilio_vaccine.pdf
Successfully extracted and saved text to polio_papers/who_poilio_vaccine.txt

Total length of combined text: 478443 characters


### **Part 2: Storing Information in a Vector Database**

A vector database allows us to perform semantic search. Instead of just searching for keywords, we can search for concepts and meanings. Here's how it works:

* **Embedding Model:** We use a model from Vertex AI to convert our text chunks into numerical representations called vectors or embeddings. Each vector is a list of numbers that captures the semantic meaning of the text. Chunks with similar meanings will have vectors that are "close" to each other in mathematical space.

* **Vector Store:** We need a place to store these vectors and a way to search through them efficiently. FAISS is a lightweight and highly efficient library for this purpose. It's perfect for a Colab environment as it runs in memory and doesn't require a separate database server.

* **Storing:** The script will take each text chunk, pass it to the Vertex AI embedding model to get a vector, and then store that vector (along with the original text chunk) in our FAISS index.


This setup is the core of our RAG system. It allows us to take a user's question, find the most relevant chunks of information from our research papers, and use them to generate a factual, context-aware answer.

In [None]:
# Chunk the text
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,  # The size of each chunk in characters
    chunk_overlap=200 # Number of characters to overlap between chunks
)
chunks = text_splitter.split_text(full_text)

print(f"\nTotal number of text chunks: {len(chunks)}")
print("Sample chunk:")
print(chunks[0])


Total number of text chunks: 395
Sample chunk:
Morbidity and Mortality Weekly Report
U.S. Centers for Disease Control and Prevention
Weekly / Vol. 73 / No. 41 O ctober 17, 2024
INSIDE
917 T obacco Product Use Among Middle and High 
School Students — National Youth Tobacco Survey, 
United States, 2024
925 C overage with Selected Vaccines and Exemption 
Rates Among Children in Kindergarten — United 
States, 2023–24 School Year
933 Not es from the Field: Enhanced Surveillance for 
Raccoon Rabies Virus Variant and Vaccination of 
Wildlife for Management — Omaha, Nebraska, 
October 2023–July 2024 
936 QuickStats
Continuing Education examination available at  
https://www.cdc.gov/mmwr/mmwr_continuingEducation.html
Update on Vaccine-Derived Poliovirus Outbreaks —  
Worldwide, January 2023–June 2024
Apophia Namageyo-Funa, PhD1; Sharon A. Greene, PhD1; Elizabeth Henderson2; Mohamed A. T raoré3; Shahzad Shaukat, PhD3; John Paul Bigouette, PhD1; 
Jaume Jorba, PhD2; Eric Wiesen, DrPH1; Omotayo Bo

##### SET UP VERTEX AI PROJECT

In [None]:
# Authenticate user
auth.authenticate_user()

In [None]:
# Authenticate and initialize Vertex AI

# Define your Google Cloud project
PROJECT_ID = "gen-lang-client-0043347732"  # @param {type:"string"}
LOCATION = "us-central1" # @param {type:"string"}

# Initialize Vertex AI
vertexai.init(project=PROJECT_ID, location=LOCATION)

In [None]:
import time
# Set up the embedding model and FAISS vector store
# Initialize the embedding model
embeddings = VertexAIEmbeddings(model_name="text-embedding-004")

# Create the vector store from our text chunks
# This will take a moment as it processes each chunk and gets its embedding

print("Creating vector store... This may take a few minutes.")
# vector_store = FAISS.from_texts(chunks, embeddings)
vector_store = FAISS.from_texts([chunks[0]], embeddings)

# Loop through the rest of the chunks, adding them one by one
for i, chunk in enumerate(chunks[1:]):
    # Add a small delay (e.g., 1 second) to stay under the quota
    time.sleep(1)
    vector_store.add_texts([chunk])
    # Optional: print progress so you know it's working
    print(f"Processed chunk {i+2}/{len(chunks)}")

print("Vector store created successfully!")



Creating vector store... This may take a few minutes.
Processed chunk 2/395
Processed chunk 3/395
Processed chunk 4/395
Processed chunk 5/395
Processed chunk 6/395




ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/online_prediction_requests_per_base_model with base model: textembedding-gecko. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.

##### Chunking & Retrieval: Sample Query

In [None]:
# Test the vector store with a sample query
# Let's see what information it retrieves for a sample question
sample_query = "What are the main challenges in polio eradication?"
retrieved_docs = vector_store.similarity_search(sample_query, k=2) # Get the top 2 most relevant chunks

print(f"\n--- Sample Retrieval for query: '{sample_query}' ---")
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Relevant Chunk {i+1} ---")
    print(doc.page_content)
    print("--------------------")




--- Sample Retrieval for query: 'What are the main challenges in polio eradication?' ---

--- Relevant Chunk 1 ---
immunity by overcoming barriers to reaching children.
Introduction
Live, attenuated oral poliovirus vaccine (OPV) induces 
long-term protection against paralytic disease, and limits virus 
shedding in vaccinated persons with infection (1). Circulating 
vaccine-derived poliovirus (cVDPVs)* outbreaks occur when 
OPV-related strains undergo prolonged circulation in com -
munities with very low immunity against polioviruses, and the 
genetically reverted virus has regained neurovirulence (vaccine-
derived poliovirus [VDPV] emergence) (2,3). After declaration 
of wild poliovirus type 2 eradication in 2015, and in an effort 
to lower the risk for cVDPV type 2 (cVDPV2) outbreaks, 
immunization programs in countries using OPV switched 
from using trivalent OPV (tOPV) (containing types 1, 2, and 
3 Sabin strains) in routine and supplementary immunization 
activities (SIAs) to biva

### **Part 4: Generating Questions for the LLM**


Based on the content of the these papers, here are 6 insightful questions that require the LLM to synthesize information from multiple sources.


* **Q1:** Based on the GPEI's eradication strategy and the challenges of polio
as an endemic disease in Pakistan, what specific strategic objectives are most crucial for finally stopping transmission in the country?

* **Q2:** How do the WHO's recommendations on polio vaccines align with the CDC's latest update on vaccine-derived poliovirus (cVDPV) outbreaks?

* **Q3:** According to the GPEI strategy and the CDC update, what are the primary risks associated with cVDPV, and how does this challenge the goal of "delivering on a promise" of polio eradication?

* **Q4:** Synthesizing information from the WHO vaccine document and the paper on polio in Pakistan, what are the key logistical and social challenges to achieving high vaccination coverage in endemic regions?

* **Q5:** What is the global strategy for responding to a poliovirus outbreak, and how might that strategy be specifically adapted for an endemic setting like Pakistan, considering the information from all four documents?

* **Q6:** According to the provided documents, what are the primary reasons for the persistence of circulating vaccine-derived poliovirus (cVDPV) outbreaks in 2023-2024, and which type is most prevalent?

### **Part 5: Answering Queries with a RAG System**

This is where everything comes together. We will now build the complete RAG pipeline to answer our generated questions.

* **Retriever:** The vector store we built (FAISS) acts as our retriever. When a user asks a question, the retriever's job is to quickly find and "retrieve" the most relevant text chunks from our source documents.

* **Prompt Template:** We don't just send the user's question to the LLM. We create a structured prompt. This prompt instructs the LLM on how to behave (e.g., "be a helpful assistant"), provides the retrieved text chunks as context, and then presents the user's question. This guides the model to base its answer only on the information we've provided.

* **LLM:** We use a powerful Gemini model from Vertex AI as the "brain" of our operation. It will receive the formatted prompt (with context) and generate a coherent, human-like answer.

* **Chain**: We use langchain to tie these components together into a RetrievalQA chain. This chain automates the entire process: a question goes in, and a fully formed, context-aware answer comes out.

This RAG approach is far superior to simply asking the LLM a question directly because it grounds the model's response in our specific source material, dramatically reducing the risk of hallucinations (made-up information) and ensuring the answers are factual and relevant to our documents.



In [None]:
# Set up the LLM and the QA Chain
from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Initialize the Gemini LLM
llm = VertexAI(model_name="gemini-2.0-flash-001", temperature=0.1)

##### Prompt Template: Response Generation

In [None]:
# Create a prompt template
prompt_template = """
You are a helpful assistant specialized in summarizing information from medical research papers.
Use the following pieces of context to answer the question at the end.
If you don't know the answer from the context, just say that you don't know, don't try to make up an answer.
Be concise and provide the answer based only on the provided text.

Context:
{context}

Question: {question}

Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [None]:
# Create the RetrievalQA Chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

##### Response Q1

In [None]:
# Ask one of our generated questions
question_to_ask = """Based on the GPEI's eradication strategy and the challenges of polio as an endemic disease in Pakistan,
                      what specific strategic objectives are most crucial for finally stopping transmission in the country?"""

print(f"Asking question: {question_to_ask}")
result = qa_chain({"query": question_to_ask})

# Print the results
print("\n--- Generated Answer ---")
print(result["result"])

print("\n--- Source Documents Used ---")
for doc in result["source_documents"]:
    print(f"\n- Source: {doc.page_content[:200]}...") # Print snippet of the source

Asking question: Based on the GPEI's eradication strategy and the challenges of polio as an endemic disease in Pakistan, 
                      what specific strategic objectives are most crucial for finally stopping transmission in the country?


  result = qa_chain({"query": question_to_ask})



--- Generated Answer ---
I am sorry, but the text does not contain information about the GPEI's eradication strategy, polio as an endemic disease in Pakistan or the strategic objectives for stopping transmission in the country.


--- Source Documents Used ---

- Source: immunity by overcoming barriers to reaching children.
Introduction
Live, attenuated oral poliovirus vaccine (OPV) induces 
long-term protection against paralytic disease, and limits virus 
shedding in...

- Source: Health Organization Polio Information System and Global 
Polio Laboratory Network, this report describes global polio 
outbreaks due to cVDPVs during January 2023–June 2024 
and updates previous repor...

- Source: mended vaccine for cVDPV2 outbreak response (5). However, 
nOPV2 supply has been periodically restricted because of 
manufacturing delays, including during a period in early 2024. 
Despite the goal of...

- Source: Morbidity and Mortality Weekly Report
U.S. Centers for Disease Control and Preventi

##### Response Q6

In [None]:
#--  Ask one of our generated questions --#
question_to_ask6 = "According to the provided documents, what are the primary reasons for the persistence of circulating vaccine-derived poliovirus (cVDPV) outbreaks in 2023-2024, and which type is most prevalent?"

print(f"Asking question: {question_to_ask6}")
result6 = qa_chain({"query": question_to_ask6})

#--  Print the results --#
print("\n--- Generated Answer ---")
print(result6["result"])

print("\n--- Source Documents Used ---")
for doc in result6["source_documents"]:
    print(f"\n- Source: {doc.page_content[:200]}...") # Print snippet of the source

Asking question: According to the provided documents, what are the primary reasons for the persistence of circulating vaccine-derived poliovirus (cVDPV) outbreaks in 2023-2024, and which type is most prevalent?

--- Generated Answer ---
The persistence of cVDPV outbreaks is primarily due to delayed implementation of outbreak response campaigns, low-quality campaigns, and barriers to reaching children to increase population immunity. cVDPV type 2 (cVDPV2) outbreaks are the most prevalent.


--- Source Documents Used ---

- Source: Health Organization Polio Information System and Global 
Polio Laboratory Network, this report describes global polio 
outbreaks due to cVDPVs during January 2023–June 2024 
and updates previous repor...

- Source: Morbidity and Mortality Weekly Report
U.S. Centers for Disease Control and Prevention
Weekly / Vol. 73 / No. 41 O ctober 17, 2024
INSIDE
917 T obacco Product Use Among Middle and High 
School Students...

- Source: mended vaccine for cVDPV2 outbreak

### **Part 6: Using an LLM as a Judge**

How do we know if our RAG system is providing good answers? We can evaluate them manually, but this is time-consuming. An advanced and powerful technique is to use another **LLM as an impartial "judge."**

* **The Judge's Task:** We give the judge LLM a very specific set of instructions. Its job is not to answer the original question, but to evaluate the answer generated by our RAG system.

* **Evaluation Criteria:** We define clear criteria for the judge. In this script, we ask it to assess two key aspects:

 * **Faithfulness:** Is the generated answer fully supported by the provided source documents? It should not contain information that isn't in the context.

 * **Relevance:** Does the answer directly address the user's question?

* **Structured Output:** We instruct the judge to provide its reasoning and a final verdict ("SUPPORTED" or "NOT SUPPORTED") in a structured format. This makes the evaluation easy to interpret.

Using an LLM Judge automates the evaluation process, allowing us to quickly assess the quality of our RAG system's responses. It's a key part of building robust and reliable AI systems.



In [None]:
# Set up the Judge LLM and Prompt Template
judge_llm = VertexAI(model_name="gemini-2.5-pro", temperature=0.0)

##### Prompt Template: LLM Judge

In [None]:
multi_criteria_judge_prompt = """
You are an expert, impartial evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to evaluate a generated answer based on a given question and the source context documents.

Please evaluate the provided response based on the following four criteria. For each criterion, provide a score from 1 to 5 (where 1 is worst and 5 is best) and a brief justification for your score.

**Evaluation Criteria:**

1.  **Context Relevance (Score 1-5):**
    - Does the retrieved context truly align with the user's query?
    - Are the source documents pertinent to answering the question?
    - **Justification:**

2.  **Groundedness/Faithfulness (Score 1-5):**
    - Is the generated response accurately derived from the retrieved context?
    - Does it avoid making up information (hallucinations)?
    - **Justification:**

3.  **Answer Coherence and Fluency (Score 1-5):**
    - Is the output well-structured, grammatically correct, and easy to understand?
    - Does it sound natural?
    - **Justification:**

4.  **Completeness (Score 1-5):**
    - Does the answer address all aspects of the user's query based on the provided context?
    - **Justification:**

**Input:**

---
**Original Question:**
{question}

---
**Source Documents (Context):**
{context}

---
**Generated Answer:**
{answer}
---

**Evaluation Output:**

Please provide your evaluation in the format specified above.
"""

##### Evaluation: Q1

In [None]:
# Prepare the inputs for the judge
# We will use the results from the previous step
question = result["query"]
context_docs = "\n\n".join([doc.page_content for doc in result["source_documents"]])
generated_answer = result["result"]

# Format the input for the judge
judge_input = multi_criteria_judge_prompt.format(
    question=question,
    context=context_docs,
    answer=generated_answer
)

In [None]:
# Get the evaluation from the judge
print("\n--- ⚖️ Submitting to LLM Judge for Evaluation ⚖️ ---")
evaluation = judge_llm(judge_input)

# Print the judge's verdict
print("\n--- Judge's Evaluation ---")
print(evaluation)


--- ⚖️ Submitting to LLM Judge for Evaluation ⚖️ ---


  evaluation = judge_llm(judge_input)



--- Judge's Evaluation ---
**Evaluation Output:**

**1. Context Relevance (Score 1-5):**
- **Score:** 1
- **Justification:** The retrieved context is a general report on worldwide vaccine-derived poliovirus (cVDPV) outbreaks, with a focus on Africa. It does not contain any specific information about the GPEI's strategy, the situation in Pakistan, or polio as an endemic disease in that country. The source documents are entirely irrelevant to the user's specific query.

**2. Groundedness/Faithfulness (Score 1-5):**
- **Score:** 5
- **Justification:** The generated answer is perfectly faithful to the provided context. It correctly identifies that the source documents do not contain the information required to answer the question about Pakistan or the GPEI's strategy. It avoids hallucination and accurately reports on the absence of relevant data.

**3. Answer Coherence and Fluency (Score 1-5):**
- **Score:** 5
- **Justification:** The answer is a single, well-formed sentence that is gramm

##### Evaluation: Q6

In [None]:
# Prepare the inputs for the judge
# We will use the results from the previous step
question = result6["query"]
context_docs = "\n\n".join([doc.page_content for doc in result6["source_documents"]])
generated_answer = result6["result"]

# Format the input for the judge
judge_input = multi_criteria_judge_prompt.format(
    question=question,
    context=context_docs,
    answer=generated_answer
)

In [None]:
# Get the evaluation from the judge
print("\n--- ⚖️ Submitting to LLM Judge for Evaluation ⚖️ ---")
evaluation = judge_llm(judge_input)

# Print the judge's verdict
print("\n--- Judge's Evaluation ---")
print(evaluation)


--- ⚖️ Submitting to LLM Judge for Evaluation ⚖️ ---

--- Judge's Evaluation ---
**Evaluation Output:**

**1. Context Relevance (Score 1-5):**
- **Score:** 5
- **Justification:** The provided source documents are excerpts from a CDC report titled "Update on Vaccine-Derived Poliovirus Outbreaks — Worldwide, January 2023–June 2024." This is perfectly aligned with the user's question about the reasons for cVDPV outbreaks during that specific time frame. The context is highly pertinent and directly contains the necessary information.

**2. Groundedness/Faithfulness (Score 1-5):**
- **Score:** 5
- **Justification:** The generated answer is entirely faithful to the source documents. Each point is directly supported by the text:
    - "Delayed implementation of outbreak response campaigns and low-quality campaigns" is stated verbatim in the context.
    - "barriers to reaching children to increase population immunity" is a direct synthesis of the phrase "overcoming barriers to reaching child

##### Evaluation: Wrong Answer

In [None]:
# We use the same setup as before, but with our new unsupported answer.

# --- Inputs for the judge ---
# The question and context remain the same from our previous RAG query.
question = result["query"]
context_docs = "\n\n".join([doc.page_content for doc in result["source_documents"]])

# Here is our manually crafted unsupported answer
unsupported_answer = "The primary reasons for the persistence of cVDPV outbreaks are low immunization coverage in certain areas and interruptions to vaccination campaigns. The most prevalent type is cVDPV2. A new, more resilient strain, cVDPV3, also emerged in late 2024 in South America, causing significant concern."

# --- Format the prompt for the judge ---
judge_input = multi_criteria_judge_prompt.format(
    question=question,
    context=context_docs,
    answer=unsupported_answer
)

# --- Get the evaluation ---
print("\n--- ⚖️ Submitting MODIFIED Answer to LLM Judge ⚖️ ---")
evaluation = judge_llm(judge_input)

# --- Print the judge's verdict ---
print("\n--- Judge's Evaluation of Unsupported Answer ---")
print(evaluation)


--- ⚖️ Submitting MODIFIED Answer to LLM Judge ⚖️ ---

--- Judge's Evaluation of Unsupported Answer ---
**Evaluation Output:**

**1. Context Relevance (Score 1-5):**
- **Score:** 1
- **Justification:** The provided context is a global update on vaccine-derived poliovirus outbreaks. The user's question is highly specific, asking about the GPEI's strategy and challenges related to polio in Pakistan. The source documents do not mention Pakistan or the GPEI's specific strategy, making them almost entirely irrelevant for answering the user's query.

**2. Groundedness/Faithfulness (Score 1-5):**
- **Score:** 1
- **Justification:** The generated answer contains a significant hallucination. The statement, "A new, more resilient strain, cVDPV3, also emerged in late 2024 in South America, causing significant concern," is completely fabricated and not supported by the source documents, which only discuss cVDPV1 and cVDPV2 up to June 2024.

**3. Answer Coherence and Fluency (Score 1-5):**
- **Sco

### END

**What We Learned**

A simple "supported" or "not supported" verdict is good, but a truly robust evaluation requires a more nuanced approach. To thoroughly assess RAG system's performance, we used an LLM Judge to score the output against four distinct criteria. This gave us a much clearer picture of the system's strengths and weaknesses.

* **Context Relevance:** This checks if the documents retrieved from our vector database are actually relevant to the user's question. **A low score here indicates a problem with our retriever (the vector search).**

* **Groundedness/Faithfulness:** This is the core of hallucination detection. It verifies that the generated answer is strictly based on the information present in the retrieved context and does not invent facts. **A low score means the LLM is hallucinating.**

* **Answer Coherence and Fluency:** This assesses the quality of the generated text itself. Is the answer well-written, grammatically correct, and easy for a human to understand? **An answer can be factual but poorly structured.**

* **Completeness:** This checks if the answer addresses all parts of the user's question. For example, if the question asks for "reasons and the most prevalent type," **a complete answer must address both points.**

By instructing our LLM Judge to provide a score and reasoning for each criterion, we can pinpoint exactly where our RAG system excels or fails, allowing for more targeted improvements.