# RAG Evaluation Using RAGAS

## Introduction to Evaluation
Evaluation in the context of Retrieval-Augmented Generation (RAG) systems involves assessing the performance of both the retrieval component (how well relevant documents are fetched from a knowledge base)
and the generation component (how accurate, relevant, and coherent the generated responses are). A RAG system combines a retriever (e.g., a vector store like FAISS) with a language model (e.g., AzureChatOpenAI) to provide contextually informed responses, reducing issues like hallucinations (incorrect or fabricated information).

## Why Do We Use Evaluation?
- Evaluation is critical for the following reasons:
- Quality Assurance: Ensures the RAG system delivers accurate, relevant, and trustworthy responses.
- System Improvement: Identifies weaknesses in retrieval (e.g., irrelevant documents) or generation (e.g., unfaithful answers), guiding optimizations like better embeddings or prompt engineering.
- Performance Monitoring: Quantifies system performance to track improvements or regressions over time.
- Stakeholder Confidence: Provides metrics to demonstrate the system's reliability to stakeholders or end-users.

### The RAGAS framework (Retrieval Augmented Generation Assessment) is used to evaluate RAG systems. It provides metrics like:
- Faithfulness: Measures if the generated answer is factually grounded in the retrieved context.
- Answer Relevancy: Assesses if the answer directly addresses the user's query.
- Context Precision: Checks if the retrieved context contains relevant information with minimal noise.
- Context Recall: Ensures all necessary information is retrieved (requires ground truth).

This notebook sets up a RAG system using AzureChatOpenAI, AzureOpenAIEmbeddings, and FAISS, generates a synthetic test dataset, and evaluates the system using RAGAS.

Loads environment variables (e.g., API keys) from a .env file for secure configuration.

In [1]:
!python -m pip install pymupdf faiss-cpu ragas --quiet

In [9]:
!python -m pip install rapidfuzz --quiet

In [2]:
import os
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai import AzureChatOpenAI
from dotenv import load_dotenv
import os 

load_dotenv()

MODEL_NAME = "gpt4o"
EMBEDDING_MODEL_NAME = "text-embedding-3-small"


Initializes AzureChatOpenAI for response generation and AzureOpenAIEmbeddings for creating document embeddings.

In [3]:
llm = AzureChatOpenAI(azure_deployment=MODEL_NAME)

embeddings = AzureOpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)

dir_path = r"datasets/supply_chain"
index_path = r"VectorDB_Chroma/faiss"

## Document Loading
This section loads PDF documents from a directory

In [16]:
def load_documents():
    """
    Load PDF documents from the specified directory using PyMuPDFLoader.
    
    Returns:
        list: A list of loaded documents.
    
    Raises:
        FileNotFoundError: If the directory does not exist.
        Exception: For other loading errors.
    """
    if not os.path.exists(dir_path):
        raise FileNotFoundError(f"Directory not found: {dir_path}")
    try:
        loader = DirectoryLoader(dir_path, loader_cls=PyMuPDFLoader)
        return loader.load()
    except Exception as e:
        raise e

def split_documents(documents):
    """
    Split documents into smaller chunks using RecursiveCharacterTextSplitter.
    
    Args:
        documents (list): List of documents to split.
    
    Returns:
        list: A list of document chunks. Returns empty list if no documents.
    """
    try:
        if not documents:
            return []
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)
        return text_splitter.split_documents(documents)
    except Exception as e:
        print(f"Error splitting documents: {str(e)}")
        return []

# Load and split documents
documents = load_documents()
documents = split_documents(documents)
print(f"Loaded and split {len(documents)} document chunks.")

Loaded and split 113 document chunks.


## Vector Store Creation

Now splits documents into chunks, and creates a FAISS vector store for retrieval.

In [17]:
def create_vectorstore(documents):
    """
    Create and save a new FAISS vector store from documents.
    
    Args:
        documents (list): List of document objects to convert to vectors.
    
    Returns:
        None: If successful, else Exception.
    """
    try:
        os.makedirs(index_path, exist_ok=True)
        vectorstore = FAISS.from_documents(documents, embeddings)
        print("Vector Store created Successfully")
        save_vectorstore(vectorstore)
    except Exception as e:
        return e

def save_vectorstore(vectorstore):
    """
    Save the FAISS vector store to the specified path.
    
    Args:
        vectorstore (FAISS): The vector store to save.
    
    Returns:
        None: If successful, else Exception.
    """
    try:
        vectorstore.save_local(index_path)
        print("vector Store saved successfully")
    except Exception as e:
        return e

def load_vectorstore():
    """
    Load an existing FAISS vector store.
    
    Returns:
        FAISS: Loaded vector store, else Exception.
    """
    try:
        print("loading vector Store...")
        vs = FAISS.load_local(index_path, embeddings=embeddings, allow_dangerous_deserialization=True)
        print("loaded successfully")
        return vs
    
    except Exception as e:
        return e

# Load or create vector store
if os.path.exists(index_path) and any(os.listdir(index_path)):
    vectorstore = load_vectorstore()
    vectorstore_retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
    print("Vector store loaded successfully.")
else:
    create_vectorstore(documents)
    vectorstore = load_vectorstore()
    print(vectorstore)
    vectorstore_retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
    print("Created and loaded new vector store.")

loading vector Store...
loaded successfully
Vector store loaded successfully.


### Explanation:

- Document Loading: Uses DirectoryLoader with PyMuPDFLoader to load PDFs from the data directory.
- Document Splitting: Splits documents into chunks (500 characters, 200 overlap) for efficient retrieval.
- Vector Store: Creates a FAISS index from document embeddings or loads an existing one from the index directory.
- Retriever: Configures the vector store as a retriever, fetching the top 5 relevant documents for a query.

## RAG Chain Setup
This section defines a RAG chain that validates queries, retrieves relevant documents, and generates answers using the AzureChatOpenAI model.

In [18]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

def validate_query(query):
    """
    Validates a user's query by ensuring it is not empty and has at least 15 characters.
    
    Args:
        query (str): The input query.
    
    Returns:
        str: The query if valid, or an error message if invalid.
    """
    try:
        if not query:
            return "Query cannot be empty, enter a valid query."
        elif len(query) < 15:
            return "Query is too short, enter a valid query."
        else:
            return query
    except Exception as e:
        return str(e)

def create_rag_chain(query, relevant_documents):
    """
    Creates and executes a RAG chain to answer a query using retrieved documents.
    
    Args:
        query (str): The user query.
        relevant_documents (list): List of retrieved document chunks.
    
    Returns:
        str: The generated response or an error message.
    """
    try:
        prompt_template = """
        Only based on the provided documents, answer the question in points. Do not mention from which document the answer is derived.
        Question: {query}
        Documents: {docs}
        Note: You are a supply chain assistant. If the query is not related to supply chain or the documents do not provide the necessary information, return "Invalid Query".
        """
        prompt = ChatPromptTemplate.from_template(prompt_template)
        valid_query = validate_query(query)
        rag_chain = prompt | llm | StrOutputParser()
        return rag_chain.invoke({"query": valid_query, "docs": relevant_documents})
    except Exception as e:
        return str(e)

# Test the RAG chain
query = "What is Supply Chain?"
relevant_documents = vectorstore_retriever.invoke(query)
response = create_rag_chain(query, relevant_documents)
print("RAG Chain Response:")
print(response)

RAG Chain Response:
- A supply chain is a network of partners that collectively convert a basic commodity into a finished product valued by end-customers.
- It involves managing the flow of materials and information from raw material production to the end-user and includes processes such as purchasing, manufacturing, and distribution.
- Each partner in the supply chain is responsible for a process that adds value to the product, transforming inputs into outputs.
- Supply chain management encompasses planning and controlling all business processes that link partners in the supply chain to meet the needs of the end-customer.
- The supply chain can be viewed as a system where all processes interact, and disruptions in one part can affect the entire network.
- Logistics is a key component of supply chain management, focusing on coordinating material and information flows across the supply chain.
- The supply chain is often described in terms of upstream (buy side) and downstream (sell side

### Explanation:

- Query Validation: Ensures the query is non-empty and at least 15 characters long.
- RAG Chain: Constructs a prompt that instructs the model to answer in bullet points, using only the retrieved documents, and to return "Invalid Query" if the query is unrelated to supply chain or unsupported by the documents.
- Execution: Combines the prompt, AzureChatOpenAI model, and string output parser to generate a response.
- Test: Runs a sample query to verify the RAG chain's functionality.

## Generating Synthetic Test Data with RAGAS
To evaluate the RAG system, we need a test dataset with questions, answers, contexts, and ground truth. RAGAS's TestsetGenerator can create synthetic data from documents.

In [19]:
documents

[Document(metadata={'producer': 'GPL Ghostscript 10.00.0', 'creator': '', 'creationdate': "D:20250417060039Z00'00'", 'source': 'datasets/supply_chain/chapter_1.pdf', 'file_path': 'datasets/supply_chain/chapter_1.pdf', 'total_pages': 30, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': "D:20250417060039Z00'00'", 'trapped': '', 'modDate': "D:20250417060039Z00'00'", 'creationDate': "D:20250417060039Z00'00'", 'page': 0}, page_content='CHAPTER 1\nLogistics and the supply chain\nIntroduction\nA car takes only 20 hours or so to assemble, and a couple more days are needed\nto ship it to the customer via the dealers. So why does it take more than a month\nfor a manufacturer to make and deliver the car I want? And why are the products\nI want to buy so often unavailable on the shelf at the local supermarket? These\nare questions that go to the heart of logistics management and strategy. Supply\nchains today are slow and costly compared with what they will

In [22]:
import random
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Wrap AzureChatOpenAI for RAGAS compatibility
evaluator_llm = LangchainLLMWrapper(llm)
evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)
# Configure the test set generator
testset_generator = TestsetGenerator(
    llm=evaluator_llm,
    embedding_model=evaluator_embeddings
)

# Randomly sample a subset of documents (e.g., 50 out of 902 chunks)
sample_size = 25  # Adjust based on your needs
random.seed(42)  # For reproducibility
sampled_documents = random.sample(documents, min(sample_size, len(documents)))

# Generate test dataset with reduced test_size
testset = testset_generator.generate_with_langchain_docs(sampled_documents, 10)


  evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)
Applying HeadlinesExtractor: 100%|██████████| 12/12 [00:19<00:00,  1.64s/it]
Applying HeadlineSplitter: 100%|██████████| 15/15 [00:00<00:00, 11472.39it/s]
Applying SummaryExtractor:   4%|▍         | 1/23 [00:03<01:17,  3.51s/it]Property 'summary' already exists in node '9c3f60'. Skipping!
Property 'summary' already exists in node '315506'. Skipping!
Property 'summary' already exists in node '1e0e02'. Skipping!
Property 'summary' already exists in node 'd91129'. Skipping!
Applying SummaryExtractor:   9%|▊         | 2/23 [00:26<05:09, 14.73s/it]Property 'summary' already exists in node '2b1d1b'. Skipping!
Applying SummaryExtractor:  74%|███████▍  | 17/23 [00:27<00:07,  1.18s/it]Property 'summary' already exists in node '05604f'. Skipping!
Applying SummaryExtractor:  78%|███████▊  | 18/23 [00:31<00:06,  1.37s/it]Property 'summary' already exists in node 'b9fb3b'. Skipping!
Property 'summary' already exists in node 'e653f1'. S

In [26]:
testset.samples

[TestsetSample(eval_sample=SingleTurnSample(user_input="What role did the Supply Chain Director play in BTC's promotional strategy?", retrieved_contexts=None, reference_contexts=['38 Chapter 2 • Putting the end-customer ﬁrst or buy too much with the consequent write-downs in the January sale. The position is further complicated in a national chain where demand patterns will be different store by store and region by region. Many retailers allocate their Christmas merchandise to individual stores on the basis of previous year’s sales for the particular product category and hope for the best. A lean design supply chain is unable to cope with such spiky demand, which will be affected further by marketing efforts and the latest fad. Retailers therefore need to be particu- larly agile in their approach in order to satisfy unknown demand. Boots The Chemists (BTC) – the leading UK health and beauty retailer – has approached this problem by outsourcing speciﬁc Christmas merchandise deliveries. 

In [27]:

# Convert test dataset to evaluation format
eval_data = {
    "question": [],
    "answer": [],
    "contexts": [],
    "ground_truth": []
}

for testcase in testset.samples:
    relevant_docs = vectorstore_retriever.invoke(testcase.eval_sample.user_input)
    answer = create_rag_chain(testcase.eval_sample.user_input, relevant_docs)
    eval_data["question"].append(testcase.eval_sample.user_input)
    eval_data["answer"].append(answer)
    eval_data["contexts"].append([doc.page_content for doc in relevant_docs])
    eval_data["ground_truth"].append(testcase.eval_sample.reference)

print(f"Generated {len(eval_data['question'])} test cases.")

Generated 6 test cases.


In [28]:
eval_data

{'question': ["What role did the Supply Chain Director play in BTC's promotional strategy?",
  'How BTC deal with the problem of unpredictable demand in their supply chain?',
  'Wot is Boots The Chemists doin to handle demand?',
  'What logistics challenges BTC have for promotions and events?',
  'What logistics challenges BTC face when doing promotions and events?',
  'Wht r the logstics chllenges of mntng promtions and evnts at a rtailer such as BTC?'],
 'answer': ["- The Supply Chain Director played a crucial role in transforming BTC's promotional strategy by addressing logistical challenges associated with promotions.\n- A dedicated promotions team was established under the Supply Chain Director's guidance to oversee the overall promotional plan and ensure timely delivery of products and materials.\n- The director facilitated a trial where logistics staff in regional distribution centers handled the preparation for promotions, improving efficiency and reducing reliance on luck.\n- 

### Explanation:

- TestsetGenerator: Uses AzureChatOpenAI for generating and critiquing test cases, with AzureOpenAIEmbeddings for document embeddings.
- Test Data Generation: Creates test cases with a mix of random samples
- Evaluation Dataset: For each test case, retrieves relevant documents, generates an answer using the RAG chain, and collects the question, answer, contexts, and ground truth.
- Output: Stores the data in a dictionary format suitable for RAGAS evaluation.

## RAG Evaluation with RAGAS
This section evaluates the RAG system using RAGAS metrics: faithfulness, answer relevancy, context precision, and context recall.

In [29]:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Convert evaluation data to Hugging Face Dataset
eval_dataset = Dataset.from_dict(eval_data)

# Wrap AzureChatOpenAI for RAGAS compatibility
evaluator_llm = LangchainLLMWrapper(llm)
evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)

# Run evaluation
results = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,       # Checks if the answer is grounded in the context
        answer_relevancy,   # Checks if the answer addresses the question
        context_precision,  # Checks if retrieved context is relevant
        context_recall      # Checks if all necessary information is retrieved
    ],
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    show_progress=True
)

# Print evaluation results
print("RAGAS Evaluation Results:")
print(results)

  evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)
Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]Exception raised in Job[5]: IndexError(list index out of range)
Evaluating:   4%|▍         | 1/24 [00:02<00:52,  2.26s/it]Exception raised in Job[9]: IndexError(list index out of range)
Exception raised in Job[13]: IndexError(list index out of range)
Evaluating:  17%|█▋        | 4/24 [00:21<01:55,  5.77s/it]Exception raised in Job[1]: IndexError(list index out of range)
Evaluating:  46%|████▌     | 11/24 [00:39<00:45,  3.53s/it]Exception raised in Job[17]: IndexError(list index out of range)
Evaluating:  50%|█████     | 12/24 [00:44<00:47,  3.95s/it]Exception raised in Job[21]: IndexError(list index out of range)
Evaluating: 100%|██████████| 24/24 [01:32<00:00,  3.84s/it]


RAGAS Evaluation Results:
{'faithfulness': 0.8859, 'answer_relevancy': nan, 'context_precision': 0.9000, 'context_recall': 0.8333}


### Explanation:

- Dataset Conversion: Converts the evaluation data into a Hugging Face Dataset for RAGAS.
- LLM Wrapper: Wraps AzureChatOpenAI with LangchainLLMWrapper for compatibility with RAGAS.
- Metrics: Evaluates the RAG system on:
    - Faithfulness: Ensures answers are factually consistent with the context.
    - Answer Relevancy: Measures how well answers address the query.
    - Context Precision: Assesses the relevance of retrieved documents.
    - Context Recall: Checks if all necessary information is retrieved (uses ground truth).
- Results: Outputs scores (0 to 1) for each metric, where higher is better.