__RAG Quality Benchmarking with RAGAs__

The goal of this project is to evaluate a RAG pipeline using an automated, LLM-as-a-judge. Therefore, determining the faithfulness and answer relevancy of the generated answers. 

__RAGAs (Retrieval- Augmented Generation Assessment System)__
- Framework designed to evaluate the responses of a RAG system
- Acts as a judge to determine the quality of the answers

__Key Areas of Performance__
1. Faithfulness (Is the answer true to the source/private database)
2. Answer Relevancy (Does the answer fit the question?) 
3. Context Retrieval Quality (Did it find the right information?)




The next cell imports the necessary libraries needed.

In [51]:
# Import necessary libraries
import pandas as pd
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
import os
pd.set_option('display.max_colwidth', None)

The next cell sets up the evaluator, which will judge the responses generated by a RAG system. In this example we are using the OpenAI model **GPT-3.5 Turbo** with a temperature of zero. 

**Temperature** controls the models creativity. A value of 0 will introduce less randomness to the model's output. This is desired since the evaluator should be provide consistent scoring. 

Note, the connection to this model private and therefore running this notebook will not work unless you set up your own connection. 

In [42]:
# Define the models RAGAs will use as the 'Judge'
# RAGAs uses an LLM to judge the quality of other LLM outputs.
evaluator_llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
evaluator_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

This notebook is focused on evaluating RAG outputs only. Therefore, we will not implement a RAG system from scratch but evaluate mock data representing RAG system.


The mock data below represents sample **questions, answers, contexts, and ground truths** gathered from a RAG system. 

The mock data includes three sets of records, two of which are good and one is bad. The goal of this project is to determine if our RAG evaluator, an LLM, is able to spot the bad record. Below is an example of a "Good" record which represent a well generated response to a given question as well as a bad example. 



**Good**

1. **Question:** What is the Capital of France.


2. **Answer:** The Capital of France is Paris, famous for the Eiffel Tower. 


3. **Contexts:** The main city of France, Paris, has many landmarks. Paris is a major European center. 


4. **Ground Truth:** Paris is the Capital of France.


**Bad** 

1. **Question:** When was the World Wide Web invented.


2. **Answer:** <span style="color: red;">The context does not specify the exact invention date of the World Wide Web.</span>   


3. **Contexts:** Tim Berners-Lee created a distributed information system at CERN in 1989.


4. **Ground Truth:** The World Wide Web was invented in 1989 by Tim Berners-Lee at CERN.

&nbsp;
&nbsp;

The "Bad" example represents a failure by the RAG to provide a correct answer, even though it had the necessary information to answer correctly. The __Contexts__ provided to the RAG system include "Tim Berners-Lee created a distributed information system at CERN in 1989.", therefore the RAG system should have responsed that the World Wide Web was created in 1989. 


In [43]:
# This data simulates the output of your RAG system after running a few test questions.
# The 'ground_truths' are the human-written, correct answers (required for some metrics).
mock_data = {
    'question': [
        "What is the capital of France?",
        "How much does a light-year measure in kilometers?",
        "When was the World Wide Web invented?",
    ],
    'answer': [
        "The capital of France is Paris, famous for the Eiffel Tower.",
        "The light-year measures about 9.461 trillion kilometers.",
        "The context does not specify the exact invention date of the World Wide Web.",
    ],
    # 'contexts' are the specific chunks the RAG system retrieved.
    'contexts': [
        ["The main city of France, Paris, has many landmarks. Paris is a major European center."],
        ["A light-year is the distance light travels in one year, which is 9.461e12 kilometers."],
        ["Tim Berners-Lee created a distributed information system at CERN in 1989."], # Intentionally missing an explicit date
    ],
    # 'ground_truths' are the perfect, human-verified answers (important for recall-based metrics)
    'ground_truths': [
        ["Paris is the capital of France."],
        ["A light-year is approximately $9.461 \times 10^{12}$ kilometers (9.461 trillion)."],
        ["The World Wide Web was invented in 1989 by Tim Berners-Lee at CERN."],
    ]
}

# Convert the dictionary into a RAGAs-compatible Dataset
eval_dataset = Dataset.from_dict(mock_data)

The following cell provides the evaluator with the mock data and specifies faithfulness and answer relevancy as the metrics for evaluation.

In [44]:
# --- Define the metrics to use ---
metrics_to_evaluate = [
    faithfulness,
    answer_relevancy,
]

# --- Run the Evaluation ---
print("Starting RAGAs evaluation...")
results = evaluate(
    dataset=eval_dataset,
    metrics=metrics_to_evaluate,
    llm=evaluator_llm,
    embeddings=evaluator_embeddings
)

# Convert results to a pandas DataFrame for clear viewing
results_df = results.to_pandas()

Starting RAGAs evaluation...


Evaluating: 100%|██████████| 6/6 [00:09<00:00,  1.58s/it]


The following cell prints the results of the evaluator.

In [None]:

results_df.head()

Unnamed: 0,user_input,retrieved_contexts,response,faithfulness,answer_relevancy
0,What is the capital of France?,"[The main city of France, Paris, has many landmarks. Paris is a major European center.]","The capital of France is Paris, famous for the Eiffel Tower.",0.0,1.0
1,How much does a light-year measure in kilometers?,"[A light-year is the distance light travels in one year, which is 9.461e12 kilometers.]",The light-year measures about 9.461 trillion kilometers.,1.0,0.939068
2,When was the World Wide Web invented?,[Tim Berners-Lee created a distributed information system at CERN in 1989.],The context does not specify the exact invention date of the World Wide Web.,1.0,0.0


__Faithfulness__ 
- Measures if the generated answer is supported by the retrieved contexts. 
- 1 = perfect score. The answer is supported in the retrieved contexts.
- 0 = worst score. The answer is a hallucination.


__Answer Relevancy__ 
- Measures if the answer directly addresses the question. 
- 1 = perfect score. The answer addresses the question fully. 
- 0 = worst score. The answer is incomplete or does not answer the question at hand. 


Based on this understanding of Faithfulness and Answer Relevancy. 

__Question 1: What is the Capital of France ?__
- **Retrieved Contexts:** "The __main__ city of France, Paris"
- **Response:** "The __capital__ of France is Paris" 
- **Faithfulness:** The answer is not supported by the retrieve contexts due to the difference in __main__ and __capital__. The retrieved contexts don't specify that Paris is the capital of France. It only mentions that Paris is a main City. 
- **Answer Relevancy:** The answer completely addresses the questions by stating that Paris is the capital of France. 


__Question 2: How much does a light-year measure in Kilometers ?__
- **Retrieved Contexts:** "A light-year is the distance light travels in one year, which is 9.461e12 kilometers"
- **Response:** "The light-year measures about 9.461 trillion kilometers." 
- **Faithfulness:** The answer is supported by the retrieved contexts. It specifically mentions the 9.461 trillion kilometers seen in the retrieved contexts.
- **Answer Relevancy:** The answer addresses the questions by stating the exact distance. The score is not 1, but relatively high at .939.  


__Question 3: When was the World Wide Web invented ?__
- **Retrieved** Contexts: "Tim Berners-Lee created a distributed information system at CERN in 1989"
- **Response:** __"The context does not specify the exact invention date of the World Wide Web"__ 
- **Faithfulness:** The answer is supported by the retrieved contexts because the question asks specifically for the __World Wide Web__, however the retrieved contexts only mentions the __distributed information system__. 
- **Answer Relevancy:** The answer fails to provide an answer for the question.   


__Conclusion__
- The evaluator succeeds in pointing out specific issues in the answers. However, it fails to understand synonyms within the text. 

- __First Example:__ The difference in the words "main" and "capital" are not the same in the eyes of the evaluator. However, this is useful information and can help us better the context provided we provide to the RAG system. 

- __Third Example:__ The difference in the phrases "World Wide Web" and "Distributed Information System" are not the same in the eyes of the evaluator. However, this is useful information and can help us better the context provided we provide to the RAG system. 





__Additional Considerations__

In the real world, RAGAs are fully trained and automated assurance testers for AI Chatbots. This is crucial for startups and companies due liability and costs of bad answers. A company does not have time to proof read a million questions manually to ensure the RAG system is not hallucinating, therefore they must have an automated solution. Using an LLM as a judge reduces the risk of hallucinations and cost of manually proofreading responses. RAGAs evaluate based on faithfulness, answer relevancy and context precision. This notebook only covered faithfulness and answer relevancy. However, context precision measures if the LLM was able to find and use only the most useful documents in order to answer the question at hand. If context precision is poor, companies spend more resources and time in order to generate a correct response. RAGAs are also used in safeguarding users from new models. A new model will undergo RAGAs tests and if faithfulness, answer relevancy or context precision drop, users will not be exposed to this model as it is unsafe to use.  


