# Evaluating RAGs through Vijil Evaluate

Retrieval Augmented Generation (RAG) is a popular framework of building generative AI applications, where the user can supply queries into a chat interface and get answers back related to a specific knowledge base typically composed od chunked documents.

There are two stages of generating an answer through a RAG: 
1. **Retrieval**: a vector search is performed in knowledge base, and top-k document chunks are retrieved that are closest to the input query per distance in the embedding space.
2. **Generation**: Retrieved contexts and the original question are supplied to a Large Language Model (LLM), which generates the final answer for the end user.

Vijil Evaluate enables you to evaluate LLMs for RAG capabilities. Given a set of questions, the list of contexts each question would yield based on vector search from knowledge base, and the ground truth (or 'golden') answers to the questions, Vijil Evaluate uses a number of metrics to evaluate the quality of generated answers from the LLM component, as well as the likelihood that a generated answer is a hallucination.

Vijil Evaluate currently supports four metrics to evaluate the generation stage in a RAG pipeline. In this notebook, we show you how to implement these metrics.

## Correctness Metrics

To measure correctness of the LLM-generated answers, we use the following traditional NLP metrics.

- BLEU
- METEOR
- BERTScore

Each of them compares the similarity of an LLM-generated answer with the ground truth 'golden' answer, and provides a score between 0 and 1. A higher score indicates greater similarity to the golden answer.


## Hallucination Metrics

We use the [HHEM](https://huggingface.co/vectara/hallucination_evaluation_model) Hallucination Evaluation classifier to measure the propensity that the generated response is hallucinated. To do so, we supply the generated response and concatenated contexts to the model, and take the output probability that the two input strings are consistent with each other as the final score. HHEM produces scores from 0 to 1, where a higher score means that the response is more faithful to the context (has fewer hallucinations).

## Evaluating Domain-specific Question Answering

In the example below, we use the [financebench](https://huggingface.co/datasets/PatronusAI/financebench) benchmark dataset to evaluate how accurate can `gpt-4o-mini` produce reliable answers in the financial domain.

We have already loaded the benchmark as an evaluation harness in Vijil Evaluate. Now we simply create an evaluation of the given LLM on this harness.

In [4]:
# !pip install vijil

# import and instantiate the client
from vijil import Vijil
client = Vijil()

# create the evaluation
client.evaluations.create(
    model_hub="openai",
    model_name="gpt-4o-mini",
    model_params={"temperature": 0},
    harnesses=["financebench"],
)

{'id': '395f15be-b0f0-4060-957b-1c325eaa9f89', 'status': 'CREATED'}

You can use the `get_status` method to keep track of the progress of the evaluation.

In [52]:
client.evaluations.get_status('395f15be-b0f0-4060-957b-1c325eaa9f89')

{'id': '395f15be-b0f0-4060-957b-1c325eaa9f89',
 'status': 'IN_PROGRESS',
 'total_test_count': 600,
 'completed_test_count': 600,
 'error_test_count': 0,
 'total_response_count': 600,
 'completed_response_count': 300,
 'error_response_count': 0,
 'total_generation_time': '28.000000',
 'average_generation_time': '2.9250000000000000',
 'score': None,
 'hub': 'openai',
 'model': 'gpt-4o',
 'url': '',
 'created_at': 1726530926,
 'created_by': '7ad3420b-2c22-4f07-a8f4-ab6c334c1421',
 'completed_at': None,
 'team_id': 'jGMHAXYmQ3RO59ebD0Bruupx8ClqhGTC@clients'}

After the evaluation finishes, you can use the following code to obtain the four metrics.

In [6]:
df = client.evaluations.summarize('395f15be-b0f0-4060-957b-1c325eaa9f89')
df = df[df.level=="probe"]

import pandas as pd
pd.DataFrame({
    "metric": df.level_name.apply(lambda s: s.replace("FinanceBench, metric ", "")),
    "score": df.score.apply(lambda s: (100-s)/100)
})

Unnamed: 0,metric,score
2,BLEU,0.0445
3,HHEM,0.1242
4,METEOR,0.3284
5,BERTScore,0.5008


While BLEU and METEOR scores are very low, there is a moderate amount (50%) of semantic overlap between the generated responses and golden answers, as per BERTScore. As per HHEM, the generated responses may involve significant hallucinations as their average is closer to 0 than 1.



If you are developing your own RAG system and have your own dataset of prompts, contexts, and desired responses at hand, you can use Vijil Evaluate to similarly evaluate your system on that dataset. Please reach out to contact@vijil.ai to know more.