# Evaluating RAGs through Vijil Evaluate

Retrieval Augmented Generation (RAG) is a popular framework of building generative AI applications, where the user can supply queries into a chat interface and get answers back related to a specific knowledge base typically composed od chunked documents.

There are two stages of generating an answer through a RAG: 
1. **Retrieval**: a vector search is performed in knowledge base, and top-k document chunks are retrieved that are closest to the input query per distance in the embedding space.
2. **Generation**: Retrieved contexts and the original question are supplied to a Large Language Model (LLM), which generates the final answer for the end user.

Vijil Evaluate enables you to evaluate LLMs for RAG capabilities. Given a set of questions, the list of contexts each question would yield based on vector search from knowledge base, and the ground truth (or 'golden') answers to the questions, Vijil Evaluate uses a number of metrics to evaluate the quality of retrieved contexts, the quality of generated answers from the LLM component, as well as the likelihood that a generated answer is a hallucination.

Vijil Evaluate currently supports seven metrics to evaluate the generation stage in a RAG pipeline. In this notebook, we show you how to implement these metrics.

## Retrieval Metrics

To measure the quality of the retrieved contexts, we use two LLM-based metrics. Each produce a score between 0 and 1.

- **Contextual Precision**: measures whether the contexts relevant to the input question are ranked higher in the full set of retrieved contexts than irrelevant ones. A higher score indicates greater alignment in ranking.
- **Contextual Recall**: measure the extent to which the retrieved contexts align with the golden answers. A higher score indicates greater alignment with the golden answer.

Currently, we use `gpt-4o` as the judge LLM in these metrics.

## Generation Metrics

Our generation metrics are divided into three categories, attempting to measure the LLM in a RAG for different capabilities.

### Correctness

To measure correctness of the answers generated by a RAG system, we use an LLM-based answer correctness metric that compares the similarity of a generated answer with the ground truth 'golden' answer, and provides a score between 0 and 1. A higher score indicates greater similarity to the golden answer.

Besides LLM-based answer correctness, we also offer traditional NLP metrics---BLEU, ROUGE, and BERTScore---for this purpose.

### Relevancy

Our LLM-based Answer Relevancy metric measures the degree to which the final generated output is relevant to the original input. It produces a score between 0 and 1, higher score indicating higher relevancy.


### Hallucination

We use an LLM-based Faithfulness metric to measure how much the generated response stays faithful to the retrieved contexts, i.e. the opposite of hallucination. This metric produces scores from 0 to 1, where a higher score means that the response is more faithful to the context (has fewer hallucinations).


## RAG metrics on your own Inputs and Outputs

In the example below, we use the a placeholder dataset to evaluate the reliability of answers in that dataset. To do so, let's load that dataset, and do some preprocessing to convert it to a list of dicts---the format we'll need supply this dataset into Vijil Evaluate in.

In [None]:
import pandas as pd
import ast

qk = pd.read_csv("dataset.csv") # replace with your own dataset

# preprocess, assume original columns are named as question, bot_answer, ground_truth, actual_context
qk1 = [
    {
        "question": row["question"],
        "response": row["bot_answer"],
        "ground_truth": row["ground_truth"],
        "contexts": ast.literal_eval(row["actual_context"]),
    }
    for _, row in qk.iterrows()
]

We then instantiate the Vijil Python client. Note that our Python client uses an API token, loaded as the environment variable `VIJIL_API_KEY`. Please make sure you have [fetched an API key](https://docs.vijil.ai/setup.html#authentication-using-api-keys) from the UI and stored it in the env file.

In [None]:
import sys
import os
sys.path.append("..")

from dotenv import load_dotenv
load_dotenv()

from vijil import Vijil
client = Vijil()



### Calculate Metrics
Using the detection endpoint in Vijil Evaluate, you can call the RAG metrics on your dataset. Each of the metrics has a set of required fields that you need to provide. We first store the fields below, then iterate through all metrics to call each of them on the dataset provided.

In [3]:
required_fields = {
    "ContextualRecall": ["ground_truth", "contexts"],
    "AnswerRelevancy": ["question", "response"],
    "ContextualPrecision": ["question", "ground_truth", "contexts"],
    "Correctness": ["question", "response", "ground_truth"],
    "Faithfulness": ["question", "response", "contexts"]
}

In [4]:
dets = []
for metric, fields in required_fields.items():
    print("Calculating metric", metric)
    detection = client.detections.create(
        detector_id=f"llm.{metric}",
        detector_inputs=[
            {
                field: row[field]
                for field in fields
            }
            for row in qk1
        ]
    )
    dets.append(detection)

Calculating metric ContextualRecall
Calculating metric AnswerRelevancy
Calculating metric ContextualPrecision
Calculating metric Correctness
Calculating metric Faithfulness


A detection job is created for each metric, and the job IDs are stored in the `dets` list.

### Summarize Metrics
Now we simply fetch the results of the detection jobs, and summarize the results to get average value of each metric.

In [5]:
det_dfs = [
    client.detections.describe(detection["id"])
    for detection in dets
]
det_df = pd.concat(det_dfs)
det_df['detector_id'] = det_df['detector_id'].apply(lambda x: x.split(".")[3])
det_df['detector_output'] = det_df['detector_output'].apply(lambda x: x['score'])
det_df.groupby("detector_id")['detector_output'].mean()

detector_id
AnswerRelevancy        0.752301
ContextualPrecision    0.404399
ContextualRecall       0.938071
Correctness            0.720812
Faithfulness           0.860111
Name: detector_output, dtype: float64

According to the retrieval metrics, the retrieved contexts demonstrate a high degree of recall, but low precision---meaning that while contexts relevant to the ground truth are retrieved consistently, irrelevant contexts get returned with them as well. According to the generation metrics, the RAG-generated answers display a high degree of faithfulness, while being moderately correct and relevant.

Finally, we append the outputs to the original dataframe and save it to a new CSV file.

In [None]:
# collate per-metric score for row of qk1
def score_per_metric(metric, dataset):
    # match detector_input by first required field
    scores = []
    metric_df = det_df[det_df['detector_id'] == metric]
    field = required_fields[metric][0]
    for row in dataset:
        try:
            score = metric_df[metric_df['detector_input'].apply(lambda x: x[field] == row[field])]['detector_output'].values[0]
        except IndexError:
            score = None
        scores.append(score.item())
    return scores

# add scores to qk1
for metric in required_fields.keys():
    metric_score = score_per_metric(metric, qk1)
    for ix, _ in enumerate(qk1):
        qk1[ix][metric] = metric_score[ix]
        
# save as csv
qk1_df = pd.DataFrame(qk1)
qk1_df.to_csv("dataset with scores.csv", index=False)