# Benchmarking RAG Pipelines With A `LabelledRagDatatset`

The `LabelledRagDataset` is meant to be used for evaluating any given RAG pipeline, for which there could be several configurations (i.e. choosing the `LLM`, values for the `similarity_top_k`, `chunk_size`, and others). We've likened this abstract to traditional machine learning datastets, where `X` features are meant to predict a ground-truth label `y`. In this case, we use the `query` as well as the retrieved `contexts` as the "features" and the answer to the query, called `reference_answer` as the ground-truth label.

And of course, such datasets are comprised of observations or examples. In the case of `LabelledRagDataset`, these are made up with a set of `LabelledRagDataExample`'s.

### The `LabelledRagDataExample` Class

In [None]:
from llama_index.llama_dataset import LabelledRagDataExample, CreatedByType

# constructing a LabelledRagDataExample
query = "This is a test query, is it not?"
query_by = CreatedByType.AI
reference_answer = "Yes it is."
reference_answer_by = CreatedByType.HUMAN
reference_contexts = ["This is a sample context"]

rag_example = LabelledRagDataExample(
    query=query,
    query_by=query_by,
    reference_contexts=reference_contexts,
    reference_answer=reference_answer,
    reference_answer_by=reference_answer_by,
)

The `LabelledRagDataExample` is a `dataclasss` and so, going from `json` or `dict` (and vice-versa) is possible.

In [None]:
print(rag_example.to_json())

{"query": "This is a test query, is it not?", "query_by": "ai", "reference_contexts": ["This is a sample context"], "reference_answer": "Yes it is.", "reference_answer_by": "human"}


In [None]:
LabelledRagDataExample.from_json(rag_example.to_json())

LabelledRagDataExample(query='This is a test query, is it not?', query_by=<CreatedByType.AI: 'ai'>, reference_contexts=['This is a sample context'], reference_answer='Yes it is.', reference_answer_by=<CreatedByType.HUMAN: 'human'>)

In [None]:
rag_example.to_dict()

{'query': 'This is a test query, is it not?',
 'query_by': <CreatedByType.AI: 'ai'>,
 'reference_contexts': ['This is a sample context'],
 'reference_answer': 'Yes it is.',
 'reference_answer_by': <CreatedByType.HUMAN: 'human'>}

In [None]:
LabelledRagDataExample.from_dict(rag_example.to_dict())

LabelledRagDataExample(query='This is a test query, is it not?', query_by=<CreatedByType.AI: 'ai'>, reference_contexts=['This is a sample context'], reference_answer='Yes it is.', reference_answer_by=<CreatedByType.HUMAN: 'human'>)

Let's create a second example, so we can have a (slightly) more interesting `LabelledRagDataset`.

In [None]:
query = "This is a test query, is it so?"
reference_answer = "I think yes, it is."
reference_contexts = ["This is a second sample context"]

rag_example_2 = LabelledRagDataExample(
    query=query,
    query_by=query_by,
    reference_contexts=reference_contexts,
    reference_answer=reference_answer,
    reference_answer_by=reference_answer_by,
)

### The `LabelledRagDataset` Class

In [None]:
from llama_index.llama_dataset.rag import LabelledRagDataset

rag_dataset = LabelledRagDataset(examples=[rag_example, rag_example_2])

There exists a convienience method to view the dataset as a `pandas.DataFrame`.

In [None]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"This is a test query, is it not?",[This is a sample context],Yes it is.,human,ai
1,"This is a test query, is it so?",[This is a second sample context],"I think yes, it is.",human,ai


#### Serialization

To persist and load the dataset to and from disk, there are the `save_json` and `from_json` methods.

In [None]:
rag_dataset.save_json("rag_dataset.json")

In [None]:
reload_rag_dataset = LabelledRagDataset.from_json("rag_dataset.json")

In [None]:
reload_rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"This is a test query, is it not?",[This is a sample context],Yes it is.,human,ai
1,"This is a test query, is it so?",[This is a second sample context],"I think yes, it is.",human,ai


### Predicting and Evaluation

For this section, we'll first create a `LabelledRagDataset` using a synthetic generator. Ultimately, we will use GPT-4 to produce both the `query` and `reference_answer` for the synthetic `LabelledRagDataExample`'s.

NOTE: if one has queries, reference answers, and contexts over a text corpus, then it is not necessary to use data synthesis to be able to predict and subsequently evaluate said predictions.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# Load documents and build index
documents = SimpleDirectoryReader(
    input_files=["data/paul_graham_essay_truncated.txt"]
).load_data()
index = VectorStoreIndex.from_documents(documents)

The `RagDatasetGenerator` can be build over a set of documents to generate `LabelledRagDataExample`'s.

In [None]:
# generate questions against chunks
from llama_index.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms import OpenAI
from llama_index import ServiceContext

# set context for llm provider
gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3)
)

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context=gpt_35_context,
    num_questions_per_chunk=2,  # set the number of questions per nodes
)

In [None]:
len(dataset_generator.nodes)

2

In [None]:
# since there are 2 nodes, there should be a total of 4 questions
rag_dataset = dataset_generator.generate_dataset_from_nodes()

In [None]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,What were the two main things that the author ...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,"Before college, the two main things that the a...",ai,ai
1,What factors influenced the author's decision ...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,The factors that influenced the author's decis...,ai,ai
2,"In the context of the given information, what ...",[I couldn't have put this into words when I wa...,The two factors that influenced the author's d...,ai,ai
3,How did learning Lisp expand the author's conc...,[I couldn't have put this into words when I wa...,Learning Lisp expanded the author's concept of...,ai,ai


In [None]:
rag_dataset.save_json("rag_dataset.json")

#### Predicting

Stepping back for a second to paint the situation before moving on to making actual predictions. Recall that the point of the `LabelledRagDataset` is to benchmark any given RAG pipeline that is built over the same source documents (in this case, the `paul_graham_essay_truncated.txt`).

So, let's emulate that situation now by creating a simple RAG pipeline (i.e., index, then query engine) over the same source text data file.

In [None]:
documents = SimpleDirectoryReader(
    input_files=["data/paul_graham_essay_truncated.txt"]
).load_data()
index = VectorStoreIndex.from_documents(documents)

In [None]:
query_engine = index.as_query_engine()

A `LabelledRagDataset` has a method call `make_predictions_with` that takes as input a `QueryEngine` to produce predictions (i.e. generate responses to the queries). Specifically, it returns a `RagPredictionDataset` that is comprised of a set of `RagExamplePrediction`'s, which store the generated response as well as the context that was retrieved by the retrievor of the RAG pipeline.

In [None]:
prediction_dataset = rag_dataset.make_predictions_with(
    query_engine=query_engine
)

In [None]:
# taking a peak at a single RagExamplePrediction
pred = prediction_dataset.predictions[0]

print(f"FIRST 100 CHARS of RESPONSE:\n{pred.response[:100]}...")
print("\n=================")
for ix, c in enumerate(pred.contexts):
    print(f"TOP {ix} RETRIEVAL:\n{c[:100]}...\n")
    print("=================")

FIRST 100 CHARS of RESPONSE:
The author worked on writing and programming before college. In terms of their outcomes, the author ...

TOP 0 RETRIEVAL:
What I Worked On

February 2021

Before college the two main things I worked on, outside of school, ...

TOP 1 RETRIEVAL:
I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking phi...



Just as with `LabelledRagDataset`'s, you can store into and upload from a json.

In [None]:
prediction_dataset.save_json("prediction_dataset.json")

In [None]:
from llama_index.llama_dataset import RagPredictionDataset

reloaded_predictions = RagPredictionDataset.from_json(
    "prediction_dataset.json"
)

In [None]:
reloaded_predictions.to_pandas()

Unnamed: 0,response,contexts
0,The author worked on writing and programming b...,[What I Worked On\n\nFebruary 2021\n\nBefore c...
1,The author's decision to switch from studying ...,[I couldn't have put this into words when I wa...
2,The two factors that influenced the author's d...,[I couldn't have put this into words when I wa...
3,Learning Lisp expanded the author's concept of...,[I couldn't have put this into words when I wa...


#### Evaluation

Now that we have our predictions, we can perform evaluations on two dimensions:

1. The generated response: how well the predicted response matches the reference answer.
2. The retrieved contexts: how well the retrieved contexts for the prediction match the reference contexts.

NOTE: For retrieved contexts, we are unable to use standard retrieval metrics such as `hit rate` and `mean reciproccal rank` due to the fact that doing so requires we have the same index that was used to generate the ground truth data. But, it is not necessary for a `LabelledRagDataset` to be even created by an index. As such, we will use `semantic similarity` between the prediction's contexts and the reference contexts as a measure of goodness.

In [None]:
import tqdm

For evaluating the response, we will use the LLM-As-A-Judge pattern. Specifically, we will use `CorrectnessEvaluator`, `FaithfulnessEvaluator` and `RelevancyEvaluator`.

For evaluating the goodness of the retrieved contexts we will use `SemanticSimilarityEvaluator`.

In [None]:
# instantiate the gpt-4 judge
from llama_index.llms import OpenAI
from llama_index import ServiceContext
from llama_index.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    service_context=ServiceContext.from_defaults(
        llm=OpenAI(temperature=0, model="gpt-4"),
    )
)

judges["relevancy"] = RelevancyEvaluator(
    service_context=ServiceContext.from_defaults(
        llm=OpenAI(temperature=0, model="gpt-4"),
    )
)

judges["faithfulness"] = FaithfulnessEvaluator(
    service_context=ServiceContext.from_defaults(
        llm=OpenAI(temperature=0, model="gpt-4"),
    )
)

judges["semantic_similarity"] = SemanticSimilarityEvaluator(
    service_context=ServiceContext.from_defaults(
        llm=OpenAI(temperature=0, model="gpt-4"),
    )
)

Loop through the (`labelled_example`, `prediction`) pais and perform the evaluations on each of them individually.

In [None]:
evals = {
    "correctness": [],
    "relevancy": [],
    "faithfulness": [],
    "context_similarity": [],
}

for example, prediction in tqdm.tqdm(
    zip(rag_dataset.examples, prediction_dataset.predictions)
):
    correctness_result = await judges["correctness"].aevaluate(
        query=example.query,
        response=prediction.response,
        reference=example.reference_answer,
    )

    relevancy_result = judges["relevancy"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    faithfulness_result = judges["faithfulness"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    semantic_similarity_result = judges["semantic_similarity"].evaluate(
        query=example.query,
        response="\n".join(prediction.contexts),
        reference="\n".join(example.reference_contexts),
    )

    evals["correctness"].append(correctness_result)
    evals["relevancy"].append(relevancy_result)
    evals["faithfulness"].append(faithfulness_result)
    evals["context_similarity"].append(semantic_similarity_result)

4it [00:33,  8.36s/it]


Now, we can use our notebook utility functions to view these evaluations.

In [None]:
import pandas as pd
from llama_index.evaluation.notebook_utils import (
    get_eval_results_df,
)

deep_eval_df, mean_correctness_df = get_eval_results_df(
    ["base_rag"] * len(evals["correctness"]),
    evals["correctness"],
    metric="correctness",
)
deep_eval_df, mean_relevancy_df = get_eval_results_df(
    ["base_rag"] * len(evals["relevancy"]),
    evals["relevancy"],
    metric="relevancy",
)
_, mean_faithfulness_df = get_eval_results_df(
    ["base_rag"] * len(evals["faithfulness"]),
    evals["faithfulness"],
    metric="faithfulness",
)
_, mean_context_similarity_df = get_eval_results_df(
    ["base_rag"] * len(evals["context_similarity"]),
    evals["context_similarity"],
    metric="context_similarity",
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
        mean_context_similarity_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])

In [None]:
mean_scores_df

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.75
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.974206


On this toy example, we see that the basic RAG pipeline performs quite well against the evaluation benchmark (`rag_dataset`)!