# Benchmarking RAG Pipelines With A `LabelledRagDatatset`

The `LabelledRagDataset` is meant to be used for evaluating any given RAG pipeline, for which there could be several configurations (i.e. choosing the `LLM`, values for the `similarity_top_k`, `chunk_size`, and others). We've likened this abstract to traditional machine learning datastets, where `X` features are meant to predict a ground-truth label `y`. In this case, we use the `query` as well as the retrieved `contexts` as the "features" and the answer to the query, called `reference_answer` as the ground-truth label.

And of course, such datasets are comprised of observations or examples. In the case of `LabelledRagDataset`, these are made up with a set of `LabelledRagDataExample`'s.

In this notebook, we will show how one can construct a `LabelledRagDataset` from scratch. Please note that the alternative to this would be to simply download a community supplied `LabelledRagDataset` from `llama-hub` in order to evaluate/benchmark your own RAG pipeline on it.

### The `LabelledRagDataExample` Class

In [None]:
from llama_index.llama_dataset import (
    LabelledRagDataExample,
    CreatedByType,
    CreatedBy,
)

# constructing a LabelledRagDataExample
query = "This is a test query, is it not?"
query_by = CreatedBy(type=CreatedByType.AI, model_name="gpt-4")
reference_answer = "Yes it is."
reference_answer_by = CreatedBy(type=CreatedByType.HUMAN)
reference_contexts = ["This is a sample context"]

rag_example = LabelledRagDataExample(
    query=query,
    query_by=query_by,
    reference_contexts=reference_contexts,
    reference_answer=reference_answer,
    reference_answer_by=reference_answer_by,
)

The `LabelledRagDataExample` is a Pydantic `Model` and so, going from `json` or `dict` (and vice-versa) is possible.

In [None]:
print(rag_example.json())

{"query": "This is a test query, is it not?", "query_by": {"model_name": "gpt-4", "type": "ai"}, "reference_contexts": ["This is a sample context"], "reference_answer": "Yes it is.", "reference_answer_by": {"model_name": "", "type": "human"}}


In [None]:
LabelledRagDataExample.parse_raw(rag_example.json())

LabelledRagDataExample(query='This is a test query, is it not?', query_by=CreatedBy(model_name='gpt-4', type=<CreatedByType.AI: 'ai'>), reference_contexts=['This is a sample context'], reference_answer='Yes it is.', reference_answer_by=CreatedBy(model_name='', type=<CreatedByType.HUMAN: 'human'>))

In [None]:
rag_example.dict()

{'query': 'This is a test query, is it not?',
 'query_by': {'model_name': 'gpt-4', 'type': <CreatedByType.AI: 'ai'>},
 'reference_contexts': ['This is a sample context'],
 'reference_answer': 'Yes it is.',
 'reference_answer_by': {'model_name': '',
  'type': <CreatedByType.HUMAN: 'human'>}}

In [None]:
LabelledRagDataExample.parse_obj(rag_example.dict())

LabelledRagDataExample(query='This is a test query, is it not?', query_by=CreatedBy(model_name='gpt-4', type=<CreatedByType.AI: 'ai'>), reference_contexts=['This is a sample context'], reference_answer='Yes it is.', reference_answer_by=CreatedBy(model_name='', type=<CreatedByType.HUMAN: 'human'>))

Let's create a second example, so we can have a (slightly) more interesting `LabelledRagDataset`.

In [None]:
query = "This is a test query, is it so?"
reference_answer = "I think yes, it is."
reference_contexts = ["This is a second sample context"]

rag_example_2 = LabelledRagDataExample(
    query=query,
    query_by=query_by,
    reference_contexts=reference_contexts,
    reference_answer=reference_answer,
    reference_answer_by=reference_answer_by,
)

### The `LabelledRagDataset` Class

In [None]:
from llama_index.llama_dataset.rag import LabelledRagDataset

rag_dataset = LabelledRagDataset(examples=[rag_example, rag_example_2])

There exists a convienience method to view the dataset as a `pandas.DataFrame`.

In [None]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"This is a test query, is it not?",[This is a sample context],Yes it is.,human,ai (gpt-4)
1,"This is a test query, is it so?",[This is a second sample context],"I think yes, it is.",human,ai (gpt-4)


#### Serialization

To persist and load the dataset to and from disk, there are the `save_json` and `from_json` methods.

In [None]:
rag_dataset.save_json("rag_dataset.json")

In [None]:
reload_rag_dataset = LabelledRagDataset.from_json("rag_dataset.json")

In [None]:
reload_rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"This is a test query, is it not?",[This is a sample context],Yes it is.,human,ai (gpt-4)
1,"This is a test query, is it so?",[This is a second sample context],"I think yes, it is.",human,ai (gpt-4)


### Predicting and Evaluation

For this section, we'll first create a `LabelledRagDataset` using a synthetic generator. Ultimately, we will use GPT-4 to produce both the `query` and `reference_answer` for the synthetic `LabelledRagDataExample`'s.

NOTE: if one has queries, reference answers, and contexts over a text corpus, then it is not necessary to use data synthesis to be able to predict and subsequently evaluate said predictions.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# Load documents and build index
documents = SimpleDirectoryReader(
    input_files=["data/paul_graham_essay_truncated.txt"]
).load_data()
index = VectorStoreIndex.from_documents(documents)

The `RagDatasetGenerator` can be build over a set of documents to generate `LabelledRagDataExample`'s.

In [None]:
# generate questions against chunks
from llama_index.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms import OpenAI
from llama_index import ServiceContext

# set context for llm provider
gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3)
)

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context=gpt_35_context,
    num_questions_per_chunk=2,  # set the number of questions per nodes
)

In [None]:
# from llama_index.evaluation import DatasetGenerator, QueryResponseDataset

# dataset_generator = DatasetGenerator.from_documents(
#     documents,
#     service_context=gpt_35_context,
#     num_questions_per_chunk=25,
# )

# qrd = QueryResponseDataset()

In [None]:
len(dataset_generator.nodes)

2

In [None]:
# since there are 2 nodes, there should be a total of 4 questions
rag_dataset = dataset_generator.generate_dataset_from_nodes()

In [None]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,How did the availability of microcomputers cha...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,The availability of microcomputers changed the...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
1,What factors influenced Paul Graham's decision...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,Two factors influenced Paul Graham's decision ...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
2,"How did the novel ""The Moon is a Harsh Mistres...",[I couldn't have put this into words when I wa...,"The novel ""The Moon is a Harsh Mistress"" and t...",ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
3,Why did the author choose to learn Lisp as a p...,[I couldn't have put this into words when I wa...,The author chose to learn Lisp as a programmin...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)


In [None]:
rag_dataset.save_json("rag_dataset.json")

In [None]:
reload_ragdataset = LabelledRagDataset.from_json("rag_dataset.json")

#### Predicting

Stepping back for a second to paint the situation before moving on to making actual predictions. Recall that the point of the `LabelledRagDataset` is to benchmark any given RAG pipeline that is built over the same source documents (in this case, the `paul_graham_essay_truncated.txt`).

So, let's emulate that situation now by creating a simple RAG pipeline (i.e., index, then query engine) over the same source text data file.

In [None]:
documents = SimpleDirectoryReader(
    input_files=["data/paul_graham_essay_truncated.txt"]
).load_data()
index = VectorStoreIndex.from_documents(documents)

In [None]:
query_engine = index.as_query_engine()

A `LabelledRagDataset` has a method call `make_predictions_with` that takes as input a `QueryEngine` to produce predictions (i.e. generate responses to the queries). Specifically, it returns a `RagPredictionDataset` that is comprised of a set of `RagExamplePrediction`'s, which store the generated response as well as the context that was retrieved by the retrievor of the RAG pipeline.

In [None]:
prediction_dataset = await rag_dataset.amake_predictions_with(
    query_engine=query_engine, show_progress=True
)

100%|█████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.17it/s]


In [None]:
prediction_dataset = rag_dataset.make_predictions_with(
    query_engine=query_engine, show_progress=True
)

100%|█████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.35it/s]


In [None]:
# taking a peak at a single RagExamplePrediction
pred = prediction_dataset.predictions[0]

print(f"FIRST 100 CHARS of RESPONSE:\n{pred.response[:100]}...")
print("\n=================")
for ix, c in enumerate(pred.contexts):
    print(f"TOP {ix} RETRIEVAL:\n{c[:100]}...\n")
    print("=================")

FIRST 100 CHARS of RESPONSE:
The availability of microcomputers changed the way people could interact with computers and engage i...

TOP 0 RETRIEVAL:
What I Worked On

February 2021

Before college the two main things I worked on, outside of school, ...

TOP 1 RETRIEVAL:
I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking phi...



Just as with `LabelledRagDataset`'s, you can store into and upload from a json.

In [None]:
prediction_dataset.save_json("prediction_dataset.json")

In [None]:
from llama_index.llama_dataset import RagPredictionDataset

reloaded_predictions = RagPredictionDataset.from_json(
    "prediction_dataset.json"
)

In [None]:
reloaded_predictions.to_pandas()

Unnamed: 0,response,contexts
0,The availability of microcomputers changed the...,[What I Worked On\n\nFebruary 2021\n\nBefore c...
1,Paul Graham's decision to switch from studying...,[I couldn't have put this into words when I wa...
2,"The novel ""The Moon is a Harsh Mistress"" and t...",[I couldn't have put this into words when I wa...
3,The author chose to learn Lisp as a programmin...,[I couldn't have put this into words when I wa...
