# Downloading a LlamaDataset from LlamaHub

You can browse our available benchmark datasets via [llamahub.ai](https://llamahub.ai/). This notebook guide depicts how you can download the dataset and its source text documents. In particular, the `download_llama_dataset` will download the evaluation dataset (i.e., `LabelledRagDataset`) as well as the `Document`'s of the source text files used to build the evaluation dataset in the first place.

Finally, in this notebook, we also demonstrate the end to end workflow of downloading an evaluation dataset, making predictions on it using your own RAG pipeline (query engine) and then evaluating these predictions.

In [None]:
from llama_index.llama_dataset import download_llama_dataset

# download and install dependencies
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

In [None]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,How did the availability of microcomputers cha...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,The availability of microcomputers changed the...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
1,What factors influenced Paul Graham's decision...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,Two factors influenced Paul Graham's decision ...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
2,"How did the novel ""The Moon is a Harsh Mistres...",[I couldn't have put this into words when I wa...,"The novel ""The Moon is a Harsh Mistress"" and t...",ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
3,Why did the author choose to learn Lisp as a p...,[I couldn't have put this into words when I wa...,The author chose to learn Lisp as a programmin...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)


With `documents`, you can build your own RAG pipeline, to then predict and perform evaluations to compare against the benchmarks listed in the `DatasetCard` associated with the datasets [llamahub.ai](https://llamahub.ai/).

### Predictions

In [None]:
from llama_index import VectorStoreIndex

# a basic RAG pipeline, uses service context defaults
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

You can now create predictions and perform evaluation manually or download the `PredictAndEvaluatePack` to do this for you in a single line of code.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
# manually
prediction_dataset = await rag_dataset.amake_predictions_with(
    query_engine=query_engine, show_progress=True
)

100%|█████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.26it/s]


In [None]:
prediction_dataset.to_pandas()

Unnamed: 0,response,contexts
0,The availability of microcomputers changed the...,[What I Worked On\n\nFebruary 2021\n\nBefore c...
1,Paul Graham's decision to switch from studying...,[I couldn't have put this into words when I wa...
2,"The novel ""The Moon is a Harsh Mistress"" and t...",[I couldn't have put this into words when I wa...
3,The author chose to learn Lisp as a programmin...,[I couldn't have put this into words when I wa...


### Evaluation

Now that we have our predictions, we can perform evaluations on two dimensions:

1. The generated response: how well the predicted response matches the reference answer.
2. The retrieved contexts: how well the retrieved contexts for the prediction match the reference contexts.

NOTE: For retrieved contexts, we are unable to use standard retrieval metrics such as `hit rate` and `mean reciproccal rank` due to the fact that doing so requires we have the same index that was used to generate the ground truth data. But, it is not necessary for a `LabelledRagDataset` to be even created by an index. As such, we will use `semantic similarity` between the prediction's contexts and the reference contexts as a measure of goodness.

In [None]:
import tqdm

For evaluating the response, we will use the LLM-As-A-Judge pattern. Specifically, we will use `CorrectnessEvaluator`, `FaithfulnessEvaluator` and `RelevancyEvaluator`.

For evaluating the goodness of the retrieved contexts we will use `SemanticSimilarityEvaluator`.

In [None]:
# instantiate the gpt-4 judge
from llama_index.llms import OpenAI
from llama_index import ServiceContext
from llama_index.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    service_context=ServiceContext.from_defaults(
        llm=OpenAI(temperature=0, model="gpt-4"),
    )
)

judges["relevancy"] = RelevancyEvaluator(
    service_context=ServiceContext.from_defaults(
        llm=OpenAI(temperature=0, model="gpt-4"),
    )
)

judges["faithfulness"] = FaithfulnessEvaluator(
    service_context=ServiceContext.from_defaults(
        llm=OpenAI(temperature=0, model="gpt-4"),
    )
)

judges["semantic_similarity"] = SemanticSimilarityEvaluator(
    service_context=ServiceContext.from_defaults()
)

Loop through the (`labelled_example`, `prediction`) pais and perform the evaluations on each of them individually.

In [None]:
evals = {
    "correctness": [],
    "relevancy": [],
    "faithfulness": [],
    "context_similarity": [],
}

for example, prediction in tqdm.tqdm(
    zip(rag_dataset.examples, prediction_dataset.predictions)
):
    correctness_result = judges["correctness"].evaluate(
        query=example.query,
        response=prediction.response,
        reference=example.reference_answer,
    )

    relevancy_result = judges["relevancy"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    faithfulness_result = judges["faithfulness"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    semantic_similarity_result = judges["semantic_similarity"].evaluate(
        query=example.query,
        response="\n".join(prediction.contexts),
        reference="\n".join(example.reference_contexts),
    )

    evals["correctness"].append(correctness_result)
    evals["relevancy"].append(relevancy_result)
    evals["faithfulness"].append(faithfulness_result)
    evals["context_similarity"].append(semantic_similarity_result)

4it [00:36,  9.10s/it]


Now, we can use our notebook utility functions to view these evaluations.

In [None]:
import pandas as pd
from llama_index.evaluation.notebook_utils import (
    get_eval_results_df,
)

deep_eval_df, mean_correctness_df = get_eval_results_df(
    ["base_rag"] * len(evals["correctness"]),
    evals["correctness"],
    metric="correctness",
)
deep_eval_df, mean_relevancy_df = get_eval_results_df(
    ["base_rag"] * len(evals["relevancy"]),
    evals["relevancy"],
    metric="relevancy",
)
_, mean_faithfulness_df = get_eval_results_df(
    ["base_rag"] * len(evals["faithfulness"]),
    evals["faithfulness"],
    metric="faithfulness",
)
_, mean_context_similarity_df = get_eval_results_df(
    ["base_rag"] * len(evals["context_similarity"]),
    evals["context_similarity"],
    metric="context_similarity",
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
        mean_context_similarity_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])

In [None]:
mean_scores_df

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,5.0
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.974579


On this toy example, we see that the basic RAG pipeline performs quite well against the evaluation benchmark (`rag_dataset`)!