# Answer Relevancy Evaluation

In this notebook, we demonstrate how to utilize the `AnswerRelevancyEvaluator` class to get a measure on the relevancy of a generated answer to given user query. This evaluator returns a `score` that is between 0 and 1 as well as a generated `feedback` explaining the score. Note that, higher score means higher relevancy. In particular, we prompt the judge LLM to take a step-by-step approach in providing a relevancy score, asking it to answer the following three questions of a generated answer to a query:

1. Does the provided response match the subject matter of the user's query?
2. Does the provided response attempt to address the focus or perspective on the subject matter taken on by the user's query?
3. Does the provided response attempt to follow the instruction of the user's query?

Each question is worth 1 point and so a perfect evaluation would yield a score of 3/3.

In [None]:
import nest_asyncio
from tqdm.asyncio import tqdm_asyncio

nest_asyncio.apply()

In [None]:
def displayify_df(df):
    """For pretty displaying DataFrame in a notebook."""
    display_df = df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        }
    )
    display(display_df)

### Download the dataset (`LabelledRagDataset`)

For this demonstration, we will use a llama-dataset provided through our [llama-hub](https://llamahub.ai).

In [None]:
from llama_index.llama_dataset import download_llama_dataset
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex

# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
    "EvaluatingLlmSurveyPaperDataset", "./data"
)

In [None]:
rag_dataset.to_pandas()[:5]

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,What are the potential risks associated with l...,[Evaluating Large Language Models: A\nComprehe...,"According to the context information, the pote...",ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
1,How does the survey categorize the evaluation ...,[Evaluating Large Language Models: A\nComprehe...,The survey categorizes the evaluation of LLMs ...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
2,What are the different types of reasoning disc...,[Contents\n1 Introduction 4\n2 Taxonomy and Ro...,The different types of reasoning discussed in ...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
3,How is toxicity evaluated in language models a...,[Contents\n1 Introduction 4\n2 Taxonomy and Ro...,Toxicity is evaluated in language models accor...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
4,"In the context of specialized LLMs evaluation,...",[5.1.3 Alignment Robustness . . . . . . . . . ...,"In the context of specialized LLMs evaluation,...",ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)


Next, we build a RAG over the same source documents used to created the `rag_dataset`.

In [None]:
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

With our RAG (i.e `query_engine`) defined, we can make predictions (i.e., generate responses to the query) with it over the `rag_dataset`.

In [None]:
prediction_dataset = await rag_dataset.amake_predictions_with(
    predictor=query_engine, batch_size=100, show_progress=True
)

Batch processing of predictions: 100%|████████████████████| 100/100 [00:06<00:00, 15.77it/s]
Batch processing of predictions: 100%|████████████████████| 100/100 [00:06<00:00, 16.28it/s]
Batch processing of predictions: 100%|██████████████████████| 76/76 [00:05<00:00, 13.11it/s]


### Evaluating Answer Relevancy

We first need to define our evaluator (i.e. `AnswerRelevancyEvaluator`):

In [None]:
# instantiate the gpt-4 judge
from llama_index.llms import OpenAI
from llama_index import ServiceContext
from llama_index.evaluation import AnswerRelevancyEvaluator

judge = AnswerRelevancyEvaluator(
    service_context=ServiceContext.from_defaults(
        llm=OpenAI(temperature=0, model="gpt-4"),
    )
)

Now, we can use our evaluator to make evaluations by looping through all of the <example, prediction> pairs.

In [None]:
eval_tasks = []
for example, prediction in zip(
    rag_dataset.examples, prediction_dataset.predictions
):
    eval_tasks.append(
        judge.aevaluate(
            query=example.query,
            response=prediction.response,
            contexts=prediction.contexts,
            sleep_time_in_seconds=1.0,
        )
    )

In [None]:
eval_results = await tqdm_asyncio.gather(*eval_tasks)

100%|█████████████████████████████████████████████████████| 276/276 [00:35<00:00,  7.80it/s]


### Taking a look at the evaluation results

Here we use a utility function to convert the list of `EvaluationResult` objects into something more notebook friendly, that is a pandas DataFrame.

In [None]:
from llama_index.evaluation.notebook_utils import get_eval_results_df

deep_df, mean_df = get_eval_results_df(
    names=["baseline"] * len(eval_results), results_arr=eval_results
)

The above utility also provides the mean score across all of the evaluations in `mean_df`.

In [None]:
mean_df.T

Unnamed: 0_level_0,scores
rag,Unnamed: 1_level_1
baseline,0.922101


We can get a look at the raw distribution of the scores by invoking `value_counts()` on the `deep_df`.

In [None]:
deep_df["scores"].value_counts()

scores
1.0    240
0.5     29
0.0      7
Name: count, dtype: int64

It looks like for the most part, the default RAG does fairly well in terms of generating answers that are relevant to the query. Getting a closer look is made possible by viewing the records of `deep_df`.

In [None]:
displayify_df(deep_df.head(2))

Unnamed: 0,rag,query,answer,scores,feedbacks
0,baseline,What are the potential risks associated with large language models (LLMs) according to the context information?,"LLMs present potential risks such as private data leaks and the generation of inappropriate, harmful, or misleading content. The rapid progress of LLMs also raises concerns about the potential emergence of superintelligent systems without adequate safeguards.",1.0,"1. The response does match the subject matter of the user's query. It provides information about the potential risks associated with large language models (LLMs), which is exactly what the user asked for. 2. The response also attempts to address the focus or perspective on the subject matter taken on by the user's query. It provides specific examples of risks, such as private data leaks and the generation of inappropriate, harmful, or misleading content, and also mentions concerns about the potential emergence of superintelligent systems without adequate safeguards. [RESULT] 2"
1,baseline,How does the survey categorize the evaluation of LLMs and what are the three major groups mentioned?,"The survey categorizes the evaluation of LLMs by providing a well-structured taxonomy framework. The three major groups mentioned in the survey are knowledge and reasoning, alignment evaluation, and safety evaluation.",1.0,"1. The response does match the subject matter of the user's query. The user asked about how a survey categorizes the evaluation of LLMs and the response provides information on this, stating that the survey uses a well-structured taxonomy framework for this purpose. 2. The response also addresses the focus or perspective on the subject matter taken on by the user's query. The user asked about the three major groups mentioned in the survey and the response provides this information, stating that the three major groups are knowledge and reasoning, alignment evaluation, and safety evaluation. [RESULT] 2"


And, of course you can apply any filters as you like. For example, if you want to look at the examples that yielded less than perfect results.

In [None]:
displayify_df(deep_df[deep_df["scores"] < 1].head(5))

Unnamed: 0,rag,query,answer,scores,feedbacks
35,baseline,"In the evaluation of online shopping models, what is the notable difference between humans and language models in terms of performance?",The notable difference between humans and language models in the evaluation of online shopping models is that humans outperform language models in all metrics.,0.5,"1. The response does match the subject matter of the user's query, which is about the difference in performance between humans and language models in the context of online shopping models. 2. The response attempts to address the focus of the user's query, but it does not provide specific details or examples to support the claim that humans outperform language models in all metrics. The user's query seems to be asking for a more detailed or nuanced explanation of the differences in performance. [RESULT] 1"
46,baseline,What are the four macroscopic perspectives used to categorize ethics and morality evaluations in the context of LLMs?,"The four macroscopic perspectives used to categorize ethics and morality evaluations in the context of LLMs are expert-defined ethics and morality, evaluation with expert-defined ethics and morality, evaluation with crowdsourced ethics and morality, and evaluation with crowdsourced ethics and morality.",0.5,"1. The response does match the subject matter of the user's query. It provides information about the four macroscopic perspectives used to categorize ethics and morality evaluations in the context of LLMs. 2. The response does attempt to address the focus or perspective on the subject matter taken on by the user's query. However, it seems to repeat the same two categories twice, which may be a mistake. The user asked for four distinct perspectives, but the response only provides two unique ones. [RESULT] 1"
61,baseline,How does the CBBQ evaluation method extend the BBQ approach? What additional categories are included in CBBQ?,The CBBQ evaluation method extends the BBQ approach by including additional categories in its evaluation framework. These additional categories are not explicitly mentioned in the given context information.,0.5,"1. The response does match the subject matter of the user's query, which is about how the CBBQ evaluation method extends the BBQ approach. The response correctly states that the CBBQ method extends the BBQ approach by including additional categories in its evaluation framework. 2. However, the response does not fully address the focus or perspective on the subject matter taken on by the user's query. The user specifically asked about what additional categories are included in CBBQ, but the response does not provide this information. [RESULT] 1"
84,baseline,How does the BigToM benchmark align human Theory-of-Mind reasoning capabilities by controlling different variables and conditions in the causal graph?,The BigToM benchmark aligns human Theory-of-Mind reasoning capabilities by controlling different variables and conditions in the causal graph.,0.5,"1. The response does match the subject matter of the user's query. It mentions the BigToM benchmark, human Theory-of-Mind reasoning capabilities, and controlling different variables and conditions in the causal graph, which are all elements present in the query. 2. However, the response does not attempt to address the focus or perspective on the subject matter taken on by the user's query. The user is asking for an explanation of how the BigToM benchmark aligns human Theory-of-Mind reasoning capabilities by controlling different variables and conditions in the causal graph. The response merely restates the query without providing any additional information or explanation. [RESULT] 1"
121,baseline,"Who are the authors of the paper titled ""Frontier AI regulation: Managing emerging risks to public safety""?","Based on the given context information, it is not possible to determine the authors of the paper titled ""Frontier AI regulation: Managing emerging risks to public safety.""",0.5,"1. The response does match the subject matter of the user's query, which is about the authors of a specific paper. 2. However, the response does not attempt to address the focus or perspective on the subject matter taken on by the user's query. The user is asking for specific information, namely the authors of a specific paper, and the response does not provide this information. [RESULT] 1"
