### RAG Evaluation and Meta-Evaluation with GroUSE
#### Overview
###### This tutorial introduces GroUSE, a framework for evaluating Retrieval-Augmented Generation (RAG) pipelines, focusing on the final stage: Grounded Question Answering (GQA). It demonstrates how to use Large Language Models (LLMs) to assess GQA answers across four distinct metrics and guides you through customizing your own Judge LLM using GroUSE unit tests.

#### Motivation
###### Manually evaluating RAG pipeline outputs can be challenging. The GroUSE framework leverages LLMs with finely tuned prompts to address all potential failure modes in Grounded Question Answering. GroUSE unit tests are used to identify the most effective prompts to optimize the performance of these evaluators.

#### Key Components
###### Answer Relevancy evaluation
###### Completeness evaluation
###### Faithfulness evaluation
###### Usefulness evaluation
###### Judge LLM Customization

#### Method Details
##### The task we want to assess: Grounded Question Answering
###### Grounded Question Answering (QA) is usually the last step of a RAG pipeline: given a question and a set of documents retrieved from the corpus, an LLM must generate an answer. We expect the LLM to cite which document each piece of information is coming from, as depicted below. When no precise answer is in the documents, the LLM should indicate it in its answer. In that case, if some related information is available in the documents, the LLM can add it to the answer to show the corpus is not completely off-topic with respect to the question.

##### Evaluation Metrics
###### Each answer is evaluated according to six metrics. The fisrt four metrics are evaluated with an evaluator LLM call. Positive acceptance and negative rejection are deducted from the first four.

###### 1. Answer Relevancy
###### Answer relevancy assesses the relevance of the information provided in the answer regarding the question, using a Likert scale (1 to 5).

###### 2. Completeness
###### Completeness uses a Likert scale (1 to 5) to evaluate whether all relevant information from the documents is present in the answer.

###### 3. Faithfulness
###### Faithfulness is a binary score that checks if all facts in the answer are accurate and correctly attributed to the corresponding document.

###### 4. Usefulness
###### When the answer states that no references can answer the question but additional information is provided, usefulness is a binary score that determines if the provided additional information is still useful.

###### 5. Positive Acceptance
###### Percentage of samples that responded when they were supposed to.

###### 6. Negative Rejection
###### Percentage of samples that refrained from responding when there is no context in the documents that allow to answer the question.

#### Benefits of the approach
###### The GroUSE framework comprehensively addresses the seven failure modes of Grounded Question Answering, providing a thorough evaluation of your RAG pipeline's final stage.

#### Implementation details
###### Answer Relevancy, Completeness, Faithfulness and Usefulness are evaluated using GPT-4 as the default model, as it was the best model we tested. Positive acceptance and negative rejection can be deducted from the answer relevancy and completeness results as these can have None values when no references contain answers to the question.

#### Conclusion
###### The GroUSE framework provides a comprehensive set of evaluation metrics to assess the performance of Grounded Question Answering models. By addressing seven key failure modes, it enables developers to thoroughly evaluate and improve their RAG pipelines. The use of LLM-based judges, such as GPT-4, automate this evaluation process. To tailor the framework to your specific needs, you can develop a custom LLM evaluator and validate its performance using GroUSE unit tests.

### Tutorial
### Import libraries

In [None]:
import os

import nest_asyncio

from grouse import (
    EvaluationSample,
    GroundedQAEvaluator,
    meta_evaluate_pipeline
)

#### Avoid nested asyncio loops inside notebooks (this line is not needed if you run the code in a Python script)

In [None]:
nest_asyncio.apply()

#### Setup your API key
#### For this tutorial, you will need access to the OpenAI API and get an OpenAI API key. You can get one here.

In [None]:
os.environ["OPENAI_API_KEY"] = input("Add OpenAI API key")

#### Initialize the evaluator
###### The default model used is GPT-4. Prompts are adapted to this model, so if you want to have the best results, keep using the default model.

In [None]:
evaluator = GroundQAEvaluator()

### Evaluate a good answer
###### An LLM has given a good answer to a question related to the Eiffel Tower, given some contexts from the Eiffel Tower Wikipedia page. Let's evaluate the answer and check that everything is okay.

In [None]:
good_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower stands in the Champs de Mars in Paris.[1]",
    expected_output="In the Champs de Mars in Paris. [1]",
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France"
    ]
)

result = evaluator.evaluate(eval_samples=[good_sample]).evaluations[0]

print("Answer Relevancy (1 to 5):", result.answer_relevancy.answer_relevancy)
print("Answer Relevancy (1 to 5):", result.answer_relevancy.answer_relevancy_justification)
print("Completeness (1 to 5):", result.completeness.completeness)
print("Completeness (1 to 5):", result.completeness.completeness_justification)
print("Faithfulness (0 or 1):", result.faithfulness.faithfulness)
print("Faithfulness (0 or 1):", result.faithfulness.faithfulness_justification)

#### How does it behave with an irrelevant answer?

In [None]:
irrelevant_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower is mainly made of puddle iron.[2]",
    expected_output="In the Champs de Mars in Paris.[1]",
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France",
        "The puddle iron (wrought iron) of the Eiffel Tower weighs 7,300 tonnes,[70] and the addition of lifts, shops and antennae have brought the total weight to approximately 10,100 tonnes."
    ]
)

result = evaluator.evaluate(eval_samples=[irrelevant_sample]).evaluations[0]

print("Answer Relevancy (1 to 5):", result.answer_relevancy.answer_relevancy)
print("Justification:", result.answer_relevancy.answer_relevancy_justification)

#### Evaluation of an incomplete sample

In [None]:
incomplete_sample = EvaluationSample(
    input="Who critized the Eiffel Tower project in 1889?",
    actual_output=(
        "The tower was critized by those who did not believe it was feasible and some artists.[1]"
    ),
    expected_output=(
        "The tower was critized by those who did not believe it was feasible and those who objected on artistic grounds.[1]"
        "An artist committee was created to protest againt the construction of the tower, led by the prominent architect "
        "Charles Garnier and including some of the most important figures of the arts, "
        "such as William-Adolphe Bouguereau, Guy de Maupassant, Charles Gounod and Jules Massenet. [2]"
    ),
    references=[
        "The proposed tower had been a subject of controversy, drawing criticism from those who did not believe it was feasible and those who objected on artistic grounds.",
        (
            "It came to a head as work began at the Champ de Mars: a \"Committee of Three Hundred\" "
            "(one member for each metre of the tower's height) was formed, led by the prominent architect "
            "Charles Garnier and including some of the most important figures of the arts, "
            "such as William-Adolphe Bouguereau, Guy de Maupassant, Charles Gounod and Jules Massenet."
        ),
        "A petition called \"Artists against the Eiffel Tower\" was sent to the Minister of Works and Commissioner for the Exposition, Adolphe Alphand, and it was published by Le Temps on 14 February 1887"
    ]
)

result = evaluator.evaluate(eval_samples=[incomplete_sample]).evaluations[0]

print("Completeness (1 to 5):", result.completeness.completeness)
print("Justification:", result.completeness.completeness_justification)

#### Evaluation of an unfaithful sample

In [None]:
unfaithful_sample = EvaluationSample(
    input="Where is the Eiffel Tower located?",
    actual_output="The Eiffel Tower is located at Rue Rabelais in Paris.[1][2]",
    expected_output="In the Champs de Mars in Paris.[1]",
    references=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France",
        "Gustave Eiffel died in his appartment at Rue Rabelais in Paris."
    ]
)

result = evaluator.evaluate(eval_samples=[unfaithful_sample]).evaluations[0]

print("Faithfulness (0 or 1):", result.faithfulness.faithfulness)
print("Justification:", result.faithfulness.faithfulness_justification)

#### Evaluation of information utility in case there is no answer to the question in the references