# Evaluating RAG pipelines

This section will be divided into 2:
1. Part A: Deep Explanation
2. Part B: Deep Explanation 



## Part A: Deep Explanation 

Haystack provides a wide range of Evaluators which can perform 2 types of evaluations:

1. Model-Based Evaluation
2. Statistical Evaluation



### 1. Model-Based Evaluation

Model-based evaluation uses a language model to check the results of a Pipeline using a  Language Model to check the results of a Pipeline.


##### Using LLM for Evaluation
A golden large language model will be used for this evaluation. The golden large language model such as OpenAI's GPT models, GPT-4, is utilize to evaluate a RAG pipeline by providing it with the Pipeline's results and sometimes additional information, along with a prompt that outlines the evaluation criteria. 
This does not need labels for the outputs, and it is easy to use. 

The method of using LLM as an evaluator is very flexible as it exposes a number of metrics to us. Each metrics is ultimately a well-crafted prompt describing to the LLM how to evaluate and score results. 

Common Metrics includes:

1. Faithfulness
2. Context Relevance 

##### Small Cross-Encoder Models for Evaluation
Alongside LLMs for evaluation, we can use small cross-encoder models. These models can calculate, for example , semantic answer similarity. In contrast to metrics based on LLMs, as the metrics based on smaller models don't require an API key of a model provider.

This method is faster and cheaper to run but it is less flexible in terms of what aspect you can evaluate. You can only evaluate what the small model was trained to evaluate.


#### Model-Based Evaluation Pipelines in Haystack

There are two ways of performing model-based evaluation in Haystack, both of which leverage Pipelien and Evaluator Components

1. Create and run an Evaluation Pipeline independently. This means you will have to provide the required inputs to the evaluation Pipeline manually. This is recommend because we can store the results of our RAG pipeline and try out different evaluation metrics afterward without needing to re-run the RAG pipeline every time.

2. Add Evaluator Component to the end of the RAG pipeline. This means we run both the RAG Pipeline and the Evaluation on of it in a single pipeline.run() call.

##### Model-based Evaluation of Retrieved Documents

##### ContextRelevantEvaluator
This evaluator uses an LLM to evaluate whether contexts are relevant to a question. It does not require ground truth labels.

The component breaks up the context into multiple statements and checks whether each statement is relevant for answering a question. The final score for the context relevance is a number from 0.0 to 1.0 and represents the proportion of statements that are relevant to the provided question.

You can pass an example to the evaluator which are sent as few-prompts to the LLM

```
[{
	"inputs": {
		"questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
	},
	"outputs": {
		"statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
		"statement_scores": [1, 0],
	},
}]
```



##### Usage

A. On its own

In [20]:


from haystack.components.evaluators import ContextRelevanceEvaluator

questions = ['What makes both Python and Javascript excellent?', 'Who created the Python Language', "What are people's feelings towards Javascript?"]
contexts = [
    [
        'Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects.',
        "Javascript and Python both have received a lot backlashes but yet keeps waxing strong."
    ]
]
for question in questions:
    print("Question: ", question)
    # OpenAI is the only supported model
    evaluator = ContextRelevanceEvaluator(raise_on_failure=True)
    result = evaluator.run(questions=[question], contexts=contexts)

    print(result['score'])
    print(result['individual_scores'])
    
    # Notice the statement_score
    print(result['results'])
    print("\n\n")
    

Question:  What makes both Python and Javascript excellent?


100%|██████████| 1/1 [00:02<00:00,  2.07s/it]


0.5
[0.5]
[{'statements': ['Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects.', 'Javascript and Python both have received a lot backlashes but yet keeps waxing strong.'], 'statement_scores': [1, 0], 'score': 0.5}]



Question:  Who created the Python Language


100%|██████████| 1/1 [00:01<00:00,  1.82s/it]


0.5
[0.5]
[{'statements': ['Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language.', 'Javascript and Python both have received a lot backlashes but yet keeps waxing strong.'], 'statement_scores': [1, 0], 'score': 0.5}]



Question:  What are people's feelings towards Javascript?


100%|██████████| 1/1 [00:00<00:00,  1.24it/s]

1.0
[1.0]
[{'statements': ['Javascript and Python both have received a lot backlashes but yet keeps waxing strong.'], 'statement_scores': [1], 'score': 1.0}]








B. In a Pipeline

In this example, we use the ContextRelevanceEvaluator and the FaithfulnessEvaluator together in a pipeline to evaluate responses and context (in the content of documents) recieved by a RAG pipeline based on the provided questionst.

This is an example of how we can run multiple metrics after we receive the context.

In [35]:
from haystack import Pipeline
from haystack.components.evaluators import ContextRelevanceEvaluator, FaithfulnessEvaluator

pipeline = Pipeline()
context_relevance_evaluator = ContextRelevanceEvaluator()
faithfulness_evaluator = FaithfulnessEvaluator() # evaluates generated/extracted answers, more on this in the next secion
pipeline.add_component("context_relevance_evaluator", context_relevance_evaluator)
pipeline.add_component("faithfulness_evaluator", faithfulness_evaluator)

questions = ["Who created the Python Language?"]
contexts = [
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
predicted_answers = ["Python is a high-level general-purpose programming language that was created by George Lucas"]

result = pipeline.run({
    "context_relevance_evaluator": {
        "questions": questions, "contexts": contexts
    },
    "faithfulness_evaluator": {
        "questions": questions,
        "contexts": contexts,
        "predicted_answers": predicted_answers
    }
})

print("\nIndividual Scores")
for evaluator in result:
    print(evaluator , " => ", result[evaluator]['individual_scores'])
    print("Statement:")
    for ev_result in result[evaluator]['results']:
        print(ev_result['statements'])
    
print("\nScore")
for evaluator in result:
    print(evaluator , " => ", result[evaluator]['score'])
    print("Statement:")
    for ev_result in result[evaluator]['results']:
        print(ev_result['statements'])
    


100%|██████████| 1/1 [00:00<00:00,  1.10it/s]
100%|██████████| 1/1 [00:00<00:00,  1.03it/s]


Individual Scores
context_relevance_evaluator  =>  [1.0]
Statement:
['Python, created by Guido van Rossum in the late 1980s.']
faithfulness_evaluator  =>  [0.5]
Statement:
['Python is a high-level general-purpose programming language.', 'Python was created by Guido van Rossum in the late 1980s.']

Score
context_relevance_evaluator  =>  1.0
Statement:
['Python, created by Guido van Rossum in the late 1980s.']
faithfulness_evaluator  =>  0.5
Statement:
['Python is a high-level general-purpose programming language.', 'Python was created by Guido van Rossum in the late 1980s.']






##### Model-based Evaluation of Generated or Extracted Answers
##### FaithfulnessEvaluator (aka groundedness)

This uses an LLM to evaluate whether a generated answer can be inferred from the provided contexts. It does not require ground truth labels.

The metric is sometimes called groundedness or hallucination.

FaithfulnessEvaluator component can be used to evaluate documents retrieved by a Haystack pipeline, such as RAG pipeline, without ground truth labels.

The component splits the generated answer into statements and checks each of them against the provided context, with an LLM. A higher faithfulness score is better, and it indicates that a larger number of statements in the generated answers can be inferred from the contexts. 

This score can be used to better understand how often and when the Generator in a RAG pipeline hallucinates.



##### Usage

A. On its own

An example of using a FaithfulnessEvaluator component to evaluate a predicted answer generated based on a provided question and context. It returned a score of 0.5 because it detects two statements in the answer, of which only one is correct.

In [37]:
from haystack.components.evaluators import FaithfulnessEvaluator

questions = ["Who created the Python language?"]
contexts = [
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
predicted_answers = ["Python is a high-level general-purpose programming language that was created by George Lucas."]
evaluator = FaithfulnessEvaluator()
result = evaluator.run(
        questions=questions, 
        contexts=contexts, 
        predicted_answers=predicted_answers
    )

print(result["individual_scores"])

print(result["score"])

print(result["results"])

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:01<00:00,  1.08s/it]

[0.5]
0.5
[{'statements': ['Python is a high-level general-purpose programming language.', 'Python was created by Guido van Rossum in the late 1980s.'], 'statement_scores': [1, 0], 'score': 0.5}]







A. In a Pipeline

As shown in the ContextRelevanceEvaluator.
Skipping this to avoid excessive usage of credits.

// NO CODE

##### SASEvaluator (Semantic Answer Similarity)

SASEvaluator evaluates answers predicted by pipelines using ground truth labels. It checks the semantic similarity of a predicted answer and the ground truth answer using a fine-tuned language model. The metric is called Semantic Answer Similarity.

The evaluator uses a bi-encoder or a cross-encoder model. By default it uses the `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` mode.

NOTE: Only one predicted answer is compared to one ground truth answer at a time. The component does not support multiple ground truth answers for the same question or multiple answers predicted for the same question.

https://arxiv.org/abs/2108.06130

##### Usage

A. On its own

The example below compares two answers and compare them to ground truth answers. We need to call the `warm_up()` before `run()` to load the model.

In [38]:
from haystack.components.evaluators import SASEvaluator

# model is from huggingface
sas_evaluator = SASEvaluator(model="sentence-transformers/paraphrase")
sas_evaluator.warm_up()
result = sas_evaluator.run(
    ground_truth_answers=["Berlin", "Paris"],
    predicted_answers=["Berlin", "Lyon"]
)
print(result['individual_scores'])

print(result['score'])



[0.9999999403953552, 0.5174765586853027]
0.758738249540329




A. In a Pipeline

Below is an example where we use an `AnswerExactMatchEvaluator` and a `SASEvaluator` in a pipeline to evaluate two answers and compare them to a ground truth answesr.

Running a pipeline instead of the individual components simplifies calculating more than one metric


In [40]:
from haystack import Pipeline
from haystack.components.evaluators import AnswerExactMatchEvaluator, SASEvaluator

pipeline = Pipeline()
em_evaluator = AnswerExactMatchEvaluator()
sas_evaluator = SASEvaluator()

pipeline.add_component("em_evaluator", em_evaluator)
pipeline.add_component("sas_evaluator", sas_evaluator)


ground_truth_answers = ["Berlin", "Paris"]
predicted_answers = ["Berlin", "Lyon"]

result = pipeline.run({
    "em_evaluator": {
        "ground_truth_answers": ground_truth_answers,
        "predicted_answers": predicted_answers
    },
    "sas_evaluator": {
        "ground_truth_answers": ground_truth_answers,
        "predicted_answers": predicted_answers,
    }
})


print("\nIndividual Scores") 

for evaluator in result:
    print(result[evaluator]['individual_scores'])
   
print("\nScore") 

for evaluator in result:
    print(result[evaluator]['score'])




Individual Scores
[1, 0]
[0.9999999403953552, 0.5174765586853027]

Score
0.5
0.758738249540329


##### RagasEvaluator

RAGAS is a framework that helps you evaluate RAG pipelines. 

Learn more about RAGAS here: https://docs.ragas.io/en/latest/index.html

Supported Metrics

- `ANSWER_CORRECTNESS`: grades the accuracy of the generated answer when compared to the ground truth.
- `FAITHFULNESS`: grades how factual the generated response was.
- `ANSWER_SIMILARITY`: grades how similar the generated answer is to the ground truth answer specified.
- `CONTEXT_PRECISION`: grades if the answer has any additional irrelevant information for the question asked.
- `CONTEXT_UTILIZATION`: grade to what extent the generated answer uses the provided context
- `CONTEXT_RECALL`: grades how complete the generated response was for the question specified
- `ASPECT_CRITIQUE`: grades generated answers based on custom aspects on a binary scale
- `CONTEXT_RELEVANCY`: grades how irrelevant the provided context was for the question specified
- `ANSWER_RELEVANCY`: grades how relevant the generated response is given the question.

Models Supported includes:
- All GPT models from OpenAI
- Google VertexAI Models
- Azure OpenAI Models
- Amazon Bedrock Models

##### Usage

1. You can use the `RagasEvaluator` while providing correct `metric_params` for the metric you are using.
2. Run the `RagasEvaluator`, either on its own or in a pipeline, by providing the expected input for the metric you are using.

##### Examples
##### Evaluate Context Relevance
Create a context-relevance evaluation pipeline



In [43]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.ragas import RagasEvaluator, RagasMetric

pipeline = Pipeline()
evaluator = RagasEvaluator(
    metric=RagasMetric.CONTEXT_RELEVANCY
)
pipeline.add_component("evaluator", evaluator)


# send this data into the evaluator
results = pipeline.run({
    "evaluator": {
        "questions": [
            "When was the Rhodes Statue built?",  # question 1
            "Where is the Pyramid of Giza" # question 2
        ],
        "contexts": [
            [ "Context for question 1" ], # context for question 1
            ["Context for question 2"] # context for question 1
        ]
    }
})

results

Evaluating: 100%|██████████| 2/2 [00:00<00:00,  3.57it/s]


{'evaluator': {'results': [[{'name': 'context_relevancy', 'score': 1.0}],
   [{'name': 'context_relevancy', 'score': 1.0}]]}}


##### Evaluate Relevance and Aspect Critique
Create a context-relevance and aspect critique evaluation pipeline

In [48]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.ragas import RagasEvaluator, RagasMetric

pipeline = Pipeline()
evaluator_context = RagasEvaluator(
    metric=RagasMetric.CONTEXT_PRECISION
)
evaluator_aspect = RagasEvaluator(
    metric=RagasMetric.ASPECT_CRITIQUE,
    metric_params={
        "name": "valid_for_children",
        "definition": "Is this answer problematic for children?", 
        "strictness": 3
    }
)
pipeline.add_component("evaluator_context", evaluator_context)
pipeline.add_component("evaluator_aspect", evaluator_aspect)

In [49]:
QUESTIONS = ["Which is the most popular global sport?", "Who created the Python language?"]
CONTEXTS = [["The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people."], 
                 ["Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."]]
RESPONSES = ["Football is the most popular sport with around 4 billion followers worldwide", "Python language was created by Guido van Rossum."]
GROUND_TRUTHS = ["Football is the most popular sport", "Python language was created by Guido van Rossum."]

results = pipeline.run({
    "evaluator_context": {
        "questions": QUESTIONS, "contexts": CONTEXTS, "ground_truths": GROUND_TRUTHS
    },
    "evaluator_aspect": {
        "questions": QUESTIONS, "contexts": CONTEXTS, "responses": RESPONSES
    }
})

import pprint
pprint.pprint(results)

Evaluating: 100%|██████████| 2/2 [00:01<00:00,  1.44it/s]
Evaluating: 100%|██████████| 2/2 [00:01<00:00,  1.56it/s]


{'evaluator_aspect': {'results': [[{'name': 'valid_for_children', 'score': 1}],
                                  [{'name': 'valid_for_children',
                                    'score': 0}]]},
 'evaluator_context': {'results': [[{'name': 'context_precision',
                                     'score': 0.9999999999}],
                                   [{'name': 'context_precision',
                                     'score': 0.9999999999}]]}}


##### DeepEvalEvaluator

DeepEval is a simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

DeepEval gives you additional options of using models on your machine.
https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm 

Integrate into CI/CDs.

Supported Metrics:
- `ANSWER_RELEVANCY`: grades how relevant the answer was to the question specified
- `FAITHFULNESS`: grades how factual the generated response was.
- `CONTEXTUAL_PRECISION`: grades if he answer has any additional irrelevant information for the question asked.
- `CONTEXTUAL_RECALL`: grades how complete the generated response was for the question specified
- `CONTEXTUAL_RELEVANCE`: grades how relevant provided context was for the question specified


##### Usage

1. You can use the `DeepEvalEvaluator` while providing correct `metric_params` for the metric you are using.
2. Run the `DeepEvalEvaluator`, either on its own or in a pipeline, by providing the expected input for the metric you are using.

##### Examples
##### Evaluate Faithfulness

In [2]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

pipeline = Pipeline()
evaluator = DeepEvalEvaluator(
    metric=DeepEvalMetric.FAITHFULNESS,
    metric_params={"model": "gpt-4"}
)
pipeline.add_component("evaluator", evaluator)

results = pipeline.run({
    "evaluator": {
        "questions": ["When was the Rhodes Statue built?", "Where is the Pyramid of Giza" ],
        "contexts": [ ["Context for question 1"], ["Context for question 2"] ],
        "responses": ["Response for question 1", "Response for question 2"]
    }
})

## Part B: Practical Examples


We would evaluate RAG pipelines both with model-based and statistical metrics available in Haystack evaluation offering.

1. Build a pipeline that answers medical questions based on PubMed data
2. Build an evaluation pipeline that makes use of some metrics like Document MRR and Answer Faithfulness 
3. Run the RAG pipeline and evaluate the output with our evaluation pipeline