# Evaluating RAG pipelines

This section will be divided into 2:

1. Part A: Deep Explanation
2. Part B: Deep Explanation


## Part A: Deep Explanation


Haystack provides a wide range of Evaluators which can perform 2 types of evaluations:

1. Model-Based Evaluation
2. Statistical Evaluation


### 1. Model-Based Evaluation

Model-based evaluation uses a language model to check the results of a Pipeline using a Language Model to check the results of a Pipeline.

##### Using LLM for Evaluation

A golden large language model will be used for this evaluation. The golden large language model such as OpenAI's GPT models, GPT-4, is utilize to evaluate a RAG pipeline by providing it with the Pipeline's results and sometimes additional information, along with a prompt that outlines the evaluation criteria.
This does not need labels for the outputs, and it is easy to use.

The method of using LLM as an evaluator is very flexible as it exposes a number of metrics to us. Each metrics is ultimately a well-crafted prompt describing to the LLM how to evaluate and score results.

Common Metrics includes:

1. Faithfulness
2. Context Relevance

##### Small Cross-Encoder Models for Evaluation

Alongside LLMs for evaluation, we can use small cross-encoder models. These models can calculate, for example , semantic answer similarity. In contrast to metrics based on LLMs, as the metrics based on smaller models don't require an API key of a model provider.

This method is faster and cheaper to run but it is less flexible in terms of what aspect you can evaluate. You can only evaluate what the small model was trained to evaluate.


#### Model-Based Evaluation Pipelines in Haystack

There are two ways of performing model-based evaluation in Haystack, both of which leverage Pipelien and Evaluator Components

1. Create and run an Evaluation Pipeline independently. This means you will have to provide the required inputs to the evaluation Pipeline manually. This is recommend because we can store the results of our RAG pipeline and try out different evaluation metrics afterward without needing to re-run the RAG pipeline every time.

2. Add Evaluator Component to the end of the RAG pipeline. This means we run both the RAG Pipeline and the Evaluation on of it in a single pipeline.run() call.


##### Model-based Evaluation of Retrieved Documents

##### ContextRelevantEvaluator

This evaluator uses an LLM to evaluate whether contexts are relevant to a question. It does not require ground truth labels.

The component breaks up the context into multiple statements and checks whether each statement is relevant for answering a question. The final score for the context relevance is a number from 0.0 to 1.0 and represents the proportion of statements that are relevant to the provided question.

You can pass an example to the evaluator which are sent as few-prompts to the LLM

```
[{
	"inputs": {
		"questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
	},
	"outputs": {
		"statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
		"statement_scores": [1, 0],
	},
}]
```


##### Usage

A. On its own


In [1]:
from haystack.components.evaluators import ContextRelevanceEvaluator

questions = [
    "What makes both Python and Javascript excellent?",
    "Who created the Python Language",
    "What are people's feelings towards Javascript?",
]
contexts = [
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects.",
        "Javascript and Python both have received a lot backlashes but yet keeps waxing strong.",
    ]
]
for question in questions:
    print("Question: ", question)
    # OpenAI is the only supported model
    evaluator = ContextRelevanceEvaluator(raise_on_failure=True)
    result = evaluator.run(questions=[question], contexts=contexts)

    print(result["score"])
    print(result["individual_scores"])

    # Notice the statement_score
    print(result["results"])
    print("\n\n")

  from .autonotebook import tqdm as notebook_tqdm


Question:  What makes both Python and Javascript excellent?


100%|██████████| 1/1 [00:01<00:00,  1.77s/it]


0.5
[0.5]
[{'statements': ['Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects.', 'Javascript and Python both have received a lot backlashes but yet keeps waxing strong.'], 'statement_scores': [1, 0], 'score': 0.5}]



Question:  Who created the Python Language


100%|██████████| 1/1 [00:01<00:00,  1.02s/it]


1.0
[1.0]
[{'statements': ['Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language.'], 'statement_scores': [1], 'score': 1.0}]



Question:  What are people's feelings towards Javascript?


100%|██████████| 1/1 [00:00<00:00,  1.10it/s]

1.0
[1.0]
[{'statements': ['Javascript and Python both have received a lot backlashes but yet keeps waxing strong.'], 'statement_scores': [1], 'score': 1.0}]








B. In a Pipeline

In this example, we use the ContextRelevanceEvaluator and the FaithfulnessEvaluator together in a pipeline to evaluate responses and context (in the content of documents) recieved by a RAG pipeline based on the provided questionst.

This is an example of how we can run multiple metrics after we receive the context.


In [2]:
from haystack import Pipeline
from haystack.components.evaluators import (
    ContextRelevanceEvaluator,
    FaithfulnessEvaluator,
)

pipeline = Pipeline()
context_relevance_evaluator = ContextRelevanceEvaluator()
faithfulness_evaluator = (
    FaithfulnessEvaluator()
)  # evaluates generated/extracted answers, more on this in the next secion
pipeline.add_component("context_relevance_evaluator", context_relevance_evaluator)
pipeline.add_component("faithfulness_evaluator", faithfulness_evaluator)

questions = ["Who created the Python Language?"]
contexts = [
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
predicted_answers = [
    "Python is a high-level general-purpose programming language that was created by George Lucas"
]

result = pipeline.run(
    {
        "context_relevance_evaluator": {"questions": questions, "contexts": contexts},
        "faithfulness_evaluator": {
            "questions": questions,
            "contexts": contexts,
            "predicted_answers": predicted_answers,
        },
    }
)

print("\nIndividual Scores")
for evaluator in result:
    print(evaluator, " => ", result[evaluator]["individual_scores"])
    print("Statement:")
    for ev_result in result[evaluator]["results"]:
        print(ev_result["statements"])

print("\nScore")
for evaluator in result:
    print(evaluator, " => ", result[evaluator]["score"])
    print("Statement:")
    for ev_result in result[evaluator]["results"]:
        print(ev_result["statements"])

100%|██████████| 1/1 [00:00<00:00,  1.14it/s]
100%|██████████| 1/1 [00:00<00:00,  1.03it/s]


Individual Scores
context_relevance_evaluator  =>  [1.0]
Statement:
['Python, created by Guido van Rossum in the late 1980s.']
faithfulness_evaluator  =>  [0.5]
Statement:
['Python is a high-level general-purpose programming language.', 'Python was created by Guido van Rossum in the late 1980s.']

Score
context_relevance_evaluator  =>  1.0
Statement:
['Python, created by Guido van Rossum in the late 1980s.']
faithfulness_evaluator  =>  0.5
Statement:
['Python is a high-level general-purpose programming language.', 'Python was created by Guido van Rossum in the late 1980s.']





##### Model-based Evaluation of Generated or Extracted Answers

##### FaithfulnessEvaluator (aka groundedness)

This uses an LLM to evaluate whether a generated answer can be inferred from the provided contexts. It does not require ground truth labels.

The metric is sometimes called groundedness or hallucination.

FaithfulnessEvaluator component can be used to evaluate documents retrieved by a Haystack pipeline, such as RAG pipeline, without ground truth labels.

The component splits the generated answer into statements and checks each of them against the provided context, with an LLM. A higher faithfulness score is better, and it indicates that a larger number of statements in the generated answers can be inferred from the contexts.

This score can be used to better understand how often and when the Generator in a RAG pipeline hallucinates.


##### Usage

A. On its own

An example of using a FaithfulnessEvaluator component to evaluate a predicted answer generated based on a provided question and context. It returned a score of 0.5 because it detects two statements in the answer, of which only one is correct.


In [3]:
from haystack.components.evaluators import FaithfulnessEvaluator

questions = ["Who created the Python language?"]
contexts = [
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
predicted_answers = [
    "Python is a high-level general-purpose programming language that was created by George Lucas."
]
evaluator = FaithfulnessEvaluator()
result = evaluator.run(
    questions=questions, contexts=contexts, predicted_answers=predicted_answers
)

print(result["individual_scores"])

print(result["score"])

print(result["results"])

100%|██████████| 1/1 [00:01<00:00,  1.09s/it]

[0.5]
0.5
[{'statements': ['Python is a high-level general-purpose programming language.', 'Python was created by Guido van Rossum in the late 1980s.'], 'statement_scores': [1, 0], 'score': 0.5}]





A. In a Pipeline

As shown in the ContextRelevanceEvaluator.
Skipping this to avoid excessive usage of credits.

// NO CODE


##### SASEvaluator (Semantic Answer Similarity)

SASEvaluator evaluates answers predicted by pipelines using ground truth labels. It checks the semantic similarity of a predicted answer and the ground truth answer using a fine-tuned language model. The metric is called Semantic Answer Similarity.

The evaluator uses a bi-encoder or a cross-encoder model. By default it uses the `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` mode.

NOTE: Only one predicted answer is compared to one ground truth answer at a time. The component does not support multiple ground truth answers for the same question or multiple answers predicted for the same question.

https://arxiv.org/abs/2108.06130


##### Usage

A. On its own

The example below compares two answers and compare them to ground truth answers. We need to call the `warm_up()` before `run()` to load the model.


In [6]:
from haystack.components.evaluators import SASEvaluator

# model is from huggingface
sas_evaluator = SASEvaluator()
sas_evaluator.warm_up()
result = sas_evaluator.run(
    ground_truth_answers=["Berlin", "Paris"], predicted_answers=["Berlin", "Lyon"]
)
print(result["individual_scores"])

print(result["score"])

[0.9999999403953552, 0.5174765586853027]
0.758738249540329


A. In a Pipeline

Below is an example where we use an `AnswerExactMatchEvaluator` and a `SASEvaluator` in a pipeline to evaluate two answers and compare them to a ground truth answesr.

Running a pipeline instead of the individual components simplifies calculating more than one metric


In [7]:
from haystack import Pipeline
from haystack.components.evaluators import AnswerExactMatchEvaluator, SASEvaluator

pipeline = Pipeline()
em_evaluator = AnswerExactMatchEvaluator()
sas_evaluator = SASEvaluator()

pipeline.add_component("em_evaluator", em_evaluator)
pipeline.add_component("sas_evaluator", sas_evaluator)


ground_truth_answers = ["Berlin", "Paris"]
predicted_answers = ["Berlin", "Lyon"]

result = pipeline.run(
    {
        "em_evaluator": {
            "ground_truth_answers": ground_truth_answers,
            "predicted_answers": predicted_answers,
        },
        "sas_evaluator": {
            "ground_truth_answers": ground_truth_answers,
            "predicted_answers": predicted_answers,
        },
    }
)


print("\nIndividual Scores")

for evaluator in result:
    print(result[evaluator]["individual_scores"])

print("\nScore")

for evaluator in result:
    print(result[evaluator]["score"])


Individual Scores
[1, 0]
[0.9999999403953552, 0.5174765586853027]

Score
0.5
0.758738249540329


##### RagasEvaluator

RAGAS is a framework that helps you evaluate RAG pipelines.

Learn more about RAGAS here: https://docs.ragas.io/en/latest/index.html

Supported Metrics

- `ANSWER_CORRECTNESS`: grades the accuracy of the generated answer when compared to the ground truth.
- `FAITHFULNESS`: grades how factual the generated response was.
- `ANSWER_SIMILARITY`: grades how similar the generated answer is to the ground truth answer specified.
- `CONTEXT_PRECISION`: grades if the answer has any additional irrelevant information for the question asked.
- `CONTEXT_UTILIZATION`: grade to what extent the generated answer uses the provided context
- `CONTEXT_RECALL`: grades how complete the generated response was for the question specified
- `ASPECT_CRITIQUE`: grades generated answers based on custom aspects on a binary scale
- `CONTEXT_RELEVANCY`: grades how irrelevant the provided context was for the question specified
- `ANSWER_RELEVANCY`: grades how relevant the generated response is given the question.

Models Supported includes:

- All GPT models from OpenAI
- Google VertexAI Models
- Azure OpenAI Models
- Amazon Bedrock Models


##### Usage

1. You can use the `RagasEvaluator` while providing correct `metric_params` for the metric you are using.
2. Run the `RagasEvaluator`, either on its own or in a pipeline, by providing the expected input for the metric you are using.

##### Examples

##### Evaluate Context Relevance

Create a context-relevance evaluation pipeline


In [8]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.ragas import (
    RagasEvaluator,
    RagasMetric,
)

pipeline = Pipeline()
evaluator = RagasEvaluator(metric=RagasMetric.CONTEXT_RELEVANCY)
pipeline.add_component("evaluator", evaluator)


# send this data into the evaluator
results = pipeline.run(
    {
        "evaluator": {
            "questions": [
                "When was the Rhodes Statue built?",  # question 1
                "Where is the Pyramid of Giza",  # question 2
            ],
            "contexts": [
                ["Context for question 1"],  # context for question 1
                ["Context for question 2"],  # context for question 1
            ],
        }
    }
)

results

Evaluating: 100%|██████████| 2/2 [00:00<00:00,  3.32it/s]


{'evaluator': {'results': [[{'name': 'context_relevancy', 'score': 1.0}],
   [{'name': 'context_relevancy', 'score': 1.0}]]}}

##### Evaluate Relevance and Aspect Critique

Create a context-relevance and aspect critique evaluation pipeline


In [9]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.ragas import (
    RagasEvaluator,
    RagasMetric,
)

pipeline = Pipeline()
evaluator_context = RagasEvaluator(metric=RagasMetric.CONTEXT_PRECISION)
evaluator_aspect = RagasEvaluator(
    metric=RagasMetric.ASPECT_CRITIQUE,
    metric_params={
        "name": "valid_for_children",
        "definition": "Is this answer problematic for children?",
        "strictness": 3,
    },
)
pipeline.add_component("evaluator_context", evaluator_context)
pipeline.add_component("evaluator_aspect", evaluator_aspect)

In [10]:
QUESTIONS = [
    "Which is the most popular global sport?",
    "Who created the Python language?",
]
CONTEXTS = [
    [
        "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people."
    ],
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
RESPONSES = [
    "Football is the most popular sport with around 4 billion followers worldwide",
    "Python language was created by Guido van Rossum.",
]
GROUND_TRUTHS = [
    "Football is the most popular sport",
    "Python language was created by Guido van Rossum.",
]

results = pipeline.run(
    {
        "evaluator_context": {
            "questions": QUESTIONS,
            "contexts": CONTEXTS,
            "ground_truths": GROUND_TRUTHS,
        },
        "evaluator_aspect": {
            "questions": QUESTIONS,
            "contexts": CONTEXTS,
            "responses": RESPONSES,
        },
    }
)

import pprint

pprint.pprint(results)

Evaluating: 100%|██████████| 2/2 [00:01<00:00,  1.70it/s]
Evaluating: 100%|██████████| 2/2 [00:01<00:00,  1.76it/s]


{'evaluator_aspect': {'results': [[{'name': 'valid_for_children', 'score': 1}],
                                  [{'name': 'valid_for_children',
                                    'score': 0}]]},
 'evaluator_context': {'results': [[{'name': 'context_precision',
                                     'score': 0.9999999999}],
                                   [{'name': 'context_precision',
                                     'score': 0.9999999999}]]}}


##### DeepEvalEvaluator

DeepEval is a simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

DeepEval gives you additional options of using models on your machine.
https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm

Integrate into CI/CDs.

Supported Metrics:

- `ANSWER_RELEVANCY`: grades how relevant the answer was to the question specified
- `FAITHFULNESS`: grades how factual the generated response was.
- `CONTEXTUAL_PRECISION`: grades if he answer has any additional irrelevant information for the question asked.
- `CONTEXTUAL_RECALL`: grades how complete the generated response was for the question specified
- `CONTEXTUAL_RELEVANCE`: grades how relevant provided context was for the question specified


##### Usage

1. You can use the `DeepEvalEvaluator` while providing correct `metric_params` for the metric you are using.
2. Run the `DeepEvalEvaluator`, either on its own or in a pipeline, by providing the expected input for the metric you are using.

##### Examples

##### Evaluate Faithfulness


In [11]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import (
    DeepEvalEvaluator,
    DeepEvalMetric,
)

pipeline = Pipeline()
evaluator = DeepEvalEvaluator(
    metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model": "gpt-4"}
)
pipeline.add_component("evaluator", evaluator)

results = pipeline.run(
    {
        "evaluator": {
            "questions": [
                "When was the Rhodes Statue built?",
                "Where is the Pyramid of Giza",
            ],
            "contexts": [["Context for question 1"], ["Context for question 2"]],
            "responses": ["Response for question 1", "Response for question 2"],
        }
    }
)

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...






Metrics Summary

  - ✅ Faithfulness (score: 1, threshold: 0.0, strict: False, evaluation model: gpt-4, reason: The score is 1.00 because there are no contradictions, indicating that the actual output is perfectly aligned with the information presented in the retrieval context., error: None)

For test case:

  - input: When was the Rhodes Statue built?
  - actual output: Response for question 1
  - expected output: None
  - context: None
  - retrieval context: ['Context for question 1']


Metrics Summary

  - ✅ Faithfulness (score: 1, threshold: 0.0, strict: False, evaluation model: gpt-4, reason: The score is 1.00 because there are no contradictions present, indicating perfect alignment between the actual output and the retrieval context., error: None)

For test case:

  - input: Where is the Pyramid of Giza
  - actual output: Response for question 2
  - expected output: None
  - context: None
  - retrieval context: ['Context for question 2']


Overall Metric Pass Rates

Faithfulness



#### Statistical Evaluation Pipelines in Haystack

Statistical Evaluation compares ground truth labels with pipeline predictions, typically using metrics such as precision or recall.

Here the ground truth labels of expected answers are compared to the pipeline's prediction. Mostly use this with Extractive Models. For assessing answers generated by an LLM, it is recommended we use the model-based evaluation instead, as it can incorporate measures of semantic similarity or coherence and is better suited to evaluate predictions that might differe in wording from the ground truth labels

##### Statistical Evalution of Retrieved Documents

##### DocumentRecallEvaluator

Recall measures how often the correct document was among the retrieved documents over a set of queries.

The evaluator checks how many of the ground truth documents were retrieved.

Modes

- `RecallMode.SINGLE_HIT`: means that any of the ground truth documents need to be retrieved to count as a correct retrieval with a recall score of 1. A single retrieved document can achieve the full score.

- `RecallMode.MULTI_HIT`: means that all of the ground truth documents need to be retrieved to count as a correct retrieval with a recall score of 1.


##### Usage

On its own


In [15]:
from haystack import Document
from haystack.components.evaluators import DocumentRecallEvaluator
from haystack.components.evaluators.document_recall import RecallMode


evaluator = DocumentRecallEvaluator(mode=RecallMode.SINGLE_HIT)
result = evaluator.run(
    ground_truth_documents=[
        [Document(content="France")], [Document(content="9th century")]
    ],
    retrieved_documents=[
        [Document(content="France")], [Document(content="9th century"), Document(content="10th century")]
    ]
)

print(result['individual_scores'])
print(result['score'])

[1.0, 1.0]
1.0


In a Pipeline

In this example, we use a `DocumentRecallEvaluator` and a `DocumentMRREvaluator` in a pipeline to evaluate two answers and compare them to ground truth answers. Running a pipeline instead of the individual components simplifies calculating more than one metric

In [46]:
from haystack import Document, Pipeline
from haystack.components.evaluators import DocumentMRREvaluator, DocumentRecallEvaluator

pipeline = Pipeline()
mrr_evaluator = DocumentMRREvaluator()
recall_evaluator = DocumentRecallEvaluator()

pipeline.add_component("mrr_evaluator", mrr_evaluator)
pipeline.add_component("recall_evaluator", recall_evaluator)

ground_truth_documents = [
    [Document(content="France")], 
    [Document(content="9th century"), Document(content="9th")],
]

retrieved_documents = [
    [Document(content="France")],
    [Document(content="9th century"), Document(content="10th century"), Document(content="9th")]
]

result = pipeline.run({
    "mrr_evaluator": {
        "ground_truth_documents": ground_truth_documents,
        "retrieved_documents": retrieved_documents,
    },
    "recall_evaluator": {
        "ground_truth_documents": ground_truth_documents,
        "retrieved_documents": retrieved_documents
    }
})

for evaluator in result:
    print(result[evaluator]['individual_scores'])
    
for evaluator in result:
    print(result[evaluator]['score'])

[1.0, 1.0]
[1.0, 1.0]
1.0
1.0



##### DocumentMRREvaluator (Mean Reciprocal Rank)

In contrast to the recall metric, mean reciprocal rank takes the position of the top correctly retrieved documents (the "rank") into account. 

It checks at what rank ground truth documents appear in the list of retrieved documents. The metric is called mean reciprocal rank (MRR).

A higher mean reciprocal rank is better and indicates that relevant documents appear at an earlier position in the list of retrieved documents.

##### Usage

On its own

The example below evaluates documents retrieved for two queries. The first query, there is one ground truth document and one retrieved document.

The second query, there are two ground truth documents and three retrieved documents

In [37]:
from haystack import Document
from haystack.components.evaluators import DocumentMRREvaluator

evaluator = DocumentMRREvaluator()
result = evaluator.run(
    ground_truth_documents=[ 
                    [Document(content="France")] ,
                    [Document(content="9th century"), Document(content="9th")]
    ],
    retrieved_documents=[ 
                    [Document(content="France")] ,
                    [Document(content="9th century"), Document(content="10th"), Document(content="9th")]
    ]
)

print(result['individual_scores'])
print(result['score'])       

[1.0, 1.0]
1.0


In a Pipeline

Same as the `DocumentRecallEvaluator`

In [44]:
from haystack import Document, Pipeline
from haystack.components.evaluators import DocumentMRREvaluator, DocumentRecallEvaluator

pipeline = Pipeline()
mrr_evaluator = DocumentMRREvaluator()
recall_evaluator = DocumentRecallEvaluator()

pipeline.add_component("mrr_evaluator", mrr_evaluator)
pipeline.add_component("recall_evaluator", recall_evaluator)

ground_truth_documents = [
    [Document(content="France")], 
    [Document(content="9th century"), Document(content="9th")],
]

retrieved_documents = [
    [Document(content="France")],
    [Document(content="9th century"), Document(content="10th century"), Document(content="9th")]
]

result = pipeline.run({
    "mrr_evaluator": {
        "ground_truth_documents": ground_truth_documents,
        "retrieved_documents": retrieved_documents,
    },
    "recall_evaluator": {
        "ground_truth_documents": ground_truth_documents,
        "retrieved_documents": retrieved_documents
    }
})

for evaluator in result:
    print(evaluator, result[evaluator]['individual_scores'])
    
print('\n')
for evaluator in result:
    print(evaluator, result[evaluator]['score'])

mrr_evaluator [1.0, 1.0]
recall_evaluator [1.0, 1.0]


mrr_evaluator 1.0
recall_evaluator 1.0



##### DocumentMAPEvaluator (Mean Average Precision)

This component can be used to evaluate documents, a higher mean average precision is better, indicating that the list of retrieved documents contains many relevant documents and only a few non-relevant documents or none at all. 

##### Usage

On its own

Showing for two queries, the first one has one ground truth and one retrieved document.
The other query has 2 ground truths, and 3 retrieved documents

In [39]:
from haystack import Document
from haystack.components.evaluators import DocumentMAPEvaluator

evaluator = DocumentMAPEvaluator()
result = evaluator.run(
    ground_truth_documents=[
        [Document(content="France")],
        [Document(content="9th century"), Document(content="9th")],
    ],
    retrieved_documents=[
        [Document(content="France")],
        [Document(content="9th century"), Document(content="10th century"), Document(content="9th")]
    ]
)

print(result['individual_scores'])
print(result['score'])

[1.0, 0.8333333333333333]
0.9166666666666666



Pipeline

In [45]:
from haystack import Document, Pipeline
from haystack.components.evaluators import DocumentMRREvaluator, DocumentMAPEvaluator

pipeline = Pipeline()
mrr_evaluator = DocumentMRREvaluator()
map_evaluator = DocumentMAPEvaluator()
pipeline.add_component("mrr_evaluator", mrr_evaluator)
pipeline.add_component("map_evaluator", map_evaluator)

ground_truth_documents = [
    [Document(content="France")],
    [Document(content="9th century"), Document(content="9th")]
]
retrieved_documents = [
    [Document(content="France")],
    [Document(content="9th century"), Document(content="10th century"), Document(content="9th")]
]

result = pipeline.run({
    "mrr_evaluator": {
        "ground_truth_documents": ground_truth_documents,
        "retrieved_documents": retrieved_documents,
    },
    "map_evaluator": {
        "ground_truth_documents": ground_truth_documents,
        "retrieved_documents": retrieved_documents,
    }
})

for evaluator in result:
    print(evaluator, result[evaluator]['individual_scores'])
print('\n')
for evaluator in result:
    print(evaluator, result[evaluator]['score'])

mrr_evaluator [1.0, 1.0]
map_evaluator [1.0, 0.8333333333333333]


mrr_evaluator 1.0
map_evaluator 0.9166666666666666


#### Statistical Evalution of Extracted or Generated Answers

##### AnswerExactMatchEvaluator

This component checks character by character whether a predicted answer exactly matches the ground truth answer. This metric is called exact match.

This is useful for evaluating an extractive question answering pipeline against ground truth labels. 

`AnswerExactMatchEvaluator` checks whether a predicted answer exactly matches the ground truth answer. It is not suited to evaluate answers generated by LLMs. use `FaithfulnessEvaluator` or `SASEvaluator` instead.

One predicted answer is compared to one ground truth answer at a time.

If matches are not same, the value will be 0.


##### Usage

On its own


In [47]:
from haystack.components.evaluators import AnswerExactMatchEvaluator

evaluator = AnswerExactMatchEvaluator()
result = evaluator.run(
    ground_truth_answers=["Berlin", "Paris"],
    predicted_answers=["Berlin", "Lyon"]
)

print(result['individual_scores'])
print(result['score'])

[1, 0]
0.5



Pipeline

In [50]:
from haystack import Pipeline
from haystack.components.evaluators import AnswerExactMatchEvaluator

# SASEvaluator uses a cross-encoder model
from haystack.components.evaluators import SASEvaluator

pipeline = Pipeline()
em_evaluator = AnswerExactMatchEvaluator()
sas_evaluator = SASEvaluator()
pipeline.add_component("em_evaluator", em_evaluator)
pipeline.add_component("sas_evaluator", sas_evaluator)

ground_truth_answers = ["Berlin", "Paris"]
predicted_answers = ["Berlin", "Lyon"]

result = pipeline.run({
    "em_evaluator": {
        "ground_truth_answers": ground_truth_answers,
        "predicted_answers": predicted_answers
    },
    "sas_evaluator": {
        "ground_truth_answers": ground_truth_answers,
        "predicted_answers": predicted_answers,
    },
})

for evaluator in result:
    print(evaluator, result[evaluator]['individual_scores'])
print('\n')
for evaluator in result:
    print(evaluator, result[evaluator]['score'])

em_evaluator [1, 0]
sas_evaluator [0.9999999403953552, 0.5174765586853027]


em_evaluator 0.5
sas_evaluator 0.758738249540329


## Part B: Practical Examples

We would evaluate RAG pipelines both with model-based and statistical metrics available in Haystack evaluation offering.

1. Build a pipeline that answers medical questions based on PubMed data
2. Build an evaluation pipeline that makes use of some metrics like Document MRR and Answer Faithfulness
3. Run the RAG pipeline and evaluate the output with our evaluation pipeline


In [51]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("vblagoje/PubMedQA_instruction", split="train")
dataset = dataset.select(range(1000))

Downloading readme: 100%|██████████| 498/498 [00:00<00:00, 1.14MB/s]
Downloading data: 100%|██████████| 274M/274M [00:09<00:00, 29.7MB/s] 
Downloading data: 100%|██████████| 986k/986k [00:00<00:00, 2.74MB/s]
Generating train split: 100%|██████████| 272458/272458 [00:02<00:00, 124627.70 examples/s]
Generating test split: 100%|██████████| 1000/1000 [00:00<00:00, 105284.00 examples/s]


In [52]:
all_documents = [Document(content=doc["context"]) for doc in dataset]
all_questions = [doc['instruction'] for doc in dataset]
all_ground_truth_answers = [doc['response'] for doc in dataset]

#### Build an Indexing Pipeline

In [55]:
from typing import List
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")
indexing.run({
    "document_embedder": {
        "documents": all_documents
    }
})

Batches: 100%|██████████| 32/32 [00:01<00:00, 16.63it/s]


{'document_writer': {'documents_written': 1000}}

#### RAG Pipeline

In [57]:
import os
from dotenv import load_dotenv

load_dotenv()

from haystack.components.builders import AnswerBuilder, PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

if os.getenv("OPENAI_API_KEY") is None:
    raise ValueError("OPENAI_API_KEY is required")

template = """
You have to answer the following question based on the given context information only.

Context:

{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:

"""

rag_pipeline = Pipeline()
rag_pipeline.add_component("query_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=template) )
rag_pipeline.add_component("generator", OpenAIGenerator(model="gpt-3.5-turbo"))
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")
rag_pipeline.connect("generator.replies", "answer_builder.replies")
rag_pipeline.connect("generator.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x752c1406c7d0>
🚅 Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: OpenAIGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)
  - generator.replies -> answer_builder.replies (List[str])
  - generator.meta -> answer_builder.meta (List[Dict[str, Any]])

#### Asking a Question

In [59]:
question = "Do high levels of procalcitoni in the early phase after pediatric liver transplantation indicate poor postoperative outcome? "

response = rag_pipeline.run({
    "query_embedder": {
        "text": question
    },
    "prompt_builder": {
        "question": question
    },
    "answer_builder": {
        "query": question
    }
})


print(response["answer_builder"]["answers"][0].data)

Batches: 100%|██████████| 1/1 [00:00<00:00, 111.75it/s]


Yes, high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome. Patients with high PCT levels on postoperative day 2 had higher International Normalized Ratio values on postoperative day 5, suffered more from primary graft non-function, had a longer stay in the pediatric intensive care unit and on mechanical ventilation.


#### Evaluate the Pipeline

We will use the following metrics to evaluate the pipeline
1. Document Mean Reciprocal Rank
2. Semantic Answer Similarity
3. Faithfulness

In [60]:
import random

questions, ground_truth_answers, ground_truth_docs = zip(*random.sample(list(zip(all_questions, all_ground_truth_answers, all_documents)), 25))

In [62]:
rag_answers = []
retrieved_docs = []

for question in list(questions):
    response = rag_pipeline.run({
        "query_embedder": {
            "text": question
        },
        "prompt_builder": {
            "question": question
        },
        "answer_builder": {
            "query": question
            }
            }
        )
    print(f"Question: {question}")
    print("Answer from pipeline: ")
    print(response["answer_builder"]["answers"][0].data)
    print("\n-----------------------------------\n")
    
    rag_answers.append(response["answer_builder"]["answers"][0].data)
    retrieved_docs.append(response["answer_builder"]["answers"][0].documents)

Batches: 100%|██████████| 1/1 [00:00<00:00, 33.58it/s]


Question: Does biolimus-eluting stent with biodegradable polymer improve clinical outcomes in patients with acute myocardial infarction?
Answer from pipeline: 
Yes, the biolimus-eluting stent (BES) with biodegradable polymer significantly reduces patient-oriented composite endpoint (POCE) and major adverse cardiac events (MACE) compared to the sirolimus-eluting stent (SES) in patients with acute myocardial infarction (AMI) at the 5-year follow-up.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 50.31it/s]


Question: Does chloroquine enhance temozolomide cytotoxicity in malignant gliomas by blocking autophagy?
Answer from pipeline: 
Yes, chloroquine enhances temozolomide cytotoxicity in malignant gliomas by blocking autophagy.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 53.26it/s]


Question: Is head-shaft angle a risk factor for hip displacement in children with cerebral palsy?
Answer from pipeline: 
Yes, based on the context information provided, the head-shaft angle (HSA) is indeed identified as a risk factor for hip displacement in children with cerebral palsy. The study found that a 10-degree difference in HSA resulted in a 1.6-times higher risk of hip displacement in children with a higher HSA, even when age, migration percentage (MP), and Gross Motor Function Classification System (GMFCS) level were taken into account.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 53.26it/s]


Question: Is high-flow-mediated constriction in adults influenced by biomarkers of cardiovascular and metabolic risk?
Answer from pipeline: 
Yes, high-flow-mediated constriction (H-FMC) in adults is influenced by biomarkers of cardiovascular and metabolic risk. In the study mentioned in the context, H-FMC was observed in approximately 69% of adult participants, and it was found to be related to body composition and biomarkers of cardiovascular and metabolic risk such as total body mass, fat mass, body mass index, glucose, insulin, and lipids. Participants with certain cardiovascular risk factors showed significantly higher epicardial adipose tissue (EAT) thickness and carotid intima media thickness (CIMT), which are biomarkers associated with cardiovascular risk. These findings suggest that biomarkers of cardiovascular and metabolic risk can influence high-flow-mediated constriction in adults.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 51.06it/s]


Question: Is fibromyalgia associated with coronary heart disease : a population-based cohort study?
Answer from pipeline: 
Yes, based on the information provided in the context, fibromyalgia is associated with an increased risk of coronary heart disease (CHD). The study used a matched-cohort design and analyzed data from the Longitudinal Health Insurance Database 2000 in Taiwan. It was found that patients with fibromyalgia had a significantly higher subsequent risk of a CHD event compared to patients without fibromyalgia (hazard ratio, 2.11; 95% confidence interval, 1.46-3.05; P < 0.001).

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 53.56it/s]


Question: Do increased litter size and suckling intensity inhibit KiSS-1 mRNA expression in rat arcuate nucleus?
Answer from pipeline: 
Yes, increased litter size and suckling intensity inhibit KiSS-1 mRNA expression in the rat arcuate nucleus. The expression of KiSS-1 mRNA in the arcuate nucleus was decreased as the litter size and intensity of the suckling stimulus were increased. The effect of suckling intensity on the expression of KiSS-1 mRNA was more pronounced than that of litter size.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 52.80it/s]


Question: Is tPH1 A218 allele associated with suicidal behavior in Turkish population?
Answer from pipeline: 
Yes, the tPH1 A218 allele is associated with suicidal behavior in the Turkish population, as the frequency of the A allele was significantly higher in suicide attempters compared to healthy controls in the study mentioned in the context information. The study found that the A allele frequency was 46.33% in suicide attempters, compared to 35.71% in healthy controls (p=0.0357).

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 51.92it/s]


Question: Are hair Cortisol Concentrations in Adolescent Girls with Anorexia Nervosa Lower Compared to Healthy and Psychiatric Controls?
Answer from pipeline: 
Yes, according to the provided context information, hair Cortisol Concentrations in Adolescent Girls with Anorexia Nervosa are lower compared to Healthy controls(p=0.030) and Psychiatric controls. This was determined through an analysis of hair cortisol concentration as a marker for long-term integrated cortisol secretion in female patients with AN compared to female healthy controls (HC) and female psychiatric controls (PC).

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 53.12it/s]


Question: Is shorter time to target temperature associated with poor neurologic outcome in post-arrest patients treated with targeted temperature management?
Answer from pipeline: 
Yes, shorter time from initiation of cooling to target temperature ("induction") was associated with worse neurologic outcome in post-arrest patients treated with targeted temperature management. The study found that induction time >300 minutes was associated with good neurologic outcome compared to those with induction time <120 minutes.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 52.95it/s]


Question: Do plasma levels of Galectin-9 reflect disease severity in malaria infection?
Answer from pipeline: 
Yes, plasma levels of Galectin-9 (Gal-9) reflect disease severity in malaria infection. The study found that Gal-9 levels were higher in severe malaria cases compared to uncomplicated cases at day 0 and day 7. This suggests that higher levels of Gal-9 in plasma are associated with more severe forms of malaria infection.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 52.79it/s]


Question: Is plasma MicroRNA-126-5p Associated with the Complexity and Severity of Coronary Artery Disease in Patients with Stable Angina Pectoris?
Answer from pipeline: 
Yes, plasma MicroRNA-126-5p levels were found to be significantly down-regulated in patients with stable angina pectoris who had multi-vessel disease and higher SYNTAX scores, indicating an association with the complexity and severity of coronary artery disease in these patients.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 53.80it/s]


Question: Do trends in outpatient MRI seem to reflect recent reimbursement cuts?
Answer from pipeline: 
Yes, trends in outpatient MRI show that office volume steadily declined while hospital outpatient department (HOPD) volume steadily increased, indicating a shift of outpatient MRI from private offices to HOPDs. This shift could potentially be a response to recent reimbursement cuts.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 52.54it/s]


Question: Is postural change during venous blood collection a major source of bias in clinical chemistry testing?
Answer from pipeline: 
Yes, postural change during venous blood collection is a major source of bias in clinical chemistry testing. The study mentioned in the context found that parameters such as hemoglobin, hematocrit, albumin, alkaline phosphatase, and others exhibited meaningful increases when participants changed from a supine position to a sitting or standing position. This indicates that the posture during blood collection can significantly affect the results of clinical chemistry testing.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 54.63it/s]


Question: Are plant communities on infertile soils less sensitive to climate change?
Answer from pipeline: 
Based on the multi-decadal study conducted in the western USA, it was found that plant communities on infertile soils (serpentine) were not less sensitive to climate change. The study showed that overstorey cover, rather than soil fertility, was a significant covariate of community change over time. Additionally, the community mean specific leaf area showed less change over time in serpentine communities, indicating that they were not less sensitive to climate change. Therefore, plant communities on infertile soils do not appear to be less sensitive to climate change based on the evidence provided.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 57.13it/s]


Question: Does direct transfer of HRPII-magnetic bead complexes to malaria rapid diagnostic tests significantly improve test sensitivity?
Answer from pipeline: 
Yes, the direct transfer of HRPII-magnetic bead complexes to malaria rapid diagnostic tests significantly improves test sensitivity. The limit of detection of the Paracheck Pf RDT brand was improved by 21-fold, resulting in a limit of detection below 1 parasite/µL using this method.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 54.07it/s]


Question: Do evolution of clinical features in possible DLB depending on FP-CIT SPECT result?
Answer from pipeline: 
Yes, the evolution of clinical features in patients with possible DLB did vary depending on the (123)I-FP-CIT SPECT scan result. Patients with abnormal imaging had a significant increase in Unified Parkinson's Disease Rating Scale (UPDRS) score over time compared to those with normal imaging. There was relatively little evolution of the rest of the DLB features regardless of the imaging result.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 132.20it/s]


Question: Are statin medications associated with a lower probability of having an abnormal screening prostate-specific antigen result?
Answer from pipeline: 
Yes, statin medications are associated with a lower probability of having an abnormal screening prostate-specific antigen result. The percentages of men with PSA results exceeding commonly used thresholds of >2.5, >4.0, and >6.5 ng/mL were lower in men using statin medications, and the adjusted relative risks of having a PSA level >4.0 ng/mL were lower in men prescribed with higher doses of statins compared to non-statin users.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 251.19it/s]


Question: Is negative feedback loop of cholesterol regulation impaired in the livers of patients with Alagille syndrome?
Answer from pipeline: 
No, the negative feedback loop of cholesterol regulation is not impaired in the livers of patients with Alagille syndrome. The expression of mature SREBP2 protein was not suppressed in these patients, indicating that the regulation of cholesterol synthesis and uptake is functioning as normal.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 53.29it/s]


Question: Does obesity increase risk of anticoagulation reversal failure with prothrombin complex concentrate in those with intracranial hemorrhage?
Answer from pipeline: 
Yes, obesity was identified as a factor associated with anticoagulation reversal failure after the first dose of prothrombin complex concentrate in patients with warfarin-related acute intracranial hemorrhage. Patients who were obese (body mass index > 30 kg/m(2)) had a higher likelihood of anticoagulation reversal failure compared to those who were not obese.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 281.12it/s]


Question: Does implementation of the acute care surgery model provide benefits in the surgical treatment of the acute appendicitis?
Answer from pipeline: 
Yes, implementation of the acute care surgery (ACS) model provides benefits in the surgical treatment of acute appendicitis. The study found that the overall emergency department (ED) length of stay was significantly shorter in the ACS model compared to the pre-ACS model. Additionally, hospital length of stay (LOS) was also significantly shorter in the ACS model. Therefore, the ACS model improves surgical efficiency and quality outcomes in the treatment of acute appendicitis.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 66.62it/s]


Question: Do [ Perinatal variables from newborns of Aymara mothers suggest a genetic adaptation to high altitude ]?
Answer from pipeline: 
Yes, perinatal variables from newborns of Aymara mothers suggest a genetic adaptation to high altitude, as women with Aymara ancestry gave birth to children with higher gestational age, weight, and cranial circumference, indicating some level of adaptation to living at high altitudes.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 51.50it/s]


Question: Does prospective international cohort study demonstrate inability of interim PET to predict treatment failure in diffuse large B-cell lymphoma?
Answer from pipeline: 
No, the prospective international cohort study demonstrates that interim PET (I-PET) is able to predict treatment failure in diffuse large B-cell lymphoma. The study found that a positive I-PET result was associated with significantly lower event-free survival and overall survival rates compared to a negative I-PET result.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 52.63it/s]


Question: Does macrolide Resistance in Treponema pallidum correlate With 23S rDNA Mutations in Recently Isolated Clinical Strains?
Answer from pipeline: 
Yes, the high rates of 23S rDNA mutations in Treponema pallidum isolated from syphilis patients do correlate with macrolide resistance, as demonstrated by the failure of azithromycin to cure rabbits infected with strains containing these mutations in a recent study.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 54.93it/s]


Question: Does early tracheostomy in trauma patients save time and money?
Answer from pipeline: 
Yes, early tracheostomy in trauma patients does save time and money. Patients in the early tracheostomy group had significantly shorter TICU (Trauma Intensive Care Unit) LOS (Length of Stay) and significantly fewer ventilator days compared to the late tracheostomy group. Additionally, early tracheostomy patients had significantly less Ventilator-Associated Pneumonia (VAP). The study also mentioned that cost for services was calculated using average daily billing rates at the institution, indicating potential cost savings with early tracheostomy.

-----------------------------------



Batches: 100%|██████████| 1/1 [00:00<00:00, 51.88it/s]


Question: Is self-reported physical activity in smoking pre-cessation a protective factor against relapse for all?
Answer from pipeline: 
Based on the context provided, the study evaluated the impact of self-reported physical activity (PA) in precessation on smoking relapse. After adjusting for potential confounders, it was found that PA was not associated with smoking relapse. Therefore, self-reported physical activity in smoking pre-cessation is not a protective factor against relapse for all. Other factors such as self-efficacy level, absence of professional activity, previous attempts to quit, and alcohol use disorders were associated with smoking relapse.

-----------------------------------



##### Perform the Evaluation
We then perform the three evaluations on the 25 questions and answers

In [64]:
from haystack.components.evaluators.document_mrr import DocumentMRREvaluator
from haystack.components.evaluators.faithfulness import FaithfulnessEvaluator
from haystack.components.evaluators.sas_evaluator import SASEvaluator

eval_pipeline = Pipeline()
eval_pipeline.add_component("doc_mrr_evaluator", DocumentMRREvaluator())
eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator())
eval_pipeline.add_component("sas_evaluator", SASEvaluator())

results = eval_pipeline.run({
    "doc_mrr_evaluator": {
        "ground_truth_documents": list([d] for d in ground_truth_docs),
        "retrieved_documents": retrieved_docs
    },
    "faithfulness": {
        "questions": list(questions), 
        "contexts": list([d] for d in ground_truth_docs), 
        "predicted_answers": rag_answers
        },
    "sas_evaluator": {
        "predicted_answers": rag_answers, "ground_truth_answers": list(ground_truth_answers)
    },
})

100%|██████████| 25/25 [00:40<00:00,  1.64s/it]


##### Constructing an Evaluation Report

In [65]:
from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": list(questions),
    "context": list([d.content] for d in ground_truth_docs),
    "answer": list(ground_truth_answers),
    "predicted_answer": rag_answers
}

evaluation_result= EvaluationRunResult(run_name="pubmed_rag_pipeline", inputs=inputs, results=results)
evaluation_result.score_report()


Unnamed: 0,score
doc_mrr_evaluator,1.0
faithfulness,0.96
sas_evaluator,0.760169


##### Convert the Report into a Pandas DataFrame

In [66]:
results_df = evaluation_result.to_pandas()
results_df

Unnamed: 0,question,context,answer,predicted_answer,doc_mrr_evaluator,faithfulness,sas_evaluator
0,Does biolimus-eluting stent with biodegradable...,[To investigate clinical outcomes of coronary ...,"BES, compared with SES, significantly improved...","Yes, the biolimus-eluting stent (BES) with bio...",1.0,1.0,0.654321
1,Does chloroquine enhance temozolomide cytotoxi...,"[In a recent clinical trial, patients with new...","Taken together, these results demonstrate that...","Yes, chloroquine enhances temozolomide cytotox...",1.0,0.0,0.705631
2,Is head-shaft angle a risk factor for hip disp...,[Hip dislocation in children with cerebral pal...,A high HSA appears to be a risk factor for hip...,"Yes, based on the context information provided...",1.0,1.0,0.835098
3,Is high-flow-mediated constriction in adults i...,"[During reactive hyperemia, the brachial arter...","Increased body mass, fat mass, and body mass i...","Yes, high-flow-mediated constriction (H-FMC) i...",1.0,1.0,0.833109
4,Is fibromyalgia associated with coronary heart...,[We examined whether patients with a diagnosis...,An association between fibromyalgia and CHD ap...,"Yes, based on the information provided in the ...",1.0,1.0,0.747827
5,Do increased litter size and suckling intensit...,[The effect of litter size and suckling intens...,Increased litter size and suckling intensity d...,"Yes, increased litter size and suckling intens...",1.0,1.0,0.801956
6,Is tPH1 A218 allele associated with suicidal b...,[Serotonergic dysfunction is implicated in dep...,Our results provide evidence that A allele of ...,"Yes, the tPH1 A218 allele is associated with s...",1.0,1.0,0.920216
7,Are hair Cortisol Concentrations in Adolescent...,[In anorexia nervosa (AN) hypercortisolism has...,"We find lower HCC in AN, compared to HC and PC...","Yes, according to the provided context informa...",1.0,1.0,0.610489
8,Is shorter time to target temperature associat...,[Time to achieve target temperature varies sub...,In this multicenter cohort of post-arrest TTM ...,"Yes, shorter time from initiation of cooling t...",1.0,1.0,0.712108
9,Do plasma levels of Galectin-9 reflect disease...,[Galectin-9 (Gal-9) is a β-galactoside-binding...,"Gal-9 is released during acute malaria, and re...","Yes, plasma levels of Galectin-9 (Gal-9) refle...",1.0,1.0,0.790772


##### Filter down to best 3 scores for semantic answer similarity (sas_evaluator) as well as the bottom 3

In [67]:
import pandas as pd
top_3 = results_df.nlargest(3, 'sas_evaluator')
bottom_3 = results_df.nsmallest(3, 'sas_evaluator')
pd.concat([top_3, bottom_3])

ValueError: Please enter a value for `frac` OR `n`, not both