In [1]:
%load_ext autoreload
%autoreload 2

# **Using a custom retrieval prompt and evaluating the results with** `ir-measures`
`ir-measures` is a library that provides a common interface to many Information Retrieval evaluation tools, such as [`pytrec-eval`](https://github.com/terrierteam/pytrec_eval) and [Ranx](https://github.com/AmenRa/ranx).

This notebook will show how to use a custom prompt with RAGElo's `CustomPromptEvaluator` to evaluate documents retrieved by a search system and use the annotations generated by the LLM as the relevants to measure the performance of the retrieval pipeline with `ir-measures`.

In [79]:
%pip install ir-measures

Looking in indexes: https://pypi.org/simple, https://zeta-alpha:****@pypi.zeta-alpha.com
Collecting ir-measures
  Downloading ir_measures-0.3.3.tar.gz (48 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m48.8/48.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting cwl-eval>=1.0.10
  Downloading cwl-eval-1.0.12.tar.gz (31 kB)
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: cwl-eval, ir-measures
[33m  DEPRECATION: cwl-eval is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559[0m[33m
[0m  Running setup.py install for cwl-eval ... [?25

## 1. Import packages and prepare evaluators

As we are using the `CustomPromptEvaluator` retrieval evaluator, we will provide it with our own prompt, as well as the configurations needed to build build the prompts properly.

In [None]:
from ragelo import Query, Document, get_retrieval_evaluator, get_llm_provider

llm_provider = get_llm_provider("openai", model_name="gpt-4o-mini", max_tokens=2048)

prompt = """"
You are an expert annotator for a search engine, rating the relevance of \
search results to a given query submitted by a user.

Given a user query, you should provide a relevance score on an integer \
scale from 0 to 3, with the following meanings:

0: The document is not relevant to the query.
1: The document is slightly relevant to the query.
2: The document is relevant to the query, but doesn't cover all aspects of it.
3: The document is highly relevant to the query and completely fulfills the user's information needs.

Assume that you are writing a report on the subject of the topic. If you would \
not use any of the information contained in the document in such a report, but \
the document still covers the right topic, mark it 1. \
If you would use any of the information contained in the document in such a report, mark it 2
If the document is primarily about the topic, or contains vital information about \
the topic, mark it 3. Otherwise, mark it 0.

# Query
{query}

# Retrieved document
{document}

# Instructions

Think step by step about how you would evaluate the relevance of the document to the query. \
Your output should be a JSON dictionary with two keys: "relevance" and "explanation". \
The "relevance" key should contain the integer relevance score. \
The "explanation" key should contain a string explaining why you rated the document as you did. \
"""

evaluator = get_retrieval_evaluator(
    "custom_prompt",
    llm_provider=llm_provider,
    prompt=prompt,
    answer_format_retrieval_evaluator="multi_field_json", # We are asking the result to be returned as JSON object with multiple fields, and we are interested in all of them.
    scoring_keys_retrieval_evaluator=["relevance", "explanation"], # These are the keys that we are interested in.
    n_processes=20,  # How many threads to use when evaluating the retrieved documents. Will do that many parallel calls to OpenAI.
    rich_print=True,  # Wether or not to use rich to print colorful outputs.
)


### Load the queries and the documents to evaluate
We are using queries and document examples from the [SciRepEval dataset](https://huggingface.co/datasets/allenai/scirepeval/viewer/search/train)

In [68]:
query = Query(qid="query_1", query="bumble bee homeotic shift")


# Add some documents retrieved by the different search engines
doc_1 = Document(
    did="doc_1",
    text="Significance Mimicry among bumble bees has driven them to diversify and converge in their color patterns, making them a replicate rich system for connecting genes to traits. Here, we discover that mimetic color variation in a bumble bee is driven by changes in Hox gene expression. Hox genes are master regulators of numerous segment specific morphologies and thus are some of the most conserved developmental genes across animals. In these bees, the posterior Hox gene Abd B is upregulated in a more anterior location to impart phenotypic change. This homeotic shift happens late in development, when nonspecific effects are minimized, thus availing these genes for color pattern diversification. Similar mimetic color patterns were inferred to use different mutations, suggesting diverse routes to mimicry. Natural phenotypic radiations, with their high diversity and convergence, are well suited for informing how genomic changes translate to natural phenotypic variation. New genomic tools enable discovery in such traditionally nonmodel systems. Here, we characterize the genomic basis of color pattern variation in bumble bees (Hymenoptera, Apidae, Bombus) a group that has undergone extensive convergence of setal color patterns as a result of Mullerian mimicry. In western North America, multiple species converge on local mimicry patterns through parallel shifts of midabdominal segments from red to black. Using genome wide association, we establish that a cis regulatory locus between the abdominal fate determining Hox genes, abd A and Abd B, controls the red black color switch in a western species, Bombus melanopygus. Gene expression analysis reveals distinct shifts in Abd B aligned with the duration of setal pigmentation at the pupal adult transition. This results in atypical anterior Abd B expression, a late developmental homeotic shift. Changing expression of Hox genes can have widespread effects, given their important role across segmental phenotypes; however, the late timing reduces this pleiotropy, making Hox genes suitable targets. Analysis of this locus across mimics and relatives reveals that other species follow independent genetic routes to obtain the same phenotypes.",
)
doc_2 = Document(
    did="doc_2",
    text="Bumble bees are declining worldwide, their vital ecosystem services are diminishing and underlying mechanisms are species specific and multifaceted. This has sparked an increase in long term assessments of historical collections that provide valuable information about population trends and shifts in distributions. However, museums specimens also contain important ecological information, including rarely measured morphological traits. Trait based assessments of museums specimens provide additional information on underlying mechanisms of population trends, by tracking changes over time. Here, we used museum specimens of four Bombus species, spanning a timeframe of 125 years to: (i) compare body size of declining and increasing species, (ii) assess intra specific trends over the last century, and (iii) investigate shifts in geographical distribution over time. We found that declining Bombus species were larger than increasing ones. All four species were smaller in current time than a century ago. Intra specific size declines were more pronounced for larger bodied species. With our sampling, declining and increasing species showed an upward shift in elevation, and declining species showed an additional geographic shift in recent times as compared to historic records. Intra specific body size declines may represent species adaptation to unfavorable environmental conditions, and may be a useful metric to complement traditional species vulnerability assessments. We highlight the utility of incorporating trait based assessments into future studies investigating species declines.",
)
doc_3 = Document(
    did="doc_3",
    text="Author(s) Rothman, Jason Advisor(s) McFrederick, Quinn Abstract: Bees are important insect pollinators in both agricultural and natural settings who may encounter toxicants while foraging on plants growing in contaminated soils. How these chemicals affect the bee microbiome, which confers many health benefits to the host, is an important but understudied aspect of pollinator health. Through a combination of 16S rRNA gene sequencing, LC MS metabolomics, ICP OES spectroscopy, quantitative PCR, culturing, microbiome manipulation, and whole organism exposure studies, I attempt to establish the effects that toxicants have on social bees and their associated microbes. The microbiome of animals has been shown to reduce metalloid toxicity, so I exposed microbiome inoculated or uninoculated bumble bees to 0.75 mg/L selenate and found that inoculated bees survive longer when compared to uninoculated bees. I also showed that selenate exposure altered the composition of the bumble bee microbiome and that the growth of two major gut symbionts Snodgrassella alvi and Lactobacillus bombicola was unaffected by this exposure. Due to the pervasiveness of environmental pollution in bee habitats, I exposed bumble bees to cadmium, copper, selenate, imidacloprid, and hydrogen peroxide and found that each of these compounds can be lethal to bees. I also showed that most of these chemicals can affect the diversity of the bee microbiome and that there is interstrain variation in toxicant tolerance genes in the major bee symbionts Snodgrassella alvi and Gilliamella apicola. As exposure to cadmium or selenate has been shown to affect animal associated microbes, I assayed the effects of these chemicals on honey bees and observed shifts in the bee microbiome at multiple timepoints. I also found that exposure to selenate and cadmium changes the overall bee metabolome and may cause oxidative damage to proteins and lipids. Lastly, I found that bee associated bacteria can bioaccumulate cadmium but generally not selenate. In this dissertation I demonstrated that bee associated bacteria are generally robust to toxicant exposure, but that chemicals can alter the composition of both bumble bee and honey bee microbiomes. I also show that toxicants affect bee metabolism, and that the bee microbiome plays an important role in maintaining host health when challenged with toxicants.",
)
doc_4 = Document(
    did="doc_4",
    text="abstract: Insects maximize their fitness by exhibiting predictable and adaptive seasonal patterns in response to changing environmental conditions. These seasonal patterns are often expressed even when insects are kept in captivity, suggesting they are functionally and evolutionary important. In this study we examined whether workers of the eusocial bumble bee Bombus impatiens maintained a seasonal signature when kept in captivity. We used an integrative approach and compared worker egg laying, ovarian activation, body size and mass, lipid content in the fat body, cold tolerance and expression of genes related to cold tolerance, metabolism, and stress throughout colony development. We found that bumble bee worker physiology and gene expression patterns shift from reproductive like to diapause like as the colony ages. Workers eclosing early in the colony cycle had increased egg laying and ovarian activation, and reduced cold tolerance, body size, mass, and lipid content in the fat body, in line with a reproductive like profile, while late eclosing workers exhibited the opposite characteristics. Furthermore, expression patterns of genes associated with reproduction and diapause differed between early and late eclosing workers, partially following the physiological patterns. We suggest that a seasonal signature, innate to individual workers, the queen or the colony is used by workers as a social cue determining the phenology of the colony and discuss possible implications for understanding reproductive division of labor in bumble bee colonies and the evolutionary divergence of female castes in the genus Bombus.",
)
doc_5 = Document(
    did="doc_5",
    text="ABSTRACT Insects maximize their fitness by exhibiting predictable and adaptive seasonal patterns in response to changing environmental conditions. These seasonal patterns are often expressed even when insects are kept in captivity, suggesting they are functionally and evolutionarily important. In this study, we examined whether workers of the eusocial bumble bee Bombus impatiens maintained a seasonal signature when kept in captivity. We used an integrative approach and compared worker egg laying, ovarian activation, body size and mass, lipid content in the fat body, cold tolerance and expression of genes related to cold tolerance, metabolism and stress throughout colony development. We found that bumble bee worker physiology and gene expression patterns shift from reproductive like to diapause like as the colony ages. Workers eclosing early in the colony cycle had increased egg laying and ovarian activation, and reduced cold tolerance, body size, mass and lipid content in the fat body, in line with a reproductive like profile, while late eclosing workers exhibited the opposite characteristics. Furthermore, expression patterns of genes associated with reproduction and diapause differed between early and late eclosing workers, partially following the physiological patterns. We suggest that a seasonal signature, innate to individual workers, the queen or the colony, is used by workers as a social cue determining the phenology of the colony and discuss possible implications for understanding reproductive division of labor in bumble bee colonies and the evolutionary divergence of female castes in the genus Bombus. Summary: Bumblebee workers exhibit a physiological signature (innate to workers, queen or the colony) corresponding to colony age with a shift towards a diapause like profile in late eclosing workers.",
)

# Add the documents retrieved by the first search engine
# Imagine that the first search engine retrieved the documents with the following ranking and scores:
retrieved_by_search_engine_1 = [(doc_1, 0.9), (doc_5, 0.7), (doc_3, 0.2)]

# In some cases, a search may not provide the score of the documents retrieved, just their rankings.
# If that's the case, we can pass the list of documents in the same order as they were retrieved,
# and their scores will be defined by their ranking.
retrieved_by_search_engine_2 = [doc_2, doc_4, doc_5]

query.add_retrieved_docs(retrieved_by_search_engine_1, agent="search_engine_1")
query.add_retrieved_docs(retrieved_by_search_engine_2, agent="search_engine_2")

# Evaluate the documents retrieved by the search engines for the query
queries = evaluator.batch_evaluate([query])

### Let's see what the evaluations look like
- Each query has a dictionary with the retrieved documents and their scores.
- Each document may have been retrieved by more than one system, so we also keep a the score that the each system attributed to that document.


In [67]:
query = queries[0]
print(f'üîé Query text: "{query.query}"')
print(f"üìö {len(query.retrieved_docs)} documents retrieved by all agents")
print("-" * 80)
for document_id in query.retrieved_docs:
    document = query.retrieved_docs[document_id]
    retrieved_by = document.retrieved_by
    explanation = document.evaluation.answer["explanation"]
    relevance = document.evaluation.answer["relevance"]
    print(f"üìú Document {document_id}:")
    print(f"\tüìù Text: {document.text[:100]} (...)")
    print(f'\tüí≠ LLM\'s reasoning: "{explanation[:100]}')
    print(f"\tüéØ Relevance: {relevance}")
    print(f"\tüïµÔ∏è Retrieved by: {len(retrieved_by)} agent(s): {', '.join(retrieved_by.keys())}")
    for agent in retrieved_by:
        print(f"\t\tüîç {agent} score: {retrieved_by[agent]}")
    print("-" * 80)

üîé Query text: "bumble bee homeotic shift"
üìö 5 documents retrieved by all agents
--------------------------------------------------------------------------------
üìú Document doc_1:
	üìù Text: Significance Mimicry among bumble bees has driven them to diversify and converge in their color patt (...)
	üí≠ LLM's reasoning: "The document discusses the role of Hox genes in driving color pattern variation in bumble bees, spec
	üéØ Relevance: 3
	üïµÔ∏è Retrieved by: 1 agent(s): search_engine_1
		üîç search_engine_1 score: 0.9
--------------------------------------------------------------------------------
üìú Document doc_5:
	üìù Text: ABSTRACT Insects maximize their fitness by exhibiting predictable and adaptive seasonal patterns in  (...)
	üí≠ LLM's reasoning: "The document discusses the physiological changes in bumble bee workers and their seasonal patterns, 
	üéØ Relevance: 2
	üïµÔ∏è Retrieved by: 2 agent(s): search_engine_1, search_engine_2
		üîç search_engine_1 score: 0

## Evaluate the results

### Formatting the results
Most Information Retrieval evaluation frameworks rely on TREC-style evaluation. In this format, each query have its own set of relevant documents, called the qrels. Usually, a qrels object is treated as a dictionary, such as: 
```python
qrels = {
    "query_id": {
        "doc_id": relevance,
        "doc_id": relevance,
        ...
    },
    ...
}
```
Where `query_id` is the id of the query, `doc_id` is the id of the document that was considered for that query, and `relevance` is the relevance of the document to the query.

On the other side, each retrieval system produces a ranking of documents in response to a query. The list of results for each query as returned by a retrieval system is usually called a `run`, and is also treated as a dictionary, such as:
```python
run_system_1 = {
    "query_id": {
        "doc_id": score,
        "doc_id": score,
        ...
    },
    ...
}
```
Where `query_id` is the id of the query, `doc_id` is the id of the document that was considered for that query, and `score` is the score attributed to the document by the retrieval system.

As in this example we only have one query, the run and qrels dictionaries will have only one key each.

### Calculating the metrics

Some of the most common metrics used in Information Retrieval are the Precision@K (P@K) and normalized Discounted Cumulative Gain (nDCG). These metrics are calculated by the `ir-measures` library, which takes the qrels and run dictionaries as input and returns the evaluation results.

P@K (P@3 in our example, each system only retrieves 3 documents) measures, in the top-K results, the fraction of documents that are considered relevant (i.e, that have a relevance score greater than 0 in the qrels). It does not take into account the order of the documents in the ranking or the relevance score of the documents, only if they are above 0.

nDCG, in the other hand, measures the quality of the ranking of the documents, according to an ideal ranking. In our example, as `doc_1` has a relevance score of 3 and `doc_5` a relevance score of 2, the ideal ranking would be [`doc_1`, `doc_5`, `doc_3`, `doc_4`, `doc_2`], with the last three being interchangeable. The nDCG can be interpreted as how close the ranking produced by the retrieval system is to this ideal ranking.


For more information, please refer to the [ir-measures documentation](https://ir-measur.es/en/latest/)



In [188]:
import ir_measures
from ir_measures import nDCG, MAP, P

qrels = query.get_qrels()
runs = query.get_runs()

results_system_1 = ir_measures.calc_aggregate(
    [P@3, nDCG], # The metrics we want to calculate. By adding the @K, we specify the cut-off for the metric.
    qrels, # The relevance judgements generated by the evaluator.
    runs["search_engine_1"] # The ranking of the documents generated by the search engine.
)
results_system_2 = ir_measures.calc_aggregate(
    [P@3, nDCG], # The metrics we want to calculate. By adding the @K, we specify the cut-off for the metric.
    qrels, # The relevance judgements generated by the evaluator.
    runs["search_engine_2"] # The ranking of the documents generated by the search engine.
)


In [192]:
# Let's sort the results of the search engines by their retrieval scores
sorted_results_1 = sorted(runs["search_engine_1"]["query_1"].items(), key=lambda x: x[1], reverse=True)
sorted_results_2 = sorted(runs["search_engine_2"]["query_1"].items(), key=lambda x: x[1], reverse=True)

print(f'üìä Search engine 1 rankings for query üîé "{query.query}":')
print(f"\t1Ô∏è‚É£ {sorted_results_1[0][0]} (score = {sorted_results_1[0][1]}), relevance = {qrels['query_1'][sorted_results_1[0][0]]}")
print(f"\t2Ô∏è‚É£ {sorted_results_1[1][0]} (score = {sorted_results_1[1][1]}), relevance = {qrels['query_1'][sorted_results_1[1][0]]}")
print(f"\t3Ô∏è‚É£ {sorted_results_1[2][0]} (score = {sorted_results_1[2][1]}), relevance = {qrels['query_1'][sorted_results_1[2][0]]}")
print("üìà Metrics:")
print(f"\tP@3:  {results_system_1[P@3]:.4f}")
print(f"\tnDCG: {results_system_1[nDCG]:.4f}")
print("-" * 80)
print(f'üìä Search engine 2 rankings for query üîé "{query.query}":')
print(f"\t1Ô∏è‚É£ {sorted_results_2[0][0]} (score = {sorted_results_2[0][1]}), relevance = {qrels['query_1'][sorted_results_2[0][0]]}")
print(f"\t2Ô∏è‚É£ {sorted_results_2[1][0]} (score = {sorted_results_2[1][1]}), relevance = {qrels['query_1'][sorted_results_2[1][0]]}")
print(f"\t3Ô∏è‚É£ {sorted_results_2[2][0]} (score = {sorted_results_2[2][1]}), relevance = {qrels['query_1'][sorted_results_2[2][0]]}")
print("üìà Metrics:")
print(f"\tP@3:  {results_system_2[P@3]:.4f}")
print(f"\tnDCG: {results_system_2[nDCG]:.4f}")

üìä Search engine 1 rankings for query üîé "bumble bee homeotic shift":
	1Ô∏è‚É£ doc_1 (score = 0.9), relevance = 3
	2Ô∏è‚É£ doc_5 (score = 0.7), relevance = 2
	3Ô∏è‚É£ doc_3 (score = 0.2), relevance = 1
üìà Metrics:
	P@3:  1.0000
	nDCG: 0.8535
--------------------------------------------------------------------------------
üìä Search engine 2 rankings for query üîé "bumble bee homeotic shift":
	1Ô∏è‚É£ doc_2 (score = 1.0), relevance = 1
	2Ô∏è‚É£ doc_4 (score = 0.5), relevance = 1
	3Ô∏è‚É£ doc_5 (score = 0.3333333333333333), relevance = 2
üìà Metrics:
	P@3:  1.0000
	nDCG: 0.4715


### Comparing the results and varying the relevance threshold
Note that, despite both systems having the same precision, the ranking that the first system produces is considerably better than the second one, with the documents retrieved being, on average, more relevant. This is better captured by the nDCG metric, which takes into account the relevance of the documents and their order in the ranking.

One phenomenon that we have observed is that LLMs tend to _overestimate_ the quality of the retrieved documents when compared to human annotators. Therefore, one way around it is to define a tighter relevance threshold when evaluating the results:

In [195]:
qrels = query.get_qrels(relevance_threshold=2) # Set the minimum relevance threshold to 2
runs = query.get_runs()
from ir_measures import P

results_system_1 = ir_measures.calc_aggregate([P@3, nDCG], qrels, runs["search_engine_1"])
results_system_2 = ir_measures.calc_aggregate([P@3, nDCG], qrels, runs["search_engine_2"])

sorted_results_1 = sorted(runs["search_engine_1"]["query_1"].items(), key=lambda x: x[1], reverse=True)
sorted_results_2 = sorted(runs["search_engine_2"]["query_1"].items(), key=lambda x: x[1], reverse=True)

print(f'üìä Search engine 1 rankings for query üîé "{query.query}":')
print(f"\t1Ô∏è‚É£ {sorted_results_1[0][0]} (score = {sorted_results_1[0][1]}), relevance = {qrels['query_1'][sorted_results_1[0][0]]}")
print(f"\t2Ô∏è‚É£ {sorted_results_1[1][0]} (score = {sorted_results_1[1][1]}), relevance = {qrels['query_1'][sorted_results_1[1][0]]}")
print(f"\t3Ô∏è‚É£ {sorted_results_1[2][0]} (score = {sorted_results_1[2][1]}), relevance = {qrels['query_1'][sorted_results_1[2][0]]}")
print("üìà Metrics:")
print(f"\tP@3:  {results_system_1[P@3]:.4f}")
print(f"\tnDCG: {results_system_1[nDCG]:.4f}")
print("-" * 80)
print(f'üìä Search engine 2 rankings for query üîé "{query.query}":')
print(f"\t1Ô∏è‚É£ {sorted_results_2[0][0]} (score = {sorted_results_2[0][1]}), relevance = {qrels['query_1'][sorted_results_2[0][0]]}")
print(f"\t2Ô∏è‚É£ {sorted_results_2[1][0]} (score = {sorted_results_2[1][1]}), relevance = {qrels['query_1'][sorted_results_2[1][0]]}")
print(f"\t3Ô∏è‚É£ {sorted_results_2[2][0]} (score = {sorted_results_2[2][1]}), relevance = {qrels['query_1'][sorted_results_2[2][0]]}")
print("üìà Metrics:")
print(f"\tP@3:  {results_system_2[P@3]:.4f}")
print(f"\tnDCG: {results_system_2[nDCG]:.4f}")


üìä Search engine 1 rankings for query üîé "bumble bee homeotic shift":
	1Ô∏è‚É£ doc_1 (score = 0.9), relevance = 3
	2Ô∏è‚É£ doc_5 (score = 0.7), relevance = 2
	3Ô∏è‚É£ doc_3 (score = 0.2), relevance = 0
üìà Metrics:
	P@3:  0.6667
	nDCG: 1.0000
--------------------------------------------------------------------------------
üìä Search engine 2 rankings for query üîé "bumble bee homeotic shift":
	1Ô∏è‚É£ doc_2 (score = 1.0), relevance = 0
	2Ô∏è‚É£ doc_4 (score = 0.5), relevance = 0
	3Ô∏è‚É£ doc_5 (score = 0.3333333333333333), relevance = 2
üìà Metrics:
	P@3:  0.3333
	nDCG: 0.2346


Note that the nDCG score for the first system is still 1.0, as it ranks the documents in their optimal order. However, now the P@3 values are more meaningful, as documents that were considered only slightly relevant by the LLM are now considered irrelevant.