# **Evaluating multiple RAG pipelines with RAGElo**
Unlike other LLM and RAG evaluation frameworks that try to evaluate every individual LLM answer individually, RAGElo focuses on comparing **pairs** of answers in an Elo-style tournament.

The idea is that, without a golden answer, LLMs can't really judge any individual answer in isolation. Rather, analyzing if answer A is better than answer B is a more reliable metric of quality, and makes it easier to decide if a new RAG pipeline is better than another.

RAGElo works in three steps when evaluating a RAG pipeline
1. Gather all documents retrieved by all agents and annotate their relevance to the user's query.
2. For each question, generate "_games_" between agents, asking the judging LLM to analyze if one agent is better than another, rather than assigning individual scores to each answer.
3. With these games, compute the Elo score for each agent, creating a final ranking of agents.

Importantly, RAGElo is **agnostic** to your pipeline. Meaning, it will not directly call your RAG system, and can work with any framework or pipeline you use. When used as a library, as we do here, we should create the `Query` objects with your agent's answers, as we show below.

## 1. Import packages and setup OpenAI API key

In [5]:
from __future__ import annotations

import glob
import os
import pickle
import random
from getpass import getpass

import openai
import pandas as pd

from ragelo import (
    get_agent_ranker,
    get_answer_evaluator,
    get_llm_provider,
    get_retrieval_evaluator,
)

# RAGElo is based around Experiments, that have multiple Queries. 
from ragelo import Experiment

if not (openai_api_key := os.environ.get("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key


## 3. Load the queries, documents retrieved and answers generated by your RAG pipelines
The simplest way to load queries, documents and answers from multiple pipelines is to pass their CSV paths in the `Experiment` initialization:

In [6]:
experiment = Experiment("RAGELo_evaluation", queries_csv_path="./data/queries.csv", )

In [13]:
experiment = Experiment(experiment_name="RAGELo_evaluation", queries_csv_path="./data/queries.csv")
# queries = load_answers_from_multiple_csvs(
#     csvs,  # A list of all the CSVs with the answers produced by our RAG pipelinesZZZ
#     query_text_col="question",  # this tells RAGelo which column of the CSV has the query itself.
# )
# query_ids = {q.query: q.qid for q in queries}
# query_dict = {q.qid: q for q in queries}

# print(f"Loaded {len(queries)} queries from {len(csvs)} files.")

Depending on how you store your queries, answers and documents, you may need to parse them into the `Query` object.

Here, we assume that the answers are stored in a CSV file with the following format, extracted from a RAG pipeline based on Qdrant:
- `question`: The query itself
- `answer`: The answer generated by the RAG pipeline
- `contexts`: A string with the documents retrieved by the RAG pipeline, in the following format:
    - `document:<document_text>`
    - `source:<document_source>`
    - `relevance:<document_relevance>`

In [5]:

def parse_docs(raw_docs) -> list[tuple[str, str]]:
    docs = raw_docs.split("\n")
    documents = []
    for d in docs:
        doc_text = d.split("document:", maxsplit=1)[1].strip("]")
        doc_source = d.split("source:", maxsplit=1)[1].strip("]")
        documents.append((doc_source, doc_text))
    return documents


for csv in csvs:
    df = pd.read_csv(csv)
    for i, row in df.iterrows():
        query_id = query_ids[row["question"]]
        answer = row["answer"]
        docs = parse_docs(row["contexts"])
        query = query_dict[query_id]
        for doc_source, doc_text in docs:
            query.add_retrieved_doc(
                doc_text,
                doc_source,  # Here, we use the source as the id of the documents
            )

## 4. Prepare the Evaluators
RAGElo uses _evaluators_ as judges. We will instantiate a **retrieval evaluator**, an **answer evaluator** and an **agents ranker** with their corresponding settings

In [6]:
# The LLM provider can be shared across all evaluators.
llm_provider = get_llm_provider("openai", model_name="gpt-4o-mini", max_tokens=2048)

# The DomainExpertEvaluator and the PairwiseDomainExpertEvaluator mimics a persona that is an expert in a given field from a specific company.
# The evaluators will add this persona and company information to the prompts when evaluating the answers.
kwargs = {
    "llm_provider": llm_provider,
    "expert_in": "the details of how to better use the Qdrant vector database and vector search engine",
    "company": "Qdrant",
    "n_processes": 20, # How many threads to use when evaluating the retrieved documents. Will do that many parallel calls to OpenAI.
    "rich_print": True, # Wether or not to use rich to print colorful outputs.
    "force": True, # Whether or not to overwrite any existing files.
}

retrieval_evaluator = get_retrieval_evaluator(
    "domain_expert",
    **kwargs,
)

answer_evaluator = get_answer_evaluator(
    "domain_expert",
    **kwargs,
    bidirectional=False, # Whether or not to evaluate the answers in both directions.
    n_games_per_query=20, # The number of games to play for each query.
    document_relevance_threshold=2, # The minimum relevance score a document needs to have to be considered relevant.
)

# The Elo ranker doesn't need an LLM. Here, we instantiate it with the basic settings.
# We can also instantiate evaluators and agents by passing the configuration directly to the get_agent_ranker or get_*__evaluator functions:
elo_ranker = get_agent_ranker(
    "elo",
    verbose=True,  # We want to see the final ranking, not only write to an output file.
    k=32,  # The k-factor for the Elo ranking algorithm
    initial_score=1000,  # Initial score for the agents. This will be updated after each game.
    rounds=1000, # Number of tournaments to play
    **kwargs,
)



## 5. Call the evaluators
Now, we actually run the evaluators. 

Note that, as we go, we are adding information to each `Query` object, instead of just dumping everything into CSVs or JSON files. This is by design. The `Query` object is also a Pydantic model, so it can be easily serialized into JSON by calling `query.model_dump_json(<path>)` or pickled by calling `pickle.dumps(query)`.
This also avoids re-evaluating the same document multiple times for the same query. The evaluator will also not re-evaluate a query (or answer) that was already evaluated, unless the `force` parameter is set to `True` on its configurations.

As the focus of RAGElo is to evaluate the quality of RAG pipelines, the __retrieval__ component is extremely important, and the answers of the agents are evaluated based not only on the quality of the documents they have retrieved, but the quality of all the documents retrieved by any agent. The intuition here is that if there are many relevant documents in the corpus, potentially retrieved by other agents, the evaluation should take these into account, even if a specific agent did not retrieve them.

When evaluating a (pair of) answer(s), the LLM will be able to see all the relevant documents retrieved by all agents, and will be able to compare the quality of the answers based on the quality of _all_ the relevant documents retrieved by any agent.

In [7]:
# Evaluate all the retrieved documents for all the queries
queries = retrieval_evaluator.batch_evaluate(queries) 

# As the evaluator is a PairwiseDomainExpertEvaluator, it will create random pairs of agent's answers for the same query and evaluate them.
queries = answer_evaluator.batch_evaluate(queries)

Output()



Output()

In [8]:
pickle.dump(queries, open("queries.pkl", "wb")) # Save the results to disk to avoid re-evaluating everything

### Let's see what is happening under the hood

- A query object contain the query itself, the documents retrieved by all the agents, the answers generated by each agent and the pairwise games that the AnswerEvaluator generated.
- A Document has three objects that can be evaluated: retrieved documents, agent answers and pairwise games. 
- Each evaluable object has an `evaluation` object that, after being evaluated, will contain the raw LLM output for the object and a parsed version of it, according to the evaluator's settings.

In [13]:

query = random.choice(queries)

print("🔎 The query object:")
print(f'\t💬 Query text: "{query.query}"')
print(f"\t📚 {len(query.retrieved_docs)} retrieved documents by all agents")
average_relevance = sum([int(d.evaluation.answer) for d in query.retrieved_docs.values()]) / len(
    query.retrieved_docs
)
print(f"\t📊 Average relevance score of the retrieved documents : {average_relevance}")
print(f"\t🕵️ {len(query.answers)} Agents answered the query")
print(f"\t🏆 {len(query.pairwise_games)} games were evaluated for this query")
document = random.choice(list(query.retrieved_docs.values()))
print("-" * 80)
print("📜 The document object:")
print(f'\t📄 Document text: "{document.text[:100]}" (...)')
print("-" * 80)
print("📈 Document's evaluation:")
document_evaluation = document.evaluation
print(
    f'\t💭 LLM\'s reasoning for the evaluation: "{document_evaluation.raw_answer[:100]}" (...)'
)
print(
    f"\t💯 Document's relevance score (between 0 and 2): {document_evaluation.answer}"
)
print("-" * 80)
print("🆚 Pairwise games played:")
game = random.choice(query.pairwise_games)
print(
    f"\tGame between agents 🕵️{game.agent_a_answer.agent} 🆚 🕵️{game.agent_b_answer.agent}"
)
print(
    f'\t💭 LLM\'s reasoning for the evaluation: "{game.evaluation.raw_answer[:100]} (...)"'
)
best_agent = game.evaluation.answer
if best_agent == "A":
    best_agent = game.agent_a_answer.agent
elif best_agent == "B":
    best_agent = game.agent_b_answer.agent
elif best_agent == "C":
    best_agent = "TIE"
print(f"\t💯 Game's winner: {best_agent}")


🔎 The query object:
	💬 Query text: "What is significance of ‘on_disk_payload’ setting?"
	📚 8 retrieved documents by all agents
	📊 Average relevance score of the retrieved documents : 0.5
	🕵️ 6 Agents answered the query
	🏆 15 games were evaluated for this query
--------------------------------------------------------------------------------
📜 The document object:
	📄 Document text: "--- title: Storage weight: 80 aliases: - ../storage --- # Storage All data within one collection is " (...)
--------------------------------------------------------------------------------
📈 Document's evaluation:
	💭 LLM's reasoning for the evaluation: "To evaluate the relevance of the retrieved document passage to the user query regarding the signific" (...)
	💯 Document's relevance score (between 0 and 2): 1
--------------------------------------------------------------------------------
🆚 Pairwise games played:
	Game between agents 🕵️rag_response_512_5 🆚 🕵️rag_response_512_4
	💭 LLM's reasoning for the evalu

## 6. Rank the agents
Based on the results of the games played, we now run the Elo ranker to determine which agent wins the tournament.

If we re-run the tournament multiple times, small variations may happen. Therefore, we re-run the tournament multiple times and average the results to get a more stable ranking. 

In [14]:
elo_ranks = elo_ranker.run(queries)