# **Evaluating multiple RAG pipelines with RAGElo**
Unlike other LLM and RAG evaluation frameworks that try to evaluate every individual LLM answer individually, RAGElo focuses on comparing **pairs** of answers in an Elo-style tournament.

The idea is that, without a golden answer, LLMs can't really judge any individual answer in isolation. Rather, analyzing if answer A is better than answer B is a more reliable metric of quality, and makes it easier to decide if a new RAG pipeline is better than another.

RAGElo works in three steps when evaluating a RAG pipeline
1. Gather all documents retrieved by all agents and annotate their relevance to the user's query.
2. For each question, generate "_games_" between agents, asking the judging LLM to analyze if one agent is better than another, rather than assigning individual scores to each answer.
3. With these games, compute the Elo score for each agent, creating a final ranking of agents.

Importantly, RAGElo is **agnostic** to your pipeline. Meaning, it will _not_ directly call your RAG system. The good thing is that it can work with any framework or pipeline you use. When used as a library, as we do here, a collection of queries is managed  by an `Experiment` object that can be initialized as shown here:

## 1. Import packages and setup OpenAI API key

In [53]:
import os
import random
import json
from getpass import getpass

import openai

from ragelo import (
    get_agent_ranker,
    get_answer_evaluator,
    get_llm_provider,
    get_retrieval_evaluator,
)

# RAGElo is based around Experiments, that have multiple Queries. 
from ragelo import Experiment

if not (openai_api_key := os.environ.get("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key


## 3. Load the queries, documents retrieved and answers generated by your RAG pipelines
The simplest way to load queries, documents and answers from multiple pipelines is to pass their CSV paths in the `Experiment` initialization:

In [67]:
experiment = Experiment(
    experiment_name="RAGELo_evaluation",
    queries_csv_path="./data/queries.csv",
    documents_csv_path="./data/documents.csv",
    answers_csv_path="./data/answers.csv",
    csv_agent_col="agent", 
    verbose=True,
)

## 4. Prepare the Evaluators
RAGElo uses _evaluators_ as judges. We will instantiate a **retrieval evaluator**, an **answer evaluator** and an **agents ranker** with their corresponding settings

In [69]:
# The LLM provider can be shared across all evaluators.
llm_provider = get_llm_provider("openai", model_name="gpt-4o-mini")

# The DomainExpertEvaluator and the PairwiseDomainExpertEvaluator mimics a persona that is an expert in a given field from a specific company.
# The evaluators will add this persona and company information to the prompts when evaluating the answers.
kwargs = {
    "llm_provider": llm_provider,
    "expert_in": "the details of how to better use the Qdrant vector database and vector search engine",
    "company": "Qdrant",
    "n_processes": 20, # How many threads to use when evaluating the retrieved documents. Will do that many parallel calls to OpenAI.
    "rich_print": True, # Wether or not to use rich to print colorful outputs.
    # "force": True, # Whether or not to overwrite any existing files.
}

retrieval_evaluator = get_retrieval_evaluator(
    "domain_expert",
    **kwargs,
)

answer_evaluator = get_answer_evaluator(
    "domain_expert",
    **kwargs,
    bidirectional=False, # Whether or not to evaluate the answers in both directions.
    n_games_per_query=20, # The number of games to play for each query.
    document_relevance_threshold=2, # The minimum relevance score a document needs to have to be considered relevant.
)

# The Elo ranker doesn't need an LLM. Here, we instantiate it with the basic settings.
# We can also instantiate evaluators and agents by passing the configuration directly to the get_agent_ranker or get_*__evaluator functions:
elo_ranker = get_agent_ranker(
    "elo",
    k=32,  # The k-factor for the Elo ranking algorithm
    initial_score=1000,  # Initial score for the agents. This will be updated after each game.
    rounds=1000, # Number of tournaments to play
    **kwargs,
)



## 5. Call the evaluators
Now, we actually run the evaluators. 

As the evaluators run, we write their intermediate outputs to disk. In this example, each evaluation output is written as a new line in `.ragelo_cache/RAGELo_evaluation_results.jsonl`. This allows us to, if something breaks, re-load the experiments and not have to re-run the evaluations that were completed. 

After each evaluator finishes running, it also dumps the whole `Experiment` state to`./ragelo_cache/RAGELo_evaluation.json`. This allows us to re-load this experiment later, and use it again somewhere else later.

As the focus of RAGElo is to evaluate the quality of RAG pipelines, the __retrieval__ component is extremely important, and the answers of the agents are evaluated based not only on the quality of the documents they have retrieved, but the quality of all the documents retrieved by any agent. The intuition here is that if there are many relevant documents in the corpus, potentially retrieved by other agents, the evaluation should take these into account, even if a specific agent did not retrieve them.

When evaluating a (pair of) answer(s), the LLM will be able to see all the relevant documents retrieved by all agents, and will be able to compare the quality of the answers based on the quality of _all_ the relevant documents retrieved by any agent.

In [70]:
# Evaluate all the retrieved documents for all the queries
retrieval_evaluator.evaluate_experiment(experiment) 
# As the evaluator is a PairwiseDomainExpertEvaluator, it will create random pairs of agent's answers for the same query and evaluate them.
answer_evaluator.evaluate_experiment(experiment)

Output()

Output()

### Let's see what is happening under the hood
An experiment contain multiple queries, each with their own documents, answers and pairwise games.
We will select a random query and take a look at its contents and the evaluations of the documents, answers and pairwise games.

In more details, an experiment contains:
- A `Query` object that contains the query itself, the documents retrieved by all the agents (`retrieved_docs`), the answers generated by each agent (`answers`) and the pairwise games that the AnswerEvaluator generated (`pairwise_games`).
- Each document, answer and pairwise game may have an evaluation (`.evaluation`), that contains the raw LLM output for the object and a parsed version of it, according to the evaluator's settings.

In [71]:

query = random.choice(list(experiment.queries.values()))

print("🔎 The query object:")
print(f'\t💬 Query text: "{query.query}"')
print(f"\t📚 {len(query.retrieved_docs)} retrieved documents by all agents")
average_relevance = sum([int(d.evaluation.answer) for d in query.retrieved_docs.values()]) / len(
    query.retrieved_docs
)
print(f"\t📊 Average relevance score of the retrieved documents : {average_relevance}")
print(f"\t🕵️ {len(query.answers)} Agents answered the query")
print(f"\t🏆 {len(query.pairwise_games)} games were evaluated for this query")
document = random.choice(list(query.retrieved_docs.values()))
print("-" * 80)
print("📜 The document object:")
print(f'\t📄 Document text: "{document.text[:100]}" (...)')
print("\t📈 Document's evaluation:")
document_evaluation = document.evaluation
print(
    f'\t\t💭 LLM\'s raw output for the evaluation (reasoning): "{document_evaluation.raw_answer[:100]}" (...)'
)
print(
    f"\t\t💯 Document's relevance score (between 0 and 2): {document_evaluation.answer}"
)
print("-" * 80)
print("🆚 Pairwise games played:")
game = random.choice(query.pairwise_games)
llm_raw_answer = json.loads(game.evaluation.raw_answer)
print(
    f"\tGame between agents 🕵️{game.agent_a_answer.agent} 🆚 🕵️{game.agent_b_answer.agent}"
)
print(
    f'\t💭 LLM\'s reasoning for the quality of the answer of agent A: "{llm_raw_answer["answer_a_reasoning"][:100]} (...)"'
)
print(
    f'\t💭 LLM\'s reasoning for the quality of the answer of agent B: "{llm_raw_answer["answer_b_reasoning"][:100]} (...)"'
)
print(f'\t💭 LLM\'s reasoning when comparing the two answers: "{llm_raw_answer["comparison_reasoning"][:100]} (...)"')
best_agent = game.evaluation.answer
if best_agent == "A":
    best_agent = game.agent_a_answer.agent
elif best_agent == "B":
    best_agent = game.agent_b_answer.agent
elif best_agent == "C":
    best_agent = "TIE"
print(f"\t💯 Game's winner: {best_agent}")


🔎 The query object:
	💬 Query text: "What is vaccum optimizer ?"
	📚 5 retrieved documents by all agents
	📊 Average relevance score of the retrieved documents : 0.4
	🕵️ 6 Agents answered the query
	🏆 15 games were evaluated for this query
--------------------------------------------------------------------------------
📜 The document object:
	📄 Document text: "--- title: Optimize Resources weight: 11 aliases: - ../tutorials/optimize --- # Optimize Qdrant Diff" (...)
	📈 Document's evaluation:
		💭 LLM's raw output for the evaluation (reasoning): "To evaluate the relevance of the retrieved document passage in relation to the user query "What is v" (...)
		💯 Document's relevance score (between 0 and 2): 0.0
--------------------------------------------------------------------------------
🆚 Pairwise games played:
	Game between agents 🕵️agent_1 🆚 🕵️agent_3
	💭 LLM's reasoning for the quality of the answer of agent A: "Assistant A provides a clear and concise explanation of the Vacuum Optimizer in

## 6. Rank the agents
Based on the results of the games played, we now run the Elo ranker to determine which agent wins the tournament.

If we re-run the tournament multiple times, small variations may happen. Therefore, we re-run the tournament multiple times and average the results to get a more stable ranking. 

In [73]:
elo_ranker = get_agent_ranker(
    "elo",
    k=32,  # The k-factor for the Elo ranking algorithm
    initial_score=1000,  # Initial score for the agents. This will be updated after each game.
    rounds=1000, # Number of tournaments to play
    **kwargs,
)

elo_ranker.run(experiment)

EloTournamentResult(agents=['agent_1', 'agent_4', 'agent_0', 'agent_3', 'agent_5', 'agent_2'], scores={'agent_1': 1295.5, 'agent_4': 689.5, 'agent_0': 822.2, 'agent_3': 815.6, 'agent_5': 1157.3, 'agent_2': 1072.2}, games_played={'agent_1': 500, 'agent_4': 500, 'agent_0': 500, 'agent_3': 500, 'agent_5': 500, 'agent_2': 500}, wins={'agent_4': 210, 'agent_3': 210, 'agent_5': 240, 'agent_1': 270, 'agent_2': 250, 'agent_0': 210}, loses={'agent_1': 210, 'agent_0': 240, 'agent_3': 240, 'agent_5': 220, 'agent_4': 270, 'agent_2': 210}, ties={'agent_0': 50, 'agent_5': 40, 'agent_3': 50, 'agent_4': 20, 'agent_1': 20, 'agent_2': 40}, std_dev={'agent_1': 210.61873136072205, 'agent_4': 163.4290365877496, 'agent_0': 238.2627960886886, 'agent_3': 232.20602920682313, 'agent_5': 133.84995330593134, 'agent_2': 252.6985555953971}, total_games=1500, total_tournaments=10)