In [1]:
%load_ext autoreload
%autoreload 2

# **Evaluating multiple RAG pipelines with RAGElo**
Unlike other LLM and RAG evaluation frameworks that try to evaluate every individual LLM answer individually, RAGElo focuses on comparing **pairs** of answers in an Elo-style tournament.

The idea is that, without a golden answer, LLMs can't really judge any individual answer in isolation. Rather, analyzing if answer A is better than answer B is a more reliable metric of quality, and makes it easier to decide if a new RAG pipeline is better than another.

RAGElo works in three steps when evaluating a RAG pipeline
1. Gather all documents retrieved by all agents and annotate their relevance to the user's query.
2. For each question, generate "_games_" between agents, asking the judging LLM to analyze if one agent is better than another, rather than assigning individual scores to each answer.
3. With these games, compute the Elo score for each agent, creating a final ranking of agents.

Importantly, RAGElo is **agnostic** to your pipeline. Meaning, it will not directly call your RAG system, and can work with any framework or pipeline you use. When used as a library, as we do here, we should create the `Query` objects with your agent's answers, as we show below.

## 1. Import packages

In [2]:
import glob
import os
from getpass import getpass

import openai
import pandas as pd

from ragelo import (
    Query,
    get_agent_ranker,
    get_answer_evaluator,
    get_llm_provider,
    get_retrieval_evaluator,
)
from ragelo.types.configurations import (
    DomainExpertEvaluatorConfig,
    PairwiseDomainExpertEvaluatorConfig,
)
from ragelo.utils import load_answers_from_multiple_csvs


## 2. Setup openai key

In [4]:
if not (openai_api_key := os.environ.get("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key


## 3.Load the answers generated by your RAG pipelines

In [5]:
data_folder = "../data/"
csvs = glob.glob(f"{data_folder}rag_response_*.csv")

queries = load_answers_from_multiple_csvs(
    csvs,  # A list of all the CSVs with the answers produced by our RAG pipelinesZZZ
    query_text_col="question",  # this tells RAGelo which column of the CSV has the query itself.
)
query_ids = {q.query: q.qid for q in queries}
query_dict = {q.qid: q for q in queries}

### Parse the documents into the queries

Note that the retrieved documents are not tied to a single answer, but to the query itself. In a RAG system, it means that, if the pipeline did not retrieve a relevant document, but another did, the one that didn't performed _worst_ than the one that did.

In [6]:
def parse_docs(raw_docs) -> list[tuple[str, str]]:
    docs = raw_docs.split("\n")
    documents = []
    for d in docs:
        doc_text = d.split("document:", maxsplit=1)[1]
        doc_source = d.split("source:", maxsplit=1)[1]
        documents.append((doc_source, doc_text))
    return documents


for csv in csvs:
    df = pd.read_csv(csv)
    for i, row in df.iterrows():
        query_id = query_ids[row["question"]]
        answer = row["answer"]
        docs = parse_docs(row["contexts"])
        query = query_dict[query_id]
        for doc_source, doc_text in docs:
            query.add_retrieved_doc(
                doc_text,
                doc_source,  # Here, we use the source as the id of the documents
            )

## 4. Prepare the Evaluators
RAGElo uses _evaluators_ as judges. We will instantiate a **retrieval evaluator**, an **answer evaluator** and an **agents ranker** with their corresponding settings

In [7]:
# The LLM provider can be shared accross all evaluators.
llm_provider = get_llm_provider("openai", model_name="gpt-4o", max_tokens=2048)


# The DomainExpertEvaluator mimics a persona that is an expert in a given field.
# The evaluator will add this persona to the prompt when evaluating the answers.
retrieval_evaluator_config = DomainExpertEvaluatorConfig(
    expert_in="the details of how to better use the Qdrant vector database and vector search engine",
    company="Qdrant",
    n_processes=20,  # How many threads to use when evaluating the retrieved documents. Will do that many parallel calls to OpenAI.
)

retrieval_evaluator = get_retrieval_evaluator(
    llm_provider=llm_provider, config=retrieval_evaluator_config
)



# The PairwiseDomainExpertEvaluator is an Answer evaluator similar to the one above, but evaluates pairs of answers.
answer_evaluator_config = PairwiseDomainExpertEvaluatorConfig(
    expert_in="the details of how to better use the Qdrant vector database and vector search engine",
    company="Qdrant",
    n_processes=20,
    bidirectional=True, # LLMs may have a positional bias. So, evaluate games where both agents are on the left and on the right.
    n_games_per_query=10,  # The maximum number of games per query to generate. In this case, 10 random games between agents will be evaluated.
    document_relevance_threshold=2 # The minimum relevance score for a document to be considered relevant. Documents with a score lower than this will not be considered in the evaluation.
)


answer_evaluator = get_answer_evaluator(
    llm_provider=llm_provider, config=answer_evaluator_config
)

# The Elo ranker doesn't need an LLM. Here, we instantiate it with the basic settings.
# We can also instantiate evaluators and agents by passing the configuration directly to the get_agent_ranker or get_*__evaluator functions:
elo_ranker = get_agent_ranker(
    "elo",
    verbose=True,  # We want to see the final ranking, not only write to an output file.
    k=32,  # The k-factor for the Elo ranking algorithm
    initial_score=1000,  # Initial score for the agents. This will be updated after each game.
)


## 5. Call the evaluators
Now, we actually run the evaluators. 

Note that, as we go, we are adding information to each `Query` object, instead of just dumping everything into CSVs or JSON files. This is by design. The `Query` object is also a Pydantic model, so it can be easily serialized into JSON by calling `query.model_dump_json(<path>)` or pickled by calling `pickle.dumps(query)`.
This also avoids re-evaluating the same document multiple times for the same query. The evaluator will also not re-evaluate a query (or answer) that was already evaluated, unless the `force` parameter is set to `True` on its configurations.

As the focus of RAGElo is to evaluate the quality of RAG pipelines, the __retrieval__ component is extremely important, and the answers of the agents are evaluated based not only on the quality of the documents they have retrieved, but the quality of all the documents retrieved by any agent. The intuition here is that if there are many relevant documents in the corpus, potentially retrieved by other agents, the evaluation should take these into account, even if a specific agent did not retrieve them.

When evaluating a (pair of) answer(s), the LLM will be able to see all the relevant documents retrieved by all agents, and will be able to compare the quality of the answers based on the quality of _all_ the relevant documents retrieved by any agent.

In [11]:
import pickle
if os.path.exists("queries.pkl"):
    queries = pickle.load(open("queries.pkl", "rb"))

In [9]:
queries = retrieval_evaluator.batch_evaluate(queries) # Evaluate all the retrieved documents for all the queries
queries = answer_evaluator.batch_evaluate(queries) # As the evaluator is a PairwiseDomainExpertEvaluator, it will create random pairs of agent's answers for the same query and evaluate them.

In [10]:
queries = answer_evaluator.batch_evaluate(queries) # As the evaluator is a PairwiseDomainExpertEvaluator, it will create random pairs of agent's answers for the same query and evaluate them.

In [12]:
pickle.dump(queries, open("queries.pkl", "wb")) # Save the results to disk to avoid re-evaluating everything if something goes wrong.

### Let's see what is happening under the hood

- A query object contain the query itself, the documents retrieved by all the agents, the answers generated by each agent and the pairwise games that the AnswerEvaluator generated.
- A Document has three objects that can be evaluated: retrieved documents, agent answers and pairwise games. 
- Each evaluable object has an `evaluation` object that, after being evaluated, will contain the raw LLM output for the object and a parsed version of it, according to the evaluator's settings.

In [13]:
import random 
query = random.choice(queries)

print("The query object:")
print(f'\tQuery text: "{query.query}"')
print(f'\t{len(query.retrieved_docs)} retrieved documents by all agents')
print(f"\t{len(query.answers)} Agents answered the query")
print(f"\t{len(query.pairwise_games)} games were evaluated for this query")
document = random.choice(query.retrieved_docs)
print("-" * 80)
print("The document object:")
print(f'\tDocument text: "{document.text[:100]}" (...)')
print("-" * 80)
print("Document's evaluation:")
document_evaluation = document.evaluation
print(f'\tLLM\'s reasoning for the evaluation: "{document_evaluation.raw_answer[:100]}" (...)')
print(f"\tDocument's relevance score (between 0 and 2): {document_evaluation.answer}")

The query object:
	Query text: "How do you use ‘ordering’ parameter?"
	7 retrieved documents by all agents
	6 Agents answered the query
	15 games were evaluated for this query
--------------------------------------------------------------------------------
The document object:
	Document text: "can be freely reordered. - `medium` ordering serializes all write operations through a dynamically e" (...)
--------------------------------------------------------------------------------
Document's evaluation:
	LLM's reasoning for the evaluation: "Given the user query "How do you use ‘ordering’ parameter?" and the provided document passage, the r" (...)
	Document's relevance score (between 0 and 2): 2


## 6. Analyze the results
Let's look at the evaluations produced by the Evaluators:

In [27]:
query.pairwise_games

[PairwiseGame(evaluation=AnswerEvaluatorResult(qid='What is vaccum optimizer ?', raw_answer='Both Assistant A and Assistant B provide a correct and comprehensive explanation of the Vacuum Optimizer in Qdrant. They both explain that the Vacuum Optimizer is used to manage the accumulation of deleted records in a segment repository, which are marked as deleted but not immediately removed to minimize disk access. Over time, these records can slow down the system, and the Vacuum Optimizer is triggered based on conditions defined in the configuration file.\n\n**Comparison of the Responses:**\n1. **Content and Detail:**\n   - Both assistants describe the basic functionality and purpose of the Vacuum Optimizer similarly.\n   - Assistant B provides additional details about the parameters that can be set for the Vacuum Optimizer, such as the minimal fraction of deleted vectors and the minimal number of vectors in a segment required for optimization. This additional information is useful for unde

In [13]:
print("LLM Reasoning:")
print(queries[5].retrieved_docs[2].evaluation.raw_answer)
print("LLM score:")
print(queries[5].retrieved_docs[2].evaluation.answer)

LLM Reasoning:


IndexError: list index out of range

In [14]:
elo_ranker.run(queries)

------- Agent Scores by Elo Agent Ranker -------
