In [1]:
%load_ext autoreload
%autoreload 2

# **Evaluating multiple RAG pipelines with RAGElo**
Unlike other LLM and RAG evaluation frameworks that try to evaluate every individual LLM answer individually, RAGElo focuses on comparing **pairs** of answers in an Elo-style tournement.

The idea is that, without a golden answer, LLMs can't really judge any individual answer in isolation. Rather, analyzing if answer A is better than Asnswer B is a more reliable metric of quality, and makes it easier to decide if a new RAG pipeline is better than another.

RAGElo works in three steps when evaluating a RAG pipeline
1. Gather all documents retrieved by all agents and annotate their relevance to the user's query.
2. For each question, generate "_games_" between agents, asking the judging LLM to analyze if one agent is better than another, rather than asigning individual scores to each answer.
3. With these games, compute the Elo score for each agent, creating a final ranking of agents.

Importantly, RAGElo is **agnostic** to your pipeline. Meaning, it will not directly call your RAG system, and can work with any framework or pipeline you use. When used as a library, as we do here, we should create the `Query` objects with your agent's answers, as we show below.

## 1. Import packages

In [2]:
from ragelo.utils import load_answers_from_multiple_csvs
import glob
import os
from getpass import getpass
import openai
import pandas as pd
from ragelo import (
    Query,
    get_answer_evaluator,
    get_llm_provider,
    get_retrieval_evaluator,
)
from ragelo.types.configurations import DomainExpertEvaluatorConfig
from ragelo.types.configurations import PairwiseDomainExpertEvaluatorConfig
from ragelo import get_agent_ranker


## 2. Setup openai key

In [3]:
if not (openai_api_key := os.environ.get("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key:  ········


## 3.Load the answers generated by your RAG pipelines

In [5]:
data_folder = "../data/"
csvs = glob.glob(f"{data_folder}rag_response_*.csv")

In [6]:
queries = load_answers_from_multiple_csvs(
    csvs,  # A list of all the CSVs with the answers produced by our RAG pipelinesZZZ
    query_text_col="question", # this tells RAGelo which column of the CSV has the query itself.
)
query_ids = {q.query: q.qid for q in queries}
query_dict = {q.qid: q for q in queries}


### Parse the documents into the queries

Note that the retrieved documents are not tied to a single answer, but to the query itself. In a RAG system, it means that, if the pipeline did not retrieve a relevant document, but another did, the one that didn't performed _worst_ than the one that did.

In [7]:
def parse_docs(raw_docs) -> list[tuple[str, str]]:
    docs = raw_docs.split("\n")
    documents = []
    for d in docs:
        doc_text = d.split("document:", maxsplit=1)[1]
        doc_source = d.split("source:", maxsplit=1)[1]
        documents.append((doc_source, doc_text))
    return documents

for csv in csvs:
    df = pd.read_csv(csv)
    for i, row in df.iterrows():
        query_id = query_ids[row["question"]]
        answer = row["answer"]
        docs = parse_docs(row["contexts"])
        query = query_dict[query_id]
        for doc_source, doc_text in docs:
            query.add_retrieved_doc(doc_text, doc_source) # Here, we use the source as the id of the cocumen

## 4. Prepare the Evaluators
RAGElo uses _evaluators_ as judges. We will instantiate a **retrieval evaluator**, an **answer evaluator** and an **Agents ranker** with their corresponding settings

In [10]:
# The LLM provider can be shared accross all evaluators.
llm_provider = get_llm_provider("openai", model_name="gpt-4o", max_tokens=2048)

# The DomainExpertEvaluator mimics a persona that is an expert in a given field
retrieval_evaluator_config = DomainExpertEvaluatorConfig(
    expert_in="the details of how to better use the Qdrant vector database and vector search engine",
    company = "Qdrant",
    n_processes=20, # How many threads to use when evaluating the retrieved documents. Will do that many parallel calls to OpenAI.
)
retrieval_evaluator = get_retrieval_evaluator(llm_provider=llm_provider, config=retrieval_evaluator_config)


#The PairwiseDomainExpertEvaluator is an Answer evaluator similar to the one above, but evaluates pairs of answers.
answer_evaluator_config = PairwiseDomainExpertEvaluatorConfig(
    expert_in="the details of how to better use the Qdrant vector database and vector search engine",
    company = "Qdrant",
    n_processes=20,
    n_games_per_query = 10, # The maximum number of games per query to generate. In this case, 10 pairwise games will be generated for each query.
)

answer_evaluator = get_answer_evaluator(llm_provider=llm_provider, config=answer_evaluator_config)

#The Elo ranker doesn't need an LLM. Here, we instantiate it with the basic settings.
elo_ranker = get_agent_ranker("elo")


## 5. Call the evaluators
Now, we call each evaluator. Note that they all modify the same queries object, adding more information to it as they go. This also avoids the need of keeping extensinve CSVs or JSONs with intermediate results.

In [12]:
queries = retrieval_evaluator.batch_evaluate(queries)
queries = answer_evaluator.batch_evaluate(queries)

## 6. Analyze the results
Let's look at the evaluations produced by the Evaluators:

In [13]:
print("LLM Reasoning:")
print(queries[5].retrieved_docs[2].evaluation.raw_answer)
print("LLM score:")
print(queries[5].retrieved_docs[2].evaluation.answer)

LLM Reasoning:


IndexError: list index out of range

In [14]:
elo_ranker.run(queries)

------- Agent Scores by Elo Agent Ranker -------
