In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from ragelo.utils import load_answers_from_multiple_csvs
import glob
import os
from getpass import getpass
import openai

In [4]:
data_folder = "../data/"
csvs = glob.glob(f"{data_folder}rag_response_*.csv")

In [5]:
if not (openai_api_key := os.environ.get("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

RAGelo is completely independent from your retrieval pipeline. All that it needs are, for each agent/pipeline, their answers and the documents retrieved when building the answers.

In [6]:
queries = load_answers_from_multiple_csvs(csvs, query_text_col="question")
query_ids = {q.query: q.qid for q in queries}
query_dict = {q.qid: q for q in queries}

In [7]:
import pandas as pd
def parse_docs(raw_docs) -> list[tuple[str, str]]:
    docs = raw_docs.split("\n")
    documents = []
    for d in docs:
        doc_text = d.split("document:", maxsplit=1)[1]
        doc_source = d.split("source:", maxsplit=1)[1]
        documents.append((doc_source, doc_text))
    return documents

for csv in csvs:
    df = pd.read_csv(csv)
    for i, row in df.iterrows():
        query_id = query_ids[row["question"]]
        answer = row["answer"]
        docs = parse_docs(row["contexts"])
        query = query_dict[query_id]
        for doc_source, doc_text in docs:
            query.add_retrieved_doc(doc_text, doc_source)


## Evaluate retrieved documents

In [8]:
from ragelo import (
    Query,
    get_answer_evaluator,
    get_llm_provider,
    get_retrieval_evaluator,
)
from ragelo.types.configurations import DomainExpertEvaluatorConfig

llm_provider = get_llm_provider("openai", model_name="gpt-4o", max_tokens=2048)

In [9]:
retrieval_evaluator_config = DomainExpertEvaluatorConfig(
    expert_in="the details of how to better use the Qdrant vector database and vector search engine",
    rich_print=True,
    n_processes=20,
)
retrieval_evaluator = get_retrieval_evaluator(llm_provider=llm_provider, config=retrieval_evaluator_config)

In [10]:
import pickle
# queries = retrieval_evaluator.batch_evaluate(queries)
queries = pickle.load(open("queries.pkl", "rb"))

Let's look at the evaluations produced by the LLM:


In [11]:
print("LLM Reasoning:")
print(queries[5].retrieved_docs[2].evaluation.raw_answer)
print("LLM score:")
print(queries[5].retrieved_docs[2].evaluation.answer)

LLM Reasoning:
The user query asks about the purpose of the function `CreatePayloadIndexAsync` in the context of using the Qdrant vector database. The query specifically seeks to understand the function's role or utility, likely within the broader operations of managing or manipulating data in the database.

The provided document passage, however, does not directly mention or discuss the `CreatePayloadIndexAsync` function. Instead, the passage details various operations related to managing payloads and points in the Qdrant database, such as setting, deleting, and updating payloads and points. It includes code snippets and examples of operations like `SetPayload`, `DeletePayload`, `ClearPayload`, and others, which are part of batch operations in managing data points.

Given the guidelines for relevance:
- **Not Relevant**: The document does not address the `CreatePayloadIndexAsync` function or its purpose. While it discusses related functionalities within the same system (Qdrant), it do

In [12]:
from ragelo.types.configurations import PairwiseDomainExpertEvaluatorConfig, answer_evaluator_configs
answer_evaluator_config = PairwiseDomainExpertEvaluatorConfig(
    expert_in="the details of how to better use the Qdrant vector database and vector search engine",
    company = "Qdrant",
    rich_print=True,
    n_processes=20,
)
    

In [13]:
answer_evaluator = get_answer_evaluator(llm_provider=llm_provider, config=answer_evaluator_config)

In [14]:
# queries = answer_evaluator.batch_evaluate(queries)

In [15]:
print(queries[5].pairwise_games[0].evaluation.raw_answer)

Both Assistant A and Assistant B provide answers regarding the purpose of the 'CreatePayloadIndexAsync' function. They both explain that it is used to create a keyword payload index for a specific field in a collection, which facilitates efficient indexing and retrieval of payload data associated with the field.

Assistant A cites "documentation/guides/multiple-partitions.md" as the source, which is relevant to the question as indicated by the relevance score of 1. Assistant B, however, cites "documentation/concepts/collections.md" as the source, which has a relevance score of 0, indicating that it is not relevant to the question.

The key difference between the two responses lies in the accuracy and relevance of the source cited. Assistant A's response is supported by a relevant document, making it more reliable and trustworthy. Assistant B, despite providing a similar explanation, cites a non-relevant document, which undermines the credibility of the response.

Given that both assist

In [18]:
from ragelo import get_agent_ranker
elo_ranker = get_agent_ranker("elo")

# elo_ranker = get_



In [192]:
elo_ranker.run(queries)

------- Agent Scores by Elo Agent Ranker -------
rag_response_512_5: 1061.0
rag_response_512_5_reranked: 987.0
rag_response_512_3: 973.0
rag_response_512_4_reranked: 953.0
rag_response_512_3_reranked: 949.0
rag_response_512_4: 930.0
