# **Evaluating conversational agents with RAGElo**
RAGElo also offers the functionality to evaluate the whole conversation between a user and an agent based on a set of given use cases and objectives.Similarly to simple QA evaluation, RAGElo can also compare **pairs** of conversations in an Elo-style tournament.

To gather the user-bot conversations that will be used for evaluation, we can also simulate the user's behaviour with an LLM agent that tries to execute a given use case against the RAG conversational agent.

## 1. Import packages

In [1]:
from __future__ import annotations

import os
from getpass import getpass

import openai

from ragelo import (
    get_agent_ranker,
    get_answer_evaluator,
    get_llm_provider,
    get_retrieval_evaluator,
    Query
)

  from .autonotebook import tqdm as notebook_tqdm


## 2. Setup openai key

In [2]:
if not (openai_api_key := os.environ.get("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## 3.Set up a sample conversational Query

In [3]:
queries: list[Query] = [
    Query.parse_obj(
        {
            "qid": 0,
            "query": (
                "I want to get familiar with neural search. "
                "I want to start with the basics of what it means and how it works, "
                "and eventually dive into what methods are used, which models, "
                "and how to create a neural search system."
            ),
            "retrieved_docs": {
                    "doc_1": {
                        "did": "doc_1",
                        "text": "Neural search is a type of search that uses neural networks to improve the accuracy and relevance of search results.",
                    },
                    "doc_2": {
                        "did": "doc_2",
                        "text": "Keyword vs Neural Search: What's the Difference?\n\nKeyword search is the traditional method of searching for information on the web. It involves entering a query into a search engine and receiving a list of results that match the query. Neural search, on the other hand, uses neural networks to understand the meaning of the query and provide more relevant results.",
                },
                    "doc_3": {
                        "did": "doc_3",
                        "text": "The methods used in neural search include natural language processing (NLP), deep learning, and machine learning. These methods are used to understand the meaning of the query and provide more relevant results.",
                },
                    "doc_4": {
                        "did": "doc_4",
                        "text": "The embeddings are very important in neural search. They are used to represent the documents and queries in a high-dimensional space where similar documents and queries are close to each other.",
                },
                    "doc_5": {
                        "did": "doc_5",
                        "text": "To create a neural search system, you need to start by collecting a large amount of relevance data. This data will be used to train the neural network to understand the meaning of the queries and provide relevant results.",
                    },
                    "doc_6": {
                        "did": "doc_6",
                        "text": "Techniques to collect data for neural search training include the generation of synthetic data, data augmentation, and active learning.",
                },
                    "doc_7": {
                        "did": "doc_7",
                        "text": "Transformer models are commonly used as based models in neural search systems, including e5, mistral, Llama, and others.",
                },
            },
            "pairwise_games": [
                {
                    "agent_a_answer": {
                        "agent": "agent_a",
                        "conversation": [
                            {
                                "sender": "user",
                                "content": "I would like to get familiar with neural search",
                            },
                            {
                                "sender": "agent_a",
                                "content": "Sure! Neural search is a type of search that uses neural networks to improve the accuracy and relevance of search results[doc_1].",
                            },
                            {
                                "sender": "user",
                                "content": "How does it work?",
                            },
                            {
                                "sender": "agent_a",
                                "content": "It works by using neural models to get embeddings of the documents and queries in a high-dimensional space where similar documents and queries are close to each other[doc_4].",
                            },
                            {
                                "sender": "user",
                                "content": "I see, what methods are used in neural search?",
                            },
                            {
                                "sender": "agent_a",
                                "content": "The methods used in neural search include natural language processing (NLP), deep learning, and machine learning[doc_3].",
                            },
                            {
                                "sender": "user",
                                "content": "How can I create a neural search system?",
                            },
                            {
                                "sender": "agent_a",
                                "content": "To create a neural search system, you need to start by collecting a large amount of relevance data. This data will be used to train the neural network to understand the meaning of the queries and provide relevant results.",
                            },
                            {
                                "sender": "user",
                                "content": "What models are used in neural search?",
                            },
                            {
                                "sender": "agent_a",
                                "content": "Transformer models are commonly used as based models in neural search systems, including e5, mistral, Llama, and others[doc_7].",
                            },
                        ]
                    },
                    "agent_b_answer": {
                        "agent": "agent_b",
                        "conversation": [
                            {
                                "sender": "user",
                                "content": "I would like to get familiar with neural search",
                            },
                            {
                                "sender": "agent_b",
                                "content": "Sure! What would you like to know about neural search?",
                            },
                            {
                                "sender": "user",
                                "content": "Let's start with the basics of what it means and how it works.",
                            },
                            {
                                "sender": "agent_b",
                                "content": "It is a type of search that uses neural networks to improve the accuracy and relevance of search results[doc_1]. Opposed to keyword search, neural search uses neural networks to understand the meaning of the query and provide more relevant results[doc_2]. In order to do this, it uses natural language processing (NLP), deep learning, and machine learning[doc_3]. Specifically, it works by getting embeddings of the documents and queries in a high-dimensional space where similar documents and queries are close to each other[doc_4].",
                            },
                            {
                                "sender": "user",
                                "content": "What kind of models are used in neural search?",
                            },
                            {
                                "sender": "agent_b",
                                "content": "Transformer models are commonly used as based models in neural search systems, including e5, mistral, Llama, and others[doc_7].",
                            },
                        ]
                    }
                },
            ]
        }
    )
]


## 4. Prepare the Evaluators
RAGElo uses _evaluators_ as judges. We will instantiate a **retrieval evaluator**, the **chat evaluator** and an **agents ranker** with their corresponding settings

In [4]:
llm_provider = get_llm_provider("openai", model_name="gpt-4o", max_tokens=2048)

kwargs = {
    "llm_provider": llm_provider,
    "rich_print": True,
    "has_citations": True,
    "include_annotations": True,
    "include_raw_documents": True,
    "force": True,
}

retrieval_evaluator = get_retrieval_evaluator(
    "reasoner",
    **kwargs,
)

chat_evaluator = get_answer_evaluator(
    "chat_pairwise",
    **kwargs,
    factors=(
        "- The agent should provide clear and accurate answers with references to documents that support the answer.\n"
        "- The agent should be able to switch context according to the user needs and use new documents to respond to diverse user's requests."
    ),
    bidirectional=False,
    n_games_per_query=20,
)

elo_ranker = get_agent_ranker(
    "elo",
    verbose=True,
    k=32,
    initial_score=1000,
    rounds=1000,
    **kwargs,
)



## 5. Call the evaluators
Now, we actually run the evaluators. 

Note that, as we go, we are adding information to each `Query` object, instead of just dumping everything into CSVs or JSON files. This is by design. The `Query` object is also a Pydantic model, so it can be easily serialized into JSON by calling `query.model_dump_json(<path>)` or pickled by calling `pickle.dumps(query)`.
This also avoids re-evaluating the same document multiple times for the same query. The evaluator will also not re-evaluate a query (or answer) that was already evaluated, unless the `force` parameter is set to `True` on its configurations.

As the focus of RAGElo is to evaluate the quality of RAG pipelines, the __retrieval__ component is extremely important, and the answers of the agents are evaluated based not only on the quality of the documents they have retrieved, but the quality of all the documents retrieved by any agent. The intuition here is that if there are many relevant documents in the corpus, potentially retrieved by other agents, the evaluation should take these into account, even if a specific agent did not retrieve them.

When evaluating a (pair of) chat(s), the LLM will be able to see all the relevant documents retrieved by all agents, and will be able to compare the quality of the answers based on the quality of _all_ the relevant documents retrieved by any agent.

In [None]:
queries = retrieval_evaluator.batch_evaluate(queries) 

queries = chat_evaluator.batch_evaluate(queries)

In [6]:
print(f"Relevance evaluation for doc1:\n{queries[0].retrieved_docs['doc_1'].evaluation.raw_answer}")

Relevance evaluation for doc1:
Somewhat relevant: The document provides a basic definition of neural search but does not cover the methods, models, or how to create a neural search system as requested by the user.


In [16]:
print("-"*80)
print(f"Pairwise game evaluation:\n{queries[0].pairwise_games[0].evaluation.raw_answer}")
print("-"*80)
print(f"Winner: {queries[0].pairwise_games[0].evaluation.answer}")

--------------------------------------------------------------------------------
Pairwise game evaluation:
**Evaluation of Assistant A:**

1. **User Intent Satisfaction:**
   - The assistant provides a basic definition of neural search in response to the user's request for the basics. It states that neural search uses neural networks to improve search results [doc_1].
   - When asked how it works, Assistant A explains that it uses embeddings to represent documents and queries in a high-dimensional space [doc_4]. This is relevant to understanding how neural search operates.
   - For methods used in neural search, it mentions natural language processing (NLP), deep learning, and machine learning [doc_3], which aligns with the user's intent to learn about methods.
   - In response to how to create a neural search system, it mentions the need for relevance data to train the neural network, which is a relevant point but lacks depth [doc_5].
   - Finally, it identifies transformer models as 

## 6. Rank the agents
Based on the results of the games played, we now run the Elo ranker to determine which agent wins the tournament.

If we re-run the tournament multiple times, small variations may happen. Therefore, we re-run the tournament multiple times and average the results to get a more stable ranking. 

In [9]:
elo_ranks = elo_ranker.run(queries)