## Evaluate RAG with LlamaIndex

- Understanding Retrieval Augmented Generation (RAG).
- Building RAG with LlamaIndex.
- Evaluating RAG with LlamaIndex.

Stages within RAG

- There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

- **Loading**: this refers to getting your data from where it lives – whether it’s text files, PDFs, another website, a database, or an API – into your pipeline. LlamaHub provides hundreds of connectors to choose from.

- **Indexing**: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.

- **Storing**: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.

- **Querying**: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.

- **Evaluation**: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.

![alt text](image.png)

In [24]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.llms.openai import OpenAI


import os
import pandas as pd

### Download Data

In [31]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/dbredvick/paul-graham-to-kindle/refs/heads/main/paul_graham_essays.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3003k  100 3003k    0     0  3686k      0 --:--:-- --:--:-- --:--:-- 3685k


### Load Data and Build Index.

In [32]:
import os

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")

### Load Data and Build Index

In [33]:
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# Define an LLM
llm = OpenAI(model="gpt-4o")

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

### Build a QueryEngine and start querying.

In [34]:
query_engine = vector_index.as_query_engine() # default 2 chunks
response_vector = query_engine.query("What did the author do growing up?")

In [35]:
response_vector.response

'The author found writing serious, intellectual stuff like the famous writers exciting, but later realized that many famous writers actually produced subpar work.'

In [36]:
# Let's check the text in each of these retrieved nodes.
# First retrieved node
response_vector.source_nodes[0].get_text()

"I should have known that was a danger sign. And in fact I found my stories pretty boring; what excited me was the idea of writing serious, intellectual stuff like the famous writers.  \n  \nNow I have enough experience to realize that those famous writers actually sucked. Plenty of famous people do; in the short term, the quality of one's work is only a small component of fame. I should have been less worried about doing something that seemed cool, and just done something I liked. That's the actual road to coolness anyway.  \n  \nA key ingredient in many projects, almost a project on its own, is to find good books. Most books are bad. Nearly all textbooks are bad. \\[9\\] So don't assume a subject is to be learned from whatever book on it happens to be closest. You have to search actively for the tiny number of good books.  \n  \nThe important thing is to get out there and do stuff. Instead of waiting to be taught, go out and learn.  \n  \nYour life doesn't have to be shaped by admiss

In [37]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

'On a log scale I was midway between crib and globe. A suburban street was just the right size. But as I grew older, suburbia started to feel suffocatingly fake.  \n  \nLife can be pretty good at 10 or 20, but it\'s often frustrating at 15. This is too big a problem to solve here, but certainly one reason life sucks at 15 is that kids are trapped in a world designed for 10 year olds.  \n  \nWhat do parents hope to protect their children from by raising them in suburbia? A friend who moved out of Manhattan said merely that her 3 year old daughter "saw too much." Off the top of my head, that might include: people who are high or drunk, poverty, madness, gruesome medical conditions, sexual behavior of various degrees of oddness, and violent anger.  \n  \nI think it\'s the anger that would worry me most if I had a 3 year old. I was 29 when I moved to New York and I was surprised even then. I wouldn\'t want a 3 year old to see some of the disputes I saw. It would be too frightening. A lot o

## Evaluation

- **Retrieval Evaluation**: This assesses the accuracy and relevance of the information retrieved by the system.
- **Response Evaluation**: This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.

In [None]:
qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=2
)

 11%|█         | 281/2513 [05:24<42:58,  1.16s/it]  

### Retrieval Evaluation

#### Hit Rate
- Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents.

#### Mean Reciprocal Rank (MRR)
- For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document.

In [None]:
retriever = vector_index.as_retriever(similarity_top_k=2)

In [None]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In [None]:
# Let's define a function to display the Retrieval evaluation results in table format.
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

display_results("OpenAI Embedding Retriever", eval_results)

### Response Evaluation:

#### FaithfulnessEvaluator
- Measures if the response from a query engine matches any source nodes which is useful for measuring if the response is hallucinated.

#### Relevancy Evaluator
- Measures if the response + source nodes match the query.


In [None]:
# Get the list of queries from the above created dataset

queries = list(qa_dataset.queries.values())

In [None]:
# gpt-3.5-turbo
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

In [None]:
vector_index = VectorStoreIndex(nodes, service_context = service_context_gpt35)
query_engine = vector_index.as_query_engine()

In [None]:
from llama_index.core.evaluation import FaithfulnessEvaluator
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)

In [None]:
eval_query = queries[10]

eval_query

In [None]:
response_vector = query_engine.query(eval_query)

In [None]:
# Compute faithfulness evaluation

eval_result = faithfulness_gpt4.evaluate_response(response=response_vector)

In [None]:
# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing

#### Relevancy Evaluator


In [None]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)

# Pick a query
query = queries[10]

query

In [None]:
# Generate response.
# response_vector has response and source nodes (retrieved context)
response_vector = query_engine.query(query)

# Relevancy evaluation
eval_result = relevancy_gpt4.evaluate_response(
    query=query, response=response_vector
)

# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing

In [None]:
# You can get the feedback for the evaluation.
eval_result.feedback

#### Batch Evaluator

Now that we have done FaithFulness and Relevancy Evaluation independently. LlamaIndex has BatchEvalRunner to compute multiple evaluations in batch wise manner.

In [None]:
from llama_index.evaluation import BatchEvalRunner

# Let's pick top 10 queries to do evaluation
batch_eval_queries = queries[:10]

# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
    workers=8,
)

# Compute evaluation
eval_results = await runner.aevaluate_queries(
    query_engine, queries=batch_eval_queries
)

In [None]:
# Let's get faithfulness score

faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])

faithfulness_score

In [None]:
# Let's get relevancy score

relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])

relevancy_score
