# LLM and RAG Evaluation

Sources: [1](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/), [2](), [3](), [4](), [5]()  

LLMs are trained on enormous bodies of data but they aren’t trained on your data. Retrieval-Augmented Generation (RAG) solves this problem by adding your data to the data LLMs already have access to. You will see references to RAG frequently in this documentation.  
In RAG, your data is loaded and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.  
Even if what you’re building is a chatbot or an agent, you’ll want to know RAG techniques for getting data into your application.  

Evaluation and benchmarking are crucial concepts in LLM development. To improve the performance of an LLM app (RAG, agents), you must have a way to measure it.

LlamaIndex offers key modules to measure the quality of generated results. We also offer key modules to measure retrieval quality.

### 1.Response Evaluation:  
Does the response match the retrieved context? Does it also match the query? Does it match the reference answer or guidelines?

### 2. Retrieval Evaluation:  
Are the retrieved sources relevant to the query?

This section describes how the evaluation components within LlamaIndex work.

---  
## 1. Response Evaluation

Evaluation of generated results can be difficult, since unlike traditional machine learning the predicted result isn't a single number, and it can be hard to define quantitative metrics for this problem. LlamaIndex offers LLM-based evaluation modules to measure the quality of results. This uses a "gold" LLM (e.g. GPT-4) to decide whether the predicted answer is correct in a variety of ways. Note that many of these current evaluation modules do not require ground-truth labels. Evaluation can be done with some combination of the query, context, response, and combine these with LLM calls.

These evaluation modules are in the following forms:

+ ### Correctness: Whether the generated answer matches that of the reference answer given the query (requires labels).
+ ### Semantic Similarity Whether the predicted answer is semantically similar to the reference answer (requires labels).
+ ### Faithfulness: Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there's hallucination).
+ ### Context Relevancy: Whether retrieved context is relevant to the query.
+ ### Answer Relevancy: Whether the generated answer is relevant to the query.
+ ### Guideline Adherence: Whether the predicted answer adheres to specific guidelines.

+ Question Generation: In addition to evaluating queries, LlamaIndex can also use your data to generate questions to evaluate on. This means that you can automatically generate questions, and then run an evaluation pipeline to test if the LLM can actually answer questions accurately using your data.

---
## 2. Retrieval Evaluation

We also provide modules to help evaluate retrieval independently.

The concept of retrieval evaluation is not new; given a dataset of questions and ground-truth rankings, we can evaluate retrievers using ranking metrics like mean-reciprocal rank (MRR), hit-rate, precision, and more.

The core retrieval evaluation steps revolve around the following:

+ ### Dataset generation: Given an unstructured text corpus, synthetically generate (question, context) pairs.  
+ ### Retrieval Evaluation: Given a retriever and a set of questions, evaluate retrieved results using ranking metrics.  
--- 

#### Installing Packages

In [2]:
!pip install -q openai
!pip install -q llama-index
!pip install -q llama-index-experimental
!pip install -q llama-index-llms-openai

#### Importing Packages

In [31]:
import os
import openai

#os.environ["OPENAI_API_KEY"] = "<the key>"
openai.api_key = os.environ["OPENAI_API_KEY"]

import sys
import shutil
import glob
from pathlib import Path

import warnings
warnings.filterwarnings('ignore')

import pandas as pd


import llama_index

## Llamaindex readers
from llama_index.core import SimpleDirectoryReader

## LlamaIndex Index Types
from llama_index.core import ListIndex
from llama_index.core import VectorStoreIndex
from llama_index.core import TreeIndex
from llama_index.core import KeywordTableIndex
from llama_index.core import SimpleKeywordTableIndex
from llama_index.core import DocumentSummaryIndex
from llama_index.core import KnowledgeGraphIndex
from llama_index.experimental.query_engine import PandasQueryEngine

## LlamaIndex Context Managers
from llama_index.core import StorageContext
from llama_index.core import load_index_from_storage
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.response_synthesizers import ResponseMode
from llama_index.core.schema import Node

## LlamaIndex Callbacks
from llama_index.core.callbacks import CallbackManager
from llama_index.core.callbacks import LlamaDebugHandler

In [3]:
import logging

#logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
#logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#### Defining Models

In [55]:
models = """gpt-4, gpt-4-32k, gpt-4-1106-preview, gpt-4-0125-preview, gpt-4-turbo-preview, 
gpt-4-vision-preview, gpt-4-1106-vision-preview, gpt-4-turbo-2024-04-09, gpt-4-turbo, gpt-4-0613, 
gpt-4-32k-0613, gpt-4-0314, gpt-4-32k-0314, gpt-3.5-turbo, gpt-3.5-turbo-16k, gpt-3.5-turbo-0125, 
gpt-3.5-turbo-1106, gpt-3.5-turbo-0613, gpt-3.5-turbo-16k-0613, gpt-3.5-turbo-0301, text-davinci-003, 
text-davinci-002, gpt-3.5-turbo-instruct, text-ada-001, text-babbage-001, text-curie-001, ada, babbage, 
curie, davinci, gpt-35-turbo-16k, gpt-35-turbo, gpt-35-turbo-0125, gpt-35-turbo-1106, gpt-35-turbo-0613, 
gpt-35-turbo-16k-0613""".split()
models = [m.strip(", ") for m in models]
models

['gpt-4',
 'gpt-4-32k',
 'gpt-4-1106-preview',
 'gpt-4-0125-preview',
 'gpt-4-turbo-preview',
 'gpt-4-vision-preview',
 'gpt-4-1106-vision-preview',
 'gpt-4-turbo-2024-04-09',
 'gpt-4-turbo',
 'gpt-4-0613',
 'gpt-4-32k-0613',
 'gpt-4-0314',
 'gpt-4-32k-0314',
 'gpt-3.5-turbo',
 'gpt-3.5-turbo-16k',
 'gpt-3.5-turbo-0125',
 'gpt-3.5-turbo-1106',
 'gpt-3.5-turbo-0613',
 'gpt-3.5-turbo-16k-0613',
 'gpt-3.5-turbo-0301',
 'text-davinci-003',
 'text-davinci-002',
 'gpt-3.5-turbo-instruct',
 'text-ada-001',
 'text-babbage-001',
 'text-curie-001',
 'ada',
 'babbage',
 'curie',
 'davinci',
 'gpt-35-turbo-16k',
 'gpt-35-turbo',
 'gpt-35-turbo-0125',
 'gpt-35-turbo-1106',
 'gpt-35-turbo-0613',
 'gpt-35-turbo-16k-0613']

In [46]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

#model="gpt-3.5-turbo"
model="gpt-4"
#model="gpt-4-turbo"

Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
embed_model = Settings.embed_model
Settings.llm = OpenAI(temperature=0, model=model)
llm = Settings.llm

In [16]:
import nest_asyncio
nest_asyncio.apply()

## 1. Response Evaluation

### 1.1 Correctness

In [48]:
from llama_index.core.evaluation import CorrectnessEvaluator
evaluator = CorrectnessEvaluator(llm=llm)

In [49]:
query = ("Can you explain the theory of relativity proposed by Albert Einstein in detail?")

reference = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, 
published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a 
vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).
General relativity, published in 1915, extended these ideas to include the effects of gravity. According to general relativity, gravity is not a force between 
masses, as described by Newton's theory of gravity, but rather the result of the warping of space and time by mass and energy. Massive objects, such as 
planets and stars, cause a curvature in spacetime, and smaller objects follow curved paths in response to this curvature. This concept is often illustrated 
using the analogy of a heavy ball placed on a rubber sheet, causing it to create a depression that other objects (representing smaller masses) naturally move 
towards.
In essence, general relativity provided a new understanding of gravity, explaining phenomena like the bending of light by gravity (gravitational lensing) and the precession of the orbit of Mercury. It has been confirmed through numerous experiments and observations and has become a fundamental theory in modern physics.
"""

response = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, 
published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a 
vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) 
and mass (m).
However, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, 
gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. 
Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects follow curved paths 
in response to this magnetic curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet with magnets 
underneath, causing it to create a depression that other objects (representing smaller masses) naturally move towards due to magnetic attraction.
"""

In [50]:
result = evaluator.evaluate(query=query, response=response, reference=reference,)
print(result.score)
print(result.feedback)

2.5
The generated answer is relevant to the user query and starts off correctly by explaining the two components of Einstein's theory of relativity: special relativity and general relativity. However, it contains a significant mistake when it starts discussing general relativity. The generated answer incorrectly states that general relativity includes the effects of magnetism and that gravity is the result of the warping of space and time by magnetic fields. This is incorrect. General relativity describes gravity as the result of the warping of space and time by mass and energy, not magnetic fields. This is a significant error that affects the overall correctness of the answer.


### 1.2 Semantic Similarity

In [52]:
from llama_index.core.evaluation import SemanticSimilarityEvaluator
evaluator = SemanticSimilarityEvaluator()

In [53]:
# This evaluator only uses `response` and `reference`, passing in query does not influence the evaluation
# query = 'What is the color of the sky'

response = "The sky is typically blue"
reference = """The color of the sky can vary depending on several factors, including time of day, weather conditions, and location.

During the day, when the sun is in the sky, the sky often appears blue. 
This is because of a phenomenon called Rayleigh scattering, where molecules and particles in the Earth's atmosphere scatter sunlight in all directions, and blue light is scattered more than other colors because it travels as shorter, smaller waves. 
This is why we perceive the sky as blue on a clear day.
"""

result = await evaluator.aevaluate(response=response, reference=reference,)
print("Score: ", result.score)
print("Passing: ", result.passing)  # default similarity threshold is 0.8

Score:  0.8741614884630503
Passing:  True


In [54]:
response = "Sorry, I do not have sufficient context to answer this question."
reference = """The color of the sky can vary depending on several factors, including time of day, weather conditions, and location.

During the day, when the sun is in the sky, the sky often appears blue. 
This is because of a phenomenon called Rayleigh scattering, where molecules and particles in the Earth's atmosphere scatter sunlight in all directions, and blue light is scattered more than other colors because it travels as shorter, smaller waves. 
This is why we perceive the sky as blue on a clear day.
"""

result = await evaluator.aevaluate(response=response, reference=reference,)
print("Score: ", result.score)
print("Passing: ", result.passing)  # default similarity threshold is 0.8

Score:  0.7213441101430746
Passing:  False


#### Customization

In [55]:
from llama_index.core.embeddings import resolve_embed_model
evaluator = SemanticSimilarityEvaluator(embed_model=embed_model, similarity_threshold=0.6,)

In [56]:
response = "The sky is yellow."
reference = "The sky is blue."

result = await evaluator.aevaluate(response=response, reference=reference,)
print("Score: ", result.score)
print("Passing: ", result.passing)

Score:  0.9406303029427779
Passing:  True


We note here that a high score does not imply the answer is always correct.   
Embedding similarity primarily captures the notion of "relevancy". Since both the response and reference discuss "the sky" and colors, they are semantically similar.

### 1.3 Faithfulness

The `FaithfulnessEvaluator` module to measure if the response from a query engine matches any source nodes.  
This is useful for measuring if the response was hallucinated.  
The data is extracted from the [New York City](https://en.wikipedia.org/wiki/New_York_City) wikipedia page.

In [57]:
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Response,
)
from llama_index.core.evaluation import FaithfulnessEvaluator
from llama_index.core.node_parser import SentenceSplitter
import pandas as pd
pd.set_option("display.max_colwidth", 0)

In [58]:
evaluator_gpt4 = FaithfulnessEvaluator(llm=llm)

In [63]:
documents = SimpleDirectoryReader(input_files=["../Data/nyc_text.txt"]).load_data()
splitter = SentenceSplitter(chunk_size=512)
vector_index = VectorStoreIndex.from_documents(documents, transformations=[splitter])

In [64]:
from llama_index.core.evaluation import EvaluationResult

In [65]:
# define jupyter display function
def display_eval_df(response: Response, eval_result: EvaluationResult) -> None:
    if response.source_nodes == []:
        print("no response!")
        return
    eval_df = pd.DataFrame(
        {
            "Response": str(response),
            "Source": response.source_nodes[0].node.text[:1000] + "...",
            "Evaluation Result": "Pass" if eval_result.passing else "Fail",
            "Reasoning": eval_result.feedback,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

To run evaluations you can call the `.evaluate_response()` function on the `Response` object return from the query to run the evaluations.  
Lets evaluate the outputs of the vector_index.

In [67]:
query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("How did New York City get its name?")
eval_result = evaluator_gpt4.evaluate_response(response=response_vector)
display_eval_df(response_vector, eval_result)

Unnamed: 0,Response,Source,Evaluation Result,Reasoning
0,"New York City was named in 1664 in honor of the Duke of York, who later became King James II of England. His elder brother, King Charles II, had granted him the territory of New Netherland, which included the city of New Amsterdam, when England took control of it from the Dutch. The city was briefly regained by the Dutch in 1673 and renamed New Orange, but it has been continuously named New York since November 1674.","The city came under British control in 1664 and was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city was regained by the Dutch in July 1673 and was renamed New Orange for one year and three months; the city has been continuously named New York since November 1674. New York City was the capital of the United States from 1785 until 1790, and has been the largest U.S. city since 1790. The Statue of Liberty greeted millions of immigrants as they came to the U.S. by ship in the late 19th and early 20th centuries, and is a symbol of the U.S. and its ideals of liberty and peace. In the 21st century, New York City has emerged as a global node of creativity, entrepreneurship, and as a symbol of freedom and cultural diversity. The New York Times has won the most Pulitzer Prizes for journalism and remains the U.S. media's ""newspaper of record"". In 2019, New York City was voted the greatest city in the world in a survey of over 30,000 p...",Pass,YES


#### Benchmark on Generated Question

Now lets generate a few more questions so that we have more to evaluate with and run a small benchmark.

In [68]:
from llama_index.core.evaluation import DatasetGenerator

question_generator = DatasetGenerator.from_documents(documents)
eval_questions = question_generator.generate_questions_from_nodes(5)
eval_questions

2024-05-30 21:18:30.955898: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-30 21:18:30.955949: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-30 21:18:31.024348: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-30 21:18:31.166787: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-30 21:18:33.445174: I external/local_xla/xla/

['What is the population of New York City as of 2020?',
 'How does the population of New York City compare to that of Los Angeles?',
 'What is the geographical location of New York City within the state of New York?',
 'How many people live within 250 mi (400 km) of New York City?',
 'What are the five boroughs of New York City and the respective counties they are coextensive with?']

In [69]:
import asyncio

def evaluate_query_engine(query_engine, questions):
    c = [query_engine.aquery(q) for q in questions]
    results = asyncio.run(asyncio.gather(*c))
    print("finished query")

    total_correct = 0
    for r in results:
        # evaluate with gpt 4
        eval_result = (
            1 if evaluator_gpt4.evaluate_response(response=r).passing else 0
        )
        total_correct += eval_result

    return total_correct, len(results)

In [70]:
vector_query_engine = vector_index.as_query_engine()
correct, total = evaluate_query_engine(vector_query_engine, eval_questions[:5])

print(f"score: {correct}/{total}")

finished query
score: 5/5


### 1.4 Context Relevancy and Answer Relevancy

### 1.5 Guideline Adherence

### 1.6 Question Generation