Reference
https://cookbook.openai.com/examples/evaluation/evaluate_rag_with_llamaindex

In [6]:
!pip install llama-index



In [7]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.llms import openai

import os
import pandas as pd

In [8]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Load Data and Build Index**

In [9]:
documents = SimpleDirectoryReader("/content/drive/MyDrive/PDF").load_data()

# Define an LLM
# llm = gemma_lm

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

**Build a QueryEngine and start querying.**

In [10]:
query_engine = vector_index.as_query_engine()

**Check response**

In [11]:
response_vector = query_engine.query("What is difference between computer vision and Deep Learning?")

In [12]:
response_vector.response

"Computer vision is the computer's ability to extract information and insights from images and videos, while deep learning is a method in artificial intelligence that teaches computers to process data in a way inspired by the human brain."

In [13]:
response_vector = query_engine.query("Who is captain of indian football team?")

In [14]:
response_vector.response

"I'm unable to provide an answer to that question as it is not related to the context provided."

**By default it retrieves two similar nodes**

In [15]:
# First retrieved node
response_vector.source_nodes[0].get_text()

'Virtual\nassistants\nsuch\nas\nAmazon\nAlexa\nand\nautomatic\ntranscription\nsoftware\nuse\nspeech\nrecognition\nto\ndo\nthe\nfollowing\ntasks:\n●\nAssist\ncall\ncenter\nagents\nand\nautomatically\nclassify\ncalls.\n●\nConvert\nclinical\nconversations\ninto\ndocumentation\nin\nreal\ntime.\n●\nAccurately\nsubtitle\nvideos\nand\nmeeting\nrecordings\nfor\na\nwider\ncontent\nreach.\nNatural\nlanguage\nprocessing\nComputers\nuse\ndeep\nlearning\nalgorithms\nto\ngather\ninsights\nand\nmeaning\nfrom\ntext\ndata\nand\ndocuments\n.\nThis\nability\nto\nprocess\nnatural,\nhuman-created\ntext\nhas\nseveral\nuse\ncases,\nincluding\nin\nthese\nfunctions:\n●\nAutomated\nvirtual\nagents\nand\nchatbots\n●\nAutomatic\nsummarization\nof\ndocuments\nor\nnews\narticles\n●\nBusiness\nintelligence\nanalysis\nof\nlong-form\ndocuments,\nsuch\nas\nemails\nand\nforms\n●\nIndexing\nof\nkey\nphrases\nthat\nindicate\nsentiment,\nsuch\nas\npositive\nand\nnegative\ncomments\non\nsocial\nmedia\nRecommendation\nengine

In [16]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

"You\ncan\ngroup\nthese\nvarious\nuse\ncases\nof\ndeep\nlearning\ninto\nfour\nbroad\ncategories—computer\nvision,\nspeech\nrecognition,\nnatural\nlanguage\nprocessing\n(NLP),\nand\nrecommendation\nengines.\nComputer\nvision\nComputer\nvision\nis\nthe\ncomputer's\nability\nto\nextract\ninformation\nand\ninsights\nfrom\nimages\nand\nvideos.\nComputers\ncan\nuse\ndeep\nlearning\ntechniques\nto\ncomprehend\nimages\nin\nthe\nsame\nway\nthat\nhumans\ndo.\nComputer\nvision\nhas\nseveral\napplications,\nsuch\nas\nthe\nfollowing:\n●\nContent\nmoderation\nto\nautomatically\nremove\nunsafe\nor\ninappropriate\ncontent\nfrom\nimage\nand\nvideo\narchives\n●\nFacial\nrecognition\nto\nidentify\nfaces\nand\nrecognize\nattributes\nlike\nopen\neyes,\nglasses,\nand\nfacial\nhair\n●\nImage\nclassification\nto\nidentify\nbrand\nlogos,\nclothing,\nsafety\ngear ,\nand\nother\nimage\ndetails\nSpeech\nrecognition\nDeep\nlearning\nmodels\ncan\nanalyze\nhuman\nspeech\ndespite\nvarying\nspeech\npatterns,\npitch,\n

In [17]:
response_vector.source_nodes[2].get_text()

IndexError: list index out of range

But you can modify using
**vector_index.as_query_engine(similarity_top_k=k).**

###Evaluation

In a RAG system, evaluation focuses on two critical aspects:

**Retrieval Evaluation:** This assesses the accuracy and relevance of the information retrieved by the system.

**Response Evaluation:** This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.

LlamaIndex offers a generate_question_context_pairs module specifically for crafting questions and context pairs which can be used in the assessment of the RAG system of both Retrieval and Response Evaluation.

In [18]:
from llama_index.llms.openai import OpenAI
qa_dataset = generate_question_context_pairs(
    nodes,
    llm=OpenAI(),
    num_questions_per_chunk=2
)

100%|██████████| 4/4 [00:26<00:00,  6.60s/it]


In [19]:
list(qa_dataset.queries.values())

['How does deep learning differ from traditional methods of processing data, and how is it inspired by the human brain?',
 'Provide examples of everyday products and emerging technologies that utilize deep learning technology, and explain how it enhances their functionality.',
 'How do businesses utilize deep learning models in various applications, and what are some examples of industries where deep learning is commonly used?',
 'Provide examples of specific applications of deep learning in self-driving cars, defense systems, medical image analysis, and factory operations, highlighting the importance of automation and detection capabilities in each scenario.',
 'How can deep learning techniques be applied in computer vision, and what are some specific applications of computer vision mentioned in the text?',
 'Explain the role of speech recognition in utilizing deep learning models, and provide examples of tasks that virtual assistants and automatic transcription software can perform u

**Retrieval Evaluation:**

In [20]:
retriever = vector_index.as_retriever(similarity_top_k=2)

**Hit Rate:**

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

**Mean Reciprocal Rank (MRR):**

For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document. Specifically, it’s the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so on.

In [21]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

In [22]:
# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In [23]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

In [24]:
display_results("OpenAI Embedding Retriever", eval_results)

Unnamed: 0,Retriever Name,Hit Rate,MRR
0,OpenAI Embedding Retriever,0.875,0.8125


MRR is less than the hit rate indicates that the top-ranking results aren't always the most relevant.  Enhancing MRR could involve the use of rerankers, which refine the order of retrieved documents.

**Response Evaluation:**

FaithfulnessEvaluator: Measures if the response from a query engine matches any source nodes which is useful for measuring if the response is hallucinated.

Relevancy Evaluator: Measures if the response + source nodes match the query.

In [25]:
# Get the list of queries from the above created dataset

queries = list(qa_dataset.queries.values())

In [40]:
queries

['How does deep learning differ from traditional methods of processing data, and how is it inspired by the human brain?',
 'Provide examples of everyday products and emerging technologies that utilize deep learning technology, and explain how it enhances their functionality.',
 'How do businesses utilize deep learning models to analyze data and make predictions in various applications? Provide examples from the context information.',
 'In what ways are deep learning models used in different industries such as automotive, aerospace, manufacturing, electronics, and medical research? Provide specific examples of applications mentioned in the text.',
 'How can deep learning techniques be applied in computer vision, and what are some specific applications of computer vision mentioned in the context information?',
 'Explain how speech recognition technology utilizes deep learning models, and provide examples of tasks that can be accomplished using speech recognition in various industries.',


**Faithfulness Evaluator**

In [26]:
# gpt-3.5-turbo
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

  service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)
  service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)


Create a QueryEngine with gpt-3.5-turbo service_context to generate response for the query.

In [27]:
vector_index = VectorStoreIndex(nodes, service_context = service_context_gpt35)
query_engine = vector_index.as_query_engine()

In [28]:
from llama_index.core.evaluation import FaithfulnessEvaluator
faithfulness_gpt35 = FaithfulnessEvaluator(service_context=service_context_gpt35)

In [29]:
eval_query = queries[3]
eval_query

'Provide examples of specific applications of deep learning in self-driving cars, defense systems, medical image analysis, and factory operations, highlighting the importance of automation and detection capabilities in each scenario.'

In [30]:
response_vector = query_engine.query(eval_query)

In [31]:
# Compute faithfulness evaluation

eval_result = faithfulness_gpt35.evaluate_response(response=response_vector)

In [32]:
# You can check passing parameter in eval_result if it passed the evaluation.
eval_result

EvaluationResult(query=None, contexts=['Deep\nlearning\nmodels\nare\ncomputer\nfiles\nthat\ndata\nscientists\nhave\ntrained\nto\nperform\ntasks\nusing\nan\nalgorithm\nor\na\npredefined\nset\nof\nsteps.\nBusinesses\nuse\ndeep\nlearning\nmodels\nto\nanalyze\ndata\nand\nmake\npredictions\nin\nvarious\napplications.\nWhat\nare\nthe\nuses\nof\ndeep\nlearning?\nDeep\nlearning\nhas\nseveral\nuse\ncases\nin\nautomotive,\naerospace,\nmanufacturing,\nelectronics,\nmedical\nresearch,\nand\nother\nfields.\nThese\nare\nsome\nexamples\nof\ndeep\nlearning:\n●\nSelf-driving\ncars\nuse\ndeep\nlearning\nmodels\nto\nautomatically\ndetect\nroad\nsigns\nand\npedestrians.\n●\nDefense\nsystems\nuse\ndeep\nlearning\nto\nautomatically\nflag\nareas\nof\ninterest\nin\nsatellite\nimages.\n●\nMedical\nimage\nanalysis\nuses\ndeep\nlearning\nto\nautomatically\ndetect\ncancer\ncells\nfor\nmedical\ndiagnosis.\n●\nFactories\nuse\ndeep\nlearning\napplications\nto\nautomatically\ndetect\nwhen\npeople\nor\nobjects\nare\nw

In [33]:
# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing

True

In [34]:
eval_result.response

'Self-driving cars utilize deep learning models to automatically detect road signs and pedestrians, enhancing safety and navigation. Defense systems leverage deep learning for automatic flagging of areas of interest in satellite images, aiding in surveillance and security measures. In medical image analysis, deep learning is employed to automatically detect cancer cells, facilitating early and accurate medical diagnosis. Factories utilize deep learning applications to automatically detect when people or objects are within an unsafe distance of machines, ensuring worker safety and operational efficiency. The automation and detection capabilities of deep learning in these scenarios significantly enhance performance, accuracy, and efficiency in various critical applications.'

**Relevancy Evaluator**

In [35]:
from llama_index.core.evaluation import RelevancyEvaluator

relevancy_gpt35 = RelevancyEvaluator(service_context=service_context_gpt35)

In [36]:
# Pick a query
query = queries[3]

query

'Provide examples of specific applications of deep learning in self-driving cars, defense systems, medical image analysis, and factory operations, highlighting the importance of automation and detection capabilities in each scenario.'

In [37]:
# Generate response.
# response_vector has response and source nodes (retrieved context)
response_vector = query_engine.query(query)

# Relevancy evaluation
eval_result = relevancy_gpt35.evaluate_response(
    query=query, response=response_vector
)

In [38]:
eval_result.feedback

'YES'

In [66]:
eval_result.passing

True

In [68]:
from llama_index.core.evaluation import BatchEvalRunner

# Let's pick top 10 queries to do evaluation
batch_eval_queries = queries[:2]

# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt35, "relevancy": relevancy_gpt35},
    workers=8,
)

# Compute evaluation
eval_results = await runner.aevaluate_queries(
    query_engine, queries=batch_eval_queries
)

In [69]:
# Let's get faithfulness score

faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])

faithfulness_score

1.0

In [70]:
# Let's get relevancy score

relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])

relevancy_score


1.0

Faithfulness score of 1.0 signifies that the generated answers contain no hallucinations and are entirely based on retrieved context.

Relevancy score of 1.0 suggests that the answers generated are consistently aligned with the retrieved context and the queries.