#### Deep Evaluation of RAG Systems using deepeval

This code demonstrates the use of the deepeval library to perform comprehensive evaluations of Retrieval-Augmented Generation (RAG) systems. It covers various evaluation metrics and provides a framework for creating and running test cases.

##### Build the basic RAG application on csv data

In [3]:
import os
import sys
from dotenv import load_dotenv
load_dotenv()
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores import FAISS

In [4]:
#Load csv data
file_path="data/customers-100.csv"
import pandas as pd
data=pd.read_csv(file_path)
data.head(2)

Unnamed: 0,Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
0,1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
1,2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/


In [6]:
#Embeddings
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
#Vectorstore
from langchain_community.docstore.in_memory import InMemoryDocstore
import faiss
index=faiss.IndexFlatL2(len(embeddings.embed_query(" ")))
vector_store=FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

In [5]:
## Document loaders
loader=CSVLoader(file_path)
data=loader.load()

In [8]:
vector_store.add_documents(documents=data)

['680d5dd0-f1c1-45f0-943b-22c7ea31066b',
 '6c0516b3-f49e-4a7a-aab7-a5f1efced207',
 '55db149f-2825-49a8-8ed6-562d8a4f9ba8',
 '06256581-18e6-4a1f-8199-01b8930e29f4',
 '804f2d12-8126-4354-abf7-16b43df2c12c',
 'dfd18fb3-86b5-43b1-906e-9985c27e64ad',
 '816ed9f3-4d9d-4cf0-96cc-5dab612d4e67',
 'e134199b-3256-4421-bc8f-d0a506420d31',
 '64c15675-2de0-40e8-ad12-10fe6633f5a0',
 '18ca37d8-944b-45b5-99d9-fd9f4adbe7e4',
 'b15d6547-de63-44c9-a1b5-4220be24e23f',
 '3a475319-2ed2-4a85-b354-f238404ffe7c',
 '2325c1ef-159a-47d5-a286-20c3823a3a90',
 '5030e0df-aff7-4abb-a6f2-360859060931',
 'bc755678-491a-491a-afb5-47fd019e1b47',
 'ee306d9b-5a76-4289-b2b7-4a2efee4ce31',
 'ffe7d1a1-3fb0-4ed9-90a4-aef2798fff80',
 '685289c8-fa5b-4b96-8032-e0a09cc1203b',
 '20f14c8d-32ef-4481-800b-a758b2b376c1',
 '06a9c1fb-f9b2-4d3c-99f6-75184503bac0',
 '9e415726-a940-4fd9-8f64-b7180240023a',
 '00067b5a-92ea-4435-a607-2cca8b7cc2fc',
 '1a7a1903-891d-4dee-b04d-c0da06e1a09d',
 '20d009fd-adcf-4cee-8a27-c95b2e36b8ba',
 'd325bce6-be80-

In [9]:
#Retriever
retriever=vector_store.as_retriever(search_kwargs={'k':2})

groq_api_key=os.getenv("GROQ_API_KEY")
llm=ChatGroq(groq_api_key=groq_api_key,model_name="Llama3-8b-8192")

from langchain_core.prompts import PromptTemplate
prompt=PromptTemplate(
    template=""" 
    You are assistant for question answering tasks.
    Use the following piece of retreived context to answer
    the question.If you don't know the answer, say that you don't know.
    keep the answer concise.
    {context}
    Question:{question}
    """,
    input_variables=['context','question']
)

#Building chain
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

def format_docs(retrieved_docs):
    context_text="\n".join(doc.page_content for doc in retrieved_docs)
    return context_text

parllel_chain=RunnableParallel({
    'context':retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

parser = StrOutputParser()

rag_chain = parllel_chain | prompt | llm | parser

In [10]:
answer=rag_chain.invoke('which company does sheryl Baxter work for?')
print(answer)

Sheryl Baxter works for Rasmussen Group.


##### Default RAG Metrics using Evaluate function

In [2]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric
)



In [11]:
metric = AnswerRelevancyMetric(
    threshold=0.7,
    include_reason=True
)

test_case = LLMTestCase(
    input = "which company does sheryl Baxter work for?",
    actual_output="Sheryl Baxter works for Rasmussen Group"
)

#Run metric as standalone
metric.measure(test_case)
print(metric.score)
print(metric.reason)

1.0
The score is 1.00 because the output is perfectly relevant and directly answers the input without any irrelevant statements. Great job on maintaining such high relevancy!


In [12]:
#We can use evaluate function to run the metrics
evaluate(test_cases=[test_case],metrics=[metric])

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:03,  3.13s/test case]



Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the provided output perfectly matches the input question with high precision and no irrelevant statements., error: None)

For test case:

  - input: which company does sheryl Baxter work for?
  - actual output: Sheryl Baxter works for Rasmussen Group
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate







EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Answer Relevancy', threshold=0.7, success=True, score=1.0, reason='The score is 1.00 because the provided output perfectly matches the input question with high precision and no irrelevant statements.', strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.0034175000000000004, verbose_logs='Statements:\n[\n    "Sheryl Baxter works for Rasmussen Group."\n] \n \nVerdicts:\n[\n    {\n        "verdict": "yes",\n        "reason": null\n    }\n]')], conversational=False, multimodal=False, input='which company does sheryl Baxter work for?', actual_output='Sheryl Baxter works for Rasmussen Group', expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link=None)

In [13]:
## GEval metric
from deepeval.metrics import GEval

correctness_metric = GEval(
    name="correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]
)

test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog.",
    expected_output="The cat."
)

evaluate(test_cases=[test_case], metrics=[correctness_metric])

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:04,  4.43s/test case]



Metrics Summary

  - ❌ correctness (GEval) (score: 0.23650124897758448, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output introduces ambiguity by suggesting both the cat and the dog, while the expected output clearly states the cat ran up the tree., error: None)

For test case:

  - input: The dog chased the cat up the tree, who ran up the tree?
  - actual output: It depends, some might consider the cat, while others might argue the dog.
  - expected output: The cat.
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

correctness (GEval): 0.00% pass rate







EvaluationResult(test_results=[TestResult(name='test_case_0', success=False, metrics_data=[MetricData(name='correctness (GEval)', threshold=0.5, success=False, score=0.23650124897758448, reason='The actual output introduces ambiguity by suggesting both the cat and the dog, while the expected output clearly states the cat ran up the tree.', strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.00225, verbose_logs='Criteria:\nDetermine whether the actual output is factually correct based on the expected output. \n \nEvaluation Steps:\n[\n    "Compare the factual details in the actual output against the expected output.",\n    "Identify any discrepancies between the actual output and the expected output.",\n    "Evaluate whether the actual output provides accurate information as specified in the expected output.",\n    "Determine if the actual output meets the factual correctness criteria based on the expected output."\n]')], conversational=False, multimodal=False, i