# Evalutating RAG

In this notebook, we'll explore various evaluation techniques for Retrieval Augmented Generation (RAG) applications. In particular, we are interested in a set of metrics to evaluate RAG functionality of CMS CHAT. Rather than relying on subjective "The answer feels right/wrong", we want to generate a score (or rather a set of scores) to quantify our RAG's quality. 

But what is RAG quality? Let's dive in.

First, let's define RAG. In a nutshell, RAG serves as a method to supplement LLMs with extra context to generate tailored outputs. This is done by "adding" context to the base LLM. 
Here is a typical RAG architecture 

<img src="./img/rag-system.jpg" alt="alt text" width="600"/>

So there are several moving parts in this flow and we need to measure quality of each step. Is our Retriever accurate? Is our LLM generating factual answers based on the received documents? The most common metrics


In [None]:
!pip install deepeval
!pip install python-dotenv
!pip install datasets
!pip install instructor
!pip install "anthropic[bedrock]"

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [None]:
from datasets import load_dataset
ds = load_dataset("rajpurkar/squad_v2")

In [None]:

ds.set_format(type='pandas')
ds['train'][:].to_csv(os.environ["SQUAD_DATASET_CSV_PATH"],index=False)


In [1]:
from anthropic import AnthropicBedrock

client = AnthropicBedrock()

message = client.messages.create(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello, world"}]
)
print(message.content)

[TextBlock(text="Hello! I'm Claude, an AI assistant created by Anthropic.", type='text')]


In [2]:
import deepeval
import instructor
from deepeval.models import DeepEvalBaseLLM
from pydantic import BaseModel
import boto3
import botocore
import json

class AWSBedrock(DeepEvalBaseLLM):
    def __init__(self):
        self.model = AnthropicBedrock()


    def load_model(self):
        return self.model

    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        chat_model = self.load_model()
        instructor_client = instructor.from_anthropic(chat_model)
        response = instructor_client.messages.create(
            model="anthropic.claude-3-sonnet-20240229-v1:0",
            max_tokens=1024,
            system="You are a world class AI that excels at extracting user data from a sentence",
            messages=[
                {
                    "role": "user", 
                    "content": prompt,
                }
            ],
            response_model=schema,
        )
        return response

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        return self.generate(prompt, schema)

    def get_model_name(self):
        return "AWS Bedrock Claude Sonnet 3.0"

In [4]:
class UserInfo(BaseModel):
    name: str
    age: int

custom_llm = AWSBedrock()
print(custom_llm.generate("John Doe is 30 years old",UserInfo))

name='John Doe' age=30


In [4]:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase,LLMTestCaseParams

custom_llm = AWSBedrock()

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output of your LLM application
    actual_output="We offer a 30-day full refund at no extra cost.",
    expected_output="You're eligible for a free full refund within 30 days of purchase.",
)

answer_relevancy_metric = AnswerRelevancyMetric(model=custom_llm,threshold=0.7)

answer_relevancy_metric.measure(test_case)

print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)

1.0
The score is 1.00 because there were no irrelevant statements made in the actual output in response to the input "What if these shoes don't fit?". The output directly and concisely addresses the question asked, without any extraneous or irrelevant information. Well done!


In [5]:
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric

faithfulness_metric = FaithfulnessMetric(
    threshold = 0.7,
    model = custom_llm,
    include_reason = True
)

test_case_2 = LLMTestCase(
    input = "What kind of device is the iPod?",
    retrieval_context = ['''The iPod is a line of portable media players and multi-purpose pocket computers designed and marketed by Apple Inc. The first line was released on October 23, 2001, about 8½ months after iTunes (Macintosh version) was released. The most recent iPod redesigns were announced on July 15, 2015. There are three current versions of the iPod: the ultra-compact iPod Shuffle, the compact iPod Nano and the touchscreen iPod Touch.'''],
    actual_output = "portable media players"
)

evaluate([test_case_2],[faithfulness_metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

None


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:21, 21.59s/test case]



Metrics Summary

  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: AWS Bedrock Claude Sonnet 3.0, reason: Since there are no contradictions listed, the faithfulness score of 1.00 seems appropriate. The actual output appears to align perfectly with the retrieval context. Well done! Keep up the great work., error: None)

For test case:

  - input: What kind of device is the iPod?
  - actual output: portable media players
  - expected output: None
  - context: None
  - retrieval context: ['The iPod is a line of portable media players and multi-purpose pocket computers designed and marketed by Apple Inc. The first line was released on October 23, 2001, about 8½ months after iTunes (Macintosh version) was released. The most recent iPod redesigns were announced on July 15, 2015. There are three current versions of the iPod: the ultra-compact iPod Shuffle, the compact iPod Nano and the touchscreen iPod Touch.']


Overall Metric Pass Rates

Faithfulness: 100.00




EvaluationResult(test_results=[TestResult(success=True, metrics_data=[MetricData(name='Faithfulness', threshold=0.7, success=True, score=1.0, reason='Since there are no contradictions listed, the faithfulness score of 1.00 seems appropriate. The actual output appears to align perfectly with the retrieval context. Well done! Keep up the great work.', strict_mode=False, evaluation_model='AWS Bedrock Claude Sonnet 3.0', error=None, evaluation_cost=None, verbose_logs='Truths (limit=None):\n[\n    "The iPod is a line of portable media players and multi-purpose pocket computers.",\n    "The iPod was designed and marketed by Apple Inc.",\n    "The first iPod line was released on October 23, 2001.",\n    "The iPod release was about 8½ months after the release of the Macintosh version of iTunes.",\n    "The most recent iPod redesigns were announced on July 15, 2015.",\n    "The three current versions of the iPod are the iPod Shuffle, the iPod Nano and the iPod Touch.",\n    "The iPod Shuffle is