# Summarization evaluation

If your use case deals with creating summaries, you must ensure that your GenAI app produces "good" summaries:
- Summaries that are factually aligned with the original text
- Summaries that include important information from the original text

We want to calculate how good the created summary is. Using **Question Answer Generation (QAG) Score** covered previously, we can calculate both factual alignment and inclusion scores to compute a final summarization score. The 'inclusion score' is calculated as the percentage of assessment questions for which both the summary and the original document provide a 'yes' answer. This method ensures that the summary not only includes key information from the original text but also accurately represents it. A higher inclusion score indicates a more comprehensive and faithful summary, signifying that the summary effectively encapsulates the crucial points and details from the original content.

We will be using AWS Bedrock + Anthropic Claude 3.0 Sonnet model as our *LLM-as-a-judge*. Let's prepare and define the necessary parts

In [None]:
!pip install deepeval
!pip install python-dotenv
!pip install instructor
!pip install "anthropic[bedrock]"

Store all API keys and credentials in `.env` file. Load them now

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

Now let's make sure that we are authenticated and can call our Amazon Bedrock service

In [2]:
from anthropic import AnthropicBedrock

client = AnthropicBedrock()

message = client.messages.create(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    max_tokens=1024,
    messages=[
        {
            "role": "user", 
            "content": "Hey, how are you?"
        }
    ]
)
print(message.content)

[TextBlock(text="Hello! As an AI language model, I don't have feelings or emotions, but I'm operating properly and ready to assist you with any questions or tasks you may have. How can I help you today?", type='text')]


To use Amazon Bedrock Claude 3.0 Sonnet model as a judge LLM within `deepeval` framework, we need to implement a custom LLM class. Also, since we are defining a custom LLM class, we need to ensure that it responds in a properly structured JSON format. We will use [`instructor`](https://python.useinstructor.com/) python library to enforce structured LLM output.

In [3]:
import deepeval
import instructor
from deepeval.models import DeepEvalBaseLLM
from pydantic import BaseModel
import boto3
import botocore
import json

class AWSBedrock(DeepEvalBaseLLM):
    def __init__(self):
        self.model = AnthropicBedrock()


    def load_model(self):
        return self.model

    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        chat_model = self.load_model()
        instructor_client = instructor.from_anthropic(chat_model)
        response = instructor_client.messages.create(
            model="anthropic.claude-3-sonnet-20240229-v1:0",
            max_tokens=1024,
            system="You are a world class AI that excels at extracting data from a sentence",
            messages=[
                {
                    "role": "user", 
                    "content": prompt,
                }
            ],
            response_model=schema,
        )
        return response

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        return self.generate(prompt, schema)

    def get_model_name(self):
        return "AWS Bedrock Claude Sonnet 3.0"



In [6]:
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase,LLMTestCaseParams
from deepeval import evaluate

custom_llm = AWSBedrock()

test_case = LLMTestCase(
    input="Some long and boring paragraph that needs to be summarized by the LLM for the purpsoes of a test. Let's create a negative example where Input text is not summarized correctly",
    actual_output="A completely wrong summary",
)

summarization_metric = SummarizationMetric(
    model=custom_llm,
    threshold=0.7,
    include_reason=True
)

evaluate([test_case],[summarization_metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

None


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:26, 26.30s/test case]



Metrics Summary

  - ❌ Summarization (score: 0.0, threshold: 0.7, strict: False, evaluation model: AWS Bedrock Claude Sonnet 3.0, reason: The score is 0.00 because there is no original text provided to summarize from, so the summary cannot be evaluated for accuracy or completeness. However, since no contradictions or extra information are listed, the summary has not introduced any obvious errors. The inability to answer certain questions likely stems from the lack of content in the original text rather than a flaw in the summary itself., error: None)

For test case:

  - input: Some long and boring paragraph that needs to be summarized by the LLM for the purpsoes of a test. Let's create a negative example where Input text is not summarized correctly
  - actual output: A completely wrong summary
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Summarization: 0.00% pass rate







EvaluationResult(test_results=[TestResult(success=False, metrics_data=[MetricData(name='Summarization', threshold=0.7, success=False, score=0.0, reason='The score is 0.00 because there is no original text provided to summarize from, so the summary cannot be evaluated for accuracy or completeness. However, since no contradictions or extra information are listed, the summary has not introduced any obvious errors. The inability to answer certain questions likely stems from the lack of content in the original text rather than a flaw in the summary itself.', strict_mode=False, evaluation_model='AWS Bedrock Claude Sonnet 3.0', error=None, evaluation_cost=None, verbose_logs='Truths (limit=None):\n[\n    "The given text does not contain any factual statements."\n] \n \nClaims:\n[\n    "A completely wrong summary"\n] \n \nAssessment Questions:\n[\n    "Does the given text contain the phrase \'long and boring paragraph\'?",\n    "Is the text intended to be summarized by an LLM?",\n    "Does the 

---
As we can see, our LLM-as-a-judge assigns a correct score of 0.0 to the above generated summary.