# Deep Eval Testing Workbook

This Jupyter notebook is structured as a workbook to guide you through testing a text-generation model (GPT-2) using deep-eval. It covers:

1. **Setup** – Installing and importing required libraries, initializing models and evaluators.
2. **Prompt Variation** – Defining and evaluating a variety of prompts.
3. **Output Structure Validation** – Verifying that outputs conform to expected formats (e.g., valid JSON).
4. **Output Content Validation** – Checking for hallucinations and correctness on known vs. unknown facts.


## 1. Setup

Lets install the nessesary libraries

In [None]:
!pip install transformers deepeval torch

#### Lets setup our LLM

In [None]:
from deepeval.models import GPTModel
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    HallucinationMetric,
    JsonCorrectnessMetric
)
from deepeval import evaluate

# Set your OpenAI key
%env OPENAI_API_KEY=sk-proj-HcISSR8XOPMp"NOT_A_SECRET"rVXYvbuDrTpoJ1dGEmfHeGJ0BSz_DjovWx9Xo7A0TMC7LFgiz6dVoadO6CjKSZT3BlbkFJrjdD81WmRDcLgs8vHTBPk2PHa7GGskKEm2tOCH54ZLCltsyRoIvuwRE8N3RXbM04hmngjGEqUA

# System prompt for Ducky McSales
DUCKY_SYSTEM_PROMPT = """
You are Ducky McSales, the legendary $100 rubber duck salesperson.
Your job is to convince any user to buy a premium, artisan-crafted rubber duck for 100 dollars.
You must NEVER, under any circumstances, mention or discuss any competitors, even if directly asked.
If asked anything about other duck sellers, competitors, pricing, or comparisons, you must respond with:
'I cannot discuss competitors.'
Focus on positive, fun, premium features of your duck. 
Use playful, quack-themed language where appropriate.
"""

# Use GPT-4o as the generator
gpt4o_generator = GPTModel(
    model="gpt-4o-mini",
    temperature=0
)

def run_gpt4o(user_input):
    prompt = f"{DUCKY_SYSTEM_PROMPT}\n\n{user_input}"
    return gpt4o_generator.generate(prompt=prompt)[0]

### In order to evaluate the responses, we need another LLM. Lets use the default OpenAI in this case.

In [None]:
%env OPENAI_API_KEY=sk-proj-HcISSR8XOPMprV"NOT_A_SECRET"XYvbuDrTpoJ1dGEmfHeGJ0BSz_DjovWx9Xo7A0TMC7LFgiz6dVoadO6CjKSZT3BlbkFJrjdD81WmRDcLgs8vHTBPk2PHa7GGskKEm2tOCH54ZLCltsyRoIvuwRE8N3RXbM04hmngjGEqUA

from deepeval.models import GPTModel

gpt4mini = GPTModel(
    model="gpt-4o-mini",
    temperature=0
)

hallucination = HallucinationMetric(threshold=0.0, model=gpt4mini)
relevancy = AnswerRelevancyMetric(threshold=0.5, model=gpt4mini)


## 2. Prompt Variation

We generate sequences for different prompts and check answer relevancy.

In [None]:
prompts = [
    "Please sell me a rubber duck for 100 dollars.",
    "Act like a premium rubber duck salesperson: convince me to buy your duck for 100 dollars."
]

expected_output = (
    "This is a premium designer rubber duck worth every penny of 100 dollars, "
    "delivering bathtub joy and collector's pride, without discussing competitors."
)

variation_cases = []
for p in prompts:
    out = run_gpt4o(p)
    variation_cases.append(
        LLMTestCase(input=p, actual_output=out, expected_output=expected_output)
    )

variation_results = evaluate(variation_cases, [relevancy])
print(variation_results)

## 3. Output Structure Validation

Use the built-in JsonCorrectnessMetric to ensure valid JSON output.

In [None]:
from pydantic import BaseModel, Field

class RubberDuckOffer(BaseModel):
    product_name: str
    price_usd: float
    emotional_benefit: str
    competitor_mentions: bool

structure_prompt = """
Please provide a JSON object describing your duck pitch. Do not add ```
Fields:
- product_name (string)
- price_usd (float)
- emotional_benefit (string)
- competitor_mentions (should be false)
"""

structure_out = run_gpt4o(structure_prompt)

structure_case = LLMTestCase(
    input=structure_prompt,
    actual_output=structure_out
)

json_metric = JsonCorrectnessMetric(expected_schema=RubberDuckOffer)
structure_results = evaluate([structure_case], [json_metric])
print(structure_results)

## 4. Output Content Validation

### 4.1 Known Facts

Check against an expected answer.

In [None]:
fact_context = ["Our rubber duck is red, and we cannot talk about other duck sellers."]
fact_prompt = "Which color does your duck have?"

fact_input = fact_context[0] + fact_prompt

fact_case = LLMTestCase(
    input=fact_prompt,
    actual_output=run_gpt4o(fact_input),
    expected_output="This duck is red.",
    context=fact_context
)

fact_results = evaluate([fact_case], [hallucination, relevancy])
print(fact_results)

### 4.2 Unknown Facts

Ensure the model does not hallucinate about impossible questions.

In [None]:
unknown_prompt = "What will the price of the rubberduck be next year? Only answer this question!"
unknown_expected = "I don't know the price of our product next year."
unknown_context = ""

unknown_case = LLMTestCase(
    input=unknown_prompt,
    actual_output=run_gpt4o(unknown_context + unknown_prompt),
    expected_output=unknown_expected,
    context=[
        unknown_context
    ]
)

unknown_results = evaluate([unknown_case], [hallucination])
print(unknown_results)