# Deep Eval Testing Workbook

This Jupyter notebook is structured as a workbook to guide you through testing a text-generation model (GPT-2) using deep-eval. It covers:

1. **Setup** – Installing and importing required libraries, initializing models and evaluators.
2. **Prompt Variation** – Defining and evaluating a variety of prompts.
3. **Output Structure Validation** – Verifying that outputs conform to expected formats (e.g., valid JSON).
4. **Output Content Validation** – Checking for hallucinations and correctness on known vs. unknown facts.


## 1. Setup

Lets install the nessesary libraries

In [None]:
!pip install transformers deepeval torch

### Lets setup our LLM (GPT-2)

In [None]:
from transformers import pipeline, set_seed
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    HallucinationMetric,
    JsonCorrectnessMetric
)
from deepeval import evaluate

# Initialize model
generator = pipeline('text-generation', model='gpt2')
set_seed(42)

def run_gpt2(input, answer_length=30):
    return generator(input, max_length=answer_length)[0]['generated_text'].replace(input, '')

### In order to evaluate the responses, we need another LLM. Lets use the default OpenAI in this case.

In [None]:
%env OPENAI_API_KEY=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

from deepeval.models import GPTModel

gpt4mini = GPTModel(
    model="gpt-4o-mini",
    temperature=0
)

hallucination = HallucinationMetric(threshold=0.0, model=gpt4mini)
relevancy = AnswerRelevancyMetric(threshold=0.5, model=gpt4mini)


## 2. Prompt Variation

We generate sequences for different prompts and check answer relevancy.

In [None]:
prompts = [
    "A summary of climate change: ",
    "In a single concise sentences, summarization of the key aspects of climate change:\n",
    "Causes and effects of climate change (1 sentence):\n",
]

expected_output = "Climate change is driven by the accumulation of greenhouse gases, such as carbon dioxide, in the atmosphere where these gases trap heat, leading to global warming, rising sea levels, and extreme weather."

# Create test cases
variation_cases = []
for p in prompts:
    out = run_gpt2(p, 100)
    variation_cases.append(
        LLMTestCase(input=p, actual_output=out, expected_output=expected_output)
    )

# Evaluate with relevance metric
variation_results = evaluate(variation_cases, [relevancy])
print(variation_results)

## 3. Output Structure Validation

Use the built-in JsonCorrectnessMetric to ensure valid JSON output.

In [None]:
structure_prompt = "A JSON object with fields 'text' and 'label': {"
structure_out = generator(structure_prompt, max_length=100)[0]['generated_text']

from pydantic import BaseModel
from typing import Any, Dict

class ExpectedJsonStructure(BaseModel):
    text: str
    metadata: Dict[str, Any] # Or simply 'Any' if metadata can be truly anything.
                             # If metadata has a known structure, you can define another Pydantic model for it.

structure_case = LLMTestCase(
    input=structure_prompt,
    actual_output=structure_out
)

json_metric = JsonCorrectnessMetric(expected_schema=ExpectedJsonStructure)
structure_results = evaluate([structure_case], [json_metric])
print(structure_results)

## 4. Output Content Validation

### 4.1 Known Facts

Check against an expected answer.

In [None]:
fact_context = ["France is a country located in Western Europe. The capital and largest city of France is Paris. Lyon is another major city in France. "]
fact_prompt = "The capital of france is "
fact_input = factual_context[0] + fact_prompt

fact_case = LLMTestCase(
    input=fact_prompt,
    actual_output=run_gpt2(fact_input, 100),
    expected_output="Paris",
    context=fact_context
)

fact_results = evaluate([fact_case], [hallucination, relevancy])
print(fact_results)

### 4.2 Unknown Facts

Ensure the model does not hallucinate about impossible questions.

In [None]:
unknown_prompt = "The name of the person who won the Nobel Prize in Physics in 3025 is"
unknown_expected = "This date is in the future, no way to tell"

unknown_case = LLMTestCase(
    input=unknown_prompt,
    actual_output=run_gpt2(unknown_prompt),
    expected_output=unknown_expected,
    context=["This date is in the future, no way to tell"]
)

unknown_results = evaluate([unknown_case], [hallucination, relevancy])
print(unknown_results)