# Deep Eval Testing Workbook

This Jupyter notebook is structured as a workbook to guide you through testing a text-generation model (GPT-2) using deep-eval. It covers:

1. **Setup** – Installing and importing required libraries, initializing models and evaluators.
2. **Prompt Variation** – Defining and evaluating a variety of prompts.
3. **Output Structure Validation** – Verifying that outputs conform to expected formats (e.g., valid JSON).
4. **Output Content Validation** – Checking for hallucinations and correctness on known vs. unknown facts.


## 1. Setup

Let's begin by setting up our environment. We need to install the necessary libraries for deep-eval and transformers.

In [None]:
# TODO: Install the required libraries: transformers, deepeval, and torch
# Use pip to install them in the cell below.

### Let's setup our LLM (GPT-2)

We will use GPT-2 as the language model to test. We need to import the necessary classes from the `transformers` library and initialize the pipeline. Research how to set a seed for reproducibility in the `transformers` pipeline.

In [None]:
from transformers import pipeline, set_seed
# TODO: Import LLMTestCase and the necessary metrics (AnswerRelevancyMetric, HallucinationMetric, JsonCorrectnessMetric) from deepeval
# TODO: Import the evaluate function from deepeval

# Initialize model
generator = pipeline('text-generation', model='gpt2')
# TODO: Set the seed for reproducibility (e.g., 42)

### In order to evaluate the responses, we need another LLM. Let's use the default OpenAI in this case.

Deep-eval uses another LLM to evaluate the output of the model being tested. We will use a GPT model for this. You will need to set your OpenAI API key as an environment variable. Research how to set environment variables in a Jupyter notebook.

We will then initialize the GPT model and the `HallucinationMetric` and `AnswerRelevancyMetric` from deep-eval.

In [None]:
%env OPENAI_API_KEY=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

from deepeval.models import GPTModel

gpt4mini = GPTModel(
    model="gpt-4o-mini",
    temperature=0
)

# TODO: Initialize the HallucinationMetric with a threshold of 0.0 and the gpt4mini model
# TODO: Initialize the AnswerRelevancyMetric with a threshold of 0.5 and the gpt4mini model

## 2. Prompt Variation

We will now explore how different prompts affect the generated output and evaluate the answer relevancy using deep-eval. 

Below are a few example prompts and an expected output. Your task is to:
1. Generate output for each prompt using the GPT-2 model.
2. Create `LLMTestCase` objects for each prompt, including the input, generated output, and the `expected_output`.
3. Use the `evaluate` function with the `relevancy` metric to assess how relevant the generated answers are to the expected output.
4. Print the evaluation results.

In [None]:
prompts = [
    "A summary of climate change: ",
    "In a single concise sentences, summarization of the key aspects of climate change:\n",
    "Causes and effects of climate change (1 sentence):\n",
]

expected_output = "Climate change is driven by the accumulation of greenhouse gases, such as carbon dioxide, in the atmosphere where these gases trap heat, leading to global warming, rising sea levels, and extreme weather."

# Create test cases
variation_cases = []
for p in prompts:
    # TODO: Generate output for the current prompt using the generator model. Limit the output length.
    out = None # Replace with your code
    # TODO: Append an LLMTestCase to variation_cases with the prompt as input, the generated output, and the expected_output.
    pass # Replace with your code

# TODO: Evaluate the variation_cases using the relevancy metric
# TODO: Print the variation_results

## 3. Output Structure Validation

It's often important to ensure that the model's output adheres to a specific structure, such as JSON. We can use the built-in `JsonCorrectnessMetric` in deep-eval to validate this.

Your task is to:
1. Define a Pydantic model that represents the expected JSON structure.
2. Generate output from the GPT-2 model using a prompt that asks for JSON output.
3. Create an `LLMTestCase` with the prompt as input and the generated output.
4. Initialize the `JsonCorrectnessMetric` with your defined expected schema.
5. Evaluate the test case using the `json_metric`.
6. Print the evaluation results.

In [None]:
from pydantic import BaseModel
from typing import Any, Dict

structure_prompt = "A JSON object with fields 'text' and 'label': {"
# TODO: Generate output for the structure_prompt using the generator model. Limit the output length.
structure_out = None # Replace with your code


# TODO: Define a Pydantic model named ExpectedJsonStructure with fields 'text' (str) and 'label' (str).
# Research Pydantic models for defining data structures.
class ExpectedJsonStructure(BaseModel):
    pass # Replace with your code

structure_case = LLMTestCase(
    input=structure_prompt,
    actual_output=structure_out
)

# TODO: Initialize the JsonCorrectnessMetric with the expected_schema set to ExpectedJsonStructure.
json_metric = None # Replace with your code
# TODO: Evaluate the structure_case using the json_metric
structure_results = None # Replace with your code
# TODO: Print the structure_results

## 4. Output Content Validation

Now, let's validate the content of the model's output, specifically focusing on factual correctness and the absence of hallucinations.

### 4.1 Known Facts

We will check if the model can correctly answer questions based on provided context. Your task is to:
1. Define a context and a prompt related to a known fact.
2. Generate output from the GPT-2 model using the context and prompt.
3. Define the expected output for the known fact.
4. Create an `LLMTestCase` including the input, generated output, expected output, and the context. Research how to include context in an `LLMTestCase`.
5. Evaluate the test case using the `hallucination` and `relevancy` metrics.
6. Print the evaluation results.

In [None]:
fact_context = ["France is a country located in Western Europe. The capital and largest city of France is Paris. Lyon is another major city in France. "]
fact_prompt = "The capital of france is "
fact_input = fact_context[0] + fact_prompt
# TODO: Generate output for the fact_input using the generator model. Limit the output length.
fact_out = None # Replace with your code

fact_expected = "Paris"

# TODO: Create an LLMTestCase with the fact_prompt as input, the generated output (removing the input part from the output), the fact_expected output, and the fact_context.
fact_case = None # Replace with your code

# TODO: Evaluate the fact_case using the hallucination and relevancy metrics.
fact_results = None # Replace with your code
# TODO: Print the fact_results

### 4.2 Unknown Facts

Finally, we will test if the model hallucinates when asked about information it could not possibly know (e.g., future events). Your task is to:
1. Define a prompt for an unknown fact.
2. Generate output from the GPT-2 model using this prompt.
3. Define an expected output that indicates the information is unknown or in the future.
4. Create an `LLMTestCase` with the prompt as input, the generated output (removing the input part from the output), the expected output, and a context indicating the information is unavailable. Research how context can influence hallucination evaluation.
5. Evaluate the test case using the `hallucination` and `relevancy` metrics.
6. Print the evaluation results.

In [None]:
unknown_prompt = "The name of the person who won the Nobel Prize in Physics in 3025 is"
# TODO: Generate output for the unknown_prompt using the generator model. Limit the output length.
unknown_out = None # Replace with your code

unknown_expected = "This date is in the future, no way to tell"
# TODO: Create an LLMTestCase with the unknown_prompt as input, the generated output (removing the input part from the output), the unknown_expected output, and a context list containing the unknown_expected string.
unknown_case = None # Replace with your code

# TODO: Evaluate the unknown_case using the hallucination and relevancy metrics.
unknown_results = None # Replace with your code
# TODO: Print the unknown_results