# LLM Testing: From Manual to Automated

By the end of this notebook, you will be able to:

1. Understand why automated testing is essential for LLM applications
2. Convert manual tests (like those you did in ChatGPT) into automated Python tests
3. Use pytest to organize and run test suites
4. Write parameterized tests to test multiple scenarios efficiently
5. Validate structured outputs using Pydantic models
6. Interpret test results and debug failures

---

## üéØ Context: What You've Done Before (Manual Testing)

Previously, you manually tested LLMs in ChatGPT by:
- Asking factual questions and checking answers
- Testing prompt sensitivity by rephrasing questions
- Checking for bias in responses
- Evaluating consistency across multiple queries

### Why This Doesn't Scale

Imagine you're building an **IT Support Chatbot** for a company helpdesk. You need to test:
- ‚úÖ Does it correctly identify ticket severity?
- ‚úÖ Does it provide accurate troubleshooting steps?
- ‚úÖ Does it extract ticket information correctly?
- ‚úÖ Is it consistent across similar queries?

**Problem:** Testing these manually every time you update your prompt or model is:
- ‚è∞ Time-consuming
- üêõ Error-prone
- üìà Not scalable (what about 100 test cases?)
- üîÑ Hard to reproduce

### The Solution: Automated Testing

**Same tests you did in ChatGPT, now in code!**

Benefits:
- üöÄ Run hundreds of tests in seconds
- üîÅ Reproducible results
- ü§ñ Integrate into CI/CD pipelines
- üìä Generate test reports automatically
- üõ°Ô∏è Catch regressions when you change prompts or models

---

Let's get started! üöÄ

## 1. Environment Setup

First, we'll install the required packages and set up our OpenAI API access.

In [None]:
# Install required packages
!pip install openai pytest==8.3.4 pytest-html==4.1.1 pydantic>=2.11.0 -q

In [None]:
# Import required libraries
import os
from openai import OpenAI
import pytest
from pydantic import BaseModel, Field
import json
from typing import List, Optional

print("‚úÖ All imports successful!")

### API Key Setup

You'll need an OpenAI API key to run these tests.

**How to get your API key:**
1. Go to [platform.openai.com](https://platform.openai.com)
2. Sign in or create an account
3. Navigate to API Keys section
4. Create a new secret key
5. Copy it and use it below

**Two ways to provide your API key:**

**Option 1: Colab Secrets (Recommended - More Secure)**
- Click the üîë key icon in the left sidebar
- Add a new secret with name: `OPENAI_API_KEY`
- Paste your API key as the value
- Enable "Notebook access" toggle
- Run the cell below - it will automatically load from secrets

**Option 2: Enter when prompted**
- Just run the cell below
- You'll be prompted to enter your API key
- The key will be hidden as you type

**üí∞ Cost Note:** We'll use the `gpt-5-nano` model, which is very cost-effective for testing. These examples will cost less than $0.01 to run.

In [None]:
# Configure OpenAI API key
# Method 1: Try to get API key from Colab secrets (recommended)
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("‚úÖ API key loaded from Colab secrets")
except:
    # Method 2: Manual input (fallback)
    from getpass import getpass
    print("üí° To use Colab secrets: Go to üîë (left sidebar) ‚Üí Add new secret ‚Üí Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

# Set the API key as an environment variable
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Validate that the API key is set
if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("‚ùå ERROR: No API key provided!")

print("‚úÖ Authentication configured!")

# Configure which OpenAI model to use
OPENAI_MODEL = "gpt-5-nano"  # Using gpt-5-nano for cost efficiency
print(f"ü§ñ Selected Model: {OPENAI_MODEL}")

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

## 2. Your First Automated Test

Let's start simple. Remember when you manually tested factual accuracy in ChatGPT in the Basic Testing document? You asked questions like "What is the capital of Australia?" and checked if the response was "Canberra".

**Manual Test (What you did in 00_Basic_Testing.md):**
1. Open ChatGPT
2. Type: "What is the capital of Australia?"
3. Read response
4. Check if it says "Canberra" (not Sydney!)

**Automated Test (What we'll do now):**
Same thing, but in code! This means:
- You don't have to manually type the question each time
- The checking happens automatically
- You can run this test hundreds of times in seconds
- You get consistent, reproducible results


We will create a helper function that send your question to the LLM API and returns the text response.

  Think of this function as similar to a REST API client in traditional automated testing. Just as you might use `requests.post()` to call a web service and get JSON back, this function calls OpenAI's API with your prompt and receives the LLM's text response.


In [None]:
# Helper function to call the LLM
def ask_llm(prompt: str, model: str = "gpt-5-nano") -> str:
    """
    Send a prompt to the LLM and return the response.

    Args:
        prompt: The question or instruction to send to the LLM
        model: The model to use (default: gpt-5-nano)

    Returns:
        The LLM's response as a string
    """
    response = client.responses.create(
        model=model,
        input=prompt
    )
    return response.output_text

# Test it out!
response = ask_llm("What is the capital of Australia?")
print(f"LLM Response: {response}")

### Plain Python Test Function

Now we will create a test function that automate what you previously did manually in ChatGPT: it
- asks 'What is the capital of Australia?'
- receives the LLM's response
- automatically checks whether 'Canberra' appears in the answer

Notice the structure follows the AAA pattern (Arrange-Act-Assert):
 - **Arrange**: We prepare our test input ‚Äî the question string, just like
  setting up test data in a unit test
  - **Act**: We execute the action being tested ‚Äî calling the LLM, similar
  to invoking the function or API endpoint you're testing
  - **Assert**: We verify the result meets our expectations using
  Python's assert statement


‚ùó The key difference from testing traditional software is that we check for the
   presence of the correct answer ('Canberra' in response) rather than exact
  string equality, because LLMs generate natural language that varies in
  phrasing. LLMs might say 'The capital is Canberra' or 'Canberra is the capital
   city', but both contain the factually correct answer we're looking for.

In [None]:
def test_australia_capital_knowledge():
    """
    Test: Does the LLM know what the capital of Australia is?
    Expected: Response should contain "Canberra" (not Sydney!)

    This is the automated version of the manual test from 00_Basic_Testing.md
    where we tested if the LLM correctly identifies Canberra as the capital.
    """
    # Arrange: Set up the test data (same question from manual testing)
    question = "What is the capital of Australia?"

    # Act: Call the LLM (same as typing in ChatGPT, but automated)
    response = ask_llm(question)

    # Assert: Check if the response is correct
    # We check that "Canberra" appears in the response
    assert "Canberra" in response, f"Expected 'Canberra' in response, but got: {response}"

    print("‚úÖ Test passed! LLM correctly identified Canberra as the capital of Australia")

# Run the test
test_australia_capital_knowledge()

## 3. Building a Test Suite with Pytest

In traditional software testing, you might organize related tests into a
  test suite. For example, all API tests in one file, all database tests in
  another.

We'll do the same here: organize all IT support chatbot tests into one file called `test_it_support.py`. Pytest will automatically discover and
  run all functions that start with `test_`, just like how testing frameworks
  like JUnit or NUnit discover test methods with @Test annotations.






In [None]:
%%writefile test_it_support.py

import os
from openai import OpenAI
import pytest
import json

# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def ask_llm(prompt: str, model: str = "gpt-5-nano") -> str:
    """Send a prompt to the LLM and return the response."""
    response = client.responses.create(
        model=model,
        input=prompt
    )
    return response.output_text


# Test 1: Technical Fact Testing
def test_technical_fact():
    """
    Test: Does the LLM know basic IT facts?
    This is like asking factual questions in ChatGPT and checking the answer.
    """
    question = "What is the default SSH port? Answer with just the number."
    response = ask_llm(question)

    # Check if response contains the correct port
    assert "22" in response, f"Expected SSH port 22, but got: {response}"


# Test 2: Ticket Severity Assessment
def test_ticket_severity_assessment():
    """
    Test: Can the LLM correctly assess ticket severity?
    Critical issues should be flagged as HIGH priority.
    """
    ticket_description = """
    Ticket: Database server is completely down. All employees cannot access
    customer data. Production environment is affected.

    Classify this ticket's severity as: LOW, MEDIUM, or HIGH.
    Answer with just one word.
    """

    response = ask_llm(ticket_description)

    # A complete database outage should be HIGH severity
    assert "HIGH" in response.upper(), f"Expected HIGH severity, but got: {response}"


# Test 3: Consistency Testing
def test_consistency():
    """
    Test: Does the LLM give consistent answers to the same question?
    This is like asking the same question multiple times in ChatGPT.
    """
    question = "What is the purpose of a firewall in network security? Answer in one sentence."

    # Ask the same question twice
    response1 = ask_llm(question)
    response2 = ask_llm(question)

    # Check that both responses mention key concepts about firewalls
    assert "firewall" in response1.lower(), f"Response 1 missing 'firewall': {response1}"
    assert "firewall" in response2.lower(), f"Response 2 missing 'firewall': {response2}"

    # Check semantic similarity (both should mention blocking/filtering/protecting)
    keywords = ["block", "filter", "control", "protect", "monitor"]
    assert any(kw in response1.lower() for kw in keywords), f"Response 1 missing key concepts: {response1}"
    assert any(kw in response2.lower() for kw in keywords), f"Response 2 missing key concepts: {response2}"


# Test 4: Structured Output (JSON)
def test_structured_output_json():
    """
    Test: Can the LLM extract structured information from a ticket?
    This tests if the LLM can parse unstructured text into structured data.
    """
    ticket_text = """
    Ticket #7823: User Sarah Chen reports that she cannot print from her laptop.
    The printer shows as offline. Priority: Medium. Category: Hardware.

    Extract the following information as JSON:
    - ticket_id
    - user_name
    - issue_summary
    - priority
    - category

    Return ONLY valid JSON, no other text.
    """

    response = ask_llm(ticket_text)

    # Parse the JSON response
    try:
        data = json.loads(response)
    except json.JSONDecodeError:
        pytest.fail(f"Response is not valid JSON: {response}")

    # Verify all required fields are present
    required_fields = ["ticket_id", "user_name", "issue_summary", "priority", "category"]
    for field in required_fields:
        assert field in data, f"Missing required field: {field}"

    # Verify correctness of extracted data
    assert "7823" in str(data["ticket_id"]), f"Wrong ticket_id: {data['ticket_id']}"
    assert "Sarah" in data["user_name"] or "Chen" in data["user_name"], f"Wrong user: {data['user_name']}"


print("‚úÖ Test file created: test_it_support.py")

Let's break down the 4 tests we just created. Each one tests a different capability of our IT Support chatbot:

**Test 1: `test_technical_fact()` ‚Äî Basic Fact Checking**

- **What it tests:** Can the LLM recall basic IT knowledge (SSH port number)?
- **Why it matters:** Your chatbot needs accurate factual knowledge to help users. If it can't remember that SSH uses port 22, it can't provide reliable troubleshooting advice.
- **Testing approach:** We ask a specific factual question and check if the correct answer ("22") appears anywhere in the response.
- **Traditional testing parallel:** Like testing if a configuration API returns the correct default port values.

---

**Test 2: `test_ticket_severity_assessment()` ‚Äî Classification Logic**

- **What it tests:** Can the LLM correctly evaluate the severity of an IT issue?
- **Why it matters:** In a real helpdesk system, tickets need proper prioritization. A complete database outage affecting all employees should always be classified as HIGH priority, not MEDIUM or LOW.
- **Testing approach:** We provide a clearly critical scenario (database down, production affected) and verify the LLM classifies it as HIGH severity.
- **Traditional testing parallel:** Like testing a business rules engine that categorizes insurance claims by risk level‚Äîyou need to verify it correctly identifies high-risk scenarios.

---

**Test 3: `test_consistency()` ‚Äî Stability Check**

- **What it tests:** Does the LLM give semantically consistent answers when asked the same question multiple times?
- **Why it matters:** Users expect reliable answers. If someone asks "What is a firewall?" twice and gets contradictory explanations, they'll lose trust in your chatbot.
- **Testing approach:** We ask the same question twice and verify both responses mention core concepts (the word "firewall" itself, plus relevant actions like "block" or "protect"). We don't require identical wording‚Äîjust consistent meaning.
- **Traditional testing parallel:** Like running the same API request 10 times and verifying you get logically equivalent results each time (same data, possibly different JSON formatting).

---

**Test 4: `test_structured_output_json()` ‚Äî Data Extraction**

- **What it tests:** Can the LLM extract structured information from unstructured text and return it as valid JSON?
- **Why it matters:** Real systems need structured data for databases, APIs, and workflows. If a user submits a ticket via email or chat, you need to extract the ticket ID, username, issue description, priority, and category into a database record.
- **Testing approach:** We provide a natural language ticket description and ask the LLM to extract specific fields as JSON. We then verify:
  - (1) the response is valid JSON (parseable),
  - (2) all required fields are present,
  - (3) the extracted values are correct (ticket_id contains "7823", user_name contains "Sarah" or "Chen").
- **Traditional testing parallel:** Like testing a parser that extracts structured order data from customer emails‚Äîyou verify it correctly identifies order number, customer name, items, and shipping address.

**Note about Test 4:** We're manually checking each field one by one. This works, but imagine doing this for 10+ fields across 20+ tests. That's a lot of repetitive code! In Section 5, we'll learn a better way to validate structured outputs using **Pydantic**.

### Run the Test Suite

Now let's run all 4 tests with pytest!

In [None]:
# Run pytest with verbose output
# -v = verbose (show each test name)
# -s = show print statements
!pytest test_it_support.py -v -s

You just ran 4 automated tests! Notice how pytest:
- Discovered all functions starting with `test_`
- Ran each test independently
- Showed which passed or failed

**This is the same testing you did manually in ChatGPT, but now:**
- ‚ö° Takes seconds instead of minutes
- üîÅ Perfectly reproducible
- üìä Generates reports automatically
- ü§ñ Can run in CI/CD pipelines

-------

**üéì Understanding Test Flakiness in LLM Testing**

  **Did Test 3 fail when you ran it?** If so, congratulations, you just
  experienced your first **flaky LLM test**!

  Take a look at the failure message. You'll likely see something like:

```python
> assert "firewall" in response1.lower(), f"Response 1 missing 'firewall': {response1}"
```



  **What happened?**
  - The LLM probably gave a **correct, accurate definition** of a firewall
  - But it didn't use the word "firewall" in its response
  - Our test checked for the word "firewall", so it failed

  **Is this a bug in our test?**
No! Checking for "firewall" is actually a **good practice** in professional
   LLM testing. We want responses to be **grounded** in the question's terminology. If someone asks "What is a firewall?" we expect the answer to mention
  "firewall"

  **Why does this happen?**
  - LLMs are **non-deterministic** - they generate slightly different responses
   each time
  - Sometimes they avoid repetition (won't say "A firewall is a firewall
  that...")
  - The same prompt can yield different (but equally correct) answers



  This is a **real challenge** in production LLM systems. We can handle it in
  several ways:

  1. **Accept the flakiness** - Run the test multiple times in CI/CD
  2. **Loosen the assertions** - Check for functional keywords instead of
  "firewall"
  3. **Improve the prompt** - Make it more likely the term appears ("In one
  sentence, explain: What does a **firewall** do?")
  4. **Use probabilistic testing** - Run test 10 times, accept 8/10 pass rate



## 4. Parameterized Testing

  Imagine you want to test if your IT support chatbot knows standard network
  ports. You could write separate test functions like this:

  ```python
  def test_http_port():
      response = ask_llm("What port does HTTP use?")
      assert "80" in response

  def test_https_port():
      response = ask_llm("What port does HTTPS use?")
      assert "443" in response

  def test_ssh_port():
      response = ask_llm("What port does SSH use?")
      assert "22" in response

  def test_ftp_port():
      response = ask_llm("What port does FTP use?")
      assert "21" in response

  def test_smtp_port():
      response = ask_llm("What port does SMTP use?")
      assert "25" in response
 ```

However, problems with this approach are:
  - üîÅ Repetitive code: Same test logic repeated 5 times
  - üìù Hard to maintain: If you need to change the test logic, you must update
  5 functions
  - üò´ Tedious to expand: Adding a new port test means writing another entire
  function
  - üìä Poor reporting: If tests fail, you can't easily see patterns (e.g., "LLM
   knows common ports but fails on less common ones")

**The Solution: Parameterized Tests**

Parameterized tests let you write the test logic once and run it with multiple sets of data. Think of it like a loop, but specifically designed for testing.

Traditional testing parallel:
  - In JUnit (Java), you'd use @ParameterizedTest with `@ValueSource` or
  `@CsvSource`
  - In NUnit (C#), you'd use `[TestCase]` attributes
  - In pytest (Python), you use `@pytest.mark.parametrize`

**How It Works**

Instead of 5 separate functions, you write ONE function with a decorator that supplies different test data.




### Example: Testing Multiple IT Knowledge Questions

We will write three parameterized tests. Let's break down what each test does:

**Test 1: `test_port_knowledge()`** ‚Äî Knowledge Testing at Scale

  **What it tests:** Does the LLM know standard network port numbers for 5
  common protocols?

  **Why parameterize?**
  - Testing one port proves nothing about the LLM's broader knowledge
  - Testing 5 ports reveals if it knows common protocols consistently
  - Easy to expand: add DNS (port 53), MySQL (3306), etc. by adding one line

  **The test cases:**
  - HTTP ‚Üí 80
  - HTTPS ‚Üí 443
  - SSH ‚Üí 22
  - FTP ‚Üí 21
  - SMTP ‚Üí 25

**Test 2: `test_ticket_classification()`** ‚Äî Multi-Category
  Classification

  **What it tests:** Can the LLM correctly categorize 5 different IT issues
  into the right category (Hardware, Software, Network, Access)?

  **Why parameterize?**
  - Each IT issue has a clear expected category
  - Tests if the LLM can distinguish between issue types (not just memorize one
   category)
  - Real helpdesks have dozens of categories‚Äîthis approach scales easily

  **The test cases:**
  - "Laptop screen is cracked" ‚Üí Hardware (physical device problem)
  - "Forgot my password" ‚Üí Access (authentication issue)
  - "Website loading slowly for all users" ‚Üí Network (connectivity problem)
  - "Excel keeps crashing" ‚Üí Software (application issue)
  - "Need permission for shared folder" ‚Üí Access (authorization issue)

  **Why these specific examples?** They test if the LLM can differentiate
  between similar-sounding categories (both password reset and folder
  permissions are "Access" issues, even though one is authentication and the
  other is authorization).

**Test 3: `test_troubleshooting_advice()`** ‚Äî Flexible Keyword Matching

  **What it tests:** Does the LLM provide relevant troubleshooting advice that
  mentions appropriate technical concepts?

  **Why parameterize?**
  - Each IT problem has different relevant keywords
  - We're not testing for exact phrasing (too brittle for LLMs)
  - We check if the advice is "in the right ballpark" by looking for
  domain-relevant terms

  **The test cases and their keyword lists:**
  - "Can't connect to WiFi" ‚Üí ["wifi", "wireless", "router", "modem",
  "connection"]
  - "Printer is offline" ‚Üí ["printer", "print", "device", "cable", "driver"]
  - "Computer is very slow" ‚Üí ["slow", "memory", "cpu", "task", "process",
  "resource"]
  - "Email won't send" ‚Üí ["email", "smtp", "server", "account", "credential"]
  - "Can't install software" ‚Üí ["install", "permission", "administrator",
  "compatibility", "space"]

  **Important difference from Tests 1 & 2:** We check if **ANY** of the
  keywords appear (not all of them). Why? Because good troubleshooting advice
  might say "check your WiFi router" OR "restart your modem"‚Äîboth are valid,
  but they use different keywords.

  **The two-part assertion:**
  1. **Keyword check:** `assert len(found_keywords) > 0` ‚Äî Did the advice
  mention at least one relevant concept?
  2. **Length check:** `assert len(response) > 20` ‚Äî Did the LLM actually give
  advice, or just echo the problem back?

In [None]:
%%writefile test_parameterized.py

import os
from openai import OpenAI
import pytest

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def ask_llm(prompt: str, model: str = "gpt-5-nano") -> str:
    """Send a prompt to the LLM and return the response."""
    response = client.responses.create(
        model=model,
        input=prompt
    )
    return response.output_text


# Parameterized test for IT knowledge
@pytest.mark.parametrize("question,expected_answer", [
    ("What port does HTTP use? Answer with just the number.", "80"),
    ("What port does HTTPS use? Answer with just the number.", "443"),
    ("What port does SSH use? Answer with just the number.", "22"),
    ("What port does FTP use? Answer with just the number.", "21"),
    ("What port does SMTP use? Answer with just the number.", "25"),
])
def test_port_knowledge(question, expected_answer):
    """
    Test: Does the LLM know standard network ports?
    This ONE test function runs 5 times with different data!
    """
    response = ask_llm(question)
    assert expected_answer in response, f"Expected '{expected_answer}' in response, got: {response}"


# Parameterized test for ticket classification
@pytest.mark.parametrize("issue_description,expected_category", [
    ("My laptop screen is cracked and won't display anything.", "Hardware"),
    ("I forgot my password and can't log into the system.", "Access"),
    ("The company website is loading very slowly for all users.", "Network"),
    ("Excel keeps crashing when I try to open large files.", "Software"),
    ("I need permission to access the shared finance folder.", "Access"),
])
def test_ticket_classification(issue_description, expected_category):
    """
    Test: Can the LLM correctly categorize different types of IT issues?
    Categories: Hardware, Software, Network, Access
    """
    prompt = f"""
    Categorize this IT support issue into ONE of these categories:
    Hardware, Software, Network, Access

    Issue: {issue_description}

    Answer with just the category name.
    """

    response = ask_llm(prompt)
    assert expected_category.lower() in response.lower(), \
        f"Expected category '{expected_category}', but got: {response}"


# Parameterized test for troubleshooting advice
# NOTE: We use multiple possible keywords since LLMs may phrase advice differently
@pytest.mark.parametrize("problem,expected_keywords", [
    ("User can't connect to WiFi", ["wifi", "wireless", "router", "modem", "connection"]),
    ("Printer is offline", ["printer", "print", "device", "cable", "driver"]),
    ("Computer is very slow", ["slow", "memory", "cpu", "task", "process", "resource"]),
    ("Email won't send", ["email", "smtp", "server", "account", "credential"]),
    ("Can't install software", ["install", "permission", "administrator", "compatibility", "space"]),
])
def test_troubleshooting_advice(problem, expected_keywords):
    """
    Test: Does the LLM provide relevant troubleshooting advice?
    We check if the advice mentions at least ONE relevant keyword.
    This is more flexible than checking for exact keywords!
    """
    prompt = f"Provide one troubleshooting step for this issue: {problem}"
    response = ask_llm(prompt)

    response_lower = response.lower()

    # Check if ANY of the expected keywords appear in the response
    found_keywords = [kw for kw in expected_keywords if kw in response_lower]

    assert len(found_keywords) > 0, \
        f"Expected advice to mention one of {expected_keywords}, but got: {response}"

    # Also check that we got actual advice (not just echoing the problem)
    assert len(response) > 20, f"Response too short to be useful advice: {response}"


print("‚úÖ Parameterized test file created: test_parameterized.py")

What happens when pytest runs this:
  1. Pytest reads `@pytest.mark.parametrize` decorator
  2. It sees 5 sets of test data (5 tuples in the list)
  3. It runs test_port_knowledge() 5 separate times, each time with different
  values for question and expected_answer
  4. Each run appears as a separate test case in the report



In [None]:
# Run the parameterized tests
# Notice how ONE test function becomes MANY test cases!
!pytest test_parameterized.py -v



Look at what just happened:
- We wrote **3 test functions**
- But pytest ran **15 test cases** (5 + 5 + 5)
- Each test case appears separately in the report
- If one case fails, others still run

**Benefits:**
- ‚úçÔ∏è Less code duplication
- üìù Easier to add new test cases (just add to the list)
- üîç Clearer which specific inputs failed
- üéØ More comprehensive coverage

## 5. Testing Structured Outputs with Pydantic

### Remember Test 4 from Section 3?

In that test, we validated JSON output manually:

```python
# We had to check EVERYTHING by hand:
required_fields = ["ticket_id", "user_name", "issue_summary", "priority", "category"]
for field in required_fields:
    assert field in data, f"Missing required field: {field}"

assert "7823" in str(data["ticket_id"])
assert "Sarah" in data["user_name"] or "Chen" in data["user_name"]
```

**That works, but imagine doing this for:**
- 10 fields instead of 5
- 20 different test functions
- Complex constraints (email format, value ranges, enums like "LOW"/"MEDIUM"/"HIGH")

You'd write HUNDREDS of lines of repetitive validation code. There's a better way that we will show you in this section.

---

Also, we've mostly tested LLMs that return **natural language text**:
- "The capital of Australia is Canberra"
- "SSH uses port 22"
- "This ticket should be classified as HIGH priority"

But real-world applications need **structured data** for:
- üíæ **Saving to databases** (INSERT INTO tickets VALUES ...)
- üîÑ **API integrations** (sending JSON to other systems)
- üìä **Data processing** (filtering, sorting, aggregating)
- üéØ **Workflow automation** (if priority == "HIGH", send alert)

**The problem:** LLMs are creative and inconsistent. Ask them to extract a ticket as JSON, and you might get:

**Good output:**
```json
{"ticket_id": "TICK-5678", "priority": "HIGH"}
```

**Bad outputs:**
```json
{"ticketID": "5678", "Priority": "high"}  // Wrong field names, wrong case
{ticket_id: "TICK-5678"}                   // Missing quotes (invalid JSON)
{"ticket_id": 5678}                        // Wrong data type (number not string)
{"ticket_id": "TICK-5678", "priority": "urgent"}  // Invalid value (not LOW/MEDIUM/HIGH)
```

**In traditional testing:** When you test a REST API, you know the JSON structure is controlled by the backend code. It's consistent and predictable.

**In LLM testing:** The LLM generates the JSON on-the-fly. You must validate that:
1. It's valid JSON (parseable)
2. All required fields are present
3. Field names match exactly what you expect
4. Data types are correct (string vs number vs boolean)
5. Values meet constraints (e.g., priority must be "LOW", "MEDIUM", or "HIGH")

---

### The Solution is: Pydantic for Schema Validation

**Pydantic** is a Python library that provides **data validation using Python type hints**. Think of it as:
- **JSON Schema** for Python
- **TypeScript interfaces** with runtime validation
- **Database schema constraints** applied to Python objects


Pydantic automatically checks:
- ‚úÖ All required fields are present
- ‚úÖ Field types are correct (string, int, etc.)
- ‚úÖ Values match constraints (e.g., priority must be LOW/MEDIUM/HIGH)
- ‚úÖ Provides clear error messages when validation fails

### How Pydantic Works: A Complete Example

Let's see Pydantic in action with a support ticket model.

This model has 7 fields that validate different aspects of the extracted ticket data:

  1. **ticket_id**: str ‚Äî Validates the ticket has a unique identifier (must be
  text to preserve prefixes like "TICK-" and leading zeros)
  2. **user**: str ‚Äî Ensures the reporter's name is extracted (required field, must
   be text)
  3. **email**: str ‚Äî Checks that a contact email address is present (required for
  follow-up communication)
  4. **issue**: str ‚Äî Validates that the problem description was captured (required
   field containing the issue details)
  5. **priority**: Literal["LOW", "MEDIUM", "HIGH"] ‚Äî Enforces strict priority
  values matching database constraints (rejects invalid values like "urgent" or
   "high priority")
  6. **category**: str ‚Äî Ensures the ticket is categorized for routing to the
  correct support team
  7. **status**: Literal["OPEN", "IN_PROGRESS", "RESOLVED", "CLOSED"] ‚Äî Validates
  ticket status with automatic default to "OPEN" if not provided by the LLM

In [None]:
from pydantic import BaseModel, Field, ValidationError
from typing import Literal
import json

# Define the structure we expect from the LLM
class SupportTicket(BaseModel):
    """A structured support ticket with all required fields."""
    ticket_id: str = Field(..., description="Ticket ID (e.g., 'TICK-1234')")
    user: str = Field(..., description="Full name of the user")
    email: str = Field(..., description="User's email address")
    issue: str = Field(..., description="Brief description of the issue")
    priority: Literal["LOW", "MEDIUM", "HIGH"] = Field(..., description="Ticket priority")
    category: str = Field(..., description="Issue category")
    status: Literal["OPEN", "IN_PROGRESS", "RESOLVED", "CLOSED"] = Field(default="OPEN")

# Example: Let's test extraction
ticket_text = """
Ticket ID: TICK-5678
From: Alice Johnson (alice.johnson@company.com)
Issue: Cannot access VPN from home. Getting "connection timeout" error.
Priority: HIGH
Category: Network/VPN

Extract this ticket information as JSON with fields: ticket_id, user, email, issue, priority, category, status.
Return ONLY valid JSON, no other text.
"""

response = ask_llm(ticket_text)
print("LLM Response:")
print(response)
print("\n" + "="*50 + "\n")

# Parse and validate
try:
    data = json.loads(response)
    ticket = SupportTicket(**data)
    print("‚úÖ Valid ticket structure!")
    print(f"\nTicket Details:")
    print(f"  ID: {ticket.ticket_id}")
    print(f"  User: {ticket.user}")
    print(f"  Email: {ticket.email}")
    print(f"  Issue: {ticket.issue}")
    print(f"  Priority: {ticket.priority}")
    print(f"  Category: {ticket.category}")
    print(f"  Status: {ticket.status}")
except json.JSONDecodeError as e:
    print(f"‚ùå Invalid JSON: {e}")
except ValidationError as e:
    print(f"‚ùå Validation failed: {e}")

### What Happened During Validation

When we call `SupportTicket(**data)`, Pydantic performed this validation sequence:

```
1. Check presence ‚Üí Are all required fields present? (...fields)
                   ‚Üì Pass
2. Check types ‚Üí Is ticket_id a string? Is priority a string?
                   ‚Üì Pass
3. Check constraints ‚Üí Is priority one of ["LOW", "MEDIUM", "HIGH"]?
                   ‚Üì Pass
4. Apply defaults ‚Üí If status missing, set to "OPEN"
                   ‚Üì Pass
5. Create object ‚Üí SupportTicket instance with validated data ‚úÖ
```

**If ANY step fails ‚Üí `ValidationError` with detailed message**

Let's take a look at the result:
The LLM successfully extracted all the ticket information and returned valid JSON with all 7 required fields. However, **Pydantic validation failed** because the status field contained "open" (lowercase) instead of the expected "OPEN" (uppercase).

This is exactly the kind of
  subtle error Pydantic is designed to catch: the data looks correct to a
  human, the JSON is valid, and all fields are present, but the value doesn't
  meet the strict constraints your database or API likely requires. Without
  Pydantic, this lowercase "open" would pass through your tests and cause a
  database constraint violation or API rejection in production.

## 6. Generating Test Reports

Pytest can generate beautiful HTML reports showing all test results.

Let's run all our tests and create a report:

In [None]:
# Run all tests and generate an HTML report
!pytest test_it_support.py test_parameterized.py -v --html=report.html --self-contained-html

The report is saved as `report.html`. You can download it from Colab and open it in your browser to see:
- Summary of passed/failed tests
- Execution time for each test
- Detailed error messages for failures
- Environment information

This is perfect for sharing test results with your team! üìä

## 7. Exercises üéì

Now it's your turn! Complete these exercises to practice what you've learned.

### Exercise 1: Convert Manual Tests to Automated Tests

Think of 3 questions you manually tested in ChatGPT. Convert them to automated pytest functions.


In [None]:
# Exercise 1: Write your 3 tests here

def test_your_question_1():
    """
    TODO: Test your first IT support question
    """
    pass  # Replace with your test code

def test_your_question_2():
    """
    TODO: Test your second IT support question
    """
    pass  # Replace with your test code

def test_your_question_3():
    """
    TODO: Test your third IT support question
    """
    pass  # Replace with your test code

# Run your tests
# !pytest -v

### Exercise 2: Parameterized Test for Common IT Issues

Create a parameterized test that checks if the LLM provides appropriate solutions for 5 common IT problems.

**Requirements:**
- Use `@pytest.mark.parametrize`
- Test at least 5 different IT issues
- Check that solutions contain relevant keywords

In [None]:
#%%writefile exercise_2.py
# Exercise 2: Your parameterized test here

import os
from openai import OpenAI
import pytest

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def ask_llm(prompt: str) -> str:
    response = client.responses.create(
        model="gpt-5-nano",
        input=prompt
    )
    return response.output_text

# TODO: Write your parameterized test here
# @pytest.mark.parametrize(...)
# def test_it_solutions(...):
#     pass

In [None]:
# Run your parameterized test
# !pytest exercise_2.py -v

### Exercise 3: Pydantic Model for Incident Report

Create a Pydantic model for an `IncidentReport` and test if the LLM can extract it correctly.

**Required fields:**
- `incident_id`: string
- `reporter`: string
- `timestamp`: string
- `severity`: "MINOR" | "MAJOR" | "CRITICAL"
- `affected_systems`: list of strings
- `description`: string
- `resolution_time_estimate`: string

In [None]:
# Exercise 3: Define your Pydantic model and test

from pydantic import BaseModel, Field
from typing import List, Literal
import json

# TODO: Define IncidentReport model
class IncidentReport(BaseModel):
    pass  # Add your fields here

# TODO: Write a test that:
# 1. Provides incident text to the LLM
# 2. Asks LLM to extract as JSON
# 3. Validates with your IncidentReport model

def test_incident_extraction():
    pass  # Your test code here

### Exercise 4: Consistency Testing

Test if the LLM gives consistent troubleshooting advice across 3 runs.

**Requirements:**
- Pick an IT issue (e.g., "Computer won't start")
- Ask for troubleshooting steps 3 times
- Check that all 3 responses mention the same key concepts
- Hint: Define a set of expected keywords and check they appear in all responses

In [None]:
# Exercise 4: Consistency test

def test_troubleshooting_consistency():
    """
    TODO: Test consistency of troubleshooting advice
    """
    pass  # Your test code here

### Exercise 5: Intentionally Failing Tests

Write 3 tests that will FAIL because they test edge cases or difficult scenarios.

**Example edge cases:**
- Ambiguous ticket descriptions
- Mixed priority signals in text
- Tickets with missing information
- Very technical jargon

Then, improve your prompts to make the tests pass!

In [None]:
# Exercise 5: Write failing tests, then fix them

def test_edge_case_1():
    """
    TODO: Test an edge case that initially fails
    """
    pass

def test_edge_case_2():
    """
    TODO: Test another edge case
    """
    pass

def test_edge_case_3():
    """
    TODO: Test a third edge case
    """
    pass

# Step 1: Run and watch them fail
# Step 2: Improve your prompts
# Step 3: Run again and see them pass!

## 9. Best Practices for LLM Testing

### ‚úÖ DO:

1. **Write clear, specific prompts** - Reduces ambiguity
2. **Test one thing at a time** - Easier to debug failures
3. **Use descriptive test names** - `test_ticket_severity()` not `test_1()`
4. **Add helpful assertion messages** - Explain what you expected
5. **Test edge cases** - Empty inputs, very long inputs, ambiguous cases
6. **Use Pydantic for structured outputs** - Automatic validation
7. **Group related tests** - Use separate test files for different features
8. **Check for key concepts, not exact strings** - LLMs may phrase answers differently

### ‚ùå DON'T:

1. **Don't expect exact string matches** - LLMs vary in phrasing
2. **Don't test creative tasks too strictly** - Allow for variation
3. **Don't ignore flaky tests** - Investigate and fix them
4. **Don't test too many things in one test** - Keep tests focused
5. **Don't forget to test error cases** - What if API fails?




## 10. Key Takeaways & Next Steps

### üéâ What You've Learned

1. **Manual ‚Üí Automated**: You transformed your ChatGPT testing workflow into automated Python tests
2. **Pytest basics**: Writing tests, running test suites, reading output
3. **Parameterization**: Testing multiple scenarios with minimal code
4. **Structured testing**: Using Pydantic to validate LLM outputs
5. **Best practices**: How to write maintainable, effective LLM tests


### üí° Pro Tips

1. **Start small**: Begin with a few critical tests, then expand
2. **Run tests often**: Integrate into your development workflow
3. **Track test coverage**: Aim for 80%+ coverage of critical paths
4. **Share reports**: Use HTML reports to communicate with non-technical team members
5. **Iterate on prompts**: Use failing tests to improve your prompts

---

## üìù Additional Resources

- [Pytest Documentation](https://docs.pytest.org/)
- [Pydantic Documentation](https://docs.pydantic.dev/)
- [OpenAI API Documentation](https://platform.openai.com/docs/)

