# LLM Testing: From Manual to Automated

By the end of this notebook, you will be able to:

1. Understand why automated testing is essential for LLM applications
2. Convert manual tests (like those you did in ChatGPT) into automated Python tests
3. Use pytest to organize and run test suites
4. Write parameterized tests to test multiple scenarios efficiently
5. Validate structured outputs using Pydantic models
6. Interpret test results and debug failures

---

## 🎯 Context: What You've Done Before (Manual Testing)

Previously, you manually tested LLMs in ChatGPT by:
- Asking factual questions and checking answers
- Testing prompt sensitivity by rephrasing questions
- Checking for bias in responses
- Evaluating consistency across multiple queries

### Why This Doesn't Scale

Imagine you're building an **IT Support Chatbot** for a company helpdesk. You need to test:
- ✅ Does it correctly identify ticket severity?
- ✅ Does it provide accurate troubleshooting steps?
- ✅ Does it extract ticket information correctly?
- ✅ Is it consistent across similar queries?

**Problem:** Testing these manually every time you update your prompt or model is:
- ⏰ Time-consuming
- 🐛 Error-prone
- 📈 Not scalable (what about 100 test cases?)
- 🔄 Hard to reproduce

### The Solution: Automated Testing

**Same tests you did in ChatGPT, now in code!**

Benefits:
- 🚀 Run hundreds of tests in seconds
- 🔁 Reproducible results
- 🤖 Integrate into CI/CD pipelines
- 📊 Generate test reports automatically
- 🛡️ Catch regressions when you change prompts or models

---

Let's get started! 🚀

## 1. Environment Setup

First, we'll install the required packages and set up our OpenAI API access.

In [None]:
# Install required packages
# This may take a minute - you'll see progress bars
!pip install openai pytest==8.3.4 pytest-html==4.1.1 pydantic>=2.11.0 -q

In [None]:
# Import required libraries
import os
from openai import OpenAI
import pytest
from pydantic import BaseModel, Field
import json
from typing import List, Optional

print("✅ All imports successful!")

### API Key Setup

You'll need an OpenAI API key to run these tests.

**How to get your API key:**
1. Go to [platform.openai.com](https://platform.openai.com)
2. Sign in or create an account
3. Navigate to API Keys section
4. Create a new secret key
5. Copy it and use it below

**Two ways to provide your API key:**

**Option 1: Colab Secrets (Recommended - More Secure)**
- Click the 🔑 key icon in the left sidebar
- Add a new secret with name: `OPENAI_API_KEY`
- Paste your API key as the value
- Enable "Notebook access" toggle
- Run the cell below - it will automatically load from secrets

**Option 2: Enter when prompted**
- Just run the cell below
- You'll be prompted to enter your API key
- The key will be hidden as you type

**💰 Cost Note:** We'll use the `gpt-5-nano` model, which is very cost-effective for testing. These examples will cost less than $0.01 to run.

In [None]:
# Configure OpenAI API key
# Method 1: Try to get API key from Colab secrets (recommended)
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("✅ API key loaded from Colab secrets")
except:
    # Method 2: Manual input (fallback)
    from getpass import getpass
    print("💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

# Set the API key as an environment variable
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Validate that the API key is set
if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("❌ ERROR: No API key provided!")

print("✅ Authentication configured!")

# Configure which OpenAI model to use
OPENAI_MODEL = "gpt-5-nano"  # Using gpt-5-nano for cost efficiency
print(f"🤖 Selected Model: {OPENAI_MODEL}")

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

## 2. Your First Automated Test

Let's start simple. Remember when you manually tested IT knowledge in ChatGPT?

**Manual Test (What you did before):**
1. Open ChatGPT
2. Type: "What port does HTTPS use?"
3. Read response
4. Check if it says "443"
5. ✅ or ❌

**Automated Test (What we'll do now):**
Same thing, but in code!



In [None]:
# Helper function to call the LLM
def ask_llm(prompt: str, model: str = "gpt-5-nano") -> str:
    """
    Send a prompt to the LLM and return the response.

    Args:
        prompt: The question or instruction to send to the LLM
        model: The model to use (default: gpt-5-nano)

    Returns:
        The LLM's response as a string
    """
    response = client.responses.create(
        model=model,
        input=prompt
    )
    return response.output_text

# Test it out!
response = ask_llm("What port does HTTPS use? Answer with just the number.")
print(f"LLM Response: {response}")

### Plain Python Test Function

Here's our first test as a simple Python function:

In [None]:
def test_https_port_knowledge():
    """
    Test: Does the LLM know what port HTTPS uses?
    Expected: Response should contain "443"
    """
    # Arrange: Set up the test data
    question = "What port does HTTPS use? Answer with just the number."

    # Act: Call the LLM
    response = ask_llm(question)

    # Assert: Check if the response is correct
    assert "443" in response, f"Expected '443' in response, but got: {response}"

    print("✅ Test passed! LLM correctly identified HTTPS port as 443")

# Run the test
test_https_port_knowledge()

### Understanding the Test Structure

Every good test follows the **AAA pattern**:

1. **Arrange:** Set up your test data (the question)
2. **Act:** Perform the action (call the LLM)
3. **Assert:** Check if the result matches expectations

**Key concept:** The `assert` statement is how we check correctness:
- If the condition is `True` → Test passes ✅
- If the condition is `False` → Test fails ❌ and shows error message

**Note about consistency:** Since gpt-5-nano doesn't support temperature control, responses may vary slightly between runs. We design our tests to check for key concepts rather than exact string matches.

## 3. Building a Test Suite with Pytest

Now let's organize multiple tests using **pytest**, a professional testing framework.

Pytest automatically:
- Discovers all functions starting with `test_`
- Runs them in isolation
- Reports which passed or failed
- Shows detailed error messages

Let's create a test file with multiple IT support tests!

In [None]:
%%writefile test_it_support.py
# test_it_support.py
# This file contains automated tests for our IT Support LLM

import os
from openai import OpenAI
import pytest
from pydantic import BaseModel, Field
import json

# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def ask_llm(prompt: str, model: str = "gpt-5-nano") -> str:
    """Send a prompt to the LLM and return the response."""
    response = client.responses.create(
        model=model,
        input=prompt
    )
    return response.output_text


# Test 1: Technical Fact Testing
def test_technical_fact():
    """
    Test: Does the LLM know basic IT facts?
    This is like asking factual questions in ChatGPT and checking the answer.
    """
    question = "What is the default SSH port? Answer with just the number."
    response = ask_llm(question)

    # Check if response contains the correct port
    assert "22" in response, f"Expected SSH port 22, but got: {response}"


# Test 2: Ticket Severity Assessment
def test_ticket_severity_assessment():
    """
    Test: Can the LLM correctly assess ticket severity?
    Critical issues should be flagged as HIGH priority.
    """
    ticket_description = """
    Ticket: Database server is completely down. All employees cannot access
    customer data. Production environment is affected.

    Classify this ticket's severity as: LOW, MEDIUM, or HIGH.
    Answer with just one word.
    """

    response = ask_llm(ticket_description)

    # A complete database outage should be HIGH severity
    assert "HIGH" in response.upper(), f"Expected HIGH severity, but got: {response}"


# Test 3: Consistency Testing
def test_consistency():
    """
    Test: Does the LLM give consistent answers to the same question?
    This is like asking the same question multiple times in ChatGPT.
    """
    question = "What is the purpose of a firewall in network security? Answer in one sentence."

    # Ask the same question twice
    response1 = ask_llm(question)
    response2 = ask_llm(question)

    # Check that both responses mention key concepts about firewalls
    assert "firewall" in response1.lower(), f"Response 1 missing 'firewall': {response1}"
    assert "firewall" in response2.lower(), f"Response 2 missing 'firewall': {response2}"

    # Check semantic similarity (both should mention blocking/filtering/protecting)
    keywords = ["block", "filter", "control", "protect", "monitor"]
    assert any(kw in response1.lower() for kw in keywords), f"Response 1 missing key concepts: {response1}"
    assert any(kw in response2.lower() for kw in keywords), f"Response 2 missing key concepts: {response2}"


# Test 4: Structured Output (JSON)
def test_structured_output_json():
    """
    Test: Can the LLM extract structured information from a ticket?
    This tests if the LLM can parse unstructured text into structured data.
    """
    ticket_text = """
    Ticket #7823: User Sarah Chen reports that she cannot print from her laptop.
    The printer shows as offline. Priority: Medium. Category: Hardware.

    Extract the following information as JSON:
    - ticket_id
    - user_name
    - issue_summary
    - priority
    - category

    Return ONLY valid JSON, no other text.
    """

    response = ask_llm(ticket_text)

    # Parse the JSON response
    try:
        data = json.loads(response)
    except json.JSONDecodeError:
        pytest.fail(f"Response is not valid JSON: {response}")

    # Verify all required fields are present
    required_fields = ["ticket_id", "user_name", "issue_summary", "priority", "category"]
    for field in required_fields:
        assert field in data, f"Missing required field: {field}"

    # Verify correctness of extracted data
    assert "7823" in str(data["ticket_id"]), f"Wrong ticket_id: {data['ticket_id']}"
    assert "Sarah" in data["user_name"] or "Chen" in data["user_name"], f"Wrong user: {data['user_name']}"


# Test 5: Structured Output with Pydantic Validation
class SupportTicket(BaseModel):
    """Pydantic model for a support ticket."""
    ticket_id: str = Field(..., description="Unique ticket identifier")
    user: str = Field(..., description="Name of the user who reported the issue")
    issue: str = Field(..., description="Brief description of the issue")
    priority: str = Field(..., description="Priority level: LOW, MEDIUM, or HIGH")
    category: str = Field(..., description="Category of the issue")

def test_structured_output_validation():
    """
    Test: Can the LLM produce output that passes strict validation?
    We use Pydantic to ensure the output has the right structure AND data types.
    """
    ticket_text = """
    Ticket #9234: John Smith cannot access his email. Getting "authentication failed" error.
    This is blocking his work. Priority: High. Category: Email.

    Extract the information as JSON with these exact fields:
    - ticket_id (string)
    - user (string)
    - issue (string)
    - priority (string: LOW, MEDIUM, or HIGH)
    - category (string)

    Return ONLY valid JSON, no other text.
    """

    response = ask_llm(ticket_text)

    # Parse JSON
    try:
        data = json.loads(response)
    except json.JSONDecodeError:
        pytest.fail(f"Response is not valid JSON: {response}")

    # Validate with Pydantic model
    try:
        ticket = SupportTicket(**data)
    except Exception as e:
        pytest.fail(f"Response doesn't match SupportTicket model: {e}\nData: {data}")

    # Verify semantic correctness
    assert "9234" in ticket.ticket_id, f"Wrong ticket_id: {ticket.ticket_id}"
    assert "John" in ticket.user or "Smith" in ticket.user, f"Wrong user: {ticket.user}"
    assert ticket.priority.upper() in ["LOW", "MEDIUM", "HIGH"], f"Invalid priority: {ticket.priority}"


print("✅ Test file created: test_it_support.py")

### Run the Test Suite

Now let's run all 5 tests with pytest!

In [None]:
# Run pytest with verbose output
# -v = verbose (show each test name)
# -s = show print statements
!pytest test_it_support.py -v -s

### 🎉 Congratulations!

You just ran 5 automated tests! Notice how pytest:
- Discovered all functions starting with `test_`
- Ran each test independently
- Showed which passed (✓) or failed (✗)
- Displayed helpful error messages for failures

**This is the same testing you did manually in ChatGPT, but now:**
- ⚡ Takes seconds instead of minutes
- 🔁 Perfectly reproducible
- 📊 Generates reports automatically
- 🤖 Can run in CI/CD pipelines

## 4. Parameterized Testing

What if you want to test multiple similar scenarios? You could write 10 separate test functions... or use **parameterized tests**!

**Parameterized tests** let you run the same test logic with different input data.

### Example: Testing Multiple IT Knowledge Questions

In [None]:
%%writefile test_parameterized.py
# test_parameterized.py
# Demonstrates parameterized testing with pytest

import os
from openai import OpenAI
import pytest

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def ask_llm(prompt: str, model: str = "gpt-5-nano") -> str:
    """Send a prompt to the LLM and return the response."""
    response = client.responses.create(
        model=model,
        input=prompt
    )
    return response.output_text


# Parameterized test for IT knowledge
@pytest.mark.parametrize("question,expected_answer", [
    ("What port does HTTP use? Answer with just the number.", "80"),
    ("What port does HTTPS use? Answer with just the number.", "443"),
    ("What port does SSH use? Answer with just the number.", "22"),
    ("What port does FTP use? Answer with just the number.", "21"),
    ("What port does SMTP use? Answer with just the number.", "25"),
])
def test_port_knowledge(question, expected_answer):
    """
    Test: Does the LLM know standard network ports?
    This ONE test function runs 5 times with different data!
    """
    response = ask_llm(question)
    assert expected_answer in response, f"Expected '{expected_answer}' in response, got: {response}"


# Parameterized test for ticket classification
@pytest.mark.parametrize("issue_description,expected_category", [
    ("My laptop screen is cracked and won't display anything.", "Hardware"),
    ("I forgot my password and can't log into the system.", "Access"),
    ("The company website is loading very slowly for all users.", "Network"),
    ("Excel keeps crashing when I try to open large files.", "Software"),
    ("I need permission to access the shared finance folder.", "Access"),
])
def test_ticket_classification(issue_description, expected_category):
    """
    Test: Can the LLM correctly categorize different types of IT issues?
    Categories: Hardware, Software, Network, Access
    """
    prompt = f"""
    Categorize this IT support issue into ONE of these categories:
    Hardware, Software, Network, Access

    Issue: {issue_description}

    Answer with just the category name.
    """

    response = ask_llm(prompt)
    assert expected_category.lower() in response.lower(), \
        f"Expected category '{expected_category}', but got: {response}"


# Parameterized test for troubleshooting advice
# NOTE: We use multiple possible keywords since LLMs may phrase advice differently
@pytest.mark.parametrize("problem,expected_keywords", [
    ("User can't connect to WiFi", ["wifi", "wireless", "router", "modem", "connection"]),
    ("Printer is offline", ["printer", "print", "device", "cable", "driver"]),
    ("Computer is very slow", ["slow", "memory", "cpu", "task", "process", "resource"]),
    ("Email won't send", ["email", "smtp", "server", "account", "credential"]),
    ("Can't install software", ["install", "permission", "administrator", "compatibility", "space"]),
])
def test_troubleshooting_advice(problem, expected_keywords):
    """
    Test: Does the LLM provide relevant troubleshooting advice?
    We check if the advice mentions at least ONE relevant keyword.
    This is more flexible than checking for exact keywords!
    """
    prompt = f"Provide one troubleshooting step for this issue: {problem}"
    response = ask_llm(prompt)

    response_lower = response.lower()

    # Check if ANY of the expected keywords appear in the response
    found_keywords = [kw for kw in expected_keywords if kw in response_lower]

    assert len(found_keywords) > 0, \
        f"Expected advice to mention one of {expected_keywords}, but got: {response}"

    # Also check that we got actual advice (not just echoing the problem)
    assert len(response) > 20, f"Response too short to be useful advice: {response}"


print("✅ Parameterized test file created: test_parameterized.py")

In [None]:
# Run the parameterized tests
# Notice how ONE test function becomes MANY test cases!
!pytest test_parameterized.py -v

### 🚀 Power of Parameterization

Look at what just happened:
- We wrote **3 test functions**
- But pytest ran **15 test cases** (5 + 5 + 5)
- Each test case appears separately in the report
- If one case fails, others still run

**Benefits:**
- ✍️ Less code duplication
- 📝 Easier to add new test cases (just add to the list)
- 🔍 Clearer which specific inputs failed
- 🎯 More comprehensive coverage

## 5. Testing Structured Outputs with Pydantic

In real applications, you often need LLMs to return structured data (JSON). But how do you test that the structure is correct?

**Pydantic** helps you:
1. Define the expected structure (schema)
2. Validate that the LLM output matches the schema
3. Get type checking and helpful error messages

### Example: Support Ticket Model

In [None]:
from pydantic import BaseModel, Field, ValidationError
from typing import Literal
import json

# Define the structure we expect from the LLM
class SupportTicket(BaseModel):
    """A structured support ticket with all required fields."""
    ticket_id: str = Field(..., description="Ticket ID (e.g., 'TICK-1234')")
    user: str = Field(..., description="Full name of the user")
    email: str = Field(..., description="User's email address")
    issue: str = Field(..., description="Brief description of the issue")
    priority: Literal["LOW", "MEDIUM", "HIGH"] = Field(..., description="Ticket priority")
    category: str = Field(..., description="Issue category")
    status: Literal["OPEN", "IN_PROGRESS", "RESOLVED", "CLOSED"] = Field(default="OPEN")

# Example: Let's test extraction
ticket_text = """
Ticket ID: TICK-5678
From: Alice Johnson (alice.johnson@company.com)
Issue: Cannot access VPN from home. Getting "connection timeout" error.
Priority: HIGH
Category: Network/VPN

Extract this ticket information as JSON with fields: ticket_id, user, email, issue, priority, category, status.
Return ONLY valid JSON, no other text.
"""

response = ask_llm(ticket_text)
print("LLM Response:")
print(response)
print("\n" + "="*50 + "\n")

# Parse and validate
try:
    data = json.loads(response)
    ticket = SupportTicket(**data)
    print("✅ Valid ticket structure!")
    print(f"\nTicket Details:")
    print(f"  ID: {ticket.ticket_id}")
    print(f"  User: {ticket.user}")
    print(f"  Email: {ticket.email}")
    print(f"  Issue: {ticket.issue}")
    print(f"  Priority: {ticket.priority}")
    print(f"  Category: {ticket.category}")
    print(f"  Status: {ticket.status}")
except json.JSONDecodeError as e:
    print(f"❌ Invalid JSON: {e}")
except ValidationError as e:
    print(f"❌ Validation failed: {e}")

### Why Pydantic is Powerful for Testing

Pydantic automatically checks:
- ✅ All required fields are present
- ✅ Field types are correct (string, int, etc.)
- ✅ Values match constraints (e.g., priority must be LOW/MEDIUM/HIGH)
- ✅ Provides clear error messages when validation fails

**Without Pydantic:** You'd need to manually write checks for each field.  
**With Pydantic:** One line (`SupportTicket(**data)`) does it all! 🎉

## 6. Generating Test Reports

Pytest can generate beautiful HTML reports showing all test results.

Let's run all our tests and create a report:

In [None]:
# Run all tests and generate an HTML report
!pytest test_it_support.py test_parameterized.py -v --html=report.html --self-contained-html

The report is saved as `report.html`. You can download it from Colab and open it in your browser to see:
- Summary of passed/failed tests
- Execution time for each test
- Detailed error messages for failures
- Environment information

This is perfect for sharing test results with your team! 📊

## 7. Exercises 🎓

Now it's your turn! Complete these exercises to practice what you've learned.

### Exercise 1: Convert Manual Tests to Automated Tests

Think of 3 IT support questions you manually tested in ChatGPT. Convert them to automated pytest functions.

**Example topics:**
- What causes a "DNS not found" error?
- How do you reset a Windows password?
- What's the difference between RAM and storage?

Write your tests below:

In [None]:
# Exercise 1: Write your 3 tests here

def test_your_question_1():
    """
    TODO: Test your first IT support question
    """
    pass  # Replace with your test code

def test_your_question_2():
    """
    TODO: Test your second IT support question
    """
    pass  # Replace with your test code

def test_your_question_3():
    """
    TODO: Test your third IT support question
    """
    pass  # Replace with your test code

# Run your tests
# !pytest -v

### Exercise 2: Parameterized Test for Common IT Issues

Create a parameterized test that checks if the LLM provides appropriate solutions for 5 common IT problems.

**Requirements:**
- Use `@pytest.mark.parametrize`
- Test at least 5 different IT issues
- Check that solutions contain relevant keywords

In [None]:
%%writefile exercise_2.py
# Exercise 2: Your parameterized test here

import os
from openai import OpenAI
import pytest

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def ask_llm(prompt: str) -> str:
    response = client.responses.create(
        model="gpt-5-nano",
        input=prompt
    )
    return response.output_text

# TODO: Write your parameterized test here
# @pytest.mark.parametrize(...)
# def test_it_solutions(...):
#     pass

In [None]:
# Run your parameterized test
# !pytest exercise_2.py -v

### Exercise 3: Pydantic Model for Incident Report

Create a Pydantic model for an `IncidentReport` and test if the LLM can extract it correctly.

**Required fields:**
- `incident_id`: string
- `reporter`: string
- `timestamp`: string
- `severity`: "MINOR" | "MAJOR" | "CRITICAL"
- `affected_systems`: list of strings
- `description`: string
- `resolution_time_estimate`: string

In [None]:
# Exercise 3: Define your Pydantic model and test

from pydantic import BaseModel, Field
from typing import List, Literal
import json

# TODO: Define IncidentReport model
class IncidentReport(BaseModel):
    pass  # Add your fields here

# TODO: Write a test that:
# 1. Provides incident text to the LLM
# 2. Asks LLM to extract as JSON
# 3. Validates with your IncidentReport model

def test_incident_extraction():
    pass  # Your test code here

### Exercise 4: Consistency Testing

Test if the LLM gives consistent troubleshooting advice across 3 runs.

**Requirements:**
- Pick an IT issue (e.g., "Computer won't start")
- Ask for troubleshooting steps 3 times
- Check that all 3 responses mention the same key concepts
- Hint: Define a set of expected keywords and check they appear in all responses

In [None]:
# Exercise 4: Consistency test

def test_troubleshooting_consistency():
    """
    TODO: Test consistency of troubleshooting advice
    """
    pass  # Your test code here

### Exercise 5: Intentionally Failing Tests

Write 3 tests that will FAIL because they test edge cases or difficult scenarios.

**Example edge cases:**
- Ambiguous ticket descriptions
- Mixed priority signals in text
- Tickets with missing information
- Very technical jargon

Then, improve your prompts to make the tests pass!

In [None]:
# Exercise 5: Write failing tests, then fix them

def test_edge_case_1():
    """
    TODO: Test an edge case that initially fails
    """
    pass

def test_edge_case_2():
    """
    TODO: Test another edge case
    """
    pass

def test_edge_case_3():
    """
    TODO: Test a third edge case
    """
    pass

# Step 1: Run and watch them fail
# Step 2: Improve your prompts
# Step 3: Run again and see them pass!

## 9. Best Practices for LLM Testing

### ✅ DO:

1. **Write clear, specific prompts** - Reduces ambiguity
2. **Test one thing at a time** - Easier to debug failures
3. **Use descriptive test names** - `test_ticket_severity()` not `test_1()`
4. **Add helpful assertion messages** - Explain what you expected
5. **Test edge cases** - Empty inputs, very long inputs, ambiguous cases
6. **Use Pydantic for structured outputs** - Automatic validation
7. **Group related tests** - Use separate test files for different features
8. **Check for key concepts, not exact strings** - LLMs may phrase answers differently

### ❌ DON'T:

1. **Don't expect exact string matches** - LLMs vary in phrasing
2. **Don't test creative tasks too strictly** - Allow for variation
3. **Don't ignore flaky tests** - Investigate and fix them
4. **Don't test too many things in one test** - Keep tests focused
5. **Don't forget to test error cases** - What if API fails?

### Testing Strategy

For each LLM feature, test:
1. **Happy path** - Normal, expected inputs
2. **Edge cases** - Unusual but valid inputs
3. **Error cases** - Invalid inputs
4. **Consistency** - Check for key concepts across multiple runs
5. **Performance** - Response time (if critical)



## 10. Key Takeaways & Next Steps

### 🎉 What You've Learned

1. **Manual → Automated**: You transformed your ChatGPT testing workflow into automated Python tests
2. **Pytest basics**: Writing tests, running test suites, reading output
3. **Parameterization**: Testing multiple scenarios with minimal code
4. **Structured testing**: Using Pydantic to validate LLM outputs
5. **Best practices**: How to write maintainable, effective LLM tests


### 💡 Pro Tips

1. **Start small**: Begin with a few critical tests, then expand
2. **Run tests often**: Integrate into your development workflow
3. **Track test coverage**: Aim for 80%+ coverage of critical paths
4. **Share reports**: Use HTML reports to communicate with non-technical team members
5. **Iterate on prompts**: Use failing tests to improve your prompts

---

## 📝 Additional Resources

- [Pytest Documentation](https://docs.pytest.org/)
- [Pydantic Documentation](https://docs.pydantic.dev/)
- [OpenAI API Documentation](https://platform.openai.com/docs/)

