# Structured Outputs with Pydantic - Demo

This demo demonstrates how to generate and validate structured JSON outputs from AI models using Pydantic for type enforcement and fail-safe design. We'll process corporate healthcare data and use AI to generate insights.

## What You'll Learn
- The difference between naive string parsing and structured outputs
- How to create robust Pydantic models for data validation
- Using OpenAI's structured output capabilities for reliable AI analysis
- Handling complex data types and validation errors
- Real-world application: Structured health claims processing

## Business Scenario
A corporate health insurance provider processes employee health claims. We need to:
- Validate employee information
- Process health claims with proper categorization
- Generate structured summaries for analysis

In [1]:
# Setup and Imports
import json
import os
from datetime import datetime
from typing import List, Optional
from enum import Enum

from pydantic import BaseModel, Field, ValidationError
from dotenv import load_dotenv
from openai import OpenAI

# Load environment and initialize OpenAI client
load_dotenv()
client = OpenAI(
    base_url="https://openai.vocareum.com/v1",
    api_key=os.getenv("OPENAI_API_KEY")
)

print("✅ Environment setup complete!")
print(f"✅ Using Vocareum OpenAI endpoint")
print(f"✅ API key loaded: {'YES' if os.getenv('OPENAI_API_KEY') else 'NO'}")

✅ Environment setup complete!
✅ Using Vocareum OpenAI endpoint
✅ API key loaded: YES


In [2]:
# Define Pydantic Models - Type-Safe Data Structures

class EmployeeProfile(BaseModel):
    """A validated employee profile with type enforcement."""
    name: str = Field(..., min_length=1, max_length=100, description="Employee's full name")
    employee_id: str = Field(..., pattern=r'^E\d{6}$', description="Employee ID format: E followed by 6 digits")
    department: str = Field(..., min_length=1, max_length=50, description="Department name")
    email: str = Field(..., pattern=r'^[^@]+@[^@]+\.[^@]+$', description="Valid email address")
    tenure_years: int = Field(..., ge=0, le=60, description="Years with company")

class ClaimType(str, Enum):
    """Enumeration for health claim types."""
    MEDICAL = "medical"
    DENTAL = "dental"
    VISION = "vision"
    PREVENTIVE = "preventive"

class HealthClaim(BaseModel):
    """A validated health insurance claim record."""
    claim_id: str = Field(..., pattern=r'^C\d{8}$', description="Claim ID format: C followed by 8 digits")
    claim_type: ClaimType = Field(..., description="Type of health claim")
    amount: float = Field(..., gt=0, le=100000, description="Claim amount (must be positive, max $100k)")
    description: str = Field(..., min_length=10, max_length=500, description="Claim description")
    date_submitted: datetime = Field(..., description="Claim submission date")
    status: str = Field(..., pattern=r'^(pending|approved|denied)$', description="Claim status")

class ClaimsSummary(BaseModel):
    """Complete health claims summary for an employee."""
    employee: EmployeeProfile = Field(..., description="Employee information")
    claims: List[HealthClaim] = Field(..., description="List of health claims")
    total_claims_amount: float = Field(..., ge=0, description="Total amount of all claims")
    approved_claims_count: int = Field(..., ge=0, description="Number of approved claims")
    pending_claims_count: int = Field(..., ge=0, description="Number of pending claims")
    average_claim_value: float = Field(..., ge=0, description="Average claim value")

print("✅ Pydantic models defined with validation rules!")

✅ Pydantic models defined with validation rules!


In [3]:
# 1. Naive Approach - String Parsing (Problematic)

def naive_json_extraction():
    """Demonstrate the problems with naive JSON parsing from text responses."""
    
    prompt = """
    Create an employee profile in JSON format with:
    - name: John Smith
    - employee_id: E123456
    - department: Engineering
    - email: john.smith@company.com
    - tenure_years: 5
    
    Return only the JSON, no additional text.
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    
    raw_response = response.choices[0].message.content
    print("Raw response from model:")
    print(repr(raw_response))
    
    try:
        json_text = raw_response
        parsed_data = json.loads(json_text)
        print("\n✅ Successfully parsed JSON:")
        print(json.dumps(parsed_data, indent=2))
        return parsed_data
    except json.JSONDecodeError as e:
        print(f"\n❌ JSON parsing failed: {e}")
        return None

# Test the naive approach
print("Testing Naive Approach:")
print("=" * 60)
naive_result = naive_json_extraction()

Testing Naive Approach:
Raw response from model:
'```json\n{\n    "name": "John Smith",\n    "employee_id": "E123456",\n    "department": "Engineering",\n    "email": "john.smith@company.com",\n    "tenure_years": 5\n}\n```'

❌ JSON parsing failed: Expecting value: line 1 column 1 (char 0)


In [4]:
# 2. Structured Output Approach - Reliable and Type-Safe

def generate_employee_profile_structured() -> EmployeeProfile:
    """Generate an employee profile using OpenAI's structured output feature."""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system", 
                "content": "You are a helpful HR system that generates employee profiles in JSON format."
            },
            {
                "role": "user", 
                "content": """Create an employee profile for a senior software engineer named Sarah Johnson, 
                employee ID E345678, from the Engineering department, email sarah.johnson@company.com, 
                with 8 years tenure.
                
                Return ONLY a flat JSON object with these exact fields:
                - name: string
                - employee_id: string (format: E followed by 6 digits)
                - department: string
                - email: string
                - tenure_years: number
                """
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.7
    )
    
    try:
        raw_json = response.choices[0].message.content
        profile_data = json.loads(raw_json)
        
        # Validate and create the Pydantic model
        employee_profile = EmployeeProfile(**profile_data)
        return employee_profile
    
    except ValidationError as e:
        print(f"❌ Validation error: {e}")
        raise
    except json.JSONDecodeError as e:
        print(f"❌ JSON parsing error: {e}")
        raise

# Test structured output
print()
print("Testing Structured Output Approach:")
print("=" * 60)
try:
    structured_profile = generate_employee_profile_structured()
    print("✅ Successfully generated structured employee profile:")
    print(structured_profile.model_dump_json(indent=2))
except Exception as e:
    print(f"Error: {e}")


Testing Structured Output Approach:
✅ Successfully generated structured employee profile:
{
  "name": "Sarah Johnson",
  "employee_id": "E345678",
  "department": "Engineering",
  "email": "sarah.johnson@company.com",
  "tenure_years": 8
}


In [5]:
# 3. Complex Data Types - Creating a Complete Claims Summary

def create_sample_claims_summary():
    """Create a complete claims summary with multiple claims for validation."""
    
    # Create employee profile
    employee = EmployeeProfile(
        name="Michael Chen",
        employee_id="E987654",
        department="Finance",
        email="michael.chen@company.com",
        tenure_years=6
    )
    
    # Create sample claims
    claims = [
        HealthClaim(
            claim_id="C12345678",
            claim_type=ClaimType.MEDICAL,
            amount=2500.50,
            description="Annual physical examination and preventive screening at primary care facility",
            date_submitted=datetime(2024, 1, 15),
            status="approved"
        ),
        HealthClaim(
            claim_id="C87654321",
            claim_type=ClaimType.DENTAL,
            amount=1200.00,
            description="Root canal treatment and dental restoration completed by certified dentist",
            date_submitted=datetime(2024, 2, 20),
            status="approved"
        ),
        HealthClaim(
            claim_id="C55555555",
            claim_type=ClaimType.VISION,
            amount=350.00,
            description="New eyeglasses and comprehensive eye examination at optical center",
            date_submitted=datetime(2024, 3, 10),
            status="pending"
        ),
    ]
    
    # Calculate summary statistics
    total_amount = sum(claim.amount for claim in claims)
    approved_count = sum(1 for claim in claims if claim.status == "approved")
    pending_count = sum(1 for claim in claims if claim.status == "pending")
    average_value = total_amount / len(claims) if claims else 0
    
    # Create summary with validation
    try:
        summary = ClaimsSummary(
            employee=employee,
            claims=claims,
            total_claims_amount=total_amount,
            approved_claims_count=approved_count,
            pending_claims_count=pending_count,
            average_claim_value=average_value
        )
        return summary
    except ValidationError as e:
        print(f"❌ Validation error: {e}")
        return None

# Create and display sample summary
print()
print("Creating Complex Data Structure with Validation:")
print("=" * 60)
summary = create_sample_claims_summary()
if summary:
    print("✅ Successfully created validated claims summary:")
    print(summary.model_dump_json(indent=2))


Creating Complex Data Structure with Validation:
✅ Successfully created validated claims summary:
{
  "employee": {
    "name": "Michael Chen",
    "employee_id": "E987654",
    "department": "Finance",
    "email": "michael.chen@company.com",
    "tenure_years": 6
  },
  "claims": [
    {
      "claim_id": "C12345678",
      "claim_type": "medical",
      "amount": 2500.5,
      "description": "Annual physical examination and preventive screening at primary care facility",
      "date_submitted": "2024-01-15T00:00:00",
      "status": "approved"
    },
    {
      "claim_id": "C87654321",
      "claim_type": "dental",
      "amount": 1200.0,
      "description": "Root canal treatment and dental restoration completed by certified dentist",
      "date_submitted": "2024-02-20T00:00:00",
      "status": "approved"
    },
    {
      "claim_id": "C55555555",
      "claim_type": "vision",
      "amount": 350.0,
      "description": "New eyeglasses and comprehensive eye examination at opti

In [6]:
# 4. AI-Generated Structured Analysis

def generate_claims_analysis():
    """Use AI to generate structured analysis of claims data."""
    
    # Create a claims summary
    summary = create_sample_claims_summary()
    
    if not summary:
        print("Failed to create claims summary")
        return
    
    # Convert to dict for AI analysis prompt
    summary_dict = summary.model_dump()
    
    prompt = f"""
    Analyze the following employee health claims data and provide:
    1. A brief summary of the employee's claim history
    2. Key observations about claim types and amounts
    3. Status of pending claims
    4. Recommendations for the HR team
    
    Employee: {summary.employee.name} ({summary.employee.employee_id})
    Department: {summary.employee.department}
    Tenure: {summary.employee.tenure_years} years
    
    Claims Summary:
    - Total Claims Amount: ${summary.total_claims_amount:,.2f}
    - Approved Claims: {summary.approved_claims_count}
    - Pending Claims: {summary.pending_claims_count}
    - Average Claim Value: ${summary.average_claim_value:,.2f}
    
    Claims Details:
    """
    
    for claim in summary.claims:
        prompt += f"\n    - {claim.claim_type.value.upper()}: ${claim.amount:,.2f} ({claim.status})"
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an HR analytics expert providing insights on employee health claims."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7
    )
    
    return response.choices[0].message.content

# Generate and display analysis
print()
print("AI-Generated Claims Analysis:")
print("=" * 60)
analysis = generate_claims_analysis()
if analysis:
    print(analysis)


AI-Generated Claims Analysis:
### 1. Brief Summary of the Employee's Claim History:
Michael Chen, an employee in the Finance department with a tenure of 6 years, has submitted a total of three health claims to date. The total amount claimed is $4,050.50. Out of these, two claims have been approved, totaling $3,700.50, while one claim is currently pending review.

### 2. Key Observations about Claim Types and Amounts:
- **Claim Types**: Michael has submitted claims in three categories: Medical, Dental, and Vision.
  - **Medical Claim**: $2,500.50 (approved) - This is the highest claim amount and represents a significant portion of the total claims.
  - **Dental Claim**: $1,200.00 (approved) - This is the second-largest claim and also fully approved.
  - **Vision Claim**: $350.00 (pending) - This claim is currently under review and is the smallest in value.

- **Total Claims Amount**: $4,050.50 indicates that health-related expenses for Michael have been relatively high, particularly du

In [7]:
# 5. Error Handling - Demonstrating Validation Failures

def demonstrate_validation_error():
    """Show how Pydantic catches invalid data before it causes problems."""
    
    print("Attempting to create employee with invalid data:")
    print()
    
    invalid_data = {
        "name": "Alex Turner",
        "employee_id": "INVALID123",  # Wrong format, should be E######
        "department": "Operations",
        "email": "not-an-email",  # Invalid email
        "tenure_years": 150  # Exceeds maximum of 60
    }
    
    try:
        employee = EmployeeProfile(**invalid_data)
    except ValidationError as e:
        print("❌ Validation caught multiple errors:")
        print()
        for error in e.errors():
            print(f"  Field: {error['loc'][0]}")
            print(f"  Error: {error['msg']}")
            print(f"  Value: {error['input']}")
            print()
    
    print("✅ Data validation prevented invalid data from entering the system!")

demonstrate_validation_error()

Attempting to create employee with invalid data:

❌ Validation caught multiple errors:

  Field: employee_id
  Error: String should match pattern '^E\d{6}$'
  Value: INVALID123

  Field: email
  Error: String should match pattern '^[^@]+@[^@]+\.[^@]+$'
  Value: not-an-email

  Field: tenure_years
  Error: Input should be less than or equal to 60
  Value: 150

✅ Data validation prevented invalid data from entering the system!
