# Structured Outputs with Pydantic - Solution

This notebook demonstrates how to generate and validate structured JSON outputs from AI models using Pydantic for type enforcement and fail-safe design. We'll process real financial data from CSV files and use AI to generate insights.

## What You'll Learn
- The difference between naive string parsing and structured outputs
- How to create robust Pydantic models for data validation
- Processing CSV data with automatic type validation
- Using OpenAI's structured output capabilities for reliable AI analysis
- Handling complex data types and validation errors

## Key Benefits of Structured Outputs
- **Type Safety**: Automatic validation of data types
- **Reliability**: No JSON parsing errors
- **Maintainability**: Clear data schemas
- **Error Handling**: Graceful failure recovery
- **Real-world Application**: CSV processing + AI analysis

## Sample Data
This notebook uses `sample_financial_data.csv` with realistic financial transactions for 3 users:
- **John Smith** (Software Engineer, 35 years old)
- **Sarah Chen** (Tech Startup Employee, 28 years old)
- **Michael Johnson** (Senior Manager, 42 years old)

In [2]:
# Setup and Imports
import json, os
import csv
import pandas as pd
from datetime import datetime
from typing import List, Optional, Dict, Any
from enum import Enum
from collections import defaultdict

from pydantic import BaseModel, Field, ValidationError
from dotenv import load_dotenv
from openai import OpenAI

# Load environment and initialize OpenAI client
load_dotenv()
client = OpenAI(
    base_url="https://openai.vocareum.com/v1",
    api_key=os.getenv("OPENAI_API_KEY")
)

print("‚úÖ Environment setup complete!")

‚úÖ Environment setup complete!


In [3]:
# 1. Naive Approach - String Parsing (Problematic)

def naive_json_extraction():
    """Demonstrate the problems with naive JSON parsing from text responses."""
    
    prompt = """
    Create a user profile with the following information in JSON format:
    - name: John Doe
    - age: 30
    - email: john@example.com
    - is_active: true
    
    Return only the JSON, no additional text.
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    
    # This approach is fragile - what if the model adds extra text?
    raw_response = response.choices[0].message.content
    print("Raw response from model:")
    print(repr(raw_response))
    
    try:
        # Attempt to parse JSON from the raw text
        # Try to handle common markdown code block wrapping
        json_text = raw_response
        
        parsed_data = json.loads(json_text)
        print("\n‚úÖ Successfully parsed JSON (after handling markdown):")
        print(json.dumps(parsed_data, indent=2))
        return parsed_data
    except json.JSONDecodeError as e:
        print(f"\n‚ùå JSON parsing failed even after cleaning: {e}")
        print(f"Attempted to parse: {repr(json_text)}")
        return None

# Test the naive approach
naive_result = naive_json_extraction()

Raw response from model:
'```json\n{\n    "name": "John Doe",\n    "age": 30,\n    "email": "john@example.com",\n    "is_active": true\n}\n```'

‚ùå JSON parsing failed even after cleaning: Expecting value: line 1 column 1 (char 0)
Attempted to parse: '```json\n{\n    "name": "John Doe",\n    "age": 30,\n    "email": "john@example.com",\n    "is_active": true\n}\n```'


In [4]:
# 2. Define Pydantic Models - Type-Safe Data Structures

class UserProfile(BaseModel):
    """A validated user profile with type enforcement."""
    name: str = Field(..., min_length=1, max_length=100, description="User's full name")
    age: int = Field(..., ge=0, le=150, description="User's age in years")
    email: str = Field(..., pattern=r'^[^@]+@[^@]+\.[^@]+$', description="Valid email address")
    is_active: bool = Field(default=True, description="Whether the user account is active")
    created_at: Optional[datetime] = Field(default=None, description="Account creation timestamp")

class TransactionType(str, Enum):
    """Enumeration for transaction types."""
    INCOME = "income"
    EXPENSE = "expense"
    TRANSFER = "transfer"

class FinancialTransaction(BaseModel):
    """A validated financial transaction record."""
    amount: float = Field(..., gt=0, description="Transaction amount (must be positive)")
    transaction_type: TransactionType = Field(..., description="Type of transaction")
    description: str = Field(..., min_length=1, max_length=200, description="Transaction description")
    date: datetime = Field(..., description="Transaction date and time")
    category: str = Field(..., min_length=1, description="Transaction category")

class FinancialSummary(BaseModel):
    """A complete financial summary with multiple transactions."""
    user: UserProfile = Field(..., description="User information")
    transactions: List[FinancialTransaction] = Field(..., description="List of transactions")
    total_income: float = Field(..., ge=0, description="Total income amount")
    total_expenses: float = Field(..., ge=0, description="Total expenses amount")
    net_balance: float = Field(..., description="Net balance (income - expenses)")

print("‚úÖ Pydantic models defined with validation rules!")

‚úÖ Pydantic models defined with validation rules!


In [5]:
# 3. Structured Output Approach - Reliable and Type-Safe

def generate_user_profile_structured() -> UserProfile:
    """Generate a user profile using OpenAI's structured output feature."""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system", 
                "content": "You are a helpful assistant that generates realistic user profiles in JSON format."
            },
            {
                "role": "user", 
                "content": """Create a user profile for a software engineer named Sarah Chen, age 28, who works at a tech startup. Her email is sarah.chen@example.com. 

Return ONLY a flat JSON object with these exact fields:
- name: string
- age: number  
- email: string
- is_active: boolean
- created_at: ISO datetime string or null
"""
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.7
    )
    
    # Parse the structured response directly into our Pydantic model
    try:
        raw_json = response.choices[0].message.content
        profile_data = json.loads(raw_json)
        
        # Validate and create the Pydantic model
        user_profile = UserProfile(**profile_data)
        return user_profile
    
    except ValidationError as e:
        print(f"‚ùå Validation error: {e}")
        raise
    except json.JSONDecodeError as e:
        print(f"‚ùå JSON parsing error: {e}")
        raise

# Test structured output
try:
    structured_profile = generate_user_profile_structured()
    print("‚úÖ Successfully generated structured user profile:")
    print(structured_profile.model_dump_json(indent=2))
except Exception as e:
    print(f"Error: {e}")

‚úÖ Successfully generated structured user profile:
{
  "name": "Sarah Chen",
  "age": 28,
  "email": "sarah.chen@example.com",
  "is_active": true,
  "created_at": "2023-10-01T12:34:56Z"
}


In [6]:
# 4. Complex Data Types - Processing CSV Data with Pydantic Validation

def load_and_process_financial_csv(csv_path: str = "sample_financial_data.csv") -> Dict[str, FinancialSummary]:
    """Load financial data from CSV and create validated FinancialSummary objects for each user."""
    
    # Read CSV data
    try:
        df = pd.read_csv(csv_path)
        print(f"‚úÖ Loaded {len(df)} transactions from CSV")
        print("üìä Sample data:")
        print(df.head(3).to_string(index=False))
    except FileNotFoundError:
        print(f"‚ùå CSV file not found: {csv_path}")
        return {}
    
    # Group transactions by user
    user_summaries = {}
    user_groups = df.groupby(['user_name', 'user_age', 'user_email', 'user_active'])
    
    for (name, age, email, is_active), group in user_groups:
        try:
            # Create user profile
            user_profile = UserProfile(
                name=name,
                age=int(age),
                email=email,
                is_active=bool(is_active),
                created_at=datetime.now()
            )
            
            # Process transactions for this user
            transactions = []
            total_income = 0.0
            total_expenses = 0.0
            
            for _, row in group.iterrows():
                # Create transaction with validation
                transaction = FinancialTransaction(
                    amount=float(row['amount']),
                    transaction_type=TransactionType(row['transaction_type']),
                    description=row['description'],
                    date=datetime.fromisoformat(row['date']),
                    category=row['category']
                )
                transactions.append(transaction)
                
                # Calculate totals
                if transaction.transaction_type == TransactionType.INCOME:
                    total_income += transaction.amount
                elif transaction.transaction_type == TransactionType.EXPENSE:
                    total_expenses += transaction.amount
            
            # Create financial summary
            net_balance = total_income - total_expenses
            summary = FinancialSummary(
                user=user_profile,
                transactions=transactions,
                total_income=total_income,
                total_expenses=total_expenses,
                net_balance=net_balance
            )
            
            user_summaries[name] = summary
            
        except ValidationError as e:
            print(f"‚ùå Validation error for user {name}: {e}")
        except Exception as e:
            print(f"‚ùå Error processing user {name}: {e}")
    
    return user_summaries

# Load and process the CSV data
try:
    financial_summaries = load_and_process_financial_csv()
    
    print(f"\n‚úÖ Successfully processed {len(financial_summaries)} users")
    
    # Display results for each user
    for user_name, summary in financial_summaries.items():
        print(f"\nüìä Financial Summary for {user_name}:")
        print(f"   Email: {summary.user.email}")
        print(f"   Age: {summary.user.age}")
        print(f"   Transactions: {len(summary.transactions)}")
        print(f"   Total Income: ${summary.total_income:,.2f}")
        print(f"   Total Expenses: ${summary.total_expenses:,.2f}")
        print(f"   Net Balance: ${summary.net_balance:,.2f}")
        
        # Show transaction breakdown by category
        category_totals = defaultdict(float)
        for transaction in summary.transactions:
            if transaction.transaction_type == TransactionType.EXPENSE:
                category_totals[transaction.category] += transaction.amount
        
        if category_totals:
            print(f"   Top Expense Categories:")
            for category, amount in sorted(category_totals.items(), key=lambda x: x[1], reverse=True)[:3]:
                print(f"     ‚Ä¢ {category}: ${amount:,.2f}")
        
        # Export one user's data as JSON to show structure
        if user_name == "John Smith":
            print(f"\nüîç Sample JSON structure for {user_name}:")
            print(summary.model_dump_json(indent=2)[:500] + "...")
            
except Exception as e:
    print(f"Error: {e}")

‚úÖ Loaded 19 transactions from CSV
üìä Sample data:
 user_name  user_age           user_email  user_active  amount transaction_type              description                date  category
John Smith        35 john.smith@email.com         True  5000.0           income Software Engineer Salary 2024-08-01T09:00:00    salary
John Smith        35 john.smith@email.com         True  1500.0          expense     Monthly Rent Payment 2024-08-01T10:30:00   housing
John Smith        35 john.smith@email.com         True   120.5          expense         Grocery Shopping 2024-08-02T14:15:00 groceries

‚úÖ Successfully processed 3 users

üìä Financial Summary for John Smith:
   Email: john.smith@email.com
   Age: 35
   Transactions: 7
   Total Income: $5,000.00
   Total Expenses: $1,826.55
   Net Balance: $3,173.45
   Top Expense Categories:
     ‚Ä¢ housing: $1,500.00
     ‚Ä¢ groceries: $120.50
     ‚Ä¢ utilities: $85.30

üîç Sample JSON structure for John Smith:
{
  "user": {
    "name": "John S

In [7]:
# 5. AI-Enhanced Data Analysis with Structured Outputs

def analyze_financial_data_with_ai(user_summary: FinancialSummary) -> Dict[str, Any]:
    """Use AI to generate insights about financial data, with structured validation."""
    
    # Create a summary of the user's financial data
    data_summary = {
        "user_profile": {
            "name": user_summary.user.name,
            "age": user_summary.user.age,
            "email": user_summary.user.email
        },
        "financial_overview": {
            "total_income": user_summary.total_income,
            "total_expenses": user_summary.total_expenses,
            "net_balance": user_summary.net_balance,
            "transaction_count": len(user_summary.transactions)
        },
        "expense_categories": {}
    }
    
    # Calculate category breakdown
    category_totals = defaultdict(float)
    for transaction in user_summary.transactions:
        if transaction.transaction_type == TransactionType.EXPENSE:
            category_totals[transaction.category] += transaction.amount
    
    data_summary["expense_categories"] = dict(category_totals)
    
    prompt = f"""
    Analyze this financial data and provide insights in JSON format:
    
    {json.dumps(data_summary, indent=2)}
    
    Provide analysis in this JSON structure:
    {{
        "financial_health_score": <number 1-10>,
        "key_insights": ["insight1", "insight2", "insight3"],
        "spending_patterns": ["pattern1", "pattern2"],
        "recommendations": ["recommendation1", "recommendation2", "recommendation3"],
        "budget_categories": {{
            "highest_expense": "category_name",
            "potential_savings": "category_name"
        }}
    }}
    """
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "You are a financial advisor AI. Analyze financial data and provide insights in JSON format."
                },
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.3
        )
        
        analysis = json.loads(response.choices[0].message.content)
        return analysis
        
    except Exception as e:
        print(f"‚ùå Error generating AI analysis: {e}")
        return {}

# Demonstrate AI analysis for each user
if financial_summaries:
    print("\nü§ñ AI-Powered Financial Analysis:")
    print("=" * 50)
    
    for user_name, summary in financial_summaries.items():
        print(f"\nüìä Analysis for {user_name}:")
        
        analysis = analyze_financial_data_with_ai(summary)
        if analysis:
            print(f"üíØ Financial Health Score: {analysis.get('financial_health_score', 'N/A')}/10")
            
            print("\nüîç Key Insights:")
            for insight in analysis.get('key_insights', []):
                print(f"   ‚Ä¢ {insight}")
            
            print("\nüìà Spending Patterns:")
            for pattern in analysis.get('spending_patterns', []):
                print(f"   ‚Ä¢ {pattern}")
            
            print("\nüí° Recommendations:")
            for rec in analysis.get('recommendations', []):
                print(f"   ‚Ä¢ {rec}")
            
            budget_info = analysis.get('budget_categories', {})
            if budget_info:
                print(f"\nüí∏ Budget Analysis:")
                if 'highest_expense' in budget_info:
                    print(f"   ‚Ä¢ Highest expense category: {budget_info['highest_expense']}")
                if 'potential_savings' in budget_info:
                    print(f"   ‚Ä¢ Potential savings in: {budget_info['potential_savings']}")
        
        print("-" * 50)


ü§ñ AI-Powered Financial Analysis:

üìä Analysis for John Smith:
üíØ Financial Health Score: 8/10

üîç Key Insights:
   ‚Ä¢ John has a strong net balance of $3173.45, indicating good financial health.
   ‚Ä¢ Total expenses are only 36.5% of total income, suggesting effective expense management.
   ‚Ä¢ Housing is the largest expense category, accounting for 82.6% of total expenses.

üìà Spending Patterns:
   ‚Ä¢ Housing costs dominate monthly expenses.
   ‚Ä¢ Minimal spending on food and transportation indicates potential for lifestyle adjustments.

üí° Recommendations:
   ‚Ä¢ Consider reviewing housing expenses to identify potential savings.
   ‚Ä¢ Explore opportunities to reduce grocery costs through meal planning or bulk buying.
   ‚Ä¢ Allocate a portion of the net balance towards savings or investments for future growth.

üí∏ Budget Analysis:
   ‚Ä¢ Highest expense category: housing
   ‚Ä¢ Potential savings in: groceries
--------------------------------------------------

ü

# Key Takeaways and Best Practices

## ‚úÖ Advantages of Structured Outputs + Pydantic

1. **Type Safety**: Automatic validation ensures data integrity
2. **Reliability**: No JSON parsing errors from malformed responses
3. **Maintainability**: Clear data schemas make code easier to understand
4. **Error Handling**: Detailed validation errors help with debugging
5. **Performance**: No need for complex string parsing logic

## üìä Comparison: Naive vs Structured Approach

| Aspect | Naive Parsing | Structured + Pydantic |
|--------|---------------|----------------------|
| **Reliability** | ‚ùå Fragile to format changes | ‚úÖ Guaranteed format compliance |
| **Type Safety** | ‚ùå No validation | ‚úÖ Automatic type checking |
| **Error Handling** | ‚ùå Generic JSON errors | ‚úÖ Detailed validation messages |
| **Maintainability** | ‚ùå Hard to debug | ‚úÖ Clear schemas and validation |
| **Performance** | ‚ùå String parsing overhead | ‚úÖ Direct object creation |

