# Structured Outputs with Pydantic - Exercise

This notebook will teach you how to generate and validate structured JSON outputs from AI models using Pydantic for type enforcement and fail-safe design. You'll process real financial data from CSV files and use AI to generate insights.

## What You'll Learn
- The difference between naive string parsing and structured outputs
- How to create robust Pydantic models for data validation
- Processing CSV data with automatic type validation
- Using OpenAI's structured output capabilities for reliable AI analysis
- Handling complex data types and validation errors

## Exercise Goals
By the end of this notebook, you'll be able to:
1. Create Pydantic models with proper validation rules
2. Process CSV data with type safety
3. Generate reliable structured outputs from AI models
4. Handle validation errors gracefully
5. Choose the right approach for your use case

## Sample Data
This notebook uses `sample_financial_data.csv` with realistic financial transactions for 3 users:
- **John Smith** (Software Engineer, 35 years old)
- **Sarah Chen** (Tech Startup Employee, 28 years old)
- **Michael Johnson** (Senior Manager, 42 years old)

## Instructions
Look for `# YOUR CODE HERE` comments and follow the hints to complete the implementation.

In [1]:
# Setup and Imports
import json
import csv
import pandas as pd
from datetime import datetime
from typing import List, Optional, Dict, Any
from enum import Enum
from collections import defaultdict

from pydantic import BaseModel, Field, ValidationError
from dotenv import load_dotenv
from openai import OpenAI

# Load environment and initialize OpenAI client
load_dotenv()
client = OpenAI(
    base_url="https://openai.vocareum.com/v1",
    api_key=os.getenv("OPENAI_API_KEY")
)

print("‚úÖ Environment setup complete!")

‚úÖ Environment setup complete!


In [None]:
# 1. Naive Approach - String Parsing (Problematic)

def naive_json_extraction():
    """Demonstrate the problems with naive JSON parsing from text responses."""
    
    prompt = """
    Create a user profile with the following information in JSON format:
    - name: John Doe
    - age: 30
    - email: john@example.com
    - is_active: true
    
    Return only the JSON, no additional text.
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    
    # This approach is fragile - what if the model adds extra text?
    raw_response = response.choices[0].message.content
    print("Raw response from model:")
    print(repr(raw_response))
    
    try:
        json_text = raw_response

        parsed_data = json.loads(json_text)
        print("\n‚úÖ Successfully parsed JSON (after handling markdown):")
        print(json.dumps(parsed_data, indent=2))
        return parsed_data
    except json.JSONDecodeError as e:
        print(f"\n‚ùå JSON parsing failed even after cleaning: {e}")
        print(f"Attempted to parse: {repr(json_text)}")
        return None

# Test the naive approach - run this to see potential issues
naive_result = naive_json_extraction()

In [None]:
# 2. Define Pydantic Models - Type-Safe Data Structures

class UserProfile(BaseModel):
    """A validated user profile with type enforcement."""
    # YOUR CODE HERE: Define fields with proper types and validation using Field()
    # Hint: Use Field(..., validation_params) for each field
    # - name: str with min_length=1, max_length=100
    # - age: int with ge=0, le=150 (greater/equal, less/equal)
    # - email: str with pattern validation for email format (use pattern= instead of regex=)
    # - is_active: bool with default=True
    # - created_at: Optional[datetime] with default=None
    pass

class TransactionType(str, Enum):
    """Enumeration for transaction types."""
    INCOME = "income"
    EXPENSE = "expense"
    TRANSFER = "transfer"

class FinancialTransaction(BaseModel):
    """A validated financial transaction record."""
    # YOUR CODE HERE: Define these fields with validation:
    # - amount: float (must be greater than 0, use gt=0)
    # - transaction_type: TransactionType 
    # - description: str with min_length=1, max_length=200
    # - date: datetime
    # - category: str with min_length=1
    pass

class FinancialSummary(BaseModel):
    """A complete financial summary with multiple transactions."""
    # YOUR CODE HERE: Define these fields:
    # - user: UserProfile
    # - transactions: List[FinancialTransaction]
    # - total_income: float (ge=0)
    # - total_expenses: float (ge=0)
    # - net_balance: float
    pass

print("‚úÖ Pydantic models defined with validation rules!")

In [None]:
# 3. Structured Output Approach - Reliable and Type-Safe

def generate_user_profile_structured() -> UserProfile:
    """Generate a user profile using OpenAI's structured output feature."""
    
    # YOUR CODE HERE: Create an OpenAI completion with structured output
    # Hint: Use client.chat.completions.create() with:
    # - model="gpt-4o-mini"
    # - messages with system and user prompts (MUST include the word "JSON"!)
    # - response_format={"type": "json_object"} for structured output
    # - temperature=0.7
    # 
    # Important: Your prompt should specify the exact JSON structure needed
    # and include clear field requirements for the UserProfile model
    
    response = None  # Replace with your OpenAI API call
    
    # YOUR CODE HERE: Parse and validate the response
    # Hint: After getting the response:
    # 1. Extract raw_json from response.choices[0].message.content
    # 2. Parse with json.loads()
    # 3. Create UserProfile(**profile_data) with validation
    # 4. Handle ValidationError and json.JSONDecodeError exceptions
    
    return None  # Replace with your UserProfile instance

# YOUR CODE HERE: Uncomment and test when you've implemented the function above
# try:
#     structured_profile = generate_user_profile_structured()
#     print("‚úÖ Successfully generated structured user profile:")
#     print(structured_profile.model_dump_json(indent=2))
# except Exception as e:
#     print(f"Error: {e}")

In [None]:
# 4. Complex Data Types - Processing CSV Data with Pydantic Validation

def load_and_process_financial_csv(csv_path: str = "sample_financial_data.csv") -> Dict[str, FinancialSummary]:
    """Load financial data from CSV and create validated FinancialSummary objects for each user."""
    
    # Read CSV data
    try:
        # YOUR CODE HERE: Use pandas to read the CSV file
        # Hint: Use pd.read_csv(csv_path)
        df = None  # Replace with your pandas read operation
        
        print(f"‚úÖ Loaded {len(df)} transactions from CSV")
        print("üìä Sample data:")
        print(df.head(3).to_string(index=False))
    except FileNotFoundError:
        print(f"‚ùå CSV file not found: {csv_path}")
        return {}
    
    # Group transactions by user
    user_summaries = {}
    
    # YOUR CODE HERE: Group the data by user information
    # Hint: Use df.groupby(['user_name', 'user_age', 'user_email', 'user_active'])
    user_groups = None  # Replace with your groupby operation
    
    for (name, age, email, is_active), group in user_groups:
        try:
            # YOUR CODE HERE: Create user profile with validation
            # Hint: Create UserProfile instance with the grouped data
            user_profile = None  # Replace with UserProfile creation
            
            # Process transactions for this user
            transactions = []
            total_income = 0.0
            total_expenses = 0.0
            
            for _, row in group.iterrows():
                # YOUR CODE HERE: Create transaction with validation
                # Hint: Create FinancialTransaction instance from row data
                # Remember to handle datetime parsing with datetime.fromisoformat()
                transaction = None  # Replace with FinancialTransaction creation
                
                transactions.append(transaction)
                
                # YOUR CODE HERE: Calculate totals based on transaction type
                # Hint: Check transaction.transaction_type and add to appropriate total
                pass
            
            # YOUR CODE HERE: Create financial summary
            # Hint: Calculate net_balance and create FinancialSummary instance
            net_balance = 0.0  # Calculate this
            summary = None  # Create FinancialSummary instance
            
            user_summaries[name] = summary
            
        except ValidationError as e:
            print(f"‚ùå Validation error for user {name}: {e}")
        except Exception as e:
            print(f"‚ùå Error processing user {name}: {e}")
    
    return user_summaries

# YOUR CODE HERE: Uncomment and test when you've implemented the functions above
# try:
#     financial_summaries = load_and_process_financial_csv()
#     
#     print(f"\n‚úÖ Successfully processed {len(financial_summaries)} users")
#     
#     # Display results for each user
#     for user_name, summary in financial_summaries.items():
#         print(f"\nüìä Financial Summary for {user_name}:")
#         print(f"   Email: {summary.user.email}")
#         print(f"   Age: {summary.user.age}")
#         print(f"   Transactions: {len(summary.transactions)}")
#         print(f"   Total Income: ${summary.total_income:,.2f}")
#         print(f"   Total Expenses: ${summary.total_expenses:,.2f}")
#         print(f"   Net Balance: ${summary.net_balance:,.2f}")
#         
#         # Show transaction breakdown by category
#         category_totals = defaultdict(float)
#         for transaction in summary.transactions:
#             if transaction.transaction_type == TransactionType.EXPENSE:
#                 category_totals[transaction.category] += transaction.amount
#         
#         if category_totals:
#             print(f"   Top Expense Categories:")
#             for category, amount in sorted(category_totals.items(), key=lambda x: x[1], reverse=True)[:3]:
#                 print(f"     ‚Ä¢ {category}: ${amount:,.2f}")
#             
# except Exception as e:
#     print(f"Error: {e}")

In [None]:
# 5. AI-Enhanced Data Analysis with Structured Outputs

def analyze_financial_data_with_ai(user_summary: FinancialSummary) -> Dict[str, Any]:
    """Use AI to generate insights about financial data, with structured validation."""
    
    # Create a summary of the user's financial data
    data_summary = {
        "user_profile": {
            "name": user_summary.user.name,
            "age": user_summary.user.age,
            "email": user_summary.user.email
        },
        "financial_overview": {
            "total_income": user_summary.total_income,
            "total_expenses": user_summary.total_expenses,
            "net_balance": user_summary.net_balance,
            "transaction_count": len(user_summary.transactions)
        },
        "expense_categories": {}
    }
    
    # YOUR CODE HERE: Calculate category breakdown
    # Hint: Create a dictionary of expense categories and their totals
    # Use defaultdict(float) and iterate through transactions
    category_totals = defaultdict(float)
    # Add your category calculation logic here
    
    data_summary["expense_categories"] = dict(category_totals)
    
    prompt = f"""
    Analyze this financial data and provide insights in JSON format:
    
    {json.dumps(data_summary, indent=2)}
    
    Provide analysis in this JSON structure:
    {{
        "financial_health_score": <number 1-10>,
        "key_insights": ["insight1", "insight2", "insight3"],
        "spending_patterns": ["pattern1", "pattern2"],
        "recommendations": ["recommendation1", "recommendation2", "recommendation3"],
        "budget_categories": {{
            "highest_expense": "category_name",
            "potential_savings": "category_name"
        }}
    }}
    """
    
    try:
        # YOUR CODE HERE: Create OpenAI completion for analysis
        # Hint: Use structured output with the prompt above
        # System message should identify you as a financial advisor
        response = None  # Replace with your API call
        
        # YOUR CODE HERE: Parse and return the analysis
        analysis = None  # Parse the JSON response
        return analysis
        
    except Exception as e:
        print(f"‚ùå Error generating AI analysis: {e}")
        return {}

# YOUR CODE HERE: Uncomment when ready to test AI analysis
# if 'financial_summaries' in locals() and financial_summaries:
#     print("\nü§ñ AI-Powered Financial Analysis:")
#     print("=" * 50)
#     
#     for user_name, summary in financial_summaries.items():
#         print(f"\nüìä Analysis for {user_name}:")
#         
#         analysis = analyze_financial_data_with_ai(summary)
#         if analysis:
#             print(f"üíØ Financial Health Score: {analysis.get('financial_health_score', 'N/A')}/10")
#             
#             print("\nüîç Key Insights:")
#             for insight in analysis.get('key_insights', []):
#                 print(f"   ‚Ä¢ {insight}")
#             
#             print("\nüìà Spending Patterns:")
#             for pattern in analysis.get('spending_patterns', []):
#                 print(f"   ‚Ä¢ {pattern}")
#             
#             print("\nüí° Recommendations:")
#             for rec in analysis.get('recommendations', []):
#                 print(f"   ‚Ä¢ {rec}")
#             
#             budget_info = analysis.get('budget_categories', {})
#             if budget_info:
#                 print(f"\nüí∏ Budget Analysis:")
#                 if 'highest_expense' in budget_info:
#                     print(f"   ‚Ä¢ Highest expense category: {budget_info['highest_expense']}")
#                 if 'potential_savings' in budget_info:
#                     print(f"   ‚Ä¢ Potential savings in: {budget_info['potential_savings']}")
#         
#         print("-" * 50)

# Key Takeaways and Best Practices

## ‚úÖ Advantages of Structured Outputs + Pydantic

1. **Type Safety**: Automatic validation ensures data integrity
2. **Reliability**: No JSON parsing errors from malformed responses
3. **Maintainability**: Clear data schemas make code easier to understand
4. **Error Handling**: Detailed validation errors help with debugging
5. **Performance**: No need for complex string parsing logic

## üìä Comparison: Naive vs Structured Approach

| Aspect | Naive Parsing | Structured + Pydantic |
|--------|---------------|----------------------|
| **Reliability** | ‚ùå Fragile to format changes | ‚úÖ Guaranteed format compliance |
| **Type Safety** | ‚ùå No validation | ‚úÖ Automatic type checking |
| **Error Handling** | ‚ùå Generic JSON errors | ‚úÖ Detailed validation messages |
| **Maintainability** | ‚ùå Hard to debug | ‚úÖ Clear schemas and validation |
| **Performance** | ‚ùå String parsing overhead | ‚úÖ Direct object creation |
