# 📊 Extracting Structured Information from Unstructured Text

Welcome! This comprehensive tutorial will teach you how to extract structured, actionable data from unstructured text using OpenAI's API.

---

## 📚 What You'll Learn

By the end of this tutorial, you'll be able to:

1. **Extract structured data** from messy, unstructured text (emails, tickets, reports)
2. **Choose the right format** - CSV, JSON, or Pydantic models for your use case
3. **Parse complex information** - Handle nested data, arrays, and multiple items
4. **Validate extractions** - Ensure data quality and catch errors early
5. **Save results** - Store extracted data in files for downstream use
6. **Handle edge cases** - Deal with missing info, contradictions, and ambiguity
7. **Build production systems** - Create robust, scalable extraction pipelines

---

## 🎯 Why Extract Structured Data?

### The Business Problem

In IT support and services, information flows in **unstructured formats**:

- 📧 **Emails**: "Hi, my laptop won't start. I think it's the battery. Can someone help? - John from Marketing"
- 💬 **Chat messages**: "printer broken room 304 need toner asap"
- 📞 **Verbal reports**: "Sarah mentioned something about the network being slow in Building B"
- 📝 **Handwritten notes**: Notes from a phone call or site visit

But to **take action**, this information needs to be **structured**:

```json
{
  "user_name": "John",
  "department": "Marketing",
  "issue": "Laptop won't start",
  "suspected_cause": "Battery",
  "urgency": "medium"
}
```

### Current Challenges

❌ **Manual data entry is:**
- **Slow** - Takes time away from actual problem-solving
- **Error-prone** - Typos, missed fields, inconsistent formatting
- **Doesn't scale** - Can't handle high ticket volumes
- **Inconsistent** - Different people extract different information

### The Solution: LLM-Powered Extraction

✅ **Large Language Models can:**
- Extract key information accurately and consistently
- Handle natural language variations ("urgent", "asap", "critical")
- Infer missing information from context
- Structure data in your required format (JSON, CSV, database schema)
- Process hundreds of items in minutes

---

## 🔑 Key Concepts

### Structured vs. Unstructured Data

**Unstructured Data:**
- Free-form text, no fixed format
- Examples: Emails, chat messages, documents, verbal reports
- Hard for computers to process directly

**Structured Data:**
- Organized in a predefined format
- Examples: Database tables, JSON objects, CSV files
- Easy to query, analyze, and integrate with other systems

### Data Extraction vs. Data Parsing

**Data Extraction:**
- Identifying and pulling out specific information from unstructured text
- Requires understanding context and meaning
- Example: Finding user name, issue type, urgency from an email

**Data Parsing:**
- Converting extracted information into a structured format
- Example: Creating a JSON object with extracted fields

LLMs excel at **both** - they understand context AND can output structured formats.

### Token Costs for Extraction

**Good news:** Extraction tasks work well with cheaper models!

- **gpt-5-nano**: $0.05/1M input tokens, $0.40/1M output tokens
- A typical support ticket extraction:
  - Input: ~300 tokens (the email/ticket)
  - Output: ~100 tokens (structured JSON)
  - Cost: ~$0.00005 (less than 1/10th of a cent per ticket!)

💡 **Key takeaway:** You can process thousands of tickets for just a few dollars.

---

# 🔧 Setup

Let's configure the environment and install required libraries.

## 📦 Install Dependencies

We'll install five libraries:
- **openai**: Official OpenAI Python client for API access
- **pydantic**: Data validation and settings management using Python type hints
- **email-validator**: Email validation for Pydantic's EmailStr type
- **tqdm**: Progress bars for batch processing
- **pandas**: Data manipulation and CSV handling

In [None]:
!pip install -q openai pydantic email-validator tqdm pandas

## 🔑 API Key Configuration

You have two methods to provide your API key:

**Method 1 (Recommended)**: Use Colab Secrets
1. Click the 🔑 icon in the left sidebar
2. Click "Add new secret"
3. Name: `OPENAI_API_KEY`
4. Value: Your OpenAI API key
5. Enable notebook access

**Method 2 (Fallback)**: Manual input when prompted

Run the cell below to configure authentication:

In [None]:
import os

# Configure OpenAI API key
# Method 1: Try to get API key from Colab secrets (recommended)
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("✅ API key loaded from Colab secrets")
except:
    # Method 2: Manual input (fallback)
    from getpass import getpass
    print("💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

# Set the API key as an environment variable
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Validate that the API key is set
if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("❌ ERROR: No API key provided!")

print("✅ Authentication configured!")

# Configure which OpenAI model to use
OPENAI_MODEL = "gpt-5-nano"  # Using gpt-5-nano for cost efficiency
print(f"🤖 Selected Model: {OPENAI_MODEL}")

## 📚 Import Required Libraries

Now let's import all the libraries we'll use throughout this tutorial:

In [None]:
from openai import OpenAI
import json
import csv
from datetime import datetime
from typing import Optional, List, Dict, Any
from enum import Enum
from pydantic import BaseModel, EmailStr, Field, field_validator
import pandas as pd
from tqdm import tqdm
from IPython.display import display

# Initialize the OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

print("✅ All libraries imported successfully!")
print("✅ OpenAI client initialized!")

---

# 📋 Understanding Output Formats

LLMs can extract data into different formats. Each format has specific use cases in IT support.

We'll explore three main formats:
1. **CSV** - Simple tabular data
2. **JSON** - Complex, nested structures
3. **Pydantic Models** - Type-safe, validated data

---

## Format A: CSV (Comma-Separated Values)

### 📖 When to Use CSV

**Best for:** Simple tabular data with similar items

**Use cases in IT:**
- 📦 Asset inventories and equipment lists
- 🖥️ Hardware tracking spreadsheets
- 📊 Simple databases that need Excel compatibility
- 📈 Reports for non-technical stakeholders

**Pros:**
- ✅ Easy to open in Excel or Google Sheets
- ✅ Simple structure, widely supported
- ✅ Human-readable and editable

**Cons:**
- ❌ No nesting (can't represent complex relationships)
- ❌ All values are strings (no native data types)
- ❌ Harder to represent one-to-many relationships

### 💻 Practical Example: Conference Room Inventory

In [None]:
# Mock data: Unstructured description of conference room equipment
inventory_description = """
Conference Room A has a Samsung 65-inch display (serial: SAMS-2024-001), 
a Logitech Rally camera (serial: LOG-CAM-445), and a Poly Studio phone system (serial: POLY-899-X).

Conference Room B contains two Dell OptiPlex 7090 computers (serials: DELL-PC-1023 and DELL-PC-1024),
an LG 55-inch display (serial: LG-DSP-3301), and a Jabra Speak 750 speakerphone (serial: JAB-750-229).

Conference Room C is equipped with a Microsoft Surface Hub 2S (serial: MSFT-HUB-8821),
a Cisco Webex Room Kit (serial: CISCO-WX-4492), and an HP laptop (serial: HP-LT-9933).
"""

# Extraction prompt for CSV format
extraction_prompt = f"""
Extract equipment inventory information from the text below and format it as CSV.

Use these exact column headers: room,device_type,manufacturer,model,serial_number

Rules:
- One row per device
- Include header row
- Use commas as separators
- No quotes around values unless they contain commas

Text:
{inventory_description}
"""

print("🔄 Extracting inventory data to CSV format...\n")

# Make API call using Responses API
response = client.responses.create(
    model=OPENAI_MODEL,
    input=extraction_prompt,
    text={"verbosity": "low"}  # Low verbosity for structured output
)

csv_output = response.output_text.strip()

print("📊 Extracted CSV Data:")
print("=" * 80)
print(csv_output)
print("=" * 80)

# Save to file
csv_file_path = "/content/inventory_data.csv"
with open(csv_file_path, 'w') as f:
    f.write(csv_output)

print(f"\n💾 Saved to: {csv_file_path}")
print(f"📊 Tokens used: {response.usage.total_tokens}")

In [None]:
# Verify the CSV is valid by parsing it back
print("🔍 Verifying CSV format...\n")

# Method 1: Using pandas
df = pd.read_csv(csv_file_path)
print("✅ CSV is valid and readable!\n")
print(f"📊 Found {len(df)} devices across {df['room'].nunique()} rooms\n")
print("Preview:")

# Display as DataFrame (better formatting in Jupyter)
display(df)

---

## Format B: JSON (JavaScript Object Notation)

### 📖 When to Use JSON

**Best for:** Complex, nested data structures

**Use cases in IT:**
- 🎫 Support tickets with multiple fields and categories
- 👤 User information with device specifications
- 🐛 Error reports with nested details
- 🔗 API integration and data exchange
- 📁 Configuration files and settings

**Pros:**
- ✅ Flexible structure, supports nesting
- ✅ Native data types (strings, numbers, booleans, arrays, objects)
- ✅ Standard format for APIs and modern applications
- ✅ Easy to parse in any programming language

**Cons:**
- ❌ More verbose than CSV
- ❌ Less human-friendly for simple tables

💡 **Key Point:** JSON is the preferred format for most extraction tasks due to its flexibility.

### 💻 Practical Example: Support Ticket Extraction

In [None]:
# Mock data: Unstructured support ticket email
support_ticket = """
From: jennifer.martinez@company.com
Subject: Urgent - Cannot Access Email

Hi IT Support,

I'm Jennifer Martinez from the Sales department (employee ID: EMP-5834). 
My Outlook keeps crashing whenever I try to open it. I've tried restarting 
my computer twice but the problem persists.

This is really urgent because I need to send quotes to clients today. 
My phone number is 555-0192 if you need to call me.

Please help ASAP!

Thanks,
Jennifer
"""

# Extraction prompt for JSON format
extraction_prompt = f"""
Extract support ticket information from the email below and format it as JSON.

Include these fields:
- user_name: Full name of the user
- department: User's department
- employee_id: Employee ID if mentioned
- contact_email: Email address
- contact_phone: Phone number if mentioned
- issue_summary: Brief summary of the issue
- application: The application having issues
- urgency: Low, Medium, High, or Critical (infer from context)
- actions_tried: List of troubleshooting steps user already attempted

Email:
{support_ticket}

Respond with ONLY valid JSON, no additional text.
"""

print("🔄 Extracting ticket data to JSON format...\n")

# Make API call
response = client.responses.create(
    model=OPENAI_MODEL,
    input=extraction_prompt,
    text={"verbosity": "low"}
)

json_output = response.output_text.strip()

# Parse and pretty-print the JSON
ticket_data = json.loads(json_output)

print("📋 Extracted JSON Data:")
print("=" * 80)
print(json.dumps(ticket_data, indent=2))
print("=" * 80)

# Save to file
json_file_path = "/content/ticket_data.json"
with open(json_file_path, 'w') as f:
    json.dump(ticket_data, f, indent=2)

print(f"\n💾 Saved to: {json_file_path}")
print(f"📊 Tokens used: {response.usage.total_tokens}")

In [None]:
# Verify and demonstrate JSON parsing
print("🔍 Verifying JSON format and demonstrating access...\n")

# Read back from file
with open(json_file_path, 'r') as f:
    loaded_data = json.load(f)

print("✅ JSON is valid!\n")
print("Accessing specific fields:")
print(f"  User: {loaded_data['user_name']}")
print(f"  Department: {loaded_data['department']}")
print(f"  Issue: {loaded_data['issue_summary']}")
print(f"  Urgency: {loaded_data['urgency']}")
print(f"  Actions tried: {', '.join(loaded_data['actions_tried'])}")

---

## Format C: Pydantic Models (Type-Safe Validation)

### 📖 When to Use Pydantic

**What it is:** Python library for data validation using type hints

**Key benefit:** Automatic validation against a defined schema

**When useful:**
- 🏭 **Production systems** feeding databases
- 🔗 **API integration** requiring specific formats
- 🐛 **Early error detection** before data reaches critical systems
- 👥 **Team collaboration** with clear data contracts
- 📊 **Type safety** ensuring fields are correct types

### 🔍 Comparison: JSON vs. Pydantic

Let's see the difference between plain JSON (no validation) and Pydantic (validated):

#### ❌ Problem: Plain JSON Without Validation

In [None]:
# Example: Plain JSON - problems discovered later
bad_ticket = {
    "user_name": "John",  # Missing last name
    "employee_id": "12345",  # Wrong format (should be EMP-####)
    "contact_email": "john.company.com",  # Invalid email (missing @)
    "urgency": "super urgent",  # Invalid value (should be Low/Medium/High/Critical)
    # Missing required field: issue_summary
}

print("❌ Plain JSON - No validation happens:")
print(json.dumps(bad_ticket, indent=2))
print("\n⚠️ This bad data could be saved to database, causing errors later!")

#### ✅ Solution: Pydantic Model with Validation

Let's create a Pydantic model that validates our data:

In [None]:
# Define urgency levels as an Enum
class UrgencyLevel(str, Enum):
    LOW = "Low"
    MEDIUM = "Medium"
    HIGH = "High"
    CRITICAL = "Critical"

# Define the Pydantic model for a support ticket
class SupportTicket(BaseModel):
    # Required fields
    user_name: str = Field(..., min_length=2, description="Full name of the user")
    employee_id: str = Field(..., pattern=r'^EMP-\d{4}$', description="Employee ID in format EMP-####")
    contact_email: EmailStr = Field(..., description="Valid email address")
    issue_summary: str = Field(..., min_length=10, description="Brief summary of the issue")
    urgency: UrgencyLevel = Field(..., description="Urgency level")
    
    # Optional fields with defaults
    department: Optional[str] = None
    contact_phone: Optional[str] = None
    application: Optional[str] = None
    actions_tried: List[str] = Field(default_factory=list)
    
    # Custom validator for phone numbers
    @field_validator('contact_phone')
    @classmethod
    def validate_phone(cls, v):
        if v and len(v.replace('-', '').replace(' ', '')) < 7:
            raise ValueError('Phone number too short')
        return v

print("✅ Pydantic model defined with validation rules!")

#### 🧪 Test 1: Invalid Data (Validation Catches Errors)

In [None]:
# Try to create ticket with bad data
print("🧪 Testing with INVALID data...\n")

try:
    bad_ticket_validated = SupportTicket(
        user_name="J",  # Too short
        employee_id="12345",  # Wrong format
        contact_email="invalid-email",  # Invalid email
        issue_summary="Broken",  # Too short
        urgency="super urgent"  # Invalid urgency level
    )
except Exception as e:
    print("❌ Validation failed (as expected):")
    print(f"\nError type: {type(e).__name__}")
    print(f"\nErrors found:")
    
    # Parse validation errors
    if hasattr(e, 'errors'):
        for error in e.errors():
            field = error['loc'][0]
            message = error['msg']
            print(f"  • {field}: {message}")
    
    print("\n✅ Validation prevented bad data from being processed!")

#### 🧪 Test 2: Valid Data (Validation Succeeds)

Now let's use LLM to extract data in Pydantic-compatible format:

In [None]:
# Mock data for extraction
ticket_email = """
From: robert.chen@company.com
Subject: VPN Connection Issues - Need Help

Hello IT,

This is Robert Chen from Engineering (EMP-7821). My VPN client keeps 
disconnecting every 5-10 minutes. I've already tried:
- Restarting the VPN client
- Rebooting my laptop
- Checking my internet connection

This is blocking my work as I need to access the development servers.
Please treat this as high priority.

You can reach me at 555-0198.

Thanks,
Robert
"""

# Extraction prompt that ensures Pydantic-compatible format
extraction_prompt = f"""
Extract support ticket information and format as JSON matching this schema:

Required fields:
- user_name (string, min 2 chars): Full name
- employee_id (string, format: EMP-####): Employee ID  
- contact_email (string): Valid email address
- issue_summary (string, min 10 chars): Brief issue description
- urgency (string): Must be exactly one of: "Low", "Medium", "High", "Critical"

Optional fields:
- department (string or null): Department name
- contact_phone (string or null): Phone number
- application (string or null): Application name
- actions_tried (array of strings): Steps user already tried

Email:
{ticket_email}

Return ONLY valid JSON, no additional text.
"""

print("🔄 Extracting ticket with Pydantic validation...\n")

# Extract data
response = client.responses.create(
    model=OPENAI_MODEL,
    input=extraction_prompt,
    text={"verbosity": "low"}
)

extracted_json = response.output_text.strip()
extracted_data = json.loads(extracted_json)

print("📋 Extracted JSON:")
print(json.dumps(extracted_data, indent=2))
print()

# Validate with Pydantic
try:
    validated_ticket = SupportTicket(**extracted_data)
    print("✅ Pydantic validation PASSED!\n")
    
    print("Validated ticket details:")
    print(f"  User: {validated_ticket.user_name}")
    print(f"  Employee ID: {validated_ticket.employee_id}")
    print(f"  Email: {validated_ticket.contact_email}")
    print(f"  Urgency: {validated_ticket.urgency.value}")
    print(f"  Issue: {validated_ticket.issue_summary}")
    print(f"  Actions tried: {len(validated_ticket.actions_tried)} steps")
    
    # Save validated ticket
    validated_file_path = "/content/validated_ticket.json"
    with open(validated_file_path, 'w') as f:
        json.dump(validated_ticket.model_dump(), f, indent=2)
    
    print(f"\n💾 Saved validated ticket to: {validated_file_path}")
    print(f"📊 Tokens used: {response.usage.total_tokens}")
    
except Exception as e:
    print(f"❌ Validation failed: {e}")

### 📊 Pydantic Benefits Summary

**Pydantic provides:**

✅ **Type safety** - Fields must be correct types  
✅ **Format validation** - Email, patterns, length checks  
✅ **Required field checking** - No missing critical data  
✅ **Enum validation** - Only allowed values accepted  
✅ **Custom validators** - Business logic validation  
✅ **Clear error messages** - Know exactly what's wrong  
✅ **IDE support** - Auto-completion and type hints  

💡 **When to use:** Production systems, database integration, team projects requiring data contracts

---

# 💼 Practical Examples

Now let's apply what we've learned to realistic IT support scenarios.

---

## Example 1: Support Ticket Parsing (Comprehensive)

We'll extract information from support tickets, starting with basic extraction and progressing to complex nested structures.

### Part A: Basic Ticket Extraction

Extract standard ticket fields from a user email:

In [None]:
# Realistic mock data: User email about laptop issue
laptop_issue_email = """
From: sarah.williams@company.com
Date: 2025-01-15 09:23 AM
Subject: Laptop Battery Problem - Urgent

Hi Support Team,

I'm Sarah Williams from the Marketing department, employee number EMP-4567.
My work laptop (Dell Latitude 5420) isn't holding a charge anymore. The battery 
drains completely within an hour even when I'm just using Word and email.

I have a client presentation tomorrow afternoon at 2 PM and really need this 
working by then. I've tried using a different power outlet and the charger 
seems to work fine (the charging light comes on).

Could someone please help? You can reach me at ext. 4523 or my cell 555-0167.

Thank you!
Sarah Williams
Marketing Department
"""

# Extraction prompt
extraction_prompt = f"""
Extract support ticket information from this email and return as JSON.

Include these fields:
- user_name: Full name
- department: Department name
- employee_id: Employee ID
- contact_email: Email address
- contact_phone: Phone number (if mentioned)
- device_info: Device description (manufacturer and model)
- issue_summary: Concise summary of the issue
- issue_details: Detailed description
- urgency: Low, Medium, High, or Critical (infer from context and deadline)
- deadline_context: Any time-sensitive information
- troubleshooting_done: What user already tried

Email:
{laptop_issue_email}

Return ONLY valid JSON.
"""

print("🔄 Extracting basic ticket information...\n")

# Make API call
response = client.responses.create(
    model=OPENAI_MODEL,
    input=extraction_prompt,
    text={"verbosity": "medium"}
)

ticket_json = response.output_text.strip()
ticket_data = json.loads(ticket_json)

print("📋 Extracted Ticket Data:")
print("=" * 80)
print(json.dumps(ticket_data, indent=2))
print("=" * 80)

# Save to file
basic_ticket_path = "/content/basic_ticket.json"
with open(basic_ticket_path, 'w') as f:
    json.dump(ticket_data, f, indent=2)

print(f"\n💾 Saved to: {basic_ticket_path}")
print(f"📊 Tokens used: {response.usage.total_tokens}")

# Display key insights
print("\n🔍 Key Insights:")
print(f"  • User: {ticket_data['user_name']} ({ticket_data['department']})")
print(f"  • Issue: {ticket_data['issue_summary']}")
print(f"  • Urgency: {ticket_data['urgency']}")
print(f"  • Deadline: {ticket_data['deadline_context']}")

### Part B: Advanced - Nested Device Specifications

Now let's extract more complex data with nested objects and arrays:

In [None]:
# Complex mock data: Workstation issue with detailed system specs
workstation_issue = """
From: michael.thompson@company.com
Subject: Workstation Performance Issues - Graphics Freezing

Hello IT,

I'm Michael Thompson, CAD Engineer in the Design department (EMP-9012).

My workstation is experiencing severe performance problems. The system specs are:
- Dell Precision 7920 Tower (Service Tag: 5XYZ789)
- Intel Xeon Gold 6248R processor, 24 cores, running at 3.0 GHz
- 128GB DDR4 RAM, ECC memory
- NVIDIA Quadro RTX 5000 with 16GB GDDR6
- Two storage drives: 1TB NVMe SSD (OS drive) and 4TB SATA SSD (data drive)
- Running Windows 11 Pro for Workstations, version 23H2

The screen freezes when I'm rendering 3D models in SolidWorks. Sometimes the 
entire application crashes. I suspect it might be the graphics card overheating 
because I hear the fans going crazy.

This is blocking a project deadline on Friday. Please help!

Phone: 555-0184
Email: michael.thompson@company.com

Thanks,
Michael
"""

# Extraction prompt for nested structure - simplified and clearer
extraction_prompt = f"""
Extract support ticket information with nested device specifications from the email below.

IMPORTANT: Return ONLY valid JSON with NO extra text before or after. Use this exact structure:

{{
  "user_name": "full name",
  "department": "department name",
  "employee_id": "employee ID",
  "contact_email": "email address",
  "contact_phone": "phone number",
  "issue_summary": "brief issue summary",
  "suspected_cause": "suspected cause if mentioned",
  "urgency": "Low or Medium or High or Critical",
  "affected_application": "application name",
  "device": {{
    "manufacturer": "device manufacturer",
    "model": "device model",
    "service_tag": "service tag",
    "processor": {{
      "brand": "processor brand",
      "model": "processor model",
      "cores": 24,
      "speed_ghz": 3.0
    }},
    "ram": {{
      "capacity_gb": 128,
      "type": "DDR4 ECC"
    }},
    "gpu": {{
      "manufacturer": "GPU manufacturer",
      "model": "GPU model",
      "memory_gb": 16
    }},
    "storage": [
      {{
        "capacity_tb": 1.0,
        "type": "NVMe SSD",
        "purpose": "OS drive"
      }},
      {{
        "capacity_tb": 4.0,
        "type": "SATA SSD",
        "purpose": "data drive"
      }}
    ],
    "operating_system": {{
      "name": "Windows 11 Pro for Workstations",
      "version": "23H2"
    }}
  }}
}}

Email to extract from:
{workstation_issue}

Return ONLY the JSON object with no additional text.
"""

print("🔄 Extracting complex nested ticket information...\\n")

# Make API call with lower verbosity for cleaner JSON output
response = client.responses.create(
    model=OPENAI_MODEL,
    input=extraction_prompt,
    text={"verbosity": "low"}  # Low verbosity for cleaner structured output
)

complex_json = response.output_text.strip()

# Try to parse JSON with error handling
try:
    complex_data = json.loads(complex_json)
    
    print("📋 Extracted Complex Nested Data:")
    print("=" * 80)
    print(json.dumps(complex_data, indent=2))
    print("=" * 80)
    
    # Save to file
    complex_ticket_path = "/content/complex_ticket_nested.json"
    with open(complex_ticket_path, 'w') as f:
        json.dump(complex_data, f, indent=2)
    
    print(f"\\n💾 Saved to: {complex_ticket_path}")
    print(f"📊 Tokens used: {response.usage.total_tokens}")
    
    # Demonstrate accessing nested data
    print("\\n🔍 Accessing Nested Data:")
    print(f"  • User: {complex_data['user_name']}")
    print(f"  • Device: {complex_data['device']['manufacturer']} {complex_data['device']['model']}")
    print(f"  • CPU: {complex_data['device']['processor']['brand']} {complex_data['device']['processor']['model']}")
    print(f"  • RAM: {complex_data['device']['ram']['capacity_gb']}GB {complex_data['device']['ram']['type']}")
    print(f"  • GPU: {complex_data['device']['gpu']['manufacturer']} {complex_data['device']['gpu']['model']}")
    print(f"  • Storage drives: {len(complex_data['device']['storage'])}")
    for i, drive in enumerate(complex_data['device']['storage'], 1):
        print(f"    - Drive {i}: {drive['capacity_tb']}TB {drive['type']} ({drive['purpose']})")

except json.JSONDecodeError as e:
    print(f"❌ JSON Parsing Error: {e}")
    print(f"\\n📄 Raw response from API:")
    print("=" * 80)
    print(complex_json)
    print("=" * 80)
    print(f"\\n💡 Tip: The model returned invalid JSON. Try adjusting the prompt or using a different model.")

### 📊 Key Teaching Points

From Example 1, we learned:

**Flat vs. Nested Structures:**
- ✅ Flat structure: Simple tickets with top-level fields only
- ✅ Nested structure: Complex tickets with related data grouped in objects

**Handling Arrays:**
- ✅ Storage devices are represented as an array of objects
- ✅ Each item in the array has the same structure
- ✅ Easy to iterate through and process programmatically

**Extracting Implied Information:**
- ✅ Urgency inferred from context ("blocking project deadline on Friday" → High)
- ✅ Suspected cause identified from symptoms described
- ✅ Device purpose categorized (OS drive vs. data drive)

💡 **Best Practice:** Use nested structures when data has clear relationships (device → components). This keeps data organized and easier to work with.

---

---

# ✅ ITERATION 1 COMPLETE

## Sections Created:
1. ✅ Introduction (Theory, Business Problem, Key Concepts)
2. ✅ Setup (API Config, Dependencies, Imports)
3. ✅ Understanding Output Formats (CSV, JSON, Pydantic)
4. ✅ Practical Examples - Example 1 (Support Ticket Parsing)

## Files Saved:
- `/content/inventory_data.csv` - Conference room equipment inventory
- `/content/ticket_data.json` - Basic support ticket extraction
- `/content/validated_ticket.json` - Pydantic-validated ticket
- `/content/basic_ticket.json` - Laptop issue ticket with standard fields
- `/content/complex_ticket_nested.json` - Workstation ticket with nested device specs

## What You've Learned:
✅ Why structured data extraction matters in IT support  
✅ When to use CSV, JSON, or Pydantic models  
✅ How to extract data using gpt-5-nano  
✅ Validation techniques with Pydantic  
✅ Handling nested structures and arrays  
✅ Saving extracted data to files  

---

## 🎯 Ready for Iteration 2

**Please review and approve before I continue.**

Iteration 2 will add:
- Example 2: Multiple Error Extraction (JSON Arrays)
- Example 3: Hardware Inventory (CSV Format)
- Data Validation Techniques
- File Generation Patterns

---

## Example 2: Multiple Error Extraction (JSON Array)

Let's extract multiple items from a single message into a structured array.

In [None]:
# Mock data: User reporting multiple different errors
multiple_errors_report = """
From: david.kumar@company.com
Subject: Multiple System Errors Today - Please Help

Hi IT Team,

I've been experiencing several different errors on my computer today (EMP-3392):

1. Around 9 AM, I got a Windows error saying "DRIVER_IRQL_NOT_LESS_OR_EQUAL" 
   with a blue screen. The computer restarted automatically.

2. At 10:30 AM, Adobe Acrobat crashed with error code 0xc0000005 when I tried 
   to open a PDF file from a client.

3. Just before lunch, my network printer showed "Error 49.FF04" and stopped 
   printing completely. Other people can still print to it.

4. Around 2 PM, I got another blue screen with "SYSTEM_SERVICE_EXCEPTION" error.

5. Finally, at 3:15 PM, Outlook gave me error 0x80040600 and won't send emails now.

I'm really frustrated because this is affecting my work. These errors seem random 
but I'm worried something serious is wrong with my computer.

Please help!
David Kumar
Sales Department
"""

# Extraction prompt for array structure
extraction_prompt = f"""
Extract ALL errors from this user report and format as JSON.

Return JSON with this structure:
{{
  "user_name": "string",
  "employee_id": "string", 
  "department": "string",
  "report_date": "infer from context or use null",
  "total_errors": number,
  "user_sentiment": "string (frustrated/concerned/neutral/etc)",
  "errors": [
    {{
      "error_number": number,
      "error_code": "string or null",
      "error_message": "string",
      "source": "string (Windows/Application name/Hardware)",
      "timestamp_context": "string (time mentioned in report)",
      "error_type": "string (Blue Screen/Application Crash/Hardware Error/etc)"
    }}
  ]
}}

Extract each error into a separate array item with consistent structure.

User report:
{multiple_errors_report}

Return ONLY valid JSON.
"""

print("🔄 Extracting multiple errors into JSON array...\\n")

# Make API call
response = client.responses.create(
    model=OPENAI_MODEL,
    input=extraction_prompt,
    text={"verbosity": "medium"}
)

errors_json = response.output_text.strip()
errors_data = json.loads(errors_json)

print("📋 Extracted Multiple Errors:")
print("=" * 80)
print(json.dumps(errors_data, indent=2))
print("=" * 80)

# Save to file
multiple_errors_path = "/content/multiple_errors.json"
with open(multiple_errors_path, 'w') as f:
    json.dump(errors_data, f, indent=2)

print(f"\\n💾 Saved to: {multiple_errors_path}")
print(f"📊 Tokens used: {response.usage.total_tokens}")

# Display insights
print(f"\\n🔍 Analysis:")
print(f"  • User: {errors_data['user_name']} ({errors_data['department']})")
print(f"  • Total errors reported: {errors_data['total_errors']}")
print(f"  • User sentiment: {errors_data['user_sentiment']}")
print(f"\\n  Error breakdown:")
for error in errors_data['errors']:
    print(f"    {error['error_number']}. {error['error_type']} - {error['source']}")
    print(f"       Time: {error['timestamp_context']}")
    if error['error_code']:
        print(f"       Code: {error['error_code']}")

### 📊 Key Teaching Point: Extracting Multiple Items

**What we learned:**
- ✅ How to extract MULTIPLE similar items into a structured array
- ✅ Each array item has consistent structure (same fields)
- ✅ LLM can identify and separate distinct errors from narrative text
- ✅ We can include metadata (total count, sentiment) alongside the array
- ✅ Easy to process programmatically (loop through errors)

💡 **Use case:** Any scenario with multiple similar items - error logs, equipment lists, action items from meetings, multiple user requests in one message.

---

## Example 3: Hardware Inventory (CSV Format)

Now let's extract equipment information into CSV format - perfect for importing into spreadsheets or inventory systems.

In [None]:
# Mock data: Description of conference room equipment with various details
conference_rooms_description = """
Conference Room Inventory Report:

Executive Boardroom (3rd Floor):
- Two 75-inch Samsung QN75 displays mounted on the wall (serials: SAMQN-8821, SAMQN-8822)
- One Logitech Rally Bar camera system, serial number LOGI-RB-3345
- Polycom RealPresence Trio 8800 conference phone, serial POLY-8800-991
- Dell OptiPlex 7090 PC for presentations, serial DELL-OPT-4455

Room 2A (2nd Floor):
- Single 65-inch LG OLED display, serial LG-OLED-2293  
- Jabra PanaCast camera, serial JAB-PC-7721
- Microsoft Surface Hub 2S 50-inch, serial MSFT-HUB-1203
- Two wireless presentation adapters, serials WRLSS-001 and WRLSS-002

Training Room B (1st Floor):
- Four 55-inch Sony Bravia displays (serials: SONY-BR-5501, SONY-BR-5502, SONY-BR-5503, SONY-BR-5504)
- Cisco Webex Room Kit Pro, serial CISCO-WX-8832
- Three HP EliteDesk 800 computers (serials: HP-ED-9901, HP-ED-9902, HP-ED-9903)
- One portable projector - Epson PowerLite, serial EPSON-PL-4456

Small Meeting Room 105:
- Single 43-inch Dell monitor, serial DELL-MON-3344
- Logitech MeetUp camera, serial LOGI-MU-6655
- One laptop docking station, serial DOCK-STN-2234
"""

# Extraction prompt for CSV
extraction_prompt = f"""
Extract conference room equipment inventory as CSV.

Use these exact column headers:
room,device_type,manufacturer,model,serial_number,location_detail,quantity

Rules:
- One row per device item
- Include header row
- device_type should be category like "Display", "Camera", "Computer", "Phone System", etc.
- location_detail should include floor or mounting info if mentioned
- quantity should be 1 for individual items
- Use commas as separators
- Don't use quotes unless value contains comma

Text:
{conference_rooms_description}

Return ONLY the CSV data, no additional text.
"""

print("🔄 Extracting conference room inventory to CSV...\\n")

# Make API call
response = client.responses.create(
    model=OPENAI_MODEL,
    input=extraction_prompt,
    text={"verbosity": "low"}
)

csv_output = response.output_text.strip()

print("📊 Extracted CSV Data:")
print("=" * 100)
print(csv_output)
print("=" * 100)

# Save to file
inventory_csv_path = "/content/conference_room_inventory.csv"
with open(inventory_csv_path, 'w') as f:
    f.write(csv_output)

print(f"\\n💾 Saved to: {inventory_csv_path}")
print(f"📊 Tokens used: {response.usage.total_tokens}")

In [None]:
# Verify and analyze the CSV
print("🔍 Verifying CSV and analyzing inventory...\n")

# Read with pandas
df = pd.read_csv(inventory_csv_path)

print("✅ CSV is valid!\n")
print(f"📊 Inventory Summary:")
print(f"  • Total items: {len(df)}")
print(f"  • Rooms covered: {df['room'].nunique()}")
print(f"  • Device types: {df['device_type'].nunique()}")
print(f"\n🏢 Items per room:")
print(df.groupby('room').size().to_string())
print(f"\n🖥️ Device type breakdown:")
print(df.groupby('device_type').size().to_string())

print(f"\n📋 Full Inventory:")

# Display as DataFrame (better formatting in Jupyter)
display(df)

print(f"\n💡 This CSV can be imported into Excel, Google Sheets, or an inventory database!")

---

# 🛡️ Data Validation

Extracted data needs validation before use in production systems.

---

## 🎯 Why Validation Matters

**The reality of LLM extraction:**
- ✅ LLMs are very good but **not 100% accurate**
- ✅ Input data may be incomplete or ambiguous
- ✅ Downstream systems expect specific formats
- ✅ Early detection prevents costly errors later

**Benefits of validation:**
- 🐛 **Catch issues early** - Before bad data reaches critical systems
- 🔄 **Provide feedback** - Know what's missing or incorrect
- 📊 **Quality metrics** - Track extraction accuracy over time
- 🚨 **Trigger human review** - Flag uncertain extractions

Let's explore three key validation techniques:

## Technique 1: Required Fields Check

Ensure all critical fields are present and not empty.

In [None]:
def check_required_fields(data, required_fields):
    """
    Check if all required fields are present and not empty.
    
    Args:
        data (dict): Extracted data dictionary
        required_fields (list): List of field names that must be present
        
    Returns:
        dict: Validation result with is_valid flag and missing_fields list
    """
    missing_fields = []
    
    for field in required_fields:
        # Check if field exists
        if field not in data:
            missing_fields.append(field)
        # Check if field is empty (None, empty string, empty list)
        elif data[field] is None or data[field] == "" or data[field] == []:
            missing_fields.append(field)
    
    return {
        "is_valid": len(missing_fields) == 0,
        "missing_fields": missing_fields,
        "message": "All required fields present" if len(missing_fields) == 0 else f"Missing fields: {', '.join(missing_fields)}"
    }

# Example: Incomplete ticket data
incomplete_ticket = {
    "user_name": "Alex Johnson",
    "employee_id": "",  # Empty!
    "contact_email": "alex.johnson@company.com",
    "issue_summary": None,  # Missing!
    # department field completely missing
}

print("🧪 Testing Required Fields Validation\\n")
print("Ticket data:")
print(json.dumps(incomplete_ticket, indent=2))

# Define what fields are required
required = ["user_name", "employee_id", "contact_email", "issue_summary", "department"]

# Validate
result = check_required_fields(incomplete_ticket, required)

print(f"\\n📋 Validation Result:")
print(f"  Valid: {result['is_valid']}")
print(f"  Message: {result['message']}")

if not result['is_valid']:
    print(f"\\n❌ Cannot process this ticket - missing critical information!")
    print(f"  Action needed: Request missing fields from user")

## Technique 2: Data Type & Format Validation

Check if data matches expected formats (email, employee ID, asset tag, etc.).

In [None]:
import re

def validate_data_formats(data):
    """
    Validate data formats for common IT fields.
    
    Args:
        data (dict): Extracted ticket data
        
    Returns:
        dict: Validation results with specific format errors
    """
    errors = []
    
    # Email format validation
    if 'contact_email' in data and data['contact_email']:
        email = data['contact_email']
        if '@' not in email or '.' not in email.split('@')[-1]:
            errors.append(f"Invalid email format: {email}")
    
    # Employee ID format validation (EMP-####)
    if 'employee_id' in data and data['employee_id']:
        emp_id = data['employee_id']
        if not re.match(r'^EMP-\\d{4}$', emp_id):
            errors.append(f"Invalid employee ID format: {emp_id} (expected: EMP-####)")
    
    # Asset tag validation (reasonable length)
    if 'asset_tag' in data and data['asset_tag']:
        asset_tag = data['asset_tag']
        if len(asset_tag) < 5 or len(asset_tag) > 20:
            errors.append(f"Asset tag length invalid: {asset_tag} (expected: 5-20 chars)")
    
    # Phone number validation (at least 7 digits)
    if 'contact_phone' in data and data['contact_phone']:
        phone = data['contact_phone']
        digits_only = re.sub(r'\\D', '', phone)  # Remove non-digits
        if len(digits_only) < 7:
            errors.append(f"Phone number too short: {phone}")
    
    return {
        "is_valid": len(errors) == 0,
        "errors": errors,
        "message": "All formats valid" if len(errors) == 0 else f"Found {len(errors)} format error(s)"
    }

# Example: Ticket with format errors
ticket_with_errors = {
    "user_name": "Patricia Lee",
    "employee_id": "12345",  # Wrong format (missing EMP- prefix)
    "contact_email": "patricia.lee.company.com",  # Missing @
    "contact_phone": "555",  # Too short
    "asset_tag": "PC",  # Too short
    "issue_summary": "Computer won't start"
}

print("🧪 Testing Format Validation\\n")
print("Ticket data:")
print(json.dumps(ticket_with_errors, indent=2))

# Validate formats
result = validate_data_formats(ticket_with_errors)

print(f"\\n📋 Format Validation Result:")
print(f"  Valid: {result['is_valid']}")
print(f"  Message: {result['message']}")

if not result['is_valid']:
    print(f"\\n❌ Format Errors Detected:")
    for i, error in enumerate(result['errors'], 1):
        print(f"  {i}. {error}")
    print(f"\\n⚠️ Action needed: Data needs correction before processing")

## Technique 3: Confidence Scoring

Ask the LLM to rate its own confidence and identify uncertain extractions.

In [None]:
# Mock data: Vague, incomplete ticket
vague_ticket = """
From: someone@company.com
Subject: Help

Something's wrong with my computer. It's not working right.
Can you fix it?
"""

# Extraction with confidence scoring
extraction_prompt = f"""
Extract support ticket information from this email. Include a confidence score.

Return JSON with these fields:
- user_name: Extract if possible, use null if not found
- employee_id: Extract if mentioned, use null otherwise
- contact_email: Email address
- issue_summary: Brief summary of the issue
- urgency: Low/Medium/High/Critical (infer from context)
- confidence_score: Your confidence in this extraction (0-100)
- missing_fields: List of critical information that's missing
- notes: Any concerns or ambiguities about the extraction

Email:
{vague_ticket}

Return ONLY valid JSON.
"""

print("🔄 Extracting vague ticket with confidence scoring...\\n")

response = client.responses.create(
    model=OPENAI_MODEL,
    input=extraction_prompt,
    text={"verbosity": "medium"}
)

extraction_json = response.output_text.strip()
extraction_data = json.loads(extraction_json)

print("📋 Extracted Data with Confidence:")
print("=" * 80)
print(json.dumps(extraction_data, indent=2))
print("=" * 80)

# Check confidence threshold
CONFIDENCE_THRESHOLD = 70

print(f"\\n🎯 Confidence Analysis:")
print(f"  Confidence Score: {extraction_data.get('confidence_score', 0)}/100")
print(f"  Threshold: {CONFIDENCE_THRESHOLD}/100")

if extraction_data.get('confidence_score', 0) < CONFIDENCE_THRESHOLD:
    print(f"\\n⚠️ LOW CONFIDENCE - Human Review Required!")
    print(f"  Missing fields: {', '.join(extraction_data.get('missing_fields', []))}")
    print(f"  Notes: {extraction_data.get('notes', 'N/A')}")
    print(f"\\n  Action: Request more information from user")
else:
    print(f"\\n✅ High confidence - Safe to process automatically")

print(f"\\n📊 Tokens used: {response.usage.total_tokens}")

---

# 💾 File Generation

Extracted data should be saved for downstream use, record-keeping, and team sharing.

---

## 🎯 Why Save Extracted Data?

**Use cases for saved files:**
- 📊 **Import to other systems** - Databases, ticketing systems, spreadsheets
- 📁 **Record keeping** - Maintain audit trails and historical records
- 🔄 **Batch processing** - Process multiple extractions together
- 👥 **Team sharing** - Share structured data with colleagues
- 📈 **Analysis** - Aggregate data for reporting and insights

Let's explore different file saving patterns:

## Pattern 1: Saving JSON Files with Timestamps

Create unique filenames with timestamps to avoid overwriting data.

In [None]:
def save_json_with_timestamp(data, prefix="extracted_data"):
    """
    Save JSON data with timestamp in filename.
    
    Args:
        data (dict): Data to save
        prefix (str): Filename prefix
        
    Returns:
        str: Path to saved file
    """
    # Generate timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Create filename
    filename = f"/content/{prefix}_{timestamp}.json"
    
    # Save file
    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)
    
    return filename

# Example: Save a ticket extraction
example_ticket = {
    "ticket_id": "TKT-2024-0157",
    "user_name": "Emily Chen",
    "employee_id": "EMP-8821",
    "department": "Finance",
    "issue_summary": "Excel crashes when opening large files",
    "urgency": "Medium",
    "extracted_at": datetime.now().isoformat()
}

print("💾 Saving JSON with timestamp...\\n")

# Save the file
file_path = save_json_with_timestamp(example_ticket, prefix="ticket_extraction")

print(f"✅ Saved to: {file_path}")
print(f"\\n📋 File contents:")
with open(file_path, 'r') as f:
    print(f.read())

# Demonstrate saving multiple extractions
print("\\n" + "="*80)
print("Saving multiple extractions...\\n")

import time

for i in range(3):
    ticket = {
        "ticket_id": f"TKT-2024-{100 + i}",
        "user_name": f"User {i+1}",
        "issue": f"Issue {i+1}"
    }
    path = save_json_with_timestamp(ticket, prefix="ticket")
    print(f"  Saved: {path}")
    time.sleep(1)  # Small delay to ensure different timestamps

print("\\n✅ Each file has unique timestamp - no overwrites!")

## Pattern 2: Saving CSV Files

Convert extracted data to CSV format for spreadsheet compatibility.

In [None]:
def save_to_csv(data_list, filename, fieldnames=None):
    """
    Save list of dictionaries as CSV file.
    
    Args:
        data_list (list): List of dictionaries to save
        filename (str): Output filename
        fieldnames (list): Optional list of field names (uses first dict keys if None)
        
    Returns:
        str: Path to saved file
    """
    if not data_list:
        raise ValueError("data_list cannot be empty")
    
    # Use provided fieldnames or extract from first dictionary
    if fieldnames is None:
        fieldnames = list(data_list[0].keys())
    
    # Ensure filename includes path
    if not filename.startswith('/content/'):
        filename = f"/content/{filename}"
    
    # Write CSV
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data_list)
    
    return filename

# Example: Save inventory data
inventory_items = [
    {
        "room": "Conference A",
        "device_type": "Display",
        "manufacturer": "Samsung",
        "model": "QN75",
        "serial_number": "SAM-001",
        "quantity": 1
    },
    {
        "room": "Conference A",
        "device_type": "Camera",
        "manufacturer": "Logitech",
        "model": "Rally Bar",
        "serial_number": "LOG-445",
        "quantity": 1
    },
    {
        "room": "Conference B",
        "device_type": "Display",
        "manufacturer": "LG",
        "model": "OLED65",
        "serial_number": "LG-2293",
        "quantity": 1
    }
]

print("💾 Saving inventory data to CSV...\n")

csv_path = save_to_csv(
    inventory_items, 
    "inventory_export.csv",
    fieldnames=["room", "device_type", "manufacturer", "model", "serial_number", "quantity"]
)

print(f"✅ Saved to: {csv_path}")

# Read and display
df = pd.read_csv(csv_path)
print(f"\n📊 CSV Contents ({len(df)} rows):")

# Display as DataFrame (better formatting in Jupyter)
display(df)

print(f"\n💡 This CSV can be opened in Excel or imported into inventory systems!")

## Pattern 3: Appending to Existing Files

Build up a log file with multiple extractions over time.

In [None]:
def append_to_ticket_log(ticket_data, log_file="/content/ticket_log.json"):
    """
    Append ticket data to existing log file.
    
    Args:
        ticket_data (dict): Ticket data to append
        log_file (str): Path to log file
        
    Returns:
        dict: Status with total count
    """
    # Add timestamp to ticket
    ticket_data['logged_at'] = datetime.now().isoformat()
    
    # Check if file exists
    if os.path.exists(log_file):
        # Load existing data
        with open(log_file, 'r') as f:
            log_data = json.load(f)
    else:
        # Create new log structure
        log_data = {
            "log_created": datetime.now().isoformat(),
            "tickets": []
        }
    
    # Append new ticket
    log_data["tickets"].append(ticket_data)
    log_data["total_tickets"] = len(log_data["tickets"])
    log_data["last_updated"] = datetime.now().isoformat()
    
    # Save back to file
    with open(log_file, 'w') as f:
        json.dump(log_data, f, indent=2)
    
    return {
        "success": True,
        "total_tickets": log_data["total_tickets"],
        "log_file": log_file
    }

print("📝 Demonstrating ticket log appending...\\n")

# Simulate processing multiple tickets
tickets_to_process = [
    {
        "ticket_id": "TKT-001",
        "user_name": "Alice Smith",
        "issue": "Password reset needed",
        "urgency": "Low"
    },
    {
        "ticket_id": "TKT-002", 
        "user_name": "Bob Johnson",
        "issue": "VPN not connecting",
        "urgency": "High"
    },
    {
        "ticket_id": "TKT-003",
        "user_name": "Carol Davis",
        "issue": "Printer offline",
        "urgency": "Medium"
    }
]

# Process and log each ticket
for ticket in tickets_to_process:
    result = append_to_ticket_log(ticket)
    print(f"✅ Logged {ticket['ticket_id']} - Total tickets in log: {result['total_tickets']}")
    time.sleep(0.5)  # Small delay

# Read and display final log
print(f"\\n📋 Final Ticket Log:")
print("=" * 80)
with open("/content/ticket_log.json", 'r') as f:
    log_contents = json.load(f)

print(f"Log created: {log_contents['log_created']}")
print(f"Last updated: {log_contents['last_updated']}")
print(f"Total tickets: {log_contents['total_tickets']}\\n")

print("Tickets in log:")
for i, ticket in enumerate(log_contents['tickets'], 1):
    print(f"  {i}. {ticket['ticket_id']}: {ticket['issue']} (Urgency: {ticket['urgency']})")

print("=" * 80)
print(f"\\n💡 Log file grows with each append - perfect for ongoing ticket tracking!")

---

# ✅ ITERATION 2 COMPLETE

## Sections Added:
5. ✅ Practical Examples (Continued)
   - Example 2: Multiple Error Extraction (JSON Arrays)
   - Example 3: Hardware Inventory (CSV Format)
6. ✅ Data Validation
   - Technique 1: Required Fields Check
   - Technique 2: Data Type & Format Validation
   - Technique 3: Confidence Scoring
7. ✅ File Generation
   - Pattern 1: JSON Files with Timestamps
   - Pattern 2: CSV File Export
   - Pattern 3: Appending to Log Files

## Additional Files Saved:
- `/content/multiple_errors.json` - Multiple error extraction from single report
- `/content/conference_room_inventory.csv` - Conference room equipment inventory
- `/content/ticket_extraction_YYYYMMDD_HHMMSS.json` - Timestamped ticket files
- `/content/inventory_export.csv` - Exported inventory data
- `/content/ticket_log.json` - Appended ticket log

## What You've Learned:
✅ Extract multiple items into structured arrays  
✅ Generate CSV files for spreadsheet import  
✅ Validate extracted data (required fields, formats, confidence)  
✅ Save data with timestamps for uniqueness  
✅ Append to log files for ongoing tracking  
✅ Handle low-confidence extractions appropriately  

---

## 🎯 Ready for Iteration 3

**Please review and approve before I continue.**

Iteration 3 will add:
- Batch Processing with progress tracking
- Error Handling & Edge Cases (5 scenarios)
- Mini-Project: Complete Support Ticket Intake System
- Best Practices & Key Takeaways
- Student Exercises

---

# 🔄 Batch Processing

Often you need to process multiple items at once - backlogs, audits, migrations. Let's build a batch processor with progress tracking.

---

## 🎯 Why Batch Processing?

**Use cases:**
- 📬 **Process email backlog** - Extract data from hundreds of support emails
- 📊 **Audit existing tickets** - Re-extract data with improved prompts
- 🔄 **Data migration** - Convert old formats to new structured data
- 📈 **Bulk analysis** - Extract insights from large document sets

**Benefits:**
- ⚡ More efficient than one-by-one processing
- 📊 Progress tracking with visual feedback
- 🐛 Error tracking - separate successes from failures
- 📁 Organized output with summary reports

In [None]:
# Create array of 5 different mock support tickets
ticket_batch = [
    """From: anna.rodriguez@company.com
    Subject: Laptop Battery Draining Fast
    Hi, I'm Anna Rodriguez (EMP-3401) from Marketing. My laptop battery only lasts 30 minutes now.
    This is urgent as I have client meetings all day tomorrow. Please help!""",
    
    """From: kevin.park@company.com
    Subject: Cannot Connect to WiFi
    Kevin Park here, Sales dept, EMP-5623. My laptop won't connect to the office WiFi.
    I've tried restarting but it still doesn't work. Medium priority.""",
    
    """From: lisa.chen@company.com  
    Subject: Printer Not Working
    Lisa Chen, Finance, EMP-7834. The printer on floor 3 shows 'Paper Jam' but there's no paper stuck.
    Multiple people are affected. Needs fixing ASAP!""",
    
    """From: marcus.johnson@company.com
    Subject: Slow Computer Performance
    Marcus Johnson, IT Support team member EMP-2019. My workstation is running extremely slow.
    All applications take forever to load. Low priority but annoying.""",
    
    """From: sophia.williams@company.com
    Subject: Email Account Locked
    Sophia Williams, HR department, EMP-9102. I got locked out after entering wrong password.
    Need access urgently for payroll processing today. CRITICAL!"""
]

def batch_extract_tickets(ticket_list):
    """
    Process multiple tickets in batch with progress tracking.
    
    Args:
        ticket_list (list): List of ticket text strings
        
    Returns:
        dict: Results with successes, failures, and summary
    """
    results = []
    failures = []
    
    # Process with progress bar
    for i, ticket_text in enumerate(tqdm(ticket_list, desc="Processing tickets")):
        try:
            # Extraction prompt
            extraction_prompt = f"""
            Extract ticket information from this email. Return ONLY valid JSON.
            
            {{
              "ticket_id": "TKT-{datetime.now().strftime('%Y')}-{str(i+1).zfill(4)}",
              "user_name": "full name",
              "employee_id": "employee ID",
              "department": "department name",
              "contact_email": "email address",
              "issue_summary": "brief summary",
              "urgency": "Low/Medium/High/Critical"
            }}
            
            Email:
            {ticket_text}
            """
            
            # Make API call
            response = client.responses.create(
                model=OPENAI_MODEL,
                input=extraction_prompt,
                text={"verbosity": "low"}
            )
            
            # Parse JSON
            extracted_data = json.loads(response.output_text.strip())
            
            # Add metadata
            extracted_data['processed_at'] = datetime.now().isoformat()
            extracted_data['batch_index'] = i + 1
            
            results.append(extracted_data)
            
        except Exception as e:
            failures.append({
                "batch_index": i + 1,
                "error": str(e),
                "ticket_preview": ticket_text[:100] + "..."
            })
    
    return {
        "total_processed": len(ticket_list),
        "successful": len(results),
        "failed": len(failures),
        "success_rate": f"{(len(results)/len(ticket_list)*100):.1f}%",
        "results": results,
        "failures": failures
    }

print("🔄 Starting batch processing of 5 tickets...\\n")

# Process the batch
batch_results = batch_extract_tickets(ticket_batch)

print(f"\\n✅ Batch Processing Complete!\\n")
print(f"📊 Summary:")
print(f"  • Total tickets: {batch_results['total_processed']}")
print(f"  • Successful: {batch_results['successful']}")
print(f"  • Failed: {batch_results['failed']}")
print(f"  • Success rate: {batch_results['success_rate']}")

# Display successful extractions
if batch_results['results']:
    print(f"\\n📋 Successfully Extracted Tickets:")
    for ticket in batch_results['results']:
        print(f"  {ticket['ticket_id']}: {ticket['user_name']} - {ticket['issue_summary']} ({ticket['urgency']})")

# Display failures if any
if batch_results['failures']:
    print(f"\\n❌ Failed Extractions:")
    for failure in batch_results['failures']:
        print(f"  Ticket {failure['batch_index']}: {failure['error']}")