# **Advanced Output Parsing with LangChain**

## **Learning Objectives**
By the end of this notebook, you will be able to:
- Master all types of output parsers in LangChain
- Handle complex nested data structures
- Implement custom parsers for specific needs
- Fix parsing errors and handle edge cases
- Use structured output with function calling
- Validate and transform parsed data

## **Why This Matters: Structured Data from Unstructured Text**

**In Production Systems:**
- Convert LLM responses to database records
- Extract structured data for APIs
- Ensure type safety and validation

**In Data Processing:**
- Parse documents into structured formats
- Extract entities and relationships
- Transform text into actionable data

**In Application Integration:**
- Feed LLM outputs to downstream systems
- Maintain data contracts between components
- Enable reliable automation

## **Prerequisites**
- Completed notebooks 00-03
- Understanding of JSON and data structures
- Basic knowledge of Pydantic (helpful but not required)

## **Setup: Install and Import Dependencies**

Run this cell first to set up your environment:

In [None]:
# Install required packages
!pip install -q langchain langchain-openai pydantic python-dotenv pandas

# Import necessary modules
import os
import json
from dotenv import load_dotenv
from typing import List, Dict, Optional, Union
from datetime import datetime
from enum import Enum

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import (
    StrOutputParser,
    JsonOutputParser,
    PydanticOutputParser,
    CommaSeparatedListOutputParser,
    StructuredOutputParser,
    ResponseSchema
)
from langchain.output_parsers import OutputFixingParser, RetryOutputParser
from pydantic import BaseModel, Field, validator, ValidationError

# Load environment variables
load_dotenv()

# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Verify setup
if os.getenv("OPENAI_API_KEY"):
    print("‚úÖ Environment ready! Let's master output parsing.")
else:
    print("‚ö†Ô∏è Please set your OPENAI_API_KEY")

---

## **Instructor Activity 1: Advanced Pydantic Parsing**

**Concept**: Use Pydantic's advanced features for complex data validation and transformation.

### **Example 1: Nested Models with Validation**

**Problem**: Parse complex nested structures with validation
**Expected Output**: Validated nested data models

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from pydantic import BaseModel, Field, validator
from typing import List, Optional
from datetime import datetime
from enum import Enum
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# **Define enums for valid values**
class Priority(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class TaskStatus(str, Enum):
    TODO = "todo"
    IN_PROGRESS = "in_progress"
    DONE = "done"
    BLOCKED = "blocked"

# **Define nested models**
class Subtask(BaseModel):
    title: str = Field(description="Subtask title")
    completed: bool = Field(description="Whether subtask is completed")
    estimated_hours: float = Field(description="Estimated hours to complete", gt=0, le=40)

class Assignee(BaseModel):
    name: str = Field(description="Person's name")
    email: str = Field(description="Email address")
    department: str = Field(description="Department name")
    
    @validator('email')
    def email_must_be_valid(cls, v):
        if '@' not in v:
            raise ValueError('Invalid email address')
        return v.lower()

class ProjectTask(BaseModel):
    task_id: str = Field(description="Unique task identifier")
    title: str = Field(description="Task title")
    description: str = Field(description="Detailed task description")
    priority: Priority = Field(description="Task priority level")
    status: TaskStatus = Field(description="Current task status")
    assignee: Assignee = Field(description="Person assigned to the task")
    subtasks: List[Subtask] = Field(description="List of subtasks")
    due_date: str = Field(description="Due date in YYYY-MM-DD format")
    tags: List[str] = Field(description="Task tags for categorization")
    estimated_total_hours: Optional[float] = Field(None, description="Total estimated hours")
    
    @validator('due_date')
    def validate_date_format(cls, v):
        try:
            datetime.strptime(v, '%Y-%m-%d')
        except ValueError:
            raise ValueError('Date must be in YYYY-MM-DD format')
        return v
    
    @validator('subtasks')
    def at_least_one_subtask(cls, v):
        if len(v) == 0:
            raise ValueError('At least one subtask required')
        return v
    
    @validator('estimated_total_hours', always=True)
    def calculate_total_hours(cls, v, values):
        if 'subtasks' in values:
            return sum(st.estimated_hours for st in values['subtasks'])
        return v

# **Create parser**
task_parser = PydanticOutputParser(pydantic_object=ProjectTask)

# **Create prompt**
prompt = ChatPromptTemplate.from_template(
    """Extract project task information from this description:
    
    {description}
    
    {format_instructions}
    """
)

# **Build chain**
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = prompt | llm | task_parser

# **Test with complex task description**
description = """Task #DEV-123: Implement User Authentication System

This is a high-priority task to implement a complete user authentication system.
It's currently in progress and assigned to John Smith (john.smith@company.com) from the Engineering department.

The task needs to be completed by 2024-04-15 and includes:
1. Design database schema (3 hours) - completed
2. Implement login API (5 hours) - not completed
3. Create password reset flow (4 hours) - not completed
4. Add two-factor authentication (6 hours) - not completed

Tags: security, backend, authentication, api
"""

# **Parse the task**
result = chain.invoke({
    "description": description,
    "format_instructions": task_parser.get_format_instructions()
})

print("Complex Nested Model Parsing:")
print("=" * 50)
print(f"üìã Task: {result.title}")
print(f"üîë ID: {result.task_id}")
print(f"üìä Status: {result.status.value}")
print(f"üî¥ Priority: {result.priority.value}")
print(f"üìÖ Due: {result.due_date}")
print(f"\nüë§ Assignee:")
print(f"  Name: {result.assignee.name}")
print(f"  Email: {result.assignee.email}")
print(f"  Dept: {result.assignee.department}")
print(f"\nüìù Subtasks ({len(result.subtasks)}):")
for i, subtask in enumerate(result.subtasks, 1):
    status = "‚úÖ" if subtask.completed else "‚è≥"
    print(f"  {i}. {status} {subtask.title} ({subtask.estimated_hours}h)")
print(f"\n‚è∞ Total Estimated Hours: {result.estimated_total_hours}")
print(f"üè∑Ô∏è Tags: {', '.join(result.tags)}")

print("\n‚úÖ Complex nested validation successful!")
```

**Advanced Pydantic features:**
- Nested model composition
- Enum validation for fixed choices
- Custom validators for business logic
- Computed fields from other fields

</details>

### **Example 2: Handling Lists and Optional Fields**

**Problem**: Parse variable structures with optional data
**Expected Output**: Flexible data models

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from pydantic import BaseModel, Field
from typing import List, Optional, Union
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate

# **Define flexible models**
class ContactInfo(BaseModel):
    type: str = Field(description="Contact type: email, phone, social")
    value: str = Field(description="Contact value")
    preferred: bool = Field(default=False, description="Is this preferred contact method")

class Address(BaseModel):
    street: Optional[str] = Field(None, description="Street address")
    city: str = Field(description="City name")
    state: Optional[str] = Field(None, description="State/Province")
    country: str = Field(description="Country")
    postal_code: Optional[str] = Field(None, description="Postal/ZIP code")

class Experience(BaseModel):
    company: str = Field(description="Company name")
    position: str = Field(description="Job title")
    duration: str = Field(description="Duration of employment")
    responsibilities: List[str] = Field(default_factory=list, description="Key responsibilities")

class Skill(BaseModel):
    name: str = Field(description="Skill name")
    level: str = Field(description="Proficiency level: beginner, intermediate, expert")
    years: Optional[int] = Field(None, description="Years of experience")

class UserProfile(BaseModel):
    name: str = Field(description="Full name")
    title: Optional[str] = Field(None, description="Professional title")
    contacts: List[ContactInfo] = Field(description="Contact information")
    address: Optional[Address] = Field(None, description="Physical address")
    skills: List[Skill] = Field(default_factory=list, description="Professional skills")
    experience: List[Experience] = Field(default_factory=list, description="Work experience")
    languages: List[str] = Field(default_factory=list, description="Languages spoken")
    certifications: Optional[List[str]] = Field(None, description="Professional certifications")
    bio: Optional[str] = Field(None, description="Short biography")

# **Create parser**
profile_parser = PydanticOutputParser(pydantic_object=UserProfile)

# **Create prompt**
prompt = ChatPromptTemplate.from_template(
    """Extract user profile information from this text:
    
    {text}
    
    Note: Some fields might be missing - that's okay.
    {format_instructions}
    """
)

# **Build chain**
chain = prompt | llm | profile_parser

# **Test with different profile completeness**
profiles = [
    """Sarah Johnson is a Senior Data Scientist based in San Francisco, CA, USA.
    You can reach her at sarah@email.com (preferred) or phone: 555-0123.
    She has expert-level Python skills (5 years) and intermediate SQL knowledge.
    Previously worked at TechCorp as Lead Analyst for 3 years, managing data pipelines.
    She speaks English and Spanish, and holds AWS and Google Cloud certifications.""",
    
    """Mike Chen - Software Developer
    Contact: mike.chen@gmail.com
    Lives in Toronto, Canada
    Knows JavaScript (expert), React (intermediate), Node.js (intermediate)"""
]

print("Flexible Model Parsing with Optional Fields:")
print("=" * 50)

for i, profile_text in enumerate(profiles, 1):
    print(f"\nüìÑ Profile {i}:")
    
    result = chain.invoke({
        "text": profile_text,
        "format_instructions": profile_parser.get_format_instructions()
    })
    
    print(f"üë§ Name: {result.name}")
    if result.title:
        print(f"üíº Title: {result.title}")
    
    print(f"üìû Contacts:")
    for contact in result.contacts:
        pref = " (preferred)" if contact.preferred else ""
        print(f"  - {contact.type}: {contact.value}{pref}")
    
    if result.address:
        print(f"üìç Location: {result.address.city}, {result.address.country}")
    
    if result.skills:
        print(f"üéØ Skills:")
        for skill in result.skills:
            years = f" ({skill.years} years)" if skill.years else ""
            print(f"  - {skill.name}: {skill.level}{years}")
    
    if result.certifications:
        print(f"üèÜ Certifications: {', '.join(result.certifications)}")
    
    print(f"\nFields populated: {sum(1 for field, value in result.dict().items() if value is not None)}")

print("\n‚úÖ Flexible parsing handles missing fields gracefully!")
```

**Optional field benefits:**
- Handles incomplete data gracefully
- Provides defaults where appropriate
- Flexible for real-world data
- No errors for missing fields

</details>

### **Example 3: Union Types and Polymorphic Models**

**Problem**: Parse data that can be one of several types
**Expected Output**: Polymorphic data handling

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from pydantic import BaseModel, Field
from typing import Union, List, Literal
from langchain_core.output_parsers import PydanticOutputParser

# **Define different event types**
class MeetingEvent(BaseModel):
    event_type: Literal["meeting"] = "meeting"
    title: str
    participants: List[str]
    duration_minutes: int
    meeting_link: Optional[str] = None

class DeadlineEvent(BaseModel):
    event_type: Literal["deadline"] = "deadline"
    title: str
    deliverable: str
    responsible_person: str

class ReminderEvent(BaseModel):
    event_type: Literal["reminder"] = "reminder"
    title: str
    message: str
    priority: str

# **Union type for any event**
Event = Union[MeetingEvent, DeadlineEvent, ReminderEvent]

class Calendar(BaseModel):
    date: str = Field(description="Date in YYYY-MM-DD format")
    events: List[Event] = Field(description="List of events for the day")
    total_events: int = Field(default=0)
    
    @validator('total_events', always=True)
    def count_events(cls, v, values):
        if 'events' in values:
            return len(values['events'])
        return 0

# **Create parser**
calendar_parser = PydanticOutputParser(pydantic_object=Calendar)

# **Create prompt**
prompt = ChatPromptTemplate.from_template(
    """Extract calendar events from this schedule:
    
    {schedule}
    
    Identify the type of each event (meeting, deadline, or reminder).
    {format_instructions}
    """
)

# **Test with mixed event types**
schedule = """Schedule for 2024-04-20:

- Team standup meeting at 9 AM with John, Sarah, and Mike for 30 minutes (Zoom link: zoom.us/123)
- DEADLINE: Submit quarterly report - Sarah is responsible
- Product demo meeting at 2 PM with clients for 60 minutes
- REMINDER: Review pull requests (high priority)
- DEADLINE: Deploy v2.0 to production - Mike is responsible
"""

chain = prompt | llm | calendar_parser
result = chain.invoke({
    "schedule": schedule,
    "format_instructions": calendar_parser.get_format_instructions()
})

print("Polymorphic Event Parsing:")
print("=" * 50)
print(f"üìÖ Date: {result.date}")
print(f"üìä Total Events: {result.total_events}\n")

for i, event in enumerate(result.events, 1):
    print(f"Event {i} - Type: {event.event_type}")
    print(f"  Title: {event.title}")
    
    if event.event_type == "meeting":
        print(f"  Participants: {', '.join(event.participants)}")
        print(f"  Duration: {event.duration_minutes} min")
        if event.meeting_link:
            print(f"  Link: {event.meeting_link}")
    elif event.event_type == "deadline":
        print(f"  Deliverable: {event.deliverable}")
        print(f"  Responsible: {event.responsible_person}")
    elif event.event_type == "reminder":
        print(f"  Message: {event.message}")
        print(f"  Priority: {event.priority}")
    print()

print("‚úÖ Union types handle different event structures!")
```

**Union type advantages:**
- Handle multiple data shapes
- Type-safe polymorphism
- Automatic type discrimination
- Clean data modeling

</details>

---

## **Learner Activity 1: Practice Advanced Parsing**

**Practice Focus**: Create complex Pydantic models with validation

### **Exercise 1: Build an Invoice Parser**

**Task**: Create a parser for invoice data with line items
**Expected Output**: Structured invoice with calculations

In [None]:
# Your code here
# TODO: Create Pydantic models for:
# - LineItem (product, quantity, unit_price, total)
# - Invoice (invoice_number, date, customer, items, subtotal, tax, total)
# Add validation for calculations

<details>
<summary>Solution</summary>

```python
from pydantic import BaseModel, Field, validator
from typing import List
from datetime import datetime
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate

class LineItem(BaseModel):
    product: str = Field(description="Product name")
    quantity: int = Field(description="Quantity ordered", gt=0)
    unit_price: float = Field(description="Price per unit", gt=0)
    total: float = Field(description="Line total")
    
    @validator('total')
    def validate_total(cls, v, values):
        if 'quantity' in values and 'unit_price' in values:
            expected = values['quantity'] * values['unit_price']
            if abs(v - expected) > 0.01:  # Allow small rounding differences
                return expected  # Auto-correct
        return v

class Customer(BaseModel):
    name: str = Field(description="Customer name")
    company: Optional[str] = Field(None, description="Company name")
    email: str = Field(description="Customer email")

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice number")
    date: str = Field(description="Invoice date in YYYY-MM-DD")
    customer: Customer = Field(description="Customer information")
    items: List[LineItem] = Field(description="Line items")
    subtotal: float = Field(description="Subtotal before tax")
    tax_rate: float = Field(description="Tax rate as decimal", ge=0, le=1)
    tax_amount: float = Field(description="Tax amount")
    total: float = Field(description="Total amount due")
    
    @validator('subtotal')
    def calculate_subtotal(cls, v, values):
        if 'items' in values:
            calculated = sum(item.total for item in values['items'])
            return calculated
        return v
    
    @validator('tax_amount')
    def calculate_tax(cls, v, values):
        if 'subtotal' in values and 'tax_rate' in values:
            return values['subtotal'] * values['tax_rate']
        return v
    
    @validator('total')
    def calculate_total(cls, v, values):
        if 'subtotal' in values and 'tax_amount' in values:
            return values['subtotal'] + values['tax_amount']
        return v

# **Create parser and prompt**
invoice_parser = PydanticOutputParser(pydantic_object=Invoice)

prompt = ChatPromptTemplate.from_template(
    """Extract invoice information from this text:
    
    {invoice_text}
    
    {format_instructions}
    """
)

# **Test with invoice text**
invoice_text = """Invoice #INV-2024-001
Date: April 15, 2024

Bill To:
John Doe
Acme Corporation
john@acme.com

Items:
1. Premium Widget - Qty: 5 @ $29.99 each
2. Standard Gadget - Qty: 10 @ $15.50 each
3. Deluxe Tool - Qty: 2 @ $89.00 each

Tax Rate: 8.5%
"""

chain = prompt | llm | invoice_parser
result = chain.invoke({
    "invoice_text": invoice_text,
    "format_instructions": invoice_parser.get_format_instructions()
})

print("Invoice Parsing with Calculations:")
print("=" * 50)
print(f"üìÑ Invoice: {result.invoice_number}")
print(f"üìÖ Date: {result.date}")
print(f"\nüë§ Customer:")
print(f"  Name: {result.customer.name}")
if result.customer.company:
    print(f"  Company: {result.customer.company}")
print(f"  Email: {result.customer.email}")

print(f"\nüì¶ Line Items:")
for item in result.items:
    print(f"  ‚Ä¢ {item.product}: {item.quantity} √ó ${item.unit_price:.2f} = ${item.total:.2f}")

print(f"\nüí∞ Totals:")
print(f"  Subtotal: ${result.subtotal:.2f}")
print(f"  Tax ({result.tax_rate*100:.1f}%): ${result.tax_amount:.2f}")
print(f"  Total Due: ${result.total:.2f}")

print("\n‚úÖ Invoice parsed with automatic calculations!")
```

**What you learned:**
- Validators for automatic calculations
- Nested customer model
- Financial data validation
- Self-correcting totals

</details>

---

## **Instructor Activity 2: Error Handling and Recovery**

**Concept**: Handle parsing errors gracefully and implement retry logic.

### **Example 1: Output Fixing Parser**

**Problem**: Automatically fix malformed LLM outputs
**Expected Output**: Corrected and parsed data

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from langchain.output_parsers import OutputFixingParser
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field

# **Define expected structure**
class ProductReview(BaseModel):
    product_name: str = Field(description="Name of the product")
    rating: int = Field(description="Rating from 1 to 5", ge=1, le=5)
    pros: List[str] = Field(description="List of pros")
    cons: List[str] = Field(description="List of cons")
    summary: str = Field(description="Brief summary")

# **Create base parser**
base_parser = JsonOutputParser(pydantic_object=ProductReview)

# **Wrap with OutputFixingParser**
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
fixing_parser = OutputFixingParser.from_llm(parser=base_parser, llm=llm)

# **Test with malformed outputs**
malformed_outputs = [
    # Missing quotes around keys
    """{product_name: "Laptop", rating: 4, pros: ["Fast", "Lightweight"], cons: ["Expensive"], summary: "Good laptop"}""",
    
    # Invalid JSON with trailing comma
    """{
        "product_name": "Phone",
        "rating": 5,
        "pros": ["Great camera", "Long battery",],
        "cons": ["No headphone jack"],
        "summary": "Excellent phone"
    }""",
    
    # Missing field
    """{
        "product_name": "Tablet",
        "rating": 3,
        "pros": ["Good screen"],
        "cons": ["Slow processor"]
    }"""
]

print("Output Fixing Parser Demo:")
print("=" * 50)

for i, malformed in enumerate(malformed_outputs, 1):
    print(f"\nüîß Test {i}: Fixing malformed output")
    print(f"Input: {malformed[:50]}...")
    
    try:
        # Try base parser first (will fail)
        result = base_parser.parse(malformed)
        print("‚úÖ Base parser succeeded (unexpected!)")
    except Exception as e:
        print(f"‚ùå Base parser failed: {str(e)[:50]}")
        
        # Use fixing parser
        try:
            fixed_result = fixing_parser.parse(malformed)
            print(f"‚úÖ Fixing parser succeeded!")
            print(f"  Product: {fixed_result['product_name']}")
            print(f"  Rating: {fixed_result['rating']}/5")
            if 'summary' in fixed_result:
                print(f"  Summary: {fixed_result['summary']}")
        except Exception as e2:
            print(f"‚ùå Fixing parser also failed: {str(e2)[:50]}")

print("\nüí° OutputFixingParser uses LLM to fix malformed outputs!")
```

**Output fixing benefits:**
- Handles malformed JSON
- Adds missing fields
- Fixes syntax errors
- Improves reliability

</details>

### **Example 2: Retry Parser with Better Instructions**

**Problem**: Retry parsing with improved prompts
**Expected Output**: Successful parsing after retry

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from langchain.output_parsers import RetryOutputParser
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field, validator

# **Define strict model**
class FinancialReport(BaseModel):
    company_name: str = Field(description="Company name")
    ticker: str = Field(description="Stock ticker symbol")
    revenue: float = Field(description="Revenue in millions", gt=0)
    profit_margin: float = Field(description="Profit margin as decimal", ge=0, le=1)
    year_over_year_growth: float = Field(description="YoY growth as decimal")
    recommendation: str = Field(description="Buy, Hold, or Sell")
    
    @validator('ticker')
    def ticker_uppercase(cls, v):
        return v.upper()
    
    @validator('recommendation')
    def valid_recommendation(cls, v):
        if v.lower() not in ['buy', 'hold', 'sell']:
            raise ValueError('Recommendation must be Buy, Hold, or Sell')
        return v.capitalize()

# **Create parsers**
base_parser = PydanticOutputParser(pydantic_object=FinancialReport)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# **Create retry parser**
retry_parser = RetryOutputParser.from_llm(
    parser=base_parser,
    llm=llm,
    max_retries=2
)

# **Create initial prompt (might produce errors)**
initial_prompt = ChatPromptTemplate.from_template(
    """Extract financial data from: {text}
    Format as JSON."""
)

# **Test text**
financial_text = """TechCorp (ticker: TECH) reported strong Q4 results with revenue of $450 million,
up 23% year-over-year. The company maintained a healthy profit margin of 18.5%.
Analysts recommend this as a strong buy opportunity."""

print("Retry Parser Demo:")
print("=" * 50)

# **First attempt (might fail due to lack of format instructions)**
chain = initial_prompt | llm
initial_completion = chain.invoke({"text": financial_text})

print("Initial LLM Output:")
print(initial_completion.content[:200])

try:
    # Try to parse initial output
    result = base_parser.parse(initial_completion.content)
    print("\n‚úÖ Initial parsing succeeded!")
except Exception as e:
    print(f"\n‚ùå Initial parsing failed: {str(e)[:100]}")
    
    # Use retry parser with the original prompt
    print("\nüîÑ Attempting retry with better instructions...")
    
    result = retry_parser.parse_with_prompt(
        initial_completion.content,
        prompt_value=initial_prompt.format_prompt(text=financial_text)
    )
    
    print("‚úÖ Retry succeeded!")

# **Display parsed result**
print("\nüìä Parsed Financial Report:")
print(f"Company: {result.company_name} ({result.ticker})")
print(f"Revenue: ${result.revenue}M")
print(f"Profit Margin: {result.profit_margin*100:.1f}%")
print(f"YoY Growth: {result.year_over_year_growth*100:.1f}%")
print(f"Recommendation: {result.recommendation}")

print("\nüí° RetryParser adds format instructions and retries!")
```

**Retry parser advantages:**
- Automatic retry on failure
- Better instructions on retry
- Configurable retry attempts
- Preserves original context

</details>

---

## **Learner Activity 2: Practice Error Handling**

**Practice Focus**: Implement robust parsing with error recovery

### **Exercise 1: Build a Fault-Tolerant Parser**

**Task**: Create a parser that handles various error conditions
**Expected Output**: Reliable parsing despite errors

In [None]:
# Your code here
# TODO: Create a parser that:
# 1. Tries to parse normally
# 2. Uses OutputFixingParser if that fails
# 3. Falls back to a simpler structure if needed

<details>
<summary>Solution</summary>

```python
from langchain.output_parsers import OutputFixingParser
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import Optional

# **Define models with different complexity levels**
class DetailedProduct(BaseModel):
    name: str
    category: str
    price: float
    features: List[str]
    specifications: Dict[str, str]
    warranty_years: int

class SimpleProduct(BaseModel):
    name: str
    category: Optional[str] = None
    price: Optional[float] = None
    description: Optional[str] = None

def fault_tolerant_parse(text: str, llm):
    """Parse with multiple fallback levels"""
    
    # Level 1: Try detailed parsing
    detailed_parser = PydanticOutputParser(pydantic_object=DetailedProduct)
    
    prompt1 = ChatPromptTemplate.from_template(
        """Extract detailed product information:
        {text}
        
        {format_instructions}
        """
    )
    
    try:
        chain1 = prompt1 | llm | detailed_parser
        result = chain1.invoke({
            "text": text,
            "format_instructions": detailed_parser.get_format_instructions()
        })
        print("‚úÖ Level 1: Detailed parsing succeeded")
        return result, "detailed"
    except Exception as e:
        print(f"‚ö†Ô∏è Level 1 failed: {str(e)[:50]}")
    
    # Level 2: Try with OutputFixingParser
    fixing_parser = OutputFixingParser.from_llm(
        parser=detailed_parser,
        llm=llm
    )
    
    try:
        chain2 = prompt1 | llm | fixing_parser
        result = chain2.invoke({
            "text": text,
            "format_instructions": detailed_parser.get_format_instructions()
        })
        print("‚úÖ Level 2: Fixing parser succeeded")
        return result, "fixed"
    except Exception as e:
        print(f"‚ö†Ô∏è Level 2 failed: {str(e)[:50]}")
    
    # Level 3: Fall back to simple structure
    simple_parser = PydanticOutputParser(pydantic_object=SimpleProduct)
    
    prompt3 = ChatPromptTemplate.from_template(
        """Extract basic product information (name, category, price if available):
        {text}
        
        {format_instructions}
        """
    )
    
    try:
        chain3 = prompt3 | llm | simple_parser
        result = chain3.invoke({
            "text": text,
            "format_instructions": simple_parser.get_format_instructions()
        })
        print("‚úÖ Level 3: Simple parsing succeeded")
        return result, "simple"
    except Exception as e:
        print(f"‚ùå All levels failed: {str(e)[:50]}")
        return None, "failed"

# **Test with various inputs**
test_inputs = [
    """UltraPhone X: A premium smartphone in the electronics category for $999.
    Features: 5G, OLED display, Triple camera, Face ID.
    Specs: Screen 6.7", Battery 4500mAh, Storage 256GB.
    Comes with 2-year warranty.""",
    
    """BasicWidget - Some kind of gadget that costs about fifty bucks"""
]

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

print("Fault-Tolerant Parsing Demo:")
print("=" * 50)

for i, text in enumerate(test_inputs, 1):
    print(f"\nüìù Input {i}: {text[:50]}...")
    result, level = fault_tolerant_parse(text, llm)
    
    if result:
        print(f"\nüìä Parsed at level: {level}")
        print(f"  Product: {result.name}")
        if hasattr(result, 'features'):
            print(f"  Features: {len(result.features)} found")
        if hasattr(result, 'price') and result.price:
            print(f"  Price: ${result.price}")

print("\n‚úÖ Fault-tolerant parsing handles various input qualities!")
```

**What you learned:**
- Multi-level fallback strategy
- Graceful degradation
- Handling incomplete data
- Robust error recovery

</details>

---

## **Instructor Activity 3: Custom Parsers and Transformations**

**Concept**: Create custom parsers for specific formats and transformations.

### **Example 1: Custom Table Parser**

**Problem**: Parse tabular data from text
**Expected Output**: Structured table data

In [None]:
# Empty cell for demonstration

<details>
<summary>Solution</summary>

```python
from langchain_core.output_parsers import BaseOutputParser
from typing import List, Dict
import pandas as pd

class TableOutputParser(BaseOutputParser[pd.DataFrame]):
    """Custom parser for table data"""
    
    def parse(self, text: str) -> pd.DataFrame:
        """Parse text into a pandas DataFrame"""
        
        lines = text.strip().split('\n')
        
        # Find table boundaries (lines with | characters)
        table_lines = [line for line in lines if '|' in line]
        
        if not table_lines:
            raise ValueError("No table found in output")
        
        # Parse header
        header = [cell.strip() for cell in table_lines[0].split('|') if cell.strip()]
        
        # Skip separator line if present
        data_start = 1
        if len(table_lines) > 1 and all(c in '-|' for c in table_lines[1].replace(' ', '')):
            data_start = 2
        
        # Parse data rows
        data = []
        for line in table_lines[data_start:]:
            row = [cell.strip() for cell in line.split('|') if cell.strip()]
            if len(row) == len(header):
                data.append(row)
        
        # Create DataFrame
        df = pd.DataFrame(data, columns=header)
        
        # Try to convert numeric columns
        for col in df.columns:
            try:
                df[col] = pd.to_numeric(df[col])
            except:
                pass  # Keep as string
        
        return df
    
    def get_format_instructions(self) -> str:
        return """Format your response as a markdown table with | separators.
Example:
| Column1 | Column2 | Column3 |
|---------|---------|----------|
| Value1  | Value2  | Value3   |
| Value4  | Value5  | Value6   |"""

# **Use the custom parser**
table_parser = TableOutputParser()

prompt = ChatPromptTemplate.from_template(
    """Create a comparison table for these items:
    {items}
    
    Include columns for: Name, Category, Price, Rating, Availability
    
    {format_instructions}
    """
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = prompt | llm | table_parser

# **Test with product comparison**
items = """Compare these laptops:
1. MacBook Pro - Premium laptop, $2499, highly rated
2. Dell XPS - Business laptop, $1599, good reviews
3. HP Pavilion - Budget laptop, $699, decent for basics
4. ThinkPad X1 - Business ultrabook, $1899, excellent keyboard
"""

result = chain.invoke({
    "items": items,
    "format_instructions": table_parser.get_format_instructions()
})

print("Custom Table Parser Result:")
print("=" * 50)
print("\nüìä Parsed DataFrame:")
print(result)
print(f"\nShape: {result.shape}")
print(f"Columns: {list(result.columns)}")
print(f"\nData types:")
print(result.dtypes)

# **Perform DataFrame operations**
if 'Price' in result.columns:
    # Extract numeric price if formatted as $X,XXX
    result['Price_Numeric'] = result['Price'].str.replace('$', '').str.replace(',', '').astype(float)
    print(f"\nüí∞ Average Price: ${result['Price_Numeric'].mean():.2f}")
    print(f"Price Range: ${result['Price_Numeric'].min():.2f} - ${result['Price_Numeric'].max():.2f}")

print("\n‚úÖ Custom parser converts text to pandas DataFrame!")
```

**Custom parser benefits:**
- Parse specific formats
- Direct to pandas DataFrame
- Custom validation logic
- Type conversion

</details>

---

## **Summary & Next Steps**

### **What You've Learned**
‚úÖ Advanced Pydantic models with nested structures and validation  
‚úÖ Handling optional fields and union types  
‚úÖ Error recovery with OutputFixingParser and RetryParser  
‚úÖ Custom parsers for specific formats  
‚úÖ Table parsing and DataFrame conversion  
‚úÖ Multi-level fallback strategies  

### **Key Takeaways**
1. **Pydantic provides powerful validation** - Use validators for business logic
2. **Handle errors gracefully** - Multiple fallback levels ensure reliability
3. **Custom parsers for custom formats** - Build parsers for your specific needs
4. **Union types handle variability** - Parse different data shapes safely
5. **Tables to DataFrames** - Convert text tables for analysis

### **What's Next?**
In the next notebook (`05_document_loading.ipynb`), you'll learn:
- Loading documents from various sources
- Text splitting strategies
- Metadata extraction
- Handling different file formats
- Preprocessing for embeddings

### **Resources**
- [LangChain Output Parsers](https://python.langchain.com/docs/modules/model_io/output_parsers/)
- [Pydantic Documentation](https://docs.pydantic.dev/)
- [JSON Schema](https://json-schema.org/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)

---

üéâ **Congratulations!** You've mastered advanced output parsing! You can now extract and validate complex structured data from LLM outputs.