# JSON Data Processing with LangChain

## Overview
This notebook demonstrates comprehensive approaches to loading and processing JSON data using LangChain's document loaders. JSON is one of the most common data formats in modern applications, and understanding how to effectively process it for RAG systems is crucial for building robust AI applications.

## What You'll Learn
1. **JSON Structure Analysis** - Understanding complex nested JSON data
2. **JSONLoader Usage** - Using jq schemas for selective data extraction
3. **Custom Processing** - Building intelligent JSON processors for complex structures
4. **JSONL Handling** - Processing JSON Lines format for streaming data
5. **Production Strategies** - Best practices for JSON processing in RAG pipelines

## Prerequisites
```bash
uv add install langchain-community jq
```

## Common JSON Use Cases in RAG
- API responses and web service data
- Configuration files and settings
- Log files and event streams
- Product catalogs and inventories
- User profiles and social media data
- IoT sensor data and telemetry

### Json Parsing And Processing

In [10]:
"""
JSON Data Processing Setup

This module sets up the environment for processing JSON data files.
We'll create sample JSON data that represents real-world complex structures
commonly found in business applications and APIs.

Author: Data Science Team
Date: 2024
"""

# Import necessary libraries for JSON manipulation and file handling
import json  # For JSON parsing and generation
import os    # For operating system interface and directory operations

# Create directory structure for storing our sample JSON files
# exist_ok=True prevents error if directory already exists
os.makedirs("data/json_files", exist_ok=True)
print("✅ Directory structure created successfully")
print("📁 Ready to process JSON and JSONL files")

✅ Directory structure created successfully
📁 Ready to process JSON and JSONL files


In [11]:
# Creating Complex Nested JSON Sample Data
# ==========================================
"""
This section creates realistic sample JSON data that demonstrates common patterns
found in business applications:
- Nested objects and arrays
- Mixed data types (strings, numbers, arrays, objects)
- Multiple levels of hierarchy
- Real-world business entities (employees, projects, departments)

This structure represents a typical company API response or configuration file.
"""

# Define complex nested JSON data structure
# This represents a realistic business scenario with multiple data relationships
json_data = {
    "company": "TechCorp",  # Company name (string)
    
    # Array of employee objects - demonstrates nested arrays with complex objects
    "employees": [
        {
            "id": 1,  # Unique identifier (number)
            "name": "John Doe",  # Employee name (string)
            "role": "Software Engineer",  # Job role (string)
            "skills": ["Python", "JavaScript", "React"],  # Array of skills (strings)
            
            # Nested array of project objects
            "projects": [
                {"name": "RAG System", "status": "In Progress"},
                {"name": "Data Pipeline", "status": "Completed"}
            ]
        },
        {
            "id": 2,
            "name": "Jane Smith",
            "role": "Data Scientist",
            "skills": ["Python", "Machine Learning", "SQL"],
            
            # Different project structure for variety
            "projects": [
                {"name": "ML Model", "status": "In Progress"},
                {"name": "Analytics Dashboard", "status": "Planning"}
            ]
        }
    ],
    
    # Nested object structure - demonstrates hierarchical data organization
    "departments": {
        "engineering": {  # Department-specific nested object
            "head": "Mike Johnson",  # Department head (string)
            "budget": 1000000,  # Budget amount (number)
            "team_size": 25  # Team size (number)
        },
        "data_science": {  # Another department with same structure
            "head": "Sarah Williams",
            "budget": 750000,
            "team_size": 15
        }
    }
}

print("✅ Complex nested JSON data structure created")
print("📊 Structure includes:")
print("  • Company metadata")
print("  • Employee profiles with skills and projects")
print("  • Department information with budgets")
print("  • Nested arrays and objects at multiple levels")

✅ Complex nested JSON data structure created
📊 Structure includes:
  • Company metadata
  • Employee profiles with skills and projects
  • Department information with budgets
  • Nested arrays and objects at multiple levels


In [12]:
# Inspect the created JSON data structure
# This will display the complete nested structure for understanding
json_data

{'company': 'TechCorp',
 'employees': [{'id': 1,
   'name': 'John Doe',
   'role': 'Software Engineer',
   'skills': ['Python', 'JavaScript', 'React'],
   'projects': [{'name': 'RAG System', 'status': 'In Progress'},
    {'name': 'Data Pipeline', 'status': 'Completed'}]},
  {'id': 2,
   'name': 'Jane Smith',
   'role': 'Data Scientist',
   'skills': ['Python', 'Machine Learning', 'SQL'],
   'projects': [{'name': 'ML Model', 'status': 'In Progress'},
    {'name': 'Analytics Dashboard', 'status': 'Planning'}]}],
 'departments': {'engineering': {'head': 'Mike Johnson',
   'budget': 1000000,
   'team_size': 25},
  'data_science': {'head': 'Sarah Williams',
   'budget': 750000,
   'team_size': 15}}}

In [None]:
# Save Complex JSON Data to File
# ===============================
"""
This section saves our structured JSON data to a file for processing.
The 'indent=2' parameter creates human-readable formatting with proper indentation.

This demonstrates how real-world JSON files are typically structured and stored.
"""

# Save the nested JSON data to a file with proper formatting
json_file_path = 'data/json_files/company_data.json'

with open(json_file_path, 'w', encoding='utf-8') as f:
    json.dump(json_data, f, indent=2, ensure_ascii=False)

print(f"✅ Complex JSON data saved to: {json_file_path}")
print("📁 File contains nested employee and department data")
print("🔍 Ready for JSONLoader processing with jq queries")

In [None]:
# Create JSON Lines (JSONL) Format Sample Data
# ==============================================
"""
JSON Lines is a format where each line is a valid JSON object.
This format is commonly used for:
- Streaming data processing
- Log files and event streams
- Large datasets that can't fit in memory
- API responses with multiple records

Each line represents a separate event or record that can be processed independently.
"""

# Define sample event data for JSONL format
# This represents a typical event tracking or logging scenario
jsonl_data = [
    # User login event - contains timestamp, event type, and user ID
    {"timestamp": "2024-01-01T08:00:00Z", "event": "user_login", "user_id": 123, "ip_address": "192.168.1.1"},
    
    # Page view event - includes additional page information
    {"timestamp": "2024-01-01T08:01:00Z", "event": "page_view", "user_id": 123, "page": "/home", "duration": 30},
    
    # Purchase event - contains transaction details
    {"timestamp": "2024-01-01T08:05:00Z", "event": "purchase", "user_id": 123, "amount": 99.99, "product": "Premium Plan"},
    
    # Search event - demonstrates different event structure
    {"timestamp": "2024-01-01T08:03:00Z", "event": "search", "user_id": 123, "query": "machine learning", "results": 42}
]

# Save as JSONL file - each JSON object on a separate line
jsonl_file_path = 'data/json_files/events.jsonl'

with open(jsonl_file_path, 'w', encoding='utf-8') as f:
    for item in jsonl_data:
        # Write each JSON object as a single line
        f.write(json.dumps(item) + '\n')

print(f"✅ JSONL data saved to: {jsonl_file_path}")
print(f"📊 Created {len(jsonl_data)} event records")
print("📝 Each line contains a separate JSON object")
print("🔍 Ready for streaming and batch processing")

## Json Processing Stratergies

In [None]:
# Method 1: Using JSONLoader with jq Schema for Selective Data Extraction
# ========================================================================
"""
JSONLoader with jq schema allows precise extraction of specific parts of JSON data.
jq is a powerful query language for JSON that enables complex filtering and transformation.

Common jq patterns:
- '.employees[]' - Extract each employee from the employees array
- '.departments.engineering' - Extract specific nested object
- '.employees[] | select(.role == "Engineer")' - Filter with conditions
- '.employees[].skills[]' - Flatten nested arrays

Pros: Precise data extraction, supports complex queries
Cons: Requires jq knowledge, can be complex for beginners
"""

from langchain_community.document_loaders import JSONLoader
import json

print("1️⃣ JSONLoader - Extract Specific Fields with jq Schema")
print("-" * 55)

try:
    # Strategy 1: Extract employee information using jq schema
    # The jq query '.employees[]' extracts each employee as a separate document
    employee_loader = JSONLoader(
        file_path='data/json_files/company_data.json',
        jq_schema='.employees[]',  # jq query to extract each employee
        text_content=False  # Get full JSON objects instead of text representation
    )

    # Load documents - each employee becomes a separate Document object
    employee_docs = employee_loader.load()
    
    print(f"✅ Successfully loaded {len(employee_docs)} employee documents")
    print(f"📊 Each document represents one employee record")
    print(f"📄 First employee preview: {employee_docs[0].page_content[:200]}...")
    
    # Display metadata information
    print(f"🏷️  Metadata keys: {list(employee_docs[0].metadata.keys())}")
    
    # Show all employee documents for analysis
    print(f"\n📋 All Employee Documents:")
    for i, doc in enumerate(employee_docs):
        print(f"  Employee {i+1}: {len(doc.page_content)} characters")
    
    employee_docs  # Display the actual documents

except FileNotFoundError:
    print("❌ Error: JSON file not found. Please run the data creation cells first.")
except ImportError:
    print("❌ Import Error: Missing jq dependency")
    print("💡 Install jq: pip install jq")
except Exception as e:
    print(f"❌ Error loading JSON: {e}")
    print("💡 Tip: Ensure 'langchain-community' is installed")

1️⃣ JSONLoader - Extract specific fields
Loaded 2 employee documents
First employee: {"id": 1, "name": "John Doe", "role": "Software Engineer", "skills": ["Python", "JavaScript", "React"], "projects": [{"name": "RAG System", "status": "In Progress"}, {"name": "Data Pipeline", "status"...
[Document(metadata={'source': '/home/bjit/Desktop/Storage01/SelfDevelopment/Rag_Course/Data_Ingest_Parsing/data/json_files/company_data.json', 'seq_num': 1}, page_content='{"id": 1, "name": "John Doe", "role": "Software Engineer", "skills": ["Python", "JavaScript", "React"], "projects": [{"name": "RAG System", "status": "In Progress"}, {"name": "Data Pipeline", "status": "Completed"}]}'), Document(metadata={'source': '/home/bjit/Desktop/Storage01/SelfDevelopment/Rag_Course/Data_Ingest_Parsing/data/json_files/company_data.json', 'seq_num': 2}, page_content='{"id": 2, "name": "Jane Smith", "role": "Data Scientist", "skills": ["Python", "Machine Learning", "SQL"], "projects": [{"name": "ML Model", "status

In [None]:
# Method 2: Custom JSON Processing for Complex Structures
# =======================================================
"""
This section demonstrates custom JSON processing that provides more control over
how complex nested JSON data is converted into documents for RAG applications.

Benefits of custom processing:
- Intelligent content formatting optimized for LLM understanding
- Rich metadata creation for filtering and search
- Context preservation across nested relationships
- Flexible document structures based on data types
"""

from typing import List
from langchain_core.documents import Document

print("\n2️⃣ Custom JSON Processing - Intelligent Structure Handling")
print("-" * 60)

def process_json_intelligently(filepath: str) -> List[Document]:
    """
    Process JSON with intelligent flattening and context preservation.
    
    This function creates structured documents from complex JSON data with
    enhanced formatting and metadata for better RAG performance.
    
    Args:
        filepath (str): Path to the JSON file to process
        
    Returns:
        List[Document]: List of Document objects with structured content
        
    Features:
        - Context-aware document creation
        - Structured content formatting for readability
        - Rich metadata for filtering and search
        - Relationship preservation between nested entities
    
    Example:
        docs = process_json_intelligently("company_data.json")
        print(f"Created {len(docs)} documents")
    """
    # Load and parse the JSON file
    with open(filepath, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    documents = []
    
    # Strategy 1: Create comprehensive employee profile documents
    # This preserves relationships between employees, their skills, and projects
    for emp in data.get('employees', []):
        # Create structured, human-readable content
        # This format is optimized for LLM understanding and Q&A
        content = f"""Employee Profile:
Name: {emp['name']}
Role: {emp['role']}
Employee ID: {emp['id']}
Skills: {', '.join(emp['skills'])}

Current Projects:"""
        
        # Add project information with context
        for proj in emp.get('projects', []):
            content += f"\n- {proj['name']} (Status: {proj['status']})"
        
        # Add company context
        content += f"\n\nCompany: {data.get('company', 'Unknown')}"
        
        # Create document with comprehensive metadata
        doc = Document(
            page_content=content,
            metadata={
                'source': filepath,
                'data_type': 'employee_profile',
                'employee_id': emp['id'],
                'employee_name': emp['name'],
                'role': emp['role'],
                'skills': emp['skills'],
                'project_count': len(emp.get('projects', [])),
                'company': data.get('company', 'Unknown'),
                'content_type': 'structured_profile'
            }
        )
        documents.append(doc)
    
    # Strategy 2: Create department summary documents
    # This provides organizational context and hierarchy
    departments = data.get('departments', {})
    if departments:
        for dept_name, dept_info in departments.items():
            content = f"""Department Information:
Department: {dept_name.title()}
Department Head: {dept_info.get('head', 'Not specified')}
Budget: ${dept_info.get('budget', 0):,}
Team Size: {dept_info.get('team_size', 0)} employees
Company: {data.get('company', 'Unknown')}

This department is part of {data.get('company', 'the organization')} with a budget allocation of ${dept_info.get('budget', 0):,}."""
            
            doc = Document(
                page_content=content,
                metadata={
                    'source': filepath,
                    'data_type': 'department_info',
                    'department_name': dept_name,
                    'department_head': dept_info.get('head', 'Not specified'),
                    'budget': dept_info.get('budget', 0),
                    'team_size': dept_info.get('team_size', 0),
                    'company': data.get('company', 'Unknown'),
                    'content_type': 'organizational_data'
                }
            )
            documents.append(doc)
    
    return documents

# Test the custom processing function
try:
    custom_docs = process_json_intelligently('data/json_files/company_data.json')
    
    print(f"✅ Custom JSON processing completed")
    print(f"📊 Created {len(custom_docs)} documents")
    
    # Analyze document types
    doc_types = {}
    for doc in custom_docs:
        doc_type = doc.metadata.get('data_type', 'unknown')
        if doc_type not in doc_types:
            doc_types[doc_type] = 0
        doc_types[doc_type] += 1
    
    print(f"\n📋 Document Type Analysis:")
    for doc_type, count in doc_types.items():
        print(f"  • {doc_type}: {count} documents")
    
    # Show example of custom-processed document
    print(f"\n📄 Example custom document:")
    print(f"Content:\n{custom_docs[0].page_content[:300]}...")
    print(f"Metadata keys: {list(custom_docs[0].metadata.keys())}")

except Exception as e:
    print(f"❌ Error in custom JSON processing: {e}")


2️⃣ Custom JSON Processing


In [None]:
# Test the custom JSON processing function directly
# This will return the list of documents for inspection
process_json_intelligently("data/json_files/company_data.json")

[Document(metadata={'source': 'data/json_files/company_data.json', 'data_type': 'employee_profile', 'employee_id': 1, 'employee_name': 'John Doe', 'role': 'Software Engineer'}, page_content='Employee Profile:\n        Name: John Doe\n        Role: Software Engineer\n        Skills: Python, JavaScript, React\n\n        Projects:\n- RAG System (Status: In Progress)\n- Data Pipeline (Status: Completed)'),
 Document(metadata={'source': 'data/json_files/company_data.json', 'data_type': 'employee_profile', 'employee_id': 2, 'employee_name': 'Jane Smith', 'role': 'Data Scientist'}, page_content='Employee Profile:\n        Name: Jane Smith\n        Role: Data Scientist\n        Skills: Python, Machine Learning, SQL\n\n        Projects:\n- ML Model (Status: In Progress)\n- Analytics Dashboard (Status: Planning)')]

In [13]:
# Method 3: Processing JSON Lines (JSONL) Format
# ==============================================
"""
JSON Lines format is commonly used for streaming data, logs, and large datasets.
Each line is a separate JSON object, allowing for efficient processing of
large files without loading everything into memory.

Benefits of JSONL:
- Memory efficient for large datasets
- Supports streaming processing
- Easy to append new records
- Common in data pipelines and APIs
"""

print("3️⃣ JSONL Processing - Event Stream Data")
print("-" * 40)

def process_jsonl_events(filepath: str) -> List[Document]:
    """
    Process JSONL file with event-specific document creation.
    
    This function handles streaming JSON data where each line represents
    a separate event or record that should be processed independently.
    
    Args:
        filepath (str): Path to the JSONL file to process
        
    Returns:
        List[Document]: Documents created from JSONL events
        
    Features:
        - Line-by-line processing for memory efficiency
        - Event-specific content formatting
        - Temporal metadata for chronological queries
        - Event type categorization
    """
    documents = []
    
    # Process JSONL file line by line (memory efficient)
    with open(filepath, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            if line.strip():  # Skip empty lines
                try:
                    # Parse each line as a separate JSON object
                    event_data = json.loads(line.strip())
                    
                    # Create event-specific content based on event type
                    event_type = event_data.get('event', 'unknown')
                    
                    if event_type == 'user_login':
                        content = f"""User Login Event:
Timestamp: {event_data.get('timestamp', 'Unknown')}
User ID: {event_data.get('user_id', 'Unknown')}
IP Address: {event_data.get('ip_address', 'Unknown')}
Event Type: Login

A user successfully logged into the system from IP {event_data.get('ip_address', 'unknown address')}."""
                    
                    elif event_type == 'page_view':
                        content = f"""Page View Event:
Timestamp: {event_data.get('timestamp', 'Unknown')}
User ID: {event_data.get('user_id', 'Unknown')}
Page: {event_data.get('page', 'Unknown')}
Duration: {event_data.get('duration', 0)} seconds
Event Type: Page View

User viewed {event_data.get('page', 'a page')} for {event_data.get('duration', 0)} seconds."""
                    
                    elif event_type == 'purchase':
                        content = f"""Purchase Event:
Timestamp: {event_data.get('timestamp', 'Unknown')}
User ID: {event_data.get('user_id', 'Unknown')}
Product: {event_data.get('product', 'Unknown')}
Amount: ${event_data.get('amount', 0)}
Event Type: Purchase

User completed a purchase of {event_data.get('product', 'unknown product')} for ${event_data.get('amount', 0)}."""
                    
                    elif event_type == 'search':
                        content = f"""Search Event:
Timestamp: {event_data.get('timestamp', 'Unknown')}
User ID: {event_data.get('user_id', 'Unknown')}
Query: "{event_data.get('query', 'Unknown')}"
Results: {event_data.get('results', 0)} found
Event Type: Search

User searched for "{event_data.get('query', 'unknown query')}" and found {event_data.get('results', 0)} results."""
                    
                    else:
                        # Generic event processing
                        content = f"Event Data:\n" + "\n".join([f"{k}: {v}" for k, v in event_data.items()])
                    
                    # Create document with event metadata
                    doc = Document(
                        page_content=content,
                        metadata={
                            'source': filepath,
                            'line_number': line_num,
                            'event_type': event_type,
                            'timestamp': event_data.get('timestamp', ''),
                            'user_id': event_data.get('user_id', ''),
                            'data_type': 'event_stream',
                            'content_type': 'temporal_event',
                            **{k: v for k, v in event_data.items() if k not in ['timestamp', 'event', 'user_id']}
                        }
                    )
                    documents.append(doc)
                    
                except json.JSONDecodeError as e:
                    print(f"⚠️  Warning: Invalid JSON on line {line_num}: {e}")
                    continue
    
    return documents

# Test JSONL processing
try:
    jsonl_docs = process_jsonl_events('data/json_files/events.jsonl')
    
    print(f"✅ JSONL processing completed")
    print(f"📊 Created {len(jsonl_docs)} event documents")
    
    # Analyze event types
    event_types = {}
    for doc in jsonl_docs:
        event_type = doc.metadata.get('event_type', 'unknown')
        if event_type not in event_types:
            event_types[event_type] = 0
        event_types[event_type] += 1
    
    print(f"\n📋 Event Type Distribution:")
    for event_type, count in event_types.items():
        print(f"  • {event_type}: {count} events")
    
    # Show example event document
    print(f"\n📄 Example event document:")
    print(f"Content:\n{jsonl_docs[0].page_content}")
    print(f"Metadata keys: {list(jsonl_docs[0].metadata.keys())}")

except FileNotFoundError:
    print("❌ Error: JSONL file not found. Please run the JSONL creation cell first.")
except Exception as e:
    print(f"❌ Error in JSONL processing: {e}")

3️⃣ JSONL Processing - Event Stream Data
----------------------------------------
✅ JSONL processing completed
📊 Created 3 event documents

📋 Event Type Distribution:
  • user_login: 1 events
  • page_view: 1 events
  • purchase: 1 events

📄 Example event document:
Content:
User Login Event:
Timestamp: 2024-01-01
User ID: 123
IP Address: Unknown
Event Type: Login

A user successfully logged into the system from IP unknown address.
Metadata keys: ['source', 'line_number', 'event_type', 'timestamp', 'user_id', 'data_type', 'content_type']


In [14]:
# JSON Processing Methods Comparison and Analysis
# ===============================================
"""
This section compares different JSON processing approaches to help you choose
the best strategy for your specific use case and data structure.
"""

print("📊 JSON Processing Methods Comparison")
print("=" * 42)

# Compare results if all methods have been executed
try:
    if 'employee_docs' in locals() and 'custom_docs' in locals() and 'jsonl_docs' in locals():
        print("\n1️⃣ JSONLoader with jq Schema:")
        print(f"  • Documents created: {len(employee_docs)}")
        print(f"  • Content approach: Direct JSON extraction")
        print(f"  • Best for: Specific field extraction from known structures")
        print(f"  • Metadata richness: {len(employee_docs[0].metadata) if employee_docs else 0} fields")
        
        print("\n2️⃣ Custom JSON Processing:")
        print(f"  • Documents created: {len(custom_docs)}")
        print(f"  • Content approach: Structured, human-readable formatting")
        print(f"  • Best for: Complex data with relationships and context")
        print(f"  • Metadata richness: {len(custom_docs[0].metadata) if custom_docs else 0} fields")
        
        print("\n3️⃣ JSONL Event Processing:")
        print(f"  • Documents created: {len(jsonl_docs)}")
        print(f"  • Content approach: Event-specific formatted content")
        print(f"  • Best for: Streaming data and temporal events")
        print(f"  • Metadata richness: {len(jsonl_docs[0].metadata) if jsonl_docs else 0} fields")
        
        # Performance analysis
        print(f"\n📈 Performance Analysis:")
        json_memory = sum(len(doc.page_content) for doc in employee_docs) if employee_docs else 0
        custom_memory = sum(len(doc.page_content) for doc in custom_docs) if custom_docs else 0
        jsonl_memory = sum(len(doc.page_content) for doc in jsonl_docs) if jsonl_docs else 0
        
        print(f"  • JSONLoader memory footprint: {json_memory} characters")
        print(f"  • Custom processing footprint: {custom_memory} characters")
        print(f"  • JSONL processing footprint: {jsonl_memory} characters")
        
    else:
        print("⚠️  Run all processing methods first to see comparison")
        
except Exception as e:
    print(f"❌ Error in comparison analysis: {e}")

print(f"\n💡 Method Selection Guidelines:")
print(f"  📋 JSONLoader + jq:")
print(f"     - Quick extraction of specific fields")
print(f"     - Known JSON structure with consistent schema")
print(f"     - Minimal processing overhead required")

print(f"  🧠 Custom Processing:")
print(f"     - Production RAG systems requiring high-quality documents")
print(f"     - Complex nested data with business relationships")
print(f"     - Need for human-readable, context-rich content")

print(f"  📡 JSONL Processing:")
print(f"     - Event streams and temporal data")
print(f"     - Large datasets processed in streaming fashion")
print(f"     - Log files and API response sequences")

📊 JSON Processing Methods Comparison
⚠️  Run all processing methods first to see comparison

💡 Method Selection Guidelines:
  📋 JSONLoader + jq:
     - Quick extraction of specific fields
     - Known JSON structure with consistent schema
     - Minimal processing overhead required
  🧠 Custom Processing:
     - Production RAG systems requiring high-quality documents
     - Complex nested data with business relationships
     - Need for human-readable, context-rich content
  📡 JSONL Processing:
     - Event streams and temporal data
     - Large datasets processed in streaming fashion
     - Log files and API response sequences


In [15]:
# Advanced JSON Processing Techniques
# ====================================
"""
This section demonstrates advanced techniques for handling complex JSON scenarios
that are commonly encountered in production RAG systems.
"""

def process_nested_json_with_flattening(data: dict, parent_key: str = '', separator: str = '.') -> dict:
    """
    Flatten nested JSON structures for easier processing.
    
    This utility function converts deeply nested JSON into a flat structure
    while preserving the hierarchical relationships in the key names.
    
    Args:
        data (dict): Nested JSON data to flatten
        parent_key (str): Parent key for recursive processing
        separator (str): Separator to use between nested keys
        
    Returns:
        dict: Flattened dictionary with concatenated keys
        
    Example:
        nested = {"user": {"profile": {"name": "John"}}}
        flattened = flatten_json(nested)
        # Result: {"user.profile.name": "John"}
    """
    items = []
    
    for key, value in data.items():
        new_key = f"{parent_key}{separator}{key}" if parent_key else key
        
        if isinstance(value, dict):
            # Recursively flatten nested dictionaries
            items.extend(process_nested_json_with_flattening(value, new_key, separator).items())
        elif isinstance(value, list):
            # Handle arrays by enumerating items
            for i, item in enumerate(value):
                if isinstance(item, dict):
                    items.extend(process_nested_json_with_flattening(item, f"{new_key}{separator}{i}", separator).items())
                else:
                    items.append((f"{new_key}{separator}{i}", item))
        else:
            # Add primitive values directly
            items.append((new_key, value))
    
    return dict(items)

def create_contextual_json_documents(filepath: str, chunk_strategy: str = 'entity') -> List[Document]:
    """
    Create JSON documents with advanced contextual processing strategies.
    
    This function implements multiple strategies for creating documents from
    JSON data based on different business logic approaches.
    
    Args:
        filepath (str): Path to JSON file
        chunk_strategy (str): Strategy for document creation
            - 'entity': One document per business entity
            - 'flat': Flatten and create comprehensive documents
            - 'contextual': Preserve relationships and context
            
    Returns:
        List[Document]: Optimized documents for RAG systems
    """
    with open(filepath, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    documents = []
    
    if chunk_strategy == 'entity':
        # Strategy: Create focused documents per business entity
        for emp in data.get('employees', []):
            content = f"""Employee: {emp['name']} (ID: {emp['id']})
Role: {emp['role']}
Skills: {', '.join(emp['skills'])}
Active Projects: {len(emp.get('projects', []))}
Company: {data.get('company', 'Unknown')}"""
            
            doc = Document(
                page_content=content,
                metadata={
                    'entity_type': 'employee',
                    'entity_id': emp['id'],
                    'strategy': 'entity_focused',
                    **{k: v for k, v in emp.items() if k not in ['projects']}
                }
            )
            documents.append(doc)
            
    elif chunk_strategy == 'flat':
        # Strategy: Flatten entire structure and create comprehensive documents
        flat_data = process_nested_json_with_flattening(data)
        
        # Group related flat keys into documents
        employee_data = {k: v for k, v in flat_data.items() if 'employees' in k}
        dept_data = {k: v for k, v in flat_data.items() if 'departments' in k}
        
        if employee_data:
            content = "Employee Information:\n" + "\n".join([f"{k}: {v}" for k, v in employee_data.items()])
            doc = Document(
                page_content=content,
                metadata={'data_type': 'flattened_employees', 'strategy': 'flattened'}
            )
            documents.append(doc)
        
        if dept_data:
            content = "Department Information:\n" + "\n".join([f"{k}: {v}" for k, v in dept_data.items()])
            doc = Document(
                page_content=content,
                metadata={'data_type': 'flattened_departments', 'strategy': 'flattened'}
            )
            documents.append(doc)
            
    elif chunk_strategy == 'contextual':
        # Strategy: Preserve context and create relationship-aware documents
        for emp in data.get('employees', []):
            # Find department context
            dept_context = ""
            for dept_name, dept_info in data.get('departments', {}).items():
                if emp.get('role', '').lower() in ['software engineer', 'data scientist']:
                    if dept_name == 'engineering' and 'engineer' in emp.get('role', '').lower():
                        dept_context = f"\nDepartment Context: {dept_name.title()} (Head: {dept_info.get('head')}, Budget: ${dept_info.get('budget', 0):,})"
                    elif dept_name == 'data_science' and 'data scientist' in emp.get('role', '').lower():
                        dept_context = f"\nDepartment Context: {dept_name.title()} (Head: {dept_info.get('head')}, Budget: ${dept_info.get('budget', 0):,})"
            
            content = f"""Employee Profile with Context:
Name: {emp['name']}
Role: {emp['role']}
Skills: {', '.join(emp['skills'])}
Projects: {[p['name'] for p in emp.get('projects', [])]}
Company: {data.get('company', 'Unknown')}{dept_context}

This employee works in {emp['role']} capacity with expertise in {', '.join(emp['skills'])}."""
            
            doc = Document(
                page_content=content,
                metadata={
                    'strategy': 'contextual',
                    'employee_id': emp['id'],
                    'has_dept_context': bool(dept_context),
                    'context_richness': 'high'
                }
            )
            documents.append(doc)
    
    return documents

print("🔬 Advanced JSON Processing Techniques")
print("=" * 38)

# Test different strategies
strategies = ['entity', 'flat', 'contextual']
results = {}

try:
    for strategy in strategies:
        docs = create_contextual_json_documents('data/json_files/company_data.json', strategy)
        results[strategy] = docs
        
        print(f"\n📊 Strategy: {strategy.upper()}")
        print(f"  • Documents created: {len(docs)}")
        print(f"  • Avg content length: {sum(len(d.page_content) for d in docs) / len(docs) if docs else 0:.1f}")
        print(f"  • Example content preview:")
        if docs:
            print(f"    {docs[0].page_content[:100]}...")
    
    print(f"\n💡 Strategy Recommendations:")
    print(f"  🎯 Entity Strategy: Best for focused queries about specific entities")
    print(f"  📊 Flat Strategy: Good for comprehensive searches across all data")
    print(f"  🧠 Contextual Strategy: Optimal for relationship-aware questions")
    
except Exception as e:
    print(f"❌ Error in advanced processing: {e}")

# Display flattening example
print(f"\n🔍 JSON Flattening Example:")
sample_nested = {"user": {"profile": {"name": "John", "details": {"age": 30}}}}
flattened = process_nested_json_with_flattening(sample_nested)
print(f"  Original: {sample_nested}")
print(f"  Flattened: {flattened}")

🔬 Advanced JSON Processing Techniques

📊 Strategy: ENTITY
  • Documents created: 2
  • Avg content length: 122.5
  • Example content preview:
    Employee: John Doe (ID: 1)
Role: Software Engineer
Skills: Python, JavaScript, React
Active Projects...

📊 Strategy: FLAT
  • Documents created: 2
  • Avg content length: 486.0
  • Example content preview:
    Employee Information:
employees.0.id: 1
employees.0.name: John Doe
employees.0.role: Software Engine...

📊 Strategy: CONTEXTUAL
  • Documents created: 2
  • Avg content length: 336.5
  • Example content preview:
    Employee Profile with Context:
Name: John Doe
Role: Software Engineer
Skills: Python, JavaScript, Re...

💡 Strategy Recommendations:
  🎯 Entity Strategy: Best for focused queries about specific entities
  📊 Flat Strategy: Good for comprehensive searches across all data
  🧠 Contextual Strategy: Optimal for relationship-aware questions

🔍 JSON Flattening Example:
  Original: {'user': {'profile': {'name': 'John', 'details': {'a

In [16]:
# Performance Analysis and Production Utilities
# ==============================================
"""
This section provides performance analysis and utility functions for production
JSON processing in RAG systems.
"""

import sys
import time
from datetime import datetime

def analyze_json_processing_performance():
    """
    Analyze memory usage and processing characteristics of different JSON approaches.
    
    Returns:
        dict: Performance metrics for different processing methods
    """
    performance_metrics = {}
    
    # Analyze JSONLoader documents if loaded
    if 'employee_docs' in locals() and employee_docs:
        performance_metrics['json_loader'] = {
            'method': 'JSONLoader with jq',
            'document_count': len(employee_docs),
            'avg_doc_length': sum(len(doc.page_content) for doc in employee_docs) / len(employee_docs),
            'total_memory': sum(sys.getsizeof(doc.page_content) for doc in employee_docs),
            'metadata_richness': len(employee_docs[0].metadata)
        }
    
    # Analyze custom processing documents if loaded
    if 'custom_docs' in locals() and custom_docs:
        performance_metrics['custom_processing'] = {
            'method': 'Custom JSON Processing',
            'document_count': len(custom_docs),
            'avg_doc_length': sum(len(doc.page_content) for doc in custom_docs) / len(custom_docs),
            'total_memory': sum(sys.getsizeof(doc.page_content) for doc in custom_docs),
            'metadata_richness': len(custom_docs[0].metadata)
        }
    
    # Analyze JSONL processing documents if loaded
    if 'jsonl_docs' in locals() and jsonl_docs:
        performance_metrics['jsonl_processing'] = {
            'method': 'JSONL Event Processing',
            'document_count': len(jsonl_docs),
            'avg_doc_length': sum(len(doc.page_content) for doc in jsonl_docs) / len(jsonl_docs),
            'total_memory': sum(sys.getsizeof(doc.page_content) for doc in jsonl_docs),
            'metadata_richness': len(jsonl_docs[0].metadata)
        }
    
    return performance_metrics

def validate_json_structure(data: dict, required_fields: List[str]) -> tuple[bool, List[str]]:
    """
    Validate JSON structure against required fields.
    
    Args:
        data (dict): JSON data to validate
        required_fields (List[str]): List of required field paths (dot notation)
        
    Returns:
        tuple: (is_valid, missing_fields)
        
    Example:
        valid, missing = validate_json_structure(data, ['company', 'employees.0.name'])
    """
    missing_fields = []
    
    for field_path in required_fields:
        parts = field_path.split('.')
        current_data = data
        
        try:
            for part in parts:
                if part.isdigit():
                    # Array index
                    current_data = current_data[int(part)]
                else:
                    # Object key
                    current_data = current_data[part]
        except (KeyError, IndexError, TypeError):
            missing_fields.append(field_path)
    
    return len(missing_fields) == 0, missing_fields

def create_production_json_processor(validation_rules: dict = None):
    """
    Factory function to create production-ready JSON processors.
    
    Args:
        validation_rules (dict): Optional validation rules for JSON structure
        
    Returns:
        function: Configured JSON processor function
    """
    def process_json_production(filepath: str, strategy: str = 'intelligent') -> tuple[List[Document], dict]:
        """
        Production JSON processor with error handling and metrics.
        
        Args:
            filepath (str): Path to JSON file
            strategy (str): Processing strategy to use
            
        Returns:
            tuple: (documents, processing_metrics)
        """
        start_time = time.time()
        processing_metrics = {
            'start_time': datetime.now().isoformat(),
            'file_path': filepath,
            'strategy': strategy,
            'success': False,
            'error_message': None,
            'documents_created': 0,
            'processing_time_seconds': 0
        }
        
        try:
            # Load and validate JSON
            with open(filepath, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            # Optional validation
            if validation_rules:
                required_fields = validation_rules.get('required_fields', [])
                is_valid, missing = validate_json_structure(data, required_fields)
                if not is_valid:
                    processing_metrics['error_message'] = f"Missing required fields: {missing}"
                    return [], processing_metrics
            
            # Process based on strategy
            if strategy == 'intelligent':
                documents = process_json_intelligently(filepath)
            elif strategy == 'flat':
                documents = create_contextual_json_documents(filepath, 'flat')
            elif strategy == 'contextual':
                documents = create_contextual_json_documents(filepath, 'contextual')
            else:
                documents = process_json_intelligently(filepath)  # Default fallback
            
            # Update metrics
            processing_metrics['success'] = True
            processing_metrics['documents_created'] = len(documents)
            processing_metrics['processing_time_seconds'] = time.time() - start_time
            
            return documents, processing_metrics
            
        except Exception as e:
            processing_metrics['error_message'] = str(e)
            processing_metrics['processing_time_seconds'] = time.time() - start_time
            return [], processing_metrics
    
    return process_json_production

print("⚡ Performance Analysis and Production Utilities")
print("=" * 48)

# Run performance analysis
try:
    metrics = analyze_json_processing_performance()
    
    if metrics:
        print("\n📊 Processing Method Performance:")
        for method_key, data in metrics.items():
            print(f"\n{data['method']}:")
            print(f"  • Documents: {data['document_count']}")
            print(f"  • Avg length: {data['avg_doc_length']:.1f} chars")
            print(f"  • Memory: {data['total_memory']} bytes")
            print(f"  • Metadata fields: {data['metadata_richness']}")
        
        print(f"\n💡 Performance Insights:")
        print(f"  • Custom processing creates more readable documents")
        print(f"  • JSONL processing is memory efficient for large datasets")
        print(f"  • Rich metadata enables better filtering and search")
        
    else:
        print("⚠️  Run the JSON processing methods first to see performance metrics")
        
except Exception as e:
    print(f"❌ Error in performance analysis: {e}")

# Test production processor
print(f"\n🏭 Production JSON Processor Example:")
try:
    # Create production processor with validation
    validation_rules = {
        'required_fields': ['company', 'employees']
    }
    
    production_processor = create_production_json_processor(validation_rules)
    
    # Process JSON with metrics
    prod_docs, prod_metrics = production_processor('data/json_files/company_data.json', 'intelligent')
    
    print(f"✅ Production processing completed:")
    print(f"  • Success: {prod_metrics['success']}")
    print(f"  • Documents: {prod_metrics['documents_created']}")
    print(f"  • Time: {prod_metrics['processing_time_seconds']:.3f}s")
    print(f"  • Strategy: {prod_metrics['strategy']}")
    
except Exception as e:
    print(f"❌ Error in production processor: {e}")

print(f"\n🎯 Production Best Practices:")
print(f"  • Always validate JSON structure before processing")
print(f"  • Include error handling and recovery mechanisms")
print(f"  • Monitor processing time and memory usage")
print(f"  • Use appropriate chunking strategies for large files")
print(f"  • Implement caching for frequently accessed data")

⚡ Performance Analysis and Production Utilities
⚠️  Run the JSON processing methods first to see performance metrics

🏭 Production JSON Processor Example:
✅ Production processing completed:
  • Success: True
  • Documents: 2
  • Time: 0.001s
  • Strategy: intelligent

🎯 Production Best Practices:
  • Always validate JSON structure before processing
  • Include error handling and recovery mechanisms
  • Monitor processing time and memory usage
  • Use appropriate chunking strategies for large files
  • Implement caching for frequently accessed data


## Best Practices for JSON Processing

### When to Use Each Approach

1. **JSONLoader with jq Schema**
   - ✅ Quick extraction of specific fields from known structures
   - ✅ Consistent JSON schemas with predictable structure
   - ✅ Minimal processing overhead for simple extractions
   - ❌ Complex business logic or relationship preservation

2. **Custom JSON Processing**
   - ✅ Production RAG systems requiring high-quality documents
   - ✅ Complex nested structures with business relationships
   - ✅ Need for human-readable, context-rich content
   - ✅ Rich metadata requirements for advanced filtering

3. **JSONL Processing**
   - ✅ Event streams and temporal data processing
   - ✅ Large datasets requiring memory-efficient processing
   - ✅ Log files and API response sequences
   - ✅ Streaming data pipelines

### Common Challenges and Solutions

1. **Large JSON File Performance**
   - **Problem**: Memory issues with large nested JSON files
   - **Solution**: Implement streaming JSON parsing or chunk processing
   - **Code Pattern**: Process top-level arrays in batches

2. **Complex Nested Structures**
   - **Problem**: Deep nesting makes extraction difficult
   - **Solution**: Use flattening utilities or recursive processing
   - **Code Pattern**: Implement path-based field extraction

3. **Inconsistent Schema**
   - **Problem**: JSON structures vary across files
   - **Solution**: Implement schema validation and flexible processing
   - **Code Pattern**: Use try-catch blocks with fallback strategies

4. **jq Query Complexity**
   - **Problem**: Complex jq queries are hard to maintain
   - **Solution**: Use custom processing for complex logic
   - **Code Pattern**: Reserve jq for simple field extractions

## Summary and Next Steps

### What We Learned

In this notebook, we explored comprehensive strategies for processing JSON data:

1. **JSONLoader with jq**: Precise field extraction using query language
2. **Custom Processing**: Intelligent document creation with rich context
3. **JSONL Processing**: Efficient streaming data handling
4. **Advanced Techniques**: Flattening, contextual processing, and validation
5. **Production Utilities**: Error handling, performance monitoring, and best practices

### Key Takeaways

- **Choose the right approach**: Simple JSONLoader for basic extraction, custom processing for production
- **Context matters**: Preserve relationships and business logic in document structure
- **Metadata is crucial**: Rich metadata enables advanced filtering and retrieval
- **Validation is essential**: Always validate JSON structure before processing
- **Performance considerations**: Memory usage and processing time vary significantly between methods

### Production Checklist

- [ ] Implement JSON schema validation
- [ ] Add comprehensive error handling
- [ ] Create context-rich, human-readable documents
- [ ] Include actionable metadata for filtering
- [ ] Monitor processing performance and memory usage
- [ ] Implement caching for frequently accessed files
- [ ] Use streaming processing for large JSON files

### Next Steps

1. **Try with your own JSON data**: Apply these techniques to your real datasets
2. **Build a JSON processing pipeline**: Create automated workflows for different JSON types
3. **Optimize for your queries**: Test different document structures with your specific questions
4. **Scale for production**: Implement batch processing, monitoring, and error recovery

### Resources for Further Learning

- [jq Manual](https://stedolan.github.io/jq/manual/) - Complete jq query language reference
- [LangChain JSONLoader Documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/json)
- [JSON Schema Specification](https://json-schema.org/) - For validation and structure definition
- [Streaming JSON Processing](https://github.com/dcmoura/spyql) - Tools for large JSON file processing