# CSV and Excel Processing with LangChain

## Overview
This notebook demonstrates comprehensive approaches to loading and processing structured data files (CSV and Excel) using LangChain's document loaders. We'll explore multiple strategies for converting tabular data into documents suitable for RAG applications.

## What You'll Learn
1. **CSV Processing** - Different strategies for converting CSV data to documents
2. **Excel Processing** - Handling multi-sheet Excel files and complex data structures
3. **Custom Processing** - Building intelligent document loaders for structured data
4. **Best Practices** - Optimizing structured data for vector search and retrieval

## Prerequisites
```bash
uv add install langchain-community pandas openpyxl xlrd
```

## Use Cases
- Product catalogs and inventory systems
- Financial data and reports
- Customer databases
- Survey and research data
- Any tabular data for Q&A systems

### CSV And EXcel files- Structured Data

In [10]:
"""
CSV and Excel Data Processing Setup

This module sets up the environment for processing structured data files.
We'll create sample data and demonstrate different loading strategies.

Author: Data Science Team
Date: 2024
"""

# Import necessary libraries for data manipulation and file handling
import pandas as pd  # For data manipulation and analysis
import os  # For operating system interface and directory operations

# Create directory structure for storing our sample files
# exist_ok=True prevents error if directory already exists
os.makedirs("data/structured_files", exist_ok=True)
print("✅ Directory structure created successfully")
print("📁 Ready to process CSV and Excel files")

✅ Directory structure created successfully
📁 Ready to process CSV and Excel files


In [11]:
# Creating Sample Product Data
# ============================
"""
This section creates realistic sample data for demonstrating CSV and Excel processing.
The data structure represents a typical product inventory system with various data types.
"""

# Define sample product data with realistic business information
# This structure represents common e-commerce or inventory data
data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],  # Product names
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics'],  # Categories
    'Price': [999.99, 29.99, 79.99, 299.99, 89.99],  # Prices in USD
    'Stock': [50, 200, 150, 75, 100],  # Available quantities
    'Description': [  # Detailed product descriptions for text processing
        'High-performance laptop with 16GB RAM and 512GB SSD',
        'Wireless optical mouse with ergonomic design',
        'Mechanical keyboard with RGB backlighting',
        '27-inch 4K monitor with HDR support',
        '1080p webcam with noise cancellation'
    ]
}

# Convert dictionary to pandas DataFrame for easier manipulation
df = pd.DataFrame(data)

# Save the data as CSV file - most common format for structured data
csv_file_path = 'data/structured_files/products.csv'
df.to_csv(csv_file_path, index=False)  # index=False prevents adding row numbers

print(f"✅ Created sample CSV file: {csv_file_path}")
print(f"📊 Data shape: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"📝 Sample data preview:")
print(df.head(2))  # Show first 2 rows

✅ Created sample CSV file: data/structured_files/products.csv
📊 Data shape: 5 rows, 5 columns
📝 Sample data preview:
  Product     Category   Price  Stock  \
0  Laptop  Electronics  999.99     50   
1   Mouse  Accessories   29.99    200   

                                         Description  
0  High-performance laptop with 16GB RAM and 512G...  
1       Wireless optical mouse with ergonomic design  


In [12]:
# Creating Excel File with Multiple Sheets
# =========================================
"""
This section demonstrates creating Excel files with multiple worksheets.
Multi-sheet Excel files are common in business environments and require
special handling for document processing.
"""

# Create Excel file with multiple sheets using ExcelWriter
excel_file_path = 'data/structured_files/inventory.xlsx'

with pd.ExcelWriter(excel_file_path, engine='openpyxl') as writer:
    # Sheet 1: Product details (main data)
    df.to_excel(writer, sheet_name='Products', index=False)
    
    # Sheet 2: Category summary (aggregated data)
    # This demonstrates how real business files often contain multiple data views
    summary_data = {
        'Category': ['Electronics', 'Accessories'],
        'Total_Items': [3, 2],  # Count of products per category
        'Total_Value': [1389.97, 109.98]  # Sum of prices per category
    }
    summary_df = pd.DataFrame(summary_data)
    summary_df.to_excel(writer, sheet_name='Summary', index=False)

print(f"✅ Created Excel file: {excel_file_path}")
print(f"📊 Created with 2 sheets: 'Products' and 'Summary'")
print(f"💡 This simulates real-world multi-sheet business files")

✅ Created Excel file: data/structured_files/inventory.xlsx
📊 Created with 2 sheets: 'Products' and 'Summary'
💡 This simulates real-world multi-sheet business files


## CSV Processing

In [13]:
# Import LangChain Document Loaders for Structured Data
# =====================================================
"""
This section imports the necessary document loaders for processing CSV files.
Different loaders provide different approaches to converting tabular data into documents.
"""

from langchain_community.document_loaders import (
    CSVLoader,  # Standard CSV loader - converts each row to a document
    UnstructuredCSVLoader  # Advanced CSV loader with more flexible parsing
)

In [14]:
# Method 1: Using CSVLoader - Row-based Document Creation
# ======================================================
"""
CSVLoader is the simplest approach to convert CSV data into documents.
Each row becomes a separate document with all columns as content.

Pros: Simple, fast, preserves all data
Cons: May create verbose documents, limited customization
"""

print("1️⃣ CSVLoader - Row-based Documents")
print("-" * 40)

try:
    # Initialize CSVLoader with file path and configuration
    csv_loader = CSVLoader(
        file_path='data/structured_files/products.csv',
        encoding='utf-8',  # Ensure proper character encoding
        csv_args={
            'delimiter': ',',    # Comma-separated values
            'quotechar': '"',   # Quote character for text fields
        }
    )

    # Load documents - each row becomes a Document object
    csv_docs = csv_loader.load()
    
    # Display results and analysis
    print(f"✅ Successfully loaded {len(csv_docs)} documents (one per row)")
    print(f"📊 Total documents: {len(csv_docs)}")
    
    # Show first document structure
    print(f"\n📄 First document preview:")
    print(f"Content: {csv_docs[0].page_content}")
    print(f"Metadata: {csv_docs[0].metadata}")
    
    # Analyze document characteristics
    avg_length = sum(len(doc.page_content) for doc in csv_docs) / len(csv_docs)
    print(f"\n📈 Statistics:")
    print(f"  • Average document length: {avg_length:.1f} characters")
    print(f"  • Document source: {csv_docs[0].metadata.get('source', 'unknown')}")

except FileNotFoundError:
    print("❌ Error: CSV file not found. Please run the data creation cells first.")
except Exception as e:
    print(f"❌ Error loading CSV: {e}")
    print("💡 Tip: Ensure 'langchain-community' is installed: pip install langchain-community")

1️⃣ CSVLoader - Row-based Documents
----------------------------------------
✅ Successfully loaded 5 documents (one per row)
📊 Total documents: 5

📄 First document preview:
Content: Product: Laptop
Category: Electronics
Price: 999.99
Stock: 50
Description: High-performance laptop with 16GB RAM and 512GB SSD
Metadata: {'source': 'data/structured_files/products.csv', 'row': 0}

📈 Statistics:
  • Average document length: 116.8 characters
  • Document source: data/structured_files/products.csv


In [15]:
# Method 2: Custom CSV Processing for Enhanced Control
# ===================================================
"""
This section demonstrates custom CSV processing that provides more control over
how tabular data is converted into documents for RAG applications.

This approach allows for:
- Intelligent content formatting
- Rich metadata creation
- Custom document structures
- Better optimization for vector search
"""

from typing import List
from langchain_core.documents import Document

print("\n2️⃣ Custom CSV Processing")
print("-" * 30)

def process_csv_intelligently(filepath: str) -> List[Document]:
    """
    Process CSV with intelligent document creation strategy.
    
    This function creates structured documents from CSV data with enhanced
    formatting and metadata for better RAG performance.
    
    Args:
        filepath (str): Path to the CSV file to process
        
    Returns:
        List[Document]: List of Document objects with structured content
        
    Features:
        - Structured content formatting for better readability
        - Rich metadata for filtering and search
        - Product-specific document creation
        - Optimized for Q&A applications
    
    Example:
        docs = process_csv_intelligently("products.csv")
        print(f"Created {len(docs)} documents")
    """
    # Read the CSV file into a pandas DataFrame
    df = pd.read_csv(filepath)
    documents = []
    
    # Strategy: Create one document per row with structured content
    for idx, row in df.iterrows():
        # Create human-readable, structured content
        # This format is optimized for LLM understanding and retrieval
        content = f"""Product Information:
Name: {row['Product']}
Category: {row['Category']}
Price: ${row['Price']}
Stock: {row['Stock']} units
Description: {row['Description']}"""
        
        # Create document with comprehensive metadata
        # Metadata enables filtering, categorization, and advanced retrieval
        doc = Document(
            page_content=content,
            metadata={
                'source': filepath,
                'row_index': idx,
                'product_name': row['Product'],
                'category': row['Category'],
                'price': row['Price'],
                'stock_level': row['Stock'],
                'data_type': 'product_info',
                'content_type': 'structured_data'
            }
        )
        documents.append(doc)
    
    return documents

# Test the custom processing function
try:
    custom_docs = process_csv_intelligently('data/structured_files/products.csv')
    
    print(f"✅ Custom processing completed")
    print(f"📊 Created {len(custom_docs)} documents")
    
    # Show example of improved document structure
    print(f"\n📄 Custom document example:")
    print(f"Content:\n{custom_docs[0].page_content}")
    print(f"\nMetadata keys: {list(custom_docs[0].metadata.keys())}")
    
    # Compare content quality
    print(f"\n🔍 Content Quality Analysis:")
    print(f"  • Structured format: ✅ Human-readable")
    print(f"  • Rich metadata: ✅ {len(custom_docs[0].metadata)} fields")
    print(f"  • Search optimization: ✅ Key-value format")

except Exception as e:
    print(f"❌ Error in custom processing: {e}")


2️⃣ Custom CSV Processing
------------------------------
✅ Custom processing completed
📊 Created 5 documents

📄 Custom document example:
Content:
Product Information:
Name: Laptop
Category: Electronics
Price: $999.99
Stock: 50 units
Description: High-performance laptop with 16GB RAM and 512GB SSD

Metadata keys: ['source', 'row_index', 'product_name', 'category', 'price', 'stock_level', 'data_type', 'content_type']

🔍 Content Quality Analysis:
  • Structured format: ✅ Human-readable
  • Rich metadata: ✅ 8 fields
  • Search optimization: ✅ Key-value format


In [16]:
# Test the custom CSV processing function directly
# This will return the list of documents for inspection
process_csv_intelligently('data/structured_files/products.csv')

[Document(metadata={'source': 'data/structured_files/products.csv', 'row_index': 0, 'product_name': 'Laptop', 'category': 'Electronics', 'price': 999.99, 'stock_level': 50, 'data_type': 'product_info', 'content_type': 'structured_data'}, page_content='Product Information:\nName: Laptop\nCategory: Electronics\nPrice: $999.99\nStock: 50 units\nDescription: High-performance laptop with 16GB RAM and 512GB SSD'),
 Document(metadata={'source': 'data/structured_files/products.csv', 'row_index': 1, 'product_name': 'Mouse', 'category': 'Accessories', 'price': 29.99, 'stock_level': 200, 'data_type': 'product_info', 'content_type': 'structured_data'}, page_content='Product Information:\nName: Mouse\nCategory: Accessories\nPrice: $29.99\nStock: 200 units\nDescription: Wireless optical mouse with ergonomic design'),
 Document(metadata={'source': 'data/structured_files/products.csv', 'row_index': 2, 'product_name': 'Keyboard', 'category': 'Accessories', 'price': 79.99, 'stock_level': 150, 'data_type

In [17]:
# CSV Processing Strategies Comparison
# ====================================
"""
This section compares different CSV processing approaches to help you
choose the right strategy for your specific use case.
"""

print("📊 CSV Processing Strategies Comparison")
print("=" * 45)

print("\n1️⃣ Row-based Processing (CSVLoader):")
print("  ✅ Simple one-row-one-document mapping")
print("  ✅ Good for record lookups and searches")
print("  ✅ Fast processing with minimal overhead")
print("  ❌ Loses table structure and relationships")
print("  ❌ May create verbose or redundant content")

print("\n2️⃣ Intelligent Custom Processing:")
print("  ✅ Preserves data relationships and context")
print("  ✅ Creates structured, readable content")
print("  ✅ Rich metadata for advanced filtering")
print("  ✅ Better performance in Q&A systems")
print("  ⚠️  Requires more development effort")

print("\n💡 Recommendation:")
print("  • Use CSVLoader for: Simple data extraction and basic search")
print("  • Use Custom Processing for: Production RAG systems and complex queries")

📊 CSV Processing Strategies Comparison

1️⃣ Row-based Processing (CSVLoader):
  ✅ Simple one-row-one-document mapping
  ✅ Good for record lookups and searches
  ✅ Fast processing with minimal overhead
  ❌ Loses table structure and relationships
  ❌ May create verbose or redundant content

2️⃣ Intelligent Custom Processing:
  ✅ Preserves data relationships and context
  ✅ Creates structured, readable content
  ✅ Rich metadata for advanced filtering
  ✅ Better performance in Q&A systems
  ⚠️  Requires more development effort

💡 Recommendation:
  • Use CSVLoader for: Simple data extraction and basic search
  • Use Custom Processing for: Production RAG systems and complex queries


## Excel Processing

Excel files present unique challenges for document processing:
- Multiple worksheets with different data structures
- Complex formatting and merged cells
- Mixed data types within sheets
- Metadata embedded in sheet structure

We'll explore different strategies for handling these complexities.

In [19]:
# Method 1: Using UnstructuredExcelLoader
# =======================================
"""
UnstructuredExcelLoader can handle Excel files with multiple sheets,
but it treats the entire file as a single document unit.

Pros: Simple setup, handles multiple sheets automatically
Cons: Limited control over sheet-specific processing
"""

from langchain_community.document_loaders import UnstructuredExcelLoader

print("1️⃣ UnstructuredExcelLoader - File-level Processing")
print("-" * 50)

try:
    # Initialize the Excel loader
    excel_loader = UnstructuredExcelLoader('data/structured_files/inventory.xlsx')
    
    # Load the entire Excel file as documents
    excel_docs = excel_loader.load()
    
    print(f"✅ Successfully loaded Excel file")
    print(f"📊 Number of documents: {len(excel_docs)}")
    print(f"📄 Content preview (first 300 chars):")
    print(f"{excel_docs[0].page_content[:300]}...")
    print(f"🏷️  Metadata: {excel_docs[0].metadata}")

except ImportError:
    print("❌ Import Error: Missing dependencies")
    print("💡 Install required packages: pip install unstructured openpyxl")
except Exception as e:
    print(f"❌ Error loading Excel file: {e}")

1️⃣ UnstructuredExcelLoader - File-level Processing
--------------------------------------------------
❌ Import Error: Missing dependencies
💡 Install required packages: pip install unstructured openpyxl
❌ Import Error: Missing dependencies
💡 Install required packages: pip install unstructured openpyxl


In [20]:
# Method 2: Custom Excel Processing with Sheet-Specific Handling
# =============================================================
"""
This approach provides granular control over each Excel sheet,
allowing for different processing strategies per sheet type.

Benefits:
- Sheet-aware document creation
- Contextual metadata for each sheet
- Different processing strategies per sheet type
- Better document organization and retrieval
"""

def process_excel_by_sheets(filepath: str) -> List[Document]:
    """
    Process Excel file with sheet-specific document creation.
    
    This function reads each sheet separately and creates optimized
    documents based on the sheet's data type and structure.
    
    Args:
        filepath (str): Path to the Excel file
        
    Returns:
        List[Document]: Documents with sheet-aware processing
        
    Features:
        - Individual sheet processing
        - Sheet-specific content formatting
        - Rich metadata including sheet information
        - Optimized for multi-sheet business files
    
    Example:
        docs = process_excel_by_sheets("inventory.xlsx")
        for doc in docs:
            print(f"Sheet: {doc.metadata['sheet_name']}")
    """
    documents = []
    
    # Read all sheets from Excel file into a dictionary
    # key = sheet name, value = DataFrame
    excel_data = pd.read_excel(filepath, sheet_name=None)
    
    # Process each sheet individually
    for sheet_name, df in excel_data.items():
        print(f"📊 Processing sheet: '{sheet_name}'")
        
        # Determine processing strategy based on sheet name/content
        if sheet_name.lower() == 'products':
            # Product sheet: Create one document per product
            for idx, row in df.iterrows():
                content = f"""Product Record from {sheet_name} Sheet:
Product: {row['Product']}
Category: {row['Category']}
Price: ${row['Price']}
Stock: {row['Stock']} units
Description: {row['Description']}"""
                
                doc = Document(
                    page_content=content,
                    metadata={
                        'source': filepath,
                        'sheet_name': sheet_name,
                        'sheet_type': 'product_catalog',
                        'row_index': idx,
                        'product_name': row['Product'],
                        'category': row['Category'],
                        'data_type': 'product_record'
                    }
                )
                documents.append(doc)
                
        elif sheet_name.lower() == 'summary':
            # Summary sheet: Create aggregated documents
            content = f"""Category Summary from {sheet_name} Sheet:\n"""
            for idx, row in df.iterrows():
                content += f"""
Category: {row['Category']}
Total Items: {row['Total_Items']}
Total Value: ${row['Total_Value']}"""
            
            doc = Document(
                page_content=content,
                metadata={
                    'source': filepath,
                    'sheet_name': sheet_name,
                    'sheet_type': 'summary_report',
                    'data_type': 'aggregated_summary',
                    'categories_count': len(df)
                }
            )
            documents.append(doc)
        
        else:
            # Generic sheet processing for unknown sheet types
            content = f"Data from {sheet_name} Sheet:\n"
            content += df.to_string(index=False)
            
            doc = Document(
                page_content=content,
                metadata={
                    'source': filepath,
                    'sheet_name': sheet_name,
                    'sheet_type': 'generic_data',
                    'rows': len(df),
                    'columns': len(df.columns),
                    'data_type': 'tabular_data'
                }
            )
            documents.append(doc)
    
    return documents

print("\n2️⃣ Custom Excel Processing - Sheet-aware")
print("-" * 45)

try:
    # Process Excel file with custom sheet handling
    excel_custom_docs = process_excel_by_sheets('data/structured_files/inventory.xlsx')
    
    print(f"\n✅ Custom Excel processing completed")
    print(f"📊 Created {len(excel_custom_docs)} documents from Excel sheets")
    
    # Analyze documents by sheet
    sheet_analysis = {}
    for doc in excel_custom_docs:
        sheet_name = doc.metadata['sheet_name']
        sheet_type = doc.metadata['sheet_type']
        
        if sheet_name not in sheet_analysis:
            sheet_analysis[sheet_name] = {'count': 0, 'type': sheet_type}
        sheet_analysis[sheet_name]['count'] += 1
    
    print(f"\n📋 Sheet Processing Summary:")
    for sheet, info in sheet_analysis.items():
        print(f"  • {sheet}: {info['count']} documents ({info['type']})")
    
    # Show example document
    print(f"\n📄 Example document from Products sheet:")
    products_doc = next(doc for doc in excel_custom_docs if doc.metadata['sheet_name'] == 'Products')
    print(f"Content:\n{products_doc.page_content}")
    print(f"Metadata keys: {list(products_doc.metadata.keys())}")

except Exception as e:
    print(f"❌ Error in custom Excel processing: {e}")


2️⃣ Custom Excel Processing - Sheet-aware
---------------------------------------------
📊 Processing sheet: 'Products'
📊 Processing sheet: 'Summary'

✅ Custom Excel processing completed
📊 Created 6 documents from Excel sheets

📋 Sheet Processing Summary:
  • Products: 5 documents (product_catalog)
  • Summary: 1 documents (summary_report)

📄 Example document from Products sheet:
Content:
Product Record from Products Sheet:
Product: Laptop
Category: Electronics
Price: $999.99
Stock: 50 units
Description: High-performance laptop with 16GB RAM and 512GB SSD
Metadata keys: ['source', 'sheet_name', 'sheet_type', 'row_index', 'product_name', 'category', 'data_type']


In [21]:
# Performance Analysis and Comparison
# ====================================
"""
This section compares the performance characteristics of different
structured data processing approaches to help you make informed decisions
for production RAG systems.
"""

import sys
import time

def analyze_structured_data_performance():
    """
    Analyze memory usage and processing characteristics of loaded documents.
    
    Returns:
        dict: Performance metrics for different processing methods
    """
    performance_metrics = {}
    
    # Analyze CSV documents if loaded
    if 'csv_docs' in locals() and csv_docs:
        performance_metrics['csv_standard'] = {
            'method': 'CSVLoader',
            'document_count': len(csv_docs),
            'avg_doc_length': sum(len(doc.page_content) for doc in csv_docs) / len(csv_docs),
            'total_memory': sum(sys.getsizeof(doc.page_content) for doc in csv_docs),
            'metadata_richness': len(csv_docs[0].metadata)
        }
    
    # Analyze custom CSV documents if loaded
    if 'custom_docs' in locals() and custom_docs:
        performance_metrics['csv_custom'] = {
            'method': 'Custom CSV Processing',
            'document_count': len(custom_docs),
            'avg_doc_length': sum(len(doc.page_content) for doc in custom_docs) / len(custom_docs),
            'total_memory': sum(sys.getsizeof(doc.page_content) for doc in custom_docs),
            'metadata_richness': len(custom_docs[0].metadata)
        }
    
    # Analyze Excel documents if loaded
    if 'excel_custom_docs' in locals() and excel_custom_docs:
        performance_metrics['excel_custom'] = {
            'method': 'Custom Excel Processing',
            'document_count': len(excel_custom_docs),
            'avg_doc_length': sum(len(doc.page_content) for doc in excel_custom_docs) / len(excel_custom_docs),
            'total_memory': sum(sys.getsizeof(doc.page_content) for doc in excel_custom_docs),
            'metadata_richness': len(excel_custom_docs[0].metadata)
        }
    
    return performance_metrics

# Run performance analysis
print("⚡ Structured Data Processing Performance Analysis")
print("=" * 55)

try:
    metrics = analyze_structured_data_performance()
    
    if metrics:
        for method_key, data in metrics.items():
            print(f"\n📊 {data['method']}:")
            print(f"  • Documents created: {data['document_count']}")
            print(f"  • Average document length: {data['avg_doc_length']:.1f} characters")
            print(f"  • Total memory usage: {data['total_memory']} bytes")
            print(f"  • Metadata fields: {data['metadata_richness']}")
        
        print(f"\n💡 Key Insights:")
        print(f"  • Custom processing creates more structured, readable documents")
        print(f"  • Rich metadata enables better filtering and retrieval")
        print(f"  • Sheet-aware processing provides contextual information")
        
    else:
        print("⚠️  Run the document loading cells first to see performance metrics")
        
except Exception as e:
    print(f"❌ Error in performance analysis: {e}")

print(f"\n🎯 Best Practices for Production:")
print(f"  • Use custom processing for better document quality")
print(f"  • Include rich metadata for advanced filtering")
print(f"  • Consider document chunking for very large datasets")
print(f"  • Implement caching for frequently accessed files")

⚡ Structured Data Processing Performance Analysis
⚠️  Run the document loading cells first to see performance metrics

🎯 Best Practices for Production:
  • Use custom processing for better document quality
  • Include rich metadata for advanced filtering
  • Consider document chunking for very large datasets
  • Implement caching for frequently accessed files


## Best Practices for Structured Data Processing

### When to Use Each Approach

1. **CSVLoader (Standard)**
   - ✅ Quick prototyping and simple data extraction
   - ✅ Small datasets with homogeneous structure
   - ✅ When processing speed is priority over content quality
   - ❌ Complex business logic or data relationships

2. **Custom Processing**
   - ✅ Production RAG systems requiring high-quality documents
   - ✅ Complex data structures with multiple relationships
   - ✅ Need for rich metadata and filtering capabilities
   - ✅ Integration with business-specific document formats

3. **Excel Sheet-Aware Processing**
   - ✅ Multi-sheet business files with different data types
   - ✅ When sheet structure provides important context
   - ✅ Need for sheet-specific processing strategies
   - ✅ Complex business reports and dashboards

### Common Challenges and Solutions

1. **Large File Performance**
   - **Problem**: Memory issues with large CSV/Excel files
   - **Solution**: Implement chunking and streaming processing
   - **Code Pattern**: Process files in batches of 1000 rows

2. **Data Quality Issues**
   - **Problem**: Missing values, inconsistent formats
   - **Solution**: Add data validation and cleaning steps
   - **Code Pattern**: Validate data types before document creation

3. **Metadata Optimization**
   - **Problem**: Too much or too little metadata
   - **Solution**: Include only actionable metadata for retrieval
   - **Code Pattern**: Focus on filterable and searchable fields

In [22]:
# Practical Utility Functions for Production RAG Systems
# ======================================================
"""
This section provides reusable utility functions for structured data processing
in production RAG applications.
"""

def validate_and_clean_data(df: pd.DataFrame, required_columns: List[str]) -> pd.DataFrame:
    """
    Validate and clean DataFrame before processing.
    
    Args:
        df (pd.DataFrame): Input DataFrame to validate
        required_columns (List[str]): List of required column names
        
    Returns:
        pd.DataFrame: Cleaned and validated DataFrame
        
    Raises:
        ValueError: If required columns are missing
    """
    # Check for required columns
    missing_columns = set(required_columns) - set(df.columns)
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")
    
    # Clean data
    df_clean = df.copy()
    
    # Handle missing values
    for col in df_clean.columns:
        if df_clean[col].dtype == 'object':  # String columns
            df_clean[col] = df_clean[col].fillna('Unknown')
        else:  # Numeric columns
            df_clean[col] = df_clean[col].fillna(0)
    
    # Remove duplicate rows
    df_clean = df_clean.drop_duplicates()
    
    print(f"✅ Data validation completed")
    print(f"  • Original rows: {len(df)}")
    print(f"  • Cleaned rows: {len(df_clean)}")
    print(f"  • Removed duplicates: {len(df) - len(df_clean)}")
    
    return df_clean

def create_enhanced_documents(df: pd.DataFrame, 
                           content_template: str,
                           metadata_fields: List[str]) -> List[Document]:
    """
    Create enhanced documents with customizable content and metadata.
    
    Args:
        df (pd.DataFrame): Source data
        content_template (str): Template for document content with {column} placeholders
        metadata_fields (List[str]): Column names to include as metadata
        
    Returns:
        List[Document]: Enhanced document objects
        
    Example:
        template = "Product: {Product}\nPrice: ${Price}\nDescription: {Description}"
        fields = ['Product', 'Category', 'Price']
        docs = create_enhanced_documents(df, template, fields)
    """
    documents = []
    
    for idx, row in df.iterrows():
        # Create content using template
        try:
            content = content_template.format(**row.to_dict())
        except KeyError as e:
            print(f"⚠️  Warning: Missing column {e} in template")
            content = str(row.to_dict())
        
        # Create metadata
        metadata = {
            'row_index': idx,
            'source_type': 'structured_data',
            'processing_timestamp': pd.Timestamp.now().isoformat()
        }
        
        # Add specified metadata fields
        for field in metadata_fields:
            if field in row.index:
                metadata[field.lower().replace(' ', '_')] = row[field]
        
        doc = Document(
            page_content=content,
            metadata=metadata
        )
        documents.append(doc)
    
    return documents

# Example usage of utility functions
print("🔧 Utility Functions for Production RAG")
print("=" * 40)

# Test data validation
try:
    # Load test data
    test_df = pd.read_csv('data/structured_files/products.csv')
    
    # Validate data
    required_cols = ['Product', 'Category', 'Price', 'Description']
    clean_df = validate_and_clean_data(test_df, required_cols)
    
    # Create enhanced documents
    template = """Product Information:
Name: {Product}
Category: {Category}
Price: ${Price}
Stock: {Stock} units available
Description: {Description}

This product is in the {Category} category and costs ${Price}."""
    
    metadata_fields = ['Product', 'Category', 'Price', 'Stock']
    enhanced_docs = create_enhanced_documents(clean_df, template, metadata_fields)
    
    print(f"\n✅ Created {len(enhanced_docs)} enhanced documents")
    print(f"📄 Example enhanced document:")
    print(f"Content:\n{enhanced_docs[0].page_content[:200]}...")
    print(f"Metadata: {list(enhanced_docs[0].metadata.keys())}")
    
except Exception as e:
    print(f"❌ Error in utility functions: {e}")

🔧 Utility Functions for Production RAG
✅ Data validation completed
  • Original rows: 5
  • Cleaned rows: 5
  • Removed duplicates: 0

✅ Created 5 enhanced documents
📄 Example enhanced document:
Content:
Product Information:
Name: Laptop
Category: Electronics
Price: $999.99
Stock: 50 units available
Description: High-performance laptop with 16GB RAM and 512GB SSD

This product is in the Electronics ca...
Metadata: ['row_index', 'source_type', 'processing_timestamp', 'product', 'category', 'price', 'stock']


## Summary and Next Steps

### What We Learned

In this notebook, we explored comprehensive strategies for processing structured data:

1. **CSV Processing**: From simple CSVLoader to intelligent custom processing
2. **Excel Processing**: Sheet-aware processing for complex business files
3. **Performance Analysis**: Understanding trade-offs between different approaches
4. **Production Utilities**: Reusable functions for data validation and enhancement

### Key Takeaways

- **Choose the right approach**: Simple loaders for quick prototyping, custom processing for production
- **Metadata is crucial**: Rich metadata enables advanced filtering and retrieval
- **Structure matters**: Preserve data relationships and context in document format
- **Validation is essential**: Always validate and clean data before processing

### Production Checklist

- [ ] Implement data validation and cleaning
- [ ] Create structured, human-readable content
- [ ] Include rich metadata for filtering
- [ ] Add error handling and logging
- [ ] Consider performance and memory usage
- [ ] Implement caching for large files

### Next Steps

1. **Try with your own data**: Apply these techniques to your structured datasets
2. **Build a data pipeline**: Create automated processing workflows
3. **Optimize for retrieval**: Test different document structures with your queries
4. **Scale for production**: Implement batch processing and monitoring

### Resources for Further Learning

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [LangChain Document Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/)
- [Structured Data Best Practices](https://docs.llamaindex.ai/en/stable/examples/data_connectors/)