# Word Document Processing with LangChain

## Overview
This notebook demonstrates different approaches to loading and processing Word (.docx) documents using LangChain's document loaders. We'll explore two main methods:

1. **Docx2txtLoader** - Simple text extraction from Word documents
2. **UnstructuredWordDocumentLoader** - Advanced parsing that preserves document structure and metadata

## Prerequisites
Before running this notebook, ensure you have the following packages installed:
```bash
uv add install langchain-community docx2txt python-docx unstructured
```

## Learning Objectives
- Understand different document loading strategies
- Compare simple vs. structured document parsing
- Learn how to handle metadata and document elements
- Best practices for document preprocessing in RAG pipelines

### Word Document Processing

In [None]:
"""
Word Document Processing with LangChain Document Loaders

This module demonstrates two different approaches for loading Word documents:
1. Docx2txtLoader: Simple text extraction
2. UnstructuredWordDocumentLoader: Advanced parsing with structure preservation

Author: SAIB AHMED
Date: 2024
"""

# Import necessary document loaders from LangChain Community
from langchain_community.document_loaders import (
    Docx2txtLoader,  # Simple text extraction from .docx files
    UnstructuredWordDocumentLoader  # Advanced parsing with structure preservation
)

In [None]:
# Method 1: Using Docx2txtLoader
# ==================================
# Docx2txtLoader provides simple text extraction from Word documents
# Pros: Fast, lightweight, simple to use
# Cons: Loses document structure and formatting information

print("1Ô∏è‚É£ Using Docx2txtLoader")
print("-" * 40)

try:pip 
    # Initialize the Docx2txtLoader with the path to your Word document
    docx_loader = Docx2txtLoader("data/word_files/proposal.docx")
    
    # Load the document - returns a list of Document objects
    docs = docx_loader.load()
    
    # Display basic information about the loaded document
    print(f"‚úÖ Successfully loaded {len(docs)} document(s)")
    print(f"üìÑ Document type: Word Document (.docx)")
    print(f"üìù Content preview (first 200 chars): {docs[0].page_content[:200]}...")
    print(f"üè∑Ô∏è  Document metadata: {docs[0].metadata}")
    
except FileNotFoundError:
    print("‚ùå Error: The specified file was not found. Please check the file path.")
except Exception as e:
    print(f"‚ùå Error loading document: {e}")
    print("üí° Tip: Make sure 'docx2txt' package is installed: pip install docx2txt")

1Ô∏è‚É£ Using Docx2txtLoader
----------------------------------------
‚úÖ Successfully loaded 1 document(s)
üìÑ Document type: Word Document (.docx)
üìù Content preview (first 200 chars): Project Proposal: RAG Implementation

Executive Summary

This proposal outlines the implementation of a Retrieval-Augmented Generation system for our organization.

Objectives

Key objectives include:...
üè∑Ô∏è  Document metadata: {'source': 'data/word_files/proposal.docx'}


In [11]:
# Method 2: Using UnstructuredWordDocumentLoader
# ================================================
# UnstructuredWordDocumentLoader provides advanced parsing capabilities
# Pros: Preserves document structure, extracts metadata, handles various elements
# Cons: Slower processing, requires additional dependencies

print("\n2Ô∏è‚É£ Using UnstructuredWordDocumentLoader")
print("-" * 50)

try:
    # Initialize UnstructuredWordDocumentLoader with mode="elements"
    # mode="elements" returns each document element as a separate Document object
    unstructured_loader = UnstructuredWordDocumentLoader(
        "data/word_files/proposal.docx", 
        mode="elements"
    )
    
    # Load the document and extract all elements
    unstructured_docs = unstructured_loader.load()
    
    # Display summary information about the loaded elements
    print(f"‚úÖ Successfully loaded {len(unstructured_docs)} document elements")
    
    # Show details for the first 3 elements to understand the structure
    for i, doc in enumerate(unstructured_docs[:3]):
        print(f"\nElement {i+1}:")
        print(f"  üìã Type: {doc.metadata.get('category', 'unknown')}")
        print(f"  üìù Content preview: {doc.page_content[:100]}...")
        
        # Show additional metadata if available
        if len(doc.metadata) > 1:
            print(f"  üè∑Ô∏è  Metadata keys: {list(doc.metadata.keys())}")

except FileNotFoundError:
    print("‚ùå Error: The specified file was not found. Please check the file path.")
except ImportError as e:
    print(f"‚ùå Import Error: {e}")
    print("üí° Tip: Make sure required packages are installed:")
    print("   pip install python-docx unstructured")
except Exception as e:
    print(f"‚ùå Unexpected error: {e}") 


2Ô∏è‚É£ Using UnstructuredWordDocumentLoader
--------------------------------------------------
‚úÖ Successfully loaded 20 document elements

Element 1:
  üìã Type: Title
  üìù Content preview: Project Proposal: RAG Implementation...
  üè∑Ô∏è  Metadata keys: ['source', 'category_depth', 'file_directory', 'filename', 'last_modified', 'languages', 'filetype', 'category', 'element_id']

Element 2:
  üìã Type: Title
  üìù Content preview: Executive Summary...
  üè∑Ô∏è  Metadata keys: ['source', 'category_depth', 'file_directory', 'filename', 'last_modified', 'languages', 'filetype', 'category', 'element_id']

Element 3:
  üìã Type: NarrativeText
  üìù Content preview: This proposal outlines the implementation of a Retrieval-Augmented Generation system for our organiz...
  üè∑Ô∏è  Metadata keys: ['source', 'category_depth', 'file_directory', 'filename', 'last_modified', 'languages', 'filetype', 'parent_id', 'category', 'element_id']


In [12]:
# Inspect the unstructured documents in detail
# This will display all the parsed elements from the Word document
# Each element represents a different part of the document (paragraphs, titles, etc.)
unstructured_docs

[Document(metadata={'source': 'data/word_files/proposal.docx', 'category_depth': 0, 'file_directory': 'data/word_files', 'filename': 'proposal.docx', 'last_modified': '2025-09-12T11:09:23', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': 'bb0410bfd160ef866f8d4357b0949db2'}, page_content='Project Proposal: RAG Implementation'),
 Document(metadata={'source': 'data/word_files/proposal.docx', 'category_depth': 0, 'file_directory': 'data/word_files', 'filename': 'proposal.docx', 'last_modified': '2025-09-12T11:09:23', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': 'c0f844859abf08d9506856b3aed4a719'}, page_content='Executive Summary'),
 Document(metadata={'source': 'data/word_files/proposal.docx', 'category_depth': 0, 'file_directory': 'data/word_files', 'filename': 'proposal.docx', 'last_modified': '2

In [13]:
# Comparison and Analysis
# ========================
"""
This section compares the two document loading methods and analyzes their outputs.
It helps you understand when to use each approach based on your specific requirements.
"""

print("üìä Document Loading Methods Comparison")
print("=" * 50)

# Compare basic statistics between the two methods
if 'docs' in locals() and 'unstructured_docs' in locals():
    print(f"Docx2txtLoader results:")
    print(f"  ‚Ä¢ Number of documents: {len(docs)}")
    print(f"  ‚Ä¢ Content length: {len(docs[0].page_content) if docs else 0} characters")
    print(f"  ‚Ä¢ Metadata keys: {list(docs[0].metadata.keys()) if docs else []}")
    
    print(f"\nUnstructuredWordDocumentLoader results:")
    print(f"  ‚Ä¢ Number of elements: {len(unstructured_docs)}")
    print(f"  ‚Ä¢ Total content length: {sum(len(doc.page_content) for doc in unstructured_docs)} characters")
    print(f"  ‚Ä¢ Element types: {set(doc.metadata.get('category', 'unknown') for doc in unstructured_docs)}")
    
    print(f"\nüí° Key Insights:")
    print(f"  ‚Ä¢ Structured parsing found {len(unstructured_docs)} separate elements")
    print(f"  ‚Ä¢ Simple parsing treats the entire document as one unit")
    print(f"  ‚Ä¢ Structured parsing preserves document hierarchy and element types")
    
else:
    print("‚ö†Ô∏è  Run the previous cells first to compare the methods")

print(f"\nüéØ When to use each method:")
print(f"  üìù Docx2txtLoader:")
print(f"     - Quick text extraction for search or basic analysis")
print(f"     - When document structure is not important")
print(f"     - For simple RAG applications with minimal preprocessing")
print(f"  üèóÔ∏è  UnstructuredWordDocumentLoader:")
print(f"     - When document structure matters (headings, paragraphs, etc.)")
print(f"     - For advanced RAG systems that need element-level processing")
print(f"     - When you need metadata about document elements")

üìä Document Loading Methods Comparison
Docx2txtLoader results:
  ‚Ä¢ Number of documents: 1
  ‚Ä¢ Content length: 743 characters
  ‚Ä¢ Metadata keys: ['source']

UnstructuredWordDocumentLoader results:
  ‚Ä¢ Number of elements: 20
  ‚Ä¢ Total content length: 680 characters
  ‚Ä¢ Element types: {'NarrativeText', 'UncategorizedText', 'Table', 'ListItem', 'Title'}

üí° Key Insights:
  ‚Ä¢ Structured parsing found 20 separate elements
  ‚Ä¢ Simple parsing treats the entire document as one unit
  ‚Ä¢ Structured parsing preserves document hierarchy and element types

üéØ When to use each method:
  üìù Docx2txtLoader:
     - Quick text extraction for search or basic analysis
     - When document structure is not important
     - For simple RAG applications with minimal preprocessing
  üèóÔ∏è  UnstructuredWordDocumentLoader:
     - When document structure matters (headings, paragraphs, etc.)
     - For advanced RAG systems that need element-level processing
     - When you need metadata 

In [None]:
# Performance and Memory Considerations
# ======================================
"""
This section analyzes the performance and memory usage of both methods.
Understanding these aspects is crucial for production RAG applications.
"""

import sys
import time

def analyze_document_loader_performance():
    """
    Analyzes and compares the performance characteristics of both document loaders.
    pip 
    Returns:
        dict: Performance metrics for both loaders
    """
    performance_metrics = {}
    
    # Test Docx2txtLoader performance
    if 'docs' in locals():
        performance_metrics['docx2txt'] = {
            'memory_usage': sys.getsizeof(docs),
            'document_count': len(docs),
            'avg_doc_size': len(docs[0].page_content) if docs else 0
        }
    
    # Test UnstructuredWordDocumentLoader performance  
    if 'unstructured_docs' in locals():
        total_memory = sum(sys.getsizeof(doc.page_content) for doc in unstructured_docs)
        performance_metrics['unstructured'] = {
            'memory_usage': total_memory,
            'element_count': len(unstructured_docs),
            'avg_element_size': total_memory / len(unstructured_docs) if unstructured_docs else 0
        }
    
    return performance_metrics

# Analyze performance if both methods have been executed
if 'docs' in locals() and 'unstructured_docs' in locals():
    metrics = analyze_document_loader_performance()
    
    print("‚ö° Performance Analysis")
    print("=" * 25)
    
    print(f"üìä Memory Usage:")
    print(f"  ‚Ä¢ Docx2txtLoader: {metrics.get('docx2txt', {}).get('memory_usage', 0)} bytes")
    print(f"  ‚Ä¢ UnstructuredLoader: {metrics.get('unstructured', {}).get('memory_usage', 0)} bytes")
    
    print(f"\nüîç Document Granularity:")
    print(f"  ‚Ä¢ Simple loader: {metrics.get('docx2txt', {}).get('document_count', 0)} document(s)")
    print(f"  ‚Ä¢ Structured loader: {metrics.get('unstructured', {}).get('element_count', 0)} element(s)")
    
    print(f"\nüí° Recommendations:")
    print(f"  ‚Ä¢ For large documents: Use Docx2txtLoader for faster processing")
    print(f"  ‚Ä¢ For structured analysis: Use UnstructuredWordDocumentLoader")
    print(f"  ‚Ä¢ For production RAG: Consider document chunking strategies")
    
else:
    print("‚ö†Ô∏è  Run the document loading cells first to see performance analysis")

‚ö° Performance Analysis
üìä Memory Usage:
  ‚Ä¢ Docx2txtLoader: 0 bytes
  ‚Ä¢ UnstructuredLoader: 0 bytes

üîç Document Granularity:
  ‚Ä¢ Simple loader: 0 document(s)
  ‚Ä¢ Structured loader: 0 element(s)

üí° Recommendations:
  ‚Ä¢ For large documents: Use Docx2txtLoader for faster processing
  ‚Ä¢ For structured analysis: Use UnstructuredWordDocumentLoader
  ‚Ä¢ For production RAG: Consider document chunking strategies


## Best Practices and Troubleshooting

### Common Issues and Solutions

1. **Module Import Errors**
   - `ModuleNotFoundError: No module named 'docx2txt'` ‚Üí Install with `pip install docx2txt`
   - `ModuleNotFoundError: No module named 'docx'` ‚Üí Install with `pip install python-docx`
   - `ModuleNotFoundError: No module named 'unstructured'` ‚Üí Install with `pip install unstructured`

2. **File Path Issues**
   - Use absolute paths or ensure relative paths are correct
   - Check file permissions and accessibility
   - Verify the file exists and is a valid .docx format

3. **Performance Considerations**
   - For large documents (>10MB): Consider using Docx2txtLoader for speed
   - For structured analysis: UnstructuredWordDocumentLoader is worth the extra processing time
   - Implement chunking strategies for very large documents

### Production RAG Pipeline Recommendations

- **Document Preprocessing**: Always validate document format before processing
- **Error Handling**: Implement robust exception handling for file operations
- **Chunking Strategy**: Split large documents into manageable chunks (500-1000 tokens)
- **Metadata Preservation**: Use UnstructuredWordDocumentLoader when structure matters
- **Monitoring**: Track processing time and memory usage in production

In [15]:
# Utility Functions for Document Processing
# ==========================================
"""
This section provides utility functions for document processing that can be reused 
in production RAG applications.
"""

def load_document_safe(file_path: str, method: str = "docx2txt"):
    """
    Safely load a Word document using the specified method.
    
    Args:
        file_path (str): Path to the Word document
        method (str): Loading method - either "docx2txt" or "unstructured"
    
    Returns:
        tuple: (documents_list, success_flag, error_message)
    
    Example:
        docs, success, error = load_document_safe("proposal.docx", "unstructured")
        if success:
            print(f"Loaded {len(docs)} document elements")
        else:
            print(f"Error: {error}")
    """
    try:
        if method == "docx2txt":
            loader = Docx2txtLoader(file_path)
        elif method == "unstructured":
            loader = UnstructuredWordDocumentLoader(file_path, mode="elements")
        else:
            return None, False, f"Unsupported method: {method}"
        
        documents = loader.load()
        return documents, True, None
        
    except FileNotFoundError:
        return None, False, f"File not found: {file_path}"
    except ImportError as e:
        return None, False, f"Missing dependency: {e}"
    except Exception as e:
        return None, False, f"Unexpected error: {e}"

def analyze_document_structure(documents: list):
    """
    Analyze the structure of loaded documents.
    
    Args:
        documents (list): List of Document objects
    
    Returns:
        dict: Analysis results including statistics and structure info
    """
    if not documents:
        return {"error": "No documents provided"}
    
    analysis = {
        "total_documents": len(documents),
        "total_content_length": sum(len(doc.page_content) for doc in documents),
        "avg_content_length": sum(len(doc.page_content) for doc in documents) / len(documents),
        "metadata_keys": set(),
        "element_types": set()
    }
    
    # Analyze metadata and element types
    for doc in documents:
        analysis["metadata_keys"].update(doc.metadata.keys())
        if "category" in doc.metadata:
            analysis["element_types"].add(doc.metadata["category"])
    
    return analysis

# Example usage of utility functions
print("üîß Document Processing Utilities")
print("=" * 35)

# You can test these functions with your documents
print("üí° Use these utility functions in your RAG pipeline:")
print("  ‚Ä¢ load_document_safe() - Safe document loading with error handling")
print("  ‚Ä¢ analyze_document_structure() - Analyze document structure and metadata")
print("\nExample usage:")
print("  docs, success, error = load_document_safe('your_file.docx', 'unstructured')")
print("  if success:")
print("      analysis = analyze_document_structure(docs)")
print("      print(f'Found {analysis[\"total_documents\"]} elements')")

üîß Document Processing Utilities
üí° Use these utility functions in your RAG pipeline:
  ‚Ä¢ load_document_safe() - Safe document loading with error handling
  ‚Ä¢ analyze_document_structure() - Analyze document structure and metadata

Example usage:
  docs, success, error = load_document_safe('your_file.docx', 'unstructured')
  if success:
      analysis = analyze_document_structure(docs)
      print(f'Found {analysis["total_documents"]} elements')


## Summary and Next Steps

### What We Learned

In this notebook, we explored two different approaches for processing Word documents:

1. **Docx2txtLoader**: Simple, fast text extraction ideal for basic RAG applications
2. **UnstructuredWordDocumentLoader**: Advanced parsing that preserves document structure and metadata

### Key Takeaways

- **Choose the right tool**: Use Docx2txtLoader for speed, UnstructuredWordDocumentLoader for structure
- **Error handling is crucial**: Always implement proper exception handling for file operations
- **Consider performance**: Memory usage and processing time scale differently between methods
- **Structure matters**: Structured parsing enables more sophisticated document analysis

### Next Steps

1. **Experiment with your own documents**: Try both methods with different document types
2. **Implement chunking strategies**: Split large documents for better processing
3. **Build a document pipeline**: Combine these techniques with text embedding and retrieval
4. **Optimize for production**: Consider caching, parallel processing, and monitoring

### Resources for Further Learning

- [LangChain Document Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/)
- [Unstructured Library Documentation](https://docs.unstructured.io/)
- [RAG Pipeline Best Practices](https://docs.llamaindex.ai/en/stable/optimizing/building_rag.html)