# OCR-NLP Pipeline Walkthrough

This notebook demonstrates how to use the OCR-NLP Pipeline for extracting structured information from documents like PDFs, invoices, and forms.

## Setup

First, let's import the necessary modules and set up the pipeline.

In [None]:
import sys
import os
import json
from pathlib import Path
from pprint import pprint

# Add the parent directory to the path so we can import the modules
sys.path.append('..')

# Import the pipeline modules
from src.pipeline import PipelineBuilder

## 1. Creating a Pipeline

The pipeline is built using the `PipelineBuilder` class, which allows for easy configuration of the OCR engine, NLP processing, and entity extraction components.

In [None]:
# Create a pipeline with default configuration
pipeline = PipelineBuilder().build()

# Alternatively, you can customize the pipeline
custom_pipeline = (
    PipelineBuilder()
    .with_ocr_engine('tesseract', lang='eng')
    .with_ocr_dpi(300)
    .with_language('en')
    .with_entity_extractor('spacy', model_name='en_core_web_sm')
    .build()
)

## 2. Processing a Document

Now let's process a sample document. For this demo, we'll use one of the sample PDFs included in the repository.

In [None]:
# Path to sample document
sample_dir = Path('../data/pdf_samples')
output_dir = Path('../outputs')

# Create output directory if it doesn't exist
output_dir.mkdir(parents=True, exist_ok=True)

# List available sample documents
sample_docs = list(sample_dir.glob('*.pdf'))
print(f"Available sample documents: {[doc.name for doc in sample_docs]}")

# Select a sample document (e.g., an invoice)
sample_doc = next((doc for doc in sample_docs if 'invoice' in doc.name.lower()), sample_docs[0])
print(f"Selected document: {sample_doc.name}")

### Note on OCR Dependencies

For a full OCR processing pipeline, you would need to have Tesseract OCR installed on your system. Since this is a demo, we'll use pre-processed sample outputs to demonstrate the pipeline's capabilities.

In [None]:
# Load pre-processed sample output
sample_output_path = Path(f'../data/sample_outputs/{sample_doc.stem}_results.json')

if sample_output_path.exists():
    with open(sample_output_path, 'r', encoding='utf-8') as f:
        result = json.load(f)
    print(f"Loaded pre-processed results from {sample_output_path}")
else:
    print(f"No pre-processed results found for {sample_doc.name}")
    print("To generate sample outputs, run: python scripts/generate_sample_outputs.py")

## 3. Analyzing the Results

Now let's examine the results of the OCR and entity extraction process.

In [None]:
# Display document metadata
print("Document Metadata:")
pprint(result['document'])

# Display OCR text sample (first 200 characters)
print("\nOCR Text Sample:")
print(result['ocr']['text'][:200] + "...")

# Display text analysis
print("\nText Analysis:")
pprint(result['text_analysis'])

## 4. Exploring Extracted Entities

One of the key features of the pipeline is entity extraction. Let's examine the entities that were extracted from the document.

In [None]:
# Group entities by type
entities_by_type = {}
for entity in result['entities']:
    entity_type = entity['type']
    if entity_type not in entities_by_type:
        entities_by_type[entity_type] = []
    entities_by_type[entity_type].append(entity)

# Display entities by type
print(f"Extracted Entities ({len(result['entities'])}):\n")
for entity_type, entities in entities_by_type.items():
    print(f"{entity_type} ({len(entities)})")
    for entity in entities[:3]:  # Show up to 3 entities per type
        print(f"  - {entity['text']} (confidence: {entity['confidence']})")
    
    if len(entities) > 3:
        print(f"  - ... and {len(entities) - 3} more")
    print()

## 5. Visualizing Entity Extraction

Let's create a simple visualization of the extracted entities.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Count entities by type
entity_types = list(entities_by_type.keys())
entity_counts = [len(entities_by_type[t]) for t in entity_types]

# Create a bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(entity_types, entity_counts, color='skyblue')
plt.xlabel('Entity Type')
plt.ylabel('Count')
plt.title('Extracted Entities by Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Add count labels on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{height}', ha='center', va='bottom')

plt.show()

## 6. Creating a Structured Document Summary

Now let's create a structured summary of the document based on the extracted entities.

In [None]:
def create_document_summary(entities_by_type):
    """Create a structured summary of the document based on extracted entities."""
    summary = {}
    
    # Extract document type
    if 'INVOICE_NUM' in entities_by_type:
        summary['document_type'] = 'Invoice'
        
        # Extract invoice details
        summary['invoice_number'] = entities_by_type['INVOICE_NUM'][0]['text'] if entities_by_type['INVOICE_NUM'] else None
        summary['date'] = entities_by_type['DATE'][0]['text'] if 'DATE' in entities_by_type else None
        summary['total_amount'] = entities_by_type['TOTAL'][0]['text'] if 'TOTAL' in entities_by_type else None
        
        # Extract vendor/customer information
        summary['vendor'] = entities_by_type['ORG'][0]['text'] if 'ORG' in entities_by_type else None
        summary['customer'] = entities_by_type['PERSON'][0]['text'] if 'PERSON' in entities_by_type else None
        
        # Extract line items
        if 'PRODUCT' in entities_by_type and 'QUANTITY' in entities_by_type and 'MONEY' in entities_by_type:
            summary['line_items'] = []
            for i in range(min(len(entities_by_type['PRODUCT']), len(entities_by_type['QUANTITY']), len(entities_by_type['MONEY']))):
                summary['line_items'].append({
                    'product': entities_by_type['PRODUCT'][i]['text'],
                    'quantity': entities_by_type['QUANTITY'][i]['text'],
                    'price': entities_by_type['MONEY'][i]['text']
                })
    elif 'PERSON' in entities_by_type and 'ADDRESS' in entities_by_type:
        summary['document_type'] = 'Form'
        
        # Extract form details
        summary['person'] = entities_by_type['PERSON'][0]['text'] if entities_by_type['PERSON'] else None
        summary['address'] = entities_by_type['ADDRESS'][0]['text'] if entities_by_type['ADDRESS'] else None
        summary['date'] = entities_by_type['DATE'][0]['text'] if 'DATE' in entities_by_type else None
        summary['organization'] = entities_by_type['ORG'][0]['text'] if 'ORG' in entities_by_type else None
        summary['tax_id'] = entities_by_type['TAX_ID'][0]['text'] if 'TAX_ID' in entities_by_type else None
    else:
        summary['document_type'] = 'Unknown'
        
    return summary

# Create and display document summary
document_summary = create_document_summary(entities_by_type)
print("Document Summary:")
pprint(document_summary)

## 7. Exporting Structured Data

Finally, let's export the structured data to a JSON file.

In [None]:
# Export document summary to JSON
summary_output_path = output_dir / f"{sample_doc.stem}_summary.json"
with open(summary_output_path, 'w', encoding='utf-8') as f:
    json.dump(document_summary, f, indent=2, ensure_ascii=False)

print(f"Document summary exported to: {summary_output_path}")

## Conclusion

In this notebook, we've demonstrated how to use the OCR-NLP Pipeline to extract structured information from documents. The pipeline combines OCR, text preprocessing, and entity extraction to convert unstructured documents into structured data that can be easily analyzed and processed.

Key features demonstrated:
- Creating and configuring the pipeline
- Processing documents
- Analyzing extracted entities
- Visualizing entity extraction results
- Creating structured document summaries

For more advanced usage, refer to the documentation and examples in the repository.