# Cell 1: **TEI XML Document Processing with AI Integration**

*Please note that the majority of this codebase (approximately 90%) was AI-generated by the large language model `anthropic/claude-sonnet-4`.*
*The author's contributions included the initial design, code review, prompts, integration, and comprehensive testing.*

*Inspired by the article https://aiucd2025.dlls.univr.it/assets/pdf/papers/94.pdf*

## **Overview**

This Jupyter notebook provides a comprehensive tool for **TEI XML document processing** and **semantic annotation** of historical documents and manuscripts. It's designed to help digital humanities researchers, archivists, and scholars efficiently convert PDF documents into properly structured TEI XML format with integrated summaries and metadata.

Leveraging **OpenRouter AI models** for text extraction, structure analysis, and TEI encoding, the notebook enables automated processing of historical documents while maintaining scholarly standards for digital editions.

---

### **Technical Approach**

#### **AI-Powered Document Analysis**
- Uses **OpenAI GPT-4o** for OCR and text extraction from images
- Employs **DeepSeek R1** for structured TEI XML generation
- Performs **document boundary detection** with optional human verification
- Enables **automated metadata extraction** and summary generation
- **Strength**: Excellent for complex historical documents with varied layouts

### **Key Features**

#### **Document Processing**
- Extracts text from **PDF documents** using advanced OCR when needed
- Handles both **text-based** and **image-based** PDFs automatically
- Maintains **spatial layout** and formatting information
- Supports **multi-document** PDFs with intelligent boundary detection

#### **TEI XML Generation**
- Creates **scholarly-standard TEI XML** with proper structure
- Integrates **detailed summaries** directly into TEI metadata
- Preserves **correspondence metadata** (sender, recipient, dates)
- Includes **comprehensive document descriptions** and annotations

#### **Quality Assurance**
- **Human verification** option for document boundary detection
- **Fallback mechanisms** for robust processing
- **Detailed logging** for process tracking and debugging
- **Validation** and error handling throughout the pipeline

---

### **How It Works**
1. **PDF Analysis**: Documents are analyzed for text content and structure
2. **Boundary Detection**: AI determines document separations with optional human verification  
3. **Text Extraction**: Advanced OCR and text extraction preserving layout
4. **TEI Generation**: Structured XML creation with integrated metadata and summaries
5. **Final Assembly**: Creation of organized PDF output with original pages and TEI XML

---

### **Getting Started**
1. Install the required Python libraries
2. Configure your **OpenRouter API key** 
3. Set up input and output directories
4. Run the processing pipeline on your PDF collection
5. Review generated TEI XML files and final organized PDF

---

## **TEI XML and Digital Humanities Standards**

This notebook implements **Text Encoding Initiative (TEI)** standards for digital humanities research, enabling:

1. **Scholarly Digital Editions**: Proper encoding of historical documents
2. **Metadata Integration**: Comprehensive document descriptions and summaries
3. **Structural Preservation**: Maintaining original document layout and formatting
4. **Research Accessibility**: Creating searchable, annotated digital collections

# Cell 2: Install, import missing Libraries and Dependencies

In [None]:
# Standard library imports (should always be available)
import os, io, re, json, tempfile, base64, math, textwrap, warnings, logging
from datetime import datetime
from xml.dom import minidom

# Check and import third-party libraries with error handling
missing_packages = []

try:
    import requests
except ImportError:
    missing_packages.append("requests")

try:
    import fitz  # PyMuPDF
except ImportError:
    missing_packages.append("PyMuPDF")

try:
    import numpy as np
except ImportError:
    missing_packages.append("numpy")

try:
    from tqdm import tqdm  # Import the function, not the module
except ImportError:
    missing_packages.append("tqdm")

try:
    import matplotlib
    matplotlib.use('Agg')  # Use non-interactive backend to reduce warnings
    import matplotlib.pyplot as plt
except ImportError:
    missing_packages.append("matplotlib")

try:
    from PIL import Image, ImageDraw, ImageFont
except ImportError:
    missing_packages.append("Pillow")

try:
    from PyPDF2 import PdfReader, PdfWriter
except ImportError:
    missing_packages.append("PyPDF2")

try:
    from reportlab.lib.pagesizes import letter, A4
    from reportlab.pdfgen import canvas
    from reportlab.pdfbase import pdfmetrics
    from reportlab.pdfbase.ttfonts import TTFont
    from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
    from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
except ImportError:
    missing_packages.append("reportlab")

# Configuration
warnings.filterwarnings("ignore", category=UserWarning)
os.environ['PYTHONHTTPSVERIFY'] = '0'  # Reduce SSL warnings if any

# Report results
if missing_packages:
    print("‚ùå Missing required packages:")
    for package in missing_packages:
        print(f"   - {package}")
    print("\nüì¶ Install missing packages with:")
    print("!pip install", " ".join(missing_packages))
    print("\nüîÑ After installation, restart the kernel and run this cell again.")
else:
    print("‚úÖ All libraries imported successfully!")
    print("üìö Ready to process TEI XML documents!")
    
    # Test tqdm import specifically
    try:
        test_list = [1, 2, 3]
        list(tqdm(test_list, desc="Testing tqdm"))
        print("‚úÖ tqdm is working correctly!")
    except Exception as e:
        print(f"‚ùå tqdm import issue: {e}")

# Cell 3: Configuration Settings

In [None]:
# Configuration - These will be set up later with user input
OPENROUTER_API_KEY = ""  # Will be configured in the last cell
INPUT_FOLDER = "./tei"  # Replace with your input folder path
OUTPUT_PDF_PATH = "./tei/output.pdf"  # Replace with your output file path
TEI_OUTPUT_FOLDER = "./tei/tei_xml"  # TEI XML output folder
ENABLE_HUMAN_VERIFICATION = True  # Set to False for fully automated processing

# Setup logging
logging.basicConfig(
    level=logging.INFO, 
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('tei_processing.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

print("‚öôÔ∏è Configuration loaded!")
print(f"üìÅ Input folder: {INPUT_FOLDER}")
print(f"üìÑ Output PDF: {OUTPUT_PDF_PATH}")
print(f"üóÉÔ∏è TEI XML folder: {TEI_OUTPUT_FOLDER}")
print(f"üë§ Human verification: {'Enabled' if ENABLE_HUMAN_VERIFICATION else 'Disabled'}")

# Cell 4: TEI-Aware OCR Prompt (Editable)

In [None]:
# TEI-AWARE OCR PROMPT - Edit this cell to customize OCR behavior
TEI_OCR_PROMPT = """Extract all the text from this document page with precise structure preservation for TEI XML annotation.

CRITICAL: Maintain exact spatial layout and formatting including:
- Line breaks and paragraph boundaries
- Header information (letterhead, addresses) - mark as [HEADER]
- Date and place of writing - mark as [DATELINE]
- Salutation - mark as [SALUTATION]
- Main body text with exact paragraph breaks
- Closing formulas - mark as [CLOSING]
- Signatures - mark as [SIGNATURE]
- Any marginal notes or additions - mark as [MARGIN NOTE]
- Postscripts - mark as [POSTSCRIPT]

Preserve indentation, line spacing, and any special formatting.
Use line breaks exactly as they appear in the document.
Mark structural elements clearly for TEI encoding."""

print("üìù TEI-aware OCR prompt configured!")
print("üîß You can edit this cell to customize how the AI extracts text structure")

# Cell 5: Summary Generation Prompt (Editable)

In [None]:
# SUMMARY GENERATION PROMPT - Edit this cell to customize summary format
SUMMARY_PROMPT_TEMPLATE = """Create a detailed summary in German of this document. Structure it with clear sections:

**Dokumentart:** [Document type]
**Hauptthema und Zweck:** [Main topic and purpose]
**Schl√ºsselpersonen:** [Key persons]
**Wichtige Daten und Orte:** [Important dates and places]
**Hauptinhalt und Entscheidungen:** [Main content and decisions]
**Historischer Kontext:** [Historical context if apparent]
**Besondere Bemerkungen:** [Special remarks or notable features]

Each section should be a separate paragraph. Write in German.

Document Text:
{text}"""

print("üìã Summary generation prompt configured!")
print("üîß You can edit this cell to customize summary structure and language")

# Cell 6: TEI XML Generation Prompt (Editable)

In [None]:
# TEI XML GENERATION PROMPT - Edit this cell to customize TEI encoding behavior
TEI_XML_PROMPT_TEMPLATE = """You are an expert TEI XML encoder. Your task is to transform the provided document text into well-formed TEI XML. You MUST prioritize the complete preservation of ALL structural elements, spatial formatting, and content from the original document.

CRITICAL REQUIREMENTS FOR OUTPUT:
1. **Exact Structural Preservation:** Preserve all original paragraph breaks, line breaks, and spatial layout (indentation, spacing).
2. **Structural Guidance:** Use explicit structural markers (e.g., "[HEADER]", "[DATELINE]", "[SIGNATURE]") provided within the document text to guide your TEI encoding.
3. **Detailed Summary Integration:** Integrate the provided `detailed_summary` into the `<sourceDesc>` section as specified.
4. **Well-Formed TEI XML:** The output MUST be valid TEI XML conforming to the specified schema and element usage.

TEI STRUCTURE RULES:

* **`<teiHeader>` Element (Full Metadata Structure):**
    * `<fileDesc>`: Describes the electronic file and the source document.
        * `<titleStmt>`: Document title and responsibility.
        * `<publicationStmt>`: Information about the publication of the TEI XML file.
        * `<sourceDesc>`: DESCRIPTION OF THE ORIGINAL SOURCE DOCUMENT.
            * MUST include basic source information (e.g., `<p>` element).
            * MUST include a `<note type="summary" xml:lang="de">` containing the `detailed_summary` provided.
            * Optionally include physical description if relevant (e.g., `<p>`).
    * `<profileDesc>`: Provides a profile of the document.
        * `<correspDesc>`: **Correspondence Metadata (CRITICAL for Sender Address):**
            * `<correspAction type="sent">`: Describes the sending action.
                * `<persName>`: Sender's name.
                * `<placeName>`: Place of sending.
                * `<date when="YYYY-MM-DD">`: Date of sending (format as YYYY-MM-DD).
                * **`<address>`: SENDER'S ADDRESS (REQUIRED if present in source):**
                    * Use this element within `<correspAction type="sent">` to encapsulate the sender's address.
                    * Within `<address>`, use `<addrLine>` for individual lines of the address.
                    * Example for address lines: `<address><addrLine>Heidelberg</addrLine><addrLine>Hauptstra√üe 15</addrLine></address>`
            * `<correspAction type="received">`: Describes the receiving action.
                * `<persName>`: Recipient's name.
        * `<abstract>`: A brief abstract of the document's content.
        * `<textClass>`: (Optional, but good practice if keywords/taxonomy are derivable)
    * `<revisionDesc>`: Processing information, including timestamps.

* **`<text><body>` Element (Document Content):**
    * `<div type="letter">`: Main container for the letter content.
    * `<head>`: For letterheads and general headers indicated by "[HEADER]".
    * `<dateline>`: For dates and places typically found at the beginning of a letter, indicated by "[DATELINE]".
    * `<salute>`: For salutations, indicated by "[SALUTATION]".
    * `<p>`: For ALL paragraphs.
        * Preserve exact line breaks within paragraphs using `<lb/>`.
        * Maintain original indentation using `<space dim="horizontal" extent="X"/>` (estimate X as number of character spaces).
    * `<closer>`: For closing formulas, indicated by "[CLOSING]".
    * `<signed>`: For signatures, indicated by "[SIGNATURE]".
    * `<postscript>`: For postscripts, indicated by "[POSTSCRIPT]".
    * `<note place="margin">`: For marginal notes, indicated by "[MARGIN NOTE]".

* **TEXT PRESERVATION & SPECIAL ENCODING:**
    * `<unclear>`: For text that is unclear or difficult to decipher in the original.
    * `<gap reason="illegible"/>`: For completely unreadable or missing portions of text.
    * `<supplied>`: For editorial additions or text supplied by the encoder for clarity.
    * **Crucial Formatting Preservation:** Ensure that ALL line breaks are represented by `<lb/>` and all significant horizontal spacing (indentation) by `<space dim="horizontal" extent="X"/>`. Maintain the original paragraph structure precisely.

SUMMARY FORMATTING:
- Structure the German summary with clear paragraph breaks
- Use <p> elements for each section
- Bold the section headers with <strong> tags
- Example: <p><strong>Dokumentart:</strong> Brief</p>

ADDRESS HANDLING:
- Recipient addresses at the top of letters should be encoded as <address> blocks
- Use <addrLine> for each line of the address
- Place recipient address in both the header <correspAction type="received"> and in the body if it appears there
- Do not use <space> and <lb/> for addresses - use proper <address><addrLine> structure

DETAILED SUMMARY TO INTEGRATE (in German):
{detailed_summary}

Document to encode (with structural markers):
{text}

Generate the complete TEI XML for the provided document, ensuring exact structural and spatial preservation, correct metadata encoding (especially the sender's address in `<correspAction type="sent">` within an `<address>` element), and seamless integration of the detailed summary."""

print("üèóÔ∏è TEI XML generation prompt configured!")
print("üîß You can edit this cell to customize TEI encoding rules and structure")

# Cell 7: Fallback TEI Template (Editable)

In [None]:
# FALLBACK TEI TEMPLATE - Edit this cell to customize fallback XML structure
def create_fallback_tei_template(text, filename, detailed_summary, found_date="Unknown"):
    """Fallback TEI template when LLM generation fails"""
    return f"""<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
    <fileDesc>
    <titleStmt>
        <title>Document from {filename}</title>
        <respStmt>
        <resp>TEI encoding</resp>
        <name>Automated TEI processor</name>
        </respStmt>
    </titleStmt>
    <publicationStmt>
        <p>Unpublished document - digitized and encoded for research purposes</p>
    </publicationStmt>
    <sourceDesc>
        <p>Digitized from original document: {filename}</p>
        <note type="summary" xml:lang="de">
        {detailed_summary}
        </note>
    </sourceDesc>
    </fileDesc>
    <profileDesc>
    <correspDesc>
        <correspAction type="sent">
        <persName>Unknown</persName>
        <placeName>Unknown</placeName>
        <date when="{found_date if found_date != 'Unknown' else ''}">{found_date}</date>
        </correspAction>
        <correspAction type="received">
        <persName>Unknown</persName>
        </correspAction>
    </correspDesc>
    <abstract>
        <p>Document extracted from {filename} with automated TEI encoding.</p>
    </abstract>
    </profileDesc>
    <revisionDesc>
    <change when="{datetime.now().strftime('%Y-%m-%d')}" who="#automated-processor">
        Automatic TEI encoding with structure preservation and summary integration
    </change>
    </revisionDesc>
</teiHeader>
<text>
    <body>
    <div type="letter">
        <!-- Processed paragraphs will be inserted here -->
    </div>
    </body>
</text>
</TEI>"""

print("üõ†Ô∏è Fallback TEI template configured!")
print("üîß You can edit this cell to customize the backup TEI structure")

# Cell 8: Metadata Page Template (Editable)

In [None]:
# METADATA PAGE CONTENT - Edit this cell to customize PDF metadata layout
def format_metadata_content(metadata):
    """Format metadata content without HTML tags and with better structure"""
    
    # Clean HTML tags from summary
    summary = metadata.get('summary', 'Unknown')
    # Remove HTML tags
    import re
    summary = re.sub(r'<[^>]+>', '', summary)
    # Replace HTML entities
    summary = summary.replace('&lt;', '<').replace('&gt;', '>').replace('&amp;', '&')
    # Clean up extra whitespace
    summary = re.sub(r'\s+', ' ', summary).strip()
    
    return {
        'source_file': metadata.get('source_file', 'Unknown'),
        'date': metadata.get('date', 'Unknown'),
        'sender': metadata.get('sender', 'Unknown'),
        'recipient': metadata.get('recipient', 'Unknown'),
        'tei_file': metadata.get('tei_file', 'Unknown'),
        'clean_summary': summary
    }

print("üìÑ Metadata page template configured!")
print("üîß You can edit this cell to customize how metadata appears in the final PDF")

# Cell 9: TEI PDF Processor Class - Part 1 (Core Methods)

In [None]:
class TEIPDFProcessor:
    def __init__(self, api_key):
        self.api_key = api_key
        self.openrouter_url = "https://openrouter.ai/api/v1/chat/completions"
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
            "HTTP-Referer": "https://your-app-website.com",
            "X-Title": "TEI XML Annotation Application"
        }
        
        # Create TEI output directory if it doesn't exist
        if not os.path.exists(TEI_OUTPUT_FOLDER):
            os.makedirs(TEI_OUTPUT_FOLDER)

    def extract_text_from_image_tei(self, image_path):
        """Extract text from an image using Gemini via OpenRouter with TEI-aware prompt"""
        try:
            # Encode image to base64
            with open(image_path, "rb") as image_file:
                base64_image = base64.b64encode(image_file.read()).decode('utf-8')
                
            data_url = f"data:image/jpeg;base64,{base64_image}"
            
            # Prepare the TEI-aware request with structure preservation
            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": TEI_OCR_PROMPT  # Using the configurable prompt from Cell 4
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": data_url
                            }
                        }
                    ]
                }
            ]
            
            payload = {
                "model": "openai/gpt-4o",
                "messages": messages
            }
            
            # Send request to OpenRouter
            response = requests.post(self.openrouter_url, headers=self.headers, json=payload, timeout=120)
            response.raise_for_status()
            
            result = response.json()
            extracted_text = result['choices'][0]['message']['content']
            
            return extracted_text
        except Exception as e:
            logger.error(f"Error in image text extraction: {e}")
            return ""

    def process_with_llm(self, prompt, model="deepseek/deepseek-r1-distill-qwen-32b"):
        """Process text with LLM via OpenRouter"""
        try:
            payload = {
                "model": model,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.1  # Lower temperature for more consistent XML output
            }
            
            response = requests.post(self.openrouter_url, headers=self.headers, json=payload, timeout=120)
            response.raise_for_status()
            
            result = response.json()
            return result['choices'][0]['message']['content']
        except Exception as e:
            logger.error(f"LLM API error: {e}")
            return ""

print("üîß TEI PDF Processor - Core methods loaded!")

# Cell 10: TEI PDF Processor Class - Part 2 (Text Extraction)

In [None]:
def extract_text_from_pdf(self, pdf_path):
    """Extract text from PDF, handling both OCR and non-OCR documents with Gemini"""
    logger.info(f"Extracting text from {pdf_path}")
    try:
        # First try to extract text directly via PyPDF2
        pdf_reader = PdfReader(pdf_path)
        pages_text = []
        
        for page_idx, page in enumerate(pdf_reader.pages):
            try:
                text = page.extract_text()
                
                # If direct text extraction yields good results, use it
                if text and len(text.strip()) > 100:  # Arbitrary threshold for "good" text
                    pages_text.append(text)
                    continue
                
                # Otherwise, use Gemini model via OpenRouter for image-based extraction
                logger.info(f"Using openai/gpt-4o for image-based text extraction on page {page_idx+1}")
                
                # Convert the PDF page to image
                img = self.get_pdf_page_as_image(pdf_path, page_idx)
                if img is None:
                    pages_text.append("")
                    continue
                
                # Save image to a temporary file
                with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as temp_file:
                    img.save(temp_file.name)
                    temp_path = temp_file.name
                
                # Extract text using Gemini with TEI awareness
                try:
                    extracted_text = self.extract_text_from_image_tei(temp_path)
                    pages_text.append(extracted_text)
                except Exception as e:
                    logger.error(f"Error extracting text with Gemini: {e}")
                    pages_text.append("")
                
                # Clean up temp file
                try:
                    os.unlink(temp_path)
                except:
                    pass
                
            except Exception as e:
                logger.error(f"Error processing page {page_idx}: {e}")
                pages_text.append("")
            
        return pages_text
    except Exception as e:
        logger.error(f"Error extracting text from PDF: {e}")
        return []

def get_pdf_page_as_image(self, pdf_path, page_idx, zoom=2.0):
    """Convert a PDF page to a PIL Image"""
    try:
        doc = fitz.open(pdf_path)
        page = doc.load_page(page_idx)
        
        # Increase resolution with the zoom factor
        matrix = fitz.Matrix(zoom, zoom)
        pixmap = page.get_pixmap(matrix=matrix)
        
        # Convert to PIL Image
        img = Image.frombytes("RGB", [pixmap.width, pixmap.height], pixmap.samples)
        doc.close()
        
        return img
    except Exception as e:
        logger.error(f"Error converting PDF page to image: {e}")
        return None

# Add methods to the TEIPDFProcessor class
TEIPDFProcessor.extract_text_from_pdf = extract_text_from_pdf
TEIPDFProcessor.get_pdf_page_as_image = get_pdf_page_as_image

print("üìÑ TEI PDF Processor - Text extraction methods loaded!")

# Cell 11: TEI PDF Processor Class - Part 3 (Summary and TEI Generation)

In [None]:
def generate_detailed_summary(self, text, filename):
    """Generate a detailed summary using LLM"""
    
    prompt = SUMMARY_PROMPT_TEMPLATE.format(text=text[:6000])  # Using configurable template
    
    try:
        response = self.process_with_llm(prompt)
        return response.strip()
    except Exception as e:
        logger.error(f"Error generating detailed summary: {e}")
        return f"Detaillierte Zusammenfassung konnte nicht generiert werden. Dokument: {filename}"

def generate_tei_xml(self, text, filename, detailed_summary):
    """Use LLM to generate TEI XML from document text with structure preservation and summary integration"""
    
    prompt = TEI_XML_PROMPT_TEMPLATE.format(
        detailed_summary=detailed_summary,
        text=text
    )  # Using configurable template
    
    response = self.process_with_llm(prompt, model="deepseek/deepseek-r1-distill-qwen-32b")
    
    # Clean and format the XML response
    try:
        # Extract XML from response if wrapped in markdown or other text
        xml_match = re.search(r'<TEI[^>]*>.*</TEI>', response, re.DOTALL | re.IGNORECASE)
        if xml_match:
            xml_content = xml_match.group()
        else:
            # If no TEI tags found, assume the entire response is XML
            xml_content = response.strip()
            # Remove markdown code blocks if present
            xml_content = re.sub(r'```[xml]*\n?', '', xml_content)
            xml_content = re.sub(r'```', '', xml_content)
        
        # Pretty print the XML while preserving structure
        try:
            parsed = minidom.parseString(xml_content)
            pretty_xml = parsed.toprettyxml(indent="  ")
            # Clean up extra whitespace but preserve intentional formatting
            lines = pretty_xml.split('\n')
            cleaned_lines = [line for line in lines if line.strip()]
            return '\n'.join(cleaned_lines)
        except:
            return xml_content
            
    except Exception as e:
        logger.error(f"Error parsing TEI XML: {e}")
        # Return a basic TEI structure if parsing fails
        return self.create_fallback_tei(text, filename, detailed_summary)

def create_fallback_tei(self, text, filename, detailed_summary):
    """Create a basic TEI structure if LLM output fails, including detailed summary"""
    # Preserve structure even in fallback
    paragraphs = text.split('\n\n')
    tei_paragraphs = []
    
    for para in paragraphs[:5]:  # Limit to first 5 paragraphs
        if para.strip():
            # Convert line breaks to <lb/> tags
            lines = para.split('\n')
            formatted_para = '<lb/>'.join(lines)
            tei_paragraphs.append(f"        <p>{formatted_para}</p>")
    
    # Extract basic metadata for fallback
    date_patterns = [
        r'(\d{1,2}\.?\d{1,2}\.?\d{2,4})',
        r'(\d{4}-\d{1,2}-\d{1,2})',
        r'(\d{1,2}\s+\w+\s+\d{2,4})'
    ]
    
    found_date = "Unknown"
    for pattern in date_patterns:
        match = re.search(pattern, text)
        if match:
            found_date = match.group(1)
            break
    
    return create_fallback_tei_template(text, filename, detailed_summary, found_date)

# Add methods to the TEIPDFProcessor class
TEIPDFProcessor.generate_detailed_summary = generate_detailed_summary
TEIPDFProcessor.generate_tei_xml = generate_tei_xml
TEIPDFProcessor.create_fallback_tei = create_fallback_tei

print("üèóÔ∏è TEI PDF Processor - Summary and TEI generation methods loaded!")

# Cell 12: TEI PDF Processor Class - Part 4 (Metadata and File Operations)

In [None]:
def extract_tei_metadata(self, tei_xml, text, filename):
    """Extract metadata from TEI XML for PDF sorting"""
    try:
        # Extract metadata from TEI XML
        date_match = re.search(r'<date[^>]*when="([^"]*)"', tei_xml)
        sender_match = re.search(r'<correspAction[^>]*type="sent"[^>]*>.*?<persName[^>]*>([^<]*)</persName>', tei_xml, re.DOTALL)
        recipient_match = re.search(r'<correspAction[^>]*type="received"[^>]*>.*?<persName[^>]*>([^<]*)</persName>', tei_xml, re.DOTALL)
        
        # Extract summary from TEI XML (it's now integrated in the XML)
        summary_match = re.search(r'<note[^>]*type="summary"[^>]*>(.*?)</note>', tei_xml, re.DOTALL)
        
        date = date_match.group(1) if date_match else "Unknown"
        sender = sender_match.group(1).strip() if sender_match else "Unknown"
        recipient = recipient_match.group(1).strip() if recipient_match else "Unknown"
        summary = summary_match.group(1).strip() if summary_match else "Summary not found in TEI XML"
        
        return {
            "date": date,
            "sender": sender,
            "recipient": recipient,
            "summary": summary
        }
        
    except Exception as e:
        logger.error(f"Error extracting TEI metadata: {e}")
        # Fallback: generate summary if not in XML
        detailed_summary = self.generate_detailed_summary(text, filename)
        return {
            "date": "Unknown",
            "sender": "Unknown", 
            "recipient": "Unknown",
            "summary": detailed_summary
        }

def save_tei_xml(self, tei_xml, original_filename):
    """Save TEI XML to file with original PDF filename"""
    # Create XML filename based on original PDF name
    base_name = os.path.splitext(original_filename)[0]
    xml_filename = f"{base_name}.xml"
    xml_path = os.path.join(TEI_OUTPUT_FOLDER, xml_filename)
    
    try:
        with open(xml_path, 'w', encoding='utf-8') as f:
            f.write(tei_xml)
        logger.info(f"Saved TEI XML with integrated summary: {xml_path}")
        return xml_path
    except Exception as e:
        logger.error(f"Error saving TEI XML: {e}")
        return None

# Add methods to the TEIPDFProcessor class
TEIPDFProcessor.extract_tei_metadata = extract_tei_metadata
TEIPDFProcessor.save_tei_xml = save_tei_xml

print("üíæ TEI PDF Processor - Metadata and file operations loaded!")

# Cell 13: TEI PDF Processor Class - Part 5 (PDF Generation)

In [None]:
def create_metadata_page(self, metadata):
    """Create a beautifully formatted PDF page with detailed metadata and summary"""
    pdf_buffer = io.BytesIO()
    c = canvas.Canvas(pdf_buffer, pagesize=letter)
    
    # Page dimensions
    width, height = letter
    
    # Clean and format metadata
    clean_data = format_metadata_content(metadata)
    
    # Header
    y_position = height - 60
    c.setFont("Helvetica-Bold", 18)
    c.setFillColorRGB(0.2, 0.3, 0.6)  # Dark blue
    c.drawCentredString(width/2, y_position, "TEI Document Metadata & Summary")  # Fixed method name
    
    # Decorative line
    y_position -= 15
    c.setStrokeColorRGB(0.2, 0.3, 0.6)
    c.setLineWidth(2)
    c.line(100, y_position, width-100, y_position)
    
    y_position -= 40
    
    # Document Information Section
    c.setFillColorRGB(0, 0, 0)  # Black
    c.setFont("Helvetica-Bold", 14)
    c.drawString(80, y_position, "üìÑ Document Information")
    y_position -= 25
    
    # Metadata fields with better formatting
    metadata_fields = [
        ("Source File:", clean_data['source_file']),
        ("Date:", clean_data['date']),
        ("Sender:", clean_data['sender']),
        ("Recipient:", clean_data['recipient']),
        ("TEI XML File:", clean_data['tei_file'])
    ]
    
    c.setFont("Helvetica", 11)
    for label, value in metadata_fields:
        c.setFont("Helvetica-Bold", 11)
        c.drawString(100, y_position, label)
        c.setFont("Helvetica", 11)
        # Handle long text that might overflow
        if len(str(value)) > 60:
            # Split long text into multiple lines
            lines = textwrap.wrap(str(value), 60)
            c.drawString(200, y_position, lines[0])
            for i, line in enumerate(lines[1:], 1):
                y_position -= 12
                c.drawString(200, y_position, line)
        else:
            c.drawString(200, y_position, str(value))
        y_position -= 18
    
    y_position -= 20
    
    # Summary Section Header
    c.setFont("Helvetica-Bold", 14)
    c.drawString(80, y_position, "üìã Document Summary")
    y_position -= 25
    
    # Summary content with proper text wrapping
    summary = clean_data['clean_summary']
    c.setFont("Helvetica", 10)
    
    # Split summary into sections if it contains structured content
    summary_sections = []
    current_section = ""
    
    # Look for bold section headers (converted from <strong> tags)
    lines = summary.split('\n')
    for line in lines:
        line = line.strip()
        if line:
            # Check if line looks like a section header (ends with :)
            if ':' in line and len(line) < 100:
                if current_section:
                    summary_sections.append(current_section.strip())
                current_section = line + "\n"
            else:
                current_section += line + " "
    
    if current_section:
        summary_sections.append(current_section.strip())
    
    # If no sections found, treat as one block
    if not summary_sections:
        summary_sections = [summary]
    
    # Draw summary sections
    for section in summary_sections:
        if y_position < 100:  # Check if we need a new page
            c.showPage()
            y_position = height - 60
            c.setFont("Helvetica-Bold", 14)
            c.drawString(80, y_position, "üìã Document Summary (continued)")
            y_position -= 25
            c.setFont("Helvetica", 10)
        
        # Check if this is a section header
        if ':' in section[:50] and len(section.split('\n')[0]) < 100:
            lines = section.split('\n', 1)
            header = lines[0]
            content = lines[1] if len(lines) > 1 else ""
            
            # Draw section header in bold
            c.setFont("Helvetica-Bold", 10)
            c.drawString(100, y_position, header)
            y_position -= 15
            
            # Draw content
            if content:
                c.setFont("Helvetica", 10)
                wrapped_lines = textwrap.wrap(content.strip(), 80)
                for line in wrapped_lines:
                    if y_position < 50:
                        c.showPage()
                        y_position = height - 60
                    c.drawString(120, y_position, line)
                    y_position -= 12
        else:
            # Regular paragraph
            wrapped_lines = textwrap.wrap(section, 80)
            for line in wrapped_lines:
                if y_position < 50:
                    c.showPage()
                    y_position = height - 60
                c.drawString(120, y_position, line)
                y_position -= 12
        
        y_position -= 8  # Extra space between sections
    
    # Footer
    c.setFont("Helvetica", 8)
    c.setFillColorRGB(0.5, 0.5, 0.5)  # Gray
    footer_text = f"Generated on {datetime.now().strftime('%Y-%m-%d %H:%M')} | TEI XML Processing Tool"
    c.drawCentredString(width/2, 30, footer_text)  # Fixed method name
    
    c.save()
    pdf_buffer.seek(0)
    return pdf_buffer

# Remove the old create_tei_xml_pages method since we don't want TEI XML in PDF
def create_tei_xml_pages(self, tei_xml, metadata):
    """This method is disabled - TEI XML content not included in final PDF"""
    print("‚ÑπÔ∏è  TEI XML pages skipped - only metadata and summary included in PDF")
    return None

# Add methods to the TEIPDFProcessor class
TEIPDFProcessor.create_metadata_page = create_metadata_page
TEIPDFProcessor.create_tei_xml_pages = create_tei_xml_pages

print("üìÑ Fixed PDF generation methods loaded!")
print("‚úÖ ReportLab method names corrected (drawCentredString)")

# Cell 14: TEI PDF Processor Class - Part 6 (Document Boundary Detection)

In [None]:
def determine_document_boundaries(self, page_texts, enable_human_verification=True):
    """Use LLM to determine if pages belong to the same document with optional human verification"""
    documents = []
    current_doc = [0]  # Start with the first page
    
    print(f"\n{'='*60}")
    print("DOCUMENT BOUNDARY DETECTION")
    print(f"{'='*60}")
    print(f"Processing {len(page_texts)} pages...")
    
    if enable_human_verification:
        print("\nHuman verification is ENABLED.")
        print("You will be asked to verify each LLM decision.")
        print("Commands: 'y'=confirm, 'n'=override, 's'=skip remaining, 'q'=quit")
        print(f"{'='*60}")
    
    skip_verification = False
    
    for i in range(1, len(page_texts)):
        print(f"\n{'='*40}")
        print(f"Comparing Page {i} with Page {i+1}")
        print(f"{'='*40}")
        
        # Show text previews
        print(f"\nPAGE {i} (first 500 chars):")
        print("-" * 30)
        print(page_texts[i-1][:1000] + "..." if len(page_texts[i-1]) > 1000 else page_texts[i-1])
        
        print(f"\nPAGE {i+1} (first 500 chars):")
        print("-" * 30)
        print(page_texts[i][:1000] + "..." if len(page_texts[i]) > 1000 else page_texts[i])
        
        # Get LLM decision
        prompt = f"""
        Determine if these two texts are from the same document/letter or different documents.
        Consider: writing style, sender/recipient continuity, colors of document, page numbering, dates, salutations, closings,
        and typical letter structure patterns.

        TEXT 1:
        {page_texts[i-1][:10000]}

        TEXT 2:
        {page_texts[i][:10000]}

        Answer with only 'SAME' or 'DIFFERENT'.
        """
        
        print(f"\nü§ñ LLM is analyzing...")
        llm_response = self.process_with_llm(prompt).strip().upper()
        llm_decision = 'SAME' if 'SAME' in llm_response else 'DIFFERENT'
        
        print(f"ü§ñ LLM Decision: {llm_decision}")
        
        # Human verification
        final_decision = llm_decision
        if enable_human_verification and not skip_verification:
            while True:
                user_input = input(f"\nüë§ LLM says '{llm_decision}'. Confirm? (y/n/s/q): ").lower().strip()
                
                if user_input == 'y':
                    print(f"‚úÖ Confirmed: {llm_decision}")
                    final_decision = llm_decision
                    break
                elif user_input == 'n':
                    # Override LLM decision
                    override_decision = 'DIFFERENT' if llm_decision == 'SAME' else 'SAME'
                    print(f"üîÑ Overridden: {llm_decision} ‚Üí {override_decision}")
                    final_decision = override_decision
                    break
                elif user_input == 's':
                    print(f"‚è≠Ô∏è  Skipping remaining verifications. Using LLM decisions.")
                    skip_verification = True
                    final_decision = llm_decision
                    break
                elif user_input == 'q':
                    print("‚ùå User quit. Using LLM decisions for remaining pages.")
                    enable_human_verification = False
                    final_decision = llm_decision
                    break
                else:
                    print("‚ùì Invalid input. Please enter 'y' (confirm), 'n' (override), 's' (skip), or 'q' (quit)")
        
        # Apply the final decision
        if final_decision == 'SAME':
            current_doc.append(i)
            print(f"üìÑ Pages {i} and {i+1} belong to the SAME document")
        else:
            documents.append(current_doc)
            current_doc = [i]
            print(f"üìÑ Pages {i} and {i+1} are DIFFERENT documents")
            print(f"üìÅ Document completed: Pages {documents[-1]}")
    
    # Add the last document
    if current_doc:
        documents.append(current_doc)
        print(f"üìÅ Final document: Pages {current_doc}")
    
    print(f"\n{'='*60}")
    print("DOCUMENT BOUNDARY DETECTION COMPLETE")
    print(f"{'='*60}")
    print(f"Total documents detected: {len(documents)}")
    for idx, doc in enumerate(documents, 1):
        print(f"Document {idx}: Pages {doc} ({len(doc)} pages)")
    print(f"{'='*60}")
    
    return documents

def determine_document_boundaries_batch(self, page_texts):
    """Batch version without human verification for automated processing"""
    return self.determine_document_boundaries(page_texts, enable_human_verification=False)

# Add methods to the TEIPDFProcessor class
TEIPDFProcessor.determine_document_boundaries = determine_document_boundaries
TEIPDFProcessor.determine_document_boundaries_batch = determine_document_boundaries_batch

print("üîç TEI PDF Processor - Document boundary detection loaded!")

# Cell 15: TEI PDF Processor Class - Part 7 (Main Processing Function)

In [None]:
def process_and_sort_pdfs(self, input_folder, output_pdf_path):
    """Main function to process and sort PDFs with TEI XML generation and summary integration"""
    # Ensure output directory exists
    output_dir = os.path.dirname(output_pdf_path)
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        
    pdf_files = [f for f in os.listdir(input_folder) if f.lower().endswith('.pdf')]
    
    if not pdf_files:
        logger.error(f"No PDF files found in {input_folder}")
        return None
        
    logger.info(f"Found {len(pdf_files)} PDF files to process with TEI XML annotation and summary integration")
    documents_data = []
    
    # Process each PDF file
    for pdf_file in tqdm(pdf_files, desc="Processing PDF files with TEI annotation and summary"):
        try:
            pdf_path = os.path.join(input_folder, pdf_file)
            
            # Extract text from PDF (with TEI-aware method)
            page_texts = self.extract_text_from_pdf(pdf_path)
            if not page_texts:
                logger.warning(f"No text found in {pdf_file}, skipping")
                continue
            
            # Determine document boundaries with human verification
            document_groups = self.determine_document_boundaries(page_texts, enable_human_verification=ENABLE_HUMAN_VERIFICATION)
            
            # Process each document group
            for doc_idx, doc_group in enumerate(document_groups):
                # Combine text for the entire document
                combined_text = "\n\n".join([page_texts[i] for i in doc_group])
                
                # Generate detailed summary first
                detailed_summary = self.generate_detailed_summary(combined_text, pdf_file)
                logger.info(f"Generated detailed summary for {pdf_file}")
                
                # Generate TEI XML with structure preservation and summary integration
                tei_xml = self.generate_tei_xml(combined_text, pdf_file, detailed_summary)
                logger.info(f"Generated TEI XML with integrated summary for {pdf_file}")
                
                # Create unique filename for multi-document PDFs
                if len(document_groups) > 1:
                    tei_filename = f"{os.path.splitext(pdf_file)[0]}_doc{doc_idx+1}.xml"
                else:
                    tei_filename = f"{os.path.splitext(pdf_file)[0]}.xml"
                
                # Save TEI XML with integrated summary
                xml_path = self.save_tei_xml(tei_xml, tei_filename)
                
                # Extract metadata from TEI XML (summary is now in the XML)
                metadata = self.extract_tei_metadata(tei_xml, combined_text, pdf_file)
                metadata['source_file'] = pdf_file
                metadata['tei_file'] = tei_filename if xml_path else "Error saving TEI"
                
                # Create document entry with page indices
                documents_data.append({
                    'metadata': metadata,
                    'pdf_path': pdf_path,
                    'pdf_filename': pdf_file,
                    'page_indices': doc_group,
                    'tei_xml': tei_xml,
                    'tei_path': xml_path
                })
                
        except Exception as e:
            logger.error(f"Error processing {pdf_file}: {e}")
            continue
    
    # Sort documents by date
    def sort_by_date(doc):
        date_str = doc['metadata'].get('date', 'Unknown')
        if date_str == "Unknown":
            return datetime.max  # Put documents with unknown dates at the end
        
        # Try different date formats
        for fmt in ('%Y-%m-%d', '%d.%m.%Y', '%d/%m/%Y', '%B %d, %Y', '%Y-%m', '%Y'):
            try:
                return datetime.strptime(date_str, fmt)
            except ValueError:
                continue
        
        return datetime.max  # Default if no format matches
    
    sorted_documents = sorted(documents_data, key=sort_by_date)
    logger.info(f"Sorted {len(sorted_documents)} documents chronologically with TEI XML and integrated summaries")
    
    # Create the final merged PDF
    output_pdf = PdfWriter()
    
    for doc in tqdm(sorted_documents, desc="Creating final PDF with metadata, summary and original pages"):
        try:
            # Add clean, formatted metadata and summary page
            logger.info(f"Adding formatted metadata page for {doc['metadata']['source_file']}")
            metadata_buffer = self.create_metadata_page(doc['metadata'])
            metadata_reader = PdfReader(metadata_buffer)
            for page in metadata_reader.pages:
                output_pdf.add_page(page)
            
            # Skip TEI XML pages - we only want clean metadata and summary
            logger.info(f"Skipping TEI XML pages - clean format requested")
            
            # Add original document pages
            logger.info(f"Adding original pages for {doc['metadata']['source_file']}")
            pdf_reader = PdfReader(doc['pdf_path'])
            for page_idx in doc['page_indices']:
                try:
                    if page_idx < len(pdf_reader.pages):
                        output_pdf.add_page(pdf_reader.pages[page_idx])
                        logger.info(f"Added original page {page_idx+1}")
                    else:
                        logger.warning(f"Page index {page_idx} out of range for {doc['pdf_path']}")
                except Exception as e:
                    logger.error(f"Error adding original page {page_idx+1}: {e}")
                    continue
                    
        except Exception as e:
            logger.error(f"Error processing document {doc['metadata']['source_file']}: {e}")
            continue
    
    # Write the final merged PDF
    try:
        with open(output_pdf_path, 'wb') as f:
            output_pdf.write(f)
        
        logger.info(f"Successfully created merged PDF at {output_pdf_path}")
        logger.info(f"TEI XML files with integrated summaries saved in {TEI_OUTPUT_FOLDER}")
        
    except Exception as e:
        logger.error(f"Error writing final PDF: {e}")
        return None
    
    return output_pdf_path

# Add method to the TEIPDFProcessor class
TEIPDFProcessor.process_and_sort_pdfs = process_and_sort_pdfs

print("üöÄ Updated main processing function loaded!")
print("‚úÖ PDF will now contain: 1) Clean formatted metadata & summary, 2) Original pages")
print("üìÑ TEI XML content removed from PDF (still saved as separate .xml files)")

# Cell 16: Main Processing Function

In [None]:
def main():
    """Main function to run the TEI PDF processor"""
    print("=" * 70)
    print("TEI XML PDF Processing with Clean Formatted Output")
    print("=" * 70)
    
    logger.info(f"Starting TEI XML PDF processing with clean summary integration")
    logger.info(f"Input folder: {INPUT_FOLDER}")
    logger.info(f"Output PDF: {OUTPUT_PDF_PATH}")
    logger.info(f"TEI XML output folder: {TEI_OUTPUT_FOLDER}")
    
    # Validate input folder
    if not os.path.exists(INPUT_FOLDER):
        logger.error(f"Input folder does not exist: {INPUT_FOLDER}")
        print(f"‚ùå Error: Input folder does not exist: {INPUT_FOLDER}")
        print("Please create the folder or update the INPUT_FOLDER path in Cell 3")
        return
    
    try:
        processor = TEIPDFProcessor(OPENROUTER_API_KEY)
        result = processor.process_and_sort_pdfs(INPUT_FOLDER, OUTPUT_PDF_PATH)
        
        if result:
            print("=" * 70)
            print("‚úÖ TEI XML processing completed successfully!")
            print(f"üìÑ Output PDF: {result}")
            print(f"üóÉÔ∏è TEI XML files: {TEI_OUTPUT_FOLDER}")
            print("")
            print("üìã PDF Structure per document:")
            print("1. ‚ú® Clean formatted metadata page with summary (no HTML tags)")
            print("2. üìÑ Original document pages")
            print("")
            print("üìÅ Separate TEI XML files with complete markup saved to:")
            print(f"   {TEI_OUTPUT_FOLDER}")
            print("")
            print("üîó Benefits of this approach:")
            print("   ‚Ä¢ Clean, readable PDF for human review")
            print("   ‚Ä¢ Complete TEI XML files for computational analysis")
            print("   ‚Ä¢ Best of both worlds: readability + machine processing")
            print("=" * 70)
        else:
            print("=" * 70)
            print("‚ùå TEI XML processing failed!")
            print("Check the logs for detailed error information.")
            print("=" * 70)
            
    except Exception as e:
        logger.error(f"Fatal error in main: {e}")
        print(f"‚ùå Fatal error: {e}")

print("üéØ Updated main processing function loaded!")
print("‚ú® Ready to create clean, formatted PDFs with separate TEI XML files!")

# Cell 17: Setup API Key and Run

In [None]:
# Setup API Key and Execute Processing
# Get your free API key from: https://openrouter.ai/

# Try to get API key from environment first
api_key = os.getenv('OPENROUTER_API_KEY')

if not api_key:
    print("üîë OpenRouter API Key Setup")
    print("=" * 50)
    print("To use this tool, you need a free API key from OpenRouter.")
    print("üåê Get your key at: https://openrouter.ai/")
    print("üí° OpenRouter provides access to multiple AI models including:")
    print("   - OpenAI GPT-4o (for OCR)")
    print("   - DeepSeek models (for TEI generation)")
    print("   - Claude models (for analysis)")
    print("")
    
    api_key = input("üîê Enter your OpenRouter API key: ").strip()

if api_key and api_key.startswith('sk-or-v1-'):
    print("‚úÖ API key configured!")
    
    # Update the global configuration
    OPENROUTER_API_KEY = api_key
    
    print("üöÄ Starting TEI XML processing...")
    print("üìÅ Make sure your PDF files are in:", INPUT_FOLDER)
    
    # Create input directory if it doesn't exist
    if not os.path.exists(INPUT_FOLDER):
        os.makedirs(INPUT_FOLDER)
        print(f"üìÅ Created input directory: {INPUT_FOLDER}")
        print("üìÑ Please add your PDF files to this directory and run this cell again.")
    else:
        # Check if there are PDF files
        pdf_files = [f for f in os.listdir(INPUT_FOLDER) if f.lower().endswith('.pdf')]
        if pdf_files:
            print(f"üìÑ Found {len(pdf_files)} PDF files to process")
            
            # Run the main processing
            main()
        else:
            print(f"üìÅ No PDF files found in {INPUT_FOLDER}")
            print("üìÑ Please add PDF files to the input directory and run this cell again.")
            
elif api_key:
    print("‚ùå Invalid API key format. OpenRouter keys should start with 'sk-or-v1-'")
    print("üåê Get a valid key from: https://openrouter.ai/")
else:
    print("‚ùå Need API key to continue. Get one from: https://openrouter.ai/")
    print("")
    print("üí° Usage Instructions:")
    print("1. Sign up at https://openrouter.ai/")
    print("2. Get your API key (starts with 'sk-or-v1-')")
    print("3. Run this cell and enter your key when prompted")
    print("4. Add PDF files to the input directory")
    print("5. The tool will process them into TEI XML format")