# Text Pattern Analysis Tool for Documents
## A Computational Approach to Document Analysis

### Overview
This tool provides a systematic approach to analyze patterns and terms across historical documents (PDF and DOCX formats), enabling researchers to track the occurrence and context of specific concepts over time.

### Key Features
1. **Document Processing**
   - Handles both PDF and DOCX formats
   - Extracts text while maintaining page references
   - Processes documents with year-based naming convention (e.g., "1946-document-name.pdf")

2. **Pattern Search**
   - Customizable search patterns using regular expressions
   - Case-insensitive matching
   - Root word and variation detection
   - Contextual excerpt extraction

3. **Analysis & Visualization**
   - Chronological pattern distribution
   - Pattern frequency analysis
   - Page-specific references
   - Context preservation

In [None]:
# Cell 1 - Installation and Imports (Detailed):
# Install required packages for document processing and data analysis
!pip install PyPDF2 python-docx rtfparse matplotlib pandas
!pip install pyth
!pip install PyMuPDF
import os
import re
import fitz  # PyMuPDF
from docx import Document
import matplotlib.pyplot as plt
import pandas as pd
from collections import defaultdict
import html
from IPython.display import display, HTML
# Verify imports
for module in [os, re, fitz, Document, plt, pd, defaultdict]:
    print(f"Successfully imported {module.__name__}")

# Search Pattern Configuration (Cell 2A)

## Overview
This cell allows you to define search patterns for text analysis in PDF and DOCX documents. You can search for single words, word variations, or multiple terms simultaneously.

## Pattern Structure Explained

### Basic Components
- `r'...'` : Raw string indicator
- `(?i)` : Case insensitive flag
- `\b` : Word boundary
- `\w*` : Wildcard for word characters
- `\b` : Word boundary

### Single Pattern Example
```python
patterns = [r'(?i)\bmysti\w*\b']  # Searches for: mystic, mystical, mysticism, etc.

patterns = [r'(?i)\bkirche\b']  # Matches exactly "kirche" (case insensitive)

patterns = [r'(?i)\bkirch\w*\b']  # Matches: kirche, kirchlich, kirchlichen, etc.



# Bible Verse Pattern Matching Guide

## Overview
The regex pattern for Bible verses needs to be customized based on the specific abbreviations and citation styles found in your ocr documents. Different historical periods, denominations, and languages may use varying citation formats.

## Basic Pattern Template
```python
# Customizable Bible reference pattern
bible_pattern = r'(?i)(?:(?:SINGLE_BOOKS)\.?\s|[1-5]\s*(?:NUMBERED_BOOKS)\.)\s*\d+(?:\s*,\s*\d+(?:\s*[-—]\s*\d+)?)?'

# Separate book lists for easy modification
single_books = r'(?:Ps|Joh|Matt|Luk|Mark|Röm|Kor|Tim|Petr|Thess|Offb|Hes|Jes|Jer|Spr|Pred|Hld|Kol|Phil|Gal|Eph)'
numbered_books = r'(?:Mos|Joh|Kor|Tim|Petr|Thess)'


In [None]:
# Cell 2 A - Configuration Setup (Detailed):
# Define your search patterns here
# Format: [r'(?i)\bword\w*\b']
# (?i) - case insensitive
# \b - word boundary
# \w* - followed by any word characters

# Example: [r'(?i)\bmysti\w*\b'] will match: mystical, mysticism, etc.

# regex pattern match for Bible verses should be modified depending on abbrevations and individual cases.
# Exmaple for Basic pattern for Bible verses: [r'(?i)(?:(?:Ps|Joh|Matt|Luk|Mark|Röm|Kor|Tim|Petr|Thess|Offb|Hes|Jes|Jer|Spr|Pred|Hld|Kol|Phil|Gal|Eph)\.|[1-5]\s*(?:Mos|Joh|Kor|Tim|Petr|Thess)\.)\s*\d+(?:\s*,\s*\d+(?:\s*[-—]\s*\d+)?)?']


# single search terms

# patterns = [r'(?i)\bmysti\w*\b']  # <-- Modify this line to change search terms

# multilple search terms
patterns = [
    r'(?i)\bmysti\w*\b',    # First search term
    r'(?i)\bleid\w*\b',     # Second search term
    r'(?i)\brechtfertw*\b',  # Third search term
    #for Bible verses uncomment the line
    #r'(?i)(?:(?:Ps|Joh|Matt|Luk|Mark|Apostelgesch|Röm|Rö|Kor|Tim|Petr|Thess|Offb|Hes|Jes|Jer|Spr|Pred|Hld|Kol|Phil|Gal|Eph)\.?|[1-5]\s*(?:Mos|Joh|Kor|Tim|Petr|Thess)\.?)\s*?\d+(?:\s*,\s*\d+(?:\s*[-—]\s*\d+)?)?',
]


# Verify pattern setup
print("Current search patterns:")
for i, pattern in enumerate(patterns, 1):
    print(f"Pattern {i}: {pattern}")


In [None]:
# Cell 2 B - Configuration Setup (Detailed):
# Define the working directory where files are located
directory = './data'
print(f"Working directory: {directory}")

# Initialize counting dictionaries
pattern_counts_by_year_title_page = {pattern: {} for pattern in patterns}
total_pattern_counts = {pattern: 0 for pattern in patterns}

# Define page offsets for specific documents
# Used to skip front matter, covers, etc.
file_start_pages = {
    '1946-jahrbuch-des-lutherbundes_searchable.pdf': 5,  # Skip first 5 pages
    '1947-jahrbuch-des-lutherbundes_searchable.pdf': 4,  # Skip first 4 pages
    '1948-jahrbuch-des-lutherbundes_searchable.pdf': 5   # Skip first 5 pages
}

# Verify setup
print("\nInitialization complete:")
print(f"Number of patterns: {len(patterns)}")
print(f"Number of files with custom page offsets: {len(file_start_pages)}")


In [None]:
# Cell 3 - PDF Processing Functions (Detailed):

def extract_text_from_pdf(pdf_path):
    """
    Extract text and page numbers from PDF files.
    
    Args:
        pdf_path (str): Path to PDF file
        
    Returns:
        tuple: (pdf_text, page_numbers)
            - pdf_text: Dict mapping page numbers to text content
            - page_numbers: Dict mapping physical to logical page numbers
    """
    pdf_document = fitz.open(pdf_path)
    pdf_text = {}
    page_numbers = {}
    
    filename = os.path.basename(pdf_path)
    skip_pages = file_start_pages.get(filename, 0)
    
    for page_num in range(pdf_document.page_count):
        page = pdf_document.load_page(page_num)
        text = page.get_text().strip()
        
        if text and page_num >= skip_pages:
            adjusted_page_num = page_num - skip_pages + 1
            pdf_text[adjusted_page_num] = text
            page_numbers[adjusted_page_num] = adjusted_page_num
    
    pdf_document.close()
    return pdf_text, page_numbers

def print_pdf_contents(directory, max_initial_items=5):
    """
    Display contents of PDFs in a scrollable element with expandable sections.
    
    Args:
        directory (str): Path to directory containing PDFs
        max_initial_items (int): Number of items to show initially
    """
    # CSS for styling
    css = """
    <style>
        .pdf-container {
            max-height: 500px;
            overflow-y: auto;
            border: 1px solid #ccc;
            padding: 10px;
            margin: 10px 0;
            font-family: monospace;
        }
        .file-section {
            margin-bottom: 20px;
            border-bottom: 1px solid #eee;
        }
        .file-header {
            font-weight: bold;
            color: #2c3e50;
            margin: 10px 0;
        }
        .page-preview {
            margin-left: 20px;
            color: #34495e;
        }
        .show-more {
            color: blue;
            cursor: pointer;
            text-decoration: underline;
        }
        .hidden {
            display: none;
        }
    </style>
    """
    
    # JavaScript for show more functionality
    javascript = """
    <script>
        function toggleContent(fileId) {
            var content = document.getElementById(fileId);
            var button = document.getElementById('btn-' + fileId);
            if (content.classList.contains('hidden')) {
                content.classList.remove('hidden');
                button.innerHTML = 'Show less';
            } else {
                content.classList.add('hidden');
                button.innerHTML = 'Show more';
            }
        }
    </script>
    """
    
    html_content = [css, javascript, '<div class="pdf-container">']
    
    for idx, filename in enumerate(os.listdir(directory)):
        if filename.endswith('.pdf'):
            try:
                pdf_path = os.path.join(directory, filename)
                file_id = f'file-{idx}'
                
                html_content.append(f'<div class="file-section">')
                html_content.append(f'<div class="file-header">Processing: {html.escape(filename)}</div>')
                html_content.append(f'Starting page offset: {file_start_pages.get(filename, 0)}')
                html_content.append('<hr>')
                
                pdf_text, page_numbers = extract_text_from_pdf(pdf_path)
                
                # Show initial items
                for i, page_num in enumerate(sorted(pdf_text.keys())[:max_initial_items]):
                    text = pdf_text[page_num]
                    html_content.append(
                        f'<div class="page-preview">Page {page_numbers[page_num]}: '
                        f'{html.escape(text[:50])}...</div>'
                    )
                
                # Add remaining items in hidden div if there are more pages
                if len(pdf_text) > max_initial_items:
                    html_content.append(
                        f'<div id="{file_id}" class="hidden">'
                    )
                    for page_num in sorted(pdf_text.keys())[max_initial_items:]:
                        text = pdf_text[page_num]
                        html_content.append(
                            f'<div class="page-preview">Page {page_numbers[page_num]}: '
                            f'{html.escape(text[:50])}...</div>'
                        )
                    html_content.append('</div>')
                    html_content.append(
                        f'<p><a class="show-more" id="btn-{file_id}" '
                        f'onclick="toggleContent(\'{file_id}\')">Show more</a></p>'
                    )
                
            except Exception as e:
                html_content.append(f'<div class="error">Error processing {html.escape(filename)}: {str(e)}</div>')
    
    html_content.append('</div>')
    
    # Display the HTML
    display(HTML(''.join(html_content)))

# Execute PDF processing with scrollable output
print_pdf_contents(directory)


In [None]:
# Cell 4 - Advanced Page Number Detection (Detailed):
def extract_potential_page_numbers(page, filename):
    """
    Extract page numbers with customized settings for each file.
    
    Args:
        page (fitz.Page): PDF page object
        filename (str): Name of the PDF file
    
    Returns:
        list: List of detected page numbers
    """
    # Default settings for page number detection

    """
    Dimension Reference:
    - Width (100 points) ≈ 3.5 cm
    - Height (50 points) ≈ 1.8 cm

    These dimensions create search boxes in each corner that are:
    - 3.5 cm wide
    - 1.8 cm high

    You can adjust these values based on your needs:
    - For larger search areas: increase the values
    - For smaller search areas: decrease the values

    Example adjustments:
    - For 5cm width: use 142 points (5 cm × 72/2.54)
    - For 2cm height: use 57 points (2 cm × 72/2.54)
    """
    default_settings = {
        'HIGHT': 50,  # Height of search area (approximately 1.8 cm)
        'WIDTH': 100,  # Width of search area (approximately 3.5 cm)
        'patterns': [r'\b\d{1,3}\b']  # Match 1-3 digit numbers
    }
    
    # Custom settings for specific files
    file_settings = {
        '1948-jahrbuch-des-lutherbundes_searchable.pdf': {
            'HIGHT': 50,
            'WIDTH': 100,
            'patterns': [r'\b\d{1,3}\b'],
            'regions': ['botom_right', 'top_left', 'bottom_right'] # define detecting area of pages 'bottom
        }
    }
    
    # Get appropriate settings for current file
    settings = file_settings.get(filename, default_settings)
    
    # Define search regions on page
    all_regions = {
        'top_left': (0, 0, settings['WIDTH'], settings['HIGHT']),
        'top_center': (page.rect.width/2 - settings['WIDTH']/2, 0, 
                      page.rect.width/2 + settings['WIDTH']/2, settings['HIGHT']),
        'top_right': (page.rect.width - settings['WIDTH'], 0, 
                     page.rect.width, settings['HIGHT']),
        # Bottom regions
        'bottom_left': (0, page.rect.height - settings['HIGHT'],
                       settings['WIDTH'], page.rect.height),
        'bottom_center': (page.rect.width/2 - settings['WIDTH']/2, 
                         page.rect.height - settings['HIGHT'],
                         page.rect.width/2 + settings['WIDTH']/2, 
                         page.rect.height),
        'bottom_right': (page.rect.width - settings['WIDTH'], 
                        page.rect.height - settings['HIGHT'],
                        page.rect.width, 
                        page.rect.height),
    }

    # Extract and validate numbers
    numbers = []
    for region_name, coords in all_regions.items():
        text_block = page.get_text("text", clip=coords)
        for pattern in settings['patterns']:
            found = re.findall(pattern, text_block)
            for match in found:
                try:
                    num = int(match)
                    if 0 < num < 1000:  # Reasonable page number range
                        numbers.append(num)
                except ValueError:
                    continue
    
    return list(set(numbers))  # Return unique numbers only

def validate_page_numbers(page_numbers, total_pages):
    """
    Validate and correct page numbers.
    
    Args:
        page_numbers (dict): Detected page numbers
        total_pages (int): Total pages in document
    
    Returns:
        dict: Corrected page numbers
    """
    corrected_numbers = {}
    
    for page, numbers in sorted(page_numbers.items()):
        # Default to physical page number if no numbers found
        if not numbers:
            corrected_numbers[page] = page
            continue
            
        # Validate detected numbers
        for num in numbers:
            if isinstance(num, str) and num.isdigit():
                num_int = int(num)
                if 0 < num_int <= total_pages + 50:
                    corrected_numbers[page] = num_int
                    break
        
        # Use first available number if no valid number found
        if page not in corrected_numbers:
            corrected_numbers[page] = numbers[0]
    
    return corrected_numbers


In [None]:
# Cell 5 - Page Detection Implementation (Detailed):

def print_detected_pages(pdf_directory, max_initial_items=3):
    """
    Display detected page numbers in a scrollable element with expandable sections.
    
    Args:
        pdf_directory (str): Path to directory containing PDFs
        max_initial_items (int): Number of items to show initially per file
    """
    css = """
    <style>
        .detection-container {
            max-height: 500px;
            overflow-y: auto;
            border: 1px solid #ccc;
            padding: 15px;
            margin: 10px 0;
            font-family: monospace;
            background-color: #f8f9fa;
        }
        .file-section {
            margin-bottom: 20px;
            border-bottom: 1px solid #eee;
        }
        .file-header {
            color: #2c3e50;
            font-weight: bold;
            margin: 10px 0;
            padding: 5px;
            background-color: #e9ecef;
            border-radius: 3px;
        }
        .page-info {
            margin-left: 20px;
            color: #34495e;
            padding: 2px 0;
        }
        .error-message {
            color: #dc3545;
            margin-left: 20px;
        }
        .summary {
            margin-top: 15px;
            padding: 10px;
            background-color: #e9ecef;
            border-radius: 3px;
        }
        .show-more {
            color: #007bff;
            cursor: pointer;
            text-decoration: underline;
            margin-left: 20px;
            font-size: 0.9em;
        }
        .hidden {
            display: none;
        }
    </style>
    """
    
    javascript = """
    <script>
        function toggleContent(fileId) {
            var content = document.getElementById(fileId);
            var button = document.getElementById('btn-' + fileId);
            if (content.classList.contains('hidden')) {
                content.classList.remove('hidden');
                button.innerHTML = 'Show less';
            } else {
                content.classList.add('hidden');
                button.innerHTML = 'Show more';
            }
        }
    </script>
    """
    
    html_content = [css, javascript, '<div class="detection-container">']
    html_content.append('<h3>Starting page detection process...</h3>')
    
    processed_files = 0
    total_pages_processed = 0
    
    for file_idx, filename in enumerate(os.listdir(directory)):
        if filename.endswith(".pdf"):
            try:
                doc = fitz.open(os.path.join(pdf_directory, filename))
                file_id = f'file-{file_idx}'
                
                # Add file section
                html_content.append('<div class="file-section">')
                html_content.append(f'<div class="file-header">Processing: {html.escape(filename)}</div>')
                html_content.append(f'<div class="page-info">Document has {len(doc)} pages</div>')
                
                # Process pages
                page_numbers = {}
                
                # Show initial items
                for page_num in range(min(max_initial_items, len(doc))):
                    page = doc.load_page(page_num)
                    detected_numbers = extract_potential_page_numbers(page, filename)
                    page_numbers[page_num] = detected_numbers
                    html_content.append(
                        f'<div class="page-info">Page {page_num + 1}: '
                        f'Detected numbers: {html.escape(str(detected_numbers))}</div>'
                    )
                    total_pages_processed += 1
                
                # Add remaining items in hidden div
                if len(doc) > max_initial_items:
                    html_content.append(f'<div id="{file_id}" class="hidden">')
                    for page_num in range(max_initial_items, len(doc)):
                        page = doc.load_page(page_num)
                        detected_numbers = extract_potential_page_numbers(page, filename)
                        page_numbers[page_num] = detected_numbers
                        html_content.append(
                            f'<div class="page-info">Page {page_num + 1}: '
                            f'Detected numbers: {html.escape(str(detected_numbers))}</div>'
                        )
                        total_pages_processed += 1
                    html_content.append('</div>')
                    html_content.append(
                        f'<div><a class="show-more" id="btn-{file_id}" '
                        f'onclick="toggleContent(\'{file_id}\')">Show more</a></div>'
                    )
                
                doc.close()
                processed_files += 1
                html_content.append('</div>')  # Close file-section
                
            except Exception as e:
                html_content.append(
                    f'<div class="error-message">Error processing {html.escape(filename)}: {str(e)}</div>'
                )
                continue
    
    # Add summary section
    html_content.append(
        f'''
        <div class="summary">
            <strong>Processing Summary:</strong><br>
            Files processed: {processed_files}<br>
            Total pages processed: {total_pages_processed}
        </div>
        '''
    )
    
    html_content.append('</div>')
    
    # Display the HTML
    display(HTML(''.join(html_content)))

def run_page_detection(pdf_directory):
    """
    Run page detection with directory validation.
    """
    if not os.path.exists(pdf_directory):
        display(HTML(
            '<div style="color: red; padding: 10px; border: 1px solid red;">'
            f'Error: Directory not found: {html.escape(pdf_directory)}'
            '</div>'
        ))
        return
    
    print_detected_pages(pdf_directory)

# Set directory and run
pdf_directory = "./data"  # Replace with your actual directory path
run_page_detection(pdf_directory)


# Adjusting Context Size in Pattern Matching

## Overview
In the pattern matching function, you can control how much surrounding text (context) is displayed around each found pattern by adjusting the `context_size` parameter.

## Current Setting
```python
context_size = 300  # Characters before and after match


In [None]:
# Cell 6 - Text Extraction and Pattern Matching (Detailed):
def extract_text_from_docx(docx_path):
    """
    Extract text from Word documents.
    
    Args:
        docx_path (str): Path to DOCX file
    
    Returns:
        str: Combined text from all paragraphs
    
    Raises:
        Exception: If document cannot be read
    """
    try:
        doc = Document(docx_path)
        full_text = []
        for para in doc.paragraphs:
            full_text.append(para.text)
        return ' '.join(full_text)
    except Exception as e:
        print(f"Error reading DOCX file: {e}")
        return ""

def find_patterns(text, patterns, page_number):
    """
    Find patterns in text with context.
    
    Args:
        text (str): Text to search in
        patterns (list): List of regex patterns
        page_number (int): Current page number
    
    Returns:
        list: Tuples of (match, context, page_number)
    """
    matches = []
    context_size = 300  # Characters before and after match
    
    for pattern in patterns:
        for match in re.finditer(pattern, text, re.IGNORECASE):
            # Extract context around match
            start = max(match.start() - context_size, 0)
            end = min(match.end() + context_size, len(text))
            context = text[start:end]
            
            matches.append((match.group(), context, page_number))
    
    return matches

# Initialize storage
content_by_year = defaultdict(str)
print("Text extraction functions initialized")


In [None]:
# Cell 7 - Main Processing Loop (Detailed):
# Process all files in directory
print("Starting file processing...")

for filename in os.listdir(directory):
    if filename.endswith(('.pdf', '.docx')):
        try:
            file_path = os.path.join(directory, filename)
            print(f"\nProcessing: {filename}")
            
            # Extract year and title from filename
            year_match = re.match(r'\d{4}', filename)
            if not year_match:
                print(f"Warning: No year found in filename {filename}")
                continue
                
            year = year_match.group()
            # Fix the title extraction
            title = filename.replace(f"{year}-", "").replace("_searchable.pdf", "").replace(".pdf", "").replace(".docx", "")
            
            # Initialize year in content_by_year if not exists
            if year not in content_by_year:
                content_by_year[year] = ""
            
            # Extract text based on file type
            if filename.endswith('.pdf'):
                pdf_text, page_numbers = extract_text_from_pdf(file_path)
                text = ' '.join(pdf_text.values())
            else:  # DOCX file
                text = extract_text_from_docx(file_path)
            
            # Process patterns
            all_matches = []
            if filename.endswith('.pdf'):
                for page_num, page_text in pdf_text.items():
                    actual_page_num = page_numbers.get(page_num, page_num)
                    matches = find_patterns(page_text, patterns, actual_page_num)
                    all_matches.extend(matches)
            else:
                matches = find_patterns(text, patterns, None)
                all_matches.extend(matches)
            
            # Store results with document header
            content_by_year[year] += (
                f"\n{'=' * 80}\n"
                f"Document: {filename}\n"
                f"Year: {year}\n"
                f"\n{'=' * 80}\n"
            )
            
            for match, context, page_num in all_matches:
                content_by_year[year] += (
                    f"\nTitle: {title}\n"
                    f"Year: {year}\n"
                    f"Pattern Found: {match}\n"
                    f"\nContext:\n"
                    f"{context}\n"
                    f"Page Number: {page_num}\n"
                    f"{'-' * 80}\n"
                )
                
            # Update pattern counts
            for pattern in patterns:
                for match, _, page_num in all_matches:
                    if re.match(pattern, match, re.IGNORECASE):
                        # Initialize nested dictionaries if they don't exist
                        if pattern not in pattern_counts_by_year_title_page:
                            pattern_counts_by_year_title_page[pattern] = {}
                        if year not in pattern_counts_by_year_title_page[pattern]:
                            pattern_counts_by_year_title_page[pattern][year] = {}
                        if title not in pattern_counts_by_year_title_page[pattern][year]:
                            pattern_counts_by_year_title_page[pattern][year][title] = {
                                'pages': [], 'count': 0
                            }
                        if page_num:
                            pattern_counts_by_year_title_page[pattern][year][title]['pages'].append(page_num)
                        pattern_counts_by_year_title_page[pattern][year][title]['count'] += 1
                        total_pattern_counts[pattern] += 1
                        
        except Exception as e:
            print(f"Error processing {filename}: {str(e)}")
            import traceback
            print(traceback.format_exc())
            continue

print("\nFile processing completed")
print("\nPattern counts summary:")
for pattern, counts in pattern_counts_by_year_title_page.items():
    print(f"\nPattern: {pattern}")
    for year, year_data in counts.items():
        for title, data in year_data.items():
            print(f"Year: {year}, Title: {title}, Count: {data['count']}, Pages: {data['pages']}")


In [None]:
# Cell 8 - Output Generation:
# Save the combined output text as a single file, ordered by year
output_file_path = os.path.join(directory, 'combined_output_test.txt')
with open(output_file_path, 'w') as output_file:
    for year in sorted(content_by_year.keys()):
        output_file.write(f"\n\n--- Year: {year} ---\n\n")
        output_file.write(content_by_year[year])
    
    # Append total counts of each pattern to the file
    output_file.write("\nTotal counts of each pattern across all files:\n")
    output_file.write("=" * 80 + "\n")
    for pattern, count in total_pattern_counts.items():
        output_file.write(f"{pattern}: {count}\n")
    output_file.write("=" * 80 + "\n")

print(f"Output saved to {output_file_path}")

In [None]:
# Cell 9 - Results Display:
# Create highlighted HTML output

# Create the output content with highlighting
output_content = ""
result_count = 0
max_results = 10

# Add content to the string, ordered by year
for year in sorted(content_by_year.keys()):
    if result_count >= max_results:
        break
        
    output_content += f"\n\n--- Year: {year} ---\n\n"
    year_content = content_by_year[year]
    
    # Split content into individual results
    sections = year_content.split("=" * 80)
    for section in sections:
        if result_count >= max_results:
            break
            
        if section.strip():  # if section is not empty
            output_content += section + "=" * 80 + "\n"
            result_count += 1

# Add total counts of each pattern
output_content += "\nTotal counts of each pattern across all files:\n"
output_content += "=" * 80 + "\n"
for pattern, count in total_pattern_counts.items():
    output_content += f"{pattern}: {count}\n"
output_content += "=" * 80 + "\n"

# Add note about truncated results
if result_count >= max_results:
    output_content = f"Showing first {max_results} results out of total matches.\n\n" + output_content

# Highlight patterns and labels
def highlight_patterns_and_labels(text, patterns):
    highlighted_text = text
    
    # Highlight patterns - using more contrast-friendly colors
    for pattern in patterns:
        matches = re.finditer(pattern, highlighted_text, re.IGNORECASE)
        matches = list(matches)
        for match in reversed(matches):
            start, end = match.span()
            match_text = highlighted_text[start:end]
            highlighted_text = (
                highlighted_text[:start] + 
                f'<span style="color: #FFB6C1; font-weight: bold; background-color: #4A312B; padding: 0 2px;">{match_text}</span>' + 
                highlighted_text[end:]
            )
    
    # Highlight labels with higher contrast colors
    label_styles = {
        "Pattern:": "color: #ADD8E6; font-weight: bold; background-color: #2F4F4F; padding: 2px 5px; border-radius: 3px;",
        "Context:": "color: #90EE90; font-weight: bold; background-color: #2F4F2F; padding: 2px 5px; border-radius: 3px;",
        "Page Number:": "color: #DDA0DD; font-weight: bold; background-color: #4B0082; padding: 2px 5px; border-radius: 3px;"
    }
    
    for label, style in label_styles.items():
        highlighted_text = highlighted_text.replace(
            label,
            f'<span style="{style}">{label}</span>'
        )
    
    return highlighted_text

# Apply highlighting
highlighted_content = highlight_patterns_and_labels(output_content, patterns)

# Format for display with improved styling
formatted_output = highlighted_content.replace('\n', '<br>')
formatted_output = f'''
<div style="font-family: 'Courier New', monospace; 
            background-color: #1E1E1E; 
            border-radius: 8px; 
            margin: 10px 0;">
    <div style="color: #D3D3D3; 
                padding: 10px; 
                border-bottom: 1px solid #404040;">
        Showing first {max_results} results. Scroll to view content.
    </div>
    <div style="max-height: 500px; 
                overflow-y: auto; 
                background-color: #1E1E1E; 
                border: 1px solid #404040; 
                border-radius: 0 0 8px 8px;">
        <pre style="white-space: pre-wrap; 
                    font-family: 'Courier New', monospace; 
                    line-height: 1.5; 
                    color: #D3D3D3; 
                    padding: 15px; 
                    margin: 0;
                    overflow-x: auto;">
            {formatted_output}
        </pre>
    </div>
</div>
'''

# Add CSS to ensure consistent rendering
css_styles = '''
<style>
    .jupyter-notebook .output_html {
        background-color: transparent !important;
    }
    .output_area pre {
        background-color: #1E1E1E !important;
    }
    .output_scroll {
        box-shadow: none !important;
    }
</style>
'''

# Display in Jupyter notebook with consistent styling
display(HTML(css_styles + formatted_output))

# Save to file (without HTML formatting)
output_file_path = os.path.join(directory, 'combined_output_test.txt')
with open(output_file_path, 'w') as output_file:
    # Remove HTML tags for file output
    clean_content = re.sub(r'<[^>]+>', '', output_content)
    output_file.write(clean_content)



In [None]:
# Cell 10 - Data Analysis:
try:
    # Convert the counts to a DataFrame for easier plotting while including page numbers
    df_list = []
    for pattern, year_data in pattern_counts_by_year_title_page.items():
        for year, title_data in year_data.items():
            for title, page_info in title_data.items():
                df_list.append({
                    'Pattern': pattern, 
                    'Year': year, 
                    'Title': title, 
                    'Page': page_info['pages'], 
                    'Count': page_info['count']
                })

    if not df_list:
        print("No matches found in any documents.")
    else:
        df = pd.DataFrame(df_list)
        
        # Display DataFrame with better formatting
        pd.set_option('display.max_columns', None)
        pd.set_option('display.expand_frame_repr', False)
        pd.set_option('display.max_colwidth', None)
        
        print("\nDetailed Pattern Analysis DataFrame:")
        print("=====================================")
        print(df.to_string())
        print("\n")

        # Aggregate counts by year, title, pattern with page information
        df_agg = df.groupby(['Pattern', 'Year', 'Title']).agg({
            'Page': lambda x: list(set([item for sublist in x for item in (sublist if isinstance(sublist, list) else [sublist])])),
            'Count': 'sum'
        }).reset_index()

        # Calculate total matches per title
        total_matches_per_title = df_agg.groupby(['Year', 'Title'])['Count'].sum().reset_index(name='TotalMatches')

        # Merge the total matches back into df_agg
        df_agg = df_agg.merge(total_matches_per_title, on=['Year', 'Title'])

        # Create the summary column
        df_agg['Year-Title-Page-Count'] = df_agg.apply(
            lambda x: f"{x['Year']}: {x['Title']} (Total: {x['TotalMatches']}) Pages: {', '.join(map(str, x['Page']))}", 
            axis=1
        )

        # Final display
        print("\nAggregated Results:")
        print("===================")
        print(df_agg[['Pattern', 'Year-Title-Page-Count', 'Count']])

except Exception as e:
    print(f"Error in data analysis: {str(e)}")
    import traceback
    print(traceback.format_exc())


In [None]:
# Cell 11 - Visualization:
# Aggregate data by year first
yearly_data = df_agg.copy()
yearly_data['Year'] = yearly_data['Year-Title-Page-Count'].str[:4]  # Extract year
yearly_totals = yearly_data.groupby(['Year', 'Pattern'])['Count'].sum().reset_index()

# Create the stacked bar plot
plt.figure(figsize=(14, 8))
df_pivot = yearly_totals.pivot(index='Year', columns='Pattern', values='Count')
ax = df_pivot.plot(kind='bar', stacked=True)

# Customize the plot
plt.title('Pattern Occurrences by Year', fontsize=14, pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Total Occurrences', fontsize=12)
plt.xticks(rotation=45)

# Enhance the legend
plt.legend(title='Search Patterns', 
          bbox_to_anchor=(1.05, 1), 
          loc='upper left',
          borderaxespad=0.)

# Add total counts as text
total_counts_text = "Total counts across all years:\n"
total_counts_text += "\n".join([f"{pattern}: {count}" 
                               for pattern, count in total_pattern_counts.items()])

# Position the text box
plt.text(1.05, 0.5, total_counts_text,
         bbox=dict(facecolor='white', alpha=0.8, edgecolor='gray'),
         transform=ax.transAxes,
         verticalalignment='center')

# Add value labels on the bars
for c in ax.containers:
    # Add labels only for non-zero values
    ax.bar_label(c, label_type='center', fmt='%.0f')

# Adjust layout
plt.tight_layout()
plt.subplots_adjust(right=0.85)

# Save and display the plot
stacked_bar_path = os.path.join(directory, 'yearly_pattern_distribution.png')
plt.savefig(stacked_bar_path, dpi=300, bbox_inches='tight')
plt.show()

# Create and save detailed Excel report
excel_path = os.path.join(directory, 'pattern_analysis_report.xlsx')
with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
    # Save yearly totals
    yearly_totals_wide = yearly_totals.pivot(index='Year', 
                                           columns='Pattern', 
                                           values='Count').fillna(0)
    yearly_totals_wide.to_excel(writer, sheet_name='Yearly Totals')
    
    # Save detailed analysis
    df.to_excel(writer, sheet_name='Detailed Analysis', index=False)
    
    # Format the Excel file
    workbook = writer.book
    for sheet_name in workbook.sheetnames:
        worksheet = workbook[sheet_name]
        for column in worksheet.columns:
            max_length = 0
            column = [cell for cell in column]
            for cell in column:
                try:
                    if len(str(cell.value)) > max_length:
                        max_length = len(str(cell.value))
                except:
                    pass
            adjusted_width = (max_length + 2)
            worksheet.column_dimensions[column[0].column_letter].width = adjusted_width

print(f"\nAnalysis has been saved to: {excel_path}")

# Display summary statistics
print("\nYearly Pattern Distribution:")
print(yearly_totals_wide)
