# Main Pipeline Hub - PDF Processing & Experimentation

This notebook serves as the central experimentation and orchestration hub for the Chicago historical documents processing pipeline.

## Workflow Overview
1. **Load and Inspect PDFs** - Verify text extraction works correctly
2. **Chunk Text** - Split documents into manageable pieces (~500 words)
3. **Summarization & Testing** - Test Ollama prompts and settings
4. **Prototype Pipeline** - Test full workflow: PDF â†’ Chunk â†’ Summarize â†’ Save
5. **Documentation** - Document approach, assumptions, and special handling

## Notes
- Some PDFs may have unusual formatting - always verify extraction
- Test different prompts and chunk sizes before committing to scripts
- Use this notebook to debug issues interactively


In [None]:
# Setup: Import libraries and configure paths
import sys
import json
from pathlib import Path
from IPython.display import display, Markdown

# Add Chicago directory to Python path
project_root = Path().parent
chicago_dir = project_root / "Chicago"
sys.path.insert(0, str(chicago_dir))

print(f"âœ“ Project root: {project_root}")
print(f"âœ“ Chicago directory: {chicago_dir}")
print(f"âœ“ Data/Raw directory: {chicago_dir / 'Data' / 'Raw'}")


In [None]:
# Import pipeline functions
from pdf_pipeline import (
    extract_pdf_text,
    chunk_text,
    summarize_with_ollama,
    process_pdf,
    ask_question_ollama
)

print("âœ“ All pipeline functions imported")


## Step 1: Load and Inspect PDFs

First, let's see what PDFs are available and inspect their contents to verify text extraction works correctly.


In [None]:
# List available PDFs in Data/Raw/
raw_dir = chicago_dir / "Data" / "Raw"
pdf_files = list(raw_dir.glob("*.pdf"))

print(f"Found {len(pdf_files)} PDF file(s):\n")
for i, pdf in enumerate(pdf_files, 1):
    size_mb = pdf.stat().st_size / (1024 * 1024)
    print(f"{i}. {pdf.name} ({size_mb:.2f} MB)")

# Store the first PDF for processing (or select one)
if pdf_files:
    selected_pdf = pdf_files[0]
    print(f"\nâœ“ Selected PDF: {selected_pdf.name}")
else:
    print("\nâš  No PDFs found. Please add PDFs to Chicago/Data/Raw/")
    selected_pdf = None


In [None]:
# Extract and inspect text from selected PDF
if selected_pdf:
    print(f"Extracting text from: {selected_pdf.name}\n")
    raw_text = extract_pdf_text(selected_pdf)
    
    print(f"âœ“ Extracted {len(raw_text)} characters")
    print(f"âœ“ Extracted {len(raw_text.split())} words")
    print(f"âœ“ Extracted {len(raw_text.split('\\n'))} lines")
    
    # Show first 500 characters to verify extraction quality
    print("\n" + "="*60)
    print("FIRST 500 CHARACTERS (to verify extraction):")
    print("="*60)
    print(raw_text[:500])
    print("...")
    
    # Check for potential issues
    if len(raw_text) < 100:
        print("\nâš  WARNING: Very little text extracted. PDF may have:")
        print("  - Scanned images (needs OCR)")
        print("  - Unusual formatting")
        print("  - Protected/encrypted content")
else:
    raw_text = None
    print("No PDF selected")


## Step 2: Chunk Text

Split the document into manageable chunks (~500 words each) for easier processing and summarization.


In [None]:
# Test chunking with different sizes
if raw_text:
    # Test with default 500 words
    chunks_500 = chunk_text(raw_text, max_tokens=500)
    print(f"âœ“ Created {len(chunks_500)} chunks (500 words each)")
    print(f"  Average chunk size: {sum(len(c.split()) for c in chunks_500) / len(chunks_500):.1f} words")
    
    # Show first chunk as example
    if chunks_500:
        print("\n" + "="*60)
        print("FIRST CHUNK PREVIEW:")
        print("="*60)
        print(f"Words: {len(chunks_500[0].split())}")
        print(f"Characters: {len(chunks_500[0])}")
        print(f"\nContent (first 300 chars):\n{chunks_500[0][:300]}...")
    
    # Store chunks for next steps
    chunks = chunks_500
else:
    chunks = []
    print("No text to chunk")


## Step 3: Summarization & Testing

Test Ollama summarization on a sample chunk. This allows you to:
- Test different prompts
- Adjust settings
- Verify output quality
- Debug issues before processing all chunks


In [None]:
# Test summarization on first chunk (if available)
if chunks:
    print("Testing Ollama summarization on first chunk...\n")
    test_chunk = chunks[0]
    
    print("="*60)
    print("ORIGINAL CHUNK (first 200 words):")
    print("="*60)
    words = test_chunk.split()[:200]
    print(" ".join(words) + "...")
    
    print("\n" + "="*60)
    print("CALLING OLLAMA FOR SUMMARIZATION...")
    print("="*60)
    
    # This may take a moment
    summary = summarize_with_ollama(test_chunk)
    
    print("\n" + "="*60)
    print("GENERATED SUMMARY:")
    print("="*60)
    print(summary)
    
    print("\nâœ“ Summarization test complete")
else:
    print("No chunks available for testing")


## Step 4: Prototype Full Pipeline

Run the complete workflow: PDF â†’ Extract â†’ Chunk â†’ Summarize â†’ Save JSON

**Note:** This processes all chunks and may take some time depending on:
- Number of chunks
- Ollama response time
- PDF size


In [None]:
# Run full pipeline on selected PDF
if selected_pdf:
    print(f"Running full pipeline on: {selected_pdf.name}\n")
    print("This will:")
    print("  1. Extract text from PDF")
    print("  2. Chunk into ~500 word pieces")
    print("  3. Summarize each chunk with Ollama")
    print("  4. Save to Data/processed/ as JSON\n")
    
    # Uncomment to run full pipeline:
    # enhanced_chunks = process_pdf(selected_pdf, save_chunks=True)
    # print(f"\nâœ“ Pipeline complete! Processed {len(enhanced_chunks)} chunks")
    
    print("âš  Uncomment the code above to run the full pipeline")
else:
    print("No PDF selected")


## Step 4b: Process Full PDF (After Testing)

Once you've verified the test works, you can process the full PDF or multiple PDFs.


In [None]:
# OPTION 2: Process full PDF (uncomment after testing)
# from engineering_pipeline import main

# Process single PDF fully (all chunks)
# results = main(pdf_path=selected_pdf, append=True, max_chunks_override=None)

# Or process multiple PDFs
# pdf_files = list(chicago_dir / "Data" / "Raw" / "*.pdf")
# results = main(pdf_path=pdf_files, append=True)

# print(f"\nâœ“ Processed {len(results)} total chunks")


## Step 5: Load and Inspect Processed Chunks

Load previously processed chunks from JSON files to inspect results or use for querying.


In [None]:
# Process PDFs incrementally using engineering_pipeline
from engineering_pipeline import main, list_available_pdfs

# List all available PDFs
all_pdfs = list_available_pdfs()
print(f"Available PDFs: {len(all_pdfs)}")
for i, pdf in enumerate(all_pdfs, 1):
    size_mb = pdf.stat().st_size / (1024 * 1024)
    print(f"  {i}. {pdf.name} ({size_mb:.2f} MB)")

# Process specific PDFs (uncomment and modify as needed)
# Example: Process first PDF only
# if all_pdfs:
#     results = main(pdf_path=all_pdfs[0], append=True, max_chunks_override=5)  # Test mode
#     # results = main(pdf_path=all_pdfs[0], append=True)  # Full processing

# Example: Process multiple PDFs
# selected = [all_pdfs[0], all_pdfs[1]]  # Select which ones
# results = main(pdf_path=selected, append=True)

print("\nðŸ’¡ Tip: Use engineering_pipeline.py interactively for easier selection")
print("   Run: cd Chicago && python engineering_pipeline.py")


In [None]:
# Load processed chunks from Data/processed/
processed_dir = chicago_dir / "Data" / "processed"
json_files = list(processed_dir.glob("*_chunks.json"))

print(f"Found {len(json_files)} processed chunk file(s):\n")
for i, json_file in enumerate(json_files, 1):
    with open(json_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    print(f"{i}. {json_file.name}")
    print(f"   - {len(data)} chunks")
    print(f"   - Source: {data[0]['pdf_path'] if data else 'N/A'}\n")

# Load the first file as example
if json_files:
    with open(json_files[0], 'r', encoding='utf-8') as f:
        loaded_chunks = json.load(f)
    
    print(f"âœ“ Loaded {len(loaded_chunks)} chunks from {json_files[0].name}")
    
    # Show sample chunk
    if loaded_chunks:
        sample = loaded_chunks[0]
        print("\n" + "="*60)
        print("SAMPLE PROCESSED CHUNK:")
        print("="*60)
        print(f"ID: {sample['id']}")
        print(f"Position: {sample['chunk_position']}")
        print(f"\nSummary:\n{sample['summary']}")
        print(f"\nText preview (first 200 chars):\n{sample['text'][:200]}...")
else:
    loaded_chunks = []
    print("No processed chunks found. Run the pipeline first.")


## Step 6: Test Query/Retrieval

Test the retrieval system by asking questions about the processed chunks.


In [None]:
# Test query functionality
if 'loaded_chunks' in locals() and loaded_chunks:
    # Test queries
    test_queries = [
        "mayor chicago",
        "architecture",
        "history",
        "fire"
    ]
    
    print("Testing retrieval with sample queries:\n")
    for query in test_queries:
        print(f"\n{'='*60}")
        print(f"QUERY: '{query}'")
        print('='*60)
        result = ask_question_ollama(query, loaded_chunks)
        if result:
            print(f"\nâœ“ Found match")
        else:
            print(f"\nâœ— No match found")
else:
    print("No chunks loaded. Process a PDF first or load existing chunks.")


## Documentation & Notes

Use this section to document:
- Special handling for specific PDFs
- Issues encountered and solutions
- Optimal settings discovered
- Assumptions and approach

### Known Issues & Solutions

**Issue:** Some PDFs have unusual formatting
- **Solution:** Verify extraction in Step 1, adjust extraction method if needed

**Issue:** Ollama may be slow for large documents
- **Solution:** Process in batches, save progress frequently

**Issue:** Chunks may split sentences awkwardly
- **Solution:** Consider sentence-aware chunking for better summaries

### Optimal Settings

- **Chunk size:** 500 words works well for most documents
- **Ollama model:** llama3.1:8b provides good balance of speed and quality
- **Summary format:** 2-4 bullet points keeps summaries concise

### Next Steps

Once confident with the workflow:
1. Refactor tested code into `engineering_pipeline.py`
2. Update `pdf_pipeline.py` with any improvements
3. Enhance `query_chunks.py` and `retrieval_v2.py` based on testing
