# Workspace Ingestion - PDF Parsing Experiments

This notebook explores PDF parsing using pymupdf for the workspace creation phase.

## Goals:
- Parse PDF into pages
- Extract text blocks with bounding boxes
- Extract tables
- Extract figures/images with bounding boxes
- Build section hierarchy

In [1]:
import pymupdf
from pathlib import Path
import json

## Load PDF

In [2]:
pdf_path = Path("../resources/chess.pdf")
doc = pymupdf.open(pdf_path)

print(f"Number of pages: {len(doc)}")
print(f"Metadata: {doc.metadata}")

Number of pages: 95
Metadata: {'format': 'PDF 1.4', 'title': 'gutenberg.org/cache/epub/33870/pg33870.txt', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36', 'producer': 'Skia/PDF m140', 'creationDate': "D:20251002094832+00'00'", 'modDate': "D:20251002094832+00'00'", 'trapped': '', 'encryption': None}


## Extract Text Blocks from First Page

Text blocks include bounding boxes (x0, y0, x1, y1), font info, and text content.

In [None]:
page = doc[0]
blocks = page.get_text("dict")["blocks"]

print(f"Total blocks on page 0: {len(blocks)}")
print("\nFirst 3 blocks:")
for i, block in enumerate(blocks[:3]):
    print(f"\nBlock {i}:")
    print(json.dumps(block, indent=2, default=str))

## Extract Text with Bounding Boxes

In [4]:
def extract_text_blocks(page):
    """Extract text blocks with bounding boxes from a page."""
    text_blocks = []
    blocks = page.get_text("dict")["blocks"]
    
    for block_idx, block in enumerate(blocks):
        if block["type"] == 0:  # text block
            bbox = block["bbox"]
            text = ""
            for line in block.get("lines", []):
                for span in line.get("spans", []):
                    text += span["text"]
                text += "\n"
            
            text_blocks.append({
                "block_id": block_idx,
                "bbox": bbox,
                "text": text.strip()
            })
    
    return text_blocks

# Test on first page
text_blocks = extract_text_blocks(page)
print(f"Extracted {len(text_blocks)} text blocks")
print("\nFirst text block:")
print(json.dumps(text_blocks[0], indent=2))

Extracted 22 text blocks

First text block:
{
  "block_id": 0,
  "bbox": [
    33.75,
    37.4498291015625,
    456.353759765625,
    138.7994384765625
  ],
  "text": "The Project Gutenberg eBook of Chess Fundamentals\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook."
}


## Extract Images/Figures

In [5]:
def extract_images(page, page_num):
    """Extract images with bounding boxes from a page."""
    images = []
    image_list = page.get_images()
    
    for img_idx, img in enumerate(image_list):
        xref = img[0]
        # Get image bounding box
        img_rects = page.get_image_rects(xref)
        
        for rect in img_rects:
            images.append({
                "image_id": f"page_{page_num}_img_{img_idx}",
                "xref": xref,
                "bbox": tuple(rect),
                "page": page_num
            })
    
    return images

# Test on first page
images = extract_images(page, 0)
print(f"Extracted {len(images)} images")
if images:
    print("\nFirst image:")
    print(json.dumps(images[0], indent=2, default=str))

Extracted 0 images


## Detect Tables

PyMuPDF has built-in table detection capabilities.

In [6]:
def extract_tables(page, page_num):
    """Extract tables from a page."""
    tables = []
    table_finder = page.find_tables()
    
    for table_idx, table in enumerate(table_finder.tables):
        tables.append({
            "table_id": f"page_{page_num}_table_{table_idx}",
            "bbox": tuple(table.bbox),
            "rows": table.row_count,
            "cols": table.col_count,
            "data": table.extract(),
            "page": page_num
        })
    
    return tables

# Test on first page
tables = extract_tables(page, 0)
print(f"Extracted {len(tables)} tables")
if tables:
    print("\nFirst table:")
    print(json.dumps(tables[0], indent=2, default=str))

Extracted 0 tables


## Process All Pages

In [7]:
def parse_document(doc):
    """Parse entire document extracting all units."""
    workspace = {
        "doc_id": "chess_pdf",
        "num_pages": len(doc),
        "pages": []
    }
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        
        page_data = {
            "page_num": page_num,
            "text_blocks": extract_text_blocks(page),
            "images": extract_images(page, page_num),
            "tables": extract_tables(page, page_num)
        }
        
        workspace["pages"].append(page_data)
    
    return workspace

# Parse the full document
workspace = parse_document(doc)

# Summary stats
total_text_blocks = sum(len(p["text_blocks"]) for p in workspace["pages"])
total_images = sum(len(p["images"]) for p in workspace["pages"])
total_tables = sum(len(p["tables"]) for p in workspace["pages"])

print(f"Document parsed successfully!")
print(f"Total pages: {workspace['num_pages']}")
print(f"Total text blocks: {total_text_blocks}")
print(f"Total images: {total_images}")
print(f"Total tables: {total_tables}")

Document parsed successfully!
Total pages: 95
Total text blocks: 1943
Total images: 0
Total tables: 0


## Inspect Sample Page

In [8]:
# Look at page 0 in detail
sample_page = workspace["pages"][0]
print(f"Page 0 has:")
print(f"  - {len(sample_page['text_blocks'])} text blocks")
print(f"  - {len(sample_page['images'])} images")
print(f"  - {len(sample_page['tables'])} tables")
print("\nFirst text block content:")
print(sample_page["text_blocks"][0]["text"][:200] + "...")

Page 0 has:
  - 22 text blocks
  - 0 images
  - 0 tables

First text block content:
The Project Gutenberg eBook of Chess Fundamentals
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
wh...


## Next Steps

- Build section hierarchy from headings (font size analysis)
- Generate unit IDs and content hashes
- Store in database (SQLite)
- Create full-text search index
- Add metadata (image captions, table surrogates)