# Library Install

This code installs three packages using the conda package manager: pdfplumber, pymupdf, and pillow. These packages are used for working with PDF files in Python.

1) The first package, pdfplumber, is a library that provides an easy-to-use interface for extracting data from PDFs. It allows you to read the contents of a PDF file as if it were a text file, and it also includes functions for searching and manipulating the contents of a PDF.

2) The second package, pymupdf, is a library that provides a more low-level interface for working with PDF files. It allows you to manipulate the structure of a PDF document, including adding or removing pages, and it also includes functions for extracting data from specific parts of a PDF file.

3) The third package, pillow, is a library that provides an easy-to-use interface for working with images in Python. It allows you to read and write image files, as well as perform various operations on them, such as resizing or cropping.

Overall, these packages provide a powerful set of tools for working with PDFs and images in Python, and they can be used together to automate a wide range of tasks related to data extraction and manipulation.


# PDF Ingestion Playground (Spiral 0 — Component A)
_Generated 2025-10-06 02:09_

This notebook helps you prototype PDF ingestion:
1) Load a PDF and gather metadata
2) Render page images at a chosen DPI (via **PyMuPDF / `fitz`**)
3) Extract text/layout spans (via **pdfplumber**)

It also includes inline **assertion tests** to mimic unit tests before moving to `pytest`.

> Tip: Place one or more real PDFs in a `data/` folder next to this notebook, e.g., `data/sample.pdf`.

## 0) Environment check
Quick imports + version print. If something fails, ensure your `eqscribe` environment is active and includes `pymupdf`, `pdfplumber`, and `pillow`.

In [None]:

import sys, platform
print(f'Python: {sys.version.split()[0]}  |  Platform: {platform.platform()}')

try:
    import fitz  # PyMuPDF
    print('PyMuPDF (fitz):', getattr(fitz, '__doc__', '').splitlines()[0])
except Exception as e:
    print('PyMuPDF import error:', e)

try:
    import pdfplumber
    print('pdfplumber:', getattr(pdfplumber, "__version__", "unknown"))
except Exception as e:
    print('pdfplumber import error:', e)

try:
    from PIL import Image
    print('Pillow (PIL): OK')
except Exception as e:
    print('Pillow import error:', e)


## 1) Utilities & Data Classes
We define a small `PdfDoc` class and utility functions for loading, rendering, and layout extraction.
- **Coordinate notes:** PDF coordinates are in points (1/72 inch), origin at bottom-left; image pixels origin is top-left. We store helpers to convert between systems when needed.

In [None]:

from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import List, Tuple, Dict, Any, Optional

import fitz  # PyMuPDF
import pdfplumber
from PIL import Image

PT_PER_INCH = 72.0

@dataclass
class PdfDoc:
    path: Path
    num_pages: int
    dpi: int = 300

def load_pdf(path: str | Path, dpi: int = 300) -> PdfDoc:
    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f'PDF not found: {path}')
    with pdfplumber.open(str(path)) as pdf:
        num_pages = len(pdf.pages)
        if num_pages == 0:
            raise ValueError('PDF has zero pages')
    return PdfDoc(path=path, num_pages=num_pages, dpi=dpi)

def page_size_points(doc: PdfDoc, i: int) -> Tuple[float, float]:
    with fitz.open(str(doc.path)) as pdf:
        page = pdf[i]
        rect = page.rect  # in points
        return float(rect.width), float(rect.height)

def page_image(doc: PdfDoc, i: int, dpi: Optional[int]=None) -> Image.Image:
    assert 0 <= i < doc.num_pages, 'page index out of range'
    dpi = dpi or doc.dpi
    scale = dpi / PT_PER_INCH
    with fitz.open(str(doc.path)) as pdf:
        page = pdf[i]
        mat = fitz.Matrix(scale, scale)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        img = Image.frombytes('RGB', [pix.width, pix.height], pix.samples)
        return img

def pdf_to_px_transform(doc: PdfDoc, i: int, dpi: Optional[int]=None):
    dpi = dpi or doc.dpi
    w_pt, h_pt = page_size_points(doc, i)
    sx = dpi / PT_PER_INCH
    sy = dpi / PT_PER_INCH
    def pdf_to_px(x_pt: float, y_pt: float) -> Tuple[int,int]:
        x_px = int(round(x_pt * sx))
        y_px = int(round((h_pt - y_pt) * sy))
        return x_px, y_px
    def px_to_pdf(x_px: int, y_px: int) -> Tuple[float,float]:
        x_pt = x_px / sx
        y_pt = h_pt - (y_px / sy)
        return x_pt, y_pt
    return pdf_to_px, px_to_pdf

def page_layout(doc: PdfDoc, i: int) -> List[Dict[str, Any]]:
    assert 0 <= i < doc.num_pages, 'page index out of range'
    spans: List[Dict[str, Any]] = []
    with pdfplumber.open(str(doc.path)) as pdf:
        page = pdf.pages[i]
        try:
            words = page.extract_words()
        except Exception:
            words = []
        for w in words:
            spans.append({
                'text': w.get('text', ''),
                'bbox_pdf': (float(w['x0']), float(w['top']), float(w['x1']), float(w['bottom'])),
                'page_index': i,
            })
    return spans


## 2) Configure a test PDF path

In [None]:

from pathlib import Path
PDF_PATH = Path('data/sample.pdf')  # <-- change this to your real file
print('PDF_PATH =', PDF_PATH.resolve())


## 3) Inline tests: load, count pages, render first page

In [None]:

doc = load_pdf(PDF_PATH, dpi=300)
print('Loaded PDF:', doc.path.name, '| pages =', doc.num_pages, '| dpi =', doc.dpi)
assert doc.num_pages > 0, 'Expected at least 1 page'

img0 = page_image(doc, 0)
display(img0)
w, h = img0.size
print('Page 0 image size:', (w, h))
assert w > 500 and h > 700, 'Rendered image seems too small for 300 dpi'


## 4) Layout extraction on page 0

In [None]:

spans0 = page_layout(doc, 0)
print('spans on page 0:', len(spans0))
print(spans0[:10])
assert isinstance(spans0, list), 'Expected list of spans'


## 5) Coordinate transforms sanity check

In [None]:

from PIL import ImageDraw

if spans0:
    def span_width(s): 
        x0, y0, x1, y1 = s['bbox_pdf']
        return x1 - x0
    span = max(spans0[:50], key=span_width)
    x0, y0, x1, y1 = span['bbox_pdf']

    pdf2px, _ = pdf_to_px_transform(doc, 0)
    x0p, y0p = pdf2px(x0, y0)
    x1p, y1p = pdf2px(x1, y1)

    img = img0.copy()
    draw = ImageDraw.Draw(img)
    draw.rectangle([(x0p, y0p), (x1p, y1p)], outline='red', width=3)
    display(img)

    assert 0 <= x0p < img.width and 0 <= x1p <= img.width
    assert 0 <= y1p < img.height and 0 <= y0p <= img.height
else:
    print('No spans found (likely a scanned PDF) — transform demo skipped.')


## 6) Rendering multiple pages (quick visual pass)

In [None]:

N = min(3, doc.num_pages)
thumbs = []
for i in range(N):
    img = page_image(doc, i)
    thumbs.append(img.resize((img.width//3, img.height//3)))
display(*thumbs)
print(f'Rendered {N} pages at {doc.dpi} dpi.')


## 7) (Optional) Save rendered pages to disk

In [None]:

from pathlib import Path
out_dir = Path('outputs/pages')
out_dir.mkdir(parents=True, exist_ok=True)

for i in range(min(5, doc.num_pages)):
    img = page_image(doc, i)
    out_path = out_dir / f'page_{i+1:03d}.png'
    img.save(out_path)
print('Saved up to 5 pages into:', out_dir.resolve())


## 8) Next steps
- Try different DPIs (150, 200, 300) and compare performance vs. clarity.
- Try a scanned PDF to observe `page_layout` returning an empty list (expected).
- Once comfortable, we’ll lift these functions into `src/pdf_ingest.py` and write `pytest` tests.