# TOC Analysis Explorer

Interactive walkthrough of the TOC analysis pipeline (`src/pdf_parser/`).  
Run each cell to see how a PDF's structure is extracted, classified, and split into sections.

The notebook follows the same flow as `_get_toc()` → `resolve_content_pages()` → `resolve_content_sections()` in `resolve_content.py`.

See [docs/toc_analysis.md](../docs/toc_analysis.md) for the full reference.

In [None]:
# Setup — add src/ to path so we can import pdf_parser
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent / "src"))


In [None]:
# Choose your PDF — uncomment one line or set your own path
INPUTS = Path.cwd().parent / "inputs"

## Books
# PDF_PATH = str(INPUTS / "little-book-deep-learning.pdf")                      # #15 — Rich 3-level TOC, front matter, 185 pages
PDF_PATH = str(INPUTS / "2019BurkovTheHundred-pageMachineLearning.pdf")     # #4  — Sparse TOC (~3% coverage), falls back to infer_toc
# PDF_PATH = str(INPUTS / "Book-all-in-one.pdf")                              # #6  — Math Foundation of RL, deep TOC, 450 pages

## Research papers — embedded TOC
# PDF_PATH = str(INPUTS / "attention-is-all-you-need.pdf")                    # #7  — Transformer, 15 pages, 67% coverage
# PDF_PATH = str(INPUTS / "gpt3-few-shot-learners.pdf")                      # #9  — GPT-3, 75 pages, 84% coverage

## Research papers — no embedded TOC (pure infer_toc)
# PDF_PATH = str(INPUTS / "alexnet.pdf")                                      # #13 — AlexNet, 9 pages, zero bookmarks
# PDF_PATH = str(INPUTS / "adam-optimizer.pdf")                               # #14 — Adam, 15 pages, zero bookmarks

print(f"PDF: {PDF_PATH}")


---
## 1. Embedded TOC extraction

Reads the PDF's built-in bookmarks via `pymupdf.get_toc()`.  
Each entry has a **level** (nesting depth), **title**, and **page** (0-indexed).

In [None]:
from shared.pdf_parser.extract_toc import extract_toc
import pymupdf

entries = extract_toc(PDF_PATH)
doc = pymupdf.open(PDF_PATH)
total_pages = len(doc)
doc.close()

print(f"Total pages: {total_pages}")
print(f"Embedded entries: {len(entries)}")
print(f"Levels found: {sorted(set(e.level for e in entries)) if entries else 'none'}")
print()

if entries:
    for e in entries:
        indent = "  " * (e.level - 1)
        print(f"  {indent}L{e.level}  p.{e.page:<4} {e.title}")
else:
    print("  No embedded TOC — will fall back to Docling inference.")


---
## 2. Page coverage check

The coverage ratio determines whether to trust the embedded TOC or fall back to Docling inference.  
**Formula**: `max_page / total_pages` — if below `min_coverage` (default 30%), the embedded TOC is considered too sparse.

In [None]:
min_coverage = 0.3  # configurable — try changing this

if entries:
    max_page = max(e.page for e in entries)
    coverage = max_page / total_pages
    passes = max_page >= total_pages * min_coverage

    print(f"Highest page referenced: {max_page} / {total_pages - 1}")
    print(f"Coverage: {coverage:.1%}")
    print(f"Threshold: {min_coverage:.0%}")
    print(f"Result: {'PASS — using embedded TOC' if passes else 'FAIL — falling back to Docling inference'}")
else:
    print("No embedded entries — coverage is 0%, falling back to Docling inference.")


---
## 3. Inferred TOC (Docling fallback)

When the embedded TOC is missing or too sparse, `_get_toc()` falls back to `infer_toc()`.  
Docling AI layout analysis detects headings (`TITLE`/`SECTION_HEADER`), then hierarchy is assigned via:
1. **Section numbering** (e.g. "3.1.2" → level = dot-count + 1) when majority of headings are numbered
2. **Font-size fallback** — looks up each heading's font size in PyMuPDF, largest = L1, rest = L2

This cell always runs inference for exploration, even if the embedded TOC passed coverage.

In [None]:
from shared.pdf_parser.infer_toc import infer_toc

infer_toc.cache_clear()  # clear cache so we get fresh results
inferred = infer_toc(PDF_PATH)

if inferred:
    print(f"Inferred {len(inferred)} entries:")
    print()
    for e in inferred:
        indent = "  " * (e.level - 1)
        print(f"  {indent}L{e.level}  p.{e.page:<4} {e.title}")
else:
    print("No headings inferred.")


---
## 4. TOC source decision (`_get_toc`)

This mirrors the logic in `resolve_content._get_toc()`:  
embedded TOC is used if coverage passes; otherwise falls back to inferred TOC.

In [None]:
from shared.pdf_parser.resolve_content import _get_toc

toc_entries = _get_toc(PDF_PATH, min_coverage=min_coverage)

# Determine which source was chosen (same logic as _get_toc)
if entries and max(e.page for e in entries) >= total_pages * min_coverage:
    source = "embedded"
else:
    source = "inferred (Docling)"

print(f"Source: {source}")
print(f"Entries: {len(toc_entries)}")
print(f"Levels: {sorted(set(e.level for e in toc_entries))}")
print()
for e in toc_entries:
    indent = "  " * (e.level - 1)
    print(f"  {indent}L{e.level}  p.{e.page:<4} {e.title}")


---
## 5. Entry classification

Each TOC entry is classified as **front matter**, **back matter**, **preamble**, or **content**.  
Front and back matter are skipped; preamble and content are included in the output.

In [None]:
from shared.pdf_parser.classify_entry import classify_entry

# Classify entries from step 4 (toc_entries already set by _get_toc)
for e in toc_entries:
    e.kind = classify_entry(e)

# Color-code by kind
KIND_ICON = {"front": "⊘", "back": "⊘", "preamble": "◈", "content": "●"}

print(f"Using: {source} TOC ({len(toc_entries)} entries)")
print()
for e in toc_entries:
    indent = "  " * (e.level - 1)
    icon = KIND_ICON.get(e.kind, "?")
    print(f"  {icon} {indent}L{e.level}  p.{e.page:<4} {e.title:<40} [{e.kind}]")

print()
print("Legend: ● content  ◈ preamble  ⊘ skipped (front/back matter)")


---
## 6. Content range resolution

Determines the page range of actual content by finding the first preamble/content entry  
and the last entry before back matter begins. Mirrors `resolve_content_pages()`.

In [None]:
from shared.pdf_parser.resolve_content import resolve_content_pages

cr = resolve_content_pages(PDF_PATH, min_coverage=min_coverage)

content_pages = cr.end_page - cr.start_page + 1

print(f"Content range: pages {cr.start_page}–{cr.end_page} ({content_pages} of {cr.total_pages} pages)")
print()
if cr.skipped_front:
    print("Skipped front matter:")
    for s in cr.skipped_front:
        print(f"  {s}")
else:
    print("No front matter skipped.")
print()
if cr.skipped_back:
    print("Skipped back matter:")
    for s in cr.skipped_back:
        print(f"  {s}")
else:
    print("No back matter skipped.")

print()
pct = content_pages / cr.total_pages * 100
bar = "░" * cr.start_page + "█" * content_pages + "░" * (cr.total_pages - cr.end_page - 1)
# Scale bar to ~60 chars
scale = max(1, cr.total_pages // 60)
bar_scaled = bar[::scale]
print(f"Page map: [{bar_scaled}]")
print(f"          █ = content ({pct:.0f}%)  ░ = skipped")


---
## 7. Section splitting

Splits the content range into sections using `resolve_content_sections()`.  
Try different `max_level` values to see coarser vs finer splitting.

In [None]:
from shared.pdf_parser.resolve_content import resolve_content_sections

max_level = 1       # try 1, 2, 3
max_tokens = 24000  # set to None to disable auto-subdivision

sections = resolve_content_sections(PDF_PATH, max_level=max_level, max_tokens=max_tokens, min_coverage=min_coverage)

print(f"max_level={max_level}, max_tokens={max_tokens}")
print(f"Sections: {len(sections)}")
print()

doc = pymupdf.open(PDF_PATH)
for i, s in enumerate(sections):
    pages = s.end_page - s.start_page + 1
    # Estimate tokens
    chars = sum(len(doc[p].get_text()) for p in range(s.start_page, s.end_page + 1))
    tokens = chars // 4
    indent = "  " * (s.level - 1)
    print(f"  {i:>2}. {indent}L{s.level}  p.{s.start_page}–{s.end_page} ({pages:>3} pp, ~{tokens:,} tok)  {s.title}")
doc.close()


---
## 8. Compare splitting granularity

Side-by-side comparison of `max_level=1` vs `max_level=2` to see how splitting changes.

In [None]:
s1 = resolve_content_sections(PDF_PATH, max_level=1, max_tokens=max_tokens, min_coverage=min_coverage)
s2 = resolve_content_sections(PDF_PATH, max_level=2, max_tokens=max_tokens, min_coverage=min_coverage)

print(f"{'max_level=1':>40}  │  max_level=2")
print(f"{'─' * 40}──┼──{'─' * 40}")

max_rows = max(len(s1), len(s2))
for i in range(max_rows):
    left = f"p.{s1[i].start_page}–{s1[i].end_page} {s1[i].title}" if i < len(s1) else ""
    right = f"p.{s2[i].start_page}–{s2[i].end_page} {s2[i].title}" if i < len(s2) else ""
    print(f"  {left:>38}  │  {right}")

print(f"\n  {'Sections: ' + str(len(s1)):>38}  │  Sections: {len(s2)}")
