# Section Hierarchy Detection

This notebook builds on the parsed PDF data to detect headings and create a hierarchical section structure.

## Goals:
- Analyze font sizes and styles to identify headings
- Classify blocks as H1, H2, H3, or body text
- Build a table of contents (TOC)
- Assign `section_path` to each block

In [12]:
import pymupdf
from pathlib import Path
import json
from collections import defaultdict, Counter
from typing import List, Dict, Optional

## Load PDF and Extract Font Information

In [13]:
pdf_path = Path("../resources/chess.pdf")
doc = pymupdf.open(pdf_path)

print(f"Loaded: {len(doc)} pages")

Loaded: 95 pages


## Extract Blocks with Font Metadata

We need font size, font name, and other styling info to detect headings.

In [14]:
def extract_blocks_with_fonts(page, page_num):
    """Extract text blocks with detailed font information."""
    blocks_data = []
    blocks = page.get_text("dict")["blocks"]
    
    for block_idx, block in enumerate(blocks):
        if block["type"] == 0:  # text block
            bbox = block["bbox"]
            text_parts = []
            font_sizes = []
            font_names = []
            is_bold = []
            
            for line in block.get("lines", []):
                for span in line.get("spans", []):
                    text_parts.append(span["text"])
                    font_sizes.append(span["size"])
                    font_names.append(span["font"])
                    is_bold.append("Bold" in span["font"] or "bold" in span["font"])
            
            if text_parts:
                text = " ".join(text_parts).strip()
                # Use the most common (or max) font size in the block
                avg_font_size = max(font_sizes) if font_sizes else 12.0
                primary_font = font_names[0] if font_names else "unknown"
                has_bold = any(is_bold)
                
                blocks_data.append({
                    "page_num": page_num,
                    "block_idx": block_idx,
                    "bbox": bbox,
                    "text": text,
                    "font_size": avg_font_size,
                    "font_name": primary_font,
                    "is_bold": has_bold,
                    "char_count": len(text)
                })
    
    return blocks_data

# Extract from all pages
all_blocks = []
for page_num in range(len(doc)):
    page = doc[page_num]
    blocks = extract_blocks_with_fonts(page, page_num)
    all_blocks.extend(blocks)

print(f"Extracted {len(all_blocks)} blocks with font metadata")

Extracted 1943 blocks with font metadata


## Analyze Font Size Distribution

In [15]:
# Get font size distribution
font_sizes = [b["font_size"] for b in all_blocks]
size_counts = Counter([round(fs, 1) for fs in font_sizes])

print("Font size distribution (top 10):")
for size, count in size_counts.most_common(10):
    print(f"  {size}pt: {count} blocks")

# Find the most common (body text) size
body_text_size = size_counts.most_common(1)[0][0]
print(f"\nMost common (body text) size: {body_text_size}pt")

Font size distribution (top 10):
  9.8pt: 1753 blocks
  8.0pt: 190 blocks

Most common (body text) size: 9.8pt


## Classify Blocks as Headings or Body Text

Heuristic:
- Larger font size than body text → potential heading
- Bold text → potential heading
- Short text (< 100 chars) → more likely a heading

In [16]:
def classify_block(block, body_size):
    """Classify a block as heading level or body text.
    
    For plain text PDFs (like Project Gutenberg exports), use pattern-based detection
    since font properties are uniform.
    """
    font_size = block["font_size"]
    is_bold = block["is_bold"]
    char_count = block["char_count"]
    text = block["text"]
    
    # Skip very short blocks (likely page numbers, etc.)
    if char_count < 5:
        return "skip"
    
    # Pattern-based heading detection (for uniform font PDFs)
    is_all_caps = text.isupper() and len(text) > 5
    is_short = len(text) < 100
    has_chapter = text.startswith("CHAPTER") or text.startswith("PART")
    has_number_prefix = len(text) > 2 and text[0].isdigit() and ". " in text[:5]
    is_underscored = text.startswith("_") and text.endswith("_")
    
    # Filter out common non-heading patterns
    skip_patterns = [
        "illustration",
        "http://", "https://",
        "gutenberg",
        "***",
        "copyright",
        "printed in",
        "/",  # Date patterns like 02/10/2025
    ]
    
    is_noise = any(pattern in text.lower() for pattern in skip_patterns)
    
    if is_noise:
        return "skip"
    
    # Classify headings by pattern
    if has_chapter:
        return "h1"
    elif has_number_prefix and is_short:
        # Numbered sections like "1. SOME SIMPLE MATES"
        return "h2"
    elif is_all_caps and is_short and char_count > 10:
        # All caps titles
        return "h1"
    elif is_underscored and is_short:
        # Emphasized with underscores
        return "h3"
    
    # Font-based detection (fallback for PDFs with proper typography)
    size_diff = font_size - body_size
    
    if size_diff > 4:
        return "h1"
    elif size_diff > 2:
        return "h2"
    elif size_diff > 0.5 and (is_bold or char_count < 80):
        return "h3"
    elif is_bold and char_count < 60:
        return "h3"
    
    return "body"

# Classify all blocks
for block in all_blocks:
    block["type"] = classify_block(block, body_text_size)

# Count classifications
type_counts = Counter([b["type"] for b in all_blocks])
print("Block classification:")
for block_type, count in type_counts.items():
    print(f"  {block_type}: {count}")

Block classification:
  skip: 410
  body: 1039
  h1: 66
  h3: 6
  h2: 422


## Inspect Sample Headings

In [17]:
# Show first few headings of each type
for heading_type in ["h1", "h2", "h3"]:
    headings = [b for b in all_blocks if b["type"] == heading_type]
    print(f"\n{heading_type.upper()} examples ({len(headings)} total):")
    for h in headings[:5]:
        print(f"  Page {h['page_num']}: {h['text'][:80]}")


H1 examples (66 total):
  Page 0: CHESS FUNDAMENTALS
  Page 0: JOSE R. CAPABLANCA
  Page 0: _CHESS CHAMPION OF THE WORLD_
  Page 0: NEW YORK HARCOURT, BRACE & WORLD, INC. LONDON: G. BELL AND SONS, LTD.
  Page 0: HARCOURT, BRACE & WORLD, INC.

H2 examples (422 total):
  Page 1: 1. SOME SIMPLE MATES                                       3
  Page 1: 2. PAWN PROMOTION                                          9
  Page 1: 3. PAWN ENDINGS                                           13
  Page 1: 4. SOME WINNING POSITIONS IN THE MIDDLE-GAME              19
  Page 1: 5. RELATIVE VALUE OF THE PIECES                           24

H3 examples (6 total):
  Page 0: _Seventeenth Printing_
  Page 1: _New York_
  Page 1: _Sept. 1, 1934_
  Page 15: _A unit that holds two._
  Page 27: _The winning of a Pawn among good players of even strength often means the winni


## Build Section Hierarchy

Walk through blocks in order and build a tree structure based on heading levels.

In [18]:
def build_section_tree(blocks):
    """Build hierarchical section structure from classified blocks."""
    # Track current section path at each level
    current_path = [None, None, None]  # [h1, h2, h3]
    section_map = {}  # block_idx -> section_path
    toc = []  # Table of contents
    
    for block in blocks:
        block_key = (block["page_num"], block["block_idx"])
        
        if block["type"] == "h1":
            current_path[0] = block["text"]
            current_path[1] = None
            current_path[2] = None
            toc.append({"level": 1, "title": block["text"], "page": block["page_num"]})
            
        elif block["type"] == "h2":
            current_path[1] = block["text"]
            current_path[2] = None
            toc.append({"level": 2, "title": block["text"], "page": block["page_num"]})
            
        elif block["type"] == "h3":
            current_path[2] = block["text"]
            toc.append({"level": 3, "title": block["text"], "page": block["page_num"]})
        
        # Build section path from current hierarchy
        path_parts = [p for p in current_path if p is not None]
        section_path = " > ".join(path_parts) if path_parts else None
        section_map[block_key] = section_path
    
    return section_map, toc

section_map, toc = build_section_tree(all_blocks)
print(f"Built section hierarchy with {len(toc)} TOC entries")

Built section hierarchy with 494 TOC entries


## Display Table of Contents

In [19]:
print("TABLE OF CONTENTS\n" + "="*50)
for entry in toc[:30]:  # Show first 30 entries
    indent = "  " * (entry["level"] - 1)
    print(f"{indent}{entry['title'][:60]} (p.{entry['page']})")

if len(toc) > 30:
    print(f"\n... and {len(toc) - 30} more entries")

TABLE OF CONTENTS
CHESS FUNDAMENTALS (p.0)
JOSE R. CAPABLANCA (p.0)
_CHESS CHAMPION OF THE WORLD_ (p.0)
NEW YORK HARCOURT, BRACE & WORLD, INC. LONDON: G. BELL AND S (p.0)
HARCOURT, BRACE & WORLD, INC. (p.0)
    _Seventeenth Printing_ (p.0)
J. R. CAPABLANCA (p.1)
    _New York_ (p.1)
    _Sept. 1, 1934_ (p.1)
LIST OF CONTENTS (p.1)
PART I (p.1)
CHAPTER I (p.1)
  1. SOME SIMPLE MATES                                       3 (p.1)
  2. PAWN PROMOTION                                          9 (p.1)
  3. PAWN ENDINGS                                           13 (p.1)
  4. SOME WINNING POSITIONS IN THE MIDDLE-GAME              19 (p.1)
  5. RELATIVE VALUE OF THE PIECES                           24 (p.1)
  6. GENERAL STRATEGY OF THE OPENING                        25 (p.1)
  7. CONTROL OF THE CENTRE                                  28 (p.1)
  8. TRAPS                                                  32 (p.1)
CHAPTER II (p.1)
FURTHER PRINCIPLES IN END-GAME PLAY (p.1)
  9. A CARDINAL PRINCIPLE  

## Assign Section Paths to All Blocks

In [21]:
# Attach section_path to each block
for block in all_blocks:
    block_key = (block["page_num"], block["block_idx"])
    block["section_path"] = section_map.get(block_key)

# Count blocks with section paths
blocks_with_sections = sum(1 for b in all_blocks if b["section_path"])
print(f"Assigned section paths to {blocks_with_sections}/{len(all_blocks)} blocks")

Assigned section paths to 1934/1943 blocks


## Inspect Blocks with Section Paths

In [22]:
# Show sample blocks with their section paths
body_blocks = [b for b in all_blocks if b["type"] == "body" and b["section_path"]]

print("Sample body text blocks with section paths:\n")
for block in body_blocks[10:15]:  # Show middle section to avoid preamble
    print(f"Page {block['page_num']}")
    print(f"Section: {block['section_path']}")
    print(f"Text: {block['text'][:100]}...")
    print()

Sample body text blocks with section paths:

Page 2
Section: FURTHER OPENINGS AND MIDDLE-GAMES > 31. SOME SALIENT POINTS ABOUT PAWNS                      143
Text: 32. SOME POSSIBLE DEVELOPMENTS FROM A RUY LOPEZ   (showing the weakness of a backward Q B P; the   p...

Page 3
Section: ILLUSTRATIVE GAMES
Text: GAME....

Page 3
Section: ILLUSTRATIVE GAMES
Text: 1. QUEEN'S GAMBIT DECLINED (MATCH, 1909)                159       White: F. J. Marshall. Black: J. R...

Page 3
Section: ILLUSTRATIVE GAMES
Text: 2. QUEEN'S GAMBIT DECLINED (SAN SEBASTIAN, 1911)        163       White: A. K. Rubinstein. Black: J....

Page 3
Section: ILLUSTRATIVE GAMES
Text: 3. IRREGULAR DEFENCE (HAVANA, 1913)                     169       White: D. Janowski. Black: J. R. C...



## Export Enhanced Workspace

Save the blocks with section paths for use in database creation.

In [23]:
workspace_enhanced = {
    "doc_id": "chess_pdf",
    "num_pages": len(doc),
    "blocks": all_blocks,
    "toc": toc
}

# Save to file
output_path = Path("workspace_with_sections.json")
with open(output_path, "w") as f:
    json.dump(workspace_enhanced, f, indent=2)

print(f"Saved enhanced workspace to {output_path}")
print(f"  - {len(all_blocks)} blocks")
print(f"  - {len(toc)} TOC entries")
print(f"  - {blocks_with_sections} blocks with section paths")

Saved enhanced workspace to workspace_with_sections.json
  - 1943 blocks
  - 494 TOC entries
  - 1934 blocks with section paths


## Next Steps

- Design SQLite database schema
- Create storage layer to persist this workspace
- Build FTS5 full-text search index
- Test retrieval with queries