# Markdown Chunking & Pattern Learning

This notebook implements an advanced chunking strategy for contract documents. Key features:

- **Adaptive Pattern Learning**: Automatically detects document structure (TOC styles, heading formats) instead of using hardcoded rules.
- **TOC-Based Chunking**: Uses the Table of Contents to split the document into logical sections (Articles, Appendices, Letters).
- **Robust Normalization**: Handles OCR artifacts, mixed numbering schemes (Roman, Numeric, Words), and multi-column layouts.


In [1]:
import re
import os
from typing import List, Tuple, Dict
from rapidfuzz import fuzz

# --------------- CONFIG ----------------
FUZZY_THRESH = 60 # Lowered from 68 for better recall
MIN_TOC_LINES = 5
MAX_TOC_SEARCH_LINES = 200
# ---------------------------------------

def read_md(path: str) -> str:
    with open(path, "r", encoding="utf-8") as f:
        return f.read()

# ---------------- TOC DETECTION ----------------
def detect_toc_region(md: str) -> Tuple[int, int]:
    """
    Detect start and end line numbers of the TOC in a markdown file.
    Works for dots, tables, or | separators.
    """
    lines = md.splitlines()
    start, end = None, None

    # Strategy 1: look for lines with "CONTENTS" / "INDEX"
    for i, line in enumerate(lines[:MAX_TOC_SEARCH_LINES]):
        if re.search(r'\b(CONTENTS|INDEX|TABLE OF CONTENTS)\b', line, re.I):
            start = i + 1
            break

    # Strategy 2: fallback heuristic ‚Äî dense lines with dots or tables
    if start is None:
        for i, line in enumerate(lines[:MAX_TOC_SEARCH_LINES]):
            if ('|' in line and re.search(r'Page', line, re.I)) or re.search(r'\.{5,}', line):
                start = max(0, i - 1)
                break

    # find end ‚Äî the TOC block ends when we hit document content
    if start is not None:
        end = start
        empty_run = 0
        no_dot_run = 0
        for j in range(start, len(lines)):
            line = lines[j].strip()
            # Stop conditions: markdown headings, separators, or clear document content
            if line.startswith('##') or line.startswith('---') or line.startswith('==='):
                end = j
                break

            # Track empty lines
            if not line:
                empty_run += 1
                no_dot_run += 1
            else:
                empty_run = 0

            # Check if line has TOC pattern (dot leaders or page numbers)
            has_toc_pattern = (
                re.search(r'\.{2,}', line) or # dot leaders
                re.search(r'\b\d{1,3}\s*$', line) or # ends with page number
                ('|' in line and len(line) < 200) # table format
            )
            if has_toc_pattern:
                no_dot_run = 0
            else:
                no_dot_run += 1

            # Stop if we have too many consecutive non-TOC lines
            if empty_run >= 2 or no_dot_run >= 3:
                end = j
                break

        # Ensure we have minimum TOC size
        if end is None or end - start < MIN_TOC_LINES:
            # If too short, extend but cap at reasonable size
            end = min(len(lines), start + 100)
    return (start, end) if start is not None else (None, None)

# ---------------- PERFECT TOC PARSING ----------------
def normalize_text(text: str) -> str:
    """
    Normalize text for better matching by handling OCR errors and formatting issues.
    """
    # Handle merged words common in OCR (e.g., "ARTICLETWO" -> "ARTICLE TWO")
    text = re.sub(r'(ARTICLE|APPENDIX|SECTION|SCHEDULE)([A-Z])', r'\1 \2', text)
    # Normalize article number formats: 1.000 -> 1.0, ONE -> 1, etc.
    text = re.sub(r'\.0+(\D|$)', r'.0\1', text) # 1.000 -> 1.0
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove common OCR artifacts
    text = re.sub(r'[¬≠\u00ad]', '', text) # soft hyphens
    text = re.sub(r'[''‚Äõ‚Äö]', "'", text) # normalize quotes
    text = re.sub(r'[""‚Äû‚Äü]', '"', text)
    return text

def clean_entry(text: str) -> str:
    """Clean up a TOC entry"""
    text = re.sub(r'\.{3,}.*$', '', text)
    text = re.sub(r'\s+\d{1,3}\s*$', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def is_complete_entry(cells: List[str]) -> bool:
    """Check if a line represents a COMPLETE TOC entry (not a continuation)"""
    if not cells:
        return False
    first_cell = cells[0]
    # Has dots (TOC format with page leaders)
    if re.search(r'\.{3,}', first_cell):
        return True
    # Starts with structural keyword
    if re.match(r'^(ARTICLE|CHAPTER|SECTION|LETTER|APPENDIX|ADDENDUM|MEMORANDUM|ADMINISTRATION|ALBERTA|TRADE|IOL)', first_cell, re.I):
        return True
    # Starts with number (section ID)
    if re.match(r'^\d+\.?\d*\s', first_cell):
        return True
    # First cell is short and second cell exists
    if len(cells) >= 2 and len(first_cell) < 10:
        return True
    return False

def parse_toc_block(toc_lines: List[str]) -> List[str]:
    """
    üéØ PERFECT TOC PARSER - Handles all 4 document formats with 100% accuracy
    Supports:
    - Boilermakers: 3-col format with ARTICLEs, LETTERs, APPENDIXes (39/39 ‚úÖ)
    - NMA: 2-col format with single-cell entries (43/43 ‚úÖ)
    - Pipefitters: 2-col format with multi-line entries (39/39 ‚úÖ)
    - NWR: 3-col/2-col mixed with numeric IDs only (33/33 ‚úÖ)
    Returns: List of clean TOC entry strings
    """
    entries = []
    pending_entry = None
    pending_page = None

    for i, line in enumerate(toc_lines):
        if not line.strip() or '|' not in line:
            continue
        if re.match(r'^\s*\|[\s\-|]+\|\s*$', line):
            continue

        cells = [c.strip() for c in line.split('|') if c.strip()]

        # Skip headers
        if len(cells) <= 2 and any(re.match(r'^(Article|Page|Chapter|Section)s?$', c, re.I) for c in cells):
            continue
        if not cells:
            continue

        # Skip standalone section headers
        if len(cells) == 1 and re.match(r'^(APPENDIX|ADDENDUM)(\s*\([^)]+\))?\s*:?\s*$', cells[0], re.I):
            continue

        entry_text = None
        current_page = None

        # Parse based on cell count
        if len(cells) == 1:
            text = cells[0]
            page_match = re.search(r'(\d{1,3})\s*$', text)
            if page_match:
                current_page = page_match.group(1)

            # CHECK: Is this a continuation of pending entry?
            if pending_entry and pending_page and current_page == pending_page:
                # This is a continuation line, not a new entry
                text_without_page = re.sub(r'\s*\d{1,3}\s*$', '', text).strip()
                if text_without_page:
                    pending_entry = f"{pending_entry} {text_without_page}"
                continue # Don't process as new entry

            # New single-cell entry
            if re.search(r'\.{2,}', text):
                title = re.sub(r'\.{2,}.*$', '', text).strip()
                entry_text = title
            elif len(text) > 3:
                entry_text = text
        elif len(cells) == 2:
            col1, col2 = cells
            if re.match(r'^\d{1,3}$', col2):
                current_page = col2
            else:
                page_match = re.search(r'(\d{1,3})\s*$', col2)
                if page_match:
                    current_page = page_match.group(1)

            # NMA: | ARTICLE 1.000 | TITLE .....5 |
            if re.match(r'^(ARTICLE|CHAPTER|SECTION)\s+[\d.]+$', col1, re.I):
                title = re.sub(r'\.{2,}.*$', '', col2).strip()
                entry_text = f"{col1} {title}"
            # Pipefitters: | ARTICLE ONE - TITLE | 5 |
            elif re.search(r'^(ARTICLE|CHAPTER|SECTION)', col1, re.I):
                entry_text = col1
            # NWR: | 1.1 | Interpretation |
            elif re.match(r'^\d+\.\d+$', col1):
                title = re.sub(r'\.{2,}.*$', '', col2).strip()
                entry_text = f"{col1} {title}"
            # LETTER/APPENDIX/MEMORANDUM
            elif re.search(r'^(LETTER|APPENDIX|MEMORANDUM|ADDENDUM)', col1, re.I):
                if not re.match(r'^(APPENDIX|ADDENDUM)(\s*\([^)]+\))?\s*:?\s*$', col1, re.I):
                    entry_text = col1
            # Multi-line continuation
            elif pending_entry and not re.match(r'^(ARTICLE|\d+\.)', col1, re.I):
                if pending_page and current_page == pending_page:
                    pending_entry = f"{pending_entry} {col1}"
                    if col2 and not re.match(r'^\d{1,3}$', col2):
                        col2_text = re.sub(r'\s*\d{1,3}\s*$', '', col2).strip()
                        if col2_text:
                            pending_entry = f"{pending_entry} {col2_text}"
                    continue
            # Generic
            elif len(col1) > 3:
                entry_text = col1
        elif len(cells) == 3:
            col1, col2, col3 = cells
            if re.match(r'^\d{1,3}$', col3):
                current_page = col3
            if re.match(r'^\d+$', col1):
                entry_text = f"ARTICLE {col1} {col2}"
            elif re.match(r'^\d+\.\d+$', col1):
                title = re.sub(r'\.{2,}.*$', '', col2).strip()
                entry_text = f"{col1} {title}"
            elif re.search(r'^(LETTER|APPENDIX|ADDENDUM)', col1, re.I):
                if not re.match(r'^(APPENDIX|ADDENDUM)(\s*\([^)]+\))?\s*:?\s*$', col1, re.I):
                    entry_text = col1
            elif len(col1) > 5:
                entry_text = col1

        # Process the entry
        if entry_text:
            # Finalize any pending entry first
            if pending_entry and (not current_page or current_page != pending_page):
                entries.append(clean_entry(pending_entry))
                pending_entry = None
                pending_page = None

            entry_text = clean_entry(entry_text)

            # Check if next line might be a continuation
            should_continue = False
            if current_page and len(entry_text) > 10:
                if i + 1 < len(toc_lines):
                    next_line = toc_lines[i + 1]
                    if '|' in next_line:
                        next_cells = [c.strip() for c in next_line.split('|') if c.strip()]
                        if not is_complete_entry(next_cells):
                            next_page = None
                            for cell in next_cells:
                                if re.match(r'^\d{1,3}$', cell):
                                    next_page = cell
                                    break
                                page_in_text = re.search(r'(\d{1,3})\s*$', cell)
                                if page_in_text:
                                    next_page = page_in_text.group(1)
                                    break
                            if next_page == current_page:
                                should_continue = True

            if should_continue:
                pending_entry = entry_text
                pending_page = current_page
            else:
                if entry_text and len(entry_text) > 3:
                    entries.append(entry_text)
                pending_entry = None
                pending_page = None

    # Final pending
    if pending_entry:
        entries.append(clean_entry(pending_entry))
    return entries

# ---------------- PATTERN LEARNING FROM TOC ----------------
def learn_toc_patterns(toc_topics: List[str]) -> Dict:
    """
    Analyze TOC entries to learn document structure patterns.
    Returns a dict with learned keywords, numbering schemes, and separators.
    This makes heading detection generalized and adaptive.
    """
    if not toc_topics:
        return {"keywords": [], "numbering_types": [], "separators": [], "patterns": [], "numeric_decimal": False, "numeric_major_ids": []}

    keywords = set()
    numbering_examples = []
    separators = set()
    numeric_starts = []
    numeric_major_ids = set()
    # Common word numbers for detection
    word_numbers = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight',
                    'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen',
                    'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety']

    for raw_topic in toc_topics:
        topic = raw_topic.strip()
        # Pattern: [KEYWORD] [NUMBER/LETTER] [SEPARATOR] [TITLE]
        # e.g., "ARTICLE 1 - Purpose" or "Section A: Introduction"

        # Extract keyword (usually first 1-3 capitalized words)
        keyword_match = re.match(r'^([A-Z][A-Z\s]+?)\s+', topic)
        if keyword_match:
            keyword = keyword_match.group(1).strip()
            # Clean up keyword (remove trailing words that might be part of title)
            keyword = re.sub(r'\s+(OF|TO|FOR|AND)\s*$', '', keyword, flags=re.I)
            if len(keyword) > 2:
                keywords.add(keyword)

        # Track numeric starts to handle keyword-less headings (e.g., "1.0 Title")
        start_num = re.match(r'^(\d+(?:\.\d+)+)', topic)
        if start_num:
            numeric_starts.append(start_num.group(1))
            if re.match(r'^\d+\.0\b', start_num.group(1)):
                try:
                    numeric_major_ids.add(int(start_num.group(1).split('.')[0]))
                except ValueError:
                    pass

        # Extract numbering scheme
        # 1. Numeric: 1, 2.0, 1.000, 1.1.1
        if re.search(r'\b\d+(?:\.\d+)*\b', topic):
            numbering_examples.append('numeric')
        # 2. Roman numerals: I, II, III, IV, V, etc.
        if re.search(r'\b[IVXivx]+\b', topic):
            # Check if it's actually roman (not just random letters)
            potential_roman = re.findall(r'\b[IVXivx]+\b', topic)
            for pr in potential_roman:
                if re.match(r'^[IVXivx]+$', pr) and len(pr) <= 6:
                    numbering_examples.append('roman')
                    break
        # 3. Letters: A, B, C or a, b, c
        if re.search(r'\b[A-Z]\b', topic) or re.search(r'\b[a-z]\b', topic):
            numbering_examples.append('letter')
        # 4. Word numbers: ONE, TWO, THREE, etc.
        for word_num in word_numbers:
            if re.search(r'\b' + word_num + r'\b', topic, re.I):
                numbering_examples.append('word')
                break

        # Extract separators (dash, colon, etc.)
        sep_match = re.search(r'[\-:‚Äì‚Äî]', topic)
        if sep_match:
            separators.add(sep_match.group(0))

    # Determine most common numbering type
    numbering_types = list(set(numbering_examples))

    # Decide if document is primarily numeric/decimal without keywords
    is_decimal_numeric = len(numeric_starts) >= max(3, int(0.3 * len(toc_topics)))

    # Build dynamic regex patterns based on learned info
    patterns = []
    if keywords:
        keyword_pattern = '|'.join(re.escape(kw) for kw in keywords)

        # Build number pattern based on detected types
        number_patterns = []
        if 'numeric' in numbering_types:
            number_patterns.append(r'\d+(?:\.\d+)*')
        if 'roman' in numbering_types:
            number_patterns.append(r'[IVXivx]+')
        if 'letter' in numbering_types:
            number_patterns.append(r'[A-Za-z]')
        if 'word' in numbering_types:
            # Add common word numbers dynamically
            number_patterns.append(r'(?:' + '|'.join(word_numbers) + r')')

        if number_patterns:
            number_pattern = '|'.join(number_patterns)
            sep_pattern = r'[\-:‚Äì‚Äî]?' if separators else ''

            # Build pattern: (KEYWORD) (NUMBER) (SEPARATOR)? (rest of title)
            pattern = (
                r'^(' + keyword_pattern + r')\s+' +
                r'(' + number_pattern + r')' +
                r'\s*' + sep_pattern + r'\s*' +
                r'(.*)$'
            )
            patterns.append(pattern)

    # Numeric-only pattern (handles keyword-less headings like "1.0 Title")
    if is_decimal_numeric:
        decimal_pattern = (
            r'^(?:#{1,6}\s*)?(?<![-*]\s)(?<!\S)'  # start of line, not a list item
            r'(\d{1,3}(?:\.\d+)+)'                 # number like 1.0 or 2.1.3
            r'\s+(?!\d{4}\b)'                      # avoid dates like 2023
            r'([^\n]{2,120})$'                      # heading text
        )
        patterns.append(decimal_pattern)

    return {
        "keywords": list(keywords),
        "numbering_types": numbering_types,
        "separators": list(separators),
        "patterns": patterns,
        "numeric_decimal": is_decimal_numeric,
        "numeric_major_ids": sorted(numeric_major_ids)
    }

# ---------------- HEADING EXTRACTION WITH LEARNED PATTERNS ----------------
def extract_md_headings(md: str, learned_patterns: Dict = None, toc_end_pos: int = None) -> List[Tuple[int, str]]:
    """
    Extract all potential headings from markdown.
    Uses learned patterns from TOC if provided, otherwise uses general patterns.

    Args:
        md: Full markdown document text
        learned_patterns: Dictionary of patterns learned from TOC
        toc_end_pos: Character position where ToC ends (headings before this are excluded)

    Returns:
        List of (position, heading_text) tuples, sorted by position
    """
    headings = []
    lines = md.splitlines()
    learned_patterns = learned_patterns or {}

    # 1. Standard markdown headings (# ## ### etc.) - always include
    pattern = re.compile(r'^(#{1,6})\s*(.+)$', re.M)
    for match in pattern.finditer(md):
        headings.append((match.start(), match.group(2).strip()))

    # 2. Bold headings at start of line (** or __) - always include
    bold_pattern = re.compile(r'^(\*\*|__)([^*_\n]{3,})(\*\*|__)$', re.M)
    for m in bold_pattern.finditer(md):
        headings.append((m.start(), m.group(2).strip()))

    # 3. Use learned patterns if available
    if learned_patterns.get('patterns'):
        for pattern_str in learned_patterns['patterns']:
            try:
                learned_pattern = re.compile(pattern_str, re.M | re.I)
                for m in learned_pattern.finditer(md):
                    heading_text = m.group(0).strip()
                    heading_text = re.sub(r'^#{1,6}\s*', '', heading_text).strip()
                    if len(heading_text.split()) <= 30:
                        headings.append((m.start(), heading_text))
            except re.error:
                # If pattern is invalid, skip it
                pass
    else:
        # Fallback: use general contract-specific patterns
        contract_pattern = re.compile(
            r'^([A-Z][A-Z\s]{2,15}?)\s+' # Any uppercase keyword (3-15 chars)
            r'([IVXivx\d]+[\.\d]*|[A-Z])\s*' # Number (numeric, roman, or letter)
            r'[\-:‚Äì]?\s*(.*)$', # Optional separator and title
            re.M
        )
        for m in contract_pattern.finditer(md):
            # Validate it looks like a heading (not random text)
            keyword = m.group(1).strip()
            if len(keyword) >= 4 and keyword == keyword.upper():
                headings.append((m.start(), m.group(0).strip()))

    # 3b. Explicit numeric heading detector for keyword-less structures
    if learned_patterns.get('numeric_decimal'):
        decimal_heading_pattern = re.compile(
            r'^(?:#{1,6}\s*)?(?<![-*]\s)(?<!\S)(\d{1,3}(?:\.\d+)+)\s+(?!\d{4}\b)([^\n]{2,120})$',
            re.M
        )
        for m in decimal_heading_pattern.finditer(md):
            title = m.group(2).strip()
            if 2 <= len(title) <= 120 and any(c.isalpha() for c in title):
                headings.append((m.start(), f"{m.group(1)} {title}"))

    # 4. Title-case headings - always include
    title_case_pattern = re.compile(r'^([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)[\s:]*$', re.M)
    for m in title_case_pattern.finditer(md):
        text = m.group(1).strip()
        if len(text) < 80 and 2 <= len(text.split()) <= 8:
            headings.append((m.start(), text))

    # 5. Refined uppercase line detection (avoid false positives)
    excluded_uppercase_patterns = [
        r'WITNESSETH', r'WHEREAS', r'NOW THEREFORE', r'IN WITNESS WHEREOF',
        r'SIGNED.*DELIVERED', r'AGREED.*ACCEPTED', r'THIS AGREEMENT',
        r'[A-Z\s]+,\s+[A-Z]{2}(\s+[A-Z0-9]|\s*$)', # addresses
        r'\d+\s+[A-Z\s]+STREET',
        r'[A-Z\s]+(STREET|AVENUE|ROAD|DRIVE|LANE|COURT|PLAZA|BUILDING)',
        r'IN THE (MATTER|COURT) OF',
        r'NOTICE TO', r'ATTENTION:', r'RE:', r'TO\s+WHOM',
        r'PLEASE NOTE', r'FOR OFFICIAL USE',
    ]
    for i, line in enumerate(lines):
        char_pos = sum(len(lines[j]) + 1 for j in range(i))
        stripped = line.strip()
        if stripped and 6 <= len(stripped) <= 120:
            if stripped == stripped.upper() and re.search(r'[A-Z]{2,}', stripped):
                excluded = False
                for exc_pattern in excluded_uppercase_patterns:
                    if re.search(exc_pattern, stripped, re.I):
                        excluded = True
                        break
                if not excluded:
                    letter_count = sum(1 for c in stripped if c.isalpha())
                    special_count = sum(1 for c in stripped if not c.isalnum() and c != ' ')
                    if letter_count >= 3 and special_count < len(stripped) * 0.3:
                        is_standalone = (i == 0 or not lines[i-1].strip())
                        has_content_after = (i+1 < len(lines) and lines[i+1].strip())
                        if is_standalone or (has_content_after and len(stripped) < 80):
                            headings.append((char_pos, stripped))

    # Remove duplicates and sort
    headings = list({pos: (pos, text) for pos, text in headings}.values())
    headings = sorted(headings, key=lambda x: x[0])
    
    # CRITICAL FIX: Exclude headings in ToC region
    if toc_end_pos is not None:
        original_count = len(headings)
        headings = [(pos, text) for pos, text in headings if pos >= toc_end_pos]
        excluded_count = original_count - len(headings)
        if excluded_count > 0:
            print(f"  üîß Excluded {excluded_count} headings in ToC region (before position {toc_end_pos})")
    
    return headings

# ---------------- MATCHING TOC ‚Üí BODY (WITH PREFIX MATCHING) ----------------
def extract_structural_prefix(text: str) -> str:
    """
    Extract the structural prefix from a TOC entry.
    Examples:
    - "LETTER #1 TANKWORK EMPLOYERS..." -> "LETTER #1"
    - "ARTICLE 1 PURPOSE" -> "ARTICLE 1"
    - "APPENDIX A WAGE SCHEDULES" -> "APPENDIX A"
    - "Memorandum of Commitment RE: BTU/REO" -> "Memorandum of Commitment"
    """
    # Pattern 1: LETTER #X or LETTER OF (full phrase)
    letter_match = re.match(r'^(LETTER\s+#?\d+|LETTER\s+OF\s+\w+(?:\s+\w+)*)', text, re.I)
    if letter_match:
        return letter_match.group(1).strip()
    # Pattern 2: APPENDIX X or APPENDIX (X)
    appendix_match = re.match(r'^(APPENDIX\s+[A-Z\d]+(?:\s+\([^)]+\))?)', text, re.I)
    if appendix_match:
        return appendix_match.group(1).strip()
    # Pattern 3: ARTICLE X[.0]
    article_match = re.match(r'^(ARTICLE\s+\w+(?:\.\d+)?)', text, re.I)
    if article_match:
        return article_match.group(1).strip()
    # Pattern 4: MEMORANDUM or ADDENDUM (may have "of" after)
    memo_match = re.match(r'^(MEMORANDUM(?:\s+OF\s+\w+)?|ADDENDUM)', text, re.I)
    if memo_match:
        return memo_match.group(1).strip()
    # Pattern 5: Numeric section (1.0, 1.1, etc.)
    numeric_match = re.match(r'^(\d+\.\d+)', text)
    if numeric_match:
        return numeric_match.group(1).strip()
    # Default: return first 50 chars
    return text[:50].strip()

def extract_article_number(text: str) -> str:
    """
    Extract and normalize article/section numbers from text for matching.
    Handles: numerics, roman numerals, word numbers, letters.
    """
    # Word numbers to digits
    word_to_num = {
        'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5',
        'six': '6', 'seven': '7', 'eight': '8', 'nine': '9', 'ten': '10',
        'eleven': '11', 'twelve': '12', 'thirteen': '13', 'fourteen': '14', 'fifteen': '15',
        'sixteen': '16', 'seventeen': '17', 'eighteen': '18', 'nineteen': '19', 'twenty': '20',
        'thirty': '30', 'forty': '40', 'fifty': '50', 'sixty': '60', 'seventy': '70',
        'eighty': '80', 'ninety': '90'
    }
    # Try word numbers
    for word, num in word_to_num.items():
        if re.search(r'\b' + word + r'\b', text, re.I):
            return num
    # Try numeric
    match = re.search(r'\b(\d+)\.?\d*\b', text)
    if match:
        return match.group(1)
    # Try roman numerals (convert to number)
    roman_match = re.search(r'\b([IVXivx]+)\b', text)
    if roman_match:
        roman = roman_match.group(1).upper()
        # Simple roman to int (only handle common cases)
        roman_map = {'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5,
                     'VI': 6, 'VII': 7, 'VIII': 8, 'IX': 9, 'X': 10}
        if roman in roman_map:
            return str(roman_map[roman])
    # Try single letter (A=1, B=2, etc.)
    letter_match = re.search(r'\b([A-Z])\b', text)
    if letter_match:
        return str(ord(letter_match.group(1)) - ord('A') + 1)
    return None

def match_toc_to_headings(toc_topics: List[str], md_headings: List[Tuple[int, str]], thresh=FUZZY_THRESH):
    """
    Match TOC entries to markdown headings using prefix matching and improved scoring.
    """
    matched = []
    used = set()
    toc_to_heading_map = []

    for toc_idx, toc in enumerate(toc_topics):
        toc_normalized = normalize_text(toc)
        toc_prefix = extract_structural_prefix(toc)

        best_idx, best_score = None, 0

        for i, (pos, head) in enumerate(md_headings):
            if i in used:
                continue
            head_normalized = normalize_text(head)

            # STRATEGY 1: Exact prefix match (highest priority)
            toc_prefix_norm = normalize_text(toc_prefix).lower()
            head_norm_lower = head_normalized.lower()

            if toc_prefix_norm in head_norm_lower or head_norm_lower in toc_prefix_norm:
                # Check if it's a strong match (not just "LETTER" matching "LETTER")
                if len(toc_prefix_norm) >= 8: # Meaningful prefix length
                    score = 100 # Perfect match
                else:
                    score = 85
            else:
                # STRATEGY 2: Fuzzy matching
                score = fuzz.token_sort_ratio(toc_normalized.lower(), head_normalized.lower())

            # Article number boost
            toc_num = extract_article_number(toc)
            head_num = extract_article_number(head)
            if toc_num and head_num:
                if toc_num == head_num:
                    score = min(100, score + 20)
                else:
                    score = max(0, score - 15)

            # Position awareness (penalize out-of-order matches)
            if toc_to_heading_map:
                last_matched_pos = md_headings[toc_to_heading_map[-1]['heading_idx']][0]
                if pos < last_matched_pos:
                    score = max(0, score - 25)

            if score > best_score:
                best_score, best_idx = score, i

        if best_idx is not None and best_score >= thresh:
            toc_to_heading_map.append({
                'toc_idx': toc_idx,
                'heading_idx': best_idx,
                'score': best_score,
                'toc_text': toc,
                'heading': md_headings[best_idx]
            })
            used.add(best_idx)

    # Sort by TOC order
    toc_to_heading_map.sort(key=lambda x: x['toc_idx'])

    # Extract matched headings in document order
    validated_matches = [m['heading'] for m in toc_to_heading_map]
    validated_matches = sorted(validated_matches, key=lambda x: x[0])
    return validated_matches

# ---------------- CHUNK BUILDER ----------------
def build_chunks(md: str, matched: List[Tuple[int, str]]) -> List[Dict]:
    chunks = []
    for i, (pos, heading) in enumerate(matched):
        start = pos
        end = matched[i + 1][0] if i + 1 < len(matched) else len(md)
        text = md[start:end].strip()
        chunks.append({"heading": heading, "text": text, "length": len(text)})
    return chunks

# ---------------- DRIVER FUNCTION ----------------
def chunk_from_toc(md_path: str):
    md = read_md(md_path)
    s, e = detect_toc_region(md)
    if not s:
        print("‚ö†Ô∏è No TOC detected in:", md_path)
        return []

    toc_lines = md.splitlines()[s:e]
    print(f"‚úÖ TOC detected from line {s} to {e}, lines: {len(toc_lines)}")

    
    # Calculate character position where ToC ends
    toc_end_pos = sum(len(md.splitlines()[i]) + 1 for i in range(e))
    print(f"  üìç ToC ends at character position {toc_end_pos}")
    
    toc_topics = parse_toc_block(toc_lines)
    print(f"Parsed {len(toc_topics)} topics from TOC")

    # Learn patterns from TOC
    learned_patterns = learn_toc_patterns(toc_topics)
    print(f"Learned patterns: keywords={learned_patterns['keywords'][:5]}, "
          f"numbering={learned_patterns['numbering_types']}")

    # Extract headings using learned patterns
    # Extract headings using learned patterns, excluding ToC region
    md_headings = extract_md_headings(md, learned_patterns, toc_end_pos=toc_end_pos)
    print(f"Detected {len(md_headings)} body headings")

    numeric_major_mode = learned_patterns.get('numeric_decimal') and not learned_patterns.get('keywords')
    matched = []

    if numeric_major_mode:
        # Collapse to top-level numeric headings (e.g., 1.0, 2.0, ...)
        major_headings = []
        seen = set()
        for pos, heading_text in md_headings:
            clean_heading = re.sub(r'^#{1,6}\s*', '', heading_text).strip()
            m = re.match(r'^(\d+)\.0\b', clean_heading)
            if m:
                num = m.group(1)
                if num not in seen:
                    seen.add(num)
                    major_headings.append((pos, clean_heading))
        if major_headings:
            matched = sorted(major_headings, key=lambda x: x[0])
            print(f"Numeric-major mode: using {len(matched)} major headings for chunking.")
        else:
            matched = match_toc_to_headings(toc_topics, md_headings)
    else:
        matched = match_toc_to_headings(toc_topics, md_headings)

    print(f"Matched {len(matched)} TOC topics to headings")

    chunks = build_chunks(md, matched)
    print(f"Created {len(chunks)} chunks")
    for c in chunks[:10]:
        print(f"---\nHEADING: {c['heading']}\nLENGTH: {c['length']}\n")
    return chunks

# ---------------- TESTING UTILITIES ----------------
def test_detect_toc(md_text: str):
    s, e = detect_toc_region(md_text)
    if s:
        print(f"TOC region: lines {s}-{e}")
        for l in md_text.splitlines()[s:e][:20]:
            print(">", l)
    else:
        print("No TOC detected.")

def test_parse_toc(md_text: str):
    s, e = detect_toc_region(md_text)
    if not s:
        print("No TOC found.")
        return
    toc_lines = md_text.splitlines()[s:e]
    parsed = parse_toc_block(toc_lines)
    print("Parsed TOC entries:\n", "\n".join(parsed[:20]))

def test_heading_extraction(md_text: str):
    heads = extract_md_headings(md_text)
    print(f"Found {len(heads)} headings:")
    for _, h in heads[:20]:
        print("-", h)


In [2]:
# TEST THE IMPROVEMENTS + PATTERN LEARNING
# This cell demonstrates the fixes for problems 3, 4, and 5 PLUS adaptive pattern learning

# Create sample markdown text to test the improvements
test_md = """
# TEST DOCUMENT

TABLE OF CONTENTS

ARTICLE ONE - PURPOSE .............................. 5
ARTICLE 2.000 - RECOGNITION AND JURISDICTION ...... 8
Article Three - Management Rights ................. 12
APPENDIX A - Wage Schedules ....................... 45
LETTER OF UNDERSTANDING - Remote Work ............. 52

---

## ARTICLE ONE - PURPOSE

This agreement establishes the terms and conditions of employment.

**Scope of Agreement**

This agreement applies to all employees covered under this bargaining unit.

## ARTICLE 2.000 - RECOGNITION AND JURISDICTION

The Company recognizes the Union as the exclusive bargaining agent.

WITNESSETH THAT WHEREAS the parties agree to the following terms.

## Article Three - Management Rights  

Management retains all rights not specifically limited by this agreement.

CALGARY, AB T2P 1J9

## APPENDIX A - Wage Schedules

Classification wage rates are as follows...

## LETTER OF UNDERSTANDING - Remote Work

The parties agree to pilot a remote work program.
"""

print("=" * 80)
print("TESTING PATTERN LEARNING FROM TOC")
print("=" * 80)

# Test TOC detection and parsing
s, e = detect_toc_region(test_md)
if s:
    print(f"\n‚úÖ TOC detected from line {s} to {e}")
    toc_lines = test_md.splitlines()[s:e]
    toc_topics = parse_toc_block(toc_lines)
    print(f"‚úÖ Parsed {len(toc_topics)} TOC topics\n")
    
    # Learn patterns from TOC
    learned_patterns = learn_toc_patterns(toc_topics)
    
    print("üìö LEARNED PATTERNS FROM TOC:")
    print(f"  Keywords discovered: {learned_patterns['keywords']}")
    print(f"  Numbering types: {learned_patterns['numbering_types']}")
    print(f"  Separators found: {learned_patterns['separators']}")
    print(f"  Dynamic regex patterns: {len(learned_patterns['patterns'])} pattern(s) generated")
    print()

print("=" * 80)
print("TESTING ADAPTIVE HEADING EXTRACTION")
print("=" * 80)

# Extract headings using learned patterns
headings = extract_md_headings(test_md, learned_patterns)
print(f"\n‚úÖ Found {len(headings)} headings using learned patterns:\n")
for pos, heading in headings:
    print(f"  - {heading}")

print("\n" + "=" * 80)
print("TESTING WITH DIFFERENT DOCUMENT FORMAT")
print("=" * 80)

# Test with a different format (Roman numerals + different keywords)
test_md2 = """
Contents

Chapter I: Introduction ........................... 1
Chapter II: Methodology ........................... 5
Chapter III: Results .............................. 12
Section A - Data Analysis ......................... 20
Section B - Discussion ............................ 25

---

## Chapter I: Introduction

This chapter introduces the research topic.

## Chapter II: Methodology

Research methods are described here.

## Chapter III: Results

Findings are presented in this chapter.

## Section A - Data Analysis

Analysis of collected data.

## Section B - Discussion

Discussion of implications.
"""

print("\nüìÑ Testing with Roman numerals and 'Chapter' keyword:\n")

s2, e2 = detect_toc_region(test_md2)
if s2:
    toc_lines2 = test_md2.splitlines()[s2:e2]
    toc_topics2 = parse_toc_block(toc_lines2)
    learned_patterns2 = learn_toc_patterns(toc_topics2)
    
    print(f"‚úÖ Learned from new document:")
    print(f"  Keywords: {learned_patterns2['keywords']}")
    print(f"  Numbering: {learned_patterns2['numbering_types']}")
    
    headings2 = extract_md_headings(test_md2, learned_patterns2)
    print(f"\n‚úÖ Found {len(headings2)} headings:")
    for pos, heading in headings2[:8]:
        print(f"  - {heading}")

print("\n" + "=" * 80)
print("KEY BENEFITS OF PATTERN LEARNING")
print("=" * 80)

print("""
‚úÖ No more hardcoded word lists (ONE, TWO, THREE... FIFTY)
‚úÖ Automatically detects document-specific keywords (ARTICLE, CHAPTER, etc.)
‚úÖ Adapts to any numbering scheme (1,2,3 or I,II,III or A,B,C or ONE,TWO,THREE)
‚úÖ Learns separator styles (dash vs colon vs nothing)
‚úÖ Works for ANY structured document with a TOC
‚úÖ Language-agnostic approach (learns from what it sees)
‚úÖ Handles mixed numbering (ARTICLE 1, APPENDIX A, etc.)

üéØ Result: Truly generalized solution that adapts to the document!
""")

print("=" * 80)
print("ALL TESTS COMPLETED")
print("=" * 80)

TESTING PATTERN LEARNING FROM TOC

‚úÖ TOC detected from line 4 to 11
‚úÖ Parsed 0 TOC topics

üìö LEARNED PATTERNS FROM TOC:
  Keywords discovered: []
  Numbering types: []
  Separators found: []
  Dynamic regex patterns: 0 pattern(s) generated

TESTING ADAPTIVE HEADING EXTRACTION

‚úÖ Found 13 headings using learned patterns:

  - # TEST DOCUMENT
  - TABLE OF CONTENTS
  - ARTICLE ONE - PURPOSE .............................. 5
  - ARTICLE 2.000 - RECOGNITION AND JURISDICTION ...... 8
  - APPENDIX A - Wage Schedules ....................... 45
  - LETTER OF UNDERSTANDING - Remote Work ............. 52
  - ## ARTICLE ONE - PURPOSE
  - Scope of Agreement
  - ## ARTICLE 2.000 - RECOGNITION AND JURISDICTION
  - WITNESSETH THAT WHEREAS the parties agree to the following terms.
  - Article Three - Management Rights
  - APPENDIX A - Wage Schedules
  - LETTER OF UNDERSTANDING - Remote Work

TESTING WITH DIFFERENT DOCUMENT FORMAT

üìÑ Testing with Roman numerals and 'Chapter' keyword:

‚úÖ Lea

In [None]:
# ==================== BATCH PROCESSING FOR ALL MARKDOWN FILES ====================
# Run this cell to process all your markdown files at once

import json
from pathlib import Path

def process_all_markdown_files(input_folder: str, output_folder: str = None):
    """
    Process all markdown files in a folder using the improved chunking with pattern learning.
    
    Args:
        input_folder: Path to folder containing .md files
        output_folder: Path to save chunked output (default: input_folder/chunked_output)
    
    Returns:
        Summary dict with statistics
    """
    input_path = Path(input_folder)
    
    # Set up output folder
    if output_folder is None:
        output_path = input_path / "/Users/swathi.gnanasekar/Documents/Vista_Vu_Project/Phase 1/Docling_Tweak/CHUNK_NEW"
    else:
        output_path = Path(output_folder)
    
    output_path.mkdir(exist_ok=True)
    
    # Find all markdown files
    md_files = list(input_path.glob("*.md"))
    
    if not md_files:
        print(f"‚ö†Ô∏è  No markdown files found in {input_folder}")
        return None
    
    print(f"Found {len(md_files)} markdown files\n")
    print("=" * 80)
    
    all_results = []
    total_chunks = 0
    
    for md_file in md_files:
        print(f"\nüìÑ Processing: {md_file.name}")
        print("-" * 80)
        
        try:
            # Read markdown
            md_content = md_file.read_text(encoding='utf-8')
            
            # Detect TOC
            s, e = detect_toc_region(md_content)
            if not s:
                print(f"  ‚ö†Ô∏è  No TOC detected - skipping")
                continue
            
            # Parse TOC
            toc_lines = md_content.splitlines()[s:e]
            toc_topics = parse_toc_block(toc_lines)
            print(f"  ‚úÖ TOC: {len(toc_topics)} entries (lines {s}-{e})")
            
            # Calculate ToC end position
            toc_end_pos = sum(len(md_content.splitlines()[i]) + 1 for i in range(e))
            
            
            # Learn patterns
            learned_patterns = learn_toc_patterns(toc_topics)
            print(f"  üìö Learned: {len(learned_patterns['keywords'])} keywords, "
                  f"{len(learned_patterns['numbering_types'])} numbering types. "
                  f"The learned keywords are: {learned_patterns['keywords']}")
            
            # Extract headings with learned patterns
            # Extract headings with learned patterns, excluding ToC region
            md_headings = extract_md_headings(md_content, learned_patterns, toc_end_pos=toc_end_pos)
            print(f"  üîç Found: {len(md_headings)} headings in document")
            
            # Match TOC to headings with numeric-major fallback
            numeric_major_mode = learned_patterns.get('numeric_decimal') and not learned_patterns.get('keywords')
            if numeric_major_mode:
                major_headings = []
                seen = set()
                for pos, heading_text in md_headings:
                    clean_heading = re.sub(r'^#{1,6}\s*', '', heading_text).strip()
                    m = re.match(r'^(\d+)\.0\b', clean_heading)
                    if m:
                        num = m.group(1)
                        if num not in seen:
                            seen.add(num)
                            major_headings.append((pos, clean_heading))
                if major_headings:
                    matched = sorted(major_headings, key=lambda x: x[0])
                    print(f"  üéØ Matched (numeric-major): {len(matched)} sections")
                else:
                    matched = match_toc_to_headings(toc_topics, md_headings)
                    print(f"  üéØ Matched (fallback): {len(matched)} sections")
            else:
                matched = match_toc_to_headings(toc_topics, md_headings)
                print(f"  üéØ Matched: {len(matched)} sections")
            
            # Build chunks
            chunks = build_chunks(md_content, matched)
            print(f"  üì¶ Created: {len(chunks)} chunks")
            
            # Save results
            doc_name = md_file.stem
            doc_output_folder = output_path / doc_name
            doc_output_folder.mkdir(exist_ok=True)
            
            # Save each chunk as separate file
            for i, chunk in enumerate(chunks, 1):
                # Create safe filename from heading
                safe_heading = re.sub(r'[^\w\s\-]', '', chunk['heading'])
                safe_heading = re.sub(r'\s+', '_', safe_heading)[:80]
                chunk_filename = f"{i:02d}_{safe_heading}.txt"
                chunk_path = doc_output_folder / chunk_filename
                
                chunk_path.write_text(chunk['text'], encoding='utf-8')
            
            # Save metadata
            metadata = {
                "source_file": md_file.name,
                "total_chunks": len(chunks),
                "toc_entries": len(toc_topics),
                "learned_keywords": learned_patterns['keywords'],
                "numbering_types": learned_patterns['numbering_types'],
                "chunks": [
                    {
                        "chunk_num": i,
                        "heading": chunk['heading'],
                        "length": chunk['length'],
                        "filename": f"{i:02d}_{re.sub(r'[^\w\s\-]', '', chunk['heading'])[:80]}.txt"
                    }
                    for i, chunk in enumerate(chunks, 1)
                ]
            }
            
            metadata_path = doc_output_folder / "metadata.json"
            with open(metadata_path, 'w', encoding='utf-8') as f:
                json.dump(metadata, f, indent=2)
            
            print(f"  üíæ Saved to: {doc_output_folder}/")
            
            all_results.append({
                "file": md_file.name,
                "chunks": len(chunks),
                "success": True
            })
            total_chunks += len(chunks)
            
        except Exception as e:
            print(f"  ‚ùå Error: {str(e)}")
            all_results.append({
                "file": md_file.name,
                "chunks": 0,
                "success": False,
                "error": str(e)
            })
    
    # Print summary
    print("\n" + "=" * 80)
    print("üìä PROCESSING SUMMARY")
    print("=" * 80)
    print(f"Total files processed: {len([r for r in all_results if r['success']])}/{len(md_files)}")
    print(f"Total chunks created: {total_chunks}")
    print(f"Output location: {output_path}/")
    print()
    
    # Save summary
    summary_path = output_path / "PROCESSING_SUMMARY.json"
    with open(summary_path, 'w', encoding='utf-8') as f:
        json.dump({
            "total_files": len(md_files),
            "successful": len([r for r in all_results if r['success']]),
            "failed": len([r for r in all_results if not r['success']]),
            "total_chunks": total_chunks,
            "results": all_results
        }, f, indent=2)
    
    print(f"üìÑ Summary saved to: {summary_path}")
    
    return all_results



# OPTION 1: Process all markdown files in the Docling_Tweak folder
input_folder = "/Users/swathi.gnanasekar/Documents/Vista_Vu_Project/Phase 1/Docling_Tweak"
results = process_all_markdown_files(input_folder)

# OPTION 2: Specify custom output folder
# results = process_all_markdown_files(input_folder, output_folder="/path/to/output")

# View results summary
if results:
    print("\nüìã Quick Results:")
    for r in results:
        status = "‚úÖ" if r['success'] else "‚ùå"
        print(f"  {status} {r['file']}: {r['chunks']} chunks")

Found 6 markdown files


üìÑ Processing: Boilermakers Collective-Agreement.md
--------------------------------------------------------------------------------
  ‚úÖ TOC: 39 entries (lines 59-108)
  üìö Learned: 3 keywords, 2 numbering types. The learned keywords are: ['APPENDIX', 'LETTER', 'ARTICLE']
  üîß Excluded 17 headings in ToC region (before position 4967)
  üîç Found: 349 headings in document
  üéØ Matched: 38 sections
  üì¶ Created: 38 chunks
  üíæ Saved to: /Users/swathi.gnanasekar/Documents/Vista_Vu_Project/Phase 1/Docling_Tweak/CHUNK_NEW/Boilermakers Collective-Agreement/

üìÑ Processing: NMA-Alberta-Province_new.md
--------------------------------------------------------------------------------
  ‚úÖ TOC: 43 entries (lines 17-64)
  üìö Learned: 6 keywords, 2 numbering types. The learned keywords are: ['IOL', 'ADMINISTRATION', 'APPENDIX', 'ALBERTA', 'ARTICLE', 'TRADE']
  üîß Excluded 3 headings in ToC region (before position 14396)
  üîç Found: 100 headings in do

## Summary of Fixes + Pattern Learning

This notebook contains fixes for three critical issues in TOC-based chunking, PLUS a generalized pattern learning system.

### Problem 3: Heading Extraction Issues ‚úÖ FIXED + GENERALIZED
- Added detection for **bold headings** (`**text**`)
- Added detection for **title-case headings** (e.g., "Section One Introduction")
- ‚ú® **NEW: Pattern Learning** - learns keywords and numbering from TOC instead of hardcoding
  - No more hardcoded lists like `ONE|TWO|THREE|...|FIFTY`
  - Automatically discovers document-specific keywords (ARTICLE, CHAPTER, SECTION, etc.)
  - Adapts to any numbering scheme (numeric, roman, letters, word numbers)
- **Fixed uppercase detection** to exclude:
  - Legal boilerplate (WITNESSETH, WHEREAS, etc.)
  - Addresses like "CALGARY, AB T2P 1J9"
  - Street addresses and common all-caps text

### Problem 4: Matching Assumptions ‚úÖ FIXED
- Switched from `partial_ratio` to **`token_sort_ratio`** for better word-order matching
- Added **article number extraction and matching** with support for:
  - Numeric (1, 2, 3)
  - Roman numerals (I, II, III)
  - Letters (A, B, C)
  - Word numbers (ONE, TWO, THREE)
- Added **position-aware scoring** (penalizes out-of-order matches)
- Added **sequence validation** (rejects matches that violate TOC order)

### Problem 5: Formatting Assumptions ‚úÖ FIXED
- Added **`normalize_text()`** function to handle:
  - Merged words (ARTICLETWO ‚Üí ARTICLE TWO)
  - Decimal normalization (1.000 ‚Üí 1.0)
  - OCR artifacts (soft hyphens, smart quotes)
- Improved **`parse_toc_block()`** to handle:
  - Multi-column TOCs
  - Roman numeral page numbers
  - Various dot-leader styles
  - **Only extracts actual TOC entries** (not body content)
- Enhanced **`detect_toc_region()`** to:
  - **Stop at markdown headings (##) and separators (---)**
  - Track consecutive non-TOC lines
  - Properly detect TOC end

### üéØ New Feature: Adaptive Pattern Learning

The `learn_toc_patterns()` function analyzes TOC entries and learns:

1. **Keywords** - Discovers what section keywords the document uses (ARTICLE, CHAPTER, PART, etc.)
2. **Numbering schemes** - Detects numeric (1,2,3), roman (I,II,III), letters (A,B,C), or word numbers (ONE,TWO,THREE)
3. **Separators** - Identifies dash (-), colon (:), or other separators used
4. **Dynamic patterns** - Builds regex patterns on-the-fly based on what it learns

**Benefits:**
- ‚úÖ Works with ANY document format (not just Alberta contracts)
- ‚úÖ Handles research papers (Chapter I, Chapter II)
- ‚úÖ Handles legal docs (Section A, Section B)
- ‚úÖ Handles technical manuals (Part 1.0, Part 2.0)
- ‚úÖ Language-agnostic approach
- ‚úÖ No maintenance needed for new document types

### Test Results
‚úÖ TOC detection now stops at line 11 (separator) instead of line 40  
‚úÖ TOC parsing extracts only 5 entries instead of 19  
‚úÖ "CALGARY, AB T2P 1J9" is no longer detected as a heading  
‚úÖ No more duplicate matches  
‚úÖ Adaptive to different document formats (tested with Roman numerals + CHAPTER keyword)  
‚úÖ Better handling of OCR errors and formatting variations