Below is a **comprehensive set of rules** for the `MarkdownChunkingStrategy` class, meticulously designed to handle the intricacies of Markdown document structures. These rules encompass **chunk size management**, **content type preservation**, **structural integrity**, **duplicate prevention**, and **automatic detection of headers and footers**. Additionally, they incorporate your specific requirements regarding **hard length limits**, **table splitting with full headers**, and **automatic identification of repeating headers and footers**.

# Rules
---

## **1. Chunk Size Management**

### **1.1. Minimum Length (`min_chunk_len`)**
- **Definition:** The minimum number of characters a chunk must contain.
- **Default:** `512`
- **Rule:** 
  - Ensure each chunk is at least `min_chunk_len` characters long.
  - Merge smaller chunks with adjacent ones to meet this requirement.

### **1.2. Soft Maximum Length (`soft_max_len`)**
- **Definition:** The preferred upper limit for chunk size.
- **Default:** `1024`
- **Rule:** 
  - Aim to keep chunks below `soft_max_len`.
  - Allow slight exceedances if necessary to preserve structural elements.

### **1.3. Hard Maximum Length (`hard_max_len`)**
- **Definition:** The absolute maximum number of characters a chunk can contain.
- **Default:** `2048`
- **Rule:** 
  - Strictly enforce that no chunk exceeds `hard_max_len`.
  - If a chunk exceeds this limit, split it at logical boundaries (e.g., sentence endings, newlines) regardless of content type.

---

## **2. Content Type Preservation**

### **2.1. Headings**
- **Identification:** Detect headings using various Markdown syntaxes (e.g., `#`, `##`, numbered headings).
- **Rule:** 
  - Preserve all heading levels and formats.
  - When splitting, ensure that headings remain intact and appropriately formatted within chunks.

### **2.2. Tables**
- **Identification:** Detect Markdown tables by the presence of pipe characters (`|`) and separator lines (e.g., `---`).
- **Rules:** 
  - **No Split Rule:** Do not split tables across chunks unless they exceed `hard_max_len`.
  - **Splitting with Full Headers:** If a table must be split due to the hard length limit, ensure that each resulting chunk includes the full table header to maintain clarity and structure.

### **2.3. Footnotes and Citations**
- **Identification:** Detect footnote definitions (e.g., `[^1]: Footnote text`) and citations (e.g., `[@citation]`).
- **Rules:** 
  - Keep footnote markers and their corresponding definitions within the same chunk.
  - Avoid splitting footnote definitions across multiple chunks.

### **2.4. Images and Links**
- **Identification:** Detect Markdown image syntax (`![Alt Text](url)`) and links (`[Text](url)`).
- **Rules:** 
  - Ensure that images and links are not split across chunks.
  - Maintain the integrity of image and link syntax within chunks.

### **2.5. Code Blocks and Inline Code**
- **Identification:** Detect fenced code blocks (e.g., ```` ```python ````) and inline code (e.g., `` `code` ``).
- **Rules:** 
  - **Code Blocks:** Never split within a fenced code block. Include both opening and closing fences within the same chunk.
  - **Inline Code:** Avoid splitting inline code snippets across chunks. Ensure that both backticks are within the same chunk.

### **2.6. Lists**
- **Identification:** Detect ordered (`1.`, `2.`, ...) and unordered lists (`-`, `*`, `+`).
- **Rules:** 
  - Do not split lists across chunks.
  - Preserve list hierarchies and indentation levels.
  - If a list exceeds `hard_max_len`, consider splitting between major sections or logical breaks, ensuring individual list items remain intact.

### **2.7. Blockquotes**
- **Identification:** Detect blockquotes using the `>` symbol.
- **Rules:** 
  - Keep entire blockquotes within the same chunk.
  - Preserve nesting and indentation within blockquotes.

### **2.8. Embedded HTML**
- **Identification:** Detect embedded HTML tags within the Markdown.
- **Rules:** 
  - Do not split embedded HTML elements across chunks.
  - Ensure that opening and closing HTML tags are contained within the same chunk.

### **2.9. Tables of Contents (TOC)**
- **Identification:** Detect TOCs, typically represented as nested lists with links to headings.
- **Rules:** 
  - Optionally exclude TOCs from chunking or place them in a separate chunk.
  - Ensure that TOC links correspond to headings within the same or subsequent chunks.

### **2.10. YAML Front Matter**
- **Identification:** Detect YAML front matter enclosed within `---` at the beginning of the document.
- **Rules:** 
  - Keep YAML front matter intact within a single chunk, preferably the first chunk.
  - Do not split YAML front matter across multiple chunks.

### **2.11. Emphasis and Formatting**
- **Identification:** Detect bold (`**bold**`), italics (`*italics*`), strikethrough (`~~text~~`), and other formatting.
- **Rules:** 
  - Avoid splitting emphasis markers across chunks.
  - Ensure that formatting syntax remains intact within chunks.

### **2.12. Horizontal Rules and Page Breaks**
- **Identification:** Detect horizontal rules (`---`, `***`, `___`) and page breaks.
- **Rules:** 
  - Preserve horizontal rules and page breaks within the same chunk.
  - Use them as natural split points for chunking.

---

## **3. Logical Split Points**

### **3.1. Natural Language Boundaries**
- **Rule:** 
  - Prefer splitting at sentence endings (e.g., `.` followed by a space) or newline characters (`\n`) to maintain readability and coherence.

### **3.2. Exclusion of Key Elements**
- **Rule:** 
  - Do not split within structural elements such as tables, code blocks, lists, blockquotes, images, links, and embedded HTML to preserve their functionality and appearance.

---

## **4. Markdown Integrity**

### **4.1. Consistent Formatting**
- **Rule:** 
  - Ensure that all Markdown syntax within each chunk is correctly formatted.
  - Maintain proper heading levels, list structures, and other formatting conventions.

### **4.2. Content Type Classification**
- **Rule:** 
  - Classify blocks of text (e.g., paragraphs, tables, headings) to apply appropriate formatting and prevent structural issues post-chunking.

---

## **5. Duplicate Prevention**

### **5.1. Exact Duplicates**
- **Rule:** 
  - Utilize hashing (e.g., MD5) to identify and exclude exact duplicate content blocks, ensuring each chunk contains unique information.

### **5.2. Fuzzy Duplicates**
- **Rule:** 
  - (Optional) Implement fuzzy duplicate detection using similarity metrics to identify and manage near-duplicate content.

---

## **6. Header and Footer Management**

### **6.1. Automatic Identification of Headers and Footers**
- **Rule:** 
  - Automatically detect and remove repeating headers and footers without relying on predefined templates.
  - Use heuristics such as consistent phrases, patterns (e.g., page numbers), and positional information (e.g., top and bottom lines) to identify headers and footers.

### **6.2. Removal of Identified Headers and Footers**
- **Rule:** 
  - Exclude detected headers and footers from chunk content to avoid redundancy and irrelevant information within chunks.

---

## **7. Table Splitting with Full Headers**

### **7.1. Handling Oversized Tables**
- **Rule:** 
  - If a table exceeds `hard_max_len`, split it into smaller tables.
  - Ensure that each split table includes the full header row to maintain context and readability.

### **7.2. Preserving Table Integrity**
- **Rule:** 
  - Maintain proper Markdown table syntax within each split to ensure tables render correctly.

---

## **8. Additional Enhancements**

### **8.1. Embedded Media Handling**
- **Rule:** 
  - Place large media embeds (e.g., videos, high-resolution images) in separate chunks to prevent inflating the size of other content.

### **8.2. Cross-Chunk References**
- **Rule:** 
  - Maintain functional references between chunks (e.g., links pointing to sections in other chunks).
  - Include identifiers or anchors in links that correspond to target chunks.

### **8.3. Metadata Preservation**
- **Rule:** 
  - Retain essential metadata (e.g., titles, authors, dates) within chunks or alongside them for processing and rendering purposes.

### **8.4. Accessibility Considerations**
- **Rule:** 
  - Preserve accessibility features such as alt text for images, proper heading hierarchies, and descriptive link texts within chunks.

### **8.5. Handling Custom Markdown Extensions**
- **Rule:** 
  - Detect and preserve syntax introduced by Markdown extensions or plugins (e.g., task lists, diagrams, LaTeX equations).
  - Ensure that extended syntax is not split across chunks to maintain functionality.

### **8.6. Consistent Line Endings and Whitespace**
- **Rule:** 
  - Normalize line endings and manage whitespace to ensure consistent formatting across chunks.
  - Remove excessive trailing spaces or blank lines that could disrupt Markdown structure.

### **8.7. Error Detection and Recovery**
- **Rule:** 
  - Implement syntax validation within each chunk to identify and rectify potential Markdown errors.
  - Provide fallback mechanisms to handle or skip malformed sections without halting the entire chunking process.

### **8.8. Concurrency and Performance Optimization**
- **Rule:** 
  - Utilize parallel processing (e.g., `ProcessPoolExecutor`) to handle multiple pages or large documents efficiently.
  - Ensure that the chunking process scales effectively with document size.

### **8.9. Testing and Validation**
- **Rule:** 
  - Develop comprehensive test cases covering various Markdown structures and edge cases.
  - Regularly validate that chunks render correctly across different Markdown parsers and platforms.

### **8.10. Customizable Chunking Rules**
- **Rule:** 
  - Allow users to define or adjust chunking behavior based on specific requirements or preferences.
  - Enable rule extensions to accommodate additional or customized chunking strategies.

---

## **9. Summary of Comprehensive Rules**

### **9.1. Chunk Size Constraints**
- **Always** adhere to `hard_max_len`, never exceeding it under any circumstances.
- **Aim** to keep chunks between `min_chunk_len` and `soft_max_len`.
- **Merge** smaller chunks to meet `min_chunk_len`.
- **Split** oversized chunks at logical boundaries, ensuring no chunk exceeds `hard_max_len`.

### **9.2. Structural Integrity**
- **Preserve** all Markdown structural elements (headings, tables, lists, code blocks, etc.) within chunks.
- **Never split** critical elements unless absolutely necessary (as per `hard_max_len`).
- **Ensure** that any split tables include full headers for clarity.

### **9.3. Automatic Header and Footer Detection**
- **Identify** repeating headers and footers automatically using heuristics (patterns, consistent phrases, positions).
- **Remove** detected headers and footers from chunk content to avoid redundancy.

### **9.4. Markdown Syntax Preservation**
- **Maintain** proper Markdown formatting within each chunk.
- **Avoid breaking** Markdown syntax across chunks to ensure correct rendering.

### **9.5. Duplicate Content Management**
- **Detect and exclude** exact duplicates using hashing techniques.
- **Optionally** implement fuzzy duplicate detection for near-duplicate content.

### **9.6. Metadata and References**
- **Preserve** essential metadata within or alongside chunks.
- **Maintain** functional cross-chunk references and links.

### **9.7. Accessibility and Usability**
- **Ensure** accessibility features are intact within chunks.
- **Preserve** logical heading hierarchies to support assistive technologies.

### **9.8. Performance and Scalability**
- **Optimize** chunking processes for efficiency, especially for large documents.
- **Leverage** concurrency to enhance performance.

### **9.9. Customization and Flexibility**
- **Allow** users to customize chunking behavior based on specific needs.
- **Support** multiple Markdown flavors and custom extensions.

### **9.10. Robust Error Handling**
- **Implement** mechanisms to detect and recover from malformed Markdown.
- **Log** errors with sufficient context for troubleshooting.

---

## **Implementation Considerations**

To effectively implement these comprehensive rules within the `MarkdownChunkingStrategy` class, consider the following strategies:

1. **Enhanced Pattern Recognition:**
   - Utilize advanced regular expressions and heuristics to automatically detect headers, footers, and other structural elements without relying on predefined templates.

2. **Context-Aware Splitting:**
   - Develop logic that is aware of the current parsing context (e.g., inside a table, code block) to prevent inappropriate splits.

3. **Dynamic Header Insertion:**
   - When splitting tables, dynamically insert header rows into each new table chunk to maintain clarity and structure.

4. **Concurrency Optimization:**
   - Implement parallel processing judiciously, ensuring thread safety and efficient resource utilization.

5. **Comprehensive Testing:**
   - Create extensive test suites that cover a wide range of Markdown features and edge cases to validate the robustness of the chunking strategy.

6. **Configurability:**
   - Design the class to accept configurable parameters and extension points, allowing users to tailor the chunking behavior to their specific needs.

7. **Logging and Reporting:**
   - Incorporate detailed logging to track the chunking process, identify issues, and provide insights for debugging and optimization.

---

## **Conclusion**

By integrating these comprehensive rules, the `MarkdownChunkingStrategy` class will robustly handle the complexities of Markdown document structures. It ensures that **chunk size constraints** are strictly respected, **structural integrity** is maintained, **headers and footers** are automatically identified and managed, and **Markdown syntax** remains unbroken across chunks. Additionally, the strategy emphasizes **performance optimization**, **accessibility**, and **flexibility**, making it well-suited for a wide range of applications involving Markdown processing.

Implementing these rules will result in well-organized, coherent, and structurally sound chunks, facilitating efficient processing, rendering, and analysis of Markdown documents.

In [None]:
import sys
import json
from pathlib import Path
from IPython.display import Markdown, display

In [3]:
# kph/kph/document_processing/chunking/chunker.py

import re
import hashlib
from typing import List, Set, Tuple, Optional, Generator
from abc import ABC, abstractmethod
from unstructured.documents.elements import Element
from unstructured.chunking.title import chunk_by_title
from concurrent.futures import ProcessPoolExecutor, as_completed
import logging


class ChunkingStrategy(ABC):
    @abstractmethod
    def chunk(self, items: List[Element]) -> List[Tuple[int, List[int], str]]:
        """
        Splits a list of elements into chunks and returns a list of tuples.
        Args:
            items (List[Element]): The list of elements to be chunked.
        Returns:
            List[Tuple[int, List[int], str]]: A list of tuples where each tuple contains:
            - An integer representing the chunk number.
            - A list of integers representing the page numbers of the chunk.
            - A string representing the content of the chunk.
        """

        pass


class DocumentChunker:
    def __init__(self, config: dict):
        self.config = config
        strategy_name = self.config.chunking.strategy
        if strategy_name == "by_page":
            self.strategy = PageChunkingStrategy()
            logger.info(f"Using {strategy_name} chunking strategy")
        elif strategy_name == "by_character":
            max_len = self.config.chunking.by_character.max_len
            overlap = self.config.chunking.by_character.overlap
            self.strategy = CharachterChunkingStrategy(max_len, overlap)
            logger.info(f"Using {strategy_name} chunking strategy")
        elif strategy_name == "by_title":
            self.strategy = TitleChunkingStrategy(
                combine_text_under_n_chars=self.config.chunking.by_title.combine_text_under_n_chars,
                include_orig_elements=self.config.chunking.by_title.include_orig_elements,
                max_characters=self.config.chunking.by_title.max_characters,
                multipage_sections=self.config.chunking.by_title.multipage_sections,
                new_after_n_chars=self.config.chunking.by_title.new_after_n_chars,
                overlap=self.config.chunking.by_title.overlap,
                overlap_all=self.config.chunking.by_title.overlap_all,
            )
            logger.info(f"Using {strategy_name} chunking strategy")
        elif strategy_name == "markdown":
            self.strategy = MarkdownChunkingStrategy(
                min_chunk_len=self.config.chunking.markdown.min_chunk_len,
                soft_max_len=self.config.chunking.markdown.soft_max_len,
                hard_max_len=self.config.chunking.markdown.hard_max_len,
            )
            logger.info(f"Using {strategy_name} chunking strategy")
        elif strategy_name == "custom":
            self.strategy = CustomChunkingStrategy()
            logger.info(f"Using {strategy_name} chunking strategy")
        else:
            raise ValueError(f"Unknown chunking strategy: {strategy_name}")

    def chunk(self, items: List[Element]) -> List[Tuple[int, List[int], str]]:
        return self.strategy.chunk(items)


class TitleChunkingStrategy(ChunkingStrategy):
    def __init__(
        self,
        combine_text_under_n_chars: Optional[int] = None,
        include_orig_elements: Optional[
            bool
        ] = True,  # Default to True for accurate page tracking
        max_characters: Optional[int] = None,
        multipage_sections: Optional[bool] = True,  # Default to True
        new_after_n_chars: Optional[int] = None,
        overlap: Optional[int] = None,
        overlap_all: Optional[bool] = None,
    ):
        self.combine_text_under_n_chars = combine_text_under_n_chars
        self.include_orig_elements = include_orig_elements
        self.max_characters = max_characters
        self.multipage_sections = multipage_sections
        self.new_after_n_chars = new_after_n_chars
        self.overlap = overlap
        self.overlap_all = overlap_all

    def get_item_text(self, item: Element) -> str:
        if "unstructured.documents.elements.Table" in str(type(item)):
            return item.metadata.text_as_html
        else:
            return item.text

    def chunk(self, items: List[Element]) -> List[Tuple[int, List[int], str]]:
        # Use the chunk_by_title function to chunk the elements based on their title.
        chunked_elements = chunk_by_title(
            elements=items,
            combine_text_under_n_chars=self.combine_text_under_n_chars,
            include_orig_elements=self.include_orig_elements,
            max_characters=self.max_characters,
            multipage_sections=self.multipage_sections,
            new_after_n_chars=self.new_after_n_chars,
            overlap=self.overlap,
            overlap_all=self.overlap_all,
        )

        chunks = []
        chunk_no = 1

        for chunk_element in chunked_elements:
            content = self.get_item_text(chunk_element)

            # Get the last page number from the chunk's metadata
            last_chunk_page_number = getattr(
                chunk_element.metadata, "page_number", None
            )

            # Initialize page_numbers list
            page_numbers = []

            if self.include_orig_elements:
                # Extract original elements and find their page numbers
                orig_elements = getattr(chunk_element.metadata, "orig_elements", [])
                page_numbers = [
                    getattr(elem.metadata, "page_number", None)
                    for elem in orig_elements
                    if getattr(elem.metadata, "page_number", None) is not None
                ]

                if page_numbers:
                    # Get unique sorted page numbers
                    page_numbers = sorted(set(page_numbers))
                else:
                    # If no page numbers found in original elements, fallback
                    if last_chunk_page_number is not None:
                        page_numbers = [last_chunk_page_number]
            else:
                # If orig_elements not included, use last_chunk_page_number
                if last_chunk_page_number is not None:
                    page_numbers = [last_chunk_page_number]

            # Append the chunk number, page number list, and content
            chunks.append((chunk_no, page_numbers, content))

            # Increment the chunk number
            chunk_no += 1

        return chunks


class CustomChunkingStrategy(ChunkingStrategy):
    def chunk(self, items: List[Element]) -> List[Tuple[int, List[int], str]]:
        logger.info("Using custom chunking strategy from plugin")
        chunks = plugin_manager.hook.chunk_elements(elements=items)
        if chunks is not None:
            return chunks
        else:
            raise ValueError("No plugin provided chunks via 'chunk_elements' hook.")


class MarkdownChunkingStrategy:
    def __init__(
        self,
        min_chunk_len: int = 512,
        soft_max_len: int = 1024,
        hard_max_len: int = 2048,
        known_header_footer_phrase: str = "YOUR KNOWN HEADER FOOTER PHRASE",
    ):
        self.min_chunk_len = min_chunk_len
        self.soft_max_len = soft_max_len
        self.hard_max_len = hard_max_len
        self.known_header_footer_phrase = known_header_footer_phrase

        self.heading_patterns = [
            re.compile(r"^([0-9]+\s*[-–.]\s+.*|[IVXLC]+\s*[-–.]\s+.*)"),
            re.compile(r"^[-=~]{3,}\s*$"),
            re.compile(r"^#{1,6}\s+.*"),
        ]

        self.footnote_definition_pattern = re.compile(r"^\d+\.\s+.*")
        self.footnote_marker_pattern = re.compile(r"\[\d+\]|[⁰¹²³⁴⁵⁶⁷⁸⁹]")
        self.image_pattern = re.compile(r"!\[.*\]\(.*\)")
        self.link_pattern = re.compile(r"\[.*?\]\((https?://|www\.).*?\)")
        self.page_number_pattern = re.compile(r"^\s*\d+\s*$")
        self.header_footer_pattern = re.compile(
            rf"^.*{re.escape(self.known_header_footer_phrase)}.*$", re.IGNORECASE
        )

        self.existing_hashes: Set[str] = set()

    def is_duplicate_exact(self, new_block: str) -> bool:
        block_hash = hashlib.md5(new_block.encode("utf-8")).hexdigest()
        if block_hash in self.existing_hashes:
            return True
        self.existing_hashes.add(block_hash)
        return False

    def is_duplicate_fuzzy(self, new_block: str, threshold: float = 95.0) -> bool:
        # Not implemented
        return False

    def is_table(self, block: str) -> bool:
        lines = block.strip().split("\n")
        if len(lines) < 2:
            return False

        has_pipes = any("|" in line for line in lines)
        # Check for a typical markdown table separator line
        has_separator = any(re.search(r"^\s*[-:|]+\s*$", l.strip()) for l in lines)
        return has_pipes and has_separator

    def is_heading(self, block: str) -> bool:
        block = block.strip()
        for pattern in self.heading_patterns:
            if pattern.match(block):
                return True
        return False

    def is_footnote_definition(self, block: str) -> bool:
        return bool(self.footnote_definition_pattern.match(block.strip()))

    def is_footnote_marker(self, block: str) -> bool:
        return bool(self.footnote_marker_pattern.search(block))

    def is_image(self, block: str) -> bool:
        return bool(self.image_pattern.search(block))

    def is_link(self, block: str) -> bool:
        return bool(self.link_pattern.search(block))

    def remove_headers_footers(self, text: str) -> str:
        lines = text.split("\n")
        cleaned_lines = []
        for line in lines:
            if self.page_number_pattern.match(line):
                continue
            if self.header_footer_pattern.match(line):
                continue
            if "Navigation menu" in line or "Table of Contents" in line:
                continue
            cleaned_lines.append(line)
        return "\n".join(cleaned_lines)

    def classify_block(self, block: str) -> str:
        block = block.strip()
        if self.is_table(block):
            return "table"
        elif self.is_footnote_definition(block):
            return "footnote_definition"
        elif self.is_footnote_marker(block):
            return "footnote_marker"
        elif self.is_heading(block):
            return "heading"
        elif self.is_image(block):
            return "image"
        elif self.is_link(block):
            return "link"
        else:
            return "paragraph"

    def split_blocks(self, text: str) -> List[Tuple[str, str]]:
        paragraphs = re.split(r"\n\s*\n", text.strip())
        result_blocks = []
        for para in paragraphs:
            if not para.strip():
                continue
            block_type = self.classify_block(para)
            result_blocks.append((para.strip(), block_type))
        return result_blocks

    def find_split_positions(self, text: str, max_len: int) -> List[int]:
        positions = []
        start = 0
        while start < len(text):
            end = min(start + max_len, len(text))
            if end == len(text):
                positions.append(len(text))
                break
            else:
                # Find a suitable split point
                split_pos = text.rfind(". ", start, end)
                if split_pos == -1 or split_pos <= start:
                    split_pos = text.rfind("\n", start, end)
                if split_pos == -1 or split_pos <= start:
                    split_pos = end
                else:
                    split_pos += 1
                positions.append(split_pos)
                start = split_pos + 1
        return positions

    def format_markdown(self, block: str) -> str:
        if self.is_image(block) or self.is_link(block):
            return block.strip()
        if self.is_heading(block):
            if block.strip().startswith("#"):
                return block.strip()
            else:
                return f"## {block.strip()}"
        if self.is_footnote_definition(block):
            return f"_{block.strip()}_"
        if self.is_footnote_marker(block):
            return block.strip()
        if self.is_table(block):
            return self.format_markdown_table(block)
        return block.strip()

    def format_markdown_table(self, block: str) -> str:
        return block.strip()

    def split_large_table(self, table_text: str) -> List[str]:
        """
        Previously used for splitting large tables. Now we do not chunk tables at all.
        Always return the table as-is.
        """
        formatted_table = self.format_markdown_table(table_text)
        return [formatted_table]

    def merge_short_chunks(
        self, chunks: List[Tuple[int, List[int], str]]
    ) -> List[Tuple[int, List[int], str]]:
        merged_chunks = []
        buffer = None

        for chunk in chunks:
            chunk_no, pages, content = chunk
            content_length = len(content)
            if content_length < self.min_chunk_len:
                if buffer is None:
                    buffer = chunk
                else:
                    merged_content = buffer[2] + "\n\n" + content
                    merged_pages = sorted(set(buffer[1] + pages))
                    merged_chunks.append((buffer[0], merged_pages, merged_content))
                    buffer = None
            else:
                if buffer is not None:
                    merged_content = buffer[2] + "\n\n" + content
                    merged_pages = sorted(set(buffer[1] + pages))
                    merged_chunks.append((buffer[0], merged_pages, merged_content))
                    buffer = None
                else:
                    merged_chunks.append(chunk)

        if buffer is not None:
            merged_chunks.append(buffer)

        final_chunks = []
        for idx, chunk in enumerate(merged_chunks, start=1):
            final_chunks.append((idx, chunk[1], chunk[2]))
        return final_chunks

    def get_elements(
        self, pages: List[dict]
    ) -> List[Tuple[str, int, int, List[int], str]]:
        elements = []
        position = 0
        consecutive_pages = []
        assigned_pages = set()

        for page in pages:
            try:
                content = page["page_text"]
                page_number = page["page_no"]
            except KeyError as e:
                print(f"Missing key in page data: {e}")
                continue

            if page_number in assigned_pages:
                continue

            content = self.remove_headers_footers(content)
            content_length = len(content)

            if self.min_chunk_len <= content_length <= self.soft_max_len:
                consecutive_pages.append((content, page_number))
            else:
                if consecutive_pages:
                    combined_content = "\n\n".join([cp[0] for cp in consecutive_pages])
                    combined_pages = [cp[1] for cp in consecutive_pages]
                    elements.append(
                        (
                            combined_content,
                            position,
                            position + len(combined_content),
                            combined_pages,
                            "combined_pages",
                        )
                    )
                    position += len(combined_content) + 1
                    assigned_pages.update(combined_pages)
                    consecutive_pages = []

                blocks = self.split_blocks(content)
                for block_text, block_type in blocks:
                    if self.is_duplicate_exact(block_text):
                        continue
                    block_start = position
                    block_end = position + len(block_text)
                    elements.append(
                        (block_text, block_start, block_end, [page_number], block_type)
                    )
                    position = block_end + 1
                    assigned_pages.add(page_number)

        if consecutive_pages:
            combined_content = "\n\n".join([cp[0] for cp in consecutive_pages])
            combined_pages = [cp[1] for cp in consecutive_pages]
            elements.append(
                (
                    combined_content,
                    position,
                    position + len(combined_content),
                    combined_pages,
                    "combined_pages",
                )
            )
            position += len(combined_content) + 1
            assigned_pages.update(combined_pages)

        return elements

    def chunk_elements(
        self, elements: List[Tuple[str, int, int, List[int], str]]
    ) -> List[Tuple[int, List[int], str]]:
        # Since we are no longer chunking tables at all,
        # we don't use split_large_table for them.
        final_elements = []
        for block, start, end, pages, block_type in elements:
            # If it's a table, just pass it through as a single block.
            # No matter its size, we don't chunk it.
            final_elements.append((block, start, end, pages, block_type))

        chunks = []
        current_chunk = ""
        current_chunk_length = 0
        current_chunk_pages: Set[int] = set()
        chunk_no = 1
        i = 0
        n = len(final_elements)

        def finalize_chunk():
            nonlocal chunks, current_chunk, current_chunk_length, current_chunk_pages, chunk_no
            if current_chunk.strip():
                chunks.append(
                    (chunk_no, sorted(current_chunk_pages), current_chunk.strip())
                )
                chunk_no += 1
            current_chunk = ""
            current_chunk_length = 0
            current_chunk_pages = set()

        while i < n:
            block, _, _, block_pages, block_type = final_elements[i]
            block_length = len(block)
            formatted_block = self.format_markdown(block)

            # If this is a table, we ignore soft/hard max lengths and treat it as a single chunk.
            if block_type == "table":
                # Finalize current chunk first
                finalize_chunk()
                # Place the entire table in a single chunk
                current_chunk = formatted_block + "\n\n"
                current_chunk_length = block_length + 2
                current_chunk_pages = set(block_pages)
                # Immediately finalize this chunk
                finalize_chunk()
                i += 1
                continue

            # For non-table content:
            if block_length > self.hard_max_len:
                # If a non-table block exceeds hard_max_len, we still split it
                # because the requirement only said don't chunk tables.
                split_positions = self.find_split_positions(block, self.hard_max_len)
                start_pos = 0
                for pos in split_positions:
                    split_text = block[start_pos:pos].strip()
                    if split_text:
                        formatted_split = self.format_markdown(split_text)
                        if (
                            current_chunk_length + len(split_text) + 2
                            <= self.soft_max_len
                        ):
                            current_chunk += formatted_split + "\n\n"
                            current_chunk_length += len(split_text) + 2
                            current_chunk_pages.update(block_pages)
                        else:
                            if current_chunk_length >= self.min_chunk_len:
                                finalize_chunk()
                            current_chunk = formatted_split + "\n\n"
                            current_chunk_length = len(split_text) + 2
                            current_chunk_pages = set(block_pages)
                    start_pos = pos + 1
                i += 1
                continue

            # If it fits in current chunk
            if current_chunk_length + block_length + 2 <= self.soft_max_len:
                current_chunk += formatted_block + "\n\n"
                current_chunk_length += block_length + 2
                current_chunk_pages.update(block_pages)
                i += 1
            else:
                # finalize current chunk if it's large enough
                if current_chunk_length >= self.min_chunk_len:
                    finalize_chunk()
                # start a new chunk with this block
                current_chunk = formatted_block + "\n\n"
                current_chunk_length = block_length + 2
                current_chunk_pages = set(block_pages)
                i += 1

        # finalize last chunk
        if current_chunk.strip():
            chunks.append(
                (chunk_no, sorted(current_chunk_pages), current_chunk.strip())
            )

        return chunks

    def chunk(self, pages: List[dict]) -> List[Tuple[int, List[int], str]]:
        elements = self.get_elements(pages)
        chunks = self.chunk_elements(elements)
        chunks = self.merge_short_chunks(chunks)
        return chunks

    def process_page(self, page: dict) -> List[Tuple[str, int, int, List[int], str]]:
        elements = []
        position = 0
        try:
            content = page["page_text"]
            page_number = page["page_no"]
        except KeyError as e:
            print(f"Missing key in page data: {e}")
            return elements

        content = self.remove_headers_footers(content)
        content_length = len(content)

        if self.min_chunk_len <= content_length <= self.soft_max_len:
            elements.append(
                (
                    content,
                    position,
                    position + len(content),
                    [page_number],
                    "combined_pages",
                )
            )
            position += len(content) + 1
        else:
            blocks = self.split_blocks(content)
            for block_text, block_type in blocks:
                if self.is_duplicate_exact(block_text):
                    continue
                block_start = position
                block_end = position + len(block_text)
                elements.append(
                    (block_text, block_start, block_end, [page_number], block_type)
                )
                position = block_end + 1

        return elements

    def get_elements_parallel(
        self, pages: List[dict]
    ) -> List[Tuple[str, int, int, List[int], str]]:
        elements = []
        with ProcessPoolExecutor() as executor:
            futures = [executor.submit(self.process_page, page) for page in pages]
            for future in as_completed(futures):
                page_elements = future.result()
                elements.extend(page_elements)
        return elements


In [8]:
file_path = "/home/saeed/code/personal/rag/outputs/db-re-sonnet/processed/parsed/audi-report-2020-desktop.parsed.json"
with open(file_path, "r") as f:
    doc = json.load(f)

# display(Markdown(doc["doc_md_text"]))

# Only select page 260
page = doc["pages"][10: 12]
page

[{'page_no': 11,
  'page_text': '# Page 29-30 Content Extraction\n\n## Left Page (29)\n\n### Motorsport is and will remain an established part of the Audi strategy\n\nAudi resolves to enter the Dakar Rally for the first time ever in 2022 with a prototype. The vehicle\'s drive concept combines an electric powertrain with a high-voltage battery, which can be charged as needed via an energy converter in the form of a highly efficient TFSI engine. In the future, marathon rally races will spearhead our Audi works team\'s motorsport activities.\n\n[Timeline showing months from Jan to Dec with Nov highlighted]\n\n[Image description: A dark silhouette of an Audi racing vehicle with red accent lighting, captioned "Audi is taking on one of the biggest challenges there is in motorsport: Dakar Rally 2022"]\n\n## Right Page (30)\n\n### Audi increases budget for electric mobility through 2025\n\nWith its investment planning for the next five years, the company is accelerating the transformation to b

In [9]:
# Instantiate the strategy
strategy = MarkdownChunkingStrategy(min_chunk_len=512, soft_max_len=1024, hard_max_len=2048)

# Get the chunks
chunks = strategy.chunk(page)

# Print min, max, and avg chunk lengths
print(f"Number of chunks: {len(chunks)}")
print(f"Min chunk length: {min(len(chunk) for _, _, chunk in chunks)}")
print(f"Max chunk length: {max(len(chunk) for _, _, chunk in chunks)}")
print(f"Average chunk length: {sum(len(chunk) for _, _, chunk in chunks) // len(chunks)}")

# Print the chunk with max length
max_chunk_len = max(len(chunk) for _, _, chunk in chunks)
max_chunk = next((chunk for _, _, chunk in chunks if len(chunk) == max_chunk_len), None)
# print(f"Chunk with max length: {max_chunk}")

Number of chunks: 6
Min chunk length: 262
Max chunk length: 977
Average chunk length: 769


In [10]:
# Display the chunks
for chunk_no, page_numbers, content in chunks[:100]:
    print(f"Chunk {chunk_no}, Chunk Len: {len(content)} , Pages: {page_numbers}")
    # print(content)
    # print("-" * 40)
    display(Markdown(content))
    print("-" * 40)

Chunk 1, Chunk Len: 842 , Pages: [11]


# Page 29-30 Content Extraction

## Left Page (29)

### Motorsport is and will remain an established part of the Audi strategy

Audi resolves to enter the Dakar Rally for the first time ever in 2022 with a prototype. The vehicle's drive concept combines an electric powertrain with a high-voltage battery, which can be charged as needed via an energy converter in the form of a highly efficient TFSI engine. In the future, marathon rally races will spearhead our Audi works team's motorsport activities.

[Timeline showing months from Jan to Dec with Nov highlighted]

[Image description: A dark silhouette of an Audi racing vehicle with red accent lighting, captioned "Audi is taking on one of the biggest challenges there is in motorsport: Dakar Rally 2022"]

## Right Page (30)

### Audi increases budget for electric mobility through 2025

----------------------------------------
Chunk 2, Chunk Len: 977 , Pages: [11, 12]


With its investment planning for the next five years, the company is accelerating the transformation to becoming a provider of connected and sustainable premium mobility. A total amount of around EUR 35 billion will ensure that despite the difficult economic environment, upfront expenditures will remain at a high level particularly with respect to future vehicle projects. Around EUR 17 billion of the investments are allocated to electric mobility, hybridization and digitalization.

[Image description: Close-up detail shot of a black Audi vehicle showing the illuminated Audi rings logo and carbon fiber detailing, with text overlay "Audi is pursuing a product initiative with the clear focus on electric mobility" and a technical chart showing "Audi e-tron GT quattro combined electric power consumption in kWh/100 km: 19.6–18.8 (NEDC); combined CO2 emissions in g/km: 0"]

# How is Audi shaping the FUTURE

# Brief portrait

At a quick glance:
Group performance in 2020.

----------------------------------------
Chunk 3, Chunk Len: 955 , Pages: [12]


The Audi Group, with its brands Audi, Lamborghini and Ducati, is one of the most successful manufacturers of automobiles and motorcycles in the premium and supercar segment.

Audi has been a fully owned subsidiary of the Volkswagen Group since November 16, 2020. Until this time, the latter held around 99.64 percent of the share capital of AUDI AG.

In 2020, the Audi Group delivered 1,692,773 (1,845,573)¹ cars of the Audi brand, 7,430 (8,205) sports cars of the Lamborghini brand and 48,042 (53,183) motorcycles of the Ducati brand to customers.

86,860 (90,640) people were working for the company all over the world as of December 31, 2020; 59,817 (62,377) of them in Germany.

Audi (headquarters: Ingolstadt) is present in more than 100 markets worldwide and produced at 18 sites¹ in 12 countries in 2020.

The current overview of sites for 2021 can be found → here.

[Map showing Sites in Europe with detailed information about two main locations:]

----------------------------------------
Chunk 4, Chunk Len: 798 , Pages: [12]


Ingolstadt, Germany
AUDI AG
With 338,055 (441,608) cars built in 2020, the headquarters in Ingolstadt is the second largest production site in the Audi Group. The plant in the heart of Bavaria is not only a production facility, but is also home to the Audi Group head office and Technical Development. Audi has 43,142 (44,458) employees here, making it the region's largest employer. Measures have already been implemented at the Ingolstadt site to prevent 70 percent of the CO₂ emissions that would otherwise have been produced.

Audi models produced at the site:
- Q2,
- SQ2, A3 Sedan,
- A3 Sportback, S3 Sedan,
- S3 Sportback,
- RS 3 Sportback,
- RS 3 Sedan, A4 Avant,
- A4 Sedan, S4 Sedan,
- S4 Avant, RS 4 Avant,
- A5 Coupé, A5 Sportback,
- S5 Coupé, S5 Sportback,
- RS 5 Coupé, RS 5 Sportback

----------------------------------------
Chunk 5, Chunk Len: 785 , Pages: [12]


Neckarsulm, Germany
AUDI AG, Audi Sport GmbH
The Audi Neckarsulm site has a long tradition and the most diverse range of products. In 2020, 157,230 (177,209) cars rolled off the production lines here. Audi Sport GmbH (formerly quattro GmbH) has had its headquarters here since 1983. Various measures in place at the site currently result in around 70 percent of all CO₂ emissions otherwise produced here. The company produces the Audi e-tron GT quattro, the RS e-tron GT and Audi Sport models at Böllinger Höfe at the Neckarsulm site.

Audi models produced at the site:
- A4 Sedan,
- A5 Cabriolet, S5 Cabriolet,
- A6 Avant, A6 Sedan,
- A6 allroad, S6 Sedan,
- S6 Avant, A7 Sportback,
- S7 Sportback, RS7 Sportback,
- A8, A8 L, S8, S8 L,
- R8 Coupé, R8 Spyder,
- e-tron GT, RS e-tron GT

----------------------------------------
Chunk 6, Chunk Len: 262 , Pages: [12]


¹ The figures in brackets represent the respective prior-year figures
² Since 2020, the Audi Brussels site has also been included in emissions → see page 51 ff. The direct, first and second-level models are not declared specifically
³ Status of December 31, 2020

----------------------------------------
