# Markdown Header-based Splitting

This notebook splits the extracted markdown content from `one_degree/one_degree_policy.md` into semantic chunks for RAG.
It uses a two-step process:
1. **Structure Splitting**: Splits by Markdown headers (#, ##, ###) to preserve document structure and capture metadata.
2. **Content Splitting**: Further splits large sections into smaller chunks to fit within embedding model limits.

**Note:** `strip_headers=True` is used to remove headers from the content body, as they are already captured in the metadata.

In [20]:
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
import os

print("Libraries loaded successfully")

Libraries loaded successfully


In [21]:
# 1. Load the Markdown File
input_file = "one_degree/one_degree_policy.md"

try:
    with open(input_file, "r") as f:
        markdown_content = f.read()
    print(f"Loaded {len(markdown_content)} characters from {input_file}")
except FileNotFoundError:
    print(f"Error: {input_file} not found. Please ensure the file exists.")
    markdown_content = ""

Loaded 46352 characters from one_degree/one_degree_policy.md


In [22]:
if markdown_content:
    # 2. Structure Splitting (MarkdownHeaderTextSplitter)
    headers_to_split_on = [
        ("#", "Section_Name"),      # Level 1 Header
        ("##", "Chapter_Name"),     # Level 2 Header
        ("###", "Clause_Name"),     # Level 3 Header
    ]

    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
        strip_headers=True  # Headers will be in metadata, not duplicated in content to avoid empty title chunks
    )

    md_header_splits = markdown_splitter.split_text(markdown_content)
    print(f"Splits by Header: {len(md_header_splits)}")

    # 3. Content Splitting (RecursiveCharacterTextSplitter)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=600,       # Target chunk size
        chunk_overlap=100,    # Overlap to preserve context
        separators=["\n\n", "\n", "。", "；", " ", ""] # Priority separators
    )

    final_splits = text_splitter.split_documents(md_header_splits)
    print(f"Final Semantic Chunks: {len(final_splits)}")

Splits by Header: 81
Final Semantic Chunks: 133


In [None]:
# 4. Inspect Specific Results (e.g., Cancer)
if 'final_splits' in locals() and final_splits:
    print("--- Inspecting Specific Chunks (e.g. Cancer) ---")
    
    # filter for chunks with 'Cancer' in Clause_Name
    arbitration_chunks = [c for c in final_splits if 'Cancer' in c.metadata.get('Clause_Name', '')]
    
    if arbitration_chunks:
        for i, chunk in enumerate(arbitration_chunks):
            print(f"\n[Cancer Chunk {i+1}]")
            print(f"Metadata: {chunk.metadata}")
            print(f"Content: {chunk.page_content[:300]}...") # Preview first 200 chars
    else:
        print("No chunks found with 'Cancer' in metadata.")

    print("\n--- General Sample Chunk Inspection (Last 5) ---")
    # Check a chunk from the middle/end to see deeper nesting
    for i in range(1, 6):
        if len(final_splits) >= i:
            sample_chunk = final_splits[-i]
            print(f"\n[Last Chunk -{i}]")
            print(f"Metadata: {sample_chunk.metadata}")
            print(f"Content Preview:\n{sample_chunk.page_content[:100]}...")
else:
    print("No splits generated to inspect.")

--- Inspecting Specific Chunks (e.g. Cancer) ---

[Cancer Chunk 1]
Metadata: {'Section_Name': 'Section A: What You Get From Your Cover', 'Chapter_Name': '1. What Your Policy Covers', 'Clause_Name': '1.3 Cancer Cash Benefit'}
Content: **1.3.1** Benefit Definition  
Subject to exclusions under Your Policy, if Your Pet is diagnosed with cancer **for the first time in The Pet’s lifetime**, We will pay You an **one-off** cancer cash benefit as stated in Your Policy Schedule.  
**1.3.2** Financial Mechanics (Standalone Payment)  
* **...

[Cancer Chunk 2]
Metadata: {'Section_Name': 'Section A: What You Get From Your Cover', 'Chapter_Name': '1. What Your Policy Covers', 'Clause_Name': '1.3 Cancer Cash Benefit'}
Content: * **Non-Deduction**: The payment of this benefit shall **not** affect or reduce the available Annual Limit during that Period of Insurance.  
**1.3.3** Eligibility Criteria  
To be eligible for this cash benefit, **ALL** of the following conditions must be met:  
1.  **Waiting