# Document Pre-processing for Knowledge Tuning

## Overview

This notebook demonstrates a complete document preprocessing pipeline designed specifically for **knowledge tuning** with sdg-hub. 

## What This Notebook Does

This preprocessing pipeline transforms raw documents (PDFs, Word docs, etc.) into seed data for data generation:

1. **Document Parsing**: Converts raw documents to structured markdown format
2. **Chunking**: Splits documents into manageable chunks while preserving structure and context
3. **Seed Data Creation**: Formats chunks with in-context learning (ICL) templates for effective knowledge tuning

## Prerequisites

- We will use the existing InstructLab document parser (`docparser_v2.py`) and Document parsing configuration (`docling_v2_config.yaml`)
- Raw pdf documents in the `document_collection/` directory


In [None]:
# Step 1: Document Processing Pipeline
# Define the directory containing raw documents to be processed
data_dir = "document_collection/"

# Run the document parser to convert documents to markdown
# - input-dir: Directory containing source documents
# - output-dir: Directory where processed markdown files will be saved
# - c: Configuration file specifying parsing parameters
!python ../instructlab/docparser_v2.py --input-dir {data_dir} --output-dir {data_dir} -c ../instructlab/docling_v2_config.yaml

In [None]:
# Step 2: Install Required Dependencies
# Install packages needed for document processing and text chunking

%pip install docling markdown-it-py
%pip install --upgrade transformers

In [None]:
# Step 3: Load Processed Document
import glob

# In our example above docling step produces markdown of all the pdf files in the document_collection
with open(glob.glob(f"{data_dir}/*.md")[0], "r") as f:
    text = f.read()

In [None]:
# Step 4: Text Chunking and Dataset Creation

from markdown_it import MarkdownIt
from typing import List
import datasets


def chunk_markdown(text: str, max_tokens: int = 200, overlap: int = 50) -> List[str]:
    """
    Splits Markdown text into chunks at block-level elements
    (headings, paragraphs, lists, tables, code, blockquotes).
    Adds overlap (in words) between all consecutive chunks.

    Args:
        text: The markdown text to be chunked
        max_tokens: Maximum number of words per chunk
        overlap: Number of overlapping words between consecutive chunks

    Returns:
        List of text chunks with specified overlap
    """

    # Initialize markdown parser to understand document structure
    md = MarkdownIt()
    tokens = md.parse(text)

    # Group tokens into block-level segments to preserve markdown structure
    # This ensures we don't split in the middle of headings, lists, etc.
    blocks = []
    buf = []
    for tok in tokens:
        if tok.block and tok.type.endswith("_open"):
            buf = []
        elif tok.block and tok.type.endswith("_close"):
            if buf:
                blocks.append("\n".join(buf).strip())
                buf = []
        elif tok.content:
            buf.append(tok.content)
    if buf:
        blocks.append("\n".join(buf).strip())

    # Split blocks into chunks with overlap to maintain context continuity
    chunks = []
    current_words = []
    for block in blocks:
        words = block.split()
        for w in words:
            current_words.append(w)
            if len(current_words) >= max_tokens:
                # Emit a complete chunk
                chunks.append(" ".join(current_words))
                # Prepare next buffer with overlap from the end of this chunk
                # This ensures context continuity between chunks
                current_words = current_words[-overlap:] if overlap > 0 else []

    # Add any remaining words as the final chunk
    if current_words:
        chunks.append(" ".join(current_words))

    return chunks


chunks = chunk_markdown(text, max_tokens=5000, overlap=1000)


# Prepare seed data for the SDG-Hub knowledge pipeline.
#
# The seed data requires the following fields:
#   - document_outline: A concise title or summary that accurately represents the entire document.
#     For documents covering multiple themes, consider providing multiple outlines (one per section).
#   - icl_document: A representative sample extract from the document. This may include tables, code snippets, definitions, etc.
#   - icl_query_1, icl_query_2, icl_query_3: Three questions based on the icl_document sample.
#   - domain: The domain or subject area of the document.
#
# The code below creates a HuggingFace Dataset from the document chunks,
# then maps the required ICL fields to each entry, and finally saves the result as a JSONL file.

seed_data = datasets.Dataset.from_dict({"document": chunks})

icl = {
    "document_outline": "The document contains excerpts from FINTRAC regulations designed to combat money laundering and terrorist financing in Canada",
    "icl_document": "## Overview\n\nThis guidance came into effect on June 1, 2021.\n\n\nThis guidance explains the methods that can be used by reporting entities\n(REs) to verify the identity of a person or an entity.\n\n\n## 1. Meaning of verifying the identity of a person or an entity\n\nIt means to use the methods described in this guidance to ensure that the\ninformation in an identification document or from other informational\nsources matches the information that the person or entity provided.\n\n\nVerifying identity is a foundational element of Canada's anti-money\nlaundering and anti-terrorist financing regime and a key component of an\nRE's relationship with clients. It helps you to know your clients and to\nunderstand and assess any risk that may be associated to their\ntransactions or activities.\n\n\n## 2. How to verify the identity of a person\n\nYou can use any of the 5 methods described below to identify a person:\n\n- 2.1 Government-issued photo identification method\n\n- 2.2 Credit file method\n\n- 2.3 Dual-process method\n\n- 2.4 Affiliate or member method\n\n- 2.5 Reliance method\n",
    "icl_query_1": "In Canada, what are the methods for verifying someone's identity?",
    "icl_query_2": "In Canada, why is it important to confirm a client's identity?",
    "icl_query_3": "In Canada, can I use Reliance method to verify identity of a person?",
    "domain": "Finance",
}

# Map the ICL fields to each document chunk (if you want to use the same ICL for all, as shown here)
seed_data = seed_data.map(lambda x: icl)

# Save the seed data to a JSONL file for downstream use
seed_data.to_json("seed_data.jsonl", orient="records", lines=True)

### Next Steps:
- The seed_data.jsonl file is now ready for the knowledge tuning pipeline.
- You can now refer to the [knowledge generation](knowledge_generation.ipynb) notebook