# Create seed data

## Overview

This notebook takes chunks of a source document and combines the chunks with In Context Learning (ICL) fields to create a `seed_data.jsonl` file for the [knowledge generation notebook](../03_Knowledge_Generation/Knowledge_Generation.ipynb).

## Prerequisites

- The URL of the source document.
- A snippet of the source document that is approximately 500 tokens in size. This snippet is used as the `icl_document` in the following code.


## Install dependencies

In [None]:
!pip install -qqU .

## Setup paths and directories

In [None]:
from pathlib import Path

WORKSPACE = Path.cwd().parent  # Path to the workspace directory

OUTPUT_DIR = WORKSPACE / "output" / "step_02"

OUTPUT_DIR.mkdir(
    parents=True, exist_ok=True
)  # Create output directory if it does not exist


DOCLING_OUTPUT_DIR = OUTPUT_DIR / "docling_output"
DOCLING_OUTPUT_DIR.mkdir(
    parents=True, exist_ok=True
)  # Create docling output directory if it does not exist


## Generate Docling document

Convert the source document into Markdown format by using Docling. In this example, the source document is a webpage for the Bank of Montreal (BMO) website.

For documentation on supported file types and usage, see [the Docling documentation](https://docling-project.github.io/docling/usage/supported_formats/).


#### Source doument:
- [BMO webpage](https://fintrac-canafe.canada.ca/guidance-directives/client-clientele/Guide11/11-eng)
- ðŸš¨ [Terms and Conditions](https://www.canada.ca/en/transparency/terms.html)


In [None]:
import glob

from docling.document_converter import DocumentConverter

WEB_URLS = [
    (
        "BMO_data",
        "https://fintrac-canafe.canada.ca/guidance-directives/client-clientele/Guide11/11-eng",
    )
]

converter = DocumentConverter()

for name, url in WEB_URLS:
    result = converter.convert(url)
    result.document.save_as_markdown(f"{DOCLING_OUTPUT_DIR}/{name}.md")


print(
    f"Number of md files in {DOCLING_OUTPUT_DIR}: ",
    len(glob.glob(f"{DOCLING_OUTPUT_DIR}/*.md")),
)

## Load the converted document

In [None]:
with open(glob.glob(f"{DOCLING_OUTPUT_DIR}/*.md")[0]) as f:
    text = f.read()

## Utility functions

In [None]:
import json
from typing import List

from markdown_it import MarkdownIt


def chunk_markdown(text: str, max_tokens: int = 200, overlap: int = 50) -> List[str]:
    """
    Splits Markdown text into chunks at block-level elements
    (headings, paragraphs, lists, tables, code, blockquotes).
    Adds overlap (in words) between all consecutive chunks.

    Args:
        text: The markdown text to be chunked
        max_tokens: Maximum number of words per chunk
        overlap: Number of overlapping words between consecutive chunks

    Returns:
        List of text chunks with specified overlap
    """

    # Initialize the Markdown parser to understand the document structure
    md = MarkdownIt()
    tokens = md.parse(text)

    # To ensure that you do not split the text in the middle of headings or lists,
    # group tokens into block-level segments to preserve the Markdown structure
    blocks = []
    buf = []
    for tok in tokens:
        if tok.block and tok.type.endswith("_open"):
            buf = []
        elif tok.block and tok.type.endswith("_close"):
            if buf:
                blocks.append("\n".join(buf).strip())
                buf = []
        elif tok.content:
            buf.append(tok.content)
    if buf:
        blocks.append("\n".join(buf).strip())

    # Split blocks into chunks with overlap to maintain context continuity
    chunks = []
    current_words = []
    for block in blocks:
        words = block.split()
        for w in words:
            current_words.append(w)
            if len(current_words) >= max_tokens:
                # Emit a complete chunk
                chunks.append(" ".join(current_words))
                # Prepare next buffer with overlap from the end of this chunk
                # to ensure context continuity between chunks
                current_words = current_words[-overlap:] if overlap > 0 else []

    # Add any remaining words as the final chunk
    if current_words:
        chunks.append(" ".join(current_words))

    return chunks


def save_chunks_to_jsonl(chunks, filename):
    """
    Save a list of strings to a JSONL file where each line is a JSON object
    with the key 'chunk'. Returns the path to the saved file.

    Args:
        chunks (list of str): List of text chunks to save.
        filename (str): Path to the output .jsonl file (string or Path).

    Returns:
        pathlib.Path: Path to the saved file.
    """
    path = Path(filename)
    with path.open("w", encoding="utf-8") as f:
        for chunk in chunks:
            json_line = json.dumps({"chunk": chunk}, ensure_ascii=False)
            f.write(json_line + "\n")
    print(f"Saved {len(chunks)} chunks to {path}")
    return path

## Chunk Markdown

The `chunk_markdown` utility function chunks the Markdown file into smaller pieces.

Run the following command to break down the Markdown files into chunks that are at least `max_tokens` in length:

In [None]:
chunks = chunk_markdown(text, max_tokens=5000, overlap=1000)

## (Optional) Save chunks to an intermediate chunks.jsonl file

You can save chunks to an intermediate `chunks.jsonl` file. You can use this intermediate file to tweak the chunks before you create the seed dataset.

In [None]:
chunks_path = save_chunks_to_jsonl(chunks, f"{OUTPUT_DIR}/chunks.jsonl")

## (Optional) Review size of chunks

For the purpose of this example, chunks should be between 6-8K tokens in length. Run the following code to check whether chunks (excluding the final chunk) are within this range. If they are not within this range, merge or split chunks until they are.

In [None]:
import tiktoken

i = 1
min_tokens = 6000
max_tokens = 8000
for chunk in chunks:
    enc = tiktoken.get_encoding("cl100k_base")
    token_count = len(enc.encode(chunk))
    if (token_count < min_tokens or token_count > max_tokens) and (i != len(chunks)):
        print(
            f"\033[31mWARNING: Chunk {i} ({chunk[:30]} ... {chunk[-30:]}) {token_count} tokens\033[0m"
        )
    i += 1

## Load chunks

In [None]:
from datasets import load_dataset

chunks_files = [f"{OUTPUT_DIR}/chunks.jsonl"]

# Load the dataset from the JSON file
chunks = (
    load_dataset("json", data_files=chunks_files)
    .rename_columns({"chunk": "document"})
    .select_columns("document")
)
# chunks is a DatasetDict. By default, the dataset for the chunks is put in the "train" split in the DatasetDict
chunks = chunks["train"]

## Set ICL fields

The seed data requires the following fields:
   - `document_outline`: A concise title or summary that accurately represents the entire document. For documents that cover multiple themes, provide an outline for section.
   - `domain`: The domain or subject area of the document.
   - `icl_document`: A ~500 token representative sample extracted from the document. The sample can include paragraphs, bulleted lists, tables, code snippets, and definitions.
   - `icl_query_1`, `icl_query_2`, `icl_query_3`: Three questions that are based on the `icl_document` sample.

In [None]:
document_outline = "IFINTRAC's compliance guidance"

domain = "Finance"

icl_document = """You can use a Canadian credit file as one of the two pieces of information required to verify the identity of a person under the dual-process method. Specifically, you can use it to confirm the person's name and address, name and date of birth, or to confirm the person's name and confirm that the person has a credit card account or a loan account. If you use a credit file as one of the information pieces for the dual-process method, it must have existed for at least six months.

You must use a second source, for example, a property tax assessment, to confirm the second category of information. In this instance, the two reliable sources are the Canadian credit bureau that provided the credit file information and the municipal government that issued the property tax assessment. The information from these two sources must match the information provided by the person.

You can also refer to information from a Canadian credit bureau if it acts as an aggregator that compiles information from different reliable sources (often referred to as tradelines). In this instance, the Canadian credit bureau must provide information from **two** independent tradelines, where each tradeline confirms one of the two categories of information required to verify the identity of a person under this method. In this instance, **each tradeline is a distinct source; the credit bureau is not the source**.
"""

icl_query_1 = "What specific information from a Canadian credit file can I use to verify a person's identity under the dual-process method?"
icl_query_2 = "What are the requirements for the second source of information when using a credit file as one of the two pieces for identity verification?"
icl_query_3 = "When a Canadian credit bureau acts as an aggregator of information from multiple tradelines, what conditions must be met for the information to satisfy the dual-process method requirements?"


icl = {
    "document_outline": document_outline,
    "icl_document": icl_document,
    "icl_query_1": icl_query_1,
    "icl_query_2": icl_query_2,
    "icl_query_3": icl_query_3,
    "domain": domain,
}

## Map ICL fields to document chunks and write the `seed_data.jsonl` file

In [None]:
# Map the ICL fields to each document chunk (if you want to use the same ICL for all, as shown here)
seed_data = chunks.map(lambda x: icl)

# Save the seed data to a JSONL file 
seed_data.to_json(f"{OUTPUT_DIR}/seed_data.jsonl", orient="records", lines=True)

The `seed_data.jsonl` file is now ready for you to use in the next step of the knowledge tuning example workflow.

## Next Step

- Open the [Knowledge Generation](../03_Knowledge_Generation/Knowledge_Generation.ipynb) notebook.