# Seed Data Creation

## Overview

This notebook takes chunks of a source document and combines them with In Context Learning (ICL) fields to create a seed_data.jsonl file for the [knowledge generation notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/knoweldge_e2e/examples/knowledge_tuning/enhanced_summary_knowledge_tuning/knowledge_generation.ipynb).

## Prerequisites

- Markdown (.md) file(s) of the source document.
- A snippet of the source document that is around 500 tokens in size. This will get used as the `icl_document` below.

## Install Required Dependencies

In [None]:
!pip install -qq datasets tiktoken docling markdown-it-py

## Setup Paths and Directories

In [None]:
from pathlib import Path


WORKSPACE = Path.cwd().parent  # Path to the workspace directory

OUTPUT_DIR = WORKSPACE / "output" / "step_01"

OUTPUT_DIR.mkdir(
    parents=True, exist_ok=True
)  # Create output directory if it doesn't exist


DOCLING_OUTPUT_DIR = OUTPUT_DIR / "docling_output"
DOCLING_OUTPUT_DIR.mkdir(
    parents=True, exist_ok=True
) # Create docling output directory if it doesn't exist

## Generate Docling Document

Convert the source document (PDF, DOCX,HTML, etc.) into markdown format using Docling


This example works through the conversion of BMO Website from a URL to markdown using Docling.
You can find more documentation on supported file types and usage [here](https://docling.readthedocs.io/en/latest/).


#### Data Description
- Source Document: [BMO Webpage](https://fintrac-canafe.canada.ca/guidance-directives/client-clientele/Guide11/11-eng)
    - 🚨 [Terms and Conditions](https://www.canada.ca/en/transparency/terms.html)


NOTE: If you already have the docling markdown file(s) of the source document, you can skip this step.

In [None]:
import glob
from docling.document_converter import DocumentConverter

WEB_URLS = [
    ("BMO_data","https://fintrac-canafe.canada.ca/guidance-directives/client-clientele/Guide11/11-eng")
]

converter = DocumentConverter()

for name,url in WEB_URLS:
    result = converter.convert(url)
    result.document.save_as_markdown(f"{DOCLING_OUTPUT_DIR}/{name}.md")


print(f"Number of md files in {DOCLING_OUTPUT_DIR}: ", len(glob.glob(f'{DOCLING_OUTPUT_DIR}/*.md')))

## Load Converted Document

In [None]:
# If you're coming with a docling JSON instead of markdown the following lines will help you convert docling JSON -> .md
#converter = DocumentConverter()
#result = converter.convert("document_collection/ibm-annual-report/ibm-annual-report-2024.json")
#result.document.save_as_markdown("document_collection/ibm-annual-report/ibm-annual-report-2024.md")
#print("Markown saved to document_collection/ibm-annual-report/ibm-annual-report-2024.md")


# In our example above docling step produces markdown of all the pdf files in the document_collection
with open(glob.glob(f'{DOCLING_OUTPUT_DIR}/*.md')[0], 'r') as f:
    text = f.read()

## Utility Functions

In [None]:
from markdown_it import MarkdownIt  
from typing import List
import datasets 
import json


def chunk_markdown(
    text: str,
    max_tokens: int = 200,
    overlap: int = 50
) -> List[str]:
    """
    Splits Markdown text into chunks at block-level elements
    (headings, paragraphs, lists, tables, code, blockquotes).
    Adds overlap (in words) between all consecutive chunks.
    
    Args:
        text: The markdown text to be chunked
        max_tokens: Maximum number of words per chunk
        overlap: Number of overlapping words between consecutive chunks
    
    Returns:
        List of text chunks with specified overlap
    """

    # Initialize markdown parser to understand document structure
    md = MarkdownIt()
    tokens = md.parse(text)

    # Group tokens into block-level segments to preserve markdown structure
    # This ensures we don't split in the middle of headings, lists, etc.
    blocks = []
    buf = []
    for tok in tokens:
        if tok.block and tok.type.endswith("_open"):
            buf = []
        elif tok.block and tok.type.endswith("_close"):
            if buf:
                blocks.append("\n".join(buf).strip())
                buf = []
        elif tok.content:
            buf.append(tok.content)
    if buf:
        blocks.append("\n".join(buf).strip())

    # Split blocks into chunks with overlap to maintain context continuity
    chunks = []
    current_words = []
    for block in blocks:
        words = block.split()
        for w in words:
            current_words.append(w)
            if len(current_words) >= max_tokens:
                # Emit a complete chunk
                chunks.append(" ".join(current_words))
                # Prepare next buffer with overlap from the end of this chunk
                # This ensures context continuity between chunks
                current_words = current_words[-overlap:] if overlap > 0 else []

    # Add any remaining words as the final chunk
    if current_words:
        chunks.append(" ".join(current_words))

    return chunks


def save_chunks_to_jsonl(chunks, filename):
    """
    Save a list of strings to a JSONL file where each line is a JSON object
    with the key 'chunk'. Returns the Path to the saved file.

    Args:
        chunks (list of str): List of text chunks to save.
        filename (str): Path to the output .jsonl file (string or Path).

    Returns:
        pathlib.Path: Path to the saved file.
    """
    path = Path(filename)
    with path.open('w', encoding='utf-8') as f:
        for chunk in chunks:
            json_line = json.dumps({"chunk": chunk}, ensure_ascii=False)
            f.write(json_line + '\n')
    print(f"Saved {len(chunks)} chunks to {path}")
    return path

## Chunk Markdown

Markdown files will be broken down into chunks at least `max_tokens` in length.

Utilize the utility function `chunk_markdown` to chunk the markdown file into smaller pieces.

In [None]:
chunks = chunk_markdown(text, max_tokens=5000, overlap=1000)

## (Optional) Save Chunks to intermediate chunks.jsonl

The intermediate `chunks.jsonl` file can be used to tweak chunks before proceeding to seed dataset creation.

In [None]:
chunks_path = save_chunks_to_jsonl(chunks, f"{OUTPUT_DIR}/chunks.jsonl")

## (Optional) Review size of Chunks

Chunks should be between 6-8K tokens in length. Chunks that are not within this range (excluding the final chunk) should be merged or split apart.

In [None]:
import tiktoken

i = 1
min_tokens = 6000
max_tokens = 8000
for chunk in chunks:
    enc = tiktoken.get_encoding("cl100k_base")
    token_count = len(enc.encode(chunk))
    if (token_count < min_tokens or token_count > max_tokens) and (i != len(chunks)):
        print(f"\033[31mWARNING: Chunk {i} ({chunk[:30]} ... {chunk[-30:]}) {token_count} tokens\033[0m")
    i += 1

## Load Chunks

In [None]:
from datasets import load_dataset

chunks_files = [f"{OUTPUT_DIR}/chunks.jsonl"]

# Load the dataset from the JSON file
chunks = load_dataset("json", data_files=chunks_files).rename_columns({'chunk': 'document'}).select_columns('document')
# chunks is a DatasetDict. By default the Dataset for the chunks is getting put in the "train" split in the DatasetDict
chunks = chunks['train']

## Set ICL Fields

The seed data requires the following fields:
   - `document_outline`: A concise title or summary that accurately represents the entire document.
     For documents covering multiple themes, consider providing multiple outlines (one per section).
   - `domain`: The domain or subject area of the document.
   - `icl_document`: A ~500 token representative sample extracted from the document. This may include paragraphs, bulleted lists, tables, code snippets, definitions, etc.
   - `icl_query_1`, `icl_query_2`, `icl_query_3`: Three questions based on the `icl_document` sample.

In [None]:
document_outline = "International Business Machines (IBM) annual company earnings report 2024"

domain = "Finance"

icl_document = """In 2024, we reported $62.8 billion in revenue, income from continuing operations of $6.0 billion, 
which includes the impact of the pension settlement charges of $3.1 billion ($2.4 billion net of tax), 
and operating (non-GAAP) earnings of $9.7 billion, which excludes the impact of the pension settlement charges. 
Refer to "Organization of Information," for additional information. 
Diluted earnings per share from continuing operations was $6.42 as reported, 
including an impact of $2.57 from the pension settlement charges, and diluted earnings per share was $10.33 on an operating (non-GAAP) basis. 
We generated $13.4 billion in cash from operations and $12.7 billion in free cash flow, and returned $6.1 billion to shareholders in dividends. 
We are pleased with the progress we made in 2024, delivering revenue growth in our re-positioned business and strong cash flow generation. 
Our 2024 performance demonstrates the success of our focused strategy, enhanced portfolio and sustainable revenue growth. 
We increased our investment in innovation and talent and completed eleven acquisitions in 2024, 
strengthening our hybrid cloud and AI capabilities, all while continuing to return value to shareholders through our dividend.

Total revenue grew 1.4 percent year to year as reported and 3 percent adjusted for currency compared to the prior year, 
led by our Software performance. Software revenue increased 8.3 percent as reported and 9.0 percent adjusted for currency, 
with strength across our portfolio. Hybrid Platform & Solutions increased 8.1 percent as reported and 8.7 percent adjusted for currency, 
reflecting growth across all lines of business with double-digit revenue growth in Red Hat and Automation. 
Transaction Processing increased 8.7 percent as reported and 9.6 percent adjusted for currency, with growth in both recurring and transactional revenue. 
Consulting revenue decreased 0.9 percent as reported but grew 0.6 percent adjusted for currency, 
and continued to be impacted by a dynamic market environment as clients reprioritized spending. 
Infrastructure decreased 3.9 percent year to year as reported and 2.7 percent adjusted for currency, reflecting product cycle dynamics.
"""

icl_query_1 = "What was the 2024 revenue in billions of dollars?"
icl_query_2 = "How much did infrastruture decrease year to year?"
icl_query_3 = "What did the IBM 2024 performance demonstrate?"


icl = {
    "document_outline": document_outline,
    "icl_document": icl_document,
    "icl_query_1": icl_query_1,
    "icl_query_2": icl_query_2,
    "icl_query_3": icl_query_3,
    "domain": domain,
}

## Map ICL Fields to Document Chunks and Write `seed_data.jsonl`

In [None]:
# Map the ICL fields to each document chunk (if you want to use the same ICL for all, as shown here)
seed_data = chunks.map(lambda x: icl)

# Save the seed data to a JSONL file for downstream use
seed_data.to_json(f'{OUTPUT_DIR}/seed_data.jsonl', orient='records', lines=True)

### Next Steps:
- The seed_data.jsonl file is now ready for the knowledge tuning pipeline.
- You can now refer to the [knowledge generation](../02_Knowledge_Generation/Knowledge_Generation.ipynb) notebook