# Data Preprocessing

This notebook covers the required preprocessing steps for preparing the `seed.jsonl` dataset which is ready for Synthetic Data Generation (SDG). 

1. Configure the paths
2. Loading the files into memory
3. Document Conversion
4. Chunking
5. Saving the processed data

In [1]:
from pathlib import Path
import json

In [2]:
WORKSPACE = Path.cwd().parent # Path to the workspace directory

SOURCE_DOCUMENT_DIR= WORKSPACE / "source_documents"
OUTPUT_DIR= WORKSPACE / "output" / "step_01"



OUTPUT_DIR.mkdir(parents=True, exist_ok=True)  # Create output directory if it doesn't exist

In [3]:
available_files = SOURCE_DOCUMENT_DIR.glob("**/*.pdf")
available_files = list(available_files)

print(f"Total PDF files found: {len(available_files)} \n")
print("Available Files:")

for file in available_files:
    print(f"\t{file.stem}.pdf")

Total PDF files found: 2 

Available Files:
	2022-nfl-rulebook.pdf
	2023-nfl-rulebook.pdf


## Document Conversion

The source documents are in pdf format and we will be using `docling` to read and convert them into docling output format.

configuring docling pipeline

In [4]:
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions, VlmPipelineOptions, smoldocling_vlm_conversion_options

from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend

pipeline_options = PdfPipelineOptions()
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)


  from .autonotebook import tqdm as notebook_tqdm


Convert the document to docling format

In [5]:
confidence_report = {}
for file in available_files:

    conv_result = doc_converter.convert(file)

    document = conv_result.document
    confidence_report[file.stem] = conv_result.confidence

    document_dict = document.export_to_dict()

    (OUTPUT_DIR / "docling_output").mkdir(parents=True, exist_ok=True)
    output_file = OUTPUT_DIR / "docling_output" /  f"{file.stem}.json"
    with open(output_file, "w") as f:
        json.dump(document_dict, f, indent=4)



2025-09-19 11:45:36,666 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-09-19 11:45:36,697 - INFO - Going to convert document batch...
2025-09-19 11:45:36,698 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-09-19 11:45:36,705 - INFO - Loading plugin 'docling_defaults'
2025-09-19 11:45:36,706 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-09-19 11:45:36,713 - INFO - Loading plugin 'docling_defaults'
2025-09-19 11:45:36,715 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-09-19 11:45:36,910 - INFO - Accelerator device: 'mps'
2025-09-19 11:45:40,102 - INFO - Accelerator device: 'mps'
2025-09-19 11:45:42,062 - INFO - Accelerator device: 'mps'
2025-09-19 11:45:42,693 - INFO - Processing document 2022-nfl-rulebook.pdf
2025-09-19 11:45:47,522 - INFO - Finished converting document 2022-nfl-rulebook.pdf in 10.86 sec.
2025-09-19 11:45:47,533 - INFO - detected 

In [6]:
for file, confidence_report in confidence_report.items():
    print(f"Conversion confidence for {file}:")
    
    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
    
    low_score_pages = []
    for page in confidence_report.pages:
        page_confidence_report = confidence_report.pages[page]
        if page_confidence_report.mean_score < confidence_report.mean_score:
            low_score_pages.append(page)

    print(f"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages)}")
    
    print()

Conversion confidence for 2022-nfl-rulebook:
Average confidence: [1mEXCELLENT[0m (score 0.917)
Pages that scored lower than average: 1, 2

Conversion confidence for 2023-nfl-rulebook:
Average confidence: [1mEXCELLENT[0m (score 0.938)
Pages that scored lower than average: 1, 2, 6



## Chunking

Chunk the document using docling

In [7]:
from docling.chunking import HybridChunker

In [8]:
chunker = HybridChunker()

In [9]:
all_chunks = []

json_files = (OUTPUT_DIR / "docling_output").glob("*.json")

convertor = DocumentConverter()

for file in json_files:
    conv_result = convertor.convert(file)

    chunks = chunker.chunk(conv_result.document)
    chunks = list(chunks)
    print(f"Total chunks created for {file.stem}: {len(chunks)}")

    for chunk in chunks:
        all_chunks.append({
            "chunk":chunker.contextualize(chunk),
            "file": file.stem,
            "metadata":chunk.meta.export_json_dict()
        })

chunks_file_path = OUTPUT_DIR / "chunks.jsonl"
with open(chunks_file_path, "w", encoding="utf-8") as file:
    for chunk in all_chunks:
        json.dump(chunk, file)
        file.write("\n")
    print(f"Path of chunks JSON is: {Path(chunks_file_path).resolve()}")


2025-09-19 11:45:53,035 - INFO - detected formats: [<InputFormat.JSON_DOCLING: 'json_docling'>]
2025-09-19 11:45:53,042 - INFO - Going to convert document batch...
2025-09-19 11:45:53,043 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2025-09-19 11:45:53,043 - INFO - Processing document 2023-nfl-rulebook.json
2025-09-19 11:45:53,044 - INFO - Finished converting document 2023-nfl-rulebook.json in 0.01 sec.
2025-09-19 11:45:53,119 - INFO - detected formats: [<InputFormat.JSON_DOCLING: 'json_docling'>]
2025-09-19 11:45:53,123 - INFO - Going to convert document batch...
2025-09-19 11:45:53,124 - INFO - Processing document 2022-nfl-rulebook.json
2025-09-19 11:45:53,124 - INFO - Finished converting document 2022-nfl-rulebook.json in 0.01 sec.


Total chunks created for 2023-nfl-rulebook: 25
Total chunks created for 2022-nfl-rulebook: 22
Path of chunks JSON is: /Users/scharan/projects/red-hat-ai-examples/examples/knowledge-tuning/output/step_01/chunks.jsonl


View random chunks

In [10]:
NUM_CHUNKS_TO_VIEW = 5

import random
import json

sample = random.sample(all_chunks, min(len(all_chunks), NUM_CHUNKS_TO_VIEW))

i = 1
for chunk in sample:
    print(f"== Randomly selected chunk {i}: ==========\n\n{chunk['chunk']}\n\n")
    i += 1


A.R. 15.262 Double Challenge
Second-and-10 on B35. With 12:00 remaining in the fourth quarter, QBA1 rolls out and throws a pass to A2 at the back of the end zone that is ruled incomplete. Team A challenges that the pass was complete, but replays show that A2 only got one foot down inbounds and the call on the field is upheld. While the Referee is making his announcement, a new replay comes up that shows the QB stepping on the sideline at the B40 before releasing the pass. Team B challenges the play.
Ruling: Reviewable. Both teams can challenge the same play. A's ball third-and-15 on the B40, reset the clock to the time when the QB stepped out of bounds, and start on the snap. A team cannot challenge the same play twice. It is important that all reviewable aspects of a play are confirmed by replay regardless of what is being challenged. Team A is charged with a challenge and a timeout.



A.R. 15.263 Next legal snap
First-and-10 on A20. In the third quarter, A2 catches a pass at the A4

Read the chunks back from the saved file and then randomly select the chunks to use as seed data for SDG.

In [11]:
chunks = []
with open(chunks_file_path, "r", encoding="utf-8") as file:
    for line in file:
        chunk = json.loads(line)
        chunks.append(chunk)



NUM_SEED_EXAMPLES = 5 # Number of chunks to select as seed examples
selected_chunks =random.sample(chunks,NUM_SEED_EXAMPLES)

selected_chunks_path = OUTPUT_DIR / "selected_chunks.jsonl"
with open(selected_chunks_path, "w", encoding="utf-8") as file:
    for chunk in selected_chunks:
        json.dump(chunk, file)
        file.write("\n")
    print(f"Path of selected chunks JSON is: {Path(selected_chunks_path).resolve()}")



Path of selected chunks JSON is: /Users/scharan/projects/red-hat-ai-examples/examples/knowledge-tuning/output/step_01/selected_chunks.jsonl


Generate QnA for each chunk selected above

In [12]:
# Define your API credentials and model details
API_KEY ="d7bbaa22df39cb51374efa3b0a64962d"
ENDPOINT = "https://granite-3-3-8b-instruct-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443/v1"
MODEL_NAME = "granite-3-3-8b-instruct"

In [13]:
CUSTOMISATION_PROMPT = "Generate atleast 5 seed examples in the format specified below."

In [14]:
from ai_tools.usecase.knowledge_tuning.utils import generate_seed_examples

generate_seed_examples(
   "",
                           selected_chunks_path,
                           OUTPUT_DIR,
                           API_KEY,
                           ENDPOINT,
                           MODEL_NAME,
                           "DOMAIN",
                           "SUMMARY",                  
                           CUSTOMISATION_PROMPT
)

Filtering smaller chunks out of chunks from document 2023-nfl-rulebook
Filtering smaller chunks out of chunks from document 2022-nfl-rulebook


  0%|          | 0/5 [00:00<?, ?it/s]2025-09-19 11:46:14,292 - INFO - HTTP Request: POST https://granite-3-3-8b-instruct-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com/v1/completions "HTTP/1.1 200 OK"
 20%|██        | 1/5 [00:19<01:18, 19.65s/it]2025-09-19 11:46:33,574 - INFO - HTTP Request: POST https://granite-3-3-8b-instruct-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com/v1/completions "HTTP/1.1 200 OK"
 40%|████      | 2/5 [00:38<00:58, 19.43s/it]2025-09-19 11:46:45,865 - INFO - HTTP Request: POST https://granite-3-3-8b-instruct-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com/v1/completions "HTTP/1.1 200 OK"
 60%|██████    | 3/5 [00:51<00:32, 16.17s/it]2025-09-19 11:47:05,104 - INFO - HTTP Request: POST https://granite-3-3-8b-instruct-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com/v1/completions "HTTP/1.1 200 OK"
 80%|████████  | 4/5 [01:10<00:17, 17.38s/it]2025-09-19 11:47:24,466 - INFO - HTTP Request: POST https://granite-3-3

Status for Q&A generation for  is: Status.SUCCESS





PosixPath('/Users/scharan/projects/red-hat-ai-examples/examples/knowledge-tuning/output/step_01/qna.yaml')

In [15]:
from ai_tools.usecase.knowledge_tuning.utils import view_seed_example

In [16]:
view_seed_example(OUTPUT_DIR / "qna.yaml", 0)

Context:
A.R. 15.258 Foul negates score
Fourth-and-goal on B8. With 3:43 remaining in the fourth quarter, QBA1's pass to A8 is ruled complete in the end zone, but A7 is penalized for offensive holding. Replays show that the ball hit the ground before A8 possessed it.
Ruling: Reviewable. Incomplete pass, B's ball first-and-10 on B8, decline holding foul. Only the Replay Official can initiate a review of this play since the ruling on the field was a touchdown nullified by a penalty.

Question: Who initiated the review of the play?
Answer: Only the Replay Official can initiate a review of this play.

Question: What was the ruling on the field and what was the corrected ruling after review?
Answer: The ruling on the field was a touchdown due to QBA1's pass to A8, but it was
nullified by a holding penalty against A7. After review, the corrected ruling
was an incomplete pass, B's ball first-and-10 on B8, with the holding foul
declined.

Question: Why was the initial ruling of a touchdown ove

In [17]:
from ai_tools.usecase.knowledge_tuning.utils import review_seed_examples_file


review_seed_examples_file(OUTPUT_DIR / "qna.yaml",min_seed_examples=4,num_qa_pairs=3)

Reviewing seed examples file at /Users/scharan/projects/red-hat-ai-examples/examples/knowledge-tuning/output/step_01/qna.yaml
Found contribution summary...
Found 'domain'...
Seed Example 1 contains expected number (3) of 'question_and_answers'...

[31mERROR! Seed Examples validation failed with the following issues:[0m
- 'seed_examples' should contain at least 4 examples, found 1. Please add 3 more seed example(s)




In [18]:
# !pip3 install -qq datasets transformers#

In [19]:
from ai_tools.usecase.knowledge_tuning.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets

In [20]:
seed_data = get_seed_dataset(OUTPUT_DIR,OUTPUT_DIR)


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Map: 100%|██████████| 25/25 [00:00<00:00, 2600.31 examples/s]
Filter: 100%|██████████| 25/25 [00:00<00:00, 2093.47 examples/s]
Map: 100%|██████████| 22/22 [00:00<00:00, 3679.51 examples/s]
Filter: 100%|██████████| 22/22 [00:00<00:00, 2229.88 examples/s]


In [21]:
seed_data

Dataset({
    features: ['document', 'document_outline', 'document_title', 'domain', 'icl_document', 'icl_query_1', 'icl_response_1', 'icl_query_2', 'icl_response_2', 'icl_query_3', 'icl_response_3'],
    num_rows: 43
})

In [22]:
seed_data.to_json(OUTPUT_DIR/ "final_seed_data.jsonl",orient='records', lines=True)

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 21.44ba/s]


85155