# Data Preprocessing

You must complete the following pre-processing steps to prepare the `seed.jsonl` dataset for Synthetic Data Generation (SDG). 

1. Configure the paths
2. Load the files into memory
3. Convert the document
4. Chunk the document
5. Generate SDG from randomly selected chunks.

**NOTE**

-  You need the following information for the model that generates the question and answer pairs:
   - An Open AI compatible endpoint
   - The model's API key
   - The model's name
- You can adjust parameters (for example, number of chunks or the QnA prompt).
- For large documents or datasets, monitor RAM and disk usage.

## Configure the paths

In [None]:
import json
import random
from pathlib import Path

from docling.chunking import HybridChunker
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

from ai_tools.usecase.knowledge_tuning.create_seed_dataset import \
    get_seed_dataset
from ai_tools.usecase.knowledge_tuning.utils import (generate_seed_examples,
                                                     review_seed_examples_file,
                                                     view_seed_example)

WORKSPACE = Path.cwd().parent  # Path to the workspace directory

SOURCE_DOCUMENT_DIR = WORKSPACE / "source_documents"
OUTPUT_DIR = WORKSPACE / "output" / "step_01"

OUTPUT_DIR.mkdir(
    parents=True, exist_ok=True
)  # Create output directory if it doesn't exist

## Load the files into memory

In [None]:
available_files = SOURCE_DOCUMENT_DIR.glob("**/*.pdf")
available_files = list(available_files)

print(f"Total PDF files found: {len(available_files)} \n")
print("Available Files:")

for file in available_files:
    print(f"\t{file.stem}.pdf")

## Convert the document

The source documents are in PDF format. To chunk the document, you must convert the PDF format to `docling` format.

Configure the docling pipeline:

In [None]:
pipeline_options = PdfPipelineOptions()
doc_converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)


Read and convert the PDF documents to `docling` format:

In [None]:
confidence_report = {}
for file in available_files:
    conv_result = doc_converter.convert(file)

    document = conv_result.document
    confidence_report[file.stem] = conv_result.confidence

    document_dict = document.export_to_dict()

    (OUTPUT_DIR / "docling_output").mkdir(parents=True, exist_ok=True)
    output_file = OUTPUT_DIR / "docling_output" / f"{file.stem}.json"
    with open(output_file, "w") as f:
        json.dump(document_dict, f, indent=4)

In [None]:
for file, confidence_report in confidence_report.items():
    print(f"Conversion confidence for {file}:")

    print(
        f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})"
    )

    low_score_pages = []
    for page in confidence_report.pages:
        page_confidence_report = confidence_report.pages[page]
        if page_confidence_report.mean_score < confidence_report.mean_score:
            low_score_pages.append(page)

    print(
        f"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages)}"
    )

    print()

## Chunk the document

Run `docling` commands to chunk the document:

In [None]:
chunker = HybridChunker()

all_chunks = []

json_files = (OUTPUT_DIR / "docling_output").glob("*.json")

convertor = DocumentConverter()

for file in json_files:
    conv_result = convertor.convert(file)

    chunks = chunker.chunk(conv_result.document)
    chunks = list(chunks)
    print(f"Total chunks created for {file.stem}: {len(chunks)}")

    for chunk in chunks:
        all_chunks.append(
            {
                "chunk": chunker.contextualize(chunk),
                "file": file.stem,
                "metadata": chunk.meta.export_json_dict(),
            }
        )

chunks_file_path = OUTPUT_DIR / "chunks.jsonl"
with open(chunks_file_path, "w", encoding="utf-8") as file:
    for chunk in all_chunks:
        json.dump(chunk, file)
        file.write("\n")
    print(f"Path of chunks JSON is: {Path(chunks_file_path).resolve()}")

View a random sample of chunks:

In [None]:
NUM_CHUNKS_TO_VIEW = 5


sample = random.sample(all_chunks, min(len(all_chunks), NUM_CHUNKS_TO_VIEW))

i = 1
for chunk in sample:
    print(f"== Randomly selected chunk {i}: ==========\n\n{chunk['chunk']}\n\n")
    i += 1

Read the chunks from the saved file and then randomly select the chunks to use as seed data for SDG:

In [None]:
chunks = []
with open(chunks_file_path, encoding="utf-8") as file:
    for line in file:
        chunk = json.loads(line)
        chunks.append(chunk)

NUM_SEED_EXAMPLES = 5  # Number of chunks to select as seed examples
selected_chunks = random.sample(chunks, NUM_SEED_EXAMPLES)
selected_chunks_path = OUTPUT_DIR / "selected_chunks.jsonl"
with open(selected_chunks_path, "w", encoding="utf-8") as file:
    for chunk in selected_chunks:
        json.dump(chunk, file)
        file.write("\n")
    print(f"Path of selected chunks JSON is: {Path(selected_chunks_path).resolve()}")

## Generate question and answer data for each chunk

Edit the values in the next cell by entering the following information for the model that generates the question and answer pairs:
   - The model's API key
   - An Open AI compatible endpoint
   - The model's name

In [None]:
# Define your API credentials and model details
API_KEY = ""  # Replace with your actual API key
ENDPOINT = "https://granite-3-3-8b-instruct-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443/v1"
MODEL_NAME = "granite-3-3-8b-instruct"

For each selected chunk, generate question and answer (QnA) data and save it to a `QnA.yaml` file:

In [None]:
CUSTOMISATION_PROMPT = "Generate atleast 5 seed examples in the format specified below."

For each randomly selected chunk, create a QnA pair in the `QnA.yaml` file. 

To generate the QnA pair, use a LLM with a prompt.

In [None]:
generate_seed_examples(
    "",
    selected_chunks_path,
    OUTPUT_DIR,
    API_KEY,
    ENDPOINT,
    MODEL_NAME,
    "DOMAIN",
    "SUMMARY",
    CUSTOMISATION_PROMPT,
)

View a seed example from the generated `QnA.yaml` file:

In [None]:
view_seed_example(OUTPUT_DIR / "qna.yaml", 0)

Review the generated QnA pairs:

- Check for the presence of required fields.
- Check the number of seed examples generated.
- Check the number of QnA pairs generated for each seed example.

In [None]:
review_seed_examples_file(OUTPUT_DIR / "qna.yaml", min_seed_examples=4, num_qa_pairs=3)

In [None]:
seed_data = get_seed_dataset(OUTPUT_DIR, OUTPUT_DIR)
seed_data
seed_data.to_json(OUTPUT_DIR / "final_seed_data.jsonl", orient="records", lines=True)


## Next Step

[Synthetic Data Processing](../02_Knowledge_Generation/README.md)