# Enhanced Summary Knowledge Tuning - Data Generation

## Overview

This notebook demonstrates how to generate high-quality knowledge tuning datasets using the SDG Hub framework. It creates multiple types of document augmentations and corresponding question-answer pairs that can be used to train or fine-tune language models for enhanced summarization and knowledge extraction capabilities.

## What This Notebook Does

This notebook will:

2. **Generate Four Types of Knowledge Tuning Datasets**:
   - **Extractive Summaries**: Concise summaries that extract key information directly from source documents
   - **Detailed Summaries**: Comprehensive summaries that provide thorough coverage of document content
   - **Key Facts**: Structured fact extraction with corresponding Q&A pairs
   - **Document-Based Q&A**: Question-answer pairs generated directly from document content


4. **Output Structured Training Data**:
   - For each augmentation we save JSONL dataset.
   - You can follow [knowledge_mixing](knowledge_mixing.ipynb) to convert it into training dataset

## Prerequisites

- SDG Hub installed and configured
- Environment variables set up (see [.env.example](.env.example)). Specifically set the model provider, seed data and output path.
- Document pre-processing completed (run [document_pre_processing.ipynb](document_pre_processing.ipynb) first)

```bash 
git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
cd sdg_hub
pip install .[examples]
copy the .env.example to .env and set the model endpoint and generation/mixing parameters
```
**⚠️ If you haven't already, run the document pre-processing notebook to create the seed data.**

## Next Steps

After running this notebook, use [knowledge_mixing](knowledge_mixing.ipynb) to combine and curate the generated datasets for final model training.


## Setup Paths and Directories

In [None]:
from pathlib import Path

WORKSPACE = Path.cwd().parent  # Path to the workspace directory

OUTPUT_DIR = WORKSPACE / "output" / "step_02"

OUTPUT_DIR.mkdir(
    parents=True, exist_ok=True
)  # Create the output directory if it doesn't exist




## Teacher Model Configuration

In [None]:
MODEL_NAME = "openai/llama-4-scout-17b-16e-w4a16"
API_KEY = ""  # Provide your API key here
ENDPOINT = "https://llama-4-scout-17b-16e-w4a16-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443/v1"

## SDG Configuration

In [None]:
# Required to run the flow with async mode
import nest_asyncio
nest_asyncio.apply()  


RUN_ON_VALIDATION_SET = True  # Whether to run the model on the validation set
SEED_DATA_SUBSAMPLE = 0 # Subsample the seed data for faster testing. 

SEED_DATA_FILE = (
    WORKSPACE / "output" / "step_01" / "seed_data.jsonl"
)  # Path to the seed data file generated in step 1


ENABLE_REASONING = False  # Whether to enable reasoning in the model
NUMBER_OF_SUMMARIES = 10  # Number of summaries to generate
MAX_CONCURRENCY = 50  # Maximum number of concurrent requests to the model



if not SEED_DATA_FILE.exists():
    raise FileNotFoundError(
        f"\nNot a valid seed data ! {SEED_DATA_FILE}.\nPlease run step 1 to generate the seed data. \n(or) Provide the correct path to the seed data file."
    )

In [None]:
# Third Party
from datasets import load_dataset
from dotenv import load_dotenv

# First Party
from sdg_hub import Flow, FlowRegistry
import os

# Load environment variables from .env file
load_dotenv()

In [None]:
# Load the seed data generated in step 1
quality_corpus = load_dataset('json', data_files=f"{SEED_DATA_FILE}", split='train')

if SEED_DATA_SUBSAMPLE > 0:
    quality_corpus = quality_corpus.select(range(SEED_DATA_SUBSAMPLE))

## Run SDG
- This will create knowledge flow from provided yaml file
- We will run this on small dataset for demo purposes
- For large scale generation, please use the python command provided in the next cell
- You can analyze the generated data to ensure the quality is similar to proivded QnA pairs

#### Discover the available generation flows

In [None]:
# Auto-discover all available flows (no setup needed!)
FlowRegistry.discover_flows()

# List available flows
flows = FlowRegistry.list_flows()
print(f"Available flows: {flows}")

# You can also search the flows by tag
qa_flows = FlowRegistry.search_flows(tag="question-generation")
print(f"QA flows: {qa_flows}")

### Generate Extractive Summaries Dataset

In [None]:
# Generate data for extractive summary
flow_name = "Extractive Summary Knowledge Tuning Dataset Generation Flow"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)

# Set model configuration
flow.set_model_config(
    model=MODEL_NAME,
    api_base=ENDPOINT,
    api_key=API_KEY,
)

# Generate data for extractive summary
if ENABLE_REASONING:
    # Increase max tokens to accommodate reasoning content
    runtime_params = {
        'question_generation': {'max_tokens': 1024}, 
        'gen_extractive_summary': {'n': NUMBER_OF_SUMMARIES, 'max_tokens': 6000}
        }
else:
    runtime_params = {
        'gen_extractive_summary': {
            'n': NUMBER_OF_SUMMARIES
        }
    }

extractive_summary_generated_data = flow.generate(quality_corpus, runtime_params=runtime_params, max_concurrency=MAX_CONCURRENCY)

extractive_summary_generated_data.to_json(os.path.join(OUTPUT_DIR, 'extractive_summary', 'gen.jsonl'), orient='records', lines=True)

print(f"✓ Extractive summary: {len(extractive_summary_generated_data)} records")

print(f"✓ Columns: {list(extractive_summary_generated_data.column_names)}")

### Generate Detailed Summary Dataset

In [None]:
# Generate similar data for Detailed Summary
flow_name = "Detailed Summary Knowledge Tuning Dataset Generation Flow"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)

# Set model configuration
flow.set_model_config(
    model=MODEL_NAME,
    api_base=ENDPOINT,
    api_key=API_KEY,
)

if ENABLE_REASONING:
    # Increase max tokens to accommodate reasoning content
    runtime_params = {
        'question_generation': {'max_tokens': 1024}, 
        'gen_detailed_summary': {'n': NUMBER_OF_SUMMARIES, 'max_tokens': 6000}
        }
else:
    runtime_params = ({'gen_detailed_summary': {
        'n': NUMBER_OF_SUMMARIES
    }})
# Generate data for detailed summary
detailed_summary_generated_data = flow.generate(quality_corpus, runtime_params=runtime_params, max_concurrency=MAX_CONCURRENCY)

detailed_summary_generated_data.to_json(os.path.join(OUTPUT_DIR, 'detailed_summary', 'gen.jsonl'), orient='records', lines=True)

print(f"✓ Detailed summary: {len(detailed_summary_generated_data)} records")

print(f"✓ Columns: {list(detailed_summary_generated_data.column_names)}")

### Generate Key facts Dataset

In [None]:
# Generate similar data for key facts 
flow_name = "Key Facts Knowledge Tuning Dataset Generation Flow"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)

# Set model configuration
flow.set_model_config(
    model=MODEL_NAME,
    api_base=ENDPOINT,
    api_key=API_KEY,
)

runtime_params = {}

if ENABLE_REASONING:
    # Increase max tokens for Question Generation to accommodate reasoning content
    runtime_params = {
        'generate_key_fact_qa': {'max_tokens': 6000}, 
        }

# Generate data for key facts summary
key_facts_generated_data = flow.generate(quality_corpus, runtime_params=runtime_params, max_concurrency=MAX_CONCURRENCY)

key_facts_generated_data.to_json(os.path.join(OUTPUT_DIR, 'key_facts_to_qa', 'gen.jsonl'), orient='records', lines=True)

print(f"✓ Key facts: {len(key_facts_generated_data)} records")

print(f"✓ Columns: {list(key_facts_generated_data.column_names)}")

### Generate Document Based Dataset

In [None]:
flow_name = "Document Based Knowledge Tuning Dataset Generation Flow"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)

# Set model configuration
flow.set_model_config(
    model=MODEL_NAME,
    api_base=ENDPOINT,
    api_key=API_KEY,
)

runtime_params = {}

if ENABLE_REASONING:
    # Increase max tokens to accommodate reasoning content
    runtime_params = {
        'question_generation': {'max_tokens': 2048}, 
        }

document_based_generated_data = flow.generate(quality_corpus, runtime_params=runtime_params, max_concurrency=MAX_CONCURRENCY)
    
document_based_generated_data.to_json(os.path.join(OUTPUT_DIR, 'document_based_qa', 'gen.jsonl'), orient='records', lines=True)

print(f"✓ Document based: {len(document_based_generated_data)} records")

print(f"✓ Columns: {list(document_based_generated_data.column_names)}")

🎉 You now have all three four of document augmentations (detailed summaries, extractive summaries, key facts and document based) along with their corresponding QA pairs.

✅ Next steps:
   - Combine and curate these datasets to prepare your final training data.
   - For detailed guidance on post-processing, mixing, and formatting the data for model training (including conversion to messages format), please refer to [knowledge_mixing.ipynb](knowledge_mixing.ipynb).