# Question Generation and Dataset Building

This notebook demonstrates how to:
1. **Generate Questions**: Create multiple questions per chunk using LLMs or heuristics
2. **Build Training Pairs**: Create (query, passage) pairs for fine-tuning
3. **Train/Val Split**: Split data with grouped splitting to avoid leakage
4. **Hard Negative Mining**: Find similar but non-matching chunks for contrastive learning

This workflow creates the training dataset needed for fine-tuning EmbeddingGemma on PDF content.


In [1]:
# Import functions from src modules
from src.data.pdf_to_chunks_pipeline import pdf_to_chunks
from src.data.question_generation_dataset import (
    generate_questions_dataset,
    train_val_split_pairs,
    save_questions_dataset
)
from src.llm.question_generation import get_question_generator, QuestionGenConfig
from src.pipelines.embed_chunks import embed_chunks
import pandas as pd
import numpy as np


  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Load or Generate Chunks

If you've already run notebook 09, you can load the chunks from the saved parquet file. Otherwise, we'll generate them here.


In [3]:
# Option 1: Load chunks from previously saved parquet file
# chunks_df = pd.read_parquet("data/processed/pdf_chunks_YYYYMMDD_HHMMSS.parquet")

# Option 2: Generate chunks from PDF (if not already done)
pdf_path = "data/mobydick.pdf"  # Update this path
chunks_df = pdf_to_chunks(
    pdf_path=pdf_path,
    max_tokens=512,
    overlap_tokens=64,
    min_tokens=128
)

print(f"Loaded {len(chunks_df)} chunks")
chunks_df.head()


Saved 709 chunks to: data/processed/pdf_chunks_20260109_184228.parquet
Saved metadata to: data/processed/pdf_chunks_metadata_20260109_184228.json
Statistics: {'num_pages': 344, 'num_sections': 22, 'num_chunks': 709, 'total_tokens': 334063, 'avg_tokens_per_chunk': 471.17489421720734, 'min_tokens': 1, 'median_tokens': 490.0, 'max_tokens': 583}
Loaded 709 chunks


Unnamed: 0,doc_id,chunk_id,section_id,section_title,page_start,page_end,token_count,char_count,text
0,mobydick,mobydick::sec000::chunk00000,sec_000,*** START OF THE PROJECT GUTENBERG EBOOK MOBY ...,1,1,20,69,*** START OF THE PROJECT GUTENBERG EBOOK MOBY ...
1,mobydick,mobydick::sec001::chunk00000,sec_001,MOBY-DICK;,1,1,15,43,"MOBY-DICK;\nor, THE WHALE By Herman Melville"
2,mobydick,mobydick::sec002::chunk00000,sec_002,CONTENTS,1,1,1,8,CONTENTS
3,mobydick,mobydick::sec003::chunk00000,sec_003,ETYMOLOGY.,1,6,493,1623,ETYMOLOGY EXTRACTS (Supplied by a Sub-Sub-Libr...
4,mobydick,mobydick::sec003::chunk00001,sec_003,ETYMOLOGY.,1,6,508,1700,CHAPTER 52 The Albatross CHAPTER 53 The Gam CH...


## Step 2: Generate Questions

We'll generate multiple questions per chunk using a question generator. The system supports:
- **HuggingFace Models**: Uses text-generation models (e.g., DialoGPT) to generate questions
- **Heuristic Fallback**: Deterministic question generation if LLM is unavailable

### Question Generation Strategy

Each chunk will have multiple questions generated, creating diverse query-passage pairs. This diversity helps the model learn robust embeddings.


In [None]:
# Configure question generation
# The system will try HuggingFace first, fall back to heuristic if needed
question_gen_config = QuestionGenConfig(
    model_name="google/gemma-3-27b-it",  # Lightweight model, can use others
    num_questions_per_chunk=3,  # Generate 3 questions per chunk
    max_new_tokens=100,
    temperature=0.7
)

# Get question generator (automatically falls back to heuristic if HF fails)
question_generator = get_question_generator(question_gen_config)

# Generate questions for all chunks
# This creates (query, passage) pairs
pairs_df = generate_questions_dataset(
    chunks_df,
    question_generator=question_generator,
    num_questions_per_chunk=3
)

print(f"Generated {len(pairs_df)} query-passage pairs")
print(f"Average questions per chunk: {len(pairs_df) / len(chunks_df):.2f}")
pairs_df.head()


Loading question generation model: google/gemma-3-27b-it
Using device: cuda


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

## Step 3: Examine Generated Questions

Let's look at some example question-passage pairs to verify quality.


In [None]:
# Display example pairs
print("Example Query-Passage Pairs:")
print("=" * 60)

for idx in range(min(5, len(pairs_df))):
    pair = pairs_df.iloc[idx]
    print(f"\nPair {idx + 1}:")
    print(f"  Query:    '{pair['query']}'")
    print(f"  Passage:  '{pair['passage'][:150]}...'")
    print(f"  Section:  {pair['section_title']}")
    print(f"  Pages:    {pair['page_start']}-{pair['page_end']}")


## Step 4: Hard Negative Mining (Optional)

Hard negative mining finds chunks that are semantically similar but not the correct answer to a query. This improves contrastive learning by providing challenging negative examples.

### How Hard Negatives Work

1. Embed all chunks (if not already done)
2. For each query, find the most similar chunks (excluding the correct passage)
3. Add these as negative examples to the training data


In [None]:
# Step 4a: Embed chunks if not already done
# (You can skip this if embeddings were generated in notebook 09)
from src.models.embedding_pipeline import load_embeddinggemma_model, embed_texts
import torch

tokenizer, model = load_embeddinggemma_model()
device = next(model.parameters()).device

# Embed all chunks in batches
chunk_texts = chunks_df['text'].tolist()
batch_size = 64
all_embeddings = []

for i in range(0, len(chunk_texts), batch_size):
    batch = chunk_texts[i:i+batch_size]
    batch_embeddings = embed_texts(batch, model, tokenizer, device=device, max_length=512)
    all_embeddings.append(batch_embeddings.numpy())

chunk_embeddings = np.vstack(all_embeddings)
print(f"Embedded {len(chunk_embeddings)} chunks")

# Step 4b: Find hard negatives for each query
from sklearn.metrics.pairwise import cosine_similarity

# Embed queries
query_texts = pairs_df['query'].tolist()
query_embeddings = []

for query in query_texts:
    query_emb = embed_texts(query, model, tokenizer, device=device, max_length=512)
    query_embeddings.append(query_emb.numpy())

query_embeddings = np.vstack(query_embeddings)

# For each query, find similar chunks (excluding the correct passage)
hard_negatives = []
for idx, row in pairs_df.iterrows():
    query_emb = query_embeddings[idx:idx+1]
    similarities = cosine_similarity(query_emb, chunk_embeddings)[0]
    
    # Get correct chunk index
    correct_chunk_id = row['chunk_id']
    correct_chunk_idx = chunks_df[chunks_df['chunk_id'] == correct_chunk_id].index[0]
    
    # Find top similar chunks (excluding correct one)
    top_indices = np.argsort(similarities)[::-1]
    top_indices = [i for i in top_indices if i != correct_chunk_idx][:3]  # Top 3 hard negatives
    
    hard_negatives.append({
        'query': row['query'],
        'positive': row['passage'],
        'hard_negatives': [chunks_df.iloc[i]['text'] for i in top_indices]
    })

print(f"Found hard negatives for {len(hard_negatives)} queries")
print("\nExample hard negative:")
print(f"  Query: {hard_negatives[0]['query']}")
print(f"  Positive: {hard_negatives[0]['positive'][:100]}...")
print(f"  Hard negatives: {len(hard_negatives[0]['hard_negatives'])} chunks")


## Step 5: Train/Val Split

We'll split the data into training and validation sets using **grouped splitting**. This ensures that all pairs from the same chunk go to the same split, preventing data leakage.

### Why Grouped Splitting?

If we randomly split pairs, pairs from the same chunk might appear in both train and val sets. This leaks information because the model could memorize chunk content from training and see it again in validation. Grouped splitting prevents this.


In [None]:
# Split into train/val with grouped splitting
train_df, val_df = train_val_split_pairs(
    pairs_df,
    val_ratio=0.2,  # 20% for validation
    random_state=42,
    group_by='chunk_id'  # Group by chunk to prevent leakage
)

print(f"Training pairs: {len(train_df)}")
print(f"Validation pairs: {len(val_df)}")
print(f"\nTrain/Val split statistics:")
print(f"  Train chunks: {train_df['chunk_id'].nunique()}")
print(f"  Val chunks: {val_df['chunk_id'].nunique()}")


## Step 6: Save Dataset

Save all the generated pairs to timestamped files for use in training.


In [None]:
# Save all datasets to timestamped files
saved_paths = save_questions_dataset(
    pairs_df,
    train_df=train_df,
    val_df=val_df
)

print("Saved files:")
for key, path in saved_paths.items():
    print(f"  {key}: {path}")


## Summary

This notebook demonstrated:
1. ✅ Question generation from chunks (using LLM or heuristics)
2. ✅ Building query-passage pairs for training
3. ✅ Hard negative mining for improved contrastive learning
4. ✅ Grouped train/val splitting to prevent data leakage
5. ✅ Saving datasets to timestamped files

**Next Steps:**
- Proceed to notebook `11_LoRA_FineTune_EmbeddingGemma_on_PDF_QA.ipynb` to fine-tune the model
- The training CSV files are ready to use with the existing training framework
- Hard negatives are available in the dataset for future use in advanced training
