# Synthetic Query Generation Tutorial: Building RAG Evaluation Datasets

## Learning Objectives

By completing this tutorial, you will be able to:
- ‚úÖ Extract salient facts from recipes using LLM prompting
- ‚úÖ Generate realistic natural language queries from facts
- ‚úÖ Use two-step LLM prompting (facts ‚Üí queries) for better quality
- ‚úÖ Implement parallel processing with ThreadPoolExecutor
- ‚úÖ Validate query quality and filter low-quality examples
- ‚úÖ Optimize costs for large-scale query generation
- ‚úÖ Export queries in evaluation-ready format

## Prerequisites

- Completed [RAG Evaluation Concepts](rag_evaluation_concepts.md)
- Have processed recipe dataset ready
- Understanding of parallel processing concepts

## Estimated Time

**Execution Time:** 20-30 minutes  
**Cost:** ~$0.50-2.00 for 100 queries (GPT-4o-mini)

---

## Setup

In [1]:
# Standard library imports
import os
import sys
import json
import random
from pathlib import Path
from typing import List, Dict, Any, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed

# Data manipulation
import pandas as pd
import numpy as np

# Progress tracking
from tqdm.notebook import tqdm

# LLM API
import litellm

# Environment configuration
from dotenv import load_dotenv
load_dotenv()

# Set random seed
random.seed(42)
np.random.seed(42)

print("‚úÖ Setup complete")

‚úÖ Setup complete


In [2]:
# ========================================
# CONFIGURATION: Demo vs Full Mode
# ========================================

# Set DEMO_MODE = False to generate full dataset
DEMO_MODE = True  # Default: Quick demo for tutorial

if DEMO_MODE:
    MAX_RECIPES = 10  # Process 10 recipes
    MAX_WORKERS = 5
    print("üöÄ DEMO MODE: Generating small query sample")
    print(f"   Recipes: {MAX_RECIPES} | Workers: {MAX_WORKERS}")
    print(f"   Expected queries: ~{MAX_RECIPES} queries")
    print(f"   Estimated cost: $0.05-0.15 | Time: ~1-2 minutes")
else:
    MAX_RECIPES = 100  # Process 100 recipes
    MAX_WORKERS = 10
    print("üìä FULL MODE: Generating comprehensive query dataset")
    print(f"   Recipes: {MAX_RECIPES} | Workers: {MAX_WORKERS}")
    print(f"   Expected queries: ~{MAX_RECIPES} queries")
    print(f"   Estimated cost: $1.50-2.00 | Time: ~3-5 minutes")

print("\nüí° To switch modes, change DEMO_MODE in this cell and re-run notebook")

üöÄ DEMO MODE: Generating small query sample
   Recipes: 10 | Workers: 5
   Expected queries: ~10 queries
   Estimated cost: $0.05-0.15 | Time: ~1-2 minutes

üí° To switch modes, change DEMO_MODE in this cell and re-run notebook


## Part 1: Load Processed Recipes

Load the recipe dataset prepared in `scripts/process_recipes.py`:

In [3]:
# Load processed recipes
recipes_path = 'data/processed_recipes.json'

if not os.path.exists(recipes_path):
    print(f"‚ùå File not found: {recipes_path}")
    print("Run scripts/process_recipes.py first to create processed recipes.")
else:
    with open(recipes_path, 'r') as f:
        recipes = json.load(f)
    
    print(f"‚úÖ Loaded {len(recipes)} recipes")
    print(f"\nüìã Sample recipe:")
    sample = recipes[0]
    print(f"ID: {sample['id']}")
    print(f"Name: {sample['name']}")
    print(f"Ingredients: {len(sample.get('ingredients', []))}")
    print(f"Steps: {len(sample.get('steps', []))}")

‚úÖ Loaded 200 recipes

üìã Sample recipe:
ID: 65007
Name: 5 cheese crab lasagna with roasted garlic and vegetables
Ingredients: 24
Steps: 108


## Part 2: Two-Step Query Generation Strategy

### Why Two Steps?

**One-step approach (naive):**
- Prompt: "Generate a query for this recipe"
- Result: Generic queries ("How to make [recipe name]?")

**Two-step approach (better):**
1. **Step 1:** Extract salient facts (specific, technical details)
2. **Step 2:** Generate query targeting those facts
- Result: Specific queries ("What air fryer temperature for crispy chicken?")

### Step 1: Extract Salient Facts

**Goal:** Identify 1-2 specific technical details that:
- Are difficult to generate from scratch
- Are clearly answerable by this recipe
- Test retrieval capability (not just generation)

In [4]:
def format_recipe_for_llm(recipe: Dict[str, Any]) -> str:
    """Format recipe for LLM processing."""
    formatted = f"**{recipe.get('name', 'Unknown')}**\n"
    
    if recipe.get('description'):
        formatted += f"Description: {recipe['description']}\n"
    
    if recipe.get('minutes'):
        formatted += f"Cooking time: {recipe['minutes']} minutes\n"
    
    if recipe.get('ingredients'):
        formatted += f"Ingredients: {', '.join(recipe['ingredients'][:10])}\n"
    
    if recipe.get('steps'):
        steps_text = ' '.join(recipe['steps'][:5])  # First 5 steps
        formatted += f"Steps: {steps_text[:500]}...\n"
    
    return formatted

def extract_salient_facts(recipe: Dict[str, Any], model: str = "gpt-4o-mini") -> str:
    """Extract salient facts from recipe using LLM."""
    
    recipe_text = format_recipe_for_llm(recipe)
    
    prompt = f"""Analyze this recipe and identify 1-2 specific, technical details that would be difficult to generate from scratch but are clearly answerable by this exact recipe. Focus on:

1. **Specific cooking techniques/methods** (e.g., "marinate for 4 hours", "bake at 375¬∞F for exactly 25 minutes")
2. **Appliance settings** (e.g., "air fryer at 400¬∞F for 12 minutes", "pressure cook for 8 minutes")  
3. **Ingredient preparation details** (e.g., "slice onions paper-thin", "whip cream to soft peaks")
4. **Timing specifics** (e.g., "rest dough for 30 minutes", "simmer for 45 minutes")
5. **Temperature precision** (e.g., "internal temp 165¬∞F", "oil heated to 350¬∞F")

Return the most distinctive fact(s) that someone might specifically search for:

Recipe:
{recipe_text}

Salient Fact(s):"""
    
    try:
        response = litellm.completion(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_tokens=200
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"‚ùå Error extracting facts: {e}")
        return ""

# Demo: Extract facts from sample recipe
print("üß™ Extracting salient facts from sample recipe...\n")
sample_recipe = recipes[0]
print(f"Recipe: {sample_recipe['name']}\n")

salient_facts = extract_salient_facts(sample_recipe)
print(f"‚úÖ Salient Facts:\n{salient_facts}")

üß™ Extracting salient facts from sample recipe...

Recipe: 5 cheese crab lasagna with roasted garlic and vegetables

‚úÖ Salient Facts:
1. **Specific Cooking Technique/Method**: "Roast garlic: place oven rack on second notch, turn oven to 375¬∞F." This detail specifies both the oven temperature and the positioning of the rack, which are crucial for properly roasting garlic.

2. **Ingredient Preparation Detail**: "Cut tops off of the heads of garlic and discard excess skin." This instruction provides a precise method for preparing the garlic, which is essential for the roasting process and affects the final flavor and texture of the dish.


### Step 2: Generate Realistic Query

**Goal:** Create a natural, conversational query that:
- Sounds like a real person asking
- Focuses on the salient fact
- Is challenging (requires this recipe to answer)
- Avoids mentioning recipe name directly

In [5]:
def generate_realistic_query(recipe: Dict[str, Any], salient_fact: str, model: str = "gpt-4o-mini") -> str:
    """Generate realistic user query from salient fact."""
    
    recipe_name = recipe.get('name', 'Unknown Recipe')
    ingredients = ', '.join(recipe.get('ingredients', [])[:5])
    
    prompt = f"""Create a realistic, specific user query that a home cook might ask, which can ONLY be answered well by this exact recipe. The query should:

1. Sound natural and conversational (like a real person asking)
2. Focus on the specific technical detail: "{salient_fact}"
3. Be challenging - requiring this exact recipe's information to answer properly
4. Avoid mentioning the recipe name directly

Context:
- Recipe: {recipe_name}
- Key ingredients: {ingredients}
- Salient fact: {salient_fact}

Examples of good query styles:
- "What temperature and time for air fryer frozen chicken tenders?"
- "How long should I marinate beef for Korean bulgogi?"
- "What's the exact oven temperature for crispy roasted vegetables?"
- "How do I get the right consistency for homemade pasta dough?"

Generate ONE specific query:"""
    
    try:
        response = litellm.completion(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=100
        )
        return response.choices[0].message.content.strip().strip('"')
    except Exception as e:
        print(f"‚ùå Error generating query: {e}")
        return ""

# Demo: Generate query from salient facts
print("üß™ Generating query from salient facts...\n")
print(f"Salient Facts: {salient_facts}\n")

query = generate_realistic_query(sample_recipe, salient_facts)
print(f"‚úÖ Generated Query:\n{query}")

üß™ Generating query from salient facts...

Salient Facts: 1. **Specific Cooking Technique/Method**: "Roast garlic: place oven rack on second notch, turn oven to 375¬∞F." This detail specifies both the oven temperature and the positioning of the rack, which are crucial for properly roasting garlic.

2. **Ingredient Preparation Detail**: "Cut tops off of the heads of garlic and discard excess skin." This instruction provides a precise method for preparing the garlic, which is essential for the roasting process and affects the final flavor and texture of the dish.

‚úÖ Generated Query:
What's the best way to prepare garlic for roasting, especially in terms of cutting it and what oven setup I should use?


# Demo: Generate queries using configuration
print(f"‚è≥ Generating queries for {MAX_RECIPES} recipes...\n")
if DEMO_MODE:
    print("‚ö†Ô∏è  DEMO MODE: Cost estimate: ~$0.05-0.15\n")
else:
    print("‚ö†Ô∏è  FULL MODE: Cost estimate: ~$1.50-2.00\n")

sample_queries = batch_generate_queries(recipes, max_workers=MAX_WORKERS, max_recipes=MAX_RECIPES)

print(f"\n‚úÖ Generated {len(sample_queries)} queries")
print(f"Success rate: {len(sample_queries)/MAX_RECIPES:.1%}")

In [7]:
def process_single_recipe(recipe: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    """Process a single recipe to generate a query."""
    try:
        # Step 1: Extract salient facts
        salient_fact = extract_salient_facts(recipe)
        
        if not salient_fact or len(salient_fact.strip()) < 10:
            return None
        
        # Step 2: Generate query
        query = generate_realistic_query(recipe, salient_fact)
        
        if not query or len(query.strip()) < 10:
            return None
        
        return {
            "query": query,
            "salient_fact": salient_fact,
            "source_recipe_id": recipe['id'],
            "source_recipe_name": recipe['name'],
            "ingredients": recipe.get('ingredients', []),
            "cooking_time": recipe.get('minutes', 0),
            "tags": recipe.get('tags', [])
        }
        
    except Exception as e:
        print(f"‚ùå Error processing recipe {recipe.get('id', 'unknown')}: {e}")
        return None

def batch_generate_queries(recipes: List[Dict[str, Any]], 
                           max_workers: int = 10,
                           max_recipes: int = None) -> List[Dict[str, Any]]:
    """Generate queries in parallel using ThreadPoolExecutor."""
    
    if max_recipes:
        recipes = recipes[:max_recipes]
    
    queries = []
    
    print(f"üìä Processing {len(recipes)} recipes with {max_workers} workers...\n")
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks
        futures = {executor.submit(process_single_recipe, recipe): recipe 
                   for recipe in recipes}
        
        # Process completed tasks with progress bar
        for future in tqdm(as_completed(futures), total=len(futures), desc="Generating queries"):
            try:
                result = future.result()
                if result:
                    queries.append(result)
            except Exception as e:
                print(f"‚ùå Task failed: {e}")
    
    return queries

# Demo: Generate queries for 10 recipes
print("‚è≥ Generating queries for 10 sample recipes...\n")
print("‚ö†Ô∏è  Cost estimate: ~$0.05-0.10\n")

sample_queries = batch_generate_queries(recipes, max_workers=5, max_recipes=10)

print(f"\n‚úÖ Generated {len(sample_queries)} queries")
print(f"Success rate: {len(sample_queries)/10:.1%}")

‚è≥ Generating queries for 10 sample recipes...

‚ö†Ô∏è  Cost estimate: ~$0.05-0.10

üìä Processing 10 recipes with 5 workers...



Generating queries:   0%|          | 0/10 [00:00<?, ?it/s]


‚úÖ Generated 10 queries
Success rate: 100.0%


## Part 4: Quality Review

### Review Generated Queries

Manually inspect a sample to ensure quality:

In [8]:
# Display sample queries
print("üìã Sample Generated Queries:\n")

for i, query_data in enumerate(sample_queries[:5], 1):
    print(f"{'='*80}")
    print(f"Query {i}: {query_data['query']}")
    print(f"Source Recipe: {query_data['source_recipe_name']}")
    print(f"Salient Fact: {query_data['salient_fact']}")
    print()

üìã Sample Generated Queries:

Query 1: I'm trying to make a baked good with a nice, crumbly texture, but I'm not sure how to properly incorporate the butter into the flour mixture. What technique should I use, and what‚Äôs the best baking temperature and time to ensure it turns out perfectly?
Source Recipe: all purpose quick mix with 28 variations
Salient Fact: 1. **Baking Temperature and Time**: The recipe specifies to "bake at 350 degrees for 30 minutes or until done." This precise temperature and timing detail is crucial for achieving the desired texture and doneness of the baked goods.

2. **Preparation Technique**: The instruction to "cut butter or margarine into flour mixture until it resembles coarse cornmeal" is a specific technique that involves a method known as "cutting in." This detail is important for achieving the right consistency in the mixture, which can be difficult to determine without this guidance.

Query 2: What water temperature should I use for my bread dough 

### Quality Validation Checks

In [10]:
def validate_query_quality(queries: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Validate query quality with automated checks."""
    
    # Length checks
    avg_query_length = np.mean([len(q['query'].split()) for q in queries])
    min_query_length = min([len(q['query'].split()) for q in queries])
    max_query_length = max([len(q['query'].split()) for q in queries])
    
    # Check for diversity (unique queries)
    unique_queries = len(set(q['query'].lower() for q in queries))
    diversity_rate = unique_queries / len(queries)
    
    # Check for question marks (conversational style)
    queries_with_questions = sum(1 for q in queries if '?' in q['query'])
    question_rate = queries_with_questions / len(queries)
    
    return {
        'total_queries': len(queries),
        'avg_query_length_words': avg_query_length,
        'min_query_length_words': min_query_length,
        'max_query_length_words': max_query_length,
        'unique_queries': unique_queries,
        'diversity_rate': diversity_rate,
        'queries_with_questions': queries_with_questions,
        'question_rate': question_rate
    }

# Validate quality
quality_stats = validate_query_quality(sample_queries)

print("üìä Quality Statistics:\n")
for key, value in quality_stats.items():
    if isinstance(value, float):
        if 'rate' in key:
            print(f"{key}: {value:.1%}")
        else:
            print(f"{key}: {value:.1f}")
    else:
        print(f"{key}: {value}")

# Quality assessment
print("\n‚úÖ Quality Assessment:")
if quality_stats['avg_query_length_words'] > 8:
    print("‚úÖ Query length is good (detailed, specific)")
else:
    print("‚ö†Ô∏è  Queries may be too short")

if quality_stats['diversity_rate'] > 0.95:
    print("‚úÖ High diversity (queries are unique)")
else:
    print("‚ö†Ô∏è  Some duplicate queries detected")

if quality_stats['question_rate'] > 0.7:
    print("‚úÖ Good conversational style (question format)")
else:
    print("‚ö†Ô∏è  Many queries are not in question format")

üìä Quality Statistics:

total_queries: 10
avg_query_length_words: 30.4
min_query_length_words: 18
max_query_length_words: 54
unique_queries: 10
diversity_rate: 100.0%
queries_with_questions: 10
question_rate: 100.0%

‚úÖ Quality Assessment:
‚úÖ Query length is good (detailed, specific)
‚úÖ High diversity (queries are unique)
‚úÖ Good conversational style (question format)


## Part 5: Cost Optimization

### Cost Analysis

In [11]:
# Estimate costs for different scales
cost_per_query_gpt4o_mini = 0.015  # $0.015 per query (2 calls √ó ~$0.0075 avg)
cost_per_query_gpt4o = 0.05        # $0.05 per query (2 calls √ó ~$0.025 avg)

scales = [10, 50, 100, 200, 500]

print("üí∞ Cost Estimates:\n")
print(f"{'Num Queries':<15} {'GPT-4o-mini':<15} {'GPT-4o'}")
print("-" * 45)

for scale in scales:
    cost_mini = scale * cost_per_query_gpt4o_mini
    cost_4o = scale * cost_per_query_gpt4o
    print(f"{scale:<15} ${cost_mini:<14.2f} ${cost_4o:.2f}")

print("\nüìù Notes:")
print("- Use GPT-4o-mini for cost efficiency (recommended)")
print("- Use GPT-4o if query quality is critical")
print("- Parallel processing reduces time but not cost")

üí∞ Cost Estimates:

Num Queries     GPT-4o-mini     GPT-4o
---------------------------------------------
10              $0.15           $0.50
50              $0.75           $2.50
100             $1.50           $5.00
200             $3.00           $10.00
500             $7.50           $25.00

üìù Notes:
- Use GPT-4o-mini for cost efficiency (recommended)
- Use GPT-4o if query quality is critical
- Parallel processing reduces time but not cost


### Tips for Cost Optimization

1. **Start small**: Generate 20-30 queries, validate quality
2. **Iterate on prompts**: Refine prompts before scaling to 100+
3. **Use mini models**: GPT-4o-mini is 70% cheaper, similar quality for this task
4. **Cache results**: Save intermediate results to avoid re-processing
5. **Filter early**: Remove invalid recipes before LLM calls

## Part 6: Export for Evaluation

Save queries in evaluation-ready format:

In [12]:
# Create data directory if needed
os.makedirs('data', exist_ok=True)

# Save queries
output_path = 'data/synthetic_queries.json'

with open(output_path, 'w') as f:
    json.dump(sample_queries, f, indent=2)

print(f"‚úÖ Saved {len(sample_queries)} queries to {output_path}")

# Also save as CSV for easy viewing
queries_df = pd.DataFrame(sample_queries)
queries_df[['query', 'source_recipe_name', 'salient_fact']].to_csv('data/synthetic_queries.csv', index=False)

print(f"‚úÖ Saved CSV to data/synthetic_queries.csv")

‚úÖ Saved 10 queries to data/synthetic_queries.json
‚úÖ Saved CSV to data/synthetic_queries.csv


## Part 7: Full-Scale Generation (Optional)

‚ö†Ô∏è **Cost Warning**: Generating 100 queries costs ~$1.50-2.00 with GPT-4o-mini

Uncomment and run to generate full evaluation dataset:

In [13]:
# # Uncomment to generate 100 queries
# print("‚è≥ Generating 100 queries for full evaluation...\n")
# print("‚ö†Ô∏è  Estimated cost: $1.50-2.00")
# print("‚è±Ô∏è  Estimated time: 3-5 minutes\n")
#
# full_queries = batch_generate_queries(recipes, max_workers=10, max_recipes=100)
#
# print(f"\n‚úÖ Generated {len(full_queries)} queries")
# print(f"Success rate: {len(full_queries)/100:.1%}")
#
# # Save
# with open('data/synthetic_queries_full.json', 'w') as f:
#     json.dump(full_queries, f, indent=2)
#
# print("‚úÖ Saved to data/synthetic_queries_full.json")

## Summary

### What We've Accomplished

‚úÖ Loaded and explored processed recipe dataset  
‚úÖ Implemented two-step query generation (facts ‚Üí queries)  
‚úÖ Built parallel processing pipeline with ThreadPoolExecutor  
‚úÖ Generated sample queries with quality validation  
‚úÖ Analyzed costs and optimization strategies  
‚úÖ Exported queries in evaluation-ready format  

### Next Steps

1. **Scale up**: Generate 100+ queries for comprehensive evaluation
2. **Manual review**: Validate 10-20% of queries for quality
3. **Retrieval evaluation**: Use [Retrieval Metrics Tutorial](retrieval_metrics_tutorial.md)
4. **BM25 evaluation**: Run `scripts/evaluate_retrieval.py`
5. **Analyze results**: Identify query types that work well vs. poorly

### Key Takeaways

- ‚úÖ **Two-step generation produces better queries** - Facts ‚Üí queries is superior to direct generation
- ‚úÖ **Parallel processing enables scale** - 10x speedup with ThreadPoolExecutor
- ‚úÖ **Quality validation is critical** - Check length, diversity, conversational style
- ‚úÖ **Cost optimization matters** - Use GPT-4o-mini, start small, iterate
- ‚úÖ **Salient facts drive specificity** - Technical details make queries challenging

---

**Tutorial Status:** ‚úÖ Complete  
**Last Updated:** 2025-10-29  
**Maintainer:** AI Evaluation Course Team