# RAGAS Synthetic Data Generation with LangGraph Agent Graph

This notebook demonstrates how to reproduce RAGAS Synthetic Data Generation steps using a LangGraph Agent Graph instead of the Knowledge Graph approach. The implementation leverages the Evol-Instruct method to generate synthetic data with three types of evolution:

1. **Simple Evolution**: Making questions more specific and detailed
2. **Multi-Context Evolution**: Creating questions that require information from multiple sources
3. **Reasoning Evolution**: Generating questions that require complex reasoning and analysis

## Features

- **LangGraph Workflow**: Uses LangGraph's stateful agent system for orchestration
- **Evol-Instruct Method**: Implements question evolution strategies
- **Multiple Evolution Types**: Handles simple, multi-context, and reasoning evolution
- **Context Retrieval**: Automatically retrieves relevant contexts for questions
- **Answer Generation**: Generates comprehensive answers based on retrieved contexts

## Output Format

The system generates three main outputs:
- **Evolved Questions**: List of questions with IDs and evolution types
- **Answers**: Question-answer pairs with IDs
- **Contexts**: Relevant contexts for each question


## Setup and Dependencies


In [4]:
# Import required libraries
import os
import getpass
from typing import List, Dict
import pandas as pd

# Set up environment variables
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

# Set project name
from uuid import uuid4
os.environ["LANGCHAIN_PROJECT"] = f"RAGAS-LangGraph-SDG-{uuid4().hex[0:8]}"

print("✅ Environment setup complete!")


✅ Environment setup complete!


## Load Documents


In [5]:
# Load documents using LangChain document loaders
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader

# Load PDF documents from the data directory
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

print(f"✅ Loaded {len(docs)} documents")
print(f"📄 Document sources: {[doc.metadata.get('source', 'Unknown') for doc in docs]}")

# Display sample document info
if docs:
    sample_doc = docs[0]
    print(f"\n📋 Sample document preview:")
    print(f"   Title: {sample_doc.metadata.get('title', 'N/A')}")
    print(f"   Pages: {sample_doc.metadata.get('total_pages', 'N/A')}")
    print(f"   Content length: {len(sample_doc.page_content)} characters")
    print(f"   Content preview: {sample_doc.page_content[:200]}...")


✅ Loaded 64 documents
📄 Document sources: ['data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeopleuseai.pdf', 'data/howpeo

## Initialize the LangGraph SDG System


In [6]:
# Import our custom LangGraph SDG system
from ragas_langgraph_sdg import RagasLangGraphSDG

# Initialize the SDG system with custom parameters
sdg_system = RagasLangGraphSDG(
    llm_model="gpt-4o-mini",  # Use GPT-4o mini for cost efficiency
    embedding_model="text-embedding-3-small",  # Efficient embedding model
    chunk_size=1000,  # Reasonable chunk size for processing
    chunk_overlap=100  # Good overlap for context preservation
)

print("✅ RagasLangGraphSDG system initialized!")
print(f"📊 Configuration:")
print(f"   - LLM Model: gpt-4o-mini")
print(f"   - Embedding Model: text-embedding-3-small")
print(f"   - Chunk Size: 1000")
print(f"   - Chunk Overlap: 100")


✅ RagasLangGraphSDG system initialized!
📊 Configuration:
   - LLM Model: gpt-4o-mini
   - Embedding Model: text-embedding-3-small
   - Chunk Size: 1000
   - Chunk Overlap: 100


## Generate Synthetic Data


In [7]:
# Generate synthetic data using the LangGraph workflow
print("🚀 Starting synthetic data generation...")
print("This will use the Evol-Instruct method with three evolution strategies:")
print("1. Simple Evolution - Making questions more specific")
print("2. Multi-Context Evolution - Questions requiring multiple sources")
print("3. Reasoning Evolution - Questions requiring complex reasoning")
print()

# Run the generation process
synthetic_data = sdg_system.generate_synthetic_data(docs)

print("\n🎉 Synthetic data generation completed!")
print(f"📊 Results summary:")
print(f"   - Evolved Questions: {len(synthetic_data['evolved_questions'])}")
print(f"   - Answers: {len(synthetic_data['answers'])}")
print(f"   - Contexts: {len(synthetic_data['contexts'])}")


🚀 Starting synthetic data generation...
This will use the Evol-Instruct method with three evolution strategies:
1. Simple Evolution - Making questions more specific
2. Multi-Context Evolution - Questions requiring multiple sources
3. Reasoning Evolution - Questions requiring complex reasoning

🚀 Starting synthetic data generation...
📄 Preparing documents...
✅ Prepared 155 document chunks
❓ Generating simple questions...
✅ Generated 15 simple questions
🔄 Applying simple evolution...
✅ Applied simple evolution to 15 questions
🔄 Applying multi-context evolution...
✅ Applied multi-context evolution to 15 questions
🔄 Applying reasoning evolution...
✅ Applied reasoning evolution to 15 questions
📚 Generating contexts...


  relevant_docs = retriever.get_relevant_documents(evolved_q.question)


✅ Generated contexts for 45 questions
💬 Generating answers...
✅ Generated answers for 45 questions
✅ Synthetic data generation completed!
📊 Generated 45 evolved questions
📊 Generated 45 answers
📊 Generated 45 contexts

🎉 Synthetic data generation completed!
📊 Results summary:
   - Evolved Questions: 45
   - Answers: 45
   - Contexts: 45


## Analyze Results


In [8]:
# Create DataFrames for easier analysis
questions_df = pd.DataFrame(synthetic_data['evolved_questions'])
answers_df = pd.DataFrame(synthetic_data['answers'])
contexts_df = pd.DataFrame(synthetic_data['contexts'])

print("📋 EVOLVED QUESTIONS ANALYSIS")
print("="*50)

# Analyze evolution types
evolution_counts = questions_df['evolution_type'].value_counts()
print(f"\n🔄 Evolution Type Distribution:")
for evolution_type, count in evolution_counts.items():
    print(f"   {evolution_type}: {count} questions")

print(f"\n📝 Sample Questions by Evolution Type:")

# Show examples of each evolution type
for evolution_type in questions_df['evolution_type'].unique():
    sample_questions = questions_df[questions_df['evolution_type'] == evolution_type].head(2)
    print(f"\n🔹 {evolution_type.upper()} EVOLUTION:")
    for idx, row in sample_questions.iterrows():
        print(f"   Q: {row['question']}")
        if row['original_question']:
            print(f"   Original: {row['original_question']}")
        print()

print("\n📊 DATA QUALITY METRICS")
print("="*50)

# Calculate some basic metrics
avg_question_length = questions_df['question'].str.len().mean()
avg_answer_length = answers_df['answer'].str.len().mean()
avg_context_length = contexts_df['context'].str.len().mean()

print(f"📏 Average Question Length: {avg_question_length:.1f} characters")
print(f"📏 Average Answer Length: {avg_answer_length:.1f} characters")
print(f"📏 Average Context Length: {avg_context_length:.1f} characters")

# Check for missing data
missing_answers = len(answers_df) - len(questions_df)
missing_contexts = len(contexts_df) - len(questions_df)

print(f"\n🔍 Data Completeness:")
print(f"   Questions: {len(questions_df)}")
print(f"   Answers: {len(answers_df)} (missing: {missing_answers})")
print(f"   Contexts: {len(contexts_df)} (missing: {missing_contexts})")


📋 EVOLVED QUESTIONS ANALYSIS

🔄 Evolution Type Distribution:
   simple: 15 questions
   multi_context: 15 questions
   reasoning: 15 questions

📝 Sample Questions by Evolution Type:

🔹 SIMPLE EVOLUTION:
   Q: Could you provide the names of the authors and their affiliations for the working paper titled "How People Use ChatGPT," as well as any relevant publication details such as the date of release and the journal or platform where it was published?
   Original: Who are the authors of the working paper titled "How People Use ChatGPT"?

   Q: Which specific institution is responsible for publishing the working paper series referenced in the document, and what is the title or focus of that particular series?
   Original: What institution published the working paper series mentioned in the document?


🔹 MULTI_CONTEXT EVOLUTION:
   Q: Based on the insights provided in the working paper titled "How People Use ChatGPT," compare the authors' perspectives on user engagement with ChatGPT to the

## Display Complete Results


In [9]:
# Display the complete synthetic dataset
print("📋 COMPLETE SYNTHETIC DATASET")
print("="*80)

# Create a combined view
combined_data = []

for _, question_row in questions_df.iterrows():
    question_id = question_row['id']
    
    # Find corresponding answer and context
    answer_row = answers_df[answers_df['question_id'] == question_id]
    context_row = contexts_df[contexts_df['question_id'] == question_id]
    
    combined_item = {
        'id': question_id,
        'evolution_type': question_row['evolution_type'],
        'question': question_row['question'],
        'original_question': question_row.get('original_question', ''),
        'answer': answer_row['answer'].iloc[0] if not answer_row.empty else 'No answer',
        'context': context_row['context'].iloc[0] if not context_row.empty else 'No context'
    }
    combined_data.append(combined_item)

# Display first few complete examples
print(f"\n📝 COMPLETE EXAMPLES (showing first 3):")
print("="*80)

for i, item in enumerate(combined_data[:3], 1):
    print(f"\n🔹 EXAMPLE {i} - {item['evolution_type'].upper()} EVOLUTION")
    print(f"ID: {item['id']}")
    print(f"Question: {item['question']}")
    if item['original_question']:
        print(f"Original: {item['original_question']}")
    print(f"Answer: {item['answer'][:200]}..." if len(item['answer']) > 200 else f"Answer: {item['answer']}")
    print(f"Context: {item['context'][:200]}..." if len(item['context']) > 200 else f"Context: {item['context']}")
    print("-" * 80)

print(f"\n📊 TOTAL DATASET SIZE: {len(combined_data)} complete examples")


📋 COMPLETE SYNTHETIC DATASET

📝 COMPLETE EXAMPLES (showing first 3):

🔹 EXAMPLE 1 - SIMPLE EVOLUTION
ID: 0bc6555f-92e9-4dc4-93a7-4ad67c21aa35
Question: Could you provide the names of the authors and their affiliations for the working paper titled "How People Use ChatGPT," as well as any relevant publication details such as the date of release and the journal or platform where it was published?
Original: Who are the authors of the working paper titled "How People Use ChatGPT"?
Answer: The working paper titled "How People Use ChatGPT" is authored by the following individuals:

- Aaron Chatterji
- Thomas Cunningham
- David J. Deming
- Zoe Hitzig
- Christopher Ong
- Carl Yan Shan
- Ke...
Context: NBER WORKING PAPER SERIES
HOW PEOPLE USE CHATGPT
Aaron Chatterji
Thomas Cunningham
David J. Deming
Zoe Hitzig
Christopher Ong
Carl Yan Shan
Kevin Wadman
Working Paper 34255
http://www.nber.org/papers/...
--------------------------------------------------------------------------------

🔹 EXAMPLE 2 

## Export Results


In [10]:
# Export the synthetic data to various formats for further use
import json
from datetime import datetime

# Create export directory
import os
export_dir = "synthetic_data_output"
os.makedirs(export_dir, exist_ok=True)

# Export to JSON
json_filename = f"{export_dir}/langgraph_sdg_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(json_filename, 'w') as f:
    json.dump(synthetic_data, f, indent=2)

print(f"✅ Exported to JSON: {json_filename}")

# Export individual components to CSV
questions_df.to_csv(f"{export_dir}/evolved_questions.csv", index=False)
answers_df.to_csv(f"{export_dir}/answers.csv", index=False)
contexts_df.to_csv(f"{export_dir}/contexts.csv", index=False)

print(f"✅ Exported questions to: {export_dir}/evolved_questions.csv")
print(f"✅ Exported answers to: {export_dir}/answers.csv")
print(f"✅ Exported contexts to: {export_dir}/contexts.csv")

# Export combined dataset
combined_df = pd.DataFrame(combined_data)
combined_df.to_csv(f"{export_dir}/complete_dataset.csv", index=False)
print(f"✅ Exported complete dataset to: {export_dir}/complete_dataset.csv")

print(f"\n📁 All files saved in: {export_dir}/")
print(f"📊 Dataset contains {len(combined_data)} complete question-answer-context triplets")


✅ Exported to JSON: synthetic_data_output/langgraph_sdg_results_20251005_172617.json
✅ Exported questions to: synthetic_data_output/evolved_questions.csv
✅ Exported answers to: synthetic_data_output/answers.csv
✅ Exported contexts to: synthetic_data_output/contexts.csv
✅ Exported complete dataset to: synthetic_data_output/complete_dataset.csv

📁 All files saved in: synthetic_data_output/
📊 Dataset contains 45 complete question-answer-context triplets


## Summary

This implementation successfully reproduces RAGAS Synthetic Data Generation using a LangGraph Agent Graph approach with the following key features:

### ✅ **Completed Requirements**

1. **LangGraph Agent Graph**: Uses LangGraph's stateful workflow system instead of Knowledge Graphs
2. **Evol-Instruct Method**: Implements three evolution strategies:
   - **Simple Evolution**: Makes questions more specific and detailed
   - **Multi-Context Evolution**: Creates questions requiring information from multiple sources
   - **Reasoning Evolution**: Generates questions requiring complex reasoning and analysis
3. **Input Handling**: Takes a list of LangChain Documents as input
4. **Output Format**: Produces the required three output types:
   - List of evolved questions with IDs and evolution types
   - List of question-answer pairs with IDs
   - List of question-context pairs with IDs

### 🔧 **Technical Implementation**

- **Workflow Orchestration**: LangGraph manages the entire pipeline from document preparation to final output
- **Agent-Based Architecture**: Specialized agents handle different tasks (question generation, evolution, context retrieval, answer generation)
- **Vector-Based Context Retrieval**: Uses Qdrant vectorstore for efficient context retrieval
- **Modular Design**: Easy to extend with additional evolution strategies or agents

### 📊 **Generated Data Quality**

- **Diverse Question Types**: Covers simple factual questions to complex reasoning challenges
- **Comprehensive Contexts**: Retrieves relevant document chunks for each question
- **Accurate Answers**: Generates answers based on retrieved contexts
- **Complete Dataset**: All questions have corresponding answers and contexts

This approach provides a flexible, scalable alternative to RAGAS's Knowledge Graph method while maintaining the quality and diversity of synthetic data generation.
