# Data Generation for Evaluation

This notebook contains code for generating test questions using AI.
We use an LLM to create realistic questions based on FAQ content.

## Setup

In [1]:
# Install dependencies
%pip install pydantic-ai
%pip install python-dotenv
%pip install requests
%pip install python-frontmatter

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
from dotenv import load_dotenv
import os
import json
import random
from pathlib import Path
from pydantic import BaseModel
from pydantic_ai import Agent

# Load environment variables
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("OPENAI_API_KEY not found. Set it in .env file.")

print("API Key loaded:", OPENAI_API_KEY[:6] + "...")

API Key loaded: sk-pro...


## Load Repository Data

First, we need to load the FAQ data from the GitHub repository.

In [2]:
import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data

In [3]:
# Load FAQ data
dtc_faq = read_repo_data('DataTalksClub', 'faq')
de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']]

print(f"Loaded {len(de_dtc_faq)} data engineering FAQ documents")

Loaded 449 data engineering FAQ documents


## Question Generation

We'll use an LLM to generate realistic questions based on FAQ content.

In [4]:
# Define the output schema for questions
class QuestionsList(BaseModel):
    questions: list[str]

# Define the question generation prompt
question_generation_prompt = """
You are helping to create test questions for an AI agent that answers questions about a data engineering course.

Based on the provided FAQ content, generate realistic questions that students might ask.

The questions should:
- Be natural and varied in style
- Range from simple to complex
- Include both specific technical questions and general course questions

Generate one question for each record.
""".strip()

# Create the question generator agent
question_generator = Agent(
    name="question_generator",
    instructions=question_generation_prompt,
    model='gpt-4o-mini',
    output_type=QuestionsList
)

In [5]:
# Sample documents and generate questions
num_samples = 10
sample = random.sample(de_dtc_faq, num_samples)
prompt_docs = [d['content'] for d in sample]
prompt = json.dumps(prompt_docs)

result = await question_generator.run(prompt)
questions = result.output.questions

print(f"Generated {len(questions)} questions:")
for i, q in enumerate(questions, 1):
    print(f"{i}. {q}")

Generated 10 questions:
1. How can I fix the Docker error related to configuration settings?
2. Can I partition more than one column in BigQuery tables?
3. What mounting path should I use for PostgreSQL in Docker?
4. Where should I place my SSH config file in my file system?
5. How do I configure the Spark BigQuery connector automatically?
6. What data types should be used to avoid schema inconsistency issues in BigQuery?
7. What should I do if my data loading process is too slow when using a script for yellow taxi data?
8. How do I install Java and Spark using SDKMAN for my project?
9. What Python version is compatible with Spark 3.0.3 to avoid inconsistencies?
10. Can I create an external table in BigQuery using Parquet files stored in a GCP Bucket?


## Generate Test Data with Agent

Now we'll run the agent on these questions and log the interactions.

In [6]:
# Import the agent and logging functionality
import sys
sys.path.append('../aihero/course')

from search_agent import init_agent
from search_tools import SearchTool
from logs import log_interaction_to_file
from minsearch import Index

# Initialize the search index
faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)
faq_index.fit(de_dtc_faq)

# Initialize the agent
agent = init_agent(
    index=faq_index,
    repo_owner='DataTalksClub',
    repo_name='faq'
)

print("Agent initialized and ready")

Agent initialized and ready


In [7]:
# Run the agent on each question and log results
from tqdm.auto import tqdm

generated_logs = []

for q in tqdm(questions, desc="Generating test data"):
    print(f"\nQuestion: {q}")
    
    result = await agent.run(user_prompt=q)
    print(f"Answer: {result.output[:200]}...")
    
    log_file = log_interaction_to_file(
        agent,
        result.new_messages(),
        source='ai-generated'
    )
    
    generated_logs.append(str(log_file))
    print(f"Logged to: {log_file}")

print(f"\nGenerated {len(generated_logs)} test interactions")

Generating test data:   0%|          | 0/10 [00:00<?, ?it/s]


Question: How can I fix the Docker error related to configuration settings?
Answer: To fix Docker errors related to configuration settings, here are some common issues and their solutions based on your platform:

1. **Docker Won't Start on Windows (10/11)**:
   - Ensure that you are ...
Logged to: logs/gh_agent_20251012_053857_86df26.json

Question: Can I partition more than one column in BigQuery tables?
Answer: No, BigQuery does not support partitioning on more than one column. You can only partition a table based on one column at a time according to the official documentation.

For more details, you can ref...
Logged to: logs/gh_agent_20251012_053909_298bad.json

Question: What mounting path should I use for PostgreSQL in Docker?
Answer: When using PostgreSQL in Docker, you can use the following mount path for your database storage:

1. On Windows, it's recommended to use one of the following options for your volume mapping:
   ```bas...
Logged to: logs/gh_agent_20251012_053913_c60

## Save Generated Questions

Save the generated questions and their corresponding log files for later evaluation.

In [8]:
from datetime import datetime

# Save questions and log file paths
output_data = {
    "questions": questions,
    "log_files": generated_logs,
    "num_samples": num_samples,
    "generated_at": datetime.now().isoformat()
}

output_path = Path('../logs/generated_test_data.json')
with output_path.open('w') as f:
    json.dump(output_data, f, indent=2)

print(f"Saved test data metadata to {output_path}")

Saved test data metadata to ../logs/generated_test_data.json
