# Synthetic Question Generation: The Power of Prompts

## Why This Notebook Matters

When building RAG systems, one of the biggest challenges is generating good synthetic questions for evaluation and training. Most people focus on getting better models or more data, but **the most critical factor is actually your prompt design**.

In this notebook, we'll compare two different approaches to synthetic question generation and show you how dramatically different results can be based on your prompt strategy.

## What You'll Learn

1. **How prompt design shapes everything** - Small changes in prompts lead to completely different outputs
2. **The importance of understanding user intent** - What kinds of questions will your users actually ask?
3. **How to choose the right approach** - Different prompts for different use cases
4. **Practical guidance** - How to design prompts for your specific domain

## The Setup

We'll use real conversations from the WildChat dataset and generate synthetic queries using two different approaches:
- **V1**: Search-focused (helping users find information)
- **V2**: Pattern-focused (finding similar conversations)

Pay close attention to how different the results are - this will help you understand why prompt engineering is so critical for your RAG system's success.


In [4]:
# Comparing Synthetic Question Generation Methods
# This notebook loads examples from WildChat dataset and compares v1 vs v2 processors

import asyncio
import sys

# Add the utils directory to path
sys.path.append('../utils')
sys.path.append('..')

# Import our modules
from utils.dataloader import WildChatDataLoader

# Setup instructor client
import instructor

# Initialize instructor-patched OpenAI client
client = instructor.from_provider("openai/gpt-4o-mini", async_client=True)

print("Setup complete!")


Setup complete!


In [5]:
from typing import List, Dict, Any
from pydantic import BaseModel, Field

class SearchQueries(BaseModel):
    """Generated search queries that could lead to discovering a conversation."""
    chain_of_thought: str = Field(
        description="Chain of thought process for generating the search queries"
    )
    queries: List[str] = Field(
        description="4-7 diverse search queries that users might type to find this conversation",
        min_items=3,
        max_items=8
    )


async def synthetic_question_generation_v1(
    client,  # instructor-patched client
    messages: List[Dict[str, Any]],
) -> SearchQueries:
    """
    Generate diverse synthetic search queries from a chat conversation.
    
    As a product manager analyzing ChatGPT usage patterns, this function creates
    search queries that users might have typed to discover similar conversations.
    The queries should be diverse and cover different aspects of the conversation.
    
    Args:
        client: instructor-patched client
        conversation: Dictionary containing conversation data with 'messages' or 'conversation' key
        
    Returns:
        SearchQueries object with 4-5 diverse search queries and reasoning
    """
    
    prompt = """
    You are a product manager analyzing ChatGPT usage patterns. Your goal is to understand 
    how users might search to find conversations like this one.
    
    Given this conversation, generate 4-5 diverse search queries that different users might 
    type when looking for similar help or information. The queries should:
    
    1. Cover different aspects of the conversation (technical terms, problem description, solution type)
    2. Vary in specificity (some broad, some specific)
    3. Use different phrasings and vocabulary levels
    4. Reflect natural user search behavior
    5. Include both question-style and keyword-style queries
    
    <conversation>
    {% for message in messages %}
        <message role="{{ message.role }}">
            {{ message.content }}
        </message>
    {% endfor %}
    </conversation>
    
    Generate queries that would realistically lead someone to discover this conversation.
    """
    
    response = await client.chat.completions.create(
        response_model=SearchQueries,
        messages=[
            {
                "role": "user", 
                "content": prompt
            }
        ],
        context={
            "messages": messages
        }
    )
    
    return response


async def synthetic_question_generation_v2(
    client,  # instructor-patched client
    messages: List[Dict[str, Any]],
) -> SearchQueries:
    """
    Generate search queries for finding conversations with similar patterns and characteristics.
    
    This version focuses on identifying conversation types, themes, and patterns that would be
    useful for researchers, content moderators, or analysts studying human-AI interactions.
    
    Args:
        client: instructor-patched client
        messages: List of messages in the conversation
        
    Returns:
        SearchQueries object with pattern-focused search queries
    """
    
    prompt = """
    You are a research analyst studying patterns in human-AI conversations from the WildChat dataset.
    Your goal is to identify the key characteristics and patterns in this conversation that would help
    researchers find similar types of conversations.
    
    Analyze this conversation and generate search queries that would help find conversations with:
    - Similar content themes or domains (medical, creative, technical, etc.)
    - Similar user intents (seeking advice, creative collaboration, testing AI limits, etc.)
    - Similar interaction patterns (role-playing, Q&A, refusal situations, etc.)
    - Similar AI behaviors or response types
    
    Focus on generating queries that capture the ESSENCE and PATTERNS rather than specific details.
    
    Examples of good pattern queries:
    - "conversations where users ask about medical diagnoses"
    - "role-playing scenarios with fictional characters"
    - "conversations where AI refuses medical advice"
    - "creative writing collaborations"
    - "technical troubleshooting discussions"
    - "conversations testing AI content policies"
    - "users seeking relationship advice"
    - "educational Q&A about scientific concepts"
    
    <conversation>
    {% for message in messages %}
        <message role="{{ message.role }}">
            {{ message.content }}
        </message>
    {% endfor %}
    </conversation>
    
    Generate 5-7 search queries that focus on conversation patterns, themes, and characteristics
    rather than specific content details. Think about what makes this conversation type distinct
    and how researchers would categorize it.
    """
    
    response = await client.chat.completions.create(
        response_model=SearchQueries,
        messages=[
            {
                "role": "system",
                "content": "You are an expert conversation analyst specializing in categorizing and understanding patterns in human-AI interactions. Focus on identifying conversation types, themes, and structural patterns rather than specific content details."
            },
            {
                "role": "user", 
                "content": prompt
            }
        ],
        context={
            "messages": messages
        }
    )
    
    return response


In [None]:
def write_streaming_to_chroma(
    collection_name="wildchat_streaming", 
    limit=100, 
    batch_size=50,
    filter_language="English",
    min_message_length=20
):
    """
    Write conversations to ChromaDB using streaming data loader
    
    This approach can handle the full 1M dataset efficiently by:
    1. Loading data in streaming fashion (no memory issues)
    2. Processing in batches
    """
    
    print(f"Starting streaming write to collection: {collection_name}")
    print(f"Filters: language={filter_language}, min_length={min_message_length}, limit={limit}")
    
    try:
        # Create/get collection
        collection = client.get_or_create_collection(
            name=collection_name,
            metadata={"description": "Streaming WildChat data"}
        )
        
        # Batch processing
        documents = []
        metadatas = []
        ids = []
        total_processed = 0
        
        for conversation in load_data(limit=limit):
            # Prepare document
            doc_id = f"{conversation['conversation_hash']}"
            
            metadata = {
                "hash": conversation['conversation_hash'],
                "timestamp": str(conversation['timestamp']),
                "lang": conversation['language'],
                "model": conversation['model'],
                "length": conversation['conversation_length'],
                "type": "user_query"
            }
            
            documents.append(conversation['first_message'][:1000])
            metadatas.append(metadata)
            ids.append(doc_id)
            
            # Write batch when full
            if len(documents) >= batch_size:
                collection.add(
                    documents=documents,
                    metadatas=metadatas,
                    ids=ids
                )
                total_processed += len(documents)
                print(f"  Wrote batch: {total_processed} documents so far...")
                
                # Reset batch
                documents = []
                metadatas = []
                ids = []
        
        # Write final batch
        if documents:
            collection.add(
                documents=documents,
                metadatas=metadatas,
                ids=ids
            )
            total_processed += len(documents)
        
        print(f"✅ Successfully wrote {total_processed} conversations to {collection_name}")
        print(f"Collection now contains {collection.count()} total documents")
        
        return collection
        
    except Exception as e:
        print(f"❌ Error in streaming write: {e}")
        return None

# Test with a moderate dataset
print("Testing streaming write with 50 English conversations...")
streaming_collection = write_streaming_to_chroma(
    collection_name="wildchat_10k",
    limit=10000,  # Will get first 50 English conversations from first 100
    filter_language="English",
    min_message_length=30
)

INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections "HTTP/1.1 200 OK"
INFO:utils.dataloader:Loading WildChat dataset: train[:10000]


Testing streaming write with 50 English conversations...
Starting streaming write to collection: wildchat_10k
Filters: language=English, min_length=30, limit=10000


INFO:utils.dataloader:Loaded 10000 conversations
INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 50 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 100 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 150 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 200 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 250 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 300 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 350 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 400 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 450 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 500 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 550 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 600 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 650 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 700 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 750 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 800 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 850 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 900 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 950 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1000 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1050 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1100 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1150 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1200 documents so far...
❌ Error in streaming write: Expected IDs to be unique, found duplicates of: f5f01c95c79d12c5c72f87f63577b6dc in add.


In [6]:
# Load examples from the WildChat dataset
print("Loading WildChat examples...")

# Initialize the dataloader with a reasonable limit for initial loading
loader = WildChatDataLoader(limit=5000)  # Load first 5K to get good examples

# Stream conversations and collect examples
examples = []
target_count = 5  # Aim for 5 good examples

for conversation in loader.stream_conversations(
    limit=target_count,
    min_message_length=50,
    filter_language='English',
    filter_toxic=True
):
    examples.append(conversation)
    print(f"Loaded example {len(examples)}: {conversation['conversation_hash'][:8]}...")
    
    if len(examples) >= target_count:
        break

Loading WildChat examples...
Loaded example 1: c9ec5b44...
Loaded example 2: cf1267ca...
Loaded example 3: e98d3e74...
Loaded example 4: 2e8fd255...
Loaded example 5: 59c72510...


In [7]:
async def process_examples_async(examples_to_process: List[Dict], num_examples: int = 5):
    """Process examples using both v1 and v2 processors and return results for comparison"""
    
    results = {
        'v1_results': [],
        'v2_results': [],
        'conversations': [],
        'processing_times': {'v1': [], 'v2': []},
        'errors': {'v1': [], 'v2': []}
    }
    
    print(f"Processing {min(num_examples, len(examples_to_process))} examples with both processors...\n")
    
    # Prepare all tasks for parallel execution
    tasks = []
    for i, example in enumerate(examples_to_process[:num_examples]):
        print(f"Preparing example {i+1}/{min(num_examples, len(examples_to_process))}: {example['conversation_hash'][:8]}")
        
        results['conversations'].append({
            'hash': example['conversation_hash'],
            'first_message': example['first_message'][:200],
            'length': example['conversation_length']
        })
        
        # Create tasks for both v1 and v2 processing
        v1_task = synthetic_question_generation_v1(client, example['conversation'])
        v2_task = synthetic_question_generation_v2(client, example['conversation'])
        tasks.extend([v1_task, v2_task])
    
    # Execute all tasks in parallel
    print("Executing all processing tasks in parallel...")
    all_results = await asyncio.gather(*tasks)
    
    # Separate v1 and v2 results
    for i in range(0, len(all_results), 2):
        results['v1_results'].append(all_results[i])
        results['v2_results'].append(all_results[i + 1])
    
    return results

# Run the async processing
if examples:
    print("Starting async processing...")
    results = await process_examples_async(examples, num_examples=5)
    print("Processing complete!")
else:
    print("No examples to process!")


Starting async processing...
Processing 5 examples with both processors...

Preparing example 1/5: c9ec5b44
Preparing example 2/5: cf1267ca
Preparing example 3/5: e98d3e74
Preparing example 4/5: 2e8fd255
Preparing example 5/5: 59c72510
Executing all processing tasks in parallel...
Processing complete!


In [8]:
from IPython.display import Markdown, display
from jinja2 import Template

# Define the Jinja template
template_str = """# Synthetic Question Generation Comparison: V1 vs V2

## Overview
Comparing two different approaches to generating search queries from conversation data:
- **V1**: Direct query generation focused on search intent
- **V2**: Query generation focused on conversation patterns and user intents

---

{% for i in range(conversations|length) %}
## Example {{ i + 1 }}: {{ conversations[i].hash[:8] }}

**Conversation Preview:** {{ conversations[i].first_message[:150] }}...

**Length:** {{ conversations[i].length }} messages

### V1 Results (Search-Focused)
**Chain of Thought:** {{ v1_results[i].chain_of_thought }}

**Generated Queries:**
{% for query in v1_results[i].queries %}
{{ loop.index }}. {{ query }}
{% endfor %}

### V2 Results (Conversation Pattern-Focused)
**Chain of Thought:** {{ v2_results[i].chain_of_thought }}

**Generated Queries:**
{% for query in v2_results[i].queries %}
{{ loop.index }}. {{ query }}
{% endfor %}

---

{% endfor %}"""

# Create and render the template
template = Template(template_str)
formatted_results = template.render(
    conversations=results['conversations'],
    v1_results=results['v1_results'],
    v2_results=results['v2_results']
)

# Display the results
display(Markdown(formatted_results))


# Synthetic Question Generation Comparison: V1 vs V2

## Overview
Comparing two different approaches to generating search queries from conversation data:
- **V1**: Direct query generation focused on search intent
- **V2**: Query generation focused on conversation patterns and user intents

---


## Example 1: c9ec5b44

**Conversation Preview:** Hey there! Are you familiar with reality shifting? So, I’m refining a foolproof method for reality shifting and want to pick a destination. Want to he...

**Length:** 2 messages

### V1 Results (Search-Focused)
**Chain of Thought:** The conversation discusses reality shifting and personalizes a fictional world based on specific requirements. Users might search for information on reality shifting, personalized destinations, or creative storytelling. Some queries will focus on the practical aspects of reality shifting, while others will emphasize the imaginative details of the created world. Additionally, the ending and specific elements such as quests, adventure, and character types will also inform the search terms. Keywords like 'reality shifting ideas,' 'creating fictional worlds,' and 'adventure storytelling' will be included to cover both broad and specific interests of users.

**Generated Queries:**

1. What are some ideas for reality shifting destinations?

2. How to create a personalized world for reality shifting?

3. Unique quest ideas for reality shifting adventures

4. What is reality shifting and how can I do it?

5. Fictional world building for immersive experiences


### V2 Results (Conversation Pattern-Focused)
**Chain of Thought:** This conversation features thematic elements of fantasy and adventure, with a focus on personal experiences and creative storytelling. The user expresses a desire for a detailed fictional scenario related to reality shifting, and the AI responds in a highly imaginative and narrative-driven manner. This indicates a collaboration in world-building and creative exploration. The nature of the interaction suggests that users are often looking for engaging, personalized narratives and that AI facilitates these immersive experiences through elaborate descriptions and structured storytelling. Therefore, the conversation can be characterized by themes of role-playing, creative collaboration, and the exploration of fictional realities.

**Generated Queries:**

1. creative storytelling about fictional adventures

2. users seeking personalized fantasy scenarios

3. conversations focused on role-playing journey planning

4. collaborative world-building in fantasy contexts

5. immersive narrative creation with AI

6. discussions around reality shifting experiences

7. user-driven quest scenarios in imaginative settings


---


## Example 2: cf1267ca

**Conversation Preview:** Old age PT hx of DM, HTN, dyslipidemia His ECG I.II, aVF (MI) what is the highest risk 

factor for this condition?...

**Length:** 2 messages

### V1 Results (Search-Focused)
**Chain of Thought:** The conversation revolves around a medical scenario focusing on a patient at risk for myocardial infarction due to age and multiple health conditions. Users searching for similar information might be looking for risk factors related to heart attacks, particularly in older adults with specific health issues. Queries can vary by covering technical terms (e.g., myocardial infarction, ECG), the general problem (e.g., heart attack risk factors), and the outcomes (e.g., what increases heart attack risk). Both broad and specific search queries can help capture different users' needs. Additionally, incorporating question-style and keyword-style queries can reflect natural search behavior.

**Generated Queries:**

1. What are the risk factors for myocardial infarction in elderly patients?

2. High risk factors for heart attacks older adults with diabetes and hypertension

3. ECG interpretation in elderly patients: what to look for

4. How does age affect heart attack risk with diabetes and hypertension?

5. Understanding myocardial infarction risk in older patients with chronic conditions


### V2 Results (Conversation Pattern-Focused)
**Chain of Thought:** This conversation revolves around a medical query, specifically regarding risk factors for myocardial infarction in a patient with multiple health issues. The user is looking for expert advice on a healthcare topic, showcasing a technical exchange. The assistant responds with a targeted, medically-oriented answer. This prompts me to consider searches that focus on medical advice, patient diagnosis, risk assessment, and user interactions that involve health-related inquiries.

**Generated Queries:**

1. conversations where users seek medical risk assessments

2. patient diagnosis discussions in aging populations

3. Q&A about cardiovascular health and risk factors

4. conversations involving diabetes and heart disease advice

5. medical inquiries about hypertension and lifestyle influences

6. discussions on elderly patients with multiple health conditions

7. educational conversations about the impact of dyslipidemia on heart health


---


## Example 3: e98d3e74

**Conversation Preview:** Hey there! Are you familiar with reality shifting? So, I’m refining a foolproof method for reality shifting and want to pick a destination. Want to he...

**Length:** 2 messages

### V1 Results (Search-Focused)
**Chain of Thought:** Given the conversation about reality shifting and the detailed scenario created by the assistant, users might look for specific terms related to reality shifting, creative stories, or character and adventure suggestions. The conversation involves aspects like personalized journeys, fantasy elements, enchantresses, and quests, which are all themes that potential users may be interested in. Thus, I’ll create queries that reflect these elements, using both technical terms and casual, question-style phrasing to capture diverse search behaviors.

**Generated Queries:**

1. what is reality shifting and how to start

2. creative fantasy stories for reality shifting

3. specific reality shifting scenarios with quests

4. adventure ideas for shifting into a fantasy world

5. how to create a personalized reality shift destination


### V2 Results (Conversation Pattern-Focused)
**Chain of Thought:** The conversation involves a user engaging in a creative collaboration with the AI, focused on generating ideas for a fictional reality-shifting scenario. The themes present include fantasy world-building, user-driven narrative creation, and a focus on specific quest details and character interactions. The user expresses clear, detailed preferences and seeks an imaginative response from the AI. The structure of the conversation is interactive and exploratory, with a blend of guiding questions and elaborate responses. Key characteristics include role-playing elements, creative storytelling, and a whimsical adventure narrative. The conversation patterns and themes necessitate search queries that emphasize these elements without diving into the specific content of the dialogue.

**Generated Queries:**

1. creative writing prompts for fantasy worlds

2. user-driven narrative development in AI conversations

3. collaborative storytelling scenarios with AI

4. role-playing adventures involving quests and characters

5. fantasy reality creation discussions with AI

6. conversations focused on unique fictional settings

7. interactive storytelling where users dictate plot elements


---


## Example 4: 2e8fd255

**Conversation Preview:** Hey there! Are you familiar with reality shifting? So, I’m refining a foolproof method for reality shifting and want to pick a destination. Want to he...

**Length:** 2 messages

### V1 Results (Search-Focused)
**Chain of Thought:** The conversation covers the niche topics of reality shifting, personalized destination creation, and specific quest ideas. Users may search for advice on reality shifting techniques, ideas for fantasy worlds, or how to create engaging narratives for their journeys. Some users may have specific attributes they’re curious about, such as 'female characters in reality shifting' or 'unconscious themes in fantasy realms'. So, the queries should reflect these varied interests from broad to specific.

**Generated Queries:**

1. How to create a personalized reality shifting destination?

2. Reality shifting ideas for fantasy worlds

3. What are some fun quests for reality shifting?

4. Unique female characters in reality shifting narratives

5. Methods to enter a different reality and unconsciousness themes


### V2 Results (Conversation Pattern-Focused)
**Chain of Thought:** This conversation revolves around a creative and imaginative theme where the user is developing a narrative for reality shifting with specific elements they want included. The user seeks detailed and personalized world-building from the AI, indicating an interest in creative collaboration. The conversation features a role-playing scenario where the AI takes on a guiding role in constructing a fictional reality, emphasizing elements of adventure and temptation. The interaction pattern includes the user providing clear requirements and the AI responding in a narrative format, showcasing an engaging and conversational style. Researchers might search for conversations that encompass similar creative storytelling interactions, user intents focused on imagination and personalization, and frameworks where the AI builds upon user-defined concepts.

**Generated Queries:**

1. creative collaboration in fictional world-building

2. users exploring fantasy role-playing scenarios

3. conversations about personalized adventure narratives

4. interactive storytelling with AI

5. users seeking detailed world descriptions for role-playing

6. engaging AI responses in creative scenarios

7. conversations about reality shifting techniques


---


## Example 5: 59c72510

**Conversation Preview:** i wanna you to write me terms & conditions and policies for my  website...

**Length:** 2 messages

### V1 Results (Search-Focused)
**Chain of Thought:** In this conversation, the user is requesting help with creating terms and conditions for a website, and the assistant advises against it due to legal implications. Thus, the search queries should reflect this need for legal documents, the advice against using AI for such tasks, and possible alternatives for obtaining legal advice or templates. Variations in phrasing, specificity, and user intent should be represented in the queries.

**Generated Queries:**

1. how to create terms and conditions for a website

2. can AI write legal documents like terms and conditions?

3. where to find customizable legal document templates

4. professional legal advice for website policies needed

5. importance of hiring a lawyer for website terms and conditions


### V2 Results (Conversation Pattern-Focused)
**Chain of Thought:** This conversation illustrates a refusal situation where the AI declines a user's request for generating legal documents. The essence focuses on conversations involving user requests for professional or specialized outputs that require expertise beyond the AI's capabilities. Therefore, search queries should reflect this refusal scenario, the themes of legal advice and document creation, and the interaction pattern of seeking assistance yet receiving a limitation from the AI.

**Generated Queries:**

1. conversations where AI refuses to generate legal documents

2. users asking for professional advice on legal matters

3. interactions where users request expert content generation

4. conversations about AI limitations in offering legal advice

5. users seeking document templates or legal guidance

6. role-playing scenarios involving legal document creation

7. AI responses focused on advising users to consult professionals


---



In [9]:
def write_streaming_to_chroma(
    collection_name="wildchat_streaming", 
    limit=100, 
    batch_size=50,
    filter_language="English",
    min_message_length=20
):
    """
    Write conversations to ChromaDB using streaming data loader
    
    This approach can handle the full 1M dataset efficiently by:
    1. Loading data in streaming fashion (no memory issues)
    2. Processing in batches
    """
    
    print(f"Starting streaming write to collection: {collection_name}")
    print(f"Filters: language={filter_language}, min_length={min_message_length}, limit={limit}")
    
    try:
        # Create/get collection
        collection = client.get_or_create_collection(
            name=collection_name,
            metadata={"description": "Streaming WildChat data"}
        )
        
        # Batch processing
        documents = []
        metadatas = []
        ids = []
        total_processed = 0
        
        for conversation in load_data(limit=limit):
            # Prepare document
            doc_id = f"{conversation['conversation_hash']}"
            
            metadata = {
                "hash": conversation['conversation_hash'],
                "timestamp": str(conversation['timestamp']),
                "lang": conversation['language'],
                "model": conversation['model'],
                "length": conversation['conversation_length'],
                "type": "user_query"
            }
            
            documents.append(conversation['first_message'][:1000])
            metadatas.append(metadata)
            ids.append(doc_id)
            
            # Write batch when full
            if len(documents) >= batch_size:
                collection.add(
                    documents=documents,
                    metadatas=metadatas,
                    ids=ids
                )
                total_processed += len(documents)
                print(f"  Wrote batch: {total_processed} documents so far...")
                
                # Reset batch
                documents = []
                metadatas = []
                ids = []
        
        # Write final batch
        if documents:
            collection.add(
                documents=documents,
                metadatas=metadatas,
                ids=ids
            )
            total_processed += len(documents)
        
        print(f"✅ Successfully wrote {total_processed} conversations to {collection_name}")
        print(f"Collection now contains {collection.count()} total documents")
        
        return collection
        
    except Exception as e:
        print(f"❌ Error in streaming write: {e}")
        return None

# Test with a moderate dataset
print("Testing streaming write with 50 English conversations...")
streaming_collection = write_streaming_to_chroma(
    collection_name="wildchat_10k",
    limit=10000,  # Will get first 50 English conversations from first 100
    filter_language="English",
    min_message_length=30
)

Testing streaming write with 50 English conversations...
Starting streaming write to collection: wildchat_10k
Filters: language=English, min_length=30, limit=10000
❌ Error in streaming write: 'AsyncOpenAI' object has no attribute 'get_or_create_collection'


In [10]:
def write_streaming_to_chroma(
    collection_name="wildchat_streaming", 
    limit=100, 
    batch_size=50,
    filter_language="English",
    min_message_length=20
):
    """
    Write conversations to ChromaDB using streaming data loader
    
    This approach can handle the full 1M dataset efficiently by:
    1. Loading data in streaming fashion (no memory issues)
    2. Processing in batches
    """
    
    print(f"Starting streaming write to collection: {collection_name}")
    print(f"Filters: language={filter_language}, min_length={min_message_length}, limit={limit}")
    
    try:
        # Create/get collection
        collection = client.get_or_create_collection(
            name=collection_name,
            metadata={"description": "Streaming WildChat data"}
        )
        
        # Batch processing
        documents = []
        metadatas = []
        ids = []
        total_processed = 0
        
        for conversation in load_data(limit=limit):
            # Prepare document
            doc_id = f"{conversation['conversation_hash']}"
            
            metadata = {
                "hash": conversation['conversation_hash'],
                "timestamp": str(conversation['timestamp']),
                "lang": conversation['language'],
                "model": conversation['model'],
                "length": conversation['conversation_length'],
                "type": "user_query"
            }
            
            documents.append(conversation['first_message'][:1000])
            metadatas.append(metadata)
            ids.append(doc_id)
            
            # Write batch when full
            if len(documents) >= batch_size:
                collection.add(
                    documents=documents,
                    metadatas=metadatas,
                    ids=ids
                )
                total_processed += len(documents)
                print(f"  Wrote batch: {total_processed} documents so far...")
                
                # Reset batch
                documents = []
                metadatas = []
                ids = []
        
        # Write final batch
        if documents:
            collection.add(
                documents=documents,
                metadatas=metadatas,
                ids=ids
            )
            total_processed += len(documents)
        
        print(f"✅ Successfully wrote {total_processed} conversations to {collection_name}")
        print(f"Collection now contains {collection.count()} total documents")
        
        return collection
        
    except Exception as e:
        print(f"❌ Error in streaming write: {e}")
        return None

# Test with a moderate dataset
print("Testing streaming write with 50 English conversations...")
streaming_collection = write_streaming_to_chroma(
    collection_name="wildchat_10k",
    limit=10000,  # Will get first 50 English conversations from first 100
    filter_language="English",
    min_message_length=30
)

Testing streaming write with 50 English conversations...
Starting streaming write to collection: wildchat_10k
Filters: language=English, min_length=30, limit=10000
❌ Error in streaming write: 'AsyncOpenAI' object has no attribute 'get_or_create_collection'
