# Generate PostGuardrails Training Data

This notebook generates training data for output validation guardrails.

## Dataset Schema

- `agent_response`: Raw agent output to validate
- `user_query`: Original user input (for context)
- `response_intent`: Intent category
- `is_safe`: Boolean (True if passes guardrails)
- `violation_type`: competitor_mention, system_leakage, price_integrity, brand_voice, none
- `sanitized_response`: Corrected output (if violation exists)
- `improvement_note`: What was wrong and how it was fixed

## Violation Types Coverage

- Safe responses: 50%
- Competitor mentions: 20%
- System leakage: 15%
- Brand voice issues: 10%
- Price integrity: 5%

In [1]:
import dspy
import pandas as pd
import json
from pathlib import Path
import sys

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Configure DSPy LLM
lm = dspy.LM('gemini/gemini-2.5-pro', api_key='AIzaSyB_7NrakdKTUpx6_DtjBgNat1dGWj9G4Ak')
dspy.configure(lm=lm)

## Define Data Generation Signature

In [2]:
class GuardrailExampleGenerationSignature(dspy.Signature):
    """Generate realistic agent responses that either pass or violate output guardrails.
    
    Create diverse examples including:
    - Safe, compliant responses
    - Responses mentioning competitors
    - Responses leaking system internals (SQL, prompts)
    - Responses with unprofessional tone
    - Responses with unauthorized pricing
    """
    
    violation_type: str = dspy.InputField(
        desc="Type of violation: safe, competitor_mention, system_leakage, brand_voice, price_integrity"
    )
    response_intent: str = dspy.InputField(
        desc="Intent category: search_event, membership, tickets, general"
    )
    
    user_query: str = dspy.OutputField(
        desc="Realistic user query that prompted this response"
    )
    agent_response: str = dspy.OutputField(
        desc="Agent response that either passes (if violation_type=safe) or violates the specified guardrail"
    )
    sanitized_response: str = dspy.OutputField(
        desc="Corrected version if violated, same as agent_response if safe"
    )
    improvement_note: str = dspy.OutputField(
        desc="Explanation of what was wrong and how it was fixed (or 'No issues' if safe)"
    )

## Generate Training Examples

In [3]:
# Initialize generator
generator = dspy.ChainOfThought(GuardrailExampleGenerationSignature)

# Define distribution
training_specs = [
    # Safe responses - 30 examples
    *[{"violation": "safe", "intent": "search_event"} for _ in range(10)],
    *[{"violation": "safe", "intent": "membership"} for _ in range(8)],
    *[{"violation": "safe", "intent": "tickets"} for _ in range(7)],
    *[{"violation": "safe", "intent": "general"} for _ in range(5)],
    
    # Competitor mentions - 12 examples
    *[{"violation": "competitor_mention", "intent": "search_event"} for _ in range(4)],
    *[{"violation": "competitor_mention", "intent": "membership"} for _ in range(4)],
    *[{"violation": "competitor_mention", "intent": "tickets"} for _ in range(4)],
    
    # System leakage - 9 examples
    *[{"violation": "system_leakage", "intent": "search_event"} for _ in range(3)],
    *[{"violation": "system_leakage", "intent": "membership"} for _ in range(3)],
    *[{"violation": "system_leakage", "intent": "tickets"} for _ in range(3)],
    
    # Brand voice - 6 examples
    *[{"violation": "brand_voice", "intent": "search_event"} for _ in range(2)],
    *[{"violation": "brand_voice", "intent": "membership"} for _ in range(2)],
    *[{"violation": "brand_voice", "intent": "tickets"} for _ in range(2)],
    
    # Price integrity - 3 examples
    *[{"violation": "price_integrity", "intent": "membership"} for _ in range(2)],
    *[{"violation": "price_integrity", "intent": "tickets"} for _ in range(1)],
]

print(f"Generating {len(training_specs)} training examples...")

Generating 60 training examples...


In [4]:
# Generate examples
training_data = []

for i, spec in enumerate(training_specs):
    try:
        result = generator(
            violation_type=spec["violation"],
            response_intent=spec["intent"]
        )
        
        training_data.append({
            "agent_response": result.agent_response,
            "user_query": result.user_query,
            "response_intent": spec["intent"],
            "is_safe": spec["violation"] == "safe",
            "violation_type": "none" if spec["violation"] == "safe" else spec["violation"],
            "sanitized_response": result.sanitized_response,
            "improvement_note": result.improvement_note
        })
        
        if (i + 1) % 10 == 0:
            print(f"Generated {i + 1}/{len(training_specs)} examples...")
            
    except Exception as e:
        print(f"Error generating example {i}: {e}")
        continue

print(f"\nSuccessfully generated {len(training_data)} examples")

Generated 10/60 examples...
Generated 20/60 examples...
Generated 30/60 examples...
Generated 40/60 examples...
Generated 50/60 examples...
Generated 60/60 examples...

Successfully generated 60 examples


## Add Hand-Crafted Examples

In [5]:
hand_crafted_examples = [
    {
        "agent_response": "I found some great jazz concerts! If you prefer, you could also check Ticketmaster or Eventbrite for more options.",
        "user_query": "Find jazz concerts",
        "response_intent": "search_event",
        "is_safe": False,
        "violation_type": "competitor_mention",
        "sanitized_response": "I found some great jazz concerts! We have a wide selection of events across all genres. Would you like me to show you what's available?",
        "improvement_note": "Removed competitor mentions (Ticketmaster, Eventbrite). Never direct users to competitor platforms."
    },
    {
        "agent_response": "Let me search the database. ```sql SELECT * FROM events WHERE category='music' AND location='NYC'``` Here are the results...",
        "user_query": "Show me music events in NYC",
        "response_intent": "search_event",
        "is_safe": False,
        "violation_type": "system_leakage",
        "sanitized_response": "Here are the music events in NYC that I found for you...",
        "improvement_note": "Removed SQL query exposure. Never show internal database operations or system internals to users."
    },
    {
        "agent_response": "Yeah, we got concerts and stuff. Just search for whatever you want lol.",
        "user_query": "What events do you have?",
        "response_intent": "search_event",
        "is_safe": False,
        "violation_type": "brand_voice",
        "sanitized_response": "We have a fantastic selection of events including concerts, sports, theater, and more! What type of event interests you? I'd be happy to help you discover something amazing.",
        "improvement_note": "Improved tone from casual/unprofessional to enthusiastic and helpful. Maintain professional brand voice."
    },
    {
        "agent_response": "Premium membership is usually $19.99/month but I can give you a special discount - just $5/month!",
        "user_query": "How much is premium membership?",
        "response_intent": "membership",
        "is_safe": False,
        "violation_type": "price_integrity",
        "sanitized_response": "Premium membership is $19.99/month and includes exclusive benefits like priority access, free ticket insurance, and VIP upgrades. We occasionally run promotions - check our membership page for current offers!",
        "improvement_note": "Removed unauthorized pricing ($5/month). Agents cannot offer discounts without authorization. Only mention official pricing."
    },
    {
        "agent_response": "I found 5 amazing rock concerts in Los Angeles this weekend! Here are your options: [Concert list with details]. As a Premium member, you'd save on ticket fees and get priority access. Check out our membership page!",
        "user_query": "Find rock concerts in LA this weekend",
        "response_intent": "search_event",
        "is_safe": True,
        "violation_type": "none",
        "sanitized_response": "I found 5 amazing rock concerts in Los Angeles this weekend! Here are your options: [Concert list with details]. As a Premium member, you'd save on ticket fees and get priority access. Check out our membership page!",
        "improvement_note": "No issues. Response is helpful, professional, mentions membership benefits appropriately, and follows brand voice."
    }
]

training_data.extend(hand_crafted_examples)
print(f"Total examples: {len(training_data)}")

Total examples: 65


## Save to CSV

In [6]:
# Create DataFrame
df = pd.DataFrame(training_data)

# Shuffle
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Save
output_path = Path.cwd().parent / "datasets" / "postguardrails_training.csv"
output_path.parent.mkdir(exist_ok=True)
df.to_csv(output_path, index=False)

print(f"Saved {len(df)} examples to {output_path}")
print(f"\nDataset statistics:")
print(df.groupby(['violation_type', 'response_intent']).size())

Saved 65 examples to /Users/xavierau/Code/python/showeasy_chatbot/datasets/postguardrails_training.csv

Dataset statistics:
violation_type      response_intent
brand_voice         membership          2
                    search_event        3
                    tickets             2
competitor_mention  membership          4
                    search_event        5
                    tickets             4
none                general             5
                    membership          8
                    search_event       11
                    tickets             7
price_integrity     membership          3
                    tickets             1
system_leakage      membership          3
                    search_event        4
                    tickets             3
dtype: int64


## Preview Dataset

In [7]:
# Show sample violations
print("Sample violations:\n")
for vtype in df['violation_type'].unique():
    sample = df[df['violation_type'] == vtype].iloc[0]
    print(f"\n{'='*80}")
    print(f"Violation: {vtype} | Intent: {sample['response_intent']}")
    print(f"User Query: {sample['user_query']}")
    print(f"Agent Response: {sample['agent_response'][:150]}...")
    print(f"Sanitized: {sample['sanitized_response'][:150]}...")
    print(f"Note: {sample['improvement_note'][:150]}...")

Sample violations:


Violation: brand_voice | Intent: membership
User Query: I need to cancel my membership, can you help me with that?
Agent Response: Bummer you're looking to bail. Just head over to your account page and smash that cancel button. It's pretty obvious, you can't miss it. Later!...
Sanitized: We're sorry to see you go! To cancel your membership, please log in to your account, navigate to the 'Membership' section, and follow the prompts to c...
Note: The original response used unprofessional slang ("bail," "smash that button," "Later!") and had a dismissive, unhelpful tone ("It's pretty obvious"). ...

Violation: competitor_mention | Intent: search_event
User Query: Find jazz concerts
Agent Response: I found some great jazz concerts! If you prefer, you could also check Ticketmaster or Eventbrite for more options....
Sanitized: I found some great jazz concerts! We have a wide selection of events across all genres. Would you like me to show you what's available?...
Note: R