# 01 - Single Agent from Personal Data

Can we create an LLM persona from personal data that responds to surveys like the real person?

This notebook tests the idea with a simple approach:
1. Load personal context (CV, writing samples, etc.)
2. Create a persona from that context
3. Run a blind comparison: you answer first, then the AI persona
4. Compare CV-based persona vs free-form text persona
5. Test response consistency (same question, different phrasings)

In [1]:
# Setup - ensure we can import from src
import sys
sys.path.insert(0, '../src')

import asyncio
from centuria.models import Persona, Question, Survey
from centuria.persona import create_persona
from centuria.survey import ask_question, run_survey, estimate_survey_cost

# Model to use throughout this notebook
MODEL = "gpt-4o"

## Step 1: Load Context

The persona needs information about the person. More relevant context = better responses. Lets try with my CV:

In [2]:
from centuria.data import load_files

my_context = load_files(['../data/personal/cv.pdf'])

In [3]:
print(f"{my_context[0:500]}...")

SEAN GREAVES Email: seanwgreaves@gmail.com 
GitHub: github.com/ribenamaplesyrup 
Portfolio: seangreaves.xyz 
EMPLOYMENT  
 
APPLIED AI ENGINEER - THE AUTONOMY INSTITUTE (MAY 2023 - PRESENT) 
A leading think-tank developing data-driven tools for sustainable economic planning  
• Led the institute's strategic exploration of generative AI, establishing a new applied AI research capacity from the ground up 
that has attracted over £250K in funding. 
• Co-developed the institute’s technical strategy ...


In [4]:
# Create the persona
persona = create_persona(
    name="My Persona",
    context=my_context
)

print(f"Created persona: {persona.name}")
print(f"Context length: {len(persona.context.split())} words")

Created persona: My Persona
Context length: 1149 words


In [5]:
from centuria.survey import (
    SURVEY_SYSTEM_PROMPT,
    SURVEY_USER_PROMPT_SINGLE_SELECT,
    SURVEY_USER_PROMPT_OPEN_ENDED,
    build_system_prompt,
    build_user_prompt,
)

print("=" * 60)
print("SYSTEM PROMPT TEMPLATE")
print("=" * 60)
print(SURVEY_SYSTEM_PROMPT)
print("\n")
print("=" * 60)
print("USER PROMPT TEMPLATE (Single Select)")
print("=" * 60)
print(SURVEY_USER_PROMPT_SINGLE_SELECT)
print("\n")
print("=" * 60)
print("USER PROMPT TEMPLATE (Open Ended)")
print("=" * 60)
print(SURVEY_USER_PROMPT_OPEN_ENDED)

SYSTEM PROMPT TEMPLATE
You are {name}. Answer as this person would actually speak - casual, natural, in their own voice.

<context>
{context}
</context>

Guidelines:
- Speak naturally as this person would in real conversation - not formally or academically
- Reference specific details from your life, job, family, or daily routine
- Your opinions come from your personal experiences, not abstract values
- Be direct and concise - real people don't give speeches


USER PROMPT TEMPLATE (Single Select)
Question: {question}

Options: {options}

Reply in exactly this format:
CHOICE: [your chosen option]
JUSTIFICATION: [a short, personal reason in your own voice - reference something specific from your life, work, or daily routine]

Bad example: "I believe this aligns with my values of sustainability and community."
Good example: "I deal with this at work every day" or "Tried it last year and it was a nightmare" or "My brother-in-law won't shut up about it" 


USER PROMPT TEMPLATE (Open Ended)


### Example: What the LLM actually sees

Here's what the filled-in prompts look like for this persona. The system prompt contains the persona context, while the user prompt contains just the question.

**Areas for optimization:**
- System prompt: Role-playing instructions, context formatting, guidelines for handling gaps
- User prompt: Question framing, response format instructions
- Context: What personal data to include, how to structure it

In [6]:
# Example question to demonstrate the prompts
example_q = Question(
    id="example",
    text="Which programming language do you prefer?",
    question_type="single_select",
    options=["Python", "JavaScript", "Rust", "Go"]
)

system_prompt = build_system_prompt(persona)
user_prompt = build_user_prompt(example_q)

print("=" * 60)
print("SYSTEM PROMPT (sent once per conversation)")
print("=" * 60)
print(system_prompt[:500] + "..." if len(system_prompt) > 500 else system_prompt)
print(f"\n[... {len(system_prompt)} total characters ...]")
print("\n")
print("=" * 60)
print("USER PROMPT (sent for each question)")
print("=" * 60)
print(user_prompt)

SYSTEM PROMPT (sent once per conversation)
You are My Persona. Answer as this person would actually speak - casual, natural, in their own voice.

<context>
SEAN GREAVES Email: seanwgreaves@gmail.com 
GitHub: github.com/ribenamaplesyrup 
Portfolio: seangreaves.xyz 
EMPLOYMENT  
 
APPLIED AI ENGINEER - THE AUTONOMY INSTITUTE (MAY 2023 - PRESENT) 
A leading think-tank developing data-driven tools for sustainable economic planning  
• Led the institute's strategic exploration of generative AI, establishing a new applied AI research capacity ...

[... 8634 total characters ...]


USER PROMPT (sent for each question)
Question: Which programming language do you prefer?

Options: Python, JavaScript, Rust, Go

Reply in exactly this format:
CHOICE: [your chosen option]
JUSTIFICATION: [a short, personal reason in your own voice - reference something specific from your life, work, or daily routine]

Bad example: "I believe this aligns with my values of sustainability and community."
Good example: 

## Step 2: Ask Single Questions

Test the persona with individual questions before running a full survey.

In [7]:
# Single select question
q1 = Question(
    id="q1",
    text="Which programming language do you prefer?",
    question_type="single_select",
    options=["Python", "JavaScript", "Rust", "Go"]
)

response = await ask_question(persona, q1, model=MODEL)
print(f"Question: {q1.text}")
print(f"Response: {response.response}")
print(f"Justification: {response.justification}")

Question: Which programming language do you prefer?
Response: Python
Justification: I end up using Python nearly every day at work, especially for AI and data-heavy projects. It's like the Swiss Army knife of programming languages—super versatile and perfect for churning out those quick scripts or full-blown applications. Plus, there's already a ton of Python code floating around in my Git repos.


Now what about an open ended question?

In [8]:
# Open-ended question
q2 = Question(
    id="q2",
    text="What motivates you in your work? Answer in less than 20 words.",
    question_type="open_ended"
)

response = await ask_question(persona, q2, model=MODEL)
print(f"Question: {q2.text}")
print(f"Response: {response.response}")

Question: What motivates you in your work? Answer in less than 20 words.
Response: Seeing the tangible impact of AI on real-world problems and helping people understand tech drives me.


A crucial piece of data we've skipped over here is **cost**. Its one of the key components within the business case for why we might build around using AI persona's over human personas. If cost was super high, we might opt to use humans, but of course cost is very low!

## Choosing a Model

You can use any model supported by LiteLLM. The `ask_question` function accepts a `model` parameter. Let's see what models are available based on your configured API keys:

In [9]:
from centuria.config import get_available_models, DEFAULT_MODEL

print(f"This notebook uses: {MODEL}")
print(f"Library default:    {DEFAULT_MODEL}")
print(f"\nAvailable models (based on your API keys):\n")

models = get_available_models()
for provider in set(m['provider'] for m in models):
    print(f"{provider}:")
    for m in models:
        if m['provider'] == provider:
            print(f"  - {m['id']}")
    print()

This notebook uses: gpt-4o
Library default:    gpt-4o-mini

Available models (based on your API keys):

Anthropic:
  - claude-sonnet-4-20250514
  - claude-opus-4-20250514
  - claude-3-7-sonnet-20250219
  - claude-3-5-haiku-20241022
  - claude-3-haiku-20240307

Google:
  - gemini/gemini-2.0-flash
  - gemini/gemini-2.0-flash-lite
  - gemini/gemini-1.5-pro
  - gemini/gemini-1.5-flash
  - gemini/gemini-1.5-flash-8b

OpenAI:
  - gpt-4o
  - gpt-4o-mini
  - gpt-4-turbo
  - gpt-4
  - gpt-3.5-turbo
  - o1



### Comparing Models

Different models can give different answers to the same question. Let's compare a few models on the same question to see how they differ:

In [10]:
# Compare responses from different models (in parallel)
# Pick up to 3 models from different providers (if available)

models_to_test = []
providers_seen = set()

for m in get_available_models():
    if m['provider'] not in providers_seen and len(models_to_test) < 3:
        models_to_test.append(m['id'])
        providers_seen.add(m['provider'])

if len(models_to_test) < 2:
    print("Need at least 2 different providers configured to compare models.")
    print("Add API keys for OpenAI, Anthropic, Google, etc. to your .env file.")
else:
    # Use a question where models might give different answers
    comparison_q = Question(
        id="comparison",
        text="What's the most important quality for a leader?",
        question_type="single_select",
        options=["Vision and strategy", "Empathy and emotional intelligence", "Decisiveness", "Technical expertise", "Communication skills"]
    )
    
    print(f"Question: {comparison_q.text}")
    print(f"Options: {', '.join(comparison_q.options)}")
    print(f"\nRunning {len(models_to_test)} models in parallel...")
    print("=" * 60)
    
    # Run all models in parallel
    responses = await asyncio.gather(*[
        ask_question(persona, comparison_q, model=model_id)
        for model_id in models_to_test
    ])
    
    for model_id, response in zip(models_to_test, responses):
        print(f"\n{model_id}:")
        print(f"  Answer: {response.response}")
        print(f"  Reason: {response.justification[:100]}..." if len(response.justification) > 100 else f"  Reason: {response.justification}")
        print(f"  Cost: ${response.cost:.6f}")

Question: What's the most important quality for a leader?
Options: Vision and strategy, Empathy and emotional intelligence, Decisiveness, Technical expertise, Communication skills

Running 3 models in parallel...

gpt-4o:
  Answer: Empathy and emotional intelligence
  Reason: Dealing with so many different people at The Autonomy Institute, I see all the time how understandin...
  Cost: $0.003195

claude-sonnet-4-20250514:
  Answer: Communication skills
  Reason: I've been the technical guy trying to explain AI stuff to government clients and trade union leaders...
  Cost: $0.008313

gemini/gemini-2.0-flash:
  Answer: Empathy and emotional intelligence
  Reason: I've seen firsthand at the Autonomy Institute how understanding people's motivations is key to getti...
  Cost: $0.000231


### Setting the Model

This notebook uses `MODEL = "gpt-4o"` defined in the setup cell. Change it there to use a different model throughout.

You can also pass `model=` to any function that calls the LLM:

```python
# Single question
response = await ask_question(persona, question, model="claude-sonnet-4-20250514")

# Full survey  
results = await run_survey(persona, survey, model="gpt-4o")

# Cost estimate
estimate = estimate_survey_cost(persona, survey, model="gpt-4o-mini")
```

In [11]:
print(f"\n--- Cost ---")
print(f"Tokens: {response.prompt_tokens:,} prompt + {response.completion_tokens:,} completion")
print(f"Cost: ${response.cost:.6f}")


--- Cost ---
Tokens: 2,106 prompt + 50 completion
Cost: $0.000231


However its perhaps more powerful to estimate how much a query would cost before we send it. That way we can clearly scope the cost of running a survey on a specific number of agents...

In [12]:
# Estimate cost for a single question BEFORE sending it
from centuria.llm import estimate_cost

system = build_system_prompt(persona)
user = build_user_prompt(q2)

estimate = estimate_cost(user, system=system, estimated_completion_tokens=20, model=MODEL)

print("COST ESTIMATE (before sending)")
print("="*60)
print(f"Prompt tokens:               {estimate.prompt_tokens:,}")
print(f"Est. completion tokens:      {estimate.completion_tokens:,}")
print(f"Est. cost:                   ${estimate.cost:.6f}")

# Now actually run it and compare
actual_response = await ask_question(persona, q2, model=MODEL)

print(f"\nACTUAL COST (after sending)")
print("="*60)
print(f"Prompt tokens:               {actual_response.prompt_tokens:,}")
print(f"Completion tokens:           {actual_response.completion_tokens:,}")
print(f"Actual cost:                 ${actual_response.cost:.6f}")

COST ESTIMATE (before sending)
Prompt tokens:               1,811
Est. completion tokens:      20
Est. cost:                   $0.004770

ACTUAL COST (after sending)
Prompt tokens:               1,811
Completion tokens:           18
Actual cost:                 $0.002468


**Why is the actual cost lower than the estimate?**

The estimate assumes full price for all prompt tokens, but API providers like Anthropic and OpenAI use **prompt caching** server-side. Since the same system prompt (your CV context) was already sent earlier in this notebook, the provider caches that prefix and charges a reduced rate for subsequent requests.

This is good news for the use case: running the same persona against many questions means the context gets cached, and each additional question costs less than the first.

**Note:** This is different from LiteLLM's local caching, which can return identical results at zero cost if you re-run the exact same prompt. The provider-side caching still makes API calls but at reduced input token rates.

## Step 3: Survey Comparison (Blind Test)

To avoid bias, you'll answer these questions **first**, then we'll run the AI persona and compare.

The survey includes:
- **Career/work questions** - should be inferable from your CV
- **Personal style questions** - harder to infer, tests extrapolation
- **Values/opinions questions** - tests whether the persona captures your worldview

In [13]:
# Define the survey questions
survey_questions = [
    # === CAREER/WORK (CV-inferable) ===
    Question(
        id="career_priority",
        text="What matters most to you in a job?",
        question_type="single_select",
        options=["Impact on society", "Financial compensation", "Learning opportunities", "Work-life balance", "Autonomy and creativity"]
    ),
    Question(
        id="work_style",
        text="How do you prefer to approach complex problems?",
        question_type="single_select",
        options=["Deep solo research then collaborate", "Immediate team brainstorming", "Build a prototype first, discuss later", "Map out the theory before any implementation"]
    ),
    Question(
        id="learning_style",
        text="How do you prefer to learn a new technical skill?",
        question_type="single_select",
        options=["Read documentation thoroughly first", "Jump in and learn by doing", "Watch tutorials and videos", "Take a structured course", "Learn from a mentor or colleague"]
    ),
    Question(
        id="tech_adoption",
        text="How quickly do you adopt new technologies or tools?",
        question_type="single_select",
        options=["Early adopter - try everything new", "Early majority - adopt once proven useful", "Late majority - wait until it's standard", "Laggard - only when absolutely necessary"]
    ),
    
    # === PERSONAL STYLE (harder to infer) ===
    Question(
        id="decision_making",
        text="When making important decisions, do you rely more on data or intuition?",
        question_type="single_select",
        options=["Strongly data-driven", "Mostly data with some intuition", "Equal balance", "Mostly intuition with some data", "Strongly intuition-driven"]
    ),
    Question(
        id="conflict_resolution",
        text="How do you typically handle disagreements at work?",
        question_type="single_select",
        options=["Direct confrontation to resolve quickly", "Seek compromise and middle ground", "Avoid conflict, let things settle naturally", "Escalate to a third party if needed", "Use data and evidence to settle disputes"]
    ),
    Question(
        id="risk_appetite",
        text="How would you describe your appetite for career risk?",
        question_type="single_select",
        options=["Very risk-averse - stability is key", "Somewhat cautious - calculated risks only", "Moderate - willing to take reasonable chances", "Risk-tolerant - growth requires discomfort", "Risk-seeking - high risk, high reward"]
    ),
    
    # === VALUES/OPINIONS ===
    Question(
        id="tech_stance",
        text="How do you feel about AI's impact on employment?",
        question_type="single_select",
        options=["Net positive - creates more jobs than it destroys", "Net negative - mass displacement is coming", "Neutral - it will transform jobs but balance out", "Too early to tell"]
    ),
    Question(
        id="success_definition",
        text="How do you primarily define professional success?",
        question_type="single_select",
        options=["Impact and contribution to society", "Financial achievement and security", "Recognition and reputation in your field", "Personal growth and learning", "Work-life balance and wellbeing"]
    ),
    Question(
        id="optimism_future",
        text="How do you feel about the next 20 years for humanity?",
        question_type="single_select",
        options=["Very optimistic - best time to be alive", "Cautiously optimistic - progress will continue", "Anxious - major challenges ahead", "Pessimistic - decline seems likely"]
    ),
]

survey = Survey(
    id="persona_validation",
    name="Persona Validation Survey",
    questions=survey_questions
)

# Estimate cost
survey_estimate = estimate_survey_cost(persona, survey, num_agents=1, model=MODEL)
print(f"Survey: {len(survey_questions)} questions")
print(f"Estimated cost: ${survey_estimate.cost_per_agent:.4f}")

Survey: 10 questions
Estimated cost: $0.0517


In [14]:
# Answer these questions yourself FIRST (before seeing the AI's answers)
import ipywidgets as widgets
from IPython.display import display, clear_output

answer_widgets = {}
widget_containers = []

header = widgets.HTML("<h3>Answer these questions as yourself:</h3>")
widget_containers.append(header)

for q in survey_questions:
    label = widgets.HTML(f"<b>{q.text}</b>")
    dropdown = widgets.Dropdown(
        options=["-- Select your answer --"] + q.options,
        value="-- Select your answer --",
        layout=widgets.Layout(width='450px')
    )
    answer_widgets[q.id] = dropdown
    widget_containers.append(widgets.VBox([label, dropdown], layout=widgets.Layout(margin='0 0 15px 0')))

# Lock answers button
lock_button = widgets.Button(
    description="Lock My Answers",
    button_style='success',
    layout=widgets.Layout(width='150px', margin='20px 0')
)
lock_output = widgets.Output()

user_answers = {}

def on_lock(b):
    with lock_output:
        clear_output()
        unanswered = []
        for qid, w in answer_widgets.items():
            if w.value == "-- Select your answer --":
                q = next(q for q in survey_questions if q.id == qid)
                unanswered.append(q.text)
            else:
                user_answers[qid] = w.value
        
        if unanswered:
            print(f"Please answer all questions first. ({len(unanswered)} remaining)")
            return
        
        # Disable all dropdowns
        for w in answer_widgets.values():
            w.disabled = True
        lock_button.disabled = True
        
        print("✓ Answers locked! Run the next cell to see the AI persona's responses.")

lock_button.on_click(on_lock)

display(widgets.VBox(widget_containers + [lock_button, lock_output]))

VBox(children=(HTML(value='<h3>Answer these questions as yourself:</h3>'), VBox(children=(HTML(value='<b>What …

In [15]:
# Run the AI persona and compare results
if not user_answers:
    print("Please lock your answers in the previous cell first!")
else:
    print("Running AI persona on the same questions...")
    survey_response = await run_survey(persona, survey, model=MODEL)
    
    # Build comparison
    persona_answers = {r.question_id: r.response for r in survey_response.responses}
    
    correct = 0
    total = len(survey_questions)
    
    print("\nCOMPARISON: You vs AI Persona")
    print("=" * 70)
    
    for q in survey_questions:
        persona_ans = persona_answers[q.id]
        actual_ans = user_answers[q.id]
        match = persona_ans == actual_ans
        if match:
            correct += 1
        
        status = "✓" if match else "✗"
        print(f"\n{status} {q.text}")
        print(f"   You:     {actual_ans}")
        print(f"   Persona: {persona_ans}")
    
    print(f"\n{'=' * 70}")
    print("ACCURACY")
    print("=" * 70)
    print(f"Matches: {correct}/{total} ({correct/total:.0%})")
    print(f"Cost:    ${survey_response.total_cost:.4f}")

Running AI persona on the same questions...

COMPARISON: You vs AI Persona

✓ What matters most to you in a job?
   You:     Impact on society
   Persona: Impact on society

✗ How do you prefer to approach complex problems?
   You:     Build a prototype first, discuss later
   Persona: Deep solo research then collaborate

✓ How do you prefer to learn a new technical skill?
   You:     Jump in and learn by doing
   Persona: Jump in and learn by doing

✓ How quickly do you adopt new technologies or tools?
   You:     Early adopter - try everything new
   Persona: Early adopter - try everything new

✗ When making important decisions, do you rely more on data or intuition?
   You:     Equal balance
   Persona: Mostly data with some intuition

✗ How do you typically handle disagreements at work?
   You:     Avoid conflict, let things settle naturally
   Persona: Use data and evidence to settle disputes

✗ How would you describe your appetite for career risk?
   You:     Risk-seeking - high 

## Step 4: Response Consistency Test

A reliable persona should give consistent answers when asked the same question in different ways. This step tests **response consistency** - the percentage of times the persona gives the same answer when the same underlying question is rephrased.

We'll:
1. Define 10 base questions
2. Create 10 phrasings of each question (same meaning, different wording)
3. Run all 100 questions through the persona
4. Calculate consistency per question and overall

**Why this matters:** Low consistency suggests the persona's answers are sensitive to question wording rather than reflecting stable preferences. High consistency indicates robust, reliable responses.

In [16]:
# Define 10 base questions, each with 10 different phrasings
# All phrasings share the same options to enable direct comparison

consistency_questions = {
    "work_motivation": {
        "options": ["Money and financial security", "Making a positive impact", "Personal growth and learning", "Recognition and status", "Work-life balance"],
        "phrasings": [
            "What motivates you most in your career?",
            "What's the primary driver of your professional life?",
            "When choosing a job, what matters most to you?",
            "What gets you out of bed for work each morning?",
            "What's your main motivation at work?",
            "What do you value most in your professional life?",
            "What's the biggest factor in your career satisfaction?",
            "What drives your career decisions?",
            "What's most important to you in a job?",
            "What keeps you engaged in your work?",
        ]
    },
    "weekend_preference": {
        "options": ["Socializing with friends", "Quiet time alone", "Outdoor activities", "Creative hobbies", "Productive tasks and errands"],
        "phrasings": [
            "How do you prefer to spend your weekends?",
            "What's your ideal weekend activity?",
            "When Saturday comes, what do you usually do?",
            "How do you typically unwind on weekends?",
            "What's your go-to weekend plan?",
            "How do you like to spend your free time on weekends?",
            "What does a perfect weekend look like for you?",
            "When you have time off, how do you spend it?",
            "What's your preferred way to enjoy the weekend?",
            "How do you usually occupy your weekends?",
        ]
    },
    "conflict_approach": {
        "options": ["Address it directly and immediately", "Take time to cool off first", "Seek a mediator or third party", "Avoid confrontation if possible", "Focus on finding compromise"],
        "phrasings": [
            "How do you typically handle conflict?",
            "What's your approach when disagreements arise?",
            "How do you deal with confrontation?",
            "When conflict occurs, what do you do?",
            "What's your conflict resolution style?",
            "How do you respond to disagreements?",
            "What's your strategy for handling disputes?",
            "When you're in conflict, how do you react?",
            "How do you navigate disagreements with others?",
            "What's your usual response to conflict situations?",
        ]
    },
    "learning_method": {
        "options": ["Reading books and articles", "Hands-on experimentation", "Video tutorials and courses", "Discussion with experts", "Structured formal education"],
        "phrasings": [
            "How do you prefer to learn new things?",
            "What's your ideal learning method?",
            "How do you best absorb new information?",
            "What's your preferred way to acquire new skills?",
            "How do you typically approach learning?",
            "What learning style works best for you?",
            "How do you like to pick up new knowledge?",
            "What's your go-to method for learning?",
            "How do you prefer to be taught new things?",
            "What's your most effective learning approach?",
        ]
    },
    "decision_style": {
        "options": ["Careful analysis of all options", "Trust my gut instinct", "Seek advice from others", "Make quick decisions and adapt", "Delay until absolutely necessary"],
        "phrasings": [
            "How do you make important decisions?",
            "What's your decision-making style?",
            "How do you approach big choices?",
            "What's your process for making decisions?",
            "How do you typically decide on important matters?",
            "What's your approach to decision-making?",
            "How do you handle major decisions?",
            "What's your strategy when facing tough choices?",
            "How do you go about making significant decisions?",
            "What's your method for reaching important decisions?",
        ]
    },
    "stress_response": {
        "options": ["Exercise or physical activity", "Talk to friends or family", "Spend time alone to recharge", "Distract myself with entertainment", "Work through it systematically"],
        "phrasings": [
            "How do you cope with stress?",
            "What do you do when you're stressed?",
            "How do you handle stressful situations?",
            "What's your go-to stress relief method?",
            "How do you manage stress in your life?",
            "What helps you deal with stress?",
            "How do you typically respond to stress?",
            "What's your strategy for handling stress?",
            "How do you unwind when stressed?",
            "What do you do to relieve stress?",
        ]
    },
    "team_preference": {
        "options": ["Lead and direct the team", "Collaborate as an equal member", "Work independently within the team", "Support and help others succeed", "Focus on specialized expertise"],
        "phrasings": [
            "What role do you prefer in a team?",
            "How do you like to work in group settings?",
            "What's your preferred team dynamic?",
            "How do you typically function in a team?",
            "What role suits you best in collaborative work?",
            "How do you prefer to contribute to a team?",
            "What's your ideal position in a group project?",
            "How do you like to participate in team efforts?",
            "What team role do you gravitate towards?",
            "How do you prefer to engage in teamwork?",
        ]
    },
    "change_attitude": {
        "options": ["Embrace it enthusiastically", "Accept it cautiously", "Resist unless necessary", "Adapt quickly and move on", "Analyze before responding"],
        "phrasings": [
            "How do you respond to change?",
            "What's your attitude towards change?",
            "How do you handle unexpected changes?",
            "What's your reaction when things change?",
            "How do you deal with change in your life?",
            "What's your approach to handling change?",
            "How do you typically respond to new situations?",
            "What's your attitude when facing change?",
            "How do you adapt to changes?",
            "What's your typical response to change?",
        ]
    },
    "success_measure": {
        "options": ["Financial achievement", "Personal happiness", "Impact on others", "Professional recognition", "Work-life balance"],
        "phrasings": [
            "How do you measure success?",
            "What does success mean to you?",
            "How do you define personal success?",
            "What's your measure of a successful life?",
            "How do you know when you've succeeded?",
            "What indicates success to you?",
            "How do you evaluate your own success?",
            "What's your personal definition of success?",
            "How do you judge whether you're successful?",
            "What does being successful look like to you?",
        ]
    },
    "communication_style": {
        "options": ["Direct and to the point", "Diplomatic and tactful", "Detailed and thorough", "Casual and friendly", "Formal and professional"],
        "phrasings": [
            "How would you describe your communication style?",
            "What's your preferred way of communicating?",
            "How do you typically express yourself?",
            "What's your communication approach?",
            "How do you prefer to convey information?",
            "What's your style when communicating with others?",
            "How do you usually communicate?",
            "What communication style fits you best?",
            "How would others describe your communication?",
            "What's your natural way of communicating?",
        ]
    },
}

# Count total questions
total_base = len(consistency_questions)
total_variations = sum(len(q["phrasings"]) for q in consistency_questions.values())
print(f"Consistency test configuration:")
print(f"  Base questions:     {total_base}")
print(f"  Phrasings each:     {len(list(consistency_questions.values())[0]['phrasings'])}")
print(f"  Total API calls:    {total_variations}")

Consistency test configuration:
  Base questions:     10
  Phrasings each:     10
  Total API calls:    100


In [17]:
# Build 10 surveys, each containing one phrasing of each base question
# This groups questions by survey rather than by base question to reduce prompt caching effects

consistency_surveys = []

for survey_idx in range(10):
    questions = []
    for base_id, q_data in consistency_questions.items():
        questions.append(Question(
            id=f"{base_id}_v{survey_idx}",
            text=q_data["phrasings"][survey_idx],
            question_type="single_select",
            options=q_data["options"]
        ))
    
    consistency_surveys.append(Survey(
        id=f"consistency_survey_{survey_idx}",
        name=f"Consistency Survey {survey_idx + 1}",
        questions=questions
    ))

print(f"Created {len(consistency_surveys)} surveys with {len(consistency_surveys[0].questions)} questions each")

# Estimate cost
total_questions = sum(len(s.questions) for s in consistency_surveys)
single_estimate = estimate_survey_cost(persona, consistency_surveys[0], num_agents=1, model=MODEL)
print(f"\nEstimated cost: ${single_estimate.cost_per_agent * 10:.4f} for all {total_questions} questions")

Created 10 surveys with 10 questions each

Estimated cost: $0.5150 for all 100 questions


In [18]:
# Run all consistency surveys IN PARALLEL for speed
import time

print("Running consistency test (100 questions across 10 surveys in parallel)...")
print("=" * 60)

start_time = time.time()

# Run all 10 surveys in parallel
survey_responses = await asyncio.gather(*[
    run_survey(persona, survey, model=MODEL)
    for survey in consistency_surveys
])

elapsed = time.time() - start_time

# Collect results
consistency_results = {}  # base_question_id -> list of responses
total_cost = sum(r.total_cost for r in survey_responses)

for response in survey_responses:
    for r in response.responses:
        # Extract base question id (remove _v0, _v1, etc.)
        base_id = r.question_id.rsplit("_v", 1)[0]
        if base_id not in consistency_results:
            consistency_results[base_id] = []
        consistency_results[base_id].append(r.response)

print(f"Completed in {elapsed:.1f}s (10 surveys × 10 questions each)")
print(f"Total cost: ${total_cost:.4f}")

Running consistency test (100 questions across 10 surveys in parallel)...
Completed in 5.9s (10 surveys × 10 questions each)
Total cost: $0.3710


In [19]:
# Calculate consistency metrics
from collections import Counter

print("RESPONSE CONSISTENCY ANALYSIS")
print("=" * 80)

question_consistency = {}
detailed_results = []

for base_id, responses in consistency_results.items():
    # Count frequency of each response
    counter = Counter(responses)
    most_common_response, most_common_count = counter.most_common(1)[0]
    
    # Consistency = % of responses that match the most common response
    consistency = most_common_count / len(responses)
    question_consistency[base_id] = consistency
    
    # Get the first phrasing as the "base question" label
    base_question = consistency_questions[base_id]["phrasings"][0]
    
    detailed_results.append({
        "base_id": base_id,
        "question": base_question,
        "consistency": consistency,
        "most_common": most_common_response,
        "distribution": dict(counter),
        "total": len(responses)
    })

# Sort by consistency (lowest first to highlight problem areas)
detailed_results.sort(key=lambda x: x["consistency"])

# Display results
print(f"\n{'Question':<45} {'Consistency':>12} {'Most Common Response':<25}")
print("-" * 80)

for r in detailed_results:
    q_short = r["question"][:43] + ".." if len(r["question"]) > 45 else r["question"]
    mc_short = r["most_common"][:23] + ".." if len(r["most_common"]) > 25 else r["most_common"]
    pct = f"{r['consistency']:.0%}"
    print(f"{q_short:<45} {pct:>12} {mc_short:<25}")

# Overall consistency
overall_consistency = sum(question_consistency.values()) / len(question_consistency)

print("\n" + "=" * 80)
print("CONSISTENCY SUMMARY")
print("=" * 80)
print(f"Overall consistency:     {overall_consistency:.1%}")
print(f"Most consistent:         {max(question_consistency.values()):.0%} ({max(question_consistency, key=question_consistency.get)})")
print(f"Least consistent:        {min(question_consistency.values()):.0%} ({min(question_consistency, key=question_consistency.get)})")

# Interpretation
print("\n" + "=" * 80)
print("INTERPRETATION")
print("=" * 80)
if overall_consistency >= 0.9:
    print("Excellent consistency - the persona gives highly stable responses.")
elif overall_consistency >= 0.7:
    print("Good consistency - responses are reasonably stable with some variation.")
elif overall_consistency >= 0.5:
    print("Moderate consistency - notable variation in responses to rephrased questions.")
else:
    print("Low consistency - responses vary significantly based on question wording.")

RESPONSE CONSISTENCY ANALYSIS

Question                                       Consistency Most Common Response     
--------------------------------------------------------------------------------
How do you cope with stress?                           40% Exercise or physical ac..
How would you describe your communication s..          50% Direct and to the point  
How do you prefer to spend your weekends?              70% Outdoor activities       
How do you typically handle conflict?                  70% Focus on finding compro..
What role do you prefer in a team?                     80% Collaborate as an equal..
How do you make important decisions?                   90% Careful analysis of all..
How do you respond to change?                          90% Adapt quickly and move on
What motivates you most in your career?               100% Making a positive impact 
How do you prefer to learn new things?                100% Hands-on experimentation 
How do you measure success?           