# 02 - Synthetic Persona Generation

This notebook develops the workflow for creating synthetic personas:
1. Extract context from personal data files using LLM analysis
2. Generate synthetic personas with realistic data files
3. Document LLM biases and workarounds

In [2]:
import sys
sys.path.insert(0, '../src')

import asyncio
from pathlib import Path
import json

from centuria.models import Persona, Question, Survey
from centuria.persona import create_persona, SyntheticIdentity, generate_synthetic_files, list_available_file_types
from centuria.survey import ask_question, run_survey
from centuria.data import load_files, process_personal_folder
from centuria.llm import complete

---
## Part 1: Context Extraction from Real Data

Extract a structured profile from personal data files, then synthesize a context statement.

The prompts include anti-sycophancy rules to avoid flattering language ("accomplished", "key player", etc). See `centuria/data/extractors.py`.

In [3]:
profile, context_statement = await process_personal_folder('../data/personal')

print(f"Name: {profile.name}")
print(f"Role: {profile.current_role} ({profile.industry})")
print(f"Political lean: {profile.political_lean}")
print(f"\nContext statement ({len(context_statement)} chars):")
print(context_statement)

Name: Sean Greaves
Role: Applied AI Engineer (Technology/AI Research)
Political lean: left

Context statement (2632 chars):
Sean Greaves is currently employed as an Applied AI Engineer at The Autonomy Institute, a think tank focused on data-driven tools for sustainable economic planning. He has spent the past five years in the technology and AI research industry, building a career that includes a variety of roles in machine learning and software development. Greaves holds a Master's degree in Design Informatics and has secured a range of positions, including researcher, machine learning solutions engineer at Seldon and IBM, and an intern at IXDS.

Throughout his career, Greaves has co-developed technical strategies, secured funding for projects, and authored publications that have been cited by governmental bodies and various media outlets. His key skills include applied AI, machine learning, software development, data analysis, technical writing, mentorship, and public speaking. He ha

---
## Part 2: LLM Bias in Identity Generation

When asked to generate random people, LLMs default to educated, progressive, urban professionals.

In [4]:
# Naive prompt without debiasing
NAIVE_PROMPT = """Generate a realistic synthetic identity. Return JSON:
{"name": "Full Name", "age": "integer", "gender": "string", "location": "City, Country",
 "occupation": "Job", "industry": "Industry", "education": "Education level",
 "political_lean": "Political orientation", "personality_sketch": "2-3 sentences"}
Return ONLY the JSON."""

print("Generating 5 identities with NAIVE prompt (in parallel)...\n")

# Run all 5 in parallel
results = await asyncio.gather(*[complete(NAIVE_PROMPT) for _ in range(5)])

for i, result in enumerate(results, 1):
    text = result.content.strip()
    if text.startswith("```"): text = text.split("```")[1].replace("json", "", 1).strip()
    p = json.loads(text)
    print(f"{i}. {p['name']}, {p['age']}, {p['gender']}")
    print(f"   {p['occupation']} | {p['education']}")
    print(f"   Political: {p['political_lean']}\n")

Generating 5 identities with NAIVE prompt (in parallel)...

1. Emily Carter, 29, Female
   Marketing Specialist | Bachelor's Degree in Communication
   Political: Liberal

2. Emily Johnson, 34, Female
   Graphic Designer | Bachelor's Degree in Fine Arts
   Political: Liberal

3. Emily Carter, 32, Female
   Graphic Designer | Bachelor's Degree in Fine Arts
   Political: Liberal

4. Samantha Lewis, 34, Female
   Software Developer | Bachelor's Degree in Computer Science
   Political: Moderate

5. Samantha Greene, 29, Female
   Digital Marketing Specialist | Bachelor's Degree in Communications
   Political: Liberal



### Model Comparison: gpt-4o-mini vs gpt-4o

Does the larger model show less bias? Let's compare 10 identities from each model:

In [5]:
# Compare gpt-4o-mini vs gpt-4o on the same naive prompt (10 each, in parallel)
from collections import Counter

async def generate_identities(model: str, n: int = 10):
    """Generate n identities with a model using the naive prompt."""
    results = await asyncio.gather(*[
        complete(NAIVE_PROMPT, model=model) for _ in range(n)
    ])
    
    identities = []
    for r in results:
        text = r.content.strip()
        if text.startswith("```"): 
            text = text.split("```")[1].replace("json", "", 1).strip()
        try:
            identities.append(json.loads(text))
        except:
            pass  # Skip parse failures
    return identities

print("Generating 10 identities with each model (in parallel)...\n")

mini_identities, full_identities = await asyncio.gather(
    generate_identities("gpt-4o-mini", 10),
    generate_identities("gpt-4o", 10),
)

def analyze_bias(identities, label):
    """Analyze bias patterns in generated identities."""
    print(f"\n{'='*60}")
    print(f"{label} ({len(identities)} identities)")
    print('='*60)
    
    # Age distribution
    ages = [p.get('age', 0) for p in identities if isinstance(p.get('age'), int) or str(p.get('age', '')).isdigit()]
    ages = [int(a) for a in ages]
    if ages:
        print(f"\nAge: min={min(ages)}, max={max(ages)}, avg={sum(ages)/len(ages):.0f}")
    
    # Gender distribution
    genders = Counter(str(p.get('gender', '')).lower() for p in identities)
    print(f"Gender: {dict(genders)}")
    
    # Education (degree vs non-degree)
    edu_keywords_degree = ['bachelor', 'master', 'phd', 'degree', 'university', 'college']
    edu_keywords_no_degree = ['high school', 'vocational', 'trade', 'diploma', 'ged', 'some college']
    
    has_degree = sum(1 for p in identities 
                     if any(k in str(p.get('education', '')).lower() for k in edu_keywords_degree))
    no_degree = sum(1 for p in identities 
                    if any(k in str(p.get('education', '')).lower() for k in edu_keywords_no_degree))
    print(f"Education: {has_degree} with degree, {no_degree} without (reality: ~35% have degrees)")
    
    # Political distribution
    politics = Counter(str(p.get('political_lean', '')).lower()[:20] for p in identities)
    print(f"Politics: {dict(politics)}")
    
    # Occupation types
    tech_keywords = ['software', 'engineer', 'developer', 'data', 'product', 'design', 'tech', 'ai', 'ml']
    tech_jobs = sum(1 for p in identities 
                    if any(k in str(p.get('occupation', '')).lower() for k in tech_keywords))
    print(f"Tech/professional jobs: {tech_jobs}/{len(identities)}")
    
    # Show sample
    print(f"\nSample identities:")
    for i, p in enumerate(identities[:5], 1):
        print(f"  {i}. {p.get('name', '?')}, {p.get('age', '?')}, {p.get('gender', '?')}")
        print(f"     {p.get('occupation', '?')} | {p.get('education', '?')[:40]}")
        print(f"     Political: {p.get('political_lean', '?')}")

analyze_bias(mini_identities, "GPT-4o-mini")
analyze_bias(full_identities, "GPT-4o")

Generating 10 identities with each model (in parallel)...


GPT-4o-mini (10 identities)

Age: min=28, max=34, avg=30
Gender: {'female': 9, 'non-binary': 1}
Education: 10 with degree, 0 without (reality: ~35% have degrees)
Politics: {'liberal': 6, 'moderate': 2, 'progressive': 2}
Tech/professional jobs: 7/10

Sample identities:
  1. Emma Johnson, 32, Female
     Software Engineer | Master's Degree in Computer Science
     Political: Liberal
  2. Emma Johnson, 34, female
     Data Analyst | Master's Degree in Data Science
     Political: Moderate
  3. Emily Carter, 29, Female
     Graphic Designer | Bachelor's Degree in Fine Arts
     Political: Liberal
  4. Jordan Miller, 28, Non-binary
     Software Developer | Bachelor's Degree in Computer Science
     Political: Progressive
  5. Emily Carter, 29, Female
     Marketing Specialist | Bachelor's Degree in Communications
     Political: Moderate

GPT-4o (10 identities)

Age: min=32, max=34, avg=34
Gender: {'female': 9, 'non-binary': 1}
Ed

### Batch Generation as Alternative

Asking for N profiles at once forces the LLM to think about distribution:

In [6]:
BATCH_PROMPT = """Generate 20 realistic identities representing diverse adults.

Include diversity in: gender (50/50), age (18-80), education (only 35% have degrees),
occupation (retail, trades, healthcare, transport - not just professionals),
location (cities, suburbs, small towns), politics (conservatives, liberals, apolitical).

Think of 20 random people: cashier, plumber, retired teacher, nurse, truck driver, etc.

Return JSON array with: name, age, gender, location, occupation, industry, education,
political_lean, personality_sketch. Return ONLY the JSON array."""

print("Generating 20 profiles in ONE batch...\n")

result = await complete(BATCH_PROMPT)
text = result.content.strip()
if text.startswith("```"): text = text.split("```")[1].replace("json", "", 1).strip()
batch = json.loads(text)

for i, p in enumerate(batch, 1):
    print(f"{i:2}. {p['name']}, {p['age']}, {p['gender']} - {p['occupation']}")
    print(f"    {p['education']} | {p['political_lean']}")

Generating 20 profiles in ONE batch...

 1. Liam Johnson, 22, male - cashier
    high school diploma | liberal
 2. Emma Smith, 34, female - nurse
    nursing degree | liberal
 3. Harold Davis, 45, male - plumber
    apprenticeship | conservative
 4. Sophia Lee, 29, female - graphic designer
    bachelor's degree | liberal
 5. Nathan Brown, 63, male - retired teacher
    master's degree | apolitical
 6. Ava Wilson, 19, female - barista
    some college | liberal
 7. James Taylor, 39, male - truck driver
    high school diploma | conservative
 8. Isabella Martinez, 50, female - local diner owner
    high school diploma | liberal
 9. Ethan Garcia, 25, male - construction worker
    some college | conservative
10. Charlotte Clark, 75, female - retired librarian
    master's degree | liberal
11. Oliver Johnson, 30, male - software tester
    bachelor's degree | apolitical
12. Mia Rodriguez, 29, female - social worker
    bachelor's degree | liberal
13. Jack Wilson, 42, male - mechanic
    h

**Inspect manually:** Is there genuine diversity in age, education, occupation, and politics? If batch works, use it. Otherwise, statistical sampling from census data is needed.

---
## Part 3: File-First Persona Generation

Recommended workflow:
1. **Specify demographics** (from batch generation or census sampling)
2. **Generate data files** (CV, reading list, subscriptions)
3. **Extract persona** using the same pipeline as real data

In [7]:
print("Available file types:", list(list_available_file_types().keys()))

# Manually specify an identity - LLMs would never generate this unprompted
identity = SyntheticIdentity(
    name="Dale Hutchins",
    age=58,
    gender="male",
    location="Tulsa, Oklahoma, USA",
    occupation="Long-haul truck driver",
    industry="Transportation",
    education="High school diploma",
    political_lean="Conservative Republican, voted Trump, skeptical of mainstream media",
    personality_sketch="Spends weeks on the road listening to talk radio. Devoted grandfather who collects model trains and grills competition brisket. Attends Baptist church when home."
)

output_dir = Path(f'../data/synthetic/dale_hutchins')
print(f"\nGenerating files for: {identity.name}, {identity.age}")
print(f"  {identity.occupation} | {identity.education}")
print(f"  Political: {identity.political_lean}")

Available file types: ['cv', 'linkedin_summary', 'work_calendar', 'email_signature', 'twitter_bio', 'instagram_bio', 'facebook_about', 'reddit_profile', 'tiktok_profile', 'nextdoor_activity', 'reading_list', 'subscriptions', 'spotify_favorites', 'netflix_history', 'youtube_subscriptions', 'podcast_subscriptions', 'amazon_wishlist', 'shopping_history', 'grocery_list', 'loyalty_programs', 'google_reviews', 'product_reviews', 'yelp_reviews', 'notes_snippet', 'bookmarks', 'text_messages', 'voicemail_greeting', 'fitness_tracker', 'health_goals', 'travel_history', 'location_history', 'bank_categories', 'charity_donations', 'home_description', 'pet_profile', 'vehicle_info', 'recipe_collection', 'dating_profile', 'forum_posts', 'event_attendance']

Generating files for: Dale Hutchins, 58
  Long-haul truck driver | High school diploma
  Political: Conservative Republican, voted Trump, skeptical of mainstream media


In [8]:
saved_files = await generate_synthetic_files(
    identity=identity,
    output_dir=output_dir,
    file_types=["cv", "subscriptions", "reading_list"],
)

print("Generated files:")
for file_type, path in saved_files.items():
    content = path.read_text()
    print(f"\n--- {path.name} ---")
    print(content[:800] + ("..." if len(content) > 800 else ""))

Generated files:

--- cv.txt ---
**Dale Hutchins**  
Tulsa, Oklahoma, USA  
dale.hutchins@email.com  

---

**Professional Summary**  
Dedicated and experienced long-haul truck driver with over 30 years in the transportation industry. Proven track record of safely delivering goods across the country while maintaining compliance with all regulations. Committed to providing excellent customer service and managing logistics for timely deliveries.  

---

**Work Experience**  

**Long-Haul Truck Driver**  
ABC Freight Lines, Tulsa, Oklahoma  
March 2015 - Present  
- Transported various types of freight across the continental United States, averaging 3,000 miles per week.  
- Maintained a clean driving record with zero accidents or violations over seven years.  
- Conducted pre-trip and post-trip inspections to ensure vehicle...

--- subscriptions.txt ---
- "Sword and Scale" (True Crime Podcast)
- "Criminal" (True Crime Podcast)
- "The Joe Rogan Experience" (Comedy/Interview Podcast)
- "Th

**Note:** File generation prompts include anti-overfitting rules - most content should be non-political (entertainment, hobbies, practical stuff). A conservative doesn't only read conservative books.

In [9]:
import uuid

synth_profile, synth_context = await process_personal_folder(str(output_dir))

persona = create_persona(
    name=identity.name,
    context=synth_context,
    persona_id=str(uuid.uuid4()),
)

print(f"Created persona: {persona.name}")
print(f"Extracted political lean: {synth_profile.political_lean}")
print(f"\nContext statement:")
print(synth_context)

Created persona: Dale Hutchins
Extracted political lean: right

Context statement:
Dale Hutchins is a long-haul truck driver based in Tulsa, Oklahoma, with over 30 years of experience in the transportation industry. He holds a valid Class A Commercial Driver's License (CDL) and possesses a strong understanding of Department of Transportation (DOT) regulations and safety compliance. Hutchins has an extensive work history that includes his current role at ABC Freight Lines, where he has been employed since March 2015, consistently delivering freight across the continental United States while maintaining a clean driving record with zero accidents or violations. His previous positions include working at XYZ Logistics and Local Transport Services, where he also focused on safe deliveries and efficient route management.

Hutchins graduated from Tulsa High School in 1983 with a high school diploma. In terms of intellectual interests, he engages with a range of topics including true crime, coo

In [10]:
test_questions = [
    Question(id="q1", text="What's your view on the role of government in the economy?",
             question_type="single_select",
             options=["Minimal intervention", "Active regulation", "Mixed approach", "Fundamental restructuring"]),
    Question(id="q2", text="Which party do you typically vote for?",
             question_type="single_select",
             options=["Republican", "Democrat", "Independent", "Don't vote"]),
    Question(id="q3", text="What motivates you most in your work?",
             question_type="open_ended"),
]

print(f"Testing {persona.name} (3 questions in parallel):\n")

# Run all questions in parallel
responses = await asyncio.gather(*[ask_question(persona, q) for q in test_questions])

for q, response in zip(test_questions, responses):
    print(f"Q: {q.text}")
    print(f"A: {response.response}\n")

Testing Dale Hutchins (3 questions in parallel):

Q: What's your view on the role of government in the economy?
A: Mixed approach

Q: Which party do you typically vote for?
A: Republican

Q: What motivates you most in your work?
A: You know, what really drives me in this job is the satisfaction of getting the freight where it needs to go, on time and safe. I've been at it for over 30 years, and there's a good feeling that comes with knowing you're reliable. Plus, I like the freedom of the open road and the chance to see different parts of the country. Can't beat that! And if I can keep a clean driving record while doing it, that just adds to the motivation.



This highlights the value in having personas based on real people as they are likely much more representative and useful. LLMs are much better at inferering persona attributes from existing data than generating them. This upsells any organisation that has an existing database of personas. It would also be important as this notebook shows to be able to go in and audit the represented opinions of those personas to see if they're realistic. Essentially they are performing and we should approach them like a **film/theatre critic** would a performer.