# 02 - Synthetic Persona Generation

This notebook develops the workflow for creating synthetic personas:
1. Extract context from personal data files using LLM analysis
2. Generate synthetic personas with realistic data files
3. Document LLM biases and workarounds

In [1]:
import sys
sys.path.insert(0, '../src')

from pathlib import Path
import json

from centuria.models import Persona, Question, Survey
from centuria.persona import create_persona, SyntheticIdentity, generate_synthetic_files, list_available_file_types
from centuria.survey import ask_question, run_survey
from centuria.data import load_files, process_personal_folder
from centuria.llm import complete

---
## Part 1: Context Extraction from Real Data

Extract a structured profile from personal data files, then synthesize a context statement.

The prompts include anti-sycophancy rules to avoid flattering language ("accomplished", "key player", etc). See `centuria/data/extractors.py`.

In [2]:
profile, context_statement = await process_personal_folder('../data/personal')

print(f"Name: {profile.name}")
print(f"Role: {profile.current_role} ({profile.industry})")
print(f"Political lean: {profile.political_lean}")
print(f"\nContext statement ({len(context_statement)} chars):")
print(context_statement)

Name: Sean Greaves
Role: Applied AI Engineer (Technology/AI Research)
Political lean: heterodox

Context statement (2760 chars):
Sean Greaves is an Applied AI Engineer based in the United Kingdom, currently working at The Autonomy Institute. He has approximately five years of experience in the technology and AI research industry. Sean holds a master's degree in Design Informatics. His key skills include AI, machine learning, data analysis, research, software development, MLOps, and technical strategy.

Sean's professional background is varied, having worked in roles at both start-ups like Seldon and larger organizations such as IBM. His work has typically intersected technology with societal planning, as demonstrated by his previous research fellowships and projects that address topics like economic planning and corporate ownership networks. In his current role, Sean is involved in developing data-driven tools for sustainable economic planning and has been active in strategic explorati

---
## Part 2: LLM Bias in Identity Generation

When asked to generate random people, LLMs default to educated, progressive, urban professionals.

In [3]:
# Naive prompt without debiasing
NAIVE_PROMPT = """Generate a realistic synthetic identity. Return JSON:
{"name": "Full Name", "age": "integer", "gender": "string", "location": "City, Country",
 "occupation": "Job", "industry": "Industry", "education": "Education level",
 "political_lean": "Political orientation", "personality_sketch": "2-3 sentences"}
Return ONLY the JSON."""

print("Generating 5 identities with NAIVE prompt...\n")

for i in range(5):
    result = await complete(NAIVE_PROMPT)
    text = result.content.strip()
    if text.startswith("```"): text = text.split("```")[1].replace("json", "", 1).strip()
    p = json.loads(text)
    print(f"{i+1}. {p['name']}, {p['age']}, {p['gender']}")
    print(f"   {p['occupation']} | {p['education']}")
    print(f"   Political: {p['political_lean']}\n")

Generating 5 identities with NAIVE prompt...

1. Sophia Turner, 34, Female
   Project Manager | Master's Degree in Business Administration
   Political: Liberal

2. Aiden Thompson, 34, Male
   Software Engineer | Bachelor's Degree in Computer Science
   Political: Moderate

3. Samuel Thompson, 34, male
   Software Engineer | Bachelor's Degree in Computer Science
   Political: Centrist

4. Jordan Michaels, 34, Non-binary
   Software Developer | Bachelor's Degree in Computer Science
   Political: Liberal

5. Amelia Stevens, 34, female
   Product Manager | Bachelor's Degree in Computer Science
   Political: Liberal



**Typical results:** All female, all 35, all Master's degrees, all progressive/environmental. This is "people who write content that trains LLMs" - not a representative sample.

### Batch Generation as Alternative

Asking for N profiles at once forces the LLM to think about distribution:

In [4]:
BATCH_PROMPT = """Generate 20 realistic identities representing diverse adults.

Include diversity in: gender (50/50), age (18-80), education (only 35% have degrees),
occupation (retail, trades, healthcare, transport - not just professionals),
location (cities, suburbs, small towns), politics (conservatives, liberals, apolitical).

Think of 20 random people: cashier, plumber, retired teacher, nurse, truck driver, etc.

Return JSON array with: name, age, gender, location, occupation, industry, education,
political_lean, personality_sketch. Return ONLY the JSON array."""

print("Generating 20 profiles in ONE batch...\n")

result = await complete(BATCH_PROMPT)
text = result.content.strip()
if text.startswith("```"): text = text.split("```")[1].replace("json", "", 1).strip()
batch = json.loads(text)

for i, p in enumerate(batch, 1):
    print(f"{i:2}. {p['name']}, {p['age']}, {p['gender']} - {p['occupation']}")
    print(f"    {p['education']} | {p['political_lean']}")

Generating 20 profiles in ONE batch...

 1. Olivia Smith, 29, Female - Cashier
    High School Diploma | Apolitical
 2. Michael Johnson, 45, Male - Plumber
    Vocational Training | Conservative
 3. Emma Brown, 68, Female - Retired Teacher
    Bachelor's Degree | Liberal
 4. Noah Davis, 33, Male - Nurse
    Associate's Degree in Nursing | Liberal
 5. Sophia Martinez, 58, Female - Truck Driver
    High School Diploma | Conservative
 6. Liam Lee, 24, Male - Retail Associate
    Some College | Apolitical
 7. Isabella Rodriguez, 39, Female - Electrician
    Vocational Training | Conservative
 8. James Kim, 72, Male - Retired Bus Driver
    High School Diploma | Apolitical
 9. Ava Patel, 50, Female - Nurse
    Bachelor's Degree | Liberal
10. Benjamin Wilson, 27, Male - Carpenter
    Vocational Training | Liberal
11. Charlotte Nguyen, 62, Female - Retired Nurse
    Associate's Degree in Nursing | Conservative
12. William Garcia, 31, Male - Taxi Driver
    High School Diploma | Apolitical
13.

**Inspect manually:** Is there genuine diversity in age, education, occupation, and politics? If batch works, use it. Otherwise, statistical sampling from census data is needed.

---
## Part 3: File-First Persona Generation

Recommended workflow:
1. **Specify demographics** (from batch generation or census sampling)
2. **Generate data files** (CV, reading list, subscriptions)
3. **Extract persona** using the same pipeline as real data

In [5]:
print("Available file types:", list(list_available_file_types().keys()))

# Manually specify an identity - LLMs would never generate this unprompted
identity = SyntheticIdentity(
    name="Dale Hutchins",
    age=58,
    gender="male",
    location="Tulsa, Oklahoma, USA",
    occupation="Long-haul truck driver",
    industry="Transportation",
    education="High school diploma",
    political_lean="Conservative Republican, voted Trump, skeptical of mainstream media",
    personality_sketch="Spends weeks on the road listening to talk radio. Devoted grandfather who collects model trains and grills competition brisket. Attends Baptist church when home."
)

output_dir = Path(f'../data/synthetic/dale_hutchins')
print(f"\nGenerating files for: {identity.name}, {identity.age}")
print(f"  {identity.occupation} | {identity.education}")
print(f"  Political: {identity.political_lean}")

Available file types: ['cv', 'linkedin_summary', 'work_calendar', 'email_signature', 'twitter_bio', 'instagram_bio', 'facebook_about', 'reddit_profile', 'tiktok_profile', 'nextdoor_activity', 'reading_list', 'subscriptions', 'spotify_favorites', 'netflix_history', 'youtube_subscriptions', 'podcast_subscriptions', 'amazon_wishlist', 'shopping_history', 'grocery_list', 'loyalty_programs', 'google_reviews', 'product_reviews', 'yelp_reviews', 'notes_snippet', 'bookmarks', 'text_messages', 'voicemail_greeting', 'fitness_tracker', 'health_goals', 'travel_history', 'location_history', 'bank_categories', 'charity_donations', 'home_description', 'pet_profile', 'vehicle_info', 'recipe_collection', 'dating_profile', 'forum_posts', 'event_attendance']

Generating files for: Dale Hutchins, 58
  Long-haul truck driver | High school diploma
  Political: Conservative Republican, voted Trump, skeptical of mainstream media


In [6]:
saved_files = await generate_synthetic_files(
    identity=identity,
    output_dir=output_dir,
    file_types=["cv", "subscriptions", "reading_list"],
)

print("Generated files:")
for file_type, path in saved_files.items():
    content = path.read_text()
    print(f"\n--- {path.name} ---")
    print(content[:800] + ("..." if len(content) > 800 else ""))

Generated files:

--- cv.txt ---
```
Dale Hutchins
Tulsa, Oklahoma, USA
Email: dale.hutchins58@gmail.com

Professional Summary
---------------------
Dedicated and experienced long-haul truck driver with over 30 years in the transportation industry. Known for a strong work ethic, safety-first approach, and excellent time management skills. Consistently maintains a clean driving record and ensures timely deliveries across the continental United States.

Work Experience
----------------
Long-Haul Truck Driver
Swift Transportation, Tulsa, OK
January 2005 - Present
- Safely operate and manage routes for nationwide deliveries, averaging over 120,000 miles per year.
- Maintain accurate logbooks and adhere to DOT regulations.
- Foster strong relationships with dispatchers and ensure clear communication to optimize delivery schedu...

--- subscriptions.txt ---
1. Crime Junkie (podcast)
2. The Joe Rogan Experience (podcast)
3. The Dan Le Batard Show with Stugotz (podcast)
4. Grillin’ Guys Radio 

**Note:** File generation prompts include anti-overfitting rules - most content should be non-political (entertainment, hobbies, practical stuff). A conservative doesn't only read conservative books.

In [7]:
import uuid

synth_profile, synth_context = await process_personal_folder(str(output_dir))

persona = create_persona(
    name=identity.name,
    context=synth_context,
    persona_id=str(uuid.uuid4()),
)

print(f"Created persona: {persona.name}")
print(f"Extracted political lean: {synth_profile.political_lean}")
print(f"\nContext statement:")
print(synth_context)

Created persona: Dale Hutchins
Extracted political lean: center

Context statement:
Dale Hutchins is a long-haul truck driver based in Tulsa, Oklahoma, within the age range of 55 to 60. He has been working in the transportation industry for 35 years, demonstrating a deep familiarity with the profession's practical and regulatory aspects. Currently employed by Swift Transportation, Dale manages nationwide deliveries and covers over 120,000 miles annually. His responsibilities include maintaining accurate logbooks, adhering to Department of Transportation regulations, and fostering communication with dispatchers. His earlier roles include positions at J.B. Hunt Transport Services and Local Freight Services, where he built a reputation for reliability and assisted in training new drivers, highlighting his capacity for mentorship in safe driving practices.

Dale completed his high school education at Tulsa High School, earning his diploma in 1983. He does not hold any formal higher educati

In [8]:
test_questions = [
    Question(id="q1", text="What's your view on the role of government in the economy?",
             question_type="single_select",
             options=["Minimal intervention", "Active regulation", "Mixed approach", "Fundamental restructuring"]),
    Question(id="q2", text="Which party do you typically vote for?",
             question_type="single_select",
             options=["Republican", "Democrat", "Independent", "Don't vote"]),
    Question(id="q3", text="What motivates you most in your work?",
             question_type="open_ended"),
]

print(f"Testing {persona.name}:\n")
for q in test_questions:
    response = await ask_question(persona, q)
    print(f"Q: {q.text}")
    print(f"A: {response.response}\n")

Testing Dale Hutchins:

Q: What's your view on the role of government in the economy?
A: Mixed approach

Q: Which party do you typically vote for?
A: Independent

Q: What motivates you most in your work?
A: Well, what really keeps me goin' is the open road itself. I've got a thing for the freedom and solitude out there. I also take pride in knowin' I'm part of what keeps the country movin'. There's a real satisfaction in gettin' goods where they need to be, on time and in one piece. Plus, I'm always learnin', and that's important to me. Every trip has its challenges, keeps it interestin'.



This highlights the value in having personas based on real people as they are likely much more representative and useful. LLMs are much better at inferering persona attributes from existing data than generating them. This upsells any organisation that has an existing database of personas. It would also be important as this notebook shows to be able to go in and audit the represented opinions of those personas to see if they're realistic. Essentially they are performing and we should approach them like a **film/theatre critic** would a performer.