# 02 - Synthetic Persona Generation

This notebook develops the workflow for creating synthetic personas:
1. Extract context from personal data files using LLM analysis
2. Generate synthetic personas with realistic data files
3. Document LLM biases and workarounds

In [None]:
import sys
sys.path.insert(0, '../src')

from pathlib import Path
import json

from centuria.models import Persona, Question, Survey
from centuria.persona import create_persona, SyntheticIdentity, generate_synthetic_files, list_available_file_types
from centuria.survey import ask_question, run_survey
from centuria.data import load_files, process_personal_folder
from centuria.llm import complete

---
## Part 1: Context Extraction from Real Data

Extract a structured profile from personal data files, then synthesize a context statement.

The prompts include anti-sycophancy rules to avoid flattering language ("accomplished", "key player", etc). See `centuria/data/extractors.py`.

In [None]:
profile, context_statement = await process_personal_folder('../data/personal')

print(f"Name: {profile.name}")
print(f"Role: {profile.current_role} ({profile.industry})")
print(f"Political lean: {profile.political_lean}")
print(f"\nContext statement ({len(context_statement)} chars):")
print(context_statement)

---
## Part 2: LLM Bias in Identity Generation

When asked to generate random people, LLMs default to educated, progressive, urban professionals.

In [None]:
# Naive prompt without debiasing
NAIVE_PROMPT = """Generate a realistic synthetic identity. Return JSON:
{"name": "Full Name", "age": "integer", "gender": "string", "location": "City, Country",
 "occupation": "Job", "industry": "Industry", "education": "Education level",
 "political_lean": "Political orientation", "personality_sketch": "2-3 sentences"}
Return ONLY the JSON."""

print("Generating 5 identities with NAIVE prompt...\n")

for i in range(5):
    result = await complete(NAIVE_PROMPT)
    text = result.content.strip()
    if text.startswith("```"): text = text.split("```")[1].replace("json", "", 1).strip()
    p = json.loads(text)
    print(f"{i+1}. {p['name']}, {p['age']}, {p['gender']}")
    print(f"   {p['occupation']} | {p['education']}")
    print(f"   Political: {p['political_lean']}\n")

**Typical results:** All female, all 35, all Master's degrees, all progressive/environmental. This is "people who write content that trains LLMs" - not a representative sample.

### Batch Generation as Alternative

Asking for N profiles at once forces the LLM to think about distribution:

In [4]:
BATCH_PROMPT = """Generate 20 realistic identities representing diverse adults.

Include diversity in: gender (50/50), age (18-80), education (only 35% have degrees),
occupation (retail, trades, healthcare, transport - not just professionals),
location (cities, suburbs, small towns), politics (conservatives, liberals, apolitical).

Think of 20 random people: cashier, plumber, retired teacher, nurse, truck driver, etc.

Return JSON array with: name, age, gender, location, occupation, industry, education,
political_lean, personality_sketch. Return ONLY the JSON array."""

print("Generating 20 profiles in ONE batch...\n")

result = await complete(BATCH_PROMPT)
text = result.content.strip()
if text.startswith("```"): text = text.split("```")[1].replace("json", "", 1).strip()
batch = json.loads(text)

for i, p in enumerate(batch, 1):
    print(f"{i:2}. {p['name']}, {p['age']}, {p['gender']} - {p['occupation']}")
    print(f"    {p['education']} | {p['political_lean']}")

Generating 20 profiles in ONE batch...

 1. Alice Smith, 24, female - waitress
    high school diploma | liberal
 2. John Doe, 65, male - retired teacher
    bachelor's degree | conservative
 3. Sarah Johnson, 36, female - nurse
    associate degree | liberal
 4. Tom Martinez, 51, male - truck driver
    high school diploma | apolitical
 5. Rachel Kim, 29, female - plumber
    vocational training | moderate
 6. Michael Brown, 59, male - janitor
    some college | conservative
 7. Emily Davis, 40, female - hairdresser
    certification | liberal
 8. Jacob Wilson, 31, male - bartender
    high school diploma | apolitical
 9. Linda Hernandez, 72, female - retired nurse
    bachelor's degree | conservative
10. Daniel Garcia, 44, male - electrician
    vocational training | moderate
11. Megan Patel, 27, female - cashier
    high school diploma | moderate
12. Chris Lee, 34, male - auto mechanic
    some college | liberal
13. Jessica White, 48, female - bus driver
    high school diploma | ap

**Inspect manually:** Is there genuine diversity in age, education, occupation, and politics? If batch works, use it. Otherwise, statistical sampling from census data is needed.

---
## Part 3: File-First Persona Generation

Recommended workflow:
1. **Specify demographics** (from batch generation or census sampling)
2. **Generate data files** (CV, reading list, subscriptions)
3. **Extract persona** using the same pipeline as real data

In [5]:
print("Available file types:", list(list_available_file_types().keys()))

# Manually specify an identity - LLMs would never generate this unprompted
identity = SyntheticIdentity(
    name="Dale Hutchins",
    age=58,
    gender="male",
    location="Tulsa, Oklahoma, USA",
    occupation="Long-haul truck driver",
    industry="Transportation",
    education="High school diploma",
    political_lean="Conservative Republican, voted Trump, skeptical of mainstream media",
    personality_sketch="Spends weeks on the road listening to talk radio. Devoted grandfather who collects model trains and grills competition brisket. Attends Baptist church when home."
)

output_dir = Path(f'../data/synthetic/dale_hutchins')
print(f"\nGenerating files for: {identity.name}, {identity.age}")
print(f"  {identity.occupation} | {identity.education}")
print(f"  Political: {identity.political_lean}")

Available file types: ['cv', 'reading_list', 'subscriptions', 'spotify_favorites', 'twitter_bio', 'linkedin_summary', 'notes_snippet', 'bookmarks']

Generating files for: Dale Hutchins, 58
  Long-haul truck driver | High school diploma
  Political: Conservative Republican, voted Trump, skeptical of mainstream media


In [6]:
saved_files = await generate_synthetic_files(
    identity=identity,
    output_dir=output_dir,
    file_types=["cv", "subscriptions", "reading_list"],
)

print("Generated files:")
for file_type, path in saved_files.items():
    content = path.read_text()
    print(f"\n--- {path.name} ---")
    print(content[:800] + ("..." if len(content) > 800 else ""))

Generated files:

--- cv.txt ---
```
Dale Hutchins
Tulsa, Oklahoma, USA
Email: dale.hutchins58@gmail.com

Professional Summary:
Experienced long-haul truck driver with over 30 years in the transportation industry. Skilled in providing efficient and timely delivery services across the United States. Committed to safety, punctuality, and customer satisfaction while maintaining a clean driving record.

Work Experience:

Long-Haul Truck Driver
ABC Logistics, Tulsa, OK
January 2010 - Present
- Operate a variety of commercial trucks for nationwide deliveries.
- Maintain accurate logs and records of driving hours and vehicle inspections.
- Ensure adherence to all Department of Transportation regulations and safety standards.
- Consistently achieve superior customer satisfaction through timely delivery and professionalism.

Truck...

--- subscriptions.txt ---
1. The Joe Rogan Experience
2. Crime Junkie
3. Sports Radio 1170: The Tulsa Sports Animal
4. Country Hits Radio (e.g., SiriusXM The High

**Note:** File generation prompts include anti-overfitting rules - most content should be non-political (entertainment, hobbies, practical stuff). A conservative doesn't only read conservative books.

In [7]:
import uuid

synth_profile, synth_context = await process_personal_folder(str(output_dir))

persona = create_persona(
    name=identity.name,
    context=synth_context,
    persona_id=str(uuid.uuid4()),
)

print(f"Created persona: {persona.name}")
print(f"Extracted political lean: {synth_profile.political_lean}")
print(f"\nContext statement:")
print(synth_context)

Created persona: Dale Hutchins
Extracted political lean: right

Context statement:
Dale Hutchins is a long-haul truck driver based in Tulsa, Oklahoma, within the age range of 55 to 60 years. He possesses over 35 years of experience in the transportation industry, having held positions at multiple logistics companies. His career began with Tulsa Logistics from March 1988 to May 1995, followed by a role at XYZ Freight Company from June 1995 to December 2009, and he has been with ABC Logistics since January 2010. Hutchins is responsible for operating various commercial trucks for nationwide deliveries, maintaining accurate driving logs, adhering to Department of Transportation regulations, and ensuring customer satisfaction through timely and professional service.

Hutchins holds a high school diploma from Union High School in Tulsa, Oklahoma, which he completed in 1983. His skill set includes a broad knowledge of DOT regulations and safety protocols, GPS navigation, route planning, and e

In [8]:
test_questions = [
    Question(id="q1", text="What's your view on the role of government in the economy?",
             question_type="single_select",
             options=["Minimal intervention", "Active regulation", "Mixed approach", "Fundamental restructuring"]),
    Question(id="q2", text="Which party do you typically vote for?",
             question_type="single_select",
             options=["Republican", "Democrat", "Independent", "Don't vote"]),
    Question(id="q3", text="What motivates you most in your work?",
             question_type="open_ended"),
]

print(f"Testing {persona.name}:\n")
for q in test_questions:
    response = await ask_question(persona, q)
    print(f"Q: {q.text}")
    print(f"A: {response.response}\n")

Testing Dale Hutchins:

Q: What's your view on the role of government in the economy?
A: Mixed approach

Q: Which party do you typically vote for?
A: Republican

Q: What motivates you most in your work?
A: What motivates me most in my work is the satisfaction of delivering on time and knowing I’ve contributed to keeping everything running smoothly across the country. The freedom of the open road and the responsibility of ensuring cargo gets to its destination safely are rewarding. Plus, maintaining great relationships with customers and exceeding their expectations with timely deliveries really keeps me going. It's all about being reliable and efficient, just like I value in any system—be it my career, model railroading, or life in general.



This highlights the value in having personas based on real people as they are likely much more representative and useful. LLMs are much better at inferering persona attributes from existing data than generating them. This upsells any organisation that has an existing database of personas. It would also be important as this notebook shows to be able to go in and audit the represented opinions of those personas to see if they're realistic. Essentially they are performing and we should approach them like a **film/theatre critic** would a performer.