| title | SGO — Semantic Gradient Optimization |
|---|---|
| emoji | 📊 |
| colorFrom | indigo |
| colorTo | purple |
| sdk | docker |
| app_port | 7860 |
You're launching a product. You think the landing page is good. But who have you actually asked?
You could run a survey — but that takes weeks and you'd need to find the right people. You could ask an LLM — but one LLM opinion isn't a market. You could A/B test — but you need traffic first, and you don't know what to test.
SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.
It builds a representative panel from census-grounded synthetic personas, has each one score your thing from their perspective, then asks "what would change your mind?" — producing a priority-ranked list of what to fix first.
You: "Here's my landing page. Here's my target market."
SGO: "47 evaluators scored you. Avg 5.3/10.
Solo devs love it (7.2). Enterprise is blocked (3.1).
#1 concern: no SOC2. #2: no free tier.
Gradient:
+2.1 Add self-hosted option
+1.8 Add free tier ← biggest universal win
+1.4 Get SOC2 certified
+0.6 Drop price ← not actually the blocker"
Anything someone else evaluates.
| What you're optimizing | Who evaluates it | What you learn |
|---|---|---|
| Product — landing page, pricing | Buyer personas by company size, role, budget | Which segments convert, which are blocked, and why |
| Resume — CV + cover letter | Hiring managers at startups vs. enterprises | What stands out, what's a red flag, what to lead with |
| Pitch — investor deck | VCs and angels at different stages | Whether the story lands, what questions they'd ask |
| Policy — proposed regulation | Stakeholders by role, income, geography | Who supports it, who opposes, what compromise works |
| Content — blog post, video | Readers at different expertise levels | Whether it hits the right level, what's confusing |
| Profile — professional bio, personal brand | Population sample by age, education, occupation | How different demographics perceive you |
SGO ships with a 1M-person census-grounded dataset (Nemotron-Personas-USA) with structured demographics (age, sex, education, occupation, marital status, US geography) plus rich narrative fields — professional persona, skills and expertise, career goals, hobbies, cultural background, and personality. The narratives naturally encode things like seniority, industry, technical depth, and decision-making style, even though those aren't separate columns.
This means most domains work out of the box — the LLM evaluates from the persona's full context, not just the demographic fields. For highly specialized panels (e.g., Series B VCs, enterprise procurement officers), SGO can generate personas via LLM with explicit stratification constraints. See limitations on generated vs. census-grounded panels.
In each case, SGO tells you where you stand, what's working, what's not, and what specific change would help the most — broken down by audience segment.
git clone https://github.com/xuy/sgo.git && cd sgo
cp .env.example .env # Add your LLM API key (any OpenAI-compatible provider)
uv sync
uv run --extra web python web/app.py
# Opens at http://localhost:8000The web interface walks you through the full pipeline: describe your entity, build a panel, evaluate, find the highest-impact changes, and audit your panel for cognitive biases.
Alternative: use as a Claude Code skill
git clone https://github.com/xuy/sgo.git ~/.claude/skills/sgo
cd ~/.claude/skills/sgo && cp .env.example .env && uv syncThen run:
/sgo # Interactive — it asks what you're optimizing
/sgo entities/my_product.md # Start with an existing entity
/sgo "optimize my landing page" # Start from a description
CLI-only usage (no web interface)
uv run python scripts/setup_data.py # Download Nemotron personas (once, ~2GB)
# Then use scripts directly: evaluate.py, counterfactual.py, bias_audit.py, compare.py
# See AGENT.md for the full pipeline referenceYou describe what you're optimizing and what your goal is. SGO builds a diverse panel, has each one react, then focuses on the persuadable middle — the people who are almost convinced — to find what would tip them toward your goal.
SGO does not try to please everyone. People who scored 1–3 are not your audience — their feedback is informational, not actionable. The system focuses on moving the people who are close to yes.
Five steps:
- Describe your entity and goal — what an evaluator would see, and what outcome you're optimizing for
- Build a panel — 30–80 evaluators, stratified to cover the segments that matter
- Evaluate — each evaluator scores 1–10. Results are segmented: champions (8+), persuadable (4–7), not-for-them (1–3)
- Find directions for your goal — the persuadable middle re-evaluates hypothetical changes. With a goal, evaluators are weighted by relevance (VJP)
- Act and re-run — make the top change, re-evaluate against the same panel, track improvement over time
The key insight is step 4. The probe produces a ranked list of changes sorted by how much they'd move the persuadable middle toward your goal. SGO calls this the semantic gradient — technically a vector-Jacobian product when a goal is specified.
Example: what the gradient looks like
Each row is an evaluator. Each column is a hypothetical change. Each cell is the score delta.
| Add free tier | Get SOC2 | Self-hosted | Open-core | Case studies | |
|---|---|---|---|---|---|
| Solo dev | +2 | +1 | 0 | +1 | +3 |
| Startup EM | +1 | +3 | -1 | +2 | +4 |
| Enterprise CTO | 0 | +1 | +2 | +1 | +2 |
| Data analyst | +1 | +2 | 0 | 0 | +3 |
| Average | +1.0 | +1.8 | +0.3 | +1.0 | +3.0 |
The column averages tell you what to fix first. "Case studies" has the highest average impact. "Self-hosted" helps enterprise but slightly hurts startups — a tradeoff, not a pure win.
SGO uses NVIDIA Nemotron-Personas-USA — 1 million synthetic Americans whose demographics match real US census distributions. Each persona includes detailed narratives: professional background, skills, career goals, hobbies, cultural background, and personality.
This matters because when you ask an LLM to "generate 50 diverse personas," you get 5–6 archetypes with surface variation — mostly coastal, college-educated, and tech-adjacent. You can't audit what's missing. Census-grounded personas give you the construction worker in suburban Illinois and the quilter in rural Texas, because census data says those people exist.
The principle: define the population before the measurement, not after.
Nemotron covers age, sex, education, occupation, geography, and marital status as structured fields — plus rich narratives about each person's career, skills, values, and lifestyle. That's enough to directly evaluate anything consumer-facing: products, profiles, content, policy.
But what about domains the dataset doesn't explicitly cover — like "enterprise CTOs" or "Series B investors"? There are four ways to get there, from most grounded to most flexible:
1. Filter by what's already there. A Nemotron persona with occupation: software_developer, education: graduate, age: 38 and a professional narrative describing team leadership is a plausible engineering manager evaluating your developer tool. You just filter and let the narrative do the work.
2. Reframe the evaluation prompt. Same persona, different lens. Instead of "would you buy this?", ask "you're evaluating this tool for your team — would you champion it internally?" The persona's professional context, skills, and decision-making style naturally shape the answer.
3. Enrich with a situational overlay. Add context that the persona doesn't have: "You are [full Nemotron persona]. You work at a 50-person Series A startup. Your team's tooling budget is $2k/month. You've been burned by vendor lock-in before." The demographic grounding stays real; the professional situation is augmented.
4. Generate from scratch, using Nemotron as a quality bar. For truly specialized roles (VC partners, procurement officers, regulatory lawyers), generate personas via LLM — but use Nemotron personas as few-shot examples so the output matches the depth and internal consistency of the dataset. SGO's generate_cohort.py does this with an explicit warning about the quality tradeoff.
Each step trades some census grounding for more domain specificity. For most use cases, steps 1–2 are enough.
SaaS product launch — full walkthrough
A seed-stage startup launching "Acme API," a managed data pipeline tool. The landing page says: 200+ connectors, pay-as-you-go at $0.01/sync, SOC2 pending, $99/mo starter, 3-person team.
40 buyer personas stratified by company size (solo → enterprise), role (IC engineer → CTO → data analyst), budget, and tech stack.
Solo devs: avg 7.2 ← love it
Startups: avg 5.8 ← cautious
Enterprise: avg 3.1 ← blocked
Non-technical: avg 4.5 ← confused
Rank avg Δ Change
1 +2.1 Add self-hosted / VPC option
2 +1.8 Add free tier (1,000 syncs/mo)
3 +1.4 SOC2 certified (not pending)
4 +1.2 Open-core positioning
5 +1.0 Add 3 named customer case studies
6 +0.6 Drop price to $49/mo
Insight: Price isn't the blocker. Trust and deployment model are.
Ship the free tier. Re-evaluate. Score moves from 5.3 → 6.1. Then get SOC2. Score moves to 7.0. Each step verified against the same panel.
v1 baseline 5.3 avg 0% positive concerns: price, trust
v2 + free tier 6.1 avg 12% positive concerns: trust
v3 + SOC2 7.0 avg 28% positive concerns: (none)
LLM evaluators don't exhibit cognitive biases at human-realistic levels — they may be too rational (under-biased) or show biases in the wrong patterns (mis-biased). Since real expert panels are biased, matching their behavior means matching their bias profile, not eliminating bias.
SGO includes a bias audit inspired by CoBRA (Liu et al., CHI'26 Best Paper), which uses validated social science experiments to measure and control cognitive biases in LLM agents.
bias_audit.py runs three probes through the same LLM + persona pipeline SGO uses for evaluation:
| Probe | What it tests | Human baseline |
|---|---|---|
| Framing | Same entity, gain-framed vs. loss-framed — do evaluators shift scores based on rhetoric vs. substance? | ~30% shift (Tversky & Kahneman, 1981) |
| Authority | Entity with/without credibility signals (SOC2, press, logos) — how much do credentials move the needle? | ~20% sensitivity in evaluation contexts |
| Order | Same entity, sections reordered — does information order anchor scores? | Should be ~0% |
uv run python scripts/bias_audit.py \
--entity entities/my_product.md \
--cohort data/cohort.json \
--probes framing authority order \
--sample 10Output: results/bias_audit/report.md — per-probe shift %, gap vs. human baselines, and whether the panel is over-biased, under-biased, or well-calibrated.
If the audit reveals bias gaps, add --bias-calibration to your evaluation run:
uv run python scripts/evaluate.py \
--entity entities/my_product.md \
--cohort data/cohort.json \
--tag calibrated \
--bias-calibrationThis appends bias-aware instructions to the evaluation prompt — reducing framing, authority, and order artifacts while preserving realistic human-level biases. The goal is not to eliminate bias but to match the type and magnitude of biases that real expert panels exhibit.
The gap between SGO and real expert panels has three components:
| Gap | What it means | How SGO addresses it |
|---|---|---|
| Knowledge | Does the LLM know what an expert knows? | Persona enrichment, narrative context |
| Preference | Does it weight factors correctly? | Stratification, prompt design |
| Bias | Does it exhibit human-realistic cognitive biases? | Bias audit + calibration (CoBRA-inspired) |
- Directional, not definitive — this is synthetic research. Treat results as strong hypotheses, not proof. Validate important decisions with real users.
- LLM biases — evaluators inherit the model's cultural blind spots. Results skew toward what the LLM thinks people think. Use
bias_audit.pyto measure and--bias-calibrationto mitigate. - Independent evaluators — each persona scores in isolation. Real-world opinions are social — people influence each other. SGO doesn't capture herd effects.
- Not all changes add up — two changes that each score +1.5 might not give +3.0 together. Test combinations explicitly.
Technical details
The Semantic Gradient
SGO computes a Jacobian matrix of score deltas — how each evaluator's score would shift for each hypothetical change:
Goal-weighted gradient (VJP)
The key insight: not all evaluators matter equally. A luxury brand shouldn't optimize for budget shoppers. A dating profile shouldn't optimize for incompatible matches.
SGO uses a goal vector v that weights each evaluator by their relevance to your objective. The gradient is a vector-Jacobian product:
Where v_i is the goal-relevance weight for evaluator i (0 = irrelevant, 1 = ideal target).
Without a goal, v = [1/n, ...] — uniform weights, optimizing for universal appeal. With a goal like "close enterprise deals", enterprise CTOs get v ≈ 1 and solo hobbyists get v ≈ 0.
The LLM assigns goal-relevance weights automatically by evaluating each persona against your stated objective. This means the gradient tells you "what changes move you toward your goal", not "what changes make everyone like you more".
What to probe
Only probe changes you'd actually make:
| Category | Examples | Probe? |
|---|---|---|
| Presentation — framing, tone, emphasis | Rewrite headline, reorder features | Yes |
| Actionable — real changes with real cost | Add free tier, get SOC2 | Yes |
| Fixed — can't change | History, sunk costs | No |
| Boundary — won't change | Values, ethics, mission | No |
Notation
| Symbol | Meaning |
|---|---|
| θ | Entity you control |
| x | Evaluator persona |
| g | Goal — what you're optimizing for |
| f(θ, x) | LLM evaluation → score + reasoning |
| v_i | Goal-relevance weight for evaluator i |
| Δⱼ | Hypothetical change |
| Jᵢⱼ | Score delta: evaluator i, change j |
| ∇ⱼ | Goal-weighted gradient (VJP): impact of change j toward goal g |
Project Structure
├── README.md # This file
├── AGENT.md # Execution guide for AI agents
├── SKILL.md # Claude Code skill definition
├── pyproject.toml # Dependencies
├── .env.example # API key template
├── scripts/
│ ├── setup_data.py # Download Nemotron personas (once)
│ ├── persona_loader.py # Load + filter
│ ├── stratified_sampler.py
│ ├── generate_cohort.py # LLM-generate personas (fallback)
│ ├── evaluate.py # Scorer (supports --bias-calibration)
│ ├── counterfactual.py # Semantic gradient probe
│ ├── bias_audit.py # CoBRA-inspired cognitive bias measurement
│ └── compare.py # Cross-run diff
├── web/
│ ├── app.py # FastAPI backend (primary entry point)
│ └── static/index.html # Single-page frontend
├── templates/ # Entity + changes templates
├── entities/ # Your documents (gitignored)
├── data/ # Cohorts (gitignored)
└── results/ # Run outputs (gitignored)
MIT