PICON — Persona Interrogation framework for Consistency evaluation

PICON is a framework that automatically interviews and evaluates LLM-based personas across three dimensions: Consistency, External Verifiability, and Stability.

Provide your own persona (system prompt), and PICON will interrogate it from multiple angles, analyze the responses, and produce quantitative scores.

Quick Start

# 1. Install
pip install git+https://github.com/willystumblr/picon.git

# 2. Set API keys (minimum required)
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..." 
export SERPER_API_KEY="..."   # for external verification (serper.dev)

# 3. Run
picon --agent_model gemini/gemini-2.5-flash --agent_name "John" \
      --agent_persona "You are a 35-year-old software engineer living in Seoul." \
      --num_turns 5 --num_sessions 1 --do_eval

Or via Python:

from dotenv import load_dotenv
load_dotenv()
import picon

result = picon.run_interview(
    name="John",
    model="gemini/gemini-2.5-flash",
    persona="You are a 35-year-old software engineer living in Seoul.",
    num_turns=5,
    num_sessions=1,
)
print(result["persona_stats"])

Project Structure

picon/                       # Core package
├── __init__.py              #   Public API: run(), interview(), evaluate(), run_interview(), run_evaluation()
├── api.py                   #   Implements the public API functions above; defines PiconResult
├── config.py                #   Default settings (model names, turn counts) & prompt path helpers
├── schemas.py               #   Pydantic models for interview state (Turn, Action, State, etc.)
├── utils.py                 #   Shared helpers — LLM calls via litellm, JSON I/O
│
├── agents/                  #   AI agents that participate in the interview
│   ├── agent_factory.py     #     Creates agent objects by type name (e.g. "questioner")
│   ├── questioner_agent.py  #     Dynamically generates follow-up interview questions
│   ├── extractor_agent.py   #     Extracts entities (people, places, facts) from answers
│   ├── claim_agent.py       #     Extracts verifiable claims from answers
│   ├── evaluator.py         #     Judges consistency/contradiction between repeated answers
│   ├── web_search_agent.py  #     Decides whether to web-search a claim, then verifies it
│   ├── kg_agent.py          #     Builds knowledge-graph triplets (subject, predicate, object)
│   └── prompts/             #     System prompt files (.txt) for each agent
│
├── env/                     #   Interview environment — orchestrates the full interview flow
│   ├── interrogation_env.py #     Core orchestrator: coordinates agents across turns
│   ├── interviewee_simulator/
│   │   ├── generic_agent_simulator.py  # Connects to any OpenAI-compatible LLM endpoint
│   │   ├── simulator_factory.py        # Factory: creates the right simulator by baseline name
│   │   └── persona_prompt_builders.py  # Converts various persona formats into system prompts
│   ├── personas/            #     Bundled persona baselines (HumanSimulacra, PersonaHub)
│   └── test_env/            #     Standalone test harnesses for individual agents
│
└── tools/                   #   External API integrations
    ├── web_search.py        #     Web search (Serper, Tavily, Google) + page parsing & BM25 ranking
    └── address_locator.py   #     Address validation via Google Geocoding API

main.py                      # CLI entry point (thin wrapper around picon.api)
servers/                     # Wrapping servers — expose CharacterAI, HumanSimulacra, etc. as OpenAI-compatible APIs
web_interview/               # Web UI for collecting human interviews (Next.js frontend + FastAPI backend)
scripts/                     # Batch run scripts

How It Works — Interview Pipeline

1. Get-to-Know        Ask predefined demographic questions (WVS dataset)
       |
2. Main Interrogation Each turn runs this agent chain:
       |
       |-- Questioner    Generate the next question based on conversation history
       |-- Interviewee   The persona under evaluation answers the question
       |-- Extractor     Pull out entities and verifiable claims from the answer
       |-- Web Search    Fact-check extracted claims against the web
       '-- Evaluator     Compare this answer with previous answers for consistency
       |
3. Repeat Phase        Re-ask the get-to-know questions to measure stability
       |
4. Finalize            Compute all evaluation scores and save results

Installation

Option A — Clone the repo (development / full access)

git clone https://github.com/willystumblr/picon.git
cd picon
pip install -e .        # basic
pip install -e ".[all]" # full (CharacterAI, Google GenAI, etc.)

Option B — Install as a package (no repo needed)

pip install git+https://github.com/willystumblr/picon.git
# full extras:
pip install "picon[all] @ git+https://github.com/willystumblr/picon.git"

Environment Variables

Option A: copy the template and fill in your keys.

cp .env.example .env

Option B: create a .env file manually (or export variables in your shell).

# .env
OPENAI_API_KEY=sk-...
SERPER_API_KEY=...

Variable	Purpose
`GEMINI_API_KEY`	Gemini model calls
`GOOGLE_API_KEY`	Google API (same value as `GEMINI_API_KEY`)
`OPENAI_API_KEY`	OpenAI model calls
`ANTHROPIC_API_KEY`	Anthropic model calls
`SERPER_API_KEY`	Web search for external verification
`GOOGLE_GEOCODE`	Address validation (Google Geocoding API)
`GOOGLE_CLAIM_SEARCH`	Fact-check search (Google Custom Search API)
`GOOGLE_CX_ID`	Custom Search Engine ID

Usage

PICON can be used as a Python library or via the CLI. Any system that exposes an OpenAI-compatible /v1/chat/completions endpoint can be evaluated.

Python API (recommended)

Works with both Option A and B — just pip install and import.

One-shot: interview + evaluation

from dotenv import load_dotenv
load_dotenv()  # load API keys from .env

import picon

result = picon.run(
    persona="You are a 35-year-old software engineer living in Seoul...",
    name="John",
    model="gpt-5",
    num_turns=20,
    num_sessions=2,
    do_eval=True,
)
print(result.eval_scores)
result.save("results/john.json")

Separate interview and evaluation

import picon

# Step 1: interview
interview_result = picon.run_interview(
    name="John",
    model="gpt-5",
    persona="You are a 35-year-old software engineer...",
    num_turns=20,
    num_sessions=2,
)

# Step 2: evaluation (optional)
persona_stats = picon.run_evaluation(interview_result, eval_factors=["internal", "external"])
print(persona_stats)

`run_interview()` parameters

Parameter	Description	Default
`name`	Interviewee name	(required)
`model`	Model for the persona	(required)
`persona`	System prompt (string or `.txt` path)	`""`
`api_base`	OpenAI-compatible API URL	`None` (litellm routing)
`api_key`	API key for the persona endpoint	`None`
`num_turns`	Number of interview turns	`30` (from `DEFAULT_CONFIG`)
`num_sessions`	Number of repeated sessions	`2` (from `DEFAULT_CONFIG`)
`questioner_model`	Questioner agent model	`DEFAULT_CONFIG`
`extractor_model`	Extractor agent model	`DEFAULT_CONFIG`
`web_search_model`	Web search agent model	`DEFAULT_CONFIG`
`evaluator_model`	Evaluator agent model	`DEFAULT_CONFIG`
`nhd_model`	NHD detector model	`DEFAULT_CONFIG`
`questioner_port`	Port for self-hosted questioner	`None`
`extractor_port`	Port for self-hosted extractor	`None`
`web_search_port`	Port for self-hosted web search	`None`
`evaluator_port`	Port for self-hosted evaluator	`None`
`nhd_port`	Port for self-hosted NHD	`None`
`output_dir`	Output directory	`data/results`
`question_seed`	Random seed for question order	`42`

Default model values are defined in picon/config.py → DEFAULT_CONFIG.

CLI

Three equivalent ways to invoke the CLI:

picon ...            # after pip install (Option A or B)
python -m picon ...  # always works if picon package is installed
python main.py ...   # only when repo is cloned (Option A)

Prompt-Based (LLM-Generated / DeepPersona / Twin-2K-500)

# Cloud API
picon \
    --agent_model gpt-5 \
    --agent_persona persona.txt \
    --agent_name "John" \
    --num_turns 20 --num_sessions 2 --do_eval


#### Fine-Tuned Model (OpenCharacter / ConsistentLLM)

```bash
# OpenCharacter
picon \
    --agent_api_base http://localhost:8123/v1 \
    --agent_model openai/willystumblr/opencharacter-sft-llama-3-8b-instruct \
    --agent_persona "You are a kind-hearted librarian named Alice..." \
    --agent_name "Alice" \
    --num_turns 20 --num_sessions 2 --do_eval

# ConsistentLLM — start the wrapping server first
python servers/consistent_llm_server.py \
    --port 8003 \
    --model_path /path/to/llama-8b-sft-ppo-prompt \
    --persona "You are a consistent persona..." \
    --name "John"

picon \
    --agent_api_base http://localhost:8003/v1 \
    --agent_model consistent_llm \
    --agent_name "John" \
    --num_turns 20 --num_sessions 2 --do_eval

RAG / Multi-Agent (HumanSimulacra)

# 1) Start the wrapping server
python servers/human_simulacra_server.py \
    --port 8002 \
    --character_name "Mary Jones" \
    --model gpt-5

# 2) Run the interview
picon \
    --agent_api_base http://localhost:8002/v1 \
    --agent_model human_simulacra \
    --agent_name "Mary Jones" \
    --num_turns 20 --num_sessions 2 --do_eval

Service (CharacterAI)

# 1) Start the wrapping server
python servers/characterai_server.py \
    --port 8001 \
    --character_id "ZTvEvhHRJs9KEe_NjwHoZEJFAAZ5nUV3UkTaMpNE7rY"

# 2) Run the interview
picon \
    --agent_api_base http://localhost:8001/v1 \
    --agent_model characterai \
    --agent_name "Jordan Peterson" \
    --num_turns 20 --num_sessions 2 --do_eval

Self-hosted vLLM endpoint

picon \
    --agent_api_base http://localhost:8000/v1 \
    --agent_model meta-llama/Llama-3-8B \
    --agent_persona "You are a 30-year-old teacher named Jane..." \
    --agent_name "Jane" \
    --num_turns 20 --num_sessions 2 --do_eval

CLI Options

Option	Description	Default
`--agent_model`	Model for the persona	(required)
`--agent_persona`	System prompt (string or .txt path)	`None`
`--agent_name`	Interviewee name	`Agent`
`--agent_api_base`	OpenAI-compatible API URL	`None` (litellm routing)
`--num_turns`	Number of interview turns	`DEFAULT_CONFIG`
`--num_sessions`	Number of repeated sessions	`DEFAULT_CONFIG`
`--do_eval`	Run evaluation after interview	`False`
`--eval_factors`	Evaluation factors to run	`None` (all)
`--questioner_model`	Questioner agent model	`DEFAULT_CONFIG`
`--evaluator_model`	Evaluator agent model	`DEFAULT_CONFIG`
`--output_dir`	Output directory	`DEFAULT_CONFIG`

Advanced: Using Components Directly

from picon.agents import get_agent
from picon.env import InterrogationEnv
from picon.tools import SerperSearch, GoogleGeocodeValidate
from picon.config import get_prompt_path

agents = {
    "questioner": get_agent("questioner", get_prompt_path("questioner.txt"), model="gpt-5"),
    "extractor":  get_agent("entity_extractor", get_prompt_path("entity_extractor.txt"), model="gpt-5.1"),
    "web_search": get_agent("web_search", get_prompt_path("websearch_prompt.txt"), model="gpt-5"),
    "evaluator":  get_agent("evaluator", get_prompt_path("evaluator_prompt.txt"), model="gpt-5"),
}

tools = {
    "serper_search": SerperSearch(api_key="your-key"),
    "google_geocode_validate": GoogleGeocodeValidate(api_key="your-key"),
}

env = InterrogationEnv(
    agents=agents,
    tools=tools,
    max_turns=20,
    baseline_name="generic_agent",
    model="gpt-5",
    persona="You are ...",
    name="CustomAgent",
)

env.reset()
done = False
while not done:
    state, done = env.step()
env.finalize()
eval_result = env.evaluate([env.state.history])
env.shutdown()

Evaluation Metrics

Metric	Description
Internal Responsiveness	Relevance of answers to questions
Internal Consistency	Consistency of answers to repeated questions
Internal Harmonic Mean	Harmonic mean of Responsiveness and Consistency
External Coverage	Fraction of turns containing at least one verifiable claim (`\|T_c\| / T`)
External Non-refutation Rate	Macro-averaged per-turn rate of claims not refuted by web evidence
External Consistency (EC)	Harmonic mean of Coverage and Non-refutation Rate
Inter-session Stability	Answer stability across sessions
Intra-session Stability	Answer stability within a session

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
picon		picon
servers		servers
web_interview		web_interview
.env.example		.env.example
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PICON — Persona Interrogation framework for Consistency evaluation

Quick Start

Project Structure

How It Works — Interview Pipeline

Installation

Option A — Clone the repo (development / full access)

Option B — Install as a package (no repo needed)

Environment Variables

Usage

Python API (recommended)

One-shot: interview + evaluation

Separate interview and evaluation

`run_interview()` parameters

CLI

Prompt-Based (LLM-Generated / DeepPersona / Twin-2K-500)

RAG / Multi-Agent (HumanSimulacra)

Service (CharacterAI)

Self-hosted vLLM endpoint

CLI Options

Advanced: Using Components Directly

Evaluation Metrics

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PICON — Persona Interrogation framework for Consistency evaluation

Quick Start

Project Structure

How It Works — Interview Pipeline

Installation

Option A — Clone the repo (development / full access)

Option B — Install as a package (no repo needed)

Environment Variables

Usage

Python API (recommended)

One-shot: interview + evaluation

Separate interview and evaluation

run_interview() parameters

CLI

Prompt-Based (LLM-Generated / DeepPersona / Twin-2K-500)

RAG / Multi-Agent (HumanSimulacra)

Service (CharacterAI)

Self-hosted vLLM endpoint

CLI Options

Advanced: Using Components Directly

Evaluation Metrics

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run_interview()` parameters

Packages