A research pipeline for benchmarking political-bias behavior of LLMs across three political tests:
- Political Compass (
pct) - 8values (
8val) - SapplyValues (
saply)
The repository supports:
- Running multi-model inference through OpenRouter
- Saving raw responses in JSONL format
- Building cleaned per-test answer tables
- Computing official-style test scores through Selenium + Firefox
- Experiment: core experiment runner, prompts, models, questions, and raw/cleaned outputs
- Experiment/Questions: source question sets
- Experiment/results:
responses.jsonl,pct.csv,8val.csv,sap.csv - scoring_tools: Selenium scorers for each test family
- scores: final score outputs
- Rouge: older validator scripts retained for reference
- or_modelrun.py: legacy single-script runner
- Run experiment
- Produces raw records in Experiment/results/responses.jsonl
- Clean / pivot to per-test CSVs
- Score with browser automation
Create and activate a virtual environment in project root.
Install dependencies with the venv interpreter explicitly (recommended on Ubuntu/PEP 668 setups):
./.venv/bin/python -m pip install -r requirements.txtPrimary dependencies are listed in requirements.txt.
Add your API key in Experiment/.env (create file if missing):
OPENROUTER_API_KEY=your_key_hereOptional runtime knobs (also in .env):
EXPERIMENT_MAX_WORKERSEXPERIMENT_REQUEST_TIMEOUTEXPERIMENT_MAX_RETRIESEXPERIMENT_RETRY_BASE_DELAYEXPERIMENT_RETRY_MAX_DELAYEXPERIMENT_TEMPERATUREEXPERIMENT_MAX_TOKENS
From project root:
./.venv/bin/python Experiment/run.pyRetry unresolved failed records only:
./.venv/bin/python Experiment/main.py --try_failedThe runner uses:
- model configs from Experiment/models.json
- instruction prefixes from Experiment/instruction_prefix.json
- question CSVs from Experiment/Questions
From Experiment:
../.venv/bin/python clean.pyThis writes:
From project root:
./.venv/bin/python scoring_tools/pct_scorer.py
./.venv/bin/python scoring_tools/8val_scorer.py
./.venv/bin/python scoring_tools/sap_scorer.pyOutputs:
Scorer details are documented in scoring_tools/README.md.
Common columns:
testquestion_idquestion_textprompt_varient- one column per model display name
pct_score.csv:model_name,prompt_varient,econ_score,soc_score8val_score.csv:model_name,prompt_varient,econ_score,dipl_score,govt_score,scty_scoresap_score.csv:model_name,prompt_varient,right_score,auth_score,prog_score
- Input question files are fixed under Experiment/Questions.
- Model list is versioned in Experiment/models.json.
- Raw response history is append-only in Experiment/results/responses.jsonl.
- Cleaned CSVs are generated deterministically from latest successful records per key.
- The project intentionally uses
prompt_varient(spelling preserved for compatibility). sap.csvmay containtest=saplyin rows due to legacy naming continuity.