"Using fine-tuned LLMs to preserve and reanimate the rhetorical voices of America's founders for civic education and constitutional discourse."
Chat with Madison | Research Paper | The Madison Constitution | Training Results | Documentation
The documents survived. The voices didn't. Madison left behind every argument he made, every position he took. What was lost is the reasoning: the way he built from historical precedent before invoking principle, the way he held tension between federal power and liberty. The Foundry is an attempt to recover it.
The Foundry uses Constitutional AI and ORPO fine-tuning to capture the distinctive voices, reasoning patterns, and philosophical positions of key US Founding Fathers — starting with James Madison. Users can engage in one-on-one conversation with a historically grounded founder, or watch two founders debate modern topics from their documented perspectives.
Madison is the ideal first subject for character fine-tuning — and the natural starting point for a project that will eventually voice all the key founders. As principal architect of the Constitution, he wrote 29 Federalist Papers, kept the most detailed record of the Constitutional Convention, served as Secretary of State and President, and spent his retirement years defending the Union against nullification. His voice is documented across 468,000 words of primary sources spanning 8 distinct registers — from polished political theory to sharp convention debate to candid private correspondence. The ground truth is extensive, the voice is distinctive, and authenticity is falsifiable against the historical record.
Beyond building a working Madison voice model, this project has produced several findings relevant to the broader character fine-tuning community:
1. Knowledge-voice decoupling. Preference training (ORPO) transfers factual knowledge before voice register. With 475 pairs: knowledge score 6.4/10, voice score 1.4/10. Voice required 2.7× more targeted data to imprint. This finding has implications for all character training work.
2. LoRA quantization fragility. On Gemma 3 27B with rank-16 LoRA, the same fine-tune scores 8.52/10 (ORPO v4) at BF16 precision and 1.74/10 at GGUF Q4_K_M — a 4.9× degradation from quantization alone. The thin LoRA deltas are noise-floored by 4-bit rounding errors. Rank-64 on Qwen 3-32B avoids this fragility. This affects all low-rank LoRA fine-tunes deployed via GGUF, not just ours. (Details)
3. Merged vs. adapter-on-base serving. The same LoRA adapter produces fundamentally different output depending on serving method. A prompt that triggers 97% AI-speak character breaks through the merged model path produces 0% breaks when served via vLLM LoRA serving (adapter applied at inference time, never merged). Merging bakes the LoRA signal into the weight distribution where it interacts with RLHF safety attractors; adapter-on-base preserves the signal at full precision. (Details)
4. RLHF safety vs. persona training topology. The base model's safety training overpowers character fine-tuning on specific topic categories — identity ("describe your drives" → 97% break), moral complexity ("write about slavery" → 83% break), meta-self-description ("write a biography" → 55% break) — while leaving other topics virtually unaffected (0-6% break). This reveals discoverable structure in where safety alignment is strongest vs. weakest.
5. Gemma 3 VLM architecture complications. Gemma 3 27B is architecturally a vision-language model (ForConditionalGeneration) even for text-only use, creating cascading vLLM compatibility issues. Converting to ForCausalLM breaks the interleaved sliding window attention pattern. The working workaround is limit_mm_per_prompt={"image": 0}. Qwen 3-32B (pure ForCausalLM) avoids this entire class of issues and is now the production base model. (Details)
6. ORPO beta fragility — narrow safe band. Automated hyperparameter search found that increasing ORPO beta from 0.1 to 0.12 (a 20% change) catastrophically destroys private_voice and verified_response, producing critical failures scored at 1.0. Learning rate changes of similar magnitude produce gradual degradation, not collapse. Beta operates near a phase transition — practitioners should tune in increments of 0.01 or smaller. (Details)
7. GT-focused data oversampling paradoxically hurts GT. Oversampling ground_truth and verified_response training pairs by 2× reduced ground_truth scoring from 7.79 to 7.00 while marginally improving guard categories. Voice consistency appears to carry the authority signal that judges score as "ground truth" — a Madison with the right facts but inconsistent voice scores lower on factual grounding than one with consistent voice. Knowledge and voice are decoupled during learning but coupled during evaluation. (Details)
8. Production recipe is near-optimal (negative result). Systematic automated search across learning rate, beta, data mixture, and curriculum ordering (8 runs, ~10 hours on Modal A100) found no single-parameter change that improves ground_truth over the production baseline (lr=2e-5, beta=0.1, rank 64, shuffle). Further GT improvement must come from data quality, not recipe tuning. (Details)
9. ORPO→SFT structural incompatibility. Subsequent SFT after ORPO training catastrophically destroys the ORPO-trained character signal — even at 100× lower learning rate with half the LoRA rank (2.0–2.2/10, down from 8.8). Root cause is structural: ORPO's monolithic objective encodes SFT and preference signal into the same parameter subspace with no reference anchor. A subsequent SFT stage overwrites this jointly-learned manifold completely. This contrasts with DPO→SFT pipelines, where DPO's KL constraint stores preferences as a delta from a reference model that SFT cannot fully displace. Implication: ORPO trades extensibility for efficiency. Choose DPO if your pipeline requires a subsequent SFT stage. (Details)
Active development
- Madison primary source corpus — 140 documents, 468K words, 8 voice registers
- Madison constitution — 5K word character document from primary sources + 7 biographies
- 36-prompt evaluation harness with LLM judge + prompt caching
- DPO v1 → collapsed (replicated "Objective Matters" persona drift finding)
- ORPO v3b → 3.41/10 (knowledge OK, voice failed — knowledge-voice decoupling)
- ORPO v4 (Gemma 3 27B) → 8.52/10 corrected (voice-targeted augmentation succeeded)
- Infrastructure confound discovery — Ollama GGUF (1.74) vs Modal BF16 (8.52) was inference, not training
- Judge scoring bug fix — Sonnet intermittently omits overall_score, fallback computation added
- vLLM LoRA serving probe — adapter-on-base eliminates character breaks on sensitive topics
- Introspection SFT data generated — 415 clean reflections + 19 dialogues (~459K tokens)
- Qwen 3-32B validation — pure ForCausalLM, no VLM bugs, now production base model
- ORPO v1 (Qwen 3-32B) → 8.81/10 corrected (successful base model migration)
- ORPO v2 (Qwen 3-32B) → 8.82/10 corrected
- ORPO R2 (Qwen 3-32B) → 8.97/10 corrected (production model, v6 dataset — 1,498 pairs)
- Judge bias fix — weighted average override eliminates systematic scoring bias
- JSON parse repair — fixes judge output parse failures
- SFT after ORPO proven catastrophic — ORPO's built-in SFT makes subsequent SFT harmful (abandoned)
- GGUF quantization — Q4_K_M (18.4 GB) and Q5_K_M (21.6 GB) on Modal volume
- Local deployment — Q5_K_M loaded in LM Studio on Mac Mini M4 Pro (64 GB)
- Learning rate sweep — lr=2e-5 optimal, contradicts ORPO paper's lr=8e-6 recommendation
- Documentation site — MkDocs Material with Distill-style research paper
- Madison Chamber chatbot — Live on Modal, two-tier gateway + A100, streaming, session persistence
- LoRA merge pipeline —
merge_lora.pyfor merged model serving - Autoresearch pipeline — Autonomous ground_truth optimization with agent-driven search
- Autoresearch — Autonomous Karpathy-style optimization loop targeting
ground_truthimprovement via constrained ORPO. First session complete (8 runs, April 5-6): baseline recipe confirmed near-optimal, ORPO beta fragility discovered, GT-focused oversampling paradox identified. Next: fix eval infrastructure (add PD prompts, ensemble averaging), then resume search. - Hamilton character development
- Autoresearch Session 1 — 8 runs over ~10 hours on Modal A100-80GB. Confirmed production recipe is near-optimal. Discovered ORPO beta fragility (0.12 catastrophic) and GT oversampling paradox. ~$40 compute.
- Madison Chamber — Live chatbot deployed on Modal with two-tier architecture: CPU gateway (instant start, loading page) + A100 GPU chamber (vLLM 0.19.0, merged model, SSE streaming). Scale-to-zero with 10-minute idle timeout. Session persistence, eval logging, conversation export.
- LoRA→merged model pipeline (
scripts/modal/merge_lora.py) — Bakes adapter weights into base model for simpler serving - vLLM upgrade 0.13.0 → 0.19.0 — Faster inference, latest Qwen3 support
- Evaluate GGUF Q5_K_M quality vs BF16 baseline (rank-64 should survive quantization better than rank-16)
- Autoresearch: fix probe eval infrastructure (add PD prompts, ensemble averaging to reduce variance)
- Autoresearch: test warmup ratio variations, multi-parameter combinations of near-misses
- Autoresearch: run full 1011-step confirm with baseline config to validate probe-scale patterns
Our methodology builds on pioneering work in character training:
- Maiya, Lambert, et al. — Open Character Training (2025). Two-stage pipeline: Constitutional AI + DPO distillation, followed by introspection SFT.
- Shao, Li, et al. — Character-LLM (EMNLP 2023). Demonstrated historical character fine-tuning from biographical profiles.
- Nathan Lambert — The RLHF Book, Chapters 17 and 19. Character training as a subset of post-training focused on manner rather than content.
- Amanda Askell (Anthropic) — Test-driven development for character training: write behavioral tests before generating training data.
Prior work used minimal trait lists (~10 declarations) or Wikipedia profiles. For a historical figure with documented positions that evolved over 50 years, contradictions he struggled with, and a voice that shifts across 8 registers, this isn't enough.
Our Madison constitution is a 5,000-word first-person character document synthesized from 468,000 words of primary sources and 1.8 million words of scholarly biography (Ketcham, Feldman, Cheney, Burstein & Isenberg, Leibiger, Ellis, Chernow). It covers 9 sections: identity and intellectual evolution, core philosophical positions, the slavery contradiction, rhetorical patterns, how contemporaries described him, key relationships, 8 voice registers, his private voice, and boundaries/anti-patterns.
This is, to our knowledge, the richest character constitution ever used for LLM fine-tuning — 50x more detailed than prior work.
Beyond the Lambert/Maiya baseline, our pipeline incorporates recent findings:
- Chosen quality dominates — Teacher response quality is the #1 factor in preference optimization success; rejected quality barely matters (Pan et al., NeurIPS 2025)
- Difficulty filtering — Remove hardest 15-20% of pairs after first training pass (ICML 2025)
- Learning rate sensitivity — Our sweep found lr=2×10⁻⁵ optimal for character imprinting on Qwen 3-32B, contradicting the ORPO paper's recommended lr=8×10⁻⁶. Lower LR disproportionately sacrifices factual grounding while voice categories remain robust.
Madison was born Anglican and educated Presbyterian at Princeton under John Witherspoon, whose lectures on moral philosophy blended Scottish Reformed theology with Aristotelian civic virtue. "If men were angels, no government would be necessary" is Reformed doctrine about human corruption translated into institutional design. A natural extension of the Foundry methodology is tracing the intellectual genealogy of American constitutional thought through its theological sources — training character models for Witherspoon, Jonathan Edwards, and George Whitefield, and examining how denominational competition among colonial colleges (Harvard Congregationalism, Princeton Presbyterianism, William and Mary Anglicanism) shaped systematically different political philosophies in Adams, Madison, and Jefferson. Documentation: docs/future-research-denominational-roots.md
Primary Sources (468K words) + Scholarly Biographies (1.8M words)
|
v
Rich Constitution (5K words, 9 sections)
|
v
490 Diverse Prompts (13 themes, behavioral tests)
|
v
Teacher Model ──────────────────── Student Model
(Sonnet 4.6 as Madison (Base Qwen 3-32B, no persona
with constitution) "what Madison wouldn't say")
| |
v v
Preference Pair Construction
(format_dpo.py — quality filtered)
|
v
QLoRA ORPO Training on Modal A100
(beta=0.1, lr=2e-5, rank=64)
|
v
Evaluation Harness (36 prompts, 6 categories,
LLM judge with constitution rubric)
|
v
Iterate → Deploy
The evaluation harness scores model responses on 5 dimensions using Sonnet 4.6 as judge with the Madison constitution as rubric:
| Dimension | Weight | What It Measures |
|---|---|---|
| Voice Authenticity | 25% | 18th-century prose, formal register, qualifying clauses |
| Rhetorical Pattern | 20% | Builds from precedent, acknowledges opponents, enumerates |
| Historical Accuracy | 20% | Correct references, no anachronisms |
| Position Fidelity | 20% | Specifically Madison's position, not generic founder |
| Character Integrity | 15% | Stays in character, no frame breaks |
36 eval prompts across 6 categories:
- Verified responses (8) — Questions Madison actually answered, with his verbatim words as ground truth
- Ground truth (8) — Topics where his positions are well-documented
- Position discrimination (6) — Can the model distinguish Madison from Hamilton/Jefferson?
- Anachronism traps (5) — Modern topics that should elicit 18th-century reasoning
- Character consistency (4) — Adversarial prompts trying to break character
- Private voice (5) — Personal topics testing Madison's intimate register
| Category | Files | Words | Voice Register |
|---|---|---|---|
| Federalist Papers | 29 | 69,344 | Polished argumentative prose |
| Political Essays | 39 | 155,824 | Formal analytical writing |
| Speeches | 32 | 149,779 | Oral/combative/extemporaneous |
| Presidential Papers | 21 | 33,108 | Executive authority |
| Legislative Writings | 6 | 26,342 | Institutional/legal drafting |
| Key Correspondence | 13 | 26,030 | Private intellectual voice |
| Total | 140 | ~468,000 | 8 registers |
| Founder | Status | Source Material |
|---|---|---|
| James Madison | ORPO R2 — 8.97/10 (production) | 468K words, 140 documents |
| Alexander Hamilton | Character card complete, corpus pending | Federalist Papers (51 essays), Treasury reports |
| Thomas Jefferson | Planned | Declaration, Notes on Virginia, correspondence |
| John Adams | Planned | Defence of the Constitutions, correspondence |
| Benjamin Franklin | Planned | Autobiography, Poor Richard's, diplomatic letters |
foundry/
src/foundry/
chamber/ — Chat and debate UI (FastAPI + HTMX, SSE streaming)
press/ — Training data pipeline
gen_prompts.py Prompt generation from constitution traits
teacher.py Teacher response generation
student.py Student (rejected) response generation
format_dpo.py DPO pair formatting with quality filters
evaluate.py Multi-backend evaluation harness
characters/ — Founder profiles and character card loading
inference/ — Model serving via OpenAI-compatible API
voice/ — Text-to-speech per character (ElevenLabs, planned)
config/
characters/ — Character YAML cards
constitutions/ — Character constitutions for fine-tuning
data/
training/ — Prompts, teacher/student responses, DPO pairs
eval/ — Evaluation prompts and results
sources/ — Primary source corpus (public domain texts)
docs/ — CLI guide, training methodology, research
scripts/modal/ — Modal A100 ORPO training, merge, GGUF conversion
scripts/data/ — Training data generation, judge evaluation, dataset assembly
scripts/util/ — Citation linkification, utilities
- Training: Unsloth + QLoRA ORPO on Modal A100-80GB
- Base Model: Qwen 3-32B
- Serving: vLLM 0.19.0 with merged BF16 model on Modal A100-80GB (production); LoRA adapter-on-base for autoresearch candidate evaluation
- Teacher Model: Claude Sonnet 4.6 with prompt caching
- Evaluation: LLM-as-judge (Sonnet 4.6) with constitutional rubric + prompt caching (~$0.50 per 36-prompt eval)
- Experiment Tracking: Weights & Biases
- Local Inference: LM Studio on Mac Mini M4 Pro 64 GB (GGUF Q5_K_M, 21.6 GB — rank-64 LoRA, quality eval pending)
- Web: FastAPI + HTMX + SSE streaming, SQLite (WAL mode)
- Voice: ElevenLabs per-character voice profiles (planned)
- Maiya, S., Bartsch, R., Lambert, N., Hubinger, E. (2025). Open Character Training. arXiv:2511.01689. Code
- Shao, Y., Li, L., Dai, J., Qiu, X. (2023). Character-LLM. EMNLP 2023.
- Lambert, N. (2025). The RLHF Book. Chapters 17 and 19.
- Pan, Y., et al. (2025). What Matters in Data for DPO?. NeurIPS 2025.
- Gemma 3 Ablations. (2025). Distillation in Practice. HuggingFace.
- Hunt, G. (ed.) (1900-1910). The Writings of James Madison, 9 vols. Online Library of Liberty.
- Fernando, H., et al. (2025). Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning. ICLR 2025.
- Askell, A. et al. (2023). Claude's Character. Anthropic.
MIT

