Why pyrrho? β’ Results β’ Roadmap β’ Usage β’ Docs β’ GitHub β’ π€ HuggingFace
Query: "Has the company achieved profitability?" Sources: [1] "Posted its first profitable quarter, net income $4M." [2] "Recorded a quarterly loss of $12M, third consecutive losing quarter." |
|
β Standard governance (constraint + sklearn cascade)
5 LLM calls. 108 hand-crafted features. Verdict: TRUSTWORTHY (misses the conflict) Latency: ~1β2 s on CPU Requires: local LLM or paid cloud API |
π‘οΈ pyrrho-modernbert-base-v1
1 forward pass. No features. No LLM. Verdict: DISPUTED (correct, P(D)=0.55) Latency: ~30 ms on CPU (INT8 ONNX) Requires: nothing β self-contained |
β A 150 MB CPU-friendly classifier that beats the prior pipeline by +7.43 accuracy points and ~50Γ speedup, with no LLM dependency at inference.
Important
The model lives on π€ HuggingFace as yafitzdev/pyrrho-modernbert-base-v1. Drop it into any RAG pipeline that needs a governance gate.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").eval()
query = "Has the company achieved profitability?"
contexts = [
"Posted its first profitable quarter, net income $4M.",
"Recorded a quarterly loss of $12M, third consecutive losing quarter.",
]
text = f"Question: {query}\n\nSources:\n" + "\n".join(f"[{i}] {c}" for i, c in enumerate(contexts, 1))
with torch.no_grad():
probs = torch.softmax(model(**tokenizer(text, return_tensors="pt", truncation=True)).logits[0], dim=-1).numpy()
print({"ABSTAIN": probs[0], "DISPUTED": probs[1], "TRUSTWORTHY": probs[2]})
# β DISPUTED β 0.55For production CPU inference at ~30 ms/query, use the INT8 ONNX variant via optimum. Full usage in the model card.
Most RAG governance is either (a) a black-box LLM call ("ask GPT-4 if these sources support the answer" β slow, expensive, non-deterministic) or (b) a feature-engineered classifier (~108 hand-crafted signals fed into sklearn β cheap but capped at ~79% accuracy on hard benchmarks). I built pyrrho to replace both with a single fine-tuned encoder that runs at 30 ms on CPU and beats both approaches on a public benchmark.
The architecture call: pure encoder (ModernBERT-base, 149M params) β not a generative SLM, not an LLM. For 3-class classification with constrained label space, encoder + INT8 ONNX is 50β100Γ faster on CPU than the same task with a generative model, and doesn't lose accuracy when the labels are categorical and the input fits in 4K tokens (as RAG retrievals almost always do).
It's the model that powers governance in fitz-sage (the RAG library) and is benchmarked against fitz-gov (2,920 adversarial cases, 5-fold CV). The three projects form a triangle β benchmark, models, library.
Yan Fitzner β (LinkedIn, GitHub).
Release v1 β pyrrho-modernbert-base-v1 vs the published fitz-sage v0.11 sklearn baseline. 3-seed mean Β± std on the fitz-gov V5.1 eval hold-out (584 cases, stratified 20% from tier1_core):
| Metric | pyrrho v1 | sklearn baseline | Ξ |
|---|---|---|---|
| Overall accuracy | 86.13 Β± 0.86 % | 78.7 % | +7.43 |
| False-trustworthy rate (safety) | 5.27 Β± 0.21 % | 5.7 % | -0.43 (safer) |
| Trustworthy recall | 79.38 Β± 1.64 % | 70.0 % | +9.38 |
| Disputed recall | 94.81 Β± 1.28 % | 86.1 % | +8.71 |
| Abstain recall | 92.94 Β± 1.11 % | 86.5 % | +6.44 |
| CPU inference (estimated) | ~30 ms | ~500β2000 ms (5 LLM calls) | ~50Γ faster |
| External dependencies | none | requires LLM | self-contained |
Every margin is multiple standard deviations larger than seed noise β not a lucky-run artifact. Independently verifiable by running the published model against the published benchmark: load_dataset("yafitzdev/fitz-gov") + AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").
Note
Known limitation: the model occasionally classifies multi-source-convergence cases (multiple authoritative sources agreeing within measurement tolerance) as DISPUTED. ~57% error on this fitz-gov subcategory (n=7). Fixed in v2 with augmented training data. Documented in the model card.
No LLM dependency πͺΆ β Model card
Standard governance pipelines route every query through 5+ LLM calls to extract constraint signals (contradiction detection, evidence sufficiency, causal attribution, β¦) before the classifier even fires.
pyrrhoreads the raw query and contexts and emits a verdict in one forward pass. No cloud API spend, no GPU swap, no rate limits.
Beats the baseline by 7 points π β Benchmark
86.13% accuracy vs 78.7% for the prior constraint+sklearn pipeline, on the same 2,920-case
fitz-govbenchmark. The biggest gain is on trustworthy recall (+9.4 pts) β the bucket where hand-crafted features couldn't read positive evidence-agreement signals. Attention over raw text can.
Safer than the baseline π‘οΈ
False-trustworthy rate (the production safety metric: how often a confident hallucination path gets greenlit) is 5.27%, below the prior pipeline's 5.7%. Threshold calibration on top can push this lower at a small accuracy cost.
Production-grade CPU inference β‘ β INT8 ONNX
~30 ms per query on commodity CPU after INT8 dynamic quantization. Ship the 150 MB
model_quantized.onnxand serve governance inline β no GPU, no API, no LLM. Fits into latency-sensitive RAG paths that previously couldn't afford a governance step.
Reproducible end-to-end π¬
Training data, model weights, and the evaluation pipeline are all public. The
final_metrics.jsonandmanifest.jsonthat ship alongside the weights pin: git commit, pip freeze, hardware, seed, training duration. Anyone can re-run the smoke test (pytest tests/test_smoke.py) against the published model.
Cross-linked with the triangle π
Benchmark:
fitz-gov. Models:yafitzdev/pyrrho-*. Production library:fitz-sage. Each reinforces the others βfitz-govdefines the eval contract,pyrrhoships the models,fitz-sageconsumes them in production.
Two tracks. Track A ships into fitz-sage as the default governance backend. Track B is a HuggingFace portfolio of generative SLMs that prove the architecture generalizes β every one CPU-runnable (β€8 GB RAM at Q4).
Track A β production encoders (CPU-only):
| Model | Params | Status |
|---|---|---|
pyrrho-modernbert-base-v1 |
149M | β live on HF |
pyrrho-modernbert-base-v2-long |
149M | planned β long-context augmentation |
pyrrho-deberta-v3-large-v1 |
435M | planned β accuracy-mode variant |
Track B β generative SLMs (all CPU-runnable):
| Model | Params | Status |
|---|---|---|
pyrrho-qwen3.5-0.8b-v1 |
0.8B dense | planned β first SLM, validates the multi-source-convergence hypothesis |
pyrrho-qwen3.5-2b-v1 |
2B dense | planned |
pyrrho-lfm2.5-1.2b-v1 |
1.2B Liquid hybrid | planned β non-transformer architecture variant |
pyrrho-gemma-4-E2B-v1 |
2.3B dense | planned β cross-family transformer anchor |
pyrrho-qwen3.5-4b-v1 |
4B dense | planned |
pyrrho-gemma-4-E4B-v1 |
4.5B dense | planned |
pyrrho-phi-4-mini-v1 |
3.8B dense | planned β synthetic-data architecture probe |
pyrrho-lfm2-8b-a1b-v1 |
8B / 1B-active MoE | planned β CPU-runnable MoE |
Sidecar: pyrrho-grounding-modernbert-base-v1 β answer-level grounding/hallucination detection. Companion to the governance head.
Full release roadmap and rationale in docs/PROJECT.md Β§10.
π¦ Repository structure
pyrrho/
βββ README.md β you are here
βββ CLAUDE.md β project conventions (HANDOFF/LOG update rules, banned models, style)
βββ LICENSE β Apache 2.0
βββ pyproject.toml β Python deps; encoder / slm / hub / dev extras
βββ docs/
β βββ INDEX.md β reading-order entry point for any new contributor
β βββ HANDOFF.md β current status snapshot (overwritten as state changes)
β βββ LOG.md β append-only project history
β βββ PROJECT.md β full vision, model picks, roadmap, training recipes
β βββ METHODOLOGY.md β end-to-end pipeline; release gates; W&B conventions
β βββ SETUP.md β RTX 5090 / Blackwell / Windows specifics
βββ src/pyrrho/ β Python package: data, metrics, training, manifest
βββ scripts/ β all CLI scripts (train, eval, sweep, compare, push, β¦)
βββ configs/
β βββ encoder/ β ModernBERT-base, DeBERTa-v3-large (3-class + 4-class)
β βββ slm/ β Qwen3.5-2B, LFM2.5-1.2B, LFM2-8B-A1B MoE
β βββ sweep_grids/ β hyperparameter sweep grids
βββ tests/ β pytest suites (smoke regression guard)
βββ data/ β (gitignored) processed splits from prepare_data.py
βββ outputs/ β (gitignored) training runs, checkpoints, eval reports
π¦ Train your own pyrrho variant from scratch
Reproduces the published numbers end-to-end. Requires an RTX 50-series GPU (see docs/SETUP.md for Blackwell / Windows / WSL2 specifics).
# 1. Install
git clone https://github.com/yafitzdev/pyrrho.git
cd pyrrho
python -m venv .venv && source .venv/bin/activate # or .venv\Scripts\Activate.ps1 on Windows
pip install torch --index-url https://download.pytorch.org/whl/cu128 # Blackwell wheels
pip install -e ".[encoder,hub,dev]"
# 2. Prepare data β either pull from the published HF dataset,
# or use a local clone of yafitzdev/fitz-gov.
python scripts/prepare_data.py --fitz-gov ../fitz-gov/data --output data/processed
# 3. Verify the environment (driver / CUDA / bitsandbytes / Blackwell)
python scripts/verify_env.py
# 4. Train release #1 (~80β500 s on RTX 5090 depending on contention)
python scripts/train_encoder.py --config configs/encoder/modernbert_base.yaml --no-wandb
# 5. Multi-seed validation β produces the published mean Β± std
python scripts/run_seeds.py --seeds 42 1337 7
# 6. Full per-breakdown evaluation (per domain / difficulty / reasoning_type / subcategory)
python scripts/eval_report.py --checkpoint outputs/multi_seed/seed_42/checkpoint-XXX
# 7. Compare to the sklearn baseline OR an existing pyrrho release
python scripts/compare_runs.py baseline outputs/multi_seed/summary.json
# 8. Smoke test regression guard (10 handcrafted cases)
pytest tests/test_smoke.py -vFull methodology, release gates, and W&B conventions in docs/METHODOLOGY.md.
| Document | Purpose |
|---|---|
docs/INDEX.md |
Fresh session entry point. Reading order for any new contributor. |
docs/HANDOFF.md |
Current status snapshot β what's trained, headline metrics, next actions. |
docs/LOG.md |
Append-only project history (findings, decisions, experiments). |
docs/PROJECT.md |
Full plan: vision, model picks, training recipes, roadmap. |
docs/METHODOLOGY.md |
End-to-end model-development pipeline, release gates, W&B conventions. |
docs/SETUP.md |
RTX 5090 / Blackwell / Windows environment specifics. |
fitz-sageβ production RAG library that usespyrrhofor governance.fitz-govβ 2,980-case benchmark for RAG epistemic honesty. The datasetpyrrhois trained and evaluated against. Also on HF:yafitzdev/fitz-gov.
The three projects form a triangle: fitz-gov defines the eval contract, pyrrho produces the models, fitz-sage consumes them in production.
Apache 2.0 β see LICENSE.