pyrrho

Fine-tuned classification models that decide when your RAG should answer — without an LLM call.

Why pyrrho? • Results • Roadmap • Usage • Docs • GitHub • 🤗 HuggingFace

Query: "Has the company achieved profitability?"
Sources:
  [1] "Posted its first profitable quarter, net income $4M."
  [2] "Recorded a quarterly loss of $12M, third consecutive losing quarter."

❌ Standard governance (constraint + sklearn cascade)

5 LLM calls. 108 hand-crafted features.
Verdict: TRUSTWORTHY  (misses the conflict)
Latency: ~1–2 s on CPU
Requires:  local LLM or paid cloud API

🛡️ pyrrho-modernbert-base-v1

1 forward pass. No features. No LLM.
Verdict: DISPUTED  (correct, P(D)=0.55)
Latency: ~30 ms on CPU (INT8 ONNX)
Requires:  nothing — self-contained

→ A 150 MB CPU-friendly classifier that beats the prior pipeline by +7.43 accuracy points and ~50× speedup, with no LLM dependency at inference.

🚀 Where to start

Important

The model lives on 🤗 HuggingFace as yafitzdev/pyrrho-modernbert-base-v1. Drop it into any RAG pipeline that needs a governance gate.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").eval()

query = "Has the company achieved profitability?"
contexts = [
    "Posted its first profitable quarter, net income $4M.",
    "Recorded a quarterly loss of $12M, third consecutive losing quarter.",
]
text = f"Question: {query}\n\nSources:\n" + "\n".join(f"[{i}] {c}" for i, c in enumerate(contexts, 1))

with torch.no_grad():
    probs = torch.softmax(model(**tokenizer(text, return_tensors="pt", truncation=True)).logits[0], dim=-1).numpy()
print({"ABSTAIN": probs[0], "DISPUTED": probs[1], "TRUSTWORTHY": probs[2]})
# → DISPUTED ≈ 0.55

For production CPU inference at ~30 ms/query, use the INT8 ONNX variant via optimum. Full usage in the model card.

About

Most RAG governance is either (a) a black-box LLM call ("ask GPT-4 if these sources support the answer" — slow, expensive, non-deterministic) or (b) a feature-engineered classifier (~108 hand-crafted signals fed into sklearn — cheap but capped at ~79% accuracy on hard benchmarks). I built pyrrho to replace both with a single fine-tuned encoder that runs at 30 ms on CPU and beats both approaches on a public benchmark.

The architecture call: pure encoder (ModernBERT-base, 149M params) — not a generative SLM, not an LLM. For 3-class classification with constrained label space, encoder + INT8 ONNX is 50–100× faster on CPU than the same task with a generative model, and doesn't lose accuracy when the labels are categorical and the input fits in 4K tokens (as RAG retrievals almost always do).

It's the model that powers governance in fitz-sage (the RAG library) and is benchmarked against fitz-gov (2,920 adversarial cases, 5-fold CV). The three projects form a triangle — benchmark, models, library.

Yan Fitzner — (LinkedIn, GitHub).

Headline results

Release v1 — pyrrho-modernbert-base-v1 vs the published fitz-sage v0.11 sklearn baseline. 3-seed mean ± std on the fitz-gov V5.1 eval hold-out (584 cases, stratified 20% from tier1_core):

Metric	pyrrho v1	sklearn baseline	Δ
Overall accuracy	86.13 ± 0.86 %	78.7 %	+7.43
False-trustworthy rate (safety)	5.27 ± 0.21 %	5.7 %	-0.43 (safer)
Trustworthy recall	79.38 ± 1.64 %	70.0 %	+9.38
Disputed recall	94.81 ± 1.28 %	86.1 %	+8.71
Abstain recall	92.94 ± 1.11 %	86.5 %	+6.44
CPU inference (estimated)	~30 ms	~500–2000 ms (5 LLM calls)	~50× faster
External dependencies	none	requires LLM	self-contained

Every margin is multiple standard deviations larger than seed noise — not a lucky-run artifact. Independently verifiable by running the published model against the published benchmark: load_dataset("yafitzdev/fitz-gov") + AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").

Note

Known limitation: the model occasionally classifies multi-source-convergence cases (multiple authoritative sources agreeing within measurement tolerance) as DISPUTED. ~57% error on this fitz-gov subcategory (n=7). Fixed in v2 with augmented training data. Documented in the model card.

Why `pyrrho`?

No LLM dependency 🪶 → Model card

Standard governance pipelines route every query through 5+ LLM calls to extract constraint signals (contradiction detection, evidence sufficiency, causal attribution, …) before the classifier even fires. pyrrho reads the raw query and contexts and emits a verdict in one forward pass. No cloud API spend, no GPU swap, no rate limits.

Beats the baseline by 7 points 📊 → Benchmark

86.13% accuracy vs 78.7% for the prior constraint+sklearn pipeline, on the same 2,920-case fitz-gov benchmark. The biggest gain is on trustworthy recall (+9.4 pts) — the bucket where hand-crafted features couldn't read positive evidence-agreement signals. Attention over raw text can.

Safer than the baseline 🛡️

False-trustworthy rate (the production safety metric: how often a confident hallucination path gets greenlit) is 5.27%, below the prior pipeline's 5.7%. Threshold calibration on top can push this lower at a small accuracy cost.

Production-grade CPU inference ⚡ → INT8 ONNX

~30 ms per query on commodity CPU after INT8 dynamic quantization. Ship the 150 MB model_quantized.onnx and serve governance inline — no GPU, no API, no LLM. Fits into latency-sensitive RAG paths that previously couldn't afford a governance step.

Reproducible end-to-end 🔬

Training data, model weights, and the evaluation pipeline are all public. The final_metrics.json and manifest.json that ship alongside the weights pin: git commit, pip freeze, hardware, seed, training duration. Anyone can re-run the smoke test (pytest tests/test_smoke.py) against the published model.

Cross-linked with the triangle 🔗

Benchmark: fitz-gov. Models: yafitzdev/pyrrho-*. Production library: fitz-sage. Each reinforces the others — fitz-gov defines the eval contract, pyrrho ships the models, fitz-sage consumes them in production.

Family roadmap

Two tracks. Track A ships into fitz-sage as the default governance backend. Track B is a HuggingFace portfolio of generative SLMs that prove the architecture generalizes — every one CPU-runnable (≤8 GB RAM at Q4).

Track A — production encoders (CPU-only):

Model	Params	Status
`pyrrho-modernbert-base-v1`	149M	✅ live on HF
`pyrrho-modernbert-base-v2-long`	149M	planned — long-context augmentation
`pyrrho-deberta-v3-large-v1`	435M	planned — accuracy-mode variant

Track B — generative SLMs (all CPU-runnable):

Model	Params	Status
`pyrrho-qwen3.5-0.8b-v1`	0.8B dense	planned — first SLM, validates the multi-source-convergence hypothesis
`pyrrho-qwen3.5-2b-v1`	2B dense	planned
`pyrrho-lfm2.5-1.2b-v1`	1.2B Liquid hybrid	planned — non-transformer architecture variant
`pyrrho-gemma-4-E2B-v1`	2.3B dense	planned — cross-family transformer anchor
`pyrrho-qwen3.5-4b-v1`	4B dense	planned
`pyrrho-gemma-4-E4B-v1`	4.5B dense	planned
`pyrrho-phi-4-mini-v1`	3.8B dense	planned — synthetic-data architecture probe
`pyrrho-lfm2-8b-a1b-v1`	8B / 1B-active MoE	planned — CPU-runnable MoE

Sidecar: pyrrho-grounding-modernbert-base-v1 — answer-level grounding/hallucination detection. Companion to the governance head.

Full release roadmap and rationale in docs/PROJECT.md §10.

📦 Repository structure

pyrrho/
├── README.md           ← you are here
├── CLAUDE.md           ← project conventions (HANDOFF/LOG update rules, banned models, style)
├── LICENSE             ← Apache 2.0
├── pyproject.toml      ← Python deps; encoder / slm / hub / dev extras
├── docs/
│   ├── INDEX.md        ← reading-order entry point for any new contributor
│   ├── HANDOFF.md      ← current status snapshot (overwritten as state changes)
│   ├── LOG.md          ← append-only project history
│   ├── PROJECT.md      ← full vision, model picks, roadmap, training recipes
│   ├── METHODOLOGY.md  ← end-to-end pipeline; release gates; W&B conventions
│   └── SETUP.md        ← RTX 5090 / Blackwell / Windows specifics
├── src/pyrrho/         ← Python package: data, metrics, training, manifest
├── scripts/            ← all CLI scripts (train, eval, sweep, compare, push, …)
├── configs/
│   ├── encoder/        ← ModernBERT-base, DeBERTa-v3-large (3-class + 4-class)
│   ├── slm/            ← Qwen3.5-2B, LFM2.5-1.2B, LFM2-8B-A1B MoE
│   └── sweep_grids/    ← hyperparameter sweep grids
├── tests/              ← pytest suites (smoke regression guard)
├── data/               ← (gitignored) processed splits from prepare_data.py
└── outputs/            ← (gitignored) training runs, checkpoints, eval reports

📦 Train your own pyrrho variant from scratch

Reproduces the published numbers end-to-end. Requires an RTX 50-series GPU (see docs/SETUP.md for Blackwell / Windows / WSL2 specifics).

# 1. Install
git clone https://github.com/yafitzdev/pyrrho.git
cd pyrrho
python -m venv .venv && source .venv/bin/activate   # or .venv\Scripts\Activate.ps1 on Windows
pip install torch --index-url https://download.pytorch.org/whl/cu128   # Blackwell wheels
pip install -e ".[encoder,hub,dev]"

# 2. Prepare data — either pull from the published HF dataset,
# or use a local clone of yafitzdev/fitz-gov.
python scripts/prepare_data.py --fitz-gov ../fitz-gov/data --output data/processed

# 3. Verify the environment (driver / CUDA / bitsandbytes / Blackwell)
python scripts/verify_env.py

# 4. Train release #1 (~80–500 s on RTX 5090 depending on contention)
python scripts/train_encoder.py --config configs/encoder/modernbert_base.yaml --no-wandb

# 5. Multi-seed validation — produces the published mean ± std
python scripts/run_seeds.py --seeds 42 1337 7

# 6. Full per-breakdown evaluation (per domain / difficulty / reasoning_type / subcategory)
python scripts/eval_report.py --checkpoint outputs/multi_seed/seed_42/checkpoint-XXX

# 7. Compare to the sklearn baseline OR an existing pyrrho release
python scripts/compare_runs.py baseline outputs/multi_seed/summary.json

# 8. Smoke test regression guard (10 handcrafted cases)
pytest tests/test_smoke.py -v

Full methodology, release gates, and W&B conventions in docs/METHODOLOGY.md.

Documentation

Document	Purpose
`docs/INDEX.md`	Fresh session entry point. Reading order for any new contributor.
`docs/HANDOFF.md`	Current status snapshot — what's trained, headline metrics, next actions.
`docs/LOG.md`	Append-only project history (findings, decisions, experiments).
`docs/PROJECT.md`	Full plan: vision, model picks, training recipes, roadmap.
`docs/METHODOLOGY.md`	End-to-end model-development pipeline, release gates, W&B conventions.
`docs/SETUP.md`	RTX 5090 / Blackwell / Windows environment specifics.

Related projects

fitz-sage — production RAG library that uses pyrrho for governance.
fitz-gov — 2,980-case benchmark for RAG epistemic honesty. The dataset pyrrho is trained and evaluated against. Also on HF: yafitzdev/fitz-gov.

The three projects form a triangle: fitz-gov defines the eval contract, pyrrho produces the models, fitz-sage consumes them in production.

License

Apache 2.0 — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyrrho

Fine-tuned classification models that decide when your RAG should answer — without an LLM call.

🚀 Where to start

About

Headline results

Why `pyrrho`?

Family roadmap

Documentation

Related projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
data		data
docs		docs
scripts		scripts
src/pyrrho		src/pyrrho
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pyrrho

Fine-tuned classification models that decide when your RAG should answer — without an LLM call.

🚀 Where to start

About

Headline results

Why pyrrho?

Family roadmap

Documentation

Related projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why `pyrrho`?

Packages