Skip to content

yafitzdev/pyrrho

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

pyrrho

Fine-tuned classification models that decide when your RAG should answer β€” without an LLM call.

Python 3.11+ License: Apache 2.0 Version πŸ€— Model πŸ€— Dataset

Why pyrrho? β€’ Results β€’ Roadmap β€’ Usage β€’ Docs β€’ GitHub β€’ πŸ€— HuggingFace



Query: "Has the company achieved profitability?"
Sources:
  [1] "Posted its first profitable quarter, net income $4M."
  [2] "Recorded a quarterly loss of $12M, third consecutive losing quarter."
❌ Standard governance (constraint + sklearn cascade)
5 LLM calls. 108 hand-crafted features.
Verdict: TRUSTWORTHY  (misses the conflict)
Latency: ~1–2 s on CPU
Requires:  local LLM or paid cloud API
πŸ›‘οΈ pyrrho-modernbert-base-v1
1 forward pass. No features. No LLM.
Verdict: DISPUTED  (correct, P(D)=0.55)
Latency: ~30 ms on CPU (INT8 ONNX)
Requires:  nothing β€” self-contained

β†’ A 150 MB CPU-friendly classifier that beats the prior pipeline by +7.43 accuracy points and ~50Γ— speedup, with no LLM dependency at inference.


πŸš€ Where to start

Important

The model lives on πŸ€— HuggingFace as yafitzdev/pyrrho-modernbert-base-v1. Drop it into any RAG pipeline that needs a governance gate.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").eval()

query = "Has the company achieved profitability?"
contexts = [
    "Posted its first profitable quarter, net income $4M.",
    "Recorded a quarterly loss of $12M, third consecutive losing quarter.",
]
text = f"Question: {query}\n\nSources:\n" + "\n".join(f"[{i}] {c}" for i, c in enumerate(contexts, 1))

with torch.no_grad():
    probs = torch.softmax(model(**tokenizer(text, return_tensors="pt", truncation=True)).logits[0], dim=-1).numpy()
print({"ABSTAIN": probs[0], "DISPUTED": probs[1], "TRUSTWORTHY": probs[2]})
# β†’ DISPUTED β‰ˆ 0.55

For production CPU inference at ~30 ms/query, use the INT8 ONNX variant via optimum. Full usage in the model card.


About

Most RAG governance is either (a) a black-box LLM call ("ask GPT-4 if these sources support the answer" β€” slow, expensive, non-deterministic) or (b) a feature-engineered classifier (~108 hand-crafted signals fed into sklearn β€” cheap but capped at ~79% accuracy on hard benchmarks). I built pyrrho to replace both with a single fine-tuned encoder that runs at 30 ms on CPU and beats both approaches on a public benchmark.

The architecture call: pure encoder (ModernBERT-base, 149M params) β€” not a generative SLM, not an LLM. For 3-class classification with constrained label space, encoder + INT8 ONNX is 50–100Γ— faster on CPU than the same task with a generative model, and doesn't lose accuracy when the labels are categorical and the input fits in 4K tokens (as RAG retrievals almost always do).

It's the model that powers governance in fitz-sage (the RAG library) and is benchmarked against fitz-gov (2,920 adversarial cases, 5-fold CV). The three projects form a triangle β€” benchmark, models, library.

Yan Fitzner β€” (LinkedIn, GitHub).


Headline results

Release v1 β€” pyrrho-modernbert-base-v1 vs the published fitz-sage v0.11 sklearn baseline. 3-seed mean Β± std on the fitz-gov V5.1 eval hold-out (584 cases, stratified 20% from tier1_core):

Metric pyrrho v1 sklearn baseline Ξ”
Overall accuracy 86.13 Β± 0.86 % 78.7 % +7.43
False-trustworthy rate (safety) 5.27 Β± 0.21 % 5.7 % -0.43 (safer)
Trustworthy recall 79.38 Β± 1.64 % 70.0 % +9.38
Disputed recall 94.81 Β± 1.28 % 86.1 % +8.71
Abstain recall 92.94 Β± 1.11 % 86.5 % +6.44
CPU inference (estimated) ~30 ms ~500–2000 ms (5 LLM calls) ~50Γ— faster
External dependencies none requires LLM self-contained

Every margin is multiple standard deviations larger than seed noise β€” not a lucky-run artifact. Independently verifiable by running the published model against the published benchmark: load_dataset("yafitzdev/fitz-gov") + AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").

Note

Known limitation: the model occasionally classifies multi-source-convergence cases (multiple authoritative sources agreeing within measurement tolerance) as DISPUTED. ~57% error on this fitz-gov subcategory (n=7). Fixed in v2 with augmented training data. Documented in the model card.


Why pyrrho?

No LLM dependency πŸͺΆ β†’ Model card

Standard governance pipelines route every query through 5+ LLM calls to extract constraint signals (contradiction detection, evidence sufficiency, causal attribution, …) before the classifier even fires. pyrrho reads the raw query and contexts and emits a verdict in one forward pass. No cloud API spend, no GPU swap, no rate limits.

Beats the baseline by 7 points πŸ“Š β†’ Benchmark

86.13% accuracy vs 78.7% for the prior constraint+sklearn pipeline, on the same 2,920-case fitz-gov benchmark. The biggest gain is on trustworthy recall (+9.4 pts) β€” the bucket where hand-crafted features couldn't read positive evidence-agreement signals. Attention over raw text can.

Safer than the baseline πŸ›‘οΈ

False-trustworthy rate (the production safety metric: how often a confident hallucination path gets greenlit) is 5.27%, below the prior pipeline's 5.7%. Threshold calibration on top can push this lower at a small accuracy cost.

Production-grade CPU inference ⚑ β†’ INT8 ONNX

~30 ms per query on commodity CPU after INT8 dynamic quantization. Ship the 150 MB model_quantized.onnx and serve governance inline β€” no GPU, no API, no LLM. Fits into latency-sensitive RAG paths that previously couldn't afford a governance step.

Reproducible end-to-end πŸ”¬

Training data, model weights, and the evaluation pipeline are all public. The final_metrics.json and manifest.json that ship alongside the weights pin: git commit, pip freeze, hardware, seed, training duration. Anyone can re-run the smoke test (pytest tests/test_smoke.py) against the published model.

Cross-linked with the triangle πŸ”—

Benchmark: fitz-gov. Models: yafitzdev/pyrrho-*. Production library: fitz-sage. Each reinforces the others β€” fitz-gov defines the eval contract, pyrrho ships the models, fitz-sage consumes them in production.


Family roadmap

Two tracks. Track A ships into fitz-sage as the default governance backend. Track B is a HuggingFace portfolio of generative SLMs that prove the architecture generalizes β€” every one CPU-runnable (≀8 GB RAM at Q4).


Track A β€” production encoders (CPU-only):

Model Params Status
pyrrho-modernbert-base-v1 149M βœ… live on HF
pyrrho-modernbert-base-v2-long 149M planned β€” long-context augmentation
pyrrho-deberta-v3-large-v1 435M planned β€” accuracy-mode variant

Track B β€” generative SLMs (all CPU-runnable):

Model Params Status
pyrrho-qwen3.5-0.8b-v1 0.8B dense planned β€” first SLM, validates the multi-source-convergence hypothesis
pyrrho-qwen3.5-2b-v1 2B dense planned
pyrrho-lfm2.5-1.2b-v1 1.2B Liquid hybrid planned β€” non-transformer architecture variant
pyrrho-gemma-4-E2B-v1 2.3B dense planned β€” cross-family transformer anchor
pyrrho-qwen3.5-4b-v1 4B dense planned
pyrrho-gemma-4-E4B-v1 4.5B dense planned
pyrrho-phi-4-mini-v1 3.8B dense planned β€” synthetic-data architecture probe
pyrrho-lfm2-8b-a1b-v1 8B / 1B-active MoE planned β€” CPU-runnable MoE

Sidecar: pyrrho-grounding-modernbert-base-v1 β€” answer-level grounding/hallucination detection. Companion to the governance head.

Full release roadmap and rationale in docs/PROJECT.md Β§10.


πŸ“¦ Repository structure
pyrrho/
β”œβ”€β”€ README.md           ← you are here
β”œβ”€β”€ CLAUDE.md           ← project conventions (HANDOFF/LOG update rules, banned models, style)
β”œβ”€β”€ LICENSE             ← Apache 2.0
β”œβ”€β”€ pyproject.toml      ← Python deps; encoder / slm / hub / dev extras
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ INDEX.md        ← reading-order entry point for any new contributor
β”‚   β”œβ”€β”€ HANDOFF.md      ← current status snapshot (overwritten as state changes)
β”‚   β”œβ”€β”€ LOG.md          ← append-only project history
β”‚   β”œβ”€β”€ PROJECT.md      ← full vision, model picks, roadmap, training recipes
β”‚   β”œβ”€β”€ METHODOLOGY.md  ← end-to-end pipeline; release gates; W&B conventions
β”‚   └── SETUP.md        ← RTX 5090 / Blackwell / Windows specifics
β”œβ”€β”€ src/pyrrho/         ← Python package: data, metrics, training, manifest
β”œβ”€β”€ scripts/            ← all CLI scripts (train, eval, sweep, compare, push, …)
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ encoder/        ← ModernBERT-base, DeBERTa-v3-large (3-class + 4-class)
β”‚   β”œβ”€β”€ slm/            ← Qwen3.5-2B, LFM2.5-1.2B, LFM2-8B-A1B MoE
β”‚   └── sweep_grids/    ← hyperparameter sweep grids
β”œβ”€β”€ tests/              ← pytest suites (smoke regression guard)
β”œβ”€β”€ data/               ← (gitignored) processed splits from prepare_data.py
└── outputs/            ← (gitignored) training runs, checkpoints, eval reports

πŸ“¦ Train your own pyrrho variant from scratch

Reproduces the published numbers end-to-end. Requires an RTX 50-series GPU (see docs/SETUP.md for Blackwell / Windows / WSL2 specifics).

# 1. Install
git clone https://github.com/yafitzdev/pyrrho.git
cd pyrrho
python -m venv .venv && source .venv/bin/activate   # or .venv\Scripts\Activate.ps1 on Windows
pip install torch --index-url https://download.pytorch.org/whl/cu128   # Blackwell wheels
pip install -e ".[encoder,hub,dev]"

# 2. Prepare data β€” either pull from the published HF dataset,
# or use a local clone of yafitzdev/fitz-gov.
python scripts/prepare_data.py --fitz-gov ../fitz-gov/data --output data/processed

# 3. Verify the environment (driver / CUDA / bitsandbytes / Blackwell)
python scripts/verify_env.py

# 4. Train release #1 (~80–500 s on RTX 5090 depending on contention)
python scripts/train_encoder.py --config configs/encoder/modernbert_base.yaml --no-wandb

# 5. Multi-seed validation β€” produces the published mean Β± std
python scripts/run_seeds.py --seeds 42 1337 7

# 6. Full per-breakdown evaluation (per domain / difficulty / reasoning_type / subcategory)
python scripts/eval_report.py --checkpoint outputs/multi_seed/seed_42/checkpoint-XXX

# 7. Compare to the sklearn baseline OR an existing pyrrho release
python scripts/compare_runs.py baseline outputs/multi_seed/summary.json

# 8. Smoke test regression guard (10 handcrafted cases)
pytest tests/test_smoke.py -v

Full methodology, release gates, and W&B conventions in docs/METHODOLOGY.md.


Documentation

Document Purpose
docs/INDEX.md Fresh session entry point. Reading order for any new contributor.
docs/HANDOFF.md Current status snapshot β€” what's trained, headline metrics, next actions.
docs/LOG.md Append-only project history (findings, decisions, experiments).
docs/PROJECT.md Full plan: vision, model picks, training recipes, roadmap.
docs/METHODOLOGY.md End-to-end model-development pipeline, release gates, W&B conventions.
docs/SETUP.md RTX 5090 / Blackwell / Windows environment specifics.

Related projects

  • fitz-sage β€” production RAG library that uses pyrrho for governance.
  • fitz-gov β€” 2,980-case benchmark for RAG epistemic honesty. The dataset pyrrho is trained and evaluated against. Also on HF: yafitzdev/fitz-gov.

The three projects form a triangle: fitz-gov defines the eval contract, pyrrho produces the models, fitz-sage consumes them in production.


License

Apache 2.0 β€” see LICENSE.

About

Fine-tuned classification models for RAG governance (replaces the constraint+sklearn pipeline in fitz-sage)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages