FairBench

Does your voice AI grade everyone fairly? FairBench gives an AI the exact same good answer spoken in different accents, names, and genders — then checks whether it passes everyone equally. When it doesn't, FairBench tells you why: the speech‑to‑text mishearing an accent, or the AI judge under‑crediting a name.

Built on Pipecat (voice), NVIDIA Nemotron (open‑weights ASR + LLM), Gradium (TTS), Twilio (phone), and Cekura (live‑call testing). Runs end‑to‑end with no API keys on synthetic, public test personas.

Live demo

Console: https://shelterless-ultrasonically-wes.ngrok-free.dev

This tunnels the local dashboard while ./scripts/dev.ps1 is running (ngrok http 3000). For Twilio phone calls, tunnel the API instead (ngrok http 8000) and set PUBLIC_BASE_URL in .env to that https URL.

The problem

Voice AI now scores people — recruiter phone screens, clinical skill checks, language tests. The scary part: the same correct answer can pass in one voice and fail in another, and nobody notices.

A spoken evaluation can be unfair two very different ways, and FairBench separates them:

The two ways a spoken test can be unfair	What it looks like
🎙️ Transcription bias — the speech‑to‑text mishears an accent, so a correct answer is garbled before it's ever scored	accent fails with a high mishear rate
⚖️ Grader bias — the AI judge scores the same words lower because of a perceived identity (e.g. the name)	name fails with no mishear gap

Gender is a control group that should stay flat — if it moves, the test is noisy and we don't trust the result. That separation is the whole product.

How it works

A person (or a test persona) gives a spoken answer; FairBench grades it on a skills checklist, then proves the grade is fair using the EEOC 80% rule (a group passing less than 80% as often as the top group is a legal red flag).

The key move: because the same competent answer is delivered for every group, any gap in who passes is the AI's flaw — not the person's. FairBench then attributes each gap to the transcription (speech‑to‑text) or the judge (LLM), so you fix the right thing. It proves itself offline with no keys, and the same grading path runs live on real phone calls.

Demo

▶ Watch the demo video — ~60s walkthrough of the console (Cekura battery → bias audit → Improve before/after).

You hear the same competent answer in two different voices — then watch the agent pass one and fail the other on identical words. That gap is the product: FairBench measures it, attributes it to the transcriber or the judge, and (in the Improve tab) closes the grader half in one click while the transcription gap correctly stays flagged.

Results: catching and closing grader bias

The Improve tab runs the matched battery before vs. after a grader fix in one click. On the pharmacy scenario:

Metric (pharmacy scenario)	Before fix	After fix
Name‑origin fairness ratio — Latino	0.74 (flagged)	0.92 (clear)
Name‑origin fairness ratio — South‑Asian	0.78 (flagged)	1.00 (clear)
Name‑origin groups flagged	2	0
Grader "wrongly failed" rate	17.1%	14.8%
Accent (transcription) groups flagged	4	4 — unchanged, by design
Gender control	clean	clean

The accent gap correctly persists because a grader fix can't fix a transcription problem — and gender stays clean throughout. Proving those two behaviors at once is exactly what makes the result trustworthy.

How it's built

Pipecat — voice. The live call is a Pipecat pipeline: transport → Nemotron ASR → Nemotron LLM → Gradium TTS → transport. One bot() entrypoint serves both the browser (WebRTC) and a real phone (Twilio WebSocket). On hang‑up, the captured transcript is graded and saved — so a live call flows into the same audit as the offline test.
Nemotron — open weights. Nemotron Speech Streaming is the ASR; Nemotron‑3‑ Super is both the in‑call persona's voice and the LLM grader behind the checklist. Using an open‑weights judge is the point — a closed judge you can't inspect can't be audited for bias.
Cekura — live‑call testing. Cekura drives a matched battery of real calls against the agent: identical competent scripts, varied only by accent and name. FairBench maps every call to a pass/fail row and renders the same fairness audit on real calls.

The fairness engine itself — fairness ratios, grader reliability, and the transcription‑vs‑judge attribution — lives in fairbench/core/integrity.py. The bias is never hard‑coded: it emerges from the real grader running over mis‑transcribed text, and the test suite asserts the engine catches it, attributes it correctly, and raises no false alarm on the gender control.

What we learned

The judge itself is a bias surface. Most eval tooling treats the LLM grader as ground truth. Holding the spoken performance constant across groups made it obvious the grader can disagree with itself by name alone — which is why attribution (transcriber vs. judge) became the core of the project.
"Speak a line, then listen" is subtle in a voice pipeline. The first build made the agent monologue past its own greeting; the fix was to speak a fixed TTSSpeakFrame and yield the turn instead of running the LLM immediately.
Transport parity is where the time goes. Browser (WebRTC) and phone (Twilio) need different "first‑utterance" triggers; unifying them behind one worker took more care than the pipeline itself.
Keyless‑by‑default was worth it. Making the whole audit prove itself offline (no API keys) means the demo, the tests, and the bias detection run anywhere — and the result is reproducible.

Notes on the tools & models

Honest, firsthand notes from building on each.

NVIDIA Nemotron (open weights).

Did well: the OpenAI‑compatible endpoint let one adapter serve both the conversational persona and the rubric grader — almost no glue code. Open weights are what made an inspectable judge possible, which is the entire premise of the audit: you can't audit a closed judge for bias.
Could be better: strict structured/JSON output when grading was the main friction — we wanted a guaranteed JSON verdict and had to defend against drift. Clearer guidance on NEMOTRON_ENABLE_THINKING for grading consistency would help, and since the streaming ASR's accent robustness is literally what we measure, published accent benchmarks would let us calibrate.

Cekura (self‑improvement loops).

Worked well: the matched accent/name battery maps cleanly onto fairness‑ratio rows, and pairing it with our transcription‑vs‑judge attribution made every fix targeted instead of guesswork — exactly the self‑improvement loop the platform is built for.
Could be better: clearer run status/result polling semantics (knowing precisely when a battery is "done" and fetchable) and more consistent metric naming between the dashboard and the API, so mapping a run to pass/fail rows is unambiguous.

Pipecat (voice orchestration).

Worked well: one build_worker served both browser (SmallWebRTCTransport) and a real phone (Twilio Media Streams) by swapping only the transport — phone + web from one pipeline. Hooking disconnect to grade‑on‑hangup was clean.
Friction — scripted greeting then listen: seeding the opening line and queuing LLMRunFrame made the agent monologue past its greeting; we had to speak a TTSSpeakFrame and then wait. A documented recipe for "speak a fixed line, then yield the turn" would save hours.
Friction — transport parity: WebRTC fires on_client_ready, but Twilio has no equivalent, so we kicked the greeting on on_client_connected. A consistent "first‑utterance" hook across transports would remove the guesswork.
Friction — transcript dedup: onBotOutput fires per‑word and per‑sentence; we filtered on aggregated_by === "sentence" for a clean transcript.
Gotcha: an embedded/Electron browser passes getUserMedia but sends no audio — the call connects and the agent hears silence. A transport‑level "no inbound audio" warning would have saved real debugging time.

Get it running

Quickest start — the keyless demo (≈60 seconds)

The product proves itself offline with no API keys. This runs a real audit through the actual grader and writes the report to disk.

python -m venv .venv
.venv\Scripts\activate            # Windows  (use: source .venv/bin/activate on macOS/Linux)
pip install -e ".[dev]"

fairbench demo                    # writes reports/audit.{md,json} + sessions — no keys needed
pytest -q                         # runs the test suite, incl. the bias‑detection checks

fairbench demo runs:

same answer → spoken in many accents/names → accent mishearing → REAL grader → fairness audit

The bias emerges from the real grader running over mis‑transcribed text — it is not hard‑coded. The tests assert the engine catches it, attributes it correctly, and raises no false alarm on the untouched gender control.

Full console (backend + dashboard)

The console is a tabbed web app backed by the FastAPI server.

# Windows — one command starts the API (:8000) and the dashboard (:3000)
./scripts/dev.ps1

…or run the two processes yourself:

pip install -e ".[bot,server,dev]"
fairbench-server                          # API on http://localhost:8000
cd dashboard; npm install; npm run dev    # dashboard on http://localhost:3000

Then open http://localhost:3000. The dashboard proxies /api/* to the backend, so there's no CORS setup. Even with the backend down it still shows the last saved audit, so you always see something real.

New to the terminology? Every screen has tooltips, and the console header has a "What do these terms mean?" glossary that explains every term in plain English.

Talk to it with your voice

In the browser — open the Live call tab, allow your mic, and speak. (Use a real browser like Chrome/Edge at http://localhost:3000, not an in‑IDE preview, so the mic works.)
On a real phone (Twilio) — add TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, and TWILIO_PHONE_NUMBER to .env. In the Live call tab, switch to phone mode, type your number, and hit "Call me" — FairBench dials your phone and the agent answers. For the full interactive agent (not just the opening line), expose the server with ngrok http 8000 and set PUBLIC_BASE_URL to that https URL. On a Twilio trial account, verify your number first. The graded call shows up in Sessions.

Use cases

Anywhere a voice AI decides who passes. Three that ship as scenarios:

💼 Recruiting phone screens

An AI screens candidates for a role. FairBench pushes the same strong answer through the screen under many accents and names.

Catches: candidates failing because the transcriber dropped a key word, or being under‑scored by the judge purely by name.
So what: a phone screen is a legal hiring step. A group passing under 80% as often is bias you'd have to defend — caught before it's a liability.

💊 Pharmacy & clinical skill checks

A pharmacy tech handles a confused medication pickup. FairBench replays the same safe response (verify identity → explain the interaction → loop in the pharmacist) across accents and names.

Catches: a mis‑heard word vs. a judge that under‑credits the identical script by name.
So what: failing a competent tech over an accent is both a fairness and a patient‑safety problem. The score should reflect the answer, not the voice.

🩺 Nursing & licensure checks

A nurse escalates a deteriorating patient. FairBench replays the same urgent, correct hand‑off across demographic variants.

Catches: whether a missed step is the transcriber mishearing an accent or the grader discounting the nurse.
So what: when the same competent escalation passes for some and fails for others, the assessment is broken — not the nurse.

23 scenarios ship in data/synthetic/scenarios/: the pharmacy and nursing cases plus 20 hiring roles across tech, sales, healthcare, retail, finance, logistics, education, and hospitality. Pick any in the Audit, Live call, or Improve tabs.

Using the console (the 5 tabs)

Tab	What it does
Audit	Pick a scenario, run the test, and read the verdict: who passed, the size of each gap, and what's causing it (transcription vs. judge), with a visible 0.80 fairness line.
Live call	Practice a real spoken conversation (browser or phone) with live coaching as you talk; hang up to get scored on the same checklist.
Improve	One click runs the test before vs. after a fix and proves the gap closed — while the transcription gap correctly persists and gender stays clean.
Real calls	Runs the matched battery as real calls via Cekura and renders the identical audit on live results. (Advanced: fetch an existing Cekura run by id.)
Sessions	Browse every graded conversation — score, scorecard, and transcript.

Reference

Stack & environment

Piece	Service	Env vars
Speech‑to‑text	Nemotron Speech Streaming	`NVIDIA_ASR_URL`
LLM (persona + grader)	Nemotron 3 Super	`NEMOTRON_LLM_URL`, `NEMOTRON_LLM_MODEL`
Text‑to‑speech	Gradium	`GRADIUM_API_KEY`
Phone	Twilio	`TWILIO_ACCOUNT_SID`, `TWILIO_AUTH_TOKEN`, `TWILIO_PHONE_NUMBER`
Live‑call testing	Cekura	`CEKURA_API_KEY`

Copy .env.example to .env. The Nemotron URLs are prefilled with the event‑day endpoints; the keyless demo and offline audit need none of these.

CLI

Command	What it does
`fairbench demo`	Keyless self‑validating loop → audit + sessions
`fairbench audit`	Same audit (offline sim by default); `--demo`, `--results FILE`, `--live`
`fairbench battery`	Build the matched bias battery
`fairbench rubric`	Show the active skills checklist
`fairbench cekura fetch ID`	Pull a completed Cekura result → audit
`fairbench cekura run --agent-url URL`	Start a Cekura accent battery against a live agent
`fairbench-server`	Run the FastAPI backend (:8000)

Project layout

fairbench/
  core/        # checklist, grader, scenario, fairness engine
  sim/         # transcript generator + accent mishearing model + offline loop
  adapters/    # Nemotron LLM, Gradium TTS, Cekura client, Pipecat factories
  bot/         # text demo + live Pipecat pipeline
  server/      # FastAPI backend (audit, live call, improve, cekura, twilio)
data/synthetic/   # checklists, personas (bias battery), scenarios, demo sessions
dashboard/        # Next.js console
reports/          # generated audit.md / audit.json

The Nemotron LLM/STT Pipecat service classes are vendored from NVIDIA's YC starter under fairbench/integrations/pipecat/; the Nemotron/Gradium endpoints are the starter's.

Why it's different

Most eval tools tell you whether an agent passed. FairBench tells you whether the pass/fail is trustworthy, and separates the two ways it can be biased — the transcription vs. the judge — so a real disparity becomes a fixable, defensible finding instead of a mystery.

License

Proprietary.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.claude/commands		.claude/commands
.cursor		.cursor
config		config
dashboard		dashboard
data/synthetic		data/synthetic
fairbench		fairbench
reports		reports
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
demo.mp4		demo.mp4
mcp.json.example		mcp.json.example
pcc-deploy.toml		pcc-deploy.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FairBench

Live demo

The problem

How it works

Demo

Results: catching and closing grader bias

How it's built

What we learned

Notes on the tools & models

Get it running

Quickest start — the keyless demo (≈60 seconds)

Full console (backend + dashboard)

Talk to it with your voice

Use cases

💼 Recruiting phone screens

💊 Pharmacy & clinical skill checks

🩺 Nursing & licensure checks

Using the console (the 5 tabs)

Reference

Stack & environment

CLI

Project layout

Why it's different

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FairBench

Live demo

The problem

How it works

Demo

Results: catching and closing grader bias

How it's built

What we learned

Notes on the tools & models

Get it running

Quickest start — the keyless demo (≈60 seconds)

Full console (backend + dashboard)

Talk to it with your voice

Use cases

💼 Recruiting phone screens

💊 Pharmacy & clinical skill checks

🩺 Nursing & licensure checks

Using the console (the 5 tabs)

Reference

Stack & environment

CLI

Project layout

Why it's different

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages