DebateFlow

A benchmark for multi-turn debate judgment in large language models.

What this is

Current argumentation benchmarks evaluate argument quality in isolation -- a single text scored along rhetorical or logical dimensions. DebateFlow tests whether LLMs can judge multi-turn debates: given a four-turn transcript and a scoring rubric, predict the winner and score each side along dimensions that require attending to the full arc of the exchange.

Each debate follows the Karl Popper format: four turns (Affirmative opening, Negative response, Affirmative rebuttal, Negative closing) on a stated resolution. Debates are generated synthetically via LLM-vs-LLM, with one side optionally receiving an injected weakness (weak evidence, argument dropping, logical gaps, or burden-of-proof failure). This gives each debate a known ground-truth failure mode for fine-grained error analysis.

Evaluation rubric

Dimension	What it measures
Clash engagement	Did each side address the opponent's arguments or talk past them?
Burden fulfillment	Did each side meet its burden of proof?
Rebuttal quality	Specificity and depth of refutations
Argument extension	Did arguments develop across turns, or merely repeat the opening?
Strategic adaptation	Did speakers adjust their approach based on the opponent's actual moves?

The last two dimensions are central to competitive debate judging but absent from existing argument quality taxonomies.

Project structure

pyproject.toml              Project config and dependencies
resolutions.yaml            12 seed resolutions (policy, values, empirical)
plans/
    SPEC.md                  Benchmark specification
    PLAN.md                  Implementation plan
    VOICE-SPEC.md            Voice synthesis spec
    TELEGRAM-JUDGING-SPEC.md Telegram judging interface spec
src/debateflow/
    models.py                Pydantic data models
    providers.py             LLM provider factory (Anthropic + OpenAI)
    prompts.py               System prompts and weakness injection templates
    generator.py             4-turn debate generation pipeline
    compile.py               JSONL compilation and statistics
    publish.py               HuggingFace Hub publication
    dataset_card.py          Dataset card template
    cli.py                   Typer CLI entry point
    server.py                Annotation server with on-demand TTS
    voice.py                 ElevenLabs TTS wrapper
    telegram_judging.py      Telegram judging session management
    agreement.py             Inter-annotator agreement computation
    static/
        annotate.html        Browser-based annotation tool
        review.html          Annotation review tool
output/
    debates/                 Generated debate JSON files
    annotations/             Human annotation JSON files
    audio/                   Cached TTS audio (MP3)
tests/
    test_models.py
    test_prompts.py

Development setup

Requires Python 3.11+ and uv.

git clone <repo-url> && cd debateflow
uv sync

Copy .env.example to .env and fill in the keys you need:

DF_ANTHROPIC_API_KEY=...    # for debate generation (Anthropic models)
DF_OPENAI_API_KEY=...       # for debate generation (OpenAI models)
DF_ELEVENLABS_API_KEY=...   # for voice synthesis (annotation server)
DF_HF_TOKEN=...             # for publishing to HuggingFace Hub
DF_HF_REPO=...              # e.g. your-username/debateflow

Not all keys are needed for every task. Generation requires the LLM provider key(s), the annotation server requires ElevenLabs, and publishing requires HuggingFace.

Generating debates

# Generate 10 debates with default models
uv run debateflow generate -n 10

# Use specific models per side
uv run debateflow generate -n 5 \
    --aff-provider anthropic --aff-model claude-sonnet-4-20250514 \
    --neg-provider openai --neg-model gpt-4o

# Filter by topic category or force a weakness type
uv run debateflow generate -n 5 --category values
uv run debateflow generate -n 3 --weakness argument_dropping

# View dataset statistics
uv run debateflow stats

# Compile individual JSONs into a single JSONL
uv run debateflow compile

Annotating debates

The annotation tool runs in the browser. Start the server:

uv run debateflow serve

Then open http://localhost:5733. The server:

Serves the annotation UI at /
Loads debates from output/debates/ (click "Load from Server" on the setup screen)
Provides on-demand text-to-speech via ElevenLabs -- click Play on any turn to hear it spoken, or Play All for sequential playback
Caches synthesized audio to output/audio/ so repeated plays don't hit the API

Enter your annotator ID, load debates, and score each one. Annotations download as JSON files that go into output/annotations/.

Voice playback is optional -- annotation works without an ElevenLabs key, you just won't have the Play buttons functional.

Annotation commands

# Check annotation progress
uv run debateflow annotate-status

# Compute inter-annotator agreement (needs 2+ annotators on same debates)
uv run debateflow annotate-agreement

Publishing

# Dry run -- generates JSONL and dataset card locally
uv run debateflow publish --repo your-username/debateflow --dry-run

# Push to HuggingFace Hub
uv run debateflow publish --repo your-username/debateflow --public

Design docs

See plans/ for the full specifications:

SPEC.md — benchmark design, rubric dimensions, and score-level anchors
PLAN.md — implementation plan for the generation pipeline
VOICE-SPEC.md — ElevenLabs voice synthesis for spoken debate playback
TELEGRAM-JUDGING-SPEC.md — Telegram-based annotation flow via OpenClaw

Tests

uv run pytest tests/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DebateFlow

What this is

Evaluation rubric

Project structure

Development setup

Generating debates

Annotating debates

Annotation commands

Publishing

Design docs

Tests

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
output		output
plans		plans
src/debateflow		src/debateflow
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
resolutions.yaml		resolutions.yaml
uv.lock		uv.lock

License

shippy/DebateFlow

Folders and files

Latest commit

History

Repository files navigation

DebateFlow

What this is

Evaluation rubric

Project structure

Development setup

Generating debates

Annotating debates

Annotation commands

Publishing

Design docs

Tests

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages