Occursus-Claude — Multi-Model LLM Benchmark

A benchmarking platform that tests whether multi-LLM synthesis pipelines produce better results than single-model baselines. Supports 4 providers (Ollama, OpenAI, Anthropic, Gemini) with 18 orchestration strategies and dual blind judging.

Thesis

Can combining multiple LLMs through structured pipelines (debate, merge, verification, decomposition) consistently outperform a single frontier model answering directly?

Features

4 LLM Providers: Ollama (local), OpenAI (GPT-4o), Anthropic (Claude), Google Gemini
18 Pipeline Architectures: From single-call baseline to 12-call tournament elimination
Dual Blind Judging: Claude + GPT judge independently on a 0-100 scale, scores averaged
Auto Model Assignment: Toggle models on/off; the tool assigns them to pipeline roles automatically
Real-Time Streaming: Server-Sent Events show results as they complete
4 Task Suites: Smoke (quick), Core (standard), Stress (hard), Thesis (ceiling-breaking)
Settings UI: Configure API keys, base URLs, and test provider connections from the browser
Export & History: CSV/JSON export, persistent run history, rerun failed cells

Pipeline Tiers

Tier	Pipelines	Strategy
1	Single, Best of 3, Sample & Vote	Baseline and simple selection
2	Merge Full, Critique Then Merge, Ranked Merge	Multi-persona synthesis
3	Debate, Dissent, Red Team/Blue Team, Expert Routing, Constraint Checker	Adversarial and specialist
4	Chain of Verification, Iterative Refinement, Mixture of Agents	Deep reasoning loops
5	Persona Council, Adversarial Decomposition, Reverse Engineer, Tournament	Experimental heavy

Quick Start

Prerequisites

Python 3.10+
Ollama (optional, for local models)
API keys for cloud providers (optional)

Install

git clone https://github.com/rich1398/Multi-Model-Benchmarking.git
cd Multi-Model-Benchmarking
pip install -r requirements.txt

Run

python app.py

Open http://localhost:8000 in your browser.

Configure API Keys

Open the Provider Settings panel in the UI
Enter API keys for OpenAI, Anthropic, and/or Gemini
Click Test to verify each connection
Click Save Settings

Or use environment variables:

cp .env.example .env
# Edit .env with your API keys

Task Suites

Suite	Tasks	Difficulty	Purpose
`smoke_tasks.json`	5	Easy	Quick validation
`core_tasks.json`	12	Easy-Medium	Standard benchmark
`stress_tasks.json`	8	Hard	Complex reasoning
`thesis_tasks.json`	8	Very Hard	Ceiling-breaking tasks designed to differentiate pipelines

The thesis tasks include cross-domain synthesis, multi-file code refactoring, constraint satisfaction, needle-in-haystack analysis, and multi-requirement system design — problems where single models demonstrably struggle.

Architecture

app.py              FastAPI server, API endpoints, benchmark orchestration
llm_client.py       Unified async client for all 4 providers, auto-routing, retry logic
role_assigner.py    Auto-assigns enabled models to pipeline roles
judge.py            Dual blind judge (Claude + GPT), 0-100 scale
config.py           YAML + JSON settings, API key management
models.py           Immutable dataclasses (frozen)
db.py               SQLite persistence with WAL mode
pipelines/
  base.py           Abstract pipeline interface
  single.py         Tier 1: Single-call baseline
  best_of_n.py      Tier 1: Selection strategies
  merge.py          Tier 2: Multi-persona synthesis
  debate.py         Tier 3: Adversarial debate
  routing.py        Tier 3: Expert routing + constraint checker
  deep.py           Tier 4: Verification, refinement, MoA
  experimental.py   Tier 5: Council, decomposition, tournament
static/
  index.html        Single-page UI
  app.js            Frontend logic (SSE streaming, charts, modals)
  style.css         Dark theme
tasks/              JSON task definition files

How It Works

Toggle models on/off in the UI (6 preset models across 4 providers)
Select pipelines and tasks to benchmark
Click Run — the tool auto-assigns models to roles:
- Claude/GPT as primary generator and synthesizer
- Secondary cloud model as critic and alternative generator
- Gemini for diversity in multi-model pipelines
- Ollama for speed when many models are enabled
Dual judge — both Claude and GPT score each response blind on 0-100, averaged
View results — real-time score matrix, charts, drill-down, thesis banner

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Occursus-Claude — Multi-Model LLM Benchmark

Thesis

Features

Pipeline Tiers

Quick Start

Prerequisites

Install

Run

Configure API Keys

Task Suites

Architecture

How It Works

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
pipelines		pipelines
static		static
tasks		tasks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
db.py		db.py
judge.py		judge.py
launcher.py		launcher.py
llm_client.py		llm_client.py
models.py		models.py
requirements.txt		requirements.txt
role_assigner.py		role_assigner.py

Folders and files

Latest commit

History

Repository files navigation

Occursus-Claude — Multi-Model LLM Benchmark

Thesis

Features

Pipeline Tiers

Quick Start

Prerequisites

Install

Run

Configure API Keys

Task Suites

Architecture

How It Works

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages