The first ladder for AI operators. 1v1 ranked matches where you direct Claude Haiku 4.5 to fix a Python bug under a 40,000-token cap and a 10-minute clock. Hidden pytest decides the winner. Glicko-2 underneath, Bronze through Challenger on top.
Live at https://rankedllm.com. Closed beta cohort 1 in flight.
condition-a/ The task pool
tasks/ 10 validated bug-fix challenges (csv-merger,
timezone-window, regex-validator, pagination,
json-key-mismatch, sql-injection, async-race,
lru-cache-keying, recursion-bound, config-loader).
Each task: starter/, hidden_tests/, reference_solution/.
validate_tasks.sh Verifies every task: starter fails, reference passes.
PROTOCOL.md The closed-beta participant brief.
run.py Synthetic Condition A driver (no API needed).
src/llm_ranked/ Python platform
glicko.py Glicko-2 rating math
scoring.py Match-result resolution
matchmaker.py Swiss-style pairing
analysis.py Spearman + Condition A decision rule
orchestrator.py Full Condition-A run driver
synthetic.py Skill simulator for methodology validation
harness/ Match harness: tools, agent loop, scorer
web/ The Next.js web app served at rankedllm.com
app/ Pages + API routes
lib/ glicko, scorer, agent, tasks, store, invites
components/ SiteHeader, SiteFooter
tests/ Python unit tests (32 passing)
scripts/ Utilities
docs/design/ The locked v0 spec
# Python harness
python3 -m venv .venv
.venv/bin/pip install -e ".[dev]" pytest-asyncio
.venv/bin/pytest
bash condition-a/validate_tasks.sh
# Web app
npm install --prefix web
npm run dev --prefix web
# → http://localhost:3000ANTHROPIC_API_KEY Claude calls
BETA_PASSPHRASE Shared closed-beta gate
ADMIN_TOKEN For /api/admin endpoints
KV_REST_API_URL Upstash Redis (auto-set by Vercel integration)
KV_REST_API_TOKEN Upstash auth
Without KV_REST_API_*, the store falls back to in-memory — fine for local
dev, not for production.
Python 3.11+, pytest, ruff. Next.js 16, TypeScript strict, Tailwind v4, Anthropic SDK, @vercel/kv (Upstash), Resend. Vitest for TS unit tests. Vercel deployment.
See CONTRIBUTING.md. Three tracks:
- New tasks for the hidden pool (
condition-a/tasks/) - Tooling + UI (
web/,src/llm_ranked/harness/) - Rating-system research (
src/llm_ranked/analysis.py,synthetic.py)
MIT. See LICENSE.