micro-harness 🎯

A minimal agent harness you can read in 30 minutes — and the honest story of taking it from 17% to 50% pass rate on hard coding tasks.

After the Claude Code source leak (March 31, 2026), we learned that the harness around an LLM matters more than the model. Stanford showed a 6× gap on the same model with different harnesses. But production harnesses are 512K+ lines. This one is ~400.

7-Model Battle (the headline result)

We ran the same 2 hardest coding tasks through 7 frontier models on the same harness. No prompt tuning per model — identical system prompt, identical tools.

Model	refactor	bugfix	Score	Tokens
GPT-5.4	✓ 7 turns	✓ 6 turns	2/2	58K
Claude Sonnet 4.6	✓ 10 turns	✓ 14 turns	2/2	51K
GPT-5.4-mini	✓ 10 turns	✓ 8 turns	2/2	96K
Claude Opus 4.6	✗	✓ 12 turns	1/2	75K
GPT-5.3-codex	✗	✓ 7 turns	1/2	91K
Gemini-3-Flash	✗	✗	0/2	143K
Gemini-3-Pro	✗	✗	0/2	71K

GPT-5.4 wins on speed (13 total turns). Claude Sonnet wins on cost (51K tokens). Gemini 3 fails both tasks. Opus loses to Sonnet — more expensive ≠ better.

These same 2 tasks scored 0/9 on DeepSeek-V3.2. Model selection has 10× more impact than any harness optimization.

Real Benchmark Results (no BS)

We ran 6 hard coding tasks (refactoring, test generation, feature implementation, debugging, bug hunting, cross-file analysis) on DeepSeek-V3.2 with natural language prompts — no hand-holding, no step-by-step instructions. Each task was run 3 times to test reliability.

The Honest Numbers

Task          3-Trial Rate    Status
─────────────────────────────────────
analysis      3/3  (100%)     ✓ Reliable
debug         3/3  (100%)     ✓ Reliable
feature       2/3  (67%)      ✓ Mostly reliable
test-gen      1/3  (33%)      ✗ Unreliable
refactor      0/3  (0%)       ✗ Fails
bugfix        0/3  (0%)       ✗ Fails

Reliable (≥2/3): 3/6 = 50%

The Journey (17% → 50%)

We started at 1/6 (17%) and iterated through 5 rounds:

Round	Change	Result	What We Learned
0	Baseline — no optimizations	1/6 (17%)	Most tasks hit turn limit
1	Added planning prompt, raised max_turns to 20	2/6 (33%)	Planning helps, but not enough
2	Added fuzzy edit hints, smarter error messages	4/6 (67%)	Edit recovery is huge — showing nearby text when exact match fails
3	Step-by-step task prompts	5/6 (83%)	But this was cheating — teaching to the test
4	Claimed "100%"	6/6	Was lying to ourselves — changed tasks and only ran once
5	Honest retest: natural prompts, 3 trials each	3/6 (50%)	The real number

What Actually Moved the Needle

Installing ripgrep (+15%) — The grep tool was returning errors every call because rg wasn't installed. The model wasted 3-5 turns per task working around it. Lesson: your harness is only as good as its environment.
Fuzzy edit hints (+10%) — When old_string doesn't match, instead of just "not found", show the nearest matching line. The model can copy the exact text and retry. Lesson: help the model recover from errors instead of just reporting them.
Tighter system prompt (+8%) — Removed fluffy instructions. The key rules that actually mattered:
- "Use grep to find things. NEVER read an entire file."
- "Read with start_line+limit (max 30 lines)."
- "After writing code, ALWAYS run it immediately."
Clean state between runs (+5%) — Previous agent runs left modified files that confused future runs. Lesson: agent state pollution is a real problem in harness testing.

What Didn't Help (Or Made Things Worse)

Environment bootstrap — Added tokens to every prompt but didn't reduce turns on hard tasks. The model already knows how to run ls and which python3.
File index — Same story. On a small project, the model finds files fine without an index.
Verbose planning prompts — "Think step-by-step before acting" sounds good but DeepSeek ignores it and starts calling tools immediately anyway.

What Still Fails and Why

refactor (0/3): The prompt "make it configurable" is too abstract. The model doesn't know it should: (1) add a dataclass field, (2) grep for hardcoded values, (3) replace each one. This is a model comprehension limit, not a harness problem.

bugfix (0/3): "Fix the error handling" requires understanding the relationship between tool_grep's return value and the agent loop's tool_result construction. The model reads the code but can't connect the dots across 200+ lines.

test-gen (1/3): Works sometimes. Fails when the model gets the import path wrong (from harness import ... without sys.path.insert). A harness-level fix would be to automatically inject the project root into PYTHONPATH.

The 8 Techniques

#	Technique	LOC	Measured Impact
1	Agent Loop — ReAct cycle with tool calling	40	Foundation
2	Tool System — 6 structured tools > shell	60×6	Baseline requirement
3	Cache Boundary — stable prefix + dynamic suffix	20	-87% cost (on multi-turn)
4	Environment Bootstrap	30	Negligible on small projects
5	Three-Layer Memory — lazy skill loading	50	-95% context (on large projects)
6	Circuit Breakers — fuses on every loop	15	Prevents $$ disasters
7	Critic Permissions — safety before bash	40	Blocks 100% dangerous commands
8	Fuzzy Edit Recovery — show nearby text on failure	25	+10% pass rate

Note: Techniques 3-5 matter on large projects and long sessions. On our small benchmark they showed no measurable improvement. That's honest — we measure what we can prove.

Quick Start

git clone https://github.com/wang2-lat/micro-harness.git
cd micro-harness
pip install anthropic openai google-genai

# Offline tests (no API key needed)
python3 benchmarks/test_offline.py

# With DeepSeek via any OpenAI-compatible API:
OPENAI_API_KEY=your-key OPENAI_BASE_URL=https://api.example.com/v1 \
MODEL=DeepSeek-V3.2 python3 src/openai_harness.py "your task here"

# With Gemini (free):
GEMINI_API_KEY=your-key python3 src/gemini_harness.py "your task here"

# Run the reliability benchmark yourself:
python3 benchmarks/run_gemini_benchmark.py

Supports 3 backends: Anthropic Claude, Google Gemini, any OpenAI-compatible API (DeepSeek, GLM, Kimi, etc.)

Example: FinFars — Vertical Harness for Equity Research

examples/finfars.py replaces generic tools with domain-specific ones:

python3 examples/finfars.py AAPL
# → 8-section equity research report with real SEC/Yahoo Finance data

Domain tools: fetch_filing (SEC EDGAR), fetch_company_facts, fetch_stock_price (Yahoo), search_news, compute_metrics. All free APIs, no auth required for data fetching.

Architecture

User Task
    │
    ▼
┌────────────────────────────┐
│  System Prompt             │
│  ┌──────────────────────┐  │
│  │ STABLE PREFIX (cache)│  │ ← Technique 3
│  │ + File Index (L1)    │  │ ← Technique 5
│  └──────────────────────┘  │
│  ┌──────────────────────┐  │
│  │ DYNAMIC (per-call)   │  │
│  └──────────────────────┘  │
└────────────────────────────┘
    │
    ▼
┌────────────────────────────┐
│   AGENT LOOP               │ ← Technique 1
│   for turn in max_turns:   │ ← Technique 6 (fuse)
│     response = llm(msgs)   │
│     if done: break         │
│     for tool in calls:     │
│       critic_check(tool)   │ ← Technique 7
│       result = execute()   │ ← Technique 2
│       if edit_failed:      │
│         show_nearby_text() │ ← Technique 8 (recovery)
│       msgs.append(result)  │
└────────────────────────────┘

What We Tried and Failed (Documented Honestly)

We tried 3 approaches to crack the remaining 3 tasks. All failed.

Approach	What We Did	Result	Why It Failed
Switch model	Gemini 2.5 Flash	0/3	Faster but not smarter. Gave up on bugfix after 3 turns
Dual-agent	Planner agent + executor agent	0/3	Planner read lots of code but generated vague, non-executable steps
More turns	Raised max_turns to 25	0/9	More turns = more flailing, not more progress

Real conclusion: These 3 tasks (abstract refactoring, cross-function bugfix, stable test generation) exceed the capability boundary of DeepSeek-V3.2 and Gemini 2.5 Flash. They likely need Claude Sonnet/Opus-level reasoning, or a fundamentally different harness architecture (code graph + symbolic reasoning).

50% reliable pass rate is the honest ceiling for this model + harness combination.

The Bottom Line

A harness is a multiplier. The model is the base. If the base is zero, no multiplier helps.

DeepSeek + no optimization      → 0/9
DeepSeek + planning prompt      → 0/3
DeepSeek + fuzzy edit           → 0/3
DeepSeek + 25 turns             → 0/3
DeepSeek + Claude's plan        → 0/3  (16x more expensive)
DeepSeek + everything combined  → 0/3

Claude Sonnet + zero optimization → ✓  10 turns, 23K tokens
GPT-5.4 + zero optimization      → ✓  7 turns, 27K tokens

24 attempts across 6 prompt/architecture variants. All failed on DeepSeek. Both Claude and GPT passed first try with no optimization. Prompt engineering cannot compensate for model capability gaps.

Model Router (saves 61% tokens)

src/router.py auto-selects the best model based on task type:

$ python3 src/router.py
"Compare harness.py and openai_harness.py"  → analysis → DeepSeek (cheap)
"Make truncation configurable"               → refactor → Claude Sonnet (smart)
"Write tests for critic_check"               → test     → GLM-4.7 (fast)
"Fix the error handling"                     → bugfix   → Claude Sonnet (smart)

Real-world benchmark (6 GitHub Issue-style tasks):

	All-DeepSeek	With Router	Change
Pass rate	2/6 (33%)	3/6 (50%)	+50%
Total tokens	596K	234K	-61%

The router sent bugfix/refactor to Claude (both passed), tests to GLM (1 turn vs 14), and kept simple tasks on DeepSeek.

What's Next

The remaining 3/6 failures are real unsolved problems:

Abstract task decomposition — How to make a model understand "make it configurable" without spelling out the steps
Cross-function reasoning — How to help models connect code across 200+ lines
Stable test generation — Automatic PYTHONPATH injection, import resolution

These are open research problems. PRs welcome.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
benchmarks		benchmarks
examples		examples
micro_harness		micro_harness
src		src
tasks		tasks
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

micro-harness 🎯

7-Model Battle (the headline result)

Real Benchmark Results (no BS)

The Honest Numbers

The Journey (17% → 50%)

What Actually Moved the Needle

What Didn't Help (Or Made Things Worse)

What Still Fails and Why

The 8 Techniques

Quick Start

Example: FinFars — Vertical Harness for Equity Research

Architecture

What We Tried and Failed (Documented Honestly)

The Bottom Line

Model Router (saves 61% tokens)

What's Next

License

Further Reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

micro-harness 🎯

7-Model Battle (the headline result)

Real Benchmark Results (no BS)

The Honest Numbers

The Journey (17% → 50%)

What Actually Moved the Needle

What Didn't Help (Or Made Things Worse)

What Still Fails and Why

The 8 Techniques

Quick Start

Example: FinFars — Vertical Harness for Equity Research

Architecture

What We Tried and Failed (Documented Honestly)

The Bottom Line

Model Router (saves 61% tokens)

What's Next

License

Further Reading

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages