English | ็ฎไฝไธญๆ
A minimal agent harness you can read in 30 minutes โ and the honest story of taking it from 17% to 50% pass rate on hard coding tasks.
After the Claude Code source leak (March 31, 2026), we learned that the harness around an LLM matters more than the model. Stanford showed a 6ร gap on the same model with different harnesses. But production harnesses are 512K+ lines. This one is ~400.
We ran the same 2 hardest coding tasks through 7 frontier models on the same harness. No prompt tuning per model โ identical system prompt, identical tools.
| Model | refactor | bugfix | Score | Tokens |
|---|---|---|---|---|
| GPT-5.4 | โ 7 turns | โ 6 turns | 2/2 | 58K |
| Claude Sonnet 4.6 | โ 10 turns | โ 14 turns | 2/2 | 51K |
| GPT-5.4-mini | โ 10 turns | โ 8 turns | 2/2 | 96K |
| Claude Opus 4.6 | โ | โ 12 turns | 1/2 | 75K |
| GPT-5.3-codex | โ | โ 7 turns | 1/2 | 91K |
| Gemini-3-Flash | โ | โ | 0/2 | 143K |
| Gemini-3-Pro | โ | โ | 0/2 | 71K |
GPT-5.4 wins on speed (13 total turns). Claude Sonnet wins on cost (51K tokens). Gemini 3 fails both tasks. Opus loses to Sonnet โ more expensive โ better.
These same 2 tasks scored 0/9 on DeepSeek-V3.2. Model selection has 10ร more impact than any harness optimization.
We ran 6 hard coding tasks (refactoring, test generation, feature implementation, debugging, bug hunting, cross-file analysis) on DeepSeek-V3.2 with natural language prompts โ no hand-holding, no step-by-step instructions. Each task was run 3 times to test reliability.
Task 3-Trial Rate Status
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
analysis 3/3 (100%) โ Reliable
debug 3/3 (100%) โ Reliable
feature 2/3 (67%) โ Mostly reliable
test-gen 1/3 (33%) โ Unreliable
refactor 0/3 (0%) โ Fails
bugfix 0/3 (0%) โ Fails
Reliable (โฅ2/3): 3/6 = 50%
We started at 1/6 (17%) and iterated through 5 rounds:
| Round | Change | Result | What We Learned |
|---|---|---|---|
| 0 | Baseline โ no optimizations | 1/6 (17%) | Most tasks hit turn limit |
| 1 | Added planning prompt, raised max_turns to 20 | 2/6 (33%) | Planning helps, but not enough |
| 2 | Added fuzzy edit hints, smarter error messages | 4/6 (67%) | Edit recovery is huge โ showing nearby text when exact match fails |
| 3 | Step-by-step task prompts | 5/6 (83%) | But this was cheating โ teaching to the test |
| 4 | Claimed "100%" | 6/6 | Was lying to ourselves โ changed tasks and only ran once |
| 5 | Honest retest: natural prompts, 3 trials each | 3/6 (50%) | The real number |
-
Installing ripgrep (+15%) โ The grep tool was returning errors every call because
rgwasn't installed. The model wasted 3-5 turns per task working around it. Lesson: your harness is only as good as its environment. -
Fuzzy edit hints (+10%) โ When
old_stringdoesn't match, instead of just "not found", show the nearest matching line. The model can copy the exact text and retry. Lesson: help the model recover from errors instead of just reporting them. -
Tighter system prompt (+8%) โ Removed fluffy instructions. The key rules that actually mattered:
- "Use grep to find things. NEVER read an entire file."
- "Read with start_line+limit (max 30 lines)."
- "After writing code, ALWAYS run it immediately."
-
Clean state between runs (+5%) โ Previous agent runs left modified files that confused future runs. Lesson: agent state pollution is a real problem in harness testing.
- Environment bootstrap โ Added tokens to every prompt but didn't reduce turns on hard tasks. The model already knows how to run
lsandwhich python3. - File index โ Same story. On a small project, the model finds files fine without an index.
- Verbose planning prompts โ "Think step-by-step before acting" sounds good but DeepSeek ignores it and starts calling tools immediately anyway.
refactor (0/3): The prompt "make it configurable" is too abstract. The model doesn't know it should: (1) add a dataclass field, (2) grep for hardcoded values, (3) replace each one. This is a model comprehension limit, not a harness problem.
bugfix (0/3): "Fix the error handling" requires understanding the relationship between tool_grep's return value and the agent loop's tool_result construction. The model reads the code but can't connect the dots across 200+ lines.
test-gen (1/3): Works sometimes. Fails when the model gets the import path wrong (from harness import ... without sys.path.insert). A harness-level fix would be to automatically inject the project root into PYTHONPATH.
| # | Technique | LOC | Measured Impact |
|---|---|---|---|
| 1 | Agent Loop โ ReAct cycle with tool calling | 40 | Foundation |
| 2 | Tool System โ 6 structured tools > shell | 60ร6 | Baseline requirement |
| 3 | Cache Boundary โ stable prefix + dynamic suffix | 20 | -87% cost (on multi-turn) |
| 4 | Environment Bootstrap | 30 | Negligible on small projects |
| 5 | Three-Layer Memory โ lazy skill loading | 50 | -95% context (on large projects) |
| 6 | Circuit Breakers โ fuses on every loop | 15 | Prevents $$ disasters |
| 7 | Critic Permissions โ safety before bash | 40 | Blocks 100% dangerous commands |
| 8 | Fuzzy Edit Recovery โ show nearby text on failure | 25 | +10% pass rate |
Note: Techniques 3-5 matter on large projects and long sessions. On our small benchmark they showed no measurable improvement. That's honest โ we measure what we can prove.
git clone https://github.com/wang2-lat/micro-harness.git
cd micro-harness
pip install anthropic openai google-genai
# Offline tests (no API key needed)
python3 benchmarks/test_offline.py
# With DeepSeek via any OpenAI-compatible API:
OPENAI_API_KEY=your-key OPENAI_BASE_URL=https://api.example.com/v1 \
MODEL=DeepSeek-V3.2 python3 src/openai_harness.py "your task here"
# With Gemini (free):
GEMINI_API_KEY=your-key python3 src/gemini_harness.py "your task here"
# Run the reliability benchmark yourself:
python3 benchmarks/run_gemini_benchmark.pySupports 3 backends: Anthropic Claude, Google Gemini, any OpenAI-compatible API (DeepSeek, GLM, Kimi, etc.)
examples/finfars.py replaces generic tools with domain-specific ones:
python3 examples/finfars.py AAPL
# โ 8-section equity research report with real SEC/Yahoo Finance dataDomain tools: fetch_filing (SEC EDGAR), fetch_company_facts, fetch_stock_price (Yahoo), search_news, compute_metrics. All free APIs, no auth required for data fetching.
User Task
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ System Prompt โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ STABLE PREFIX (cache)โ โ โ Technique 3
โ โ + File Index (L1) โ โ โ Technique 5
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ DYNAMIC (per-call) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AGENT LOOP โ โ Technique 1
โ for turn in max_turns: โ โ Technique 6 (fuse)
โ response = llm(msgs) โ
โ if done: break โ
โ for tool in calls: โ
โ critic_check(tool) โ โ Technique 7
โ result = execute() โ โ Technique 2
โ if edit_failed: โ
โ show_nearby_text() โ โ Technique 8 (recovery)
โ msgs.append(result) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
We tried 3 approaches to crack the remaining 3 tasks. All failed.
| Approach | What We Did | Result | Why It Failed |
|---|---|---|---|
| Switch model | Gemini 2.5 Flash | 0/3 | Faster but not smarter. Gave up on bugfix after 3 turns |
| Dual-agent | Planner agent + executor agent | 0/3 | Planner read lots of code but generated vague, non-executable steps |
| More turns | Raised max_turns to 25 | 0/9 | More turns = more flailing, not more progress |
Real conclusion: These 3 tasks (abstract refactoring, cross-function bugfix, stable test generation) exceed the capability boundary of DeepSeek-V3.2 and Gemini 2.5 Flash. They likely need Claude Sonnet/Opus-level reasoning, or a fundamentally different harness architecture (code graph + symbolic reasoning).
50% reliable pass rate is the honest ceiling for this model + harness combination.
A harness is a multiplier. The model is the base. If the base is zero, no multiplier helps.
DeepSeek + no optimization โ 0/9
DeepSeek + planning prompt โ 0/3
DeepSeek + fuzzy edit โ 0/3
DeepSeek + 25 turns โ 0/3
DeepSeek + Claude's plan โ 0/3 (16x more expensive)
DeepSeek + everything combined โ 0/3
Claude Sonnet + zero optimization โ โ 10 turns, 23K tokens
GPT-5.4 + zero optimization โ โ 7 turns, 27K tokens
24 attempts across 6 prompt/architecture variants. All failed on DeepSeek. Both Claude and GPT passed first try with no optimization. Prompt engineering cannot compensate for model capability gaps.
src/router.py auto-selects the best model based on task type:
$ python3 src/router.py
"Compare harness.py and openai_harness.py" โ analysis โ DeepSeek (cheap)
"Make truncation configurable" โ refactor โ Claude Sonnet (smart)
"Write tests for critic_check" โ test โ GLM-4.7 (fast)
"Fix the error handling" โ bugfix โ Claude Sonnet (smart)
Real-world benchmark (6 GitHub Issue-style tasks):
| All-DeepSeek | With Router | Change | |
|---|---|---|---|
| Pass rate | 2/6 (33%) | 3/6 (50%) | +50% |
| Total tokens | 596K | 234K | -61% |
The router sent bugfix/refactor to Claude (both passed), tests to GLM (1 turn vs 14), and kept simple tasks on DeepSeek.
The remaining 3/6 failures are real unsolved problems:
- Abstract task decomposition โ How to make a model understand "make it configurable" without spelling out the steps
- Cross-function reasoning โ How to help models connect code across 200+ lines
- Stable test generation โ Automatic PYTHONPATH injection, import resolution
These are open research problems. PRs welcome.
MIT