Skip to content

wang2-lat/micro-harness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

19 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

micro-harness ๐ŸŽฏ

English | ็ฎ€ไฝ“ไธญๆ–‡

A minimal agent harness you can read in 30 minutes โ€” and the honest story of taking it from 17% to 50% pass rate on hard coding tasks.

After the Claude Code source leak (March 31, 2026), we learned that the harness around an LLM matters more than the model. Stanford showed a 6ร— gap on the same model with different harnesses. But production harnesses are 512K+ lines. This one is ~400.

7-Model Battle (the headline result)

We ran the same 2 hardest coding tasks through 7 frontier models on the same harness. No prompt tuning per model โ€” identical system prompt, identical tools.

Model refactor bugfix Score Tokens
GPT-5.4 โœ“ 7 turns โœ“ 6 turns 2/2 58K
Claude Sonnet 4.6 โœ“ 10 turns โœ“ 14 turns 2/2 51K
GPT-5.4-mini โœ“ 10 turns โœ“ 8 turns 2/2 96K
Claude Opus 4.6 โœ— โœ“ 12 turns 1/2 75K
GPT-5.3-codex โœ— โœ“ 7 turns 1/2 91K
Gemini-3-Flash โœ— โœ— 0/2 143K
Gemini-3-Pro โœ— โœ— 0/2 71K

GPT-5.4 wins on speed (13 total turns). Claude Sonnet wins on cost (51K tokens). Gemini 3 fails both tasks. Opus loses to Sonnet โ€” more expensive โ‰  better.

These same 2 tasks scored 0/9 on DeepSeek-V3.2. Model selection has 10ร— more impact than any harness optimization.

Real Benchmark Results (no BS)

We ran 6 hard coding tasks (refactoring, test generation, feature implementation, debugging, bug hunting, cross-file analysis) on DeepSeek-V3.2 with natural language prompts โ€” no hand-holding, no step-by-step instructions. Each task was run 3 times to test reliability.

The Honest Numbers

Task          3-Trial Rate    Status
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
analysis      3/3  (100%)     โœ“ Reliable
debug         3/3  (100%)     โœ“ Reliable
feature       2/3  (67%)      โœ“ Mostly reliable
test-gen      1/3  (33%)      โœ— Unreliable
refactor      0/3  (0%)       โœ— Fails
bugfix        0/3  (0%)       โœ— Fails

Reliable (โ‰ฅ2/3): 3/6 = 50%

The Journey (17% โ†’ 50%)

We started at 1/6 (17%) and iterated through 5 rounds:

Round Change Result What We Learned
0 Baseline โ€” no optimizations 1/6 (17%) Most tasks hit turn limit
1 Added planning prompt, raised max_turns to 20 2/6 (33%) Planning helps, but not enough
2 Added fuzzy edit hints, smarter error messages 4/6 (67%) Edit recovery is huge โ€” showing nearby text when exact match fails
3 Step-by-step task prompts 5/6 (83%) But this was cheating โ€” teaching to the test
4 Claimed "100%" 6/6 Was lying to ourselves โ€” changed tasks and only ran once
5 Honest retest: natural prompts, 3 trials each 3/6 (50%) The real number

What Actually Moved the Needle

  1. Installing ripgrep (+15%) โ€” The grep tool was returning errors every call because rg wasn't installed. The model wasted 3-5 turns per task working around it. Lesson: your harness is only as good as its environment.

  2. Fuzzy edit hints (+10%) โ€” When old_string doesn't match, instead of just "not found", show the nearest matching line. The model can copy the exact text and retry. Lesson: help the model recover from errors instead of just reporting them.

  3. Tighter system prompt (+8%) โ€” Removed fluffy instructions. The key rules that actually mattered:

    • "Use grep to find things. NEVER read an entire file."
    • "Read with start_line+limit (max 30 lines)."
    • "After writing code, ALWAYS run it immediately."
  4. Clean state between runs (+5%) โ€” Previous agent runs left modified files that confused future runs. Lesson: agent state pollution is a real problem in harness testing.

What Didn't Help (Or Made Things Worse)

  • Environment bootstrap โ€” Added tokens to every prompt but didn't reduce turns on hard tasks. The model already knows how to run ls and which python3.
  • File index โ€” Same story. On a small project, the model finds files fine without an index.
  • Verbose planning prompts โ€” "Think step-by-step before acting" sounds good but DeepSeek ignores it and starts calling tools immediately anyway.

What Still Fails and Why

refactor (0/3): The prompt "make it configurable" is too abstract. The model doesn't know it should: (1) add a dataclass field, (2) grep for hardcoded values, (3) replace each one. This is a model comprehension limit, not a harness problem.

bugfix (0/3): "Fix the error handling" requires understanding the relationship between tool_grep's return value and the agent loop's tool_result construction. The model reads the code but can't connect the dots across 200+ lines.

test-gen (1/3): Works sometimes. Fails when the model gets the import path wrong (from harness import ... without sys.path.insert). A harness-level fix would be to automatically inject the project root into PYTHONPATH.

The 8 Techniques

# Technique LOC Measured Impact
1 Agent Loop โ€” ReAct cycle with tool calling 40 Foundation
2 Tool System โ€” 6 structured tools > shell 60ร—6 Baseline requirement
3 Cache Boundary โ€” stable prefix + dynamic suffix 20 -87% cost (on multi-turn)
4 Environment Bootstrap 30 Negligible on small projects
5 Three-Layer Memory โ€” lazy skill loading 50 -95% context (on large projects)
6 Circuit Breakers โ€” fuses on every loop 15 Prevents $$ disasters
7 Critic Permissions โ€” safety before bash 40 Blocks 100% dangerous commands
8 Fuzzy Edit Recovery โ€” show nearby text on failure 25 +10% pass rate

Note: Techniques 3-5 matter on large projects and long sessions. On our small benchmark they showed no measurable improvement. That's honest โ€” we measure what we can prove.

Quick Start

git clone https://github.com/wang2-lat/micro-harness.git
cd micro-harness
pip install anthropic openai google-genai

# Offline tests (no API key needed)
python3 benchmarks/test_offline.py

# With DeepSeek via any OpenAI-compatible API:
OPENAI_API_KEY=your-key OPENAI_BASE_URL=https://api.example.com/v1 \
MODEL=DeepSeek-V3.2 python3 src/openai_harness.py "your task here"

# With Gemini (free):
GEMINI_API_KEY=your-key python3 src/gemini_harness.py "your task here"

# Run the reliability benchmark yourself:
python3 benchmarks/run_gemini_benchmark.py

Supports 3 backends: Anthropic Claude, Google Gemini, any OpenAI-compatible API (DeepSeek, GLM, Kimi, etc.)

Example: FinFars โ€” Vertical Harness for Equity Research

examples/finfars.py replaces generic tools with domain-specific ones:

python3 examples/finfars.py AAPL
# โ†’ 8-section equity research report with real SEC/Yahoo Finance data

Domain tools: fetch_filing (SEC EDGAR), fetch_company_facts, fetch_stock_price (Yahoo), search_news, compute_metrics. All free APIs, no auth required for data fetching.

Architecture

User Task
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  System Prompt             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ STABLE PREFIX (cache)โ”‚  โ”‚ โ† Technique 3
โ”‚  โ”‚ + File Index (L1)    โ”‚  โ”‚ โ† Technique 5
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ DYNAMIC (per-call)   โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   AGENT LOOP               โ”‚ โ† Technique 1
โ”‚   for turn in max_turns:   โ”‚ โ† Technique 6 (fuse)
โ”‚     response = llm(msgs)   โ”‚
โ”‚     if done: break         โ”‚
โ”‚     for tool in calls:     โ”‚
โ”‚       critic_check(tool)   โ”‚ โ† Technique 7
โ”‚       result = execute()   โ”‚ โ† Technique 2
โ”‚       if edit_failed:      โ”‚
โ”‚         show_nearby_text() โ”‚ โ† Technique 8 (recovery)
โ”‚       msgs.append(result)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

What We Tried and Failed (Documented Honestly)

We tried 3 approaches to crack the remaining 3 tasks. All failed.

Approach What We Did Result Why It Failed
Switch model Gemini 2.5 Flash 0/3 Faster but not smarter. Gave up on bugfix after 3 turns
Dual-agent Planner agent + executor agent 0/3 Planner read lots of code but generated vague, non-executable steps
More turns Raised max_turns to 25 0/9 More turns = more flailing, not more progress

Real conclusion: These 3 tasks (abstract refactoring, cross-function bugfix, stable test generation) exceed the capability boundary of DeepSeek-V3.2 and Gemini 2.5 Flash. They likely need Claude Sonnet/Opus-level reasoning, or a fundamentally different harness architecture (code graph + symbolic reasoning).

50% reliable pass rate is the honest ceiling for this model + harness combination.

The Bottom Line

A harness is a multiplier. The model is the base. If the base is zero, no multiplier helps.

DeepSeek + no optimization      โ†’ 0/9
DeepSeek + planning prompt      โ†’ 0/3
DeepSeek + fuzzy edit           โ†’ 0/3
DeepSeek + 25 turns             โ†’ 0/3
DeepSeek + Claude's plan        โ†’ 0/3  (16x more expensive)
DeepSeek + everything combined  โ†’ 0/3

Claude Sonnet + zero optimization โ†’ โœ“  10 turns, 23K tokens
GPT-5.4 + zero optimization      โ†’ โœ“  7 turns, 27K tokens

24 attempts across 6 prompt/architecture variants. All failed on DeepSeek. Both Claude and GPT passed first try with no optimization. Prompt engineering cannot compensate for model capability gaps.

Model Router (saves 61% tokens)

src/router.py auto-selects the best model based on task type:

$ python3 src/router.py
"Compare harness.py and openai_harness.py"  โ†’ analysis โ†’ DeepSeek (cheap)
"Make truncation configurable"               โ†’ refactor โ†’ Claude Sonnet (smart)
"Write tests for critic_check"               โ†’ test     โ†’ GLM-4.7 (fast)
"Fix the error handling"                     โ†’ bugfix   โ†’ Claude Sonnet (smart)

Real-world benchmark (6 GitHub Issue-style tasks):

All-DeepSeek With Router Change
Pass rate 2/6 (33%) 3/6 (50%) +50%
Total tokens 596K 234K -61%

The router sent bugfix/refactor to Claude (both passed), tests to GLM (1 turn vs 14), and kept simple tasks on DeepSeek.

What's Next

The remaining 3/6 failures are real unsolved problems:

  1. Abstract task decomposition โ€” How to make a model understand "make it configurable" without spelling out the steps
  2. Cross-function reasoning โ€” How to help models connect code across 200+ lines
  3. Stable test generation โ€” Automatic PYTHONPATH injection, import resolution

These are open research problems. PRs welcome.

License

MIT

Further Reading

About

๐ŸŽฏ Minimal agent harness (~400 LOC) teaching the 8 techniques behind Claude Code. Each technique isolated and measurable.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages