Give your AI coding agent a memory that learns from failure — and a debugger that knows where to look.
AI agents repeat the same mistakes because they forget everything between sessions. Debug Bank fixes this — a pattern-first debugging system that checks "have I seen this before?" in 30 seconds, provides targeted breakpoints for runtime debuggers, and catches known failure patterns before they ship.
One-liner: Drop a
CLAUDE.mdinto your project. Your agent never makes the same debugging mistake twice — and when it uses a debugger, it knows exactly which breakpoints to set.
curl -O https://raw.githubusercontent.com/soleimanmansouri/debug-bank/main/CLAUDE.mdgraph TD
BUG["Bug Reported"] --> PC["Step 1: Pattern Check (30s)"]
DEPLOY["About to Deploy"] --> PDS["Pre-Deploy Scan"]
PDS -->|"Patterns flagged"| REVIEW["Review / Fix before shipping"]
PDS -->|"No matches"| SHIP["Deploy"]
REVIEW --> SHIP
PC -->|"Match found"| VERIFY["Verify known fix applies"]
PC -->|"No match"| REPRODUCE["Step 2: Reproduce"]
VERIFY -->|"Confirmed"| FIX["Step 6: Fix"]
VERIFY -->|"Doesn't apply"| REPRODUCE
REPRODUCE --> HYPOTHESIZE["Step 3: Hypothesize (2-3 ranked causes)"]
HYPOTHESIZE --> ISOLATE["Step 4: Isolate (binary search)"]
ISOLATE -->|"3 failures"| STOP["STOP — 3-Exchange Rule"]
STOP --> REPLAN["Re-plan / Add logging / Switch strategy"]
REPLAN --> HYPOTHESIZE
ISOLATE -->|"Found it"| DIAGNOSE["Step 5: Diagnose (trace call chain)"]
DIAGNOSE --> FIX
FIX --> RECORD["Step 7: Record trajectory"]
RECORD --> PB["Pattern Bank grows"]
RECORD --> DC["Domain Catalog grows"]
PB -.->|"Next bug"| PC
DC -.->|"Next bug"| PC
PB -.->|"Next deploy"| PDS
DC -.->|"Next deploy"| PDS
style BUG fill:#ff6b6b,stroke:#333,color:#fff
style DEPLOY fill:#fd9644,stroke:#333,color:#fff
style PC fill:#4ecdc4,stroke:#333,color:#fff
style PDS fill:#4ecdc4,stroke:#333,color:#fff
style STOP fill:#ff6b6b,stroke:#333,color:#fff
style FIX fill:#95e77e,stroke:#333,color:#000
style RECORD fill:#a29bfe,stroke:#333,color:#fff
style PB fill:#74b9ff,stroke:#333,color:#000
style DC fill:#74b9ff,stroke:#333,color:#000
style SHIP fill:#95e77e,stroke:#333,color:#000
Three layers that compound over time — and a runtime bridge that makes debuggers smart:
| Layer | What It Does | How It Helps |
|---|---|---|
| Pattern Bank (P01-P22) | Generalized root cause patterns with debugger strategies | 30-second match + targeted breakpoints |
| Symptom Classifier | Keyword-driven symptom → pattern lookup | Structured hypothesis ranking before touching code |
| Debug Subagent Protocol | Pattern-guided runtime debugging via PDB/JDB | 2-4 targeted breakpoints instead of 15+ blind ones |
| Domain Catalogs | Bugs organized by subsystem | Search by symptom type, not by date |
| Feedback Rules | User corrections → enforceable rules | Agent adapts to YOUR working style |
| Pre-Deploy Scanner | Scans git diff against pattern keywords before shipping | Catches known failure classes before they reach production |
Layer 3: KNOWLEDGE ← Debug Bank (patterns, protocol, memory, classifier)
Layer 2: RUNTIME ← Debug Subagent Protocol (breakpoints, variables, call stacks)
Layer 1: STATIC ← Most agents today (grep, read, guess, retry)
Most coding agents are stuck at Layer 1. Tools like Debug2Fix move them to Layer 2 — but their debug subagent starts from scratch every time. Debug Bank bridges Layer 2 and Layer 3: when your agent matches a pattern, it gets targeted breakpoints and watch expressions from the pattern's debugger strategy, not a blind stepping session. The result: fewer steps to diagnosis, higher-quality fixes from canonical solutions, and graceful fallback to exploratory mode for novel bugs.
AI coding agents are expensive debugging partners:
- They re-investigate bugs they've seen before — from scratch, every time
- They circle through 5+ failed attempts before finding root causes
- They can't learn from corrections — "I told you this yesterday" doesn't stick
- They have no pattern recognition — a P08 (Config Chain Gap) looks brand new every time
Stack Overflow data: AI-generated code has 2.66x more formatting problems and 1.5-2x more security bugs than human code. Much of this comes from agents not learning from past failures.
Google's ReasoningBank research proved that distilling failures into reusable patterns yields +8.3% on WebArena and +4.6% on SWE-Bench. Debug Bank is the production-ready implementation of that concept.
curl -O https://raw.githubusercontent.com/soleimanmansouri/debug-bank/main/CLAUDE.mdcp -r skills/debug-trajectory ~/.claude/skills/
cp -r skills/pattern-check ~/.claude/skills/cp AGENTS.md /path/to/your/project/
cp -r patterns/ /path/to/your/project/patterns/cat CLAUDE.md >> /path/to/your/project/.cursorrulesWorks in 30 seconds. No dependencies. No infrastructure. Just markdown files your agent reads.
The single most impactful rule in this repo:
If 3 rounds of iterative fixing show no progress: STOP. Re-plan from scratch, add logging, or switch strategy entirely.
This prevents the #1 failure mode of AI agents — circular debugging that wastes tokens and produces nothing. After switching strategy, the counter resets.
Before a bug ships is the cheapest time to catch it. The pre-deploy scanner scans your git diff against the 21 pattern keywords and flags any matches before you deploy.
What it does:
- Greps the staged diff for keywords linked to each pattern (e.g.,
observer,subscribe,multiple writers,fallback,retry) - Prints a ranked list of flagged patterns with their quick-check
- Exits non-zero when matches are found, so it can block a deploy pipeline
Run it manually:
bash integrations/pre-deploy-check.shHook it into Claude Code — so it runs automatically before every deploy action. See the full setup guide:
integrations/claude-code-pre-deploy.md
Example output:
[debug-bank] Pre-Deploy Pattern Scan
Scanning git diff for known failure patterns...
FLAGGED P03 Observer/Hook Multiplier
keyword: subscribe
Check: Deduplicate by event/frame ID
FLAGGED P08 Config Resolution Chain Gap
keyword: fallback
Check: Trace the full fallback chain
2 pattern(s) flagged. Review before deploying.
Exit code: 1
No flagged patterns means a clean scan — the script exits 0 and the deploy proceeds.
Before the 7-step protocol begins, run the symptom through the Symptom Classifier. It maps keywords to pattern IDs with confidence scoring:
INPUT: "The greeting plays twice on every call"
OUTPUT: Primary: P03 (Observer Multiplier) — HIGH — 3/3 checklist
Secondary: P01 (Wrapper Defaults) — MEDIUM — 1/3 checklist
→ Debugger: break on observer callback, watch frame.id hit count
The classifier covers 25+ symptom signals, 5 compound pattern triggers, and outputs targeted breakpoints when a pattern's debugger strategy is available. See the full keyword index and usage protocol in classifier/symptom-classifier.md.
When your agent has access to a runtime debugger (PDB, JDB), the Debug Subagent Protocol defines how a main agent delegates targeted investigations to a specialized debug subagent.
Unlike brute-force approaches like Debug2Fix (which explore from scratch), this protocol feeds the subagent pattern-specific breakpoints:
| Approach | Starting Knowledge | Avg Breakpoints | Steps to Diagnosis |
|---|---|---|---|
| Debug2Fix (brute-force) | None | 8-15 | 15-25 |
| Debug Bank v3 (pattern-guided) | Pattern match + debugger strategy | 2-4 | 5-12 |
Three delegation modes based on classifier confidence:
- High confidence: "Confirm P02. Set breakpoints on
context_manager.saveandobserver.save_transcript_turn. Watchinspect.stack()at each write." - Low confidence: "Investigate whether P08 applies. Break at each config resolution level, report which source provides the value."
- No match: Exploratory mode — inspect locals at error site, walk the call stack.
Full spec with typed tool signatures, evidence format, and integration points: protocol/debug-subagent.md.
Each pattern has: description, 30-second checklist, real-world examples, fix strategy, prevention guide, and debugger strategy (targeted breakpoints, watch expressions, isolation technique for PDB/JDB).
| ID | Pattern | Quick Check |
|---|---|---|
| P01 | Wrapper/Decorator Default Mismatch | Audit ALL parent class defaults when wrapping |
| P03 | Observer/Hook Multiplier | Deduplicate by event/frame ID |
| P05 | Context-Dependent Flag Duality | Check if any context needs the opposite value |
| P20 | Filler/Background Audio Pipeline Contention | Ensure only one source writes to the audio pipeline at a time |
| P21 | Untested Handler Path After Shared Code Change | Test ALL handlers in files you changed, not just the one you edited |
| ID | Pattern | Quick Check |
|---|---|---|
| P02 | Multiple Write Sources → Corruption | Grep for ALL writes to the same target |
| P09 | Auto-Apply Pipeline Writing Feedback as Data | Validate payload matches target field structure |
| ID | Pattern | Quick Check |
|---|---|---|
| P07 | Stale/Dead Config | Trace where runtime actually reads from |
| P08 | Config Resolution Chain Gap | Trace the full fallback chain |
| P10 | Contradictory Multi-Source Config | Validate ALL sibling fields match provider |
| ID | Pattern | Quick Check |
|---|---|---|
| P06 | Dependency Resolution Cascade | Check lock file after adding any dependency |
| ID | Pattern | Quick Check |
|---|---|---|
| P11 | Credential Expression Scope Limitation | Test credential expressions with echo/log |
| P12 | Expression Engine Corrupts Non-JSON Bodies | Use JSON-based APIs in workflow engines |
| P13 | Parse Code Matches Errors as Success | Check for error indicators BEFORE extracting data |
| P14 | Expression Evaluation Requires Prefix | Add prefix if template renders as literal |
| P15 | Multi-Output Node Rejects Valid Returns | Use parallel single-output nodes |
| P16 | Binary Data Is Reference-Based | Use helper methods to read actual data |
| ID | Pattern | Quick Check |
|---|---|---|
| P04 | LLM Copies Example Text as Behavior | No action-like text in prompts |
| P17 | Model Speaks Everything in Context | Keep speakable text out of conversation history |
| P18 | Model Loops Without Stop Signal | Set precise timeouts, add idempotency guards |
| P19 | Prompt Engineering Has Hard Limits | Switch to code-level after 2 failed prompt fixes |
| P22 | Iterative Fix Regression (Failswitch) | STOP after 2 failed fixes — deep analyze before attempt 3 |
Single-file bugs are for practice. Real production bugs span services, databases, and timing boundaries. The scenarios/ directory contains self-contained L3-L4 debugging environments where the symptom is in one place and the root cause is somewhere else entirely.
| # | Name | Tier | Patterns | Key Challenge |
|---|---|---|---|---|
| S01 | Stale Cache Race | L4 | P02 + P08 | Cache invalidation arrives after consumer reads stale data |
| S02 | Retry Storm Amplification | L4 | P06 + P03 | Library upgrade changes retry defaults, cascading across services |
| S03 | Silent Schema Drift | L3 | P07 + P02 + P13 | Migration runs but service reads stale schema cache |
Each scenario includes: system architecture, red herrings, full investigation path, solution, and blast-radius analysis. See scenarios/README.md for the full guide.
Anonymized postmortem reports from real incidents. Each goes beyond "what broke" to cover timeline, false leads, blast radius, and — most importantly — systemic mitigation that prevents the entire CLASS of incident.
| # | Title | Duration | Impact | Patterns |
|---|---|---|---|---|
| PM01 | The Invisible Throttle | 4.5 hours | 12% of requests silently degraded | P07 + P13 |
| PM02 | Midnight Migration | 2 hours | Full outage + 30 min data loss | P02 + P08 |
| PM03 | The Helpful Retry | 35 minutes | $23K in duplicate charges | P06 + P03 |
See postmortems/README.md for the template and writing guide.
Real bugs rarely match a single pattern. Compositions document common pattern pairings, why they amplify each other, and how to detect the combination.
| ID | Composition | Patterns | Signal |
|---|---|---|---|
| C01 | Write Race + Stale Fallback | P02 + P08 | Intermittent stale data that self-heals then re-breaks |
| C02 | Upgrade Cascade + Retry Multiplier | P06 + P03 | Traffic amplification after dependency update |
| C03 | Silent Success + Stale Config | P13 + P07 | Wrong results, no errors, 100% "success" rate |
| C04 | LLM Hallucination + Missing Stop | P04 + P18 | AI agent loops wrong behavior confidently |
| C05 | Prompt Limits + Flag Duality | P19 + P05 | Prompt fix breaks opposite context |
See compositions/README.md for investigation strategies.
The protocol scales with the bug's scope. Use the Difficulty Tiers guide to right-size your investigation:
| Tier | Scope | Time Budget | Example |
|---|---|---|---|
| L1 | Single file | 5-30 min | Off-by-one, wrong variable, missing null check |
| L2 | Multi-file, single service | 30 min - 2 hours | Controller returns wrong data due to service layer bug |
| L3 | Multi-service | 2-8 hours | Service A writes correctly, service B reads stale cache |
| L4 | Distributed / timing | 4 hours - 2 days | Cache invalidation race, retry storm, eventual consistency violation |
When you correct your agent, the correction becomes a persistent rule:
---
name: no-mocking-database
type: feedback
---
Integration tests must hit a real database, not mocks.
**Why:** Prior incident where mock/prod divergence masked a broken migration.
**How to apply:** Any test file touching database operations.The Why lets the agent judge edge cases instead of blindly following rules. After 30+ rules, the agent rarely needs the same correction twice.
debug-bank/
├── CLAUDE.md # Drop-in for Claude Code
├── AGENTS.md # Cross-agent (Codex, Gemini CLI, Cursor)
├── protocol/
│ ├── debug-trajectory.md # The 7-step protocol
│ ├── debug-subagent.md # v3: Pattern-guided debug subagent spec
│ ├── 3-exchange-rule.md # When to stop and re-plan
│ ├── difficulty-tiers.md # L1-L4 scale selector
│ └── feedback-capture.md # Corrections → persistent rules
├── classifier/
│ └── symptom-classifier.md # v3: Symptom → pattern matcher with confidence scoring
├── patterns/
│ ├── P01 through P22 # 22 patterns, each with debugger strategy
│ └── TEMPLATE.md # Add your own (includes debugger_strategy section)
├── compositions/ # Common pattern combinations
│ ├── C01 through C05 # 5 documented compositions
│ └── README.md
├── scenarios/ # Multi-service debugging challenges
│ ├── S01 through S03 # L3-L4 scenarios with full solutions
│ ├── TEMPLATE.md
│ └── README.md
├── postmortems/ # Anonymized production incidents
│ ├── PM01 through PM03 # With blast radius + systemic mitigation
│ ├── TEMPLATE.md
│ └── README.md
├── memory/
│ ├── schema.md # Memory file format
│ ├── feedback-rules.md # Behavioral rule structure
│ └── domain-catalogs.md # Organizing bugs by subsystem
├── skills/
│ ├── debug-trajectory/SKILL.md # Claude Code skill
│ └── pattern-check/SKILL.md # Pre-investigation scan
├── examples/ # 20 real bug trajectories
│ ├── voice-pipeline/
│ ├── api-integration/
│ └── config-management/
└── integrations/ # Setup guides per agent
├── claude-code.md
├── codex-cli.md
├── gemini-cli.md
├── cursor.md
├── pre-deploy-check.sh # Bash scanner: git diff → pattern keywords
└── claude-code-pre-deploy.md # Claude Code hook integration guide
Compound learning — Every bug fix teaches the system. After 50 bugs, most issues resolve at Step 1 (pattern match).
Transfers across projects — P02 (Multiple Writers) and P08 (Config Chain Gap) appear in web apps, APIs, pipelines, and infrastructure. The pattern bank moves with you.
User-driven self-improvement — Feedback rules capture corrections with WHY context. The agent gets better at matching your expectations over time.
Evidence-based — Every pattern has a check list. Every catalog entry links to a pattern ID. Nothing is "just trust me."
| Research | Contribution | How Debug Bank Uses It |
|---|---|---|
| Google ReasoningBank (2025) | Distilling reasoning from failures yields +8.3% WebArena, +4.6% SWE-Bench | Pattern bank + domain catalogs = production implementation of this concept |
| AgentDebug (ICLR 2026) | Agent Error Taxonomy across 5 failure categories, +24% all-correct accuracy | P01-P22 categories map to and extend the taxonomy |
| Debug2Fix (2026) | Subagent debugger architecture, +12-22% fix rate via PDB/JDB | Debug subagent protocol adds pattern-guided breakpoints to this architecture |
| debug-gym (2025) | Text-based interactive debugging environment for LLM agents | Debugger strategy fields designed to be compatible with debug-gym tool interface |
| Trajectory-based learning | Searchable, pattern-linked debug entries | Every recorded trajectory feeds the classifier and grows the pattern bank |
Add patterns: Copy patterns/TEMPLATE.md, assign the next P-number, submit a PR with a real-world example.
Add domain catalogs: Create a directory under examples/ with bug entries following memory/domain-catalogs.md.
Share feedback rules: The best rules include a clear Why that helps the agent judge edge cases.
Built from months of production debugging across diverse software systems. Battle-tested on 100+ real bugs before being open-sourced.
Created by Soleiman Mansouri.