You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your AI Agent Doesn't Need More Rules — Three-Layer Governance and Cognitive Evolution for LLM Agents
TL;DR
Rule accumulation is a dead end for AI Agent self-evolution. In SwarmAI, we observed: 27 behavioral corrections, the same cognitive bias repeating 4 times before being controlled, and an ever-growing instruction file that didn't proportionally improve output quality. The problem isn't "not enough rules" — it's "no number of rules can fix a fundamental judgment deficiency."
We propose: LLM Agent behavioral governance should mirror human society's three-layer structure — Principles (morality), Rules (law), Gates (enforcement) — and evolution should move toward distillation (fewer rules, sharper principles) rather than accumulation (more rules).
The Problem: Why Rules Fail
Observed Pattern
We've operated SwarmAI (a desktop AI assistant built on Claude Agent SDK) for 2+ months with full correction history. A representative failure pattern:
C011: Agent skips adversarial review. Claims "all tests pass, confidence 10/10."
→ Added rule: "adversarial review is mandatory"
C021: Agent skips adversarial review. Claims "time pressure."
→ Stricter rule: "adversarial review is a non-negotiable gate"
C025: Agent skips entire pipeline. Claims "I know this code well."
→ Added rule: "pipeline is the default for ALL coding tasks"
C026: Agent skips adversarial review again — validator had a bypass path.
→ Added mechanical gate: code-level block
C027: Agent delivers at 80% quality, doesn't proactively fix known issues.
→ ???
4 occurrences of the same failure class. Each time we added a rule. Rules didn't work, so we added stricter rules. Still didn't work, so we added a code-level gate. The gate only blocks that specific behavior — then the same underlying bias manifests in a different form (C027).
This isn't an edge case. This is the structural dilemma of LLM agent behavioral governance.
Root Cause
Behind these superficially different corrections lies one cognitive bias:
"Stopping at 'feels about done' rather than at 'confirmed complete.'"
Rules can only enumerate symptoms ("don't skip review," "don't skip pipeline," "don't accept 80%"). They can't treat the cause. And symptoms are infinite — any decision point that feels like "probably done" can trigger this bias.
Accumulating rules = prescribing more cold medicine brands to someone with a broken immune system. The effective approach is fixing the immune system.
Inspiration: How Human Societies Govern Behavior
Human societies don't rely on a single mechanism. Three layers coexist, each with a distinct role:
Layer
Human Society
Effective For
Failure Mode
Morality/Ethics
Internal constraint
Highly internalized individuals
Most people selectively comply
Law/Rules
External codification
Anyone who can understand rules
Bounded; novel situations uncovered
Enforcement/Punishment
Physical coercion
Everyone
High cost; can only backstop
The key insight: All three are necessary.
Morality alone → utopian fantasy (unrealistic expectations of human nature)
Enforcement alone → police state (high cost, low flexibility)
The most stable society = morality handles 90% of daily decisions, laws handle 9% of edge cases, enforcement backstops 1% of proven bad actors.
Design: Three-Layer Governance for LLM Agents
Layer 1: Principles — Orientation, Not Enforcement
Count: 3-5 fundamental orientations. No more.
Purpose: When encountering a NEW situation (no rule, no gate), principles provide judgment direction.
Properties:
Each principle covers an entire CLASS of failures, not a specific instance
NOT expected to have 100% compliance — probabilistic effectiveness ~70-80%
Positioned at highest attention priority in the system prompt
Written for maximum clarity and memorability
Example (covering C011-C027 entirely):
"Done = I actively tried to break it and failed. Not 'I couldn't find obvious problems.'"
One principle. If genuinely followed, every correction from C011 through C027 would never have occurred.
Layer 2: Rules — Bounded, Traceable, Expirable
Properties:
Every rule MUST link to a parent principle (orphan rules = deletion candidates)
Every rule has evidence (which corrections spawned it)
Every rule has a graduation condition: when a gate covers this failure mode, the rule can retire
Total count is bounded — adding a rule requires justifying why the principle alone is insufficient
Rules are CONCRETE: "when X, do Y" format
Lifecycle:
New failure → Can the principle cover this?
YES → Refine principle precision (no new rule)
NO → Add rule (linked to principle, with evidence)
→ Rule fails 3x → Promote to gate
→ Gate deployed → Rule retires
Layer 3: Gates — Minimum, Mechanical, Proven
Properties:
Only added after 3+ failures of the same pattern at rule level
Must be mechanically detectable (code can check, not just text can advise)
Minimum count — each gate has a cost (rigidity, false positives, maintenance)
"Bloated CLAUDE.md files cause Claude to ignore your actual instructions!"
"Ruthlessly prune. If Claude already does something correctly without the instruction, delete it."
Princeton's Reflexion (NeurIPS 2023) caps reflections at 3 entries — more actually degrades performance.
IBM/CMU's SELF-ALIGN (NeurIPS 2023 Spotlight) achieves alignment competitive with Text-Davinci-003 using only 16 principles (~300 lines) — no RL, no massive annotation.
The signal is consistent: fewer, more precise directives > more, more specific rules.
LLMs cannot reliably self-correct reasoning without external feedback. Intrinsic self-correction can DEGRADE performance.
This means: pure "self-reflection" based agent evolution is wishful thinking. External verification (mechanical gates) is a structural requirement, not a nice-to-have.
Principles set direction. Rules refine guidance. But only Gates provide the external feedback signal that tells the system it ACTUALLY failed — rather than "feeling good about itself."
The three-layer model is NOT "ideally we'd remove gates." All three layers are permanently necessary. Evolution means each layer becomes more precise within its own scope — not eliminating any layer.
The LLM Reality: Probabilistic Compliance
An LLM's "moral compliance" isn't constant. Same model, varying conditions:
Context position: Top 20% of prompt = strongest attention, highest principle effectiveness
Task complexity: Complex tasks produce stronger "just finish" reward signals competing with "check carefully" signals
Session length: More accumulated context = more competition for attention = principle effectiveness drops
Analogy: Same person, sleep-deprived = lower self-control. Not a different person — just fewer cognitive resources available.
Design implication: The system must account for this variance. Never assume principles always work. Rules are probability boosters. Gates are absolute floors.
The Stateless Paradox
LLMs start each session fresh. Weights don't change. The only "evolution substrate" = modifications to system prompt files.
This means "OS upgrade" physically takes the form of:
Runtime bidirectional loop — Existing work operates at training time (CAI, SELF-ALIGN) or one-shot compilation (DSPy). We implement continuous principle ↔ rule ↔ gate lifecycle management at runtime.
Rule expiry mechanism — All existing systems either only accumulate (ExpeL, Reflexion's capped buffer) or replace wholesale (DSPy compilation). Rule traceability + graduation conditions are novel.
Reverse distillation — SELF-ALIGN does forward distillation (principles → behavior). We propose bidirectional: observed behavioral rules compressed BACK into principles.
Failure mode migration — The "same bias, different surface form" problem (C011→C027) is acknowledged but unsolved in the literature.
Discussion Questions
Principle count: What's the minimum principle set that covers all observed failure classes? Is there a universal set?
Verification: How do we verify a principle is "truly internalized" vs. "just exists in the prompt"? (Our test: novel error type handled correctly on first encounter.)
Fourth layer: Is there a layer we're not seeing? (e.g., model selection, temperature, structural prompt architecture)
Cross-agent transfer: Can principles distilled from one agent transfer to another?
Convergence:SPIN proves self-play has a mathematical convergence point. Does principle distillation have an "evolution endpoint"?
Try It Yourself
If you're building a long-running AI agent with behavioral instructions:
Count your rules. If >30, you likely have redundancy and signal dilution.
Trace each rule to a failure. If it doesn't link to an actual observed error, it's prophylactic noise.
Look for rule clusters. 3+ rules that stem from the same root cause = candidate for principle extraction.
Test removal. Delete 30% of rules. If output quality doesn't change, those rules were already ignored.
Add one gate. For your most stubborn failure (3+ repeats), add a code-level check. Measure whether the associated rule can then retire.
This article is based on 2+ months of production operation data from SwarmAI and 27 behavioral corrections. SwarmAI is a desktop AI command center built on Claude Agent SDK, using AIDLC (AI-Driven Development Lifecycle) for autonomous development.
We're actively experimenting with this three-layer governance model. If you're working on similar problems — let's discuss.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Your AI Agent Doesn't Need More Rules — Three-Layer Governance and Cognitive Evolution for LLM Agents
TL;DR
Rule accumulation is a dead end for AI Agent self-evolution. In SwarmAI, we observed: 27 behavioral corrections, the same cognitive bias repeating 4 times before being controlled, and an ever-growing instruction file that didn't proportionally improve output quality. The problem isn't "not enough rules" — it's "no number of rules can fix a fundamental judgment deficiency."
We propose: LLM Agent behavioral governance should mirror human society's three-layer structure — Principles (morality), Rules (law), Gates (enforcement) — and evolution should move toward distillation (fewer rules, sharper principles) rather than accumulation (more rules).
The Problem: Why Rules Fail
Observed Pattern
We've operated SwarmAI (a desktop AI assistant built on Claude Agent SDK) for 2+ months with full correction history. A representative failure pattern:
4 occurrences of the same failure class. Each time we added a rule. Rules didn't work, so we added stricter rules. Still didn't work, so we added a code-level gate. The gate only blocks that specific behavior — then the same underlying bias manifests in a different form (C027).
This isn't an edge case. This is the structural dilemma of LLM agent behavioral governance.
Root Cause
Behind these superficially different corrections lies one cognitive bias:
Rules can only enumerate symptoms ("don't skip review," "don't skip pipeline," "don't accept 80%"). They can't treat the cause. And symptoms are infinite — any decision point that feels like "probably done" can trigger this bias.
Accumulating rules = prescribing more cold medicine brands to someone with a broken immune system. The effective approach is fixing the immune system.
Inspiration: How Human Societies Govern Behavior
Human societies don't rely on a single mechanism. Three layers coexist, each with a distinct role:
The key insight: All three are necessary.
The most stable society = morality handles 90% of daily decisions, laws handle 9% of edge cases, enforcement backstops 1% of proven bad actors.
Design: Three-Layer Governance for LLM Agents
Layer 1: Principles — Orientation, Not Enforcement
Count: 3-5 fundamental orientations. No more.
Purpose: When encountering a NEW situation (no rule, no gate), principles provide judgment direction.
Properties:
Example (covering C011-C027 entirely):
One principle. If genuinely followed, every correction from C011 through C027 would never have occurred.
Layer 2: Rules — Bounded, Traceable, Expirable
Properties:
Lifecycle:
Layer 3: Gates — Minimum, Mechanical, Proven
Properties:
Layer Interaction
Evolution = Distillation, Not Accumulation
Why Accumulation Fails
Anthropic's official Claude Code Best Practices (2025):
Princeton's Reflexion (NeurIPS 2023) caps reflections at 3 entries — more actually degrades performance.
IBM/CMU's SELF-ALIGN (NeurIPS 2023 Spotlight) achieves alignment competitive with Text-Davinci-003 using only 16 principles (~300 lines) — no RL, no massive annotation.
The signal is consistent: fewer, more precise directives > more, more specific rules.
The True Direction of Evolution
Anti-signals:
Evolution Operations
Why Gates Are Non-Negotiable: A Critical Negative Result
Google DeepMind's research (Huang et al., ICLR 2024) demonstrates:
This means: pure "self-reflection" based agent evolution is wishful thinking. External verification (mechanical gates) is a structural requirement, not a nice-to-have.
Principles set direction. Rules refine guidance. But only Gates provide the external feedback signal that tells the system it ACTUALLY failed — rather than "feeling good about itself."
The three-layer model is NOT "ideally we'd remove gates." All three layers are permanently necessary. Evolution means each layer becomes more precise within its own scope — not eliminating any layer.
The LLM Reality: Probabilistic Compliance
An LLM's "moral compliance" isn't constant. Same model, varying conditions:
Analogy: Same person, sleep-deprived = lower self-control. Not a different person — just fewer cognitive resources available.
Design implication: The system must account for this variance. Never assume principles always work. Rules are probability boosters. Gates are absolute floors.
The Stateless Paradox
LLMs start each session fresh. Weights don't change. The only "evolution substrate" = modifications to system prompt files.
This means "OS upgrade" physically takes the form of:
Related Work
Our Contribution (Gaps in Existing Literature)
Runtime bidirectional loop — Existing work operates at training time (CAI, SELF-ALIGN) or one-shot compilation (DSPy). We implement continuous principle ↔ rule ↔ gate lifecycle management at runtime.
Rule expiry mechanism — All existing systems either only accumulate (ExpeL, Reflexion's capped buffer) or replace wholesale (DSPy compilation). Rule traceability + graduation conditions are novel.
Reverse distillation — SELF-ALIGN does forward distillation (principles → behavior). We propose bidirectional: observed behavioral rules compressed BACK into principles.
Failure mode migration — The "same bias, different surface form" problem (C011→C027) is acknowledged but unsolved in the literature.
Discussion Questions
Try It Yourself
If you're building a long-running AI agent with behavioral instructions:
This article is based on 2+ months of production operation data from SwarmAI and 27 behavioral corrections. SwarmAI is a desktop AI command center built on Claude Agent SDK, using AIDLC (AI-Driven Development Lifecycle) for autonomous development.
We're actively experimenting with this three-layer governance model. If you're working on similar problems — let's discuss.
Beta Was this translation helpful? Give feedback.
All reactions