Composable cognitive layers that make AI agents think better.
Your agent has skills that tell it what to do. ThinkStack tells it how to think while doing it.
Without ThinkStack: "Should we migrate to GraphQL? Here are the pros and cons..."
With ThinkStack: "Is this the right question? What problem are you actually solving?
Here are 3 alternatives you haven't considered, what you lose with
each option, and a decision framework to choose."
git clone https://github.com/ziquanc/thinkstack.gitBefore responding, read and follow thinkstack/SELECTOR.mdThat's it. The agent now auto-selects thinking modes for every request.
User asks: "design a database for my SaaS app"
│
▼ Agent reads SELECTOR.md (from CLAUDE.md instruction)
│
▼ Matches "design" → systems-thinking, first-principles, tradeoff
│
▼ Reads those 3 MODE.md files from thinkstack/modes/
│
▼ Applies the thinking rules while responding
│
▼ If a skill also triggers → skill runs WITH thinking modes active
Simple questions ("what is 2+2?") get no modes — the selector skips them.
Any agent / system prompt — add to your system prompt:
Before responding, read and follow thinkstack/SELECTOR.md
Manual mode selection — in a skill frontmatter:
---
name: my-skill
thinking_modes: [analytical, first-principles, systems-thinking]
---Claude Code / Kilo Code hooks — auto-select via hooks. See INTEGRATION.md.
AI agents follow instructions well but have default behaviors they skip. Ask "should we use X?" and even strong models default to a pros/cons list. They won't question the premise, list alternatives, name what you lose, or challenge weak reasons — unless you explicitly tell them to.
Skills define the process (what steps to follow). ThinkStack provides cognitive rules — specific behaviors the agent must follow that push it beyond its defaults.
Each mode is 6 rules, ~160 words. Each rule forces a specific output behavior the model would otherwise skip.
# Example: first-principles mode forces these behaviors:
Rule 1: "Challenge the framing — is this the right question?"
Rule 2: "Separate constraints from conventions"
Rule 5: "Name the assumptions explicitly"
Rule 6: "Kill false requirements"
Not "think deeply" — but "do these specific things you wouldn't normally do."
Strong models like Claude Opus benefit from rules that push past defaults. Weaker models (Haiku, GPT-4o-mini, Llama) benefit from the rules AND the reasoning structure they provide.
24 tests per model across 12 industries. Each prompt ran twice — once without ThinkStack, once with. Scored using strict LLM-as-judge rubrics via promptfoo. Rubrics demand specific numbers, named tradeoffs, and detailed analysis — not just mentioning topics.
| Industry | Test | Without | With | Gain |
|---|---|---|---|---|
| Fintech | Payment system design | 2 | 4 | +2 |
| Healthcare | Patient data migration | 4 | 6 | +2 |
| E-commerce | Black Friday scaling | 4 | 5 | +1 |
| EdTech | AI tutoring system | 4 | 2 | -2 |
| SaaS | Pricing strategy | 2 | 4 | +2 |
| DevOps | Kubernetes migration | 6 | 10 | +4 |
| Data | Real-time analytics pipeline | 2 | 10 | +8 |
| Security | API auth design | 4 | 4 | 0 |
| Mobile | Offline-first architecture | 6 | 10 | +4 |
| Startup | Build vs buy (email) | 6 | 10 | +4 |
| AI/ML | RAG system design | 9 | 4 | -5 |
| Legacy | Monolith decomposition | 6 | 10 | +4 |
| Average | 4.6 | 6.6 | +2.0 | |
| Pass rate | 1/12 (8%) | 5/12 (42%) |
Warning: The judge (Flash Lite) made arithmetic errors on 37.5% of Pro scores — it says "deduct 2 from 10" but outputs 4 instead of 8. Pro scores below are raw judge output. The direction (with > without) is consistent but the absolute numbers are unreliable. Use a stronger judge model for accurate Pro evaluation.
| Industry | Test | Without | With | Gain |
|---|---|---|---|---|
| Fintech | Payment system design | 4 | 6 | +2 |
| Healthcare | Patient data migration | 4 | 4 | 0 |
| E-commerce | Black Friday scaling | 4 | 2 | -2 |
| EdTech | AI tutoring system | 4 | 6 | +2 |
| SaaS | Pricing strategy | 2 | 4 | +2 |
| DevOps | Kubernetes migration | 6 | 8 | +2 |
| Data | Real-time analytics pipeline | 4 | 0.6† | -3.4 |
| Security | API auth design | 0.2 | 4 | +3.8 |
| Mobile | Offline-first architecture | 0.6 | 6 | +5.4 |
| Startup | Build vs buy (email) | 2 | 10 | +8 |
| AI/ML | RAG system design | 10 | 6 | -4 |
| Legacy | Monolith decomposition | 10 | 10 | 0 |
| Average | 4.2 | 5.5 | +1.3 |
†Judge error: reasoning says "deduct 4 from 10" = 6, but output 0.6. 9 of 24 Pro scores have similar arithmetic errors.
| Metric | Flash Lite | Pro | ||
|---|---|---|---|---|
| Without | With | Without | With | |
| Average score | 4.6 | 6.6 | 4.2† | 5.5† |
| Pass rate | 8% | 42% | — | — |
| Improvement | +2.0 | +1.3† |
Key findings:
- ThinkStack improves both models — weaker model benefits more (+2.0 vs +1.0)
- Biggest wins consistently: Data pipeline, Build vs buy, DevOps, Mobile, Legacy
- RAG fix: Changed modes from adversarial → abstraction for AI/ML tasks. Pro fixed (was -7.4, now 0). Flash Lite still struggles (-5) — this model already knows RAG well, modes add overhead
- Scores vary between runs (~1-2 points). These are single-run results, not averaged
- Judge reliability: Flash Lite scoring its own output = consistent (0 errors). Flash Lite scoring Pro output = 37.5% arithmetic errors. The judge says "deduct 2 from 10" but outputs 4 instead of 8. Pro absolute scores are unreliable; the relative direction (with > without) is consistent. For accurate Pro scoring, use a stronger judge model
- Security scored 4/4 across all conditions — rubric may need redesign for that domain
Run yourself:
npx promptfoo@latest eval| Config:evals/promptfooconfig.yaml| Results: Flash Lite | Pro
We also ran longer-form tests using Claude Opus 4.6 (claude -p) and had Gemini evaluate the full outputs manually. These provide detailed qualitative comparison.
"Our Node.js API returns 500 errors for 2% of requests. POST /api/orders only. Logs show 'connection pool exhausted' but pool max is 20 and we get 50 req/s. Started after Tuesday's deploy."
Paired with: systematic-debugging skill | Modes: analytical + first-principles + systems-thinking
| Criteria | Without | With |
|---|---|---|
| Focus | How to fix the pool leak | How to find the trigger of the leak |
| Code examples | Instructional | Broken vs fixed comparison |
| Testing approach | Better (bash concurrent test) | Basic (logging) |
| Nuance | General debugging | Explains the 2% failure distribution |
Both responses are strong. Without-modes has a better testing script. With-modes better explains why exactly 2% of requests fail and provides an explicit causal chain (deploy → error path → leaked connection → pool drains → 500s).
Full output: without ThinkStack | with ThinkStack
"Should we migrate from REST to GraphQL?"
No skill | Modes: first-principles + tradeoff + adversarial
| Criteria | Without | With |
|---|---|---|
| Completeness | Moderate | High |
| Risk mitigation | Basic (N+1, auth) | Advanced (security, CDN, partial failure) |
| Alternative solutions | None provided | tRPC, BFF, sparse fields |
| Decision logic | Narrative/broad | Structured binary checklist |
Full output: without ThinkStack | with ThinkStack
"Design a database schema for a multi-tenant SaaS project management tool."
No skill | Modes: systems-thinking + first-principles + tradeoff
| Criteria | Without | With |
|---|---|---|
| Multi-tenancy approach | Standard RLS | Denormalized workspace_id for O(1) RLS — explicitly justified |
| Architecture reasoning | Decisions listed | Systems map + tradeoff table |
| Practical features | More complete (soft deletes, polymorphic attachments) | Missing some features |
Without ThinkStack, the model initially asked a clarifying question instead of deciding. We re-ran with the decision pre-answered for fair comparison. See known tradeoff.
Full output: without ThinkStack | with ThinkStack
| Mode | Core Rule | What It Forces |
|---|---|---|
| analytical | Do the math | Quantify claims, trace cause-effect chains, name the structure |
| first-principles | Question the premise | Challenge framing, separate constraints from conventions |
| systems-thinking | Trace ripple effects | Map the system, find feedback loops, state second-order effects |
| tradeoff | Name the cost | State what you lose, list alternatives, assess reversibility |
| probabilistic | State your confidence | Attach confidence levels, use ranges, name tail risks |
| adversarial | Break it | Attack the happy path, find trust boundaries, calculate blast radius |
| optimization | Find the bottleneck | Define the target, measure before optimizing, state diminishing returns |
| abstraction | Zoom to the right level | Name the abstraction level, recognize patterns, define interfaces |
| exploration | Widen first | Generate 3+ options before evaluating, include the unconventional option |
| critical | Demand evidence | Cite evidence for claims, check sources, state what would change your mind |
ThinkStack improves how the model reasons (tradeoff tables, systems maps, justified decisions). But reasoning takes tokens — what the model spends on "why" it can't spend on "what." In our database design test, the with-ThinkStack version had better architectural reasoning but missed practical features that the without-ThinkStack version included.
ThinkStack is a thinking layer, not a completeness layer. For complete implementations, pair it with a skill that ensures coverage — like writing-plans which has a self-review step. Skill handles what to build, ThinkStack handles how to reason about it.
thinkstack/
├── SELECTOR.md ← entry point (agent reads this to auto-pick modes)
├── README.md
├── COMPACT.md ← 6-line summaries per mode
├── INDEX.md ← composability rules
├── INTEGRATION.md ← hook setup for Claude Code / Kilo Code
├── modes/
│ ├── analytical/MODE.md (154 words, 6 rules)
│ ├── first-principles/MODE.md (178 words, 6 rules)
│ ├── systems-thinking/MODE.md (152 words, 6 rules)
│ ├── tradeoff/MODE.md (156 words, 6 rules)
│ ├── probabilistic/MODE.md (165 words, 6 rules)
│ ├── adversarial/MODE.md (168 words, 6 rules)
│ ├── optimization/MODE.md (170 words, 6 rules)
│ ├── abstraction/MODE.md (182 words, 6 rules)
│ ├── exploration/MODE.md (172 words, 6 rules)
│ └── critical/MODE.md (186 words, 6 rules)
└── test-results/ ← comparison outputs for independent review
MIT