ThinkStack

Composable cognitive layers that make AI agents think better.

Your agent has skills that tell it what to do. ThinkStack tells it how to think while doing it.

Without ThinkStack:  "Should we migrate to GraphQL? Here are the pros and cons..."
With ThinkStack:     "Is this the right question? What problem are you actually solving?
                      Here are 3 alternatives you haven't considered, what you lose with
                      each option, and a decision framework to choose."

Setup

1. Clone into your project

git clone https://github.com/ziquanc/thinkstack.git

2. Add one line to your CLAUDE.md

Before responding, read and follow thinkstack/SELECTOR.md

That's it. The agent now auto-selects thinking modes for every request.

How it works

User asks: "design a database for my SaaS app"
  │
  ▼  Agent reads SELECTOR.md (from CLAUDE.md instruction)
  │
  ▼  Matches "design" → systems-thinking, first-principles, tradeoff
  │
  ▼  Reads those 3 MODE.md files from thinkstack/modes/
  │
  ▼  Applies the thinking rules while responding
  │
  ▼  If a skill also triggers → skill runs WITH thinking modes active

Simple questions ("what is 2+2?") get no modes — the selector skips them.

Other platforms

Any agent / system prompt — add to your system prompt:

Before responding, read and follow thinkstack/SELECTOR.md

Manual mode selection — in a skill frontmatter:

---
name: my-skill
thinking_modes: [analytical, first-principles, systems-thinking]
---

Claude Code / Kilo Code hooks — auto-select via hooks. See INTEGRATION.md.

The Problem

AI agents follow instructions well but have default behaviors they skip. Ask "should we use X?" and even strong models default to a pros/cons list. They won't question the premise, list alternatives, name what you lose, or challenge weak reasons — unless you explicitly tell them to.

Skills define the process (what steps to follow). ThinkStack provides cognitive rules — specific behaviors the agent must follow that push it beyond its defaults.

How Modes Work

Each mode is 6 rules, ~160 words. Each rule forces a specific output behavior the model would otherwise skip.

# Example: first-principles mode forces these behaviors:
Rule 1: "Challenge the framing — is this the right question?"
Rule 2: "Separate constraints from conventions"
Rule 5: "Name the assumptions explicitly"
Rule 6: "Kill false requirements"

Not "think deeply" — but "do these specific things you wouldn't normally do."

Strong models like Claude Opus benefit from rules that push past defaults. Weaker models (Haiku, GPT-4o-mini, Llama) benefit from the rules AND the reasoning structure they provide.

Tested Results

Automated Evaluation (12 industries, 2 models, strict rubrics)

24 tests per model across 12 industries. Each prompt ran twice — once without ThinkStack, once with. Scored using strict LLM-as-judge rubrics via promptfoo. Rubrics demand specific numbers, named tradeoffs, and detailed analysis — not just mentioning topics.

Gemini 3.1 Flash Lite (weaker model)

Industry	Test	Without	With	Gain
Fintech	Payment system design	2	4	+2
Healthcare	Patient data migration	4	6	+2
E-commerce	Black Friday scaling	4	5	+1
EdTech	AI tutoring system	4	2	-2
SaaS	Pricing strategy	2	4	+2
DevOps	Kubernetes migration	6	10	+4
Data	Real-time analytics pipeline	2	10	+8
Security	API auth design	4	4	0
Mobile	Offline-first architecture	6	10	+4
Startup	Build vs buy (email)	6	10	+4
AI/ML	RAG system design	9	4	-5
Legacy	Monolith decomposition	6	10	+4
	Average	4.6	6.6	+2.0
	Pass rate	1/12 (8%)	5/12 (42%)

Gemini 3.1 Pro (stronger model)

Warning: The judge (Flash Lite) made arithmetic errors on 37.5% of Pro scores — it says "deduct 2 from 10" but outputs 4 instead of 8. Pro scores below are raw judge output. The direction (with > without) is consistent but the absolute numbers are unreliable. Use a stronger judge model for accurate Pro evaluation.

Industry	Test	Without	With	Gain
Fintech	Payment system design	4	6	+2
Healthcare	Patient data migration	4	4	0
E-commerce	Black Friday scaling	4	2	-2
EdTech	AI tutoring system	4	6	+2
SaaS	Pricing strategy	2	4	+2
DevOps	Kubernetes migration	6	8	+2
Data	Real-time analytics pipeline	4	0.6†	-3.4
Security	API auth design	0.2	4	+3.8
Mobile	Offline-first architecture	0.6	6	+5.4
Startup	Build vs buy (email)	2	10	+8
AI/ML	RAG system design	10	6	-4
Legacy	Monolith decomposition	10	10	0
	Average	4.2	5.5	+1.3

†Judge error: reasoning says "deduct 4 from 10" = 6, but output 0.6. 9 of 24 Pro scores have similar arithmetic errors.

Summary across models

Metric	Flash Lite		Pro
	Without	With	Without	With
Average score	4.6	6.6	4.2†	5.5†
Pass rate	8%	42%	—	—
Improvement		+2.0		+1.3†

Key findings:

ThinkStack improves both models — weaker model benefits more (+2.0 vs +1.0)
Biggest wins consistently: Data pipeline, Build vs buy, DevOps, Mobile, Legacy
RAG fix: Changed modes from adversarial → abstraction for AI/ML tasks. Pro fixed (was -7.4, now 0). Flash Lite still struggles (-5) — this model already knows RAG well, modes add overhead
Scores vary between runs (~1-2 points). These are single-run results, not averaged
Judge reliability: Flash Lite scoring its own output = consistent (0 errors). Flash Lite scoring Pro output = 37.5% arithmetic errors. The judge says "deduct 2 from 10" but outputs 4 instead of 8. Pro absolute scores are unreliable; the relative direction (with > without) is consistent. For accurate Pro scoring, use a stronger judge model
Security scored 4/4 across all conditions — rubric may need redesign for that domain

Run yourself: npx promptfoo@latest eval | Config: evals/promptfooconfig.yaml | Results: Flash Lite | Pro

Manual Evaluation (Claude Opus 4.6 outputs, Gemini-scored)

We also ran longer-form tests using Claude Opus 4.6 (claude -p) and had Gemini evaluate the full outputs manually. These provide detailed qualitative comparison.

Test 1: Debugging

"Our Node.js API returns 500 errors for 2% of requests. POST /api/orders only. Logs show 'connection pool exhausted' but pool max is 20 and we get 50 req/s. Started after Tuesday's deploy."

Paired with: systematic-debugging skill | Modes: analytical + first-principles + systems-thinking

Criteria	Without	With
Focus	How to fix the pool leak	How to find the trigger of the leak
Code examples	Instructional	Broken vs fixed comparison
Testing approach	Better (bash concurrent test)	Basic (logging)
Nuance	General debugging	Explains the 2% failure distribution

Both responses are strong. Without-modes has a better testing script. With-modes better explains why exactly 2% of requests fail and provides an explicit causal chain (deploy → error path → leaked connection → pool drains → 500s).

Full output: without ThinkStack | with ThinkStack

Test 2: Decision Making

"Should we migrate from REST to GraphQL?"

No skill | Modes: first-principles + tradeoff + adversarial

Criteria	Without	With
Completeness	Moderate	High
Risk mitigation	Basic (N+1, auth)	Advanced (security, CDN, partial failure)
Alternative solutions	None provided	tRPC, BFF, sparse fields
Decision logic	Narrative/broad	Structured binary checklist

Full output: without ThinkStack | with ThinkStack

Test 3: Database Design

"Design a database schema for a multi-tenant SaaS project management tool."

No skill | Modes: systems-thinking + first-principles + tradeoff

Criteria	Without	With
Multi-tenancy approach	Standard RLS	Denormalized `workspace_id` for O(1) RLS — explicitly justified
Architecture reasoning	Decisions listed	Systems map + tradeoff table
Practical features	More complete (soft deletes, polymorphic attachments)	Missing some features

Without ThinkStack, the model initially asked a clarifying question instead of deciding. We re-ran with the decision pre-answered for fair comparison. See known tradeoff.

Full output: without ThinkStack | with ThinkStack

Available Modes

Mode	Core Rule	What It Forces
analytical	Do the math	Quantify claims, trace cause-effect chains, name the structure
first-principles	Question the premise	Challenge framing, separate constraints from conventions
systems-thinking	Trace ripple effects	Map the system, find feedback loops, state second-order effects
tradeoff	Name the cost	State what you lose, list alternatives, assess reversibility
probabilistic	State your confidence	Attach confidence levels, use ranges, name tail risks
adversarial	Break it	Attack the happy path, find trust boundaries, calculate blast radius
optimization	Find the bottleneck	Define the target, measure before optimizing, state diminishing returns
abstraction	Zoom to the right level	Name the abstraction level, recognize patterns, define interfaces
exploration	Widen first	Generate 3+ options before evaluating, include the unconventional option
critical	Demand evidence	Cite evidence for claims, check sources, state what would change your mind

Known tradeoff: reasoning depth vs feature completeness

ThinkStack improves how the model reasons (tradeoff tables, systems maps, justified decisions). But reasoning takes tokens — what the model spends on "why" it can't spend on "what." In our database design test, the with-ThinkStack version had better architectural reasoning but missed practical features that the without-ThinkStack version included.

ThinkStack is a thinking layer, not a completeness layer. For complete implementations, pair it with a skill that ensures coverage — like writing-plans which has a self-review step. Skill handles what to build, ThinkStack handles how to reason about it.

Project Structure

thinkstack/
├── SELECTOR.md              ← entry point (agent reads this to auto-pick modes)
├── README.md
├── COMPACT.md               ← 6-line summaries per mode
├── INDEX.md                 ← composability rules
├── INTEGRATION.md           ← hook setup for Claude Code / Kilo Code
├── modes/
│   ├── analytical/MODE.md       (154 words, 6 rules)
│   ├── first-principles/MODE.md (178 words, 6 rules)
│   ├── systems-thinking/MODE.md (152 words, 6 rules)
│   ├── tradeoff/MODE.md         (156 words, 6 rules)
│   ├── probabilistic/MODE.md    (165 words, 6 rules)
│   ├── adversarial/MODE.md      (168 words, 6 rules)
│   ├── optimization/MODE.md     (170 words, 6 rules)
│   ├── abstraction/MODE.md      (182 words, 6 rules)
│   ├── exploration/MODE.md      (172 words, 6 rules)
│   └── critical/MODE.md         (186 words, 6 rules)
└── test-results/                ← comparison outputs for independent review

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ThinkStack

Setup

1. Clone into your project

2. Add one line to your CLAUDE.md

How it works

Other platforms

The Problem

How Modes Work

Tested Results

Automated Evaluation (12 industries, 2 models, strict rubrics)

Gemini 3.1 Flash Lite (weaker model)

Gemini 3.1 Pro (stronger model)

Summary across models

Manual Evaluation (Claude Opus 4.6 outputs, Gemini-scored)

Test 1: Debugging

Test 2: Decision Making

Test 3: Database Design

Available Modes

Known tradeoff: reasoning depth vs feature completeness

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
evals		evals
modes		modes
test-results		test-results
.gitignore		.gitignore
COMPACT.md		COMPACT.md
INDEX.md		INDEX.md
INTEGRATION.md		INTEGRATION.md
README.md		README.md
SELECTOR.md		SELECTOR.md

Folders and files

Latest commit

History

Repository files navigation

ThinkStack

Setup

1. Clone into your project

2. Add one line to your CLAUDE.md

How it works

Other platforms

The Problem

How Modes Work

Tested Results

Automated Evaluation (12 industries, 2 models, strict rubrics)

Gemini 3.1 Flash Lite (weaker model)

Gemini 3.1 Pro (stronger model)

Summary across models

Manual Evaluation (Claude Opus 4.6 outputs, Gemini-scored)

Test 1: Debugging

Test 2: Decision Making

Test 3: Database Design

Available Modes

Known tradeoff: reasoning depth vs feature completeness

Project Structure

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages