Skip to content

ziquanc/thinkstack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ThinkStack

Composable cognitive layers that make AI agents think better.

Your agent has skills that tell it what to do. ThinkStack tells it how to think while doing it.

Without ThinkStack:  "Should we migrate to GraphQL? Here are the pros and cons..."
With ThinkStack:     "Is this the right question? What problem are you actually solving?
                      Here are 3 alternatives you haven't considered, what you lose with
                      each option, and a decision framework to choose."

Setup

1. Clone into your project

git clone https://github.com/ziquanc/thinkstack.git

2. Add one line to your CLAUDE.md

Before responding, read and follow thinkstack/SELECTOR.md

That's it. The agent now auto-selects thinking modes for every request.

How it works

User asks: "design a database for my SaaS app"
  │
  ▼  Agent reads SELECTOR.md (from CLAUDE.md instruction)
  │
  ▼  Matches "design" → systems-thinking, first-principles, tradeoff
  │
  ▼  Reads those 3 MODE.md files from thinkstack/modes/
  │
  ▼  Applies the thinking rules while responding
  │
  ▼  If a skill also triggers → skill runs WITH thinking modes active

Simple questions ("what is 2+2?") get no modes — the selector skips them.

Other platforms

Any agent / system prompt — add to your system prompt:

Before responding, read and follow thinkstack/SELECTOR.md

Manual mode selection — in a skill frontmatter:

---
name: my-skill
thinking_modes: [analytical, first-principles, systems-thinking]
---

Claude Code / Kilo Code hooks — auto-select via hooks. See INTEGRATION.md.


The Problem

AI agents follow instructions well but have default behaviors they skip. Ask "should we use X?" and even strong models default to a pros/cons list. They won't question the premise, list alternatives, name what you lose, or challenge weak reasons — unless you explicitly tell them to.

Skills define the process (what steps to follow). ThinkStack provides cognitive rules — specific behaviors the agent must follow that push it beyond its defaults.

How Modes Work

Each mode is 6 rules, ~160 words. Each rule forces a specific output behavior the model would otherwise skip.

# Example: first-principles mode forces these behaviors:
Rule 1: "Challenge the framing — is this the right question?"
Rule 2: "Separate constraints from conventions"
Rule 5: "Name the assumptions explicitly"
Rule 6: "Kill false requirements"

Not "think deeply" — but "do these specific things you wouldn't normally do."

Strong models like Claude Opus benefit from rules that push past defaults. Weaker models (Haiku, GPT-4o-mini, Llama) benefit from the rules AND the reasoning structure they provide.


Tested Results

Automated Evaluation (12 industries, 2 models, strict rubrics)

24 tests per model across 12 industries. Each prompt ran twice — once without ThinkStack, once with. Scored using strict LLM-as-judge rubrics via promptfoo. Rubrics demand specific numbers, named tradeoffs, and detailed analysis — not just mentioning topics.

ThinkStack evaluation chart across 12 industries and 2 models

Gemini 3.1 Flash Lite (weaker model)

Industry Test Without With Gain
Fintech Payment system design 2 4 +2
Healthcare Patient data migration 4 6 +2
E-commerce Black Friday scaling 4 5 +1
EdTech AI tutoring system 4 2 -2
SaaS Pricing strategy 2 4 +2
DevOps Kubernetes migration 6 10 +4
Data Real-time analytics pipeline 2 10 +8
Security API auth design 4 4 0
Mobile Offline-first architecture 6 10 +4
Startup Build vs buy (email) 6 10 +4
AI/ML RAG system design 9 4 -5
Legacy Monolith decomposition 6 10 +4
Average 4.6 6.6 +2.0
Pass rate 1/12 (8%) 5/12 (42%)

Gemini 3.1 Pro (stronger model)

Warning: The judge (Flash Lite) made arithmetic errors on 37.5% of Pro scores — it says "deduct 2 from 10" but outputs 4 instead of 8. Pro scores below are raw judge output. The direction (with > without) is consistent but the absolute numbers are unreliable. Use a stronger judge model for accurate Pro evaluation.

Industry Test Without With Gain
Fintech Payment system design 4 6 +2
Healthcare Patient data migration 4 4 0
E-commerce Black Friday scaling 4 2 -2
EdTech AI tutoring system 4 6 +2
SaaS Pricing strategy 2 4 +2
DevOps Kubernetes migration 6 8 +2
Data Real-time analytics pipeline 4 0.6† -3.4
Security API auth design 0.2 4 +3.8
Mobile Offline-first architecture 0.6 6 +5.4
Startup Build vs buy (email) 2 10 +8
AI/ML RAG system design 10 6 -4
Legacy Monolith decomposition 10 10 0
Average 4.2 5.5 +1.3

†Judge error: reasoning says "deduct 4 from 10" = 6, but output 0.6. 9 of 24 Pro scores have similar arithmetic errors.

Summary across models

Metric Flash Lite Pro
Without With Without With
Average score 4.6 6.6 4.2† 5.5
Pass rate 8% 42%
Improvement +2.0 +1.3

Key findings:

  • ThinkStack improves both models — weaker model benefits more (+2.0 vs +1.0)
  • Biggest wins consistently: Data pipeline, Build vs buy, DevOps, Mobile, Legacy
  • RAG fix: Changed modes from adversarial → abstraction for AI/ML tasks. Pro fixed (was -7.4, now 0). Flash Lite still struggles (-5) — this model already knows RAG well, modes add overhead
  • Scores vary between runs (~1-2 points). These are single-run results, not averaged
  • Judge reliability: Flash Lite scoring its own output = consistent (0 errors). Flash Lite scoring Pro output = 37.5% arithmetic errors. The judge says "deduct 2 from 10" but outputs 4 instead of 8. Pro absolute scores are unreliable; the relative direction (with > without) is consistent. For accurate Pro scoring, use a stronger judge model
  • Security scored 4/4 across all conditions — rubric may need redesign for that domain

Run yourself: npx promptfoo@latest eval | Config: evals/promptfooconfig.yaml | Results: Flash Lite | Pro

Manual Evaluation (Claude Opus 4.6 outputs, Gemini-scored)

We also ran longer-form tests using Claude Opus 4.6 (claude -p) and had Gemini evaluate the full outputs manually. These provide detailed qualitative comparison.


Test 1: Debugging

"Our Node.js API returns 500 errors for 2% of requests. POST /api/orders only. Logs show 'connection pool exhausted' but pool max is 20 and we get 50 req/s. Started after Tuesday's deploy."

Paired with: systematic-debugging skill | Modes: analytical + first-principles + systems-thinking

Criteria Without With
Focus How to fix the pool leak How to find the trigger of the leak
Code examples Instructional Broken vs fixed comparison
Testing approach Better (bash concurrent test) Basic (logging)
Nuance General debugging Explains the 2% failure distribution

Both responses are strong. Without-modes has a better testing script. With-modes better explains why exactly 2% of requests fail and provides an explicit causal chain (deploy → error path → leaked connection → pool drains → 500s).

Full output: without ThinkStack | with ThinkStack


Test 2: Decision Making

"Should we migrate from REST to GraphQL?"

No skill | Modes: first-principles + tradeoff + adversarial

Criteria Without With
Completeness Moderate High
Risk mitigation Basic (N+1, auth) Advanced (security, CDN, partial failure)
Alternative solutions None provided tRPC, BFF, sparse fields
Decision logic Narrative/broad Structured binary checklist

Full output: without ThinkStack | with ThinkStack


Test 3: Database Design

"Design a database schema for a multi-tenant SaaS project management tool."

No skill | Modes: systems-thinking + first-principles + tradeoff

Criteria Without With
Multi-tenancy approach Standard RLS Denormalized workspace_id for O(1) RLS — explicitly justified
Architecture reasoning Decisions listed Systems map + tradeoff table
Practical features More complete (soft deletes, polymorphic attachments) Missing some features

Without ThinkStack, the model initially asked a clarifying question instead of deciding. We re-ran with the decision pre-answered for fair comparison. See known tradeoff.

Full output: without ThinkStack | with ThinkStack


Available Modes

Mode Core Rule What It Forces
analytical Do the math Quantify claims, trace cause-effect chains, name the structure
first-principles Question the premise Challenge framing, separate constraints from conventions
systems-thinking Trace ripple effects Map the system, find feedback loops, state second-order effects
tradeoff Name the cost State what you lose, list alternatives, assess reversibility
probabilistic State your confidence Attach confidence levels, use ranges, name tail risks
adversarial Break it Attack the happy path, find trust boundaries, calculate blast radius
optimization Find the bottleneck Define the target, measure before optimizing, state diminishing returns
abstraction Zoom to the right level Name the abstraction level, recognize patterns, define interfaces
exploration Widen first Generate 3+ options before evaluating, include the unconventional option
critical Demand evidence Cite evidence for claims, check sources, state what would change your mind

Known tradeoff: reasoning depth vs feature completeness

ThinkStack improves how the model reasons (tradeoff tables, systems maps, justified decisions). But reasoning takes tokens — what the model spends on "why" it can't spend on "what." In our database design test, the with-ThinkStack version had better architectural reasoning but missed practical features that the without-ThinkStack version included.

ThinkStack is a thinking layer, not a completeness layer. For complete implementations, pair it with a skill that ensures coverage — like writing-plans which has a self-review step. Skill handles what to build, ThinkStack handles how to reason about it.

Project Structure

thinkstack/
├── SELECTOR.md              ← entry point (agent reads this to auto-pick modes)
├── README.md
├── COMPACT.md               ← 6-line summaries per mode
├── INDEX.md                 ← composability rules
├── INTEGRATION.md           ← hook setup for Claude Code / Kilo Code
├── modes/
│   ├── analytical/MODE.md       (154 words, 6 rules)
│   ├── first-principles/MODE.md (178 words, 6 rules)
│   ├── systems-thinking/MODE.md (152 words, 6 rules)
│   ├── tradeoff/MODE.md         (156 words, 6 rules)
│   ├── probabilistic/MODE.md    (165 words, 6 rules)
│   ├── adversarial/MODE.md      (168 words, 6 rules)
│   ├── optimization/MODE.md     (170 words, 6 rules)
│   ├── abstraction/MODE.md      (182 words, 6 rules)
│   ├── exploration/MODE.md      (172 words, 6 rules)
│   └── critical/MODE.md         (186 words, 6 rules)
└── test-results/                ← comparison outputs for independent review

License

MIT

About

Composable cognitive layers that make AI agents think better. 10 thinking modes, 6 rules each. Tested across 12 industries with honest results.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages