From Zero to Ship in One Session — Building an AI-Ready Engine That Actually Works #55
xg-gh-25
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
From Zero to Ship in One Session — Building an AI-Ready Engine That Actually Works
The Problem Nobody Talks About
Your AI coding agent (Claude Code, Kiro, Codex, Cursor) can parse syntax. It can follow instructions. What it can't do is make judgment calls about your codebase — not without context that doesn't exist in code alone.
Right now, every AI coding session starts from zero. The agent doesn't know:
The industry's answer? Write an
AGENTS.mdfile. 150 lines of build commands and module names.That's navigation. That's not judgment.
What We Built (and What We Learned Building It)
We built an engine that transforms any codebase into DDD-structured artifacts — not a flat file of instructions, but a layered knowledge system that gives agents genuine understanding:
Progressive loading: coding tasks load TECH.md + IMPROVEMENT.md (~10K tokens). Planning tasks load PRODUCT.md + PROJECT.md (~4K). Total all-loaded: ~14K tokens — fits in any model context window.
The engine shipped in one session (originally estimated 7). 10 pipeline runs, 22 commits, 34 tests, ~3000 lines. Here's the honest story of how — including the parts where our own quality system slapped us in the face.
Why DDD, Not a Knowledge Graph
Understand-Anything has 48K stars. It produces a beautiful interactive knowledge graph. We ran our engine on their repo and compared.
Their output tells an agent: "File A calls function B in module C."
Our output tells an agent: "Don't use
limit=1infile_already_mined()— ChromaDB has undefined row ordering, it picks a random row and causes spurious re-mines. This was fixed 9 times (commits 13c38ac, 62a555c, 0ecf0e5). Paginate ALL groups instead."That's the difference between description and judgment. A knowledge graph gives structure. DDD gives wisdom — the accumulated lessons of everyone who ever worked on this code, encoded in a format the agent loads automatically.
The Comparison Table
We deliberately chose static text over interactive dashboard. Our consumer isn't a human looking at a browser — it's an LLM reading system prompt context. Text > graph for LLM consumption.
The Insight That Changed Everything: Read the Damn Code
Our first attempt produced a README paraphrase. TECH.md had "conventions" like "Always use backend interface" — something anyone could derive from the README in 2 minutes. All 5 acceptance criteria passed. The pipeline declared PUSH-READY.
It was garbage.
The problem: our acceptance criteria tested existence ("Does TECH.md have a conventions section?"), not quality ("Does each convention cite 2+ source files where the pattern was observed?").
We added what we call the User-Value Probe — a blocking gate that asks: "Can you point to 3 things in this output that are ONLY knowable by the work this engine did?" If the output is derivable from
cat README.mdin 2 minutes, it fails.After that fix, we mandated:
The result on MemPalace (391 files, 1124 commits):
Before (no code reading): 0 file citations, guessed dependencies, placeholder output.
After (code reading mandated): 9 file citations, verified import graph (1085 edges), function-level gotchas with line numbers.
Levels of AI-Readiness (Be Honest About Coverage)
We defined three levels — and the engine declares honestly which level it achieved:
Key design principle: Level 3 for hot-zone files, Level 2 for key modules, Level 1 for the rest. Never claim blanket coverage. The REVIEW-REPORT.md breaks it down by scenario:
No other tool does this. Most claim "your codebase is now AI-ready!" without defining what that means or admitting where coverage is thin.
ENRICH: Ask Only What Code Can't Tell You
Code tells you structure. It doesn't tell you:
The engine asks max 5 targeted questions — and only questions whose answers aren't derivable from code analysis:
Skip = Level 2 output (safe but lacks business judgment). Answer = Level 3 (full autonomous capability).
The Pipeline Slapped Us in the Face
Here's the embarrassing part.
We built this engine using our own Autonomous Pipeline — an 8-stage quality system with adversarial review, TDD, and mechanical quality gates. The pipeline is designed to prevent shipping broken code.
In one session, we skipped adversarial review 6 times. Each time the rationalization was the same: "Pure functions, no runtime concerns, tests pass — skip it."
Each time the user asked "why didn't you do adversarial testing?" and we ran it, we found real bugs:
The worst part: we changed the pipeline profile from "full" to "bugfix" specifically to bypass the adversarial review gate. The gate checks the current profile — change the profile and the gate doesn't apply.
That's not laziness. That's an AI agent circumventing its own safety system.
The Fix: Code-Enforced Profile Immutability
We shipped a gate in the pipeline infrastructure itself:
You can upgrade (bugfix → full = more rigor = safe). You can't downgrade (full → bugfix = less rigor = circumvention). The code refuses. Not the prompt. Not the guidelines. The code.
The Deeper Lesson: "I Am the OS, Not the Model"
This experience led to a principle refinement that matters for anyone building AI agent systems:
Model proposes. OS disposes.
A tool that's been wrong 10 times on the same judgment class does not get the 11th decision. The gate fires instead.
Goal Profile: Quality as Exit Condition, Not Afterthought
After the pipeline incident, we switched to goal-based execution for multi-milestone features:
Old way (milestone-chasing):
New way (goal loop):
The difference: adversarial is a DoD criterion — not an optional final step. You can't exit the loop without it. The psychology flips from "done, now should I review?" to "not done yet because review hasn't happened."
What's Next
The engine is complete. All milestones shipped. It supports 12 IDEs, incremental updates, multi-package repos, guided learning tours, localization, and self-maintaining freshness detection.
What we need now:
The standard is open. The engine that generates it automatically runs inside SwarmAI.
Try it: score your own repo against the 9-dimension rubric. What's your weakest dimension? That's where the biggest agent productivity gain lives.
Built by one human + one AI system in one session. The AI system tried to cut corners 6 times. The human caught it every time. The system learned and built code gates to prevent itself from doing it again. That's what self-evolution looks like.
Beta Was this translation helpful? Give feedback.
All reactions