- Claude Code with Opus 4.6 as the main agent
- Claude Code Skills for workflow orchestration — each step in the workflow is a skill
- Claude Tasks for agent task management (replaced Beads)
- Agent Teams for parallel implementation with cross-agent communication (replaced sub-agents)
tl;dr: Same as before — it's way easier to iterate on a plan than on code. Front-load the arguing, only implement once there's nothing left to debate. What changed is the tooling got native, the models got smarter, and the feedback loops got tighter.
This seems like a lot of steps, but...
- Most steps are hands-off — you kick them off and wait
- You're not reading anything until it's been through 1-2 refinement passes
- For smaller stuff, skip steps — straightforward tasks go straight to implementation
A couple quick notes:
- Tasks are for the agents, not for you. We still use Linear to manage bugs, features, tasks. Claude Tasks just lets us break up the work into small, outcome-focused pieces that an individual agent can pick up. Each task has acceptance criteria and starting points — the agent figures out the how.
- Context windows still matter, but less. Opus 4.6 is meaningfully better at maintaining coherence at depth. The team lead (supervisor) can chew through ~30 tasks before degrading, mostly because all the context lives in the tasks themselves, not in the conversation. Workers get rotated out after 1-3 tasks depending on weight to keep them in the smart window.
- Feedback loops are critical. Give your agents a linter, a type checker, and a test runner. TDD is baked into the worker prompt — they write a failing test, make it pass, then commit. Type checking is a hard gate on every diff review. Results are 2-5x better when agents can self-correct.
- Capture learnings, not just code. If an agent makes a bad choice, don't just fix it — run
/remembering-learningsto mine what went wrong and curate it into CLAUDE.md so future agents don't repeat it.
Six weeks ago I was using OpenCode, Beads for task management, and a rotating bench of models (Opus, GPT-5.2, Gemini 3 Pro, GLM-4.7) as sub-agents. The theory was model diversity — different models catch different things.
Three things changed:
-
Back to Claude Code Anthropic shipped Tasks and then Teams which make Claude Code a plugin-free E2E solution for the brainstorming -> execution flow. Personally, it's worth the loss of the multimodel swarming on reviews and design. This may come back, TBD.
-
Beads → Claude Tasks. Swapped from Beads to Tasks because it's one less piece of overhead to manage. When I (infrequently) want to look at / annotate tasks,
claude-task-vieweris great. -
Sub-agents → Agent Teams. Sub-agents were fire-and-forget — they couldn't talk to each other or to a supervisor. Agent Teams give you real coordination: a team lead reviews every diff, blocks/unblocks tasks, rotates workers when context degrades, and escalates when it can't resolve something.
-
**Otucome
The philosophical shift: I stopped telling agents how to implement things and started telling them what done looks like. That's only possible because the models got good enough to bridge that gap reliably.
I'm also spending about 30% od my time iterating on harnesses, skills, and workflow tooling vs. actually shipping features. That ratio feels right — every hour invested in tightening the feedback loop pays for itself many times over. Usually this comes in fits & starts as, during work on a feature, I'll see something to improve. I'll either roll it into the current PR or make a new worktree and upstream it right then.
/brainstorming — give a high-level explanation of what you're trying to achieve. Work back-and-forth with the agent to design the solution.
- Agent explores the codebase first (sub-agents for deep research)
- Asks clarifying questions one at a time, multiple choice when possible
- Proposes 2-3 approaches with tradeoffs, leads with a recommendation
- Presents the design in sections, validates each before moving on
- Output: a PRD in
docs/plans/with problem statement, user stories, implementation decisions, and testing strategy
/improving-plans <path> — run 1-2x to iterate on the plan.
- Deep context gathering — reads every referenced file, explores related code, checks tests
- Critiques across five dimensions: clarity, architecture, practical implementation, API design, alternatives
- Caps at 5 new ideas per pass (no gold plating) and skips anything already covered in prior reviews
- Interactive — proposes approaches with tradeoffs, you pick
- Outputs a versioned plan (v2, v3...) and a review report so future passes don't retread the same ground
- Same rule as before: I don't bother reading the plan until it's been through at least one review pass
- Catching a bad design in a plan costs 30 seconds. Catching it in code costs an hour.
/creating-tasks <path> — converts the plan into Claude Tasks.
- Identifies the tracer bullet first — thinnest end-to-end slice that proves the approach
- Each task is outcome-focused: context, acceptance criteria, interface contracts, starting points
- Dependencies are explicit (
blockedBy) — the dependency graph is the implementation sequence - Cross-task integrity check: no overlapping AC, correct cross-references, shared prerequisites owned by exactly one task
- Uses sub-agents to answer its own questions; only surfaces things that genuinely need human judgment
- I rarely read the tasks in detail — skim for glaring mistakes (wrong assumptions, bad sequencing)
/implementing-tasks — go for a walk.
- Spawns an agent team: one team lead (coordinator) + workers (one per task, max 5 concurrent)
- Team lead plans waves by dependency graph, assigns tasks, monitors file conflicts
- Each worker: claims a task, explores the codebase, implements with TDD, type-checks, commits, reports back
- Team lead reviews every diff against quality standards (karpathy guidelines, test anti-patterns, codebase rules)
- Workers get rotated out after 1-3 tasks depending on weight — fresh context = better output
- Approve → next task. Reject → specific feedback, fix, resubmit. Two rejections → escalate to human.
- Full type check between waves to prevent debt from compounding
- You're not needed here. For a reasonably sized feature, this runs for an hour+ unattended.
/polishing — self-review before PR.
- Spawns a team of four specialized review agents in parallel:
- Lint & Types — mechanical fixes, commits directly
- Slop & Comments — AI slop, comment noise, commits directly
- Test Quality — test anti-patterns, vitest gotchas, commits directly
- Design & Correctness — architecture, correctness (report only)
- Team lead triages findings: fix what's unambiguous, escalate what needs human judgment
- Final gate: lint + type check must pass before finishing
- Output: summary of fixes + escalation doc for anything that needs your eyes
Put on your QA hat. Actually use the feature.
/debuggingfor anything weird — four-phase framework: investigate → analyze patterns → hypothesize → fix. No fixes without root cause understanding.- Create new tasks for issues found, loop back through implement → polish as needed
- Bonus points: give agents tools to QA autonomously (CLI harnesses, browser-use MCP, etc.)
- QA on the user experience side is still the longest pole in the tent. For CLIs and APIs it's fast. For UIs, agent-controlled browsers are... fine. Getting better.
/remembering-learnings — mine what agents learned and curate it into CLAUDE.md.
- Collects learnings from three sources: commit message footers, team lead notes, review findings
- Deduplicates, filters through "would a fresh agent benefit from knowing this?"
- Presents a proposed CLAUDE.md diff for approval before writing
- This is how your CLAUDE.md stays alive. Agent mistakes become future guardrails.
After all that, review the code once in GitHub.
- By now you've had automated design critique, team-supervised implementation, specialized polish passes, and manual QA
- This review is a sanity check, not the first line of defense
/summarizing-planto consolidate planning artifacts into a single source-of-record doc — cut the user stories and version history, keep the decisions and architecture
Not everything fits the plan-then-implement workflow. For broad, repetitive tasks — migrations, bulk refactors, grep-and-swap operations — I pull out an actual Ralph loop.
Example: switching a test runner across 100+ files. Planned approach didn't work well (too many edge cases, plan goes stale immediately). Instead:
# ralph.sh — pick a file, swap it, verify, next
# "you are done when the checks are green and the test passes"20 minutes of fiddling to get the loop right, then let it chew through overnight. These are the kinds of tasks where telling the agent exactly what to do at the file level is actually the right call — the opposite of the outcome-focused approach above.
I'm also writing more harnesses — small CLIs and eval tools that aren't for me, they're for agents to verify their own work. An eval harness that lets agents diagnose and benchmark agent output. A test runner wrapper that gives better error context. These are the tools that make the 50/50 time split worth it.
These aren't steps in the workflow — they're loaded automatically by other skills or used ad hoc:
- testing — TDD workflow, vitest patterns, anti-patterns. Loaded by workers during implementation and by reviewers during polish.
- karpathy-guidelines — behavioral guardrails against common LLM coding mistakes (overcomplication, speculative code, touching things you shouldn't). Loaded during diff review.
- reviewing-comments — comment hygiene. Remove narrator comments, keep "why" comments. Loaded during polish.
- summarizing-plan — consolidate versioned plan docs into a single source-of-record after shipping.
- debugging — four-phase debugging framework. Available anytime, most useful during QA.
- Better QA automation — agent-controlled browsers are functional but slow. The gap between "can verify a CLI" and "can verify a UI" is still huge.
- Cross-session memory beyond CLAUDE.md — learnings capture is good but there's more signal in the conversation that gets lost.
- Evaluating agent output quality systematically, not just vibes.