How I'm Writing Code (c/a 2/23/2026)

Claude Code with Opus 4.6 as the main agent
Claude Code Skills for workflow orchestration — each step in the workflow is a skill
Claude Tasks for agent task management (replaced Beads)
Agent Teams for parallel implementation with cross-agent communication (replaced sub-agents)

tl;dr: Same as before — it's way easier to iterate on a plan than on code. Front-load the arguing, only implement once there's nothing left to debate. What changed is the tooling got native, the models got smarter, and the feedback loops got tighter.

This seems like a lot of steps, but...

Most steps are hands-off — you kick them off and wait
You're not reading anything until it's been through 1-2 refinement passes
For smaller stuff, skip steps — straightforward tasks go straight to implementation

A couple quick notes:

Tasks are for the agents, not for you. We still use Linear to manage bugs, features, tasks. Claude Tasks just lets us break up the work into small, outcome-focused pieces that an individual agent can pick up. Each task has acceptance criteria and starting points — the agent figures out the how.
Context windows still matter, but less. Opus 4.6 is meaningfully better at maintaining coherence at depth. The team lead (supervisor) can chew through ~30 tasks before degrading, mostly because all the context lives in the tasks themselves, not in the conversation. Workers get rotated out after 1-3 tasks depending on weight to keep them in the smart window.
Feedback loops are critical. Give your agents a linter, a type checker, and a test runner. TDD is baked into the worker prompt — they write a failing test, make it pass, then commit. Type checking is a hard gate on every diff review. Results are 2-5x better when agents can self-correct.
Capture learnings, not just code. If an agent makes a bad choice, don't just fix it — run /remembering-learnings to mine what went wrong and curate it into CLAUDE.md so future agents don't repeat it.

How It Evolved

Six weeks ago I was using OpenCode, Beads for task management, and a rotating bench of models (Opus, GPT-5.2, Gemini 3 Pro, GLM-4.7) as sub-agents. The theory was model diversity — different models catch different things.

Three things changed:

Back to Claude Code Anthropic shipped Tasks and then Teams which make Claude Code a plugin-free E2E solution for the brainstorming -> execution flow. Personally, it's worth the loss of the multimodel swarming on reviews and design. This may come back, TBD.
Beads → Claude Tasks. Swapped from Beads to Tasks because it's one less piece of overhead to manage. When I (infrequently) want to look at / annotate tasks, claude-task-viewer is great.
Sub-agents → Agent Teams. Sub-agents were fire-and-forget — they couldn't talk to each other or to a supervisor. Agent Teams give you real coordination: a team lead reviews every diff, blocks/unblocks tasks, rotates workers when context degrades, and escalates when it can't resolve something.
**Otucome

The philosophical shift: I stopped telling agents how to implement things and started telling them what done looks like. That's only possible because the models got good enough to bridge that gap reliably.

I'm also spending about 30% od my time iterating on harnesses, skills, and workflow tooling vs. actually shipping features. That ratio feels right — every hour invested in tightening the feedback loop pays for itself many times over. Usually this comes in fits & starts as, during work on a feature, I'll see something to improve. I'll either roll it into the current PR or make a new worktree and upstream it right then.

The Workflow

Step 1: Brainstorm

/brainstorming — give a high-level explanation of what you're trying to achieve. Work back-and-forth with the agent to design the solution.

Agent explores the codebase first (sub-agents for deep research)
Asks clarifying questions one at a time, multiple choice when possible
Proposes 2-3 approaches with tradeoffs, leads with a recommendation
Presents the design in sections, validates each before moving on
Output: a PRD in docs/plans/ with problem statement, user stories, implementation decisions, and testing strategy

Step 2: Improve the Plan

/improving-plans <path> — run 1-2x to iterate on the plan.

Deep context gathering — reads every referenced file, explores related code, checks tests
Critiques across five dimensions: clarity, architecture, practical implementation, API design, alternatives
Caps at 5 new ideas per pass (no gold plating) and skips anything already covered in prior reviews
Interactive — proposes approaches with tradeoffs, you pick
Outputs a versioned plan (v2, v3...) and a review report so future passes don't retread the same ground
Same rule as before: I don't bother reading the plan until it's been through at least one review pass
Catching a bad design in a plan costs 30 seconds. Catching it in code costs an hour.

Step 3: Create Tasks

/creating-tasks <path> — converts the plan into Claude Tasks.

Identifies the tracer bullet first — thinnest end-to-end slice that proves the approach
Each task is outcome-focused: context, acceptance criteria, interface contracts, starting points
Dependencies are explicit (blockedBy) — the dependency graph is the implementation sequence
Cross-task integrity check: no overlapping AC, correct cross-references, shared prerequisites owned by exactly one task
Uses sub-agents to answer its own questions; only surfaces things that genuinely need human judgment
I rarely read the tasks in detail — skim for glaring mistakes (wrong assumptions, bad sequencing)

Step 4: Implement

/implementing-tasks — go for a walk.

Spawns an agent team: one team lead (coordinator) + workers (one per task, max 5 concurrent)
Team lead plans waves by dependency graph, assigns tasks, monitors file conflicts
Each worker: claims a task, explores the codebase, implements with TDD, type-checks, commits, reports back
Team lead reviews every diff against quality standards (karpathy guidelines, test anti-patterns, codebase rules)
Workers get rotated out after 1-3 tasks depending on weight — fresh context = better output
Approve → next task. Reject → specific feedback, fix, resubmit. Two rejections → escalate to human.
Full type check between waves to prevent debt from compounding
You're not needed here. For a reasonably sized feature, this runs for an hour+ unattended.

Step 5: Polish

/polishing — self-review before PR.

Spawns a team of four specialized review agents in parallel:
- Lint & Types — mechanical fixes, commits directly
- Slop & Comments — AI slop, comment noise, commits directly
- Test Quality — test anti-patterns, vitest gotchas, commits directly
- Design & Correctness — architecture, correctness (report only)
Team lead triages findings: fix what's unambiguous, escalate what needs human judgment
Final gate: lint + type check must pass before finishing
Output: summary of fixes + escalation doc for anything that needs your eyes

Step 6: QA

Put on your QA hat. Actually use the feature.

/debugging for anything weird — four-phase framework: investigate → analyze patterns → hypothesize → fix. No fixes without root cause understanding.
Create new tasks for issues found, loop back through implement → polish as needed
Bonus points: give agents tools to QA autonomously (CLI harnesses, browser-use MCP, etc.)
QA on the user experience side is still the longest pole in the tent. For CLIs and APIs it's fast. For UIs, agent-controlled browsers are... fine. Getting better.

Step 7: Remember

/remembering-learnings — mine what agents learned and curate it into CLAUDE.md.

Collects learnings from three sources: commit message footers, team lead notes, review findings
Deduplicates, filters through "would a fresh agent benefit from knowing this?"
Presents a proposed CLAUDE.md diff for approval before writing
This is how your CLAUDE.md stays alive. Agent mistakes become future guardrails.

Step 8: Final Review

After all that, review the code once in GitHub.

By now you've had automated design critique, team-supervised implementation, specialized polish passes, and manual QA
This review is a sanity check, not the first line of defense
/summarizing-plan to consolidate planning artifacts into a single source-of-record doc — cut the user stories and version history, keep the decisions and architecture

Autonomous Work (Ralph Loops)

Not everything fits the plan-then-implement workflow. For broad, repetitive tasks — migrations, bulk refactors, grep-and-swap operations — I pull out an actual Ralph loop.

Example: switching a test runner across 100+ files. Planned approach didn't work well (too many edge cases, plan goes stale immediately). Instead:

# ralph.sh — pick a file, swap it, verify, next
# "you are done when the checks are green and the test passes"

20 minutes of fiddling to get the loop right, then let it chew through overnight. These are the kinds of tasks where telling the agent exactly what to do at the file level is actually the right call — the opposite of the outcome-focused approach above.

I'm also writing more harnesses — small CLIs and eval tools that aren't for me, they're for agents to verify their own work. An eval harness that lets agents diagnose and benchmark agent output. A test runner wrapper that gives better error context. These are the tools that make the 50/50 time split worth it.

Quality of Life Skills

These aren't steps in the workflow — they're loaded automatically by other skills or used ad hoc:

testing — TDD workflow, vitest patterns, anti-patterns. Loaded by workers during implementation and by reviewers during polish.
karpathy-guidelines — behavioral guardrails against common LLM coding mistakes (overcomplication, speculative code, touching things you shouldn't). Loaded during diff review.
reviewing-comments — comment hygiene. Remove narrator comments, keep "why" comments. Loaded during polish.
summarizing-plan — consolidate versioned plan docs into a single source-of-record after shipping.
debugging — four-phase debugging framework. Available anytime, most useful during QA.

What I'm Thinking About

Better QA automation — agent-controlled browsers are functional but slow. The gap between "can verify a CLI" and "can verify a UI" is still huge.
Cross-session memory beyond CLAUDE.md — learnings capture is good but there's more signal in the conversation that gets lost.
Evaluating agent output quality systematically, not just vibes.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
skills		skills
CLAUDE.md		CLAUDE.md
readme.md		readme.md
transcript.txt		transcript.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How I'm Writing Code (c/a 2/23/2026)

How It Evolved

The Workflow

Step 1: Brainstorm

Step 2: Improve the Plan

Step 3: Create Tasks

Step 4: Implement

Step 5: Polish

Step 6: QA

Step 7: Remember

Step 8: Final Review

Autonomous Work (Ralph Loops)

Quality of Life Skills

What I'm Thinking About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

How I'm Writing Code (c/a 2/23/2026)

How It Evolved

The Workflow

Step 1: Brainstorm

Step 2: Improve the Plan

Step 3: Create Tasks

Step 4: Implement

Step 5: Polish

Step 6: QA

Step 7: Remember

Step 8: Final Review

Autonomous Work (Ralph Loops)

Quality of Life Skills

What I'm Thinking About

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages