v0.0.3
0.0.3
Added
-
Added a first-run setup card — on startup tomte auto-detects your OS and package manager and, if an important external tool is missing (currently
git), shows the exact install command for your platform (winget/scoop/choco,brew,apt/dnf/pacman/…, or a download link). It only shows the command — it never runs an installer — and no card appears when the environment is ready. -
Added an OS terminal window/tab title — tomte names the window
tomteon launch andtomte — <task>after the first prompt, resetting totomteon/clearso the next prompt re-titles it, so several tomte sessions are easy to tell apart. Cross-platform via crossterm (SetConsoleTitleon Windows, the OSC title escape on macOS/Linux/Windows Terminal); the task text is the first prompt line with control characters stripped, so a crafted prompt can't inject its own terminal escape. -
Added an optional focus to
/compact—/compact <what to keep>steers the summary toward the topic you name (e.g./compact the auth refactor and the failing test) while still producing a self-contained summary, so a compaction at 85% doesn't drop the thread you care about. A bare/compactis unchanged, an auto-compaction never carries a steer, and a blank focus (/compact) is treated as no focus; the steer is consumed once per run so it can't leak into the next summary. (The instruction is appended to the existing compaction prompt; history replacement and checkpoint handling are untouched.) -
Added
/rewind— restore the session to an earlier turn: it truncates the conversation back to a chosen turn AND reverts the file edits made since (the custodian you can follow and undo). A checkpoint is recorded at every turn;/rewindopens a picker of them (newest first), each row showing its blast radius before you commit —… · drops N later turns · reverts M files(Pillar 1) — and selecting one reverts each touched file to its pre-turn content, newest-edit-first so stacked edits to one file collapse to a single restore. A file you changed outside tomte is reported and left as-is, never clobbered;run_shellside effects can't be undone and are counted in the calm summary (↩ rewound to: … · reverted N files · M shell effects could not be undone). In-session only — checkpoints reset on/clear,/compact, and/resume, since they index the runtime undo stack. Reuses the same atomic-restore + staleness guard as/undo; edits since a checkpoint are tracked by a monotonic counter so the capped undo stack's eviction can't miscount. -
Added a live thinking display — while the model reasons, tomte now shows the reasoning text in muted italic (like Claude Code) so you can follow its thought, then collapses it to a compact
Thought for Xsline the moment the answer starts. On by default;/thoughts off(orshow_thinking: falsein config.json) hides the text and keeps only the spinner'sthinkingcue,/thoughts onbrings it back. Provider-agnostic — it renders whatever reasoning the active model streams (Anthropic thinking, OpenAI reasoning), so it carries across a model switch. (/thinkingis unchanged — it still picks reasoning effort;/thoughtsis the new display toggle. The reasoning was already captured per turn; this surfaces it instead of suppressing it.) -
Added
tomte --continue(-c) — resume the most recent session in the current directory immediately, skipping the/resumepicker (parity withclaude --continue). It reuses the exact restore path the picker uses (history, reasoning, and the active goal), and a directory with no saved session starts fresh with a one-line note instead of erroring.tomte resumestill opens the picker to choose among older sessions. -
Added the Proof Capsule — "done means verified."
/prove(in a session) andtomte prove(headless) collect an evidence bundle the CLI gathers itself: the files git reports changed, plus the real exit codes of the project's own verification scripts — test, typecheck, lint, build — which tomte runs and observes. The model never supplies these numbers and can't fabricate a green capsule; at most it explains one the CLI already collected. The card reads ✅ Verified / ❌ Not verified /⚠️ Unverified, lists each check (✅ test passed cargo test), shows a failing check's output tail, and ends with a one-line reproduce command. A check the project could define but doesn't (a Node project with notypecheckscript) surfaces as a deterministic "⚠️ not verified", never silently dropped — so an absent test suite can't masquerade as a passing one. The toolchain is auto-detected per ecosystem: Rust (cargo test/check/clippy/build), Node (package.jsonscripts via the detectednpm/pnpm/yarn/bun, resolvingtypecheck/type-check/tscaliases), Go (go test/vet/build ./...), and Python (pytest/mypy/ruff, each present only when its tool is on PATH).tomte proveexits non-zero when any check fails, so it can gate a commit hook or CI step;tomte prove --jsonemits the capsule for scripting. In the TUI the collection runs on a background task (it can shell out to a full build/test suite) so the UI keeps animating, and a second/provewhile one is in flight is a no-op. Cross-platform (runs each script throughcmd /Con Windows,sh -celsewhere; secret-looking env vars are scrubbed from the child as everywhere else). -
Added the Repo Twin / Context X-Ray — a verifiable map of the repository the agent (and you) can trust.
tomte twinbuilds five indexes straight from the source — file/import graph, symbol/function graph, test→source map, git recent-change map, and project conventions (AGENTS.md/README/docs) — and caches them as JSON beside the memory/decision stores, rebuilding automatically when the working tree changes (--rebuildforces a fresh scan,--jsonemits the summary).tomte why-context <file|symbol|file:line>is the headline query: given a seed — a file, a stack-trace location, or a symbol name — it prints the files a maintainer would pull into context, each tagged with the index it came from (import / symbol / test / git / recorded decision), and the nearby files it deliberately leaves out, each with the reason it's unreachable. Every claim is grounded in a real import edge, definition, test, commit, or decision — never an invented "this project uses pattern X": the symbol graph only traces globally-distinctive names and skips method/field accesses (so a generic name likeappendcan't manufacture a false reference), and recorded decisions on the seed are shown with a freshness flag (fresh / drifted / stale) so you can see which memory has gone out of date. Multi-language (Rust, JavaScript/TypeScript, Python, Go) and cross-platform, with no native database — pure Rust/JSON. The same map is reachable from inside a session:/twin [--rebuild]shows the index summary and/why-context <seed>(alias/xray) runs the X-ray query, each on a blocking task off the UI thread so the first full scan doesn't freeze the session. -
Added the Handoff capsule —
tomte handoff(and/handoffin a session): the shift report that lets the next session pick the house up where this one left it, whether that's a colleague, tomorrow's you, or a different model entirely — the decision trail is cross-model on purpose, and this is the door it walks through. One paste-ready markdown capsule collects, from real state and never from a model's summary: where the tree stands (branch, HEAD, dirty files capped with an "and N more", recent commits), why things are the way they are (the newest recorded decisions with who recorded them, a pointer to the full trail, and a drift-watch line — anchors holding / healed / needing eyes — from the same reconcile/why --reconcileruns), the twin's five-index map summary, and the top of the Repo Pulse. Sections degrade gracefully (outside a git repo it says so; an empty trail points atrecord_decision) and it ends with the house rule:tomte prove— done means verified.--jsonfor scripts,--out HANDOFF.mdto write a file; in the TUI the collection runs off the UI thread. -
Added Repo Pulse —
tomte pulse(and/pulsein a session): which files are most likely to break next, scored from the Repo Twin's own indexes with a formula printed right on the card —risk = commits in the recent git window × (import fan-in + 1) × 2 when no test covers the file. Change heat says the code is in play, blast radius says others lean on it, and a missing test means the regression lands silently — every factor is a real index entry (aGitStat, anImportEdge, aTestEdge), so the verdict is reproducible: rerun it, get the same card, argue with the numbers instead of a model's vibe. The card lists the top 10 with the most recent commit subject for each, plus two vitals — how many hot source files have no covering test, and the file with the widest import fan-in (shown only at ≥2 importers, since a Rust mod tree gives every file exactly one declaring parent). Test files and non-source files are never scored; ties break by path so equal twins render byte-identical cards;--jsonemits the report for scripts and--rebuildre-scans first. Costs nothing beyond the cached twin load — pure index math, no shell-outs, no model. -
Added Claude Fable 5 (
claude-fable-5, GA June 9, 2026) — Anthropic's new top tier above Opus — to the model catalog: it appears first in the Anthropic picker/loginlist (tagged1M ctx · most capable · top tier), carries its published facts (1M context window by default, 128K max output, adaptive-only thinking,xhigh/maxeffort honoured instead of clamped), prices at the published $10/$50 per MTok in/cost(cache read $1.00, 5m cache write $12.50 — the same 0.1×/1.25× rule as the other Claude tiers), andmodel: fablein a subagent file resolves to the concrete id likeopus/sonnet/haikudo. The request shape was already safe for Fable's API surface: tomte never sendsthinking: {"type":"disabled"}(Fable rejects it — adaptive thinking is always on, so the param is simply omitted when no effort is set), never sends sampling parameters to Anthropic models, and an unknown future Claude id still routes to the forward-compatible adaptive shape. The same family rule recognizes datedclaude-fable-5-*snapshots and prices the limited-availabilityclaude-mythos-5at its published $10/$50. -
Added the Agent Tournament — "don't trust one agent; make several compete."
tomte race "<task>" --agents N(default 4, max 8) runs the task with N contestants — varying model (--models a,bround-robins a list), reasoning effort, and style, with the last contestant always the conservative minimal-patch entry — each in its own git worktree branched from HEAD, so contestants can never touch each other's tree or yours. Each contestant runs through the existing headlesstomte run(sandboxedworkspace-write, JSON event stream captured). The judge is deterministic and decides the winner — an LLM is never the referee: it measures the evidence itself per worktree — the project's own test/typecheck/lint/build checks via the Proof Capsule collector, diff size and files touched fromgit diff --numstat, whether a regression test was added (test-path conventions or added#[test]/def test_/it(-style markers in the diff), and how many risky shell commands the contestant ran, classified by the sameclassify_dangerguard the live agent uses. Ranking is tiered so a clever-but-broken patch can never beat a working one (verified > checks-failed > no-change/errored), then scored within the tier (+verified, +added-test, −per-file, −per-line, −per-risky-command), with a smallest-diff tie-break; every reason on the card is generated from the measured numbers, so the verdict is reproducible. The winning patch is saved beside the project's tomte state (race-winner-<label>.patch) and--applyapplies it to your working tree;--jsonemits the full report; worktrees are always torn down, even when a contestant errors or times out. -
Added the Commit Seal —
tomte seal: the Proof Capsule that travels with the commit.tomte proveanswers "is the tree verified right now?" and the answer evaporates the moment anyone moves on;sealnotarizes it onto the commit itself. Run at a clean HEAD (a dirty tree is refused with the changes named — a seal describes a commit, not a drifting tree), it collects the capsule the CLI gathers itself (real exit codes of the project's own test/typecheck/lint/build; a model is never consulted) and attaches it as a git note underrefs/notes/tomte-seal, bound to the commit and tree ids it was collected at. Notes are ordinary git objects, so the proof is pushed/fetched with the history it certifies (git push origin refs/notes/tomte-seal) — a teammate's clone, CI, or tomorrow's machine reads the same seal.tomte seal show [rev]prints a commit's seal (with a ⚠ warning when the note answers for a different commit than it hangs on);tomte seal verify [rev]is the CI gate — exit 0 only when the commit is sealed, the binding matches (a copied note or one whose JSON was edited onto another commit never verifies), and the sealed capsule is green with at least one check actually run — "nothing failed because nothing ran" doesn't gate. A red capsule still seals (a seal is a notarized observation, not an award) andsealmirrorsprove's non-zero exit on failure; re-sealing replaces the note.--jsonon all three for scripts; headless by design, likerounds— sealing wants a clean tree, which is a commit-time posture, not a mid-session one. -
Added
why_contextas an agent tool — the Repo Twin's Context X-Ray was only reachable through user commands (tomte why-context,/why-context); the model itself still picked context by grepping around. It is now a read-only built-in the agent calls on its own: pass a seed (a file, a stack-tracefile:line, or a symbol) and it returns the connected files with the index each claim came from, plus the nearby files to skip. The system prompt's tool-discipline list teaches the timing (call it FIRST on a seeded task in a large or unfamiliar area); subagents can whitelist it (why_context, aliasesxray/why-context), and its arguments accept the spellings other models reach for (query/file/symbol/path/target). -
Added
/prove explain— the explain layer on the Proof Capsule: after the CLI collects the card (real git state, real exit codes), one agent turn interprets it — what the verdict does and does not prove for the work in this session, the residual risks the checks cannot see, and what to verify by hand before shipping. The model never supplies or alters the numbers; it only reads the capsule the CLI already gathered. A bare/proveis unchanged, and an unknown argument prints usage instead of running. -
Added live progress to
tomte race— a tournament can run many minutes, and until now the CLI was silent from launch to the final card. Each contestant's lifecycle is now narrated on stderr with elapsed timestamps (agent-b (gpt-5.5) started,finished its run in 4s,verifying — running the project's own checks…,checks: 3 passed, 0 failed, plus worktree-creation failures), so you can see who is still running and who is being verified. Progress goes to stderr so--jsonstdout stays clean for piping. -
Added markdown link rendering in the chat —
[text](https://…)now draws the label accent-underlined with the target kept visible in dim parens (a terminal can't click, so hiding the url would lose the one thing the reader needs). Only the safe shape linkifies: a matched[label](url)whose target carries a real scheme (http://,https://,mailto:); indexing likearr[i](x), footnote[1], and relative targets stay literal, same as the existing matched-pair rule for*/`. -
Added the remaining everyday markdown to the chat renderer —
~~strikethrough~~(matched-pair rule, so a home path~/srcor an unterminated~~stays literal), thematic breaks (---,***,___, spaced- - -render as a faint horizontal rule instead of literal dashes; two dashes or mixed markers stay prose), and GFM task-list items (- [ ]/- [x]render as☐/✓glyphs mirroring the todo panel's marks; a non-checkbox bracket body keeps the plain bullet). -
Added description search to every picker — typing what a command does now finds it (
reasoningsurfaces the effort row), not just its name/key; powers the slash menu,/model,/resume,/rewind, and the rest. -
Added an Environment block to the system prompt — the runtime facts a model can't guess and reliably gets wrong when left to its training data: the working directory and git standing (branch + HEAD at session start, or "not a git repository"), the platform and architecture, which shell
run_shellactually executes through (cmd /Con Windows — with a nudge toward Windows-console syntax and an explicitpowershell -Commandescape hatch —sh -celsewhere), and today's date with an instruction to trust it over the training cutoff when reasoning about "latest" versions. Each absent fact used to cost real turns (bash syntax sent to cmd.exe, stale version reasoning); now the harness states them up front. The block is marker-delimited and range-replaced in place on re-apply — it sits before the memory/decision-trail/skill blocks, so the truncate-style strip the other blocks use would have destroyed them. Applied at session start (TUI, resume, and headlesschat) and on everyrefresh_system_context. -
Strengthened the system prompt's discipline rails — a new Git & version control section (the repo's state belongs to the user: never commit/push/tag/amend unless asked; read
git status/diff/logand match the repo's message convention before a requested commit; stage specific files, never blanketgit add -A; never force-push, rewrite published history, or--no-verifypast a failing hook;gh pr createfor PRs with a reviewer-trustworthy summary), a new Security stance (help defend code freely, refuse harm-purposed code with a one-line alternative, never echo or commit secrets), and two tool-discipline rules that close real failure loops: a tool call the user denied is a decision — never re-issue it unchanged — and a hook that blocks or annotates a call speaks for the user, not an obstacle to route around. -
Added Night Rounds —
tomte rounds: the custodian's read-only inspection walk, named for what the tomte of the stories actually does at night. One command re-checks every store tomte already keeps and reports what changed since the last walk: the Repo Twin is rebuilt fresh (never trusted from cache — an inspection that trusts yesterday's map defeats itself) with a Δ line for file/test-edge counts; the Pulse is scored uncapped and diffed against the recorded baseline, so the card lists risers (mainloop.rs 18→31 (+13)) and files that turned hot-and-untested since last rounds; the decision trail is reconciled (the same drift watchhandoffruns) with any GONE/AMBIGUOUS anchor named on the card; TODO/FIXME marks added since the baseline are listed withfile:line(matched by file+text, so a mark that merely moved lines never reads as new); and the Proof Capsule pass re-runs the project's own test/typecheck/lint/build with real exit codes (--no-proofskips it). The baseline lives beside the memory/decision/twin stores and updates every walk. Exit semantics make it a CI morning gate: red — a decision whose anchored line is gone or ambiguous, or a failed check — exits 1; an amber walk (new TODOs, risers) still exits 0 and says the marks are worth a look; a clean walk says "A quiet night — nothing out of order." Every line is computed from real indexes and real exit codes — a model is never consulted, so two walks over the same tree say the same thing. Read-only by design (the opposite of the background-autonomy lane): rounds never edits the tree.--jsonfor scripts,--out rounds.mdto keep the morning report as a file. -
Added the
esc to interruptaffordance to the busy spinner — Esc has always cancelled a running turn, but the only place that said so was/help; cancellation is now discoverable at the exact moment it's needed, on the very line that shows the turn running. -
Made the status line's
? for shortcutshint true — a bare?on an empty composer now shows the same card as/help(it previously just typed a?into the input). With any text in the composer,?stays an ordinary character.
Changed
- Changed Ctrl+C to a double-press quit guard — a single reflexive Ctrl+C (the terminal copy/cancel habit) used to kill the whole session instantly, mid-turn included, while inside an open approval card it did nothing at all. Now the first press clears the composer (stashing any draft into the ↑ recall history) and arms a two-second window with a
ctrl+c again to quitstatus hint; only a second press inside the window exits, and a press after the window lapses re-arms instead of quitting. The rule is uniform — identical idle, busy, in pickers, and under the approval/conscience cards — and any other key disarms the guard. Ctrl+D on an empty composer still quits immediately (the deliberate EOF idiom). - Changed Esc on an idle composer to stash the cleared draft into the ↑ recall history instead of discarding it — a reflexive Esc on a long half-written prompt is no longer an unrecoverable loss. (Esc while a turn is running still cancels the turn, unchanged.)
- Switched the default renderer back to the full-screen alternate screen (input pinned to the bottom, in-app scroll), reverting 0.0.2's inline default; the inline native-scrollback viewport is now opt-in via
TOMTE_INLINE=1. A mouse-wheel scroll now clears any in-app text selection instead of mis-tracking it onto unrelated rows. - Changed the live fleet view to show a sub-agent's cumulative output tokens (
· 1.2k tokens ·) instead of a raw tool-call step count — a truer signal of how much work a child has done. The parent now forwards each sub-agent's token usage as it streams. - Changed
tomte raceto state its honest isolation posture up front on a platform without an OS sandbox (Windows): one stderr line before the start banner notes that contestants are still worktree-isolated and dangerous commands stay hard-blocked, but other shell side effects are not filesystem/network-confined there. On Linux/macOS (Landlock/seatbelt) nothing changes.
Fixed
- Made the approval card show the same arguments the history records — the card used to display the model's raw spelling (
cmd=…,filePath=…) while the transcript recorded the canonical fold (command=…,path=…), so what the user approved and what the session recorded could read as two different calls. The card now renders the canonical shape with absent-fieldnullplaceholders stripped, so it shows exactly what the call carries and nothing else; tools without a canonical mapping pass through unchanged. - Fixed house-rules surfacing for alias-spelled edits — the Pillar 5 recall card looked up a file's recorded decisions only under the
pathkey, so an edit the tool happily executes when spelledfilePath/file_pathsilently skipped the lookup and the house rules never appeared. The lookup now accepts every alias the executing tool's deserializer accepts (the same consumer-parity rule that already fixed the conscience pre-check). - Fixed the stream-truncation error message for a tool call whose name is whitespace-only — it interpolated the raw spaces (
tool ` `) instead of`<missing>`. - Fixed
undo_last_edit//undorefusing to unwind edits interleaved across files — after restoring a file, the staleness-guard refresh only looked at the TOP undo entry, so with an edit order of A → B → A the second undo of A was refused as "file has been modified since the edit" (the guard read tomte's own earlier restore as an external edit). The refresh now finds the newest remaining entry for the restored file wherever it sits in the stack; a genuine external edit still trips the guard. — the numbered code preview under● Write(src/main.rs)rendered every line in one bright color, while the assistant's own fenced code blocks were already syntect-highlighted, so the same code read differently depending on who printed it. The preview now runs through the same syntect pipeline and theme, resolving the language from the target file's extension (the fence-token aliases —rust,ts,py, … — apply), with the code-block background bed for the same panel look; a path with no recognizable extension degrades to the plain-text syntax, never an error. The highlighter is fed line-by-line in order, so multi-line constructs (block comments, raw strings) keep their state across the preview. - Fixed the streaming stutter — in a long session the chat visibly hitched on every tool event, because each one (a call starting, its args streaming, its result landing, a pre-flight card attaching) invalidated the render cache and forced the next frame to re-wrap and re-syntax-highlight the entire transcript; an agentic turn does that several times per tool call, so the cost landed as continuous 50–300 ms hiccups exactly while the agent worked. The cache is now a stable-prefix cache: everything before the live turn is wrapped once, validated each frame by a cheap fingerprint fold (so
/resume,/rewind, and/clearstill invalidate it naturally), and extended append-only when a turn settles — while only the live turn re-wraps per frame. No event handler manages the cache at all anymore, so this class of regression can't come back by forgetting an invalidation site; a cache-vs-fresh equivalence test pins the rendered frames byte-identical across a scripted streaming turn. - Smoothed the paint pipeline — three smaller sources of shimmer/jank, fixed together: every frame's terminal writes (the inline scrollback commit plus the diff) are now bracketed in a DECSET 2026 synchronized update so the terminal paints them atomically instead of mid-write (auto-follow shifts the whole chat region every frame, so unsynchronized paints showed as tearing; terminals without the mode ignore the markers), stdout now goes through a 256 KB BufWriter so a frame reaches the terminal as one write instead of the line-buffered dribble, and inline mode's
insert_beforenow uses ratatui'sscrolling-regionsfeature — inserting history via terminal scroll regions instead of clearing and redrawing the live viewport, which blinked on every committed block. Input latency is tightened in the same pass: a keystroke arriving mid-stream draws immediately instead of waiting behind the 16 ms frame budget, and a frame the budget deferred now paints on the budget's remainder instead of up to 80 ms late when the stream happens to go quiet. - Made tool-call argument parsing tolerate a double-encoded payload — some models (and some providers' streaming layers) stringify the arguments object twice, so
argumentsarrives as a JSON string whose content is the real object; tomte bounced that back as "arguments must be a JSON object, got string", costing a round-trip. The payload is now unwrapped exactly one level when (and only when) the inner text parses to an object — a bare string, a double-encoded array, and invalid JSON still get the existing self-correct error with the tool's schema hint. Provider-agnostic, so a model added later benefits automatically. - Closed the conscience-check gap that tolerance opened — the Pillar 5 pre-check parsed an edit call's arguments more strictly than the tool phase that executes them: a double-encoded payload (now unwrapped and run) or an alias spelling the tools themselves accept (
filePath,oldString, …) made the pre-check silently skip, so an edit could land against a recorded decision without its conscience card. The pre-check now resolves arguments through the same parse and alias fold as execution, so whatever spelling runs is also what the conscience sees. - Made the race judge count what a contestant actually ran —
count_risky_commandsmatched the literal tool namerun_shelland the literalcommandfield, but a contestant's model may spell the toolbash/shell(the registry resolves those at execution), double-encode the arguments, or use thecmdalias (the executing agent tolerates all three). Such a contestant ran the risky command without the judge counting it, skewing the deterministic score in its favor. The judge now canonicalises the tool name and parses arguments with the same tolerance as the agent it replays. - Made the Repo Twin fully deterministic — two index nondeterminisms could make "rerun it, get the same card" untrue: a Go package import resolved to whichever
.gofile aHashSetiteration happened to visit first (so rebuilds could flip import edges and Pulse fan-in counts), and a convention rule that mentioned several seed tokens cited whichever token the set yielded first. Both now pick the lexicographically-smallest candidate, so equal trees build byte-identical twins. - Fixed
why-contextreporting an absolute-path seed as "missing" — a pasted stack-trace location is usually absolute (C:/repo/src/x.rs:88,/home/me/repo/src/x.rs:88), but the twin stores root-relative paths, so the documented stack-trace use case never resolved. The repo root is now stripped from an absolute seed before matching. - Stopped a closed pipe reading as a crash —
tomte twin | head -3(or any evidence command piped intohead/Select-Object -Firstin a script) made the stdlib's print macros panic with "failed printing to stdout: Broken pipe", exiting 101/-1 for a completely routine shell pattern. A panic whose payload is exactly that pipe-closed abort (UnixBroken pipe, Windowsos error 232) now exits 0 — the consumer closing the pipe means it has all the output it wanted — while every other panic still reaches the default hook and aborts loudly. Matters most fortomte prove --json/twin/why-context/why/blamein CI pipelines. - Closed a
run_shelldestructive-command classifier bypass — an inline-code interpreter hidden behind a command-wrapper (env python3 -c '…',sudo node -e '…',command perl -e '…',nice ruby -e '…',env awk 'BEGIN{system(…)}') slipped past the guard, because the inline-interpreter and process-substitution checks inspected only the first command word of a segment and saw the wrapper, not the interpreter. They now peel known wrappers (env/sudo/xargs/command/exec/nice/nohup/time/…) the same way the pipe-into-interpreter guard already did, so the interpreter behind them is classified — and the same peel covers a shell handed a process substitution behind a wrapper (env bash <(curl …)). Without the fix the payload auto-ran under a matchingrun_shell(…:*)grant or bypass mode, and on Windows (no OS filesystem sandbox) ran unconfined; a benign wrapped command (env NODE_ENV=prod node server.js,sudo apt-get install nodejs) stays unflagged. - Closed a Windows-specific destructive-command bypass — cmd.exe accepts switch clusters with no separating space (
del /s/q dir,rd /s/q dir), but the recursive-delete guard tested for/sas its own whitespace token, so the glued form was never recognized. It now detects the/sswitch inside a/-joined single-letter cluster, while still leaving a real path like/usr/sand a non-/sswitch (del /q file) unflagged. Matters most on Windows, where the sandbox does not confine the filesystem. - Closed a cross-provider credential clobber on OAuth sign-in — completing a ChatGPT/Codex or Claude login did
load_auth → set this provider's tokens → save_authwithout the sharedREFRESH_LOCKthatensure_freshholds, so a sibling provider's token refresh landing in that window had its freshly-rotated, single-use refresh token reverted by the login's write-back (bricking it until a manual re-login). Both login completions now take the lock for just the load→save tail — acquired after the browser wait and code exchange, so a pending login never blocks refreshes while the user is still authorizing. - Hardened
config.jsonto owner-only on Windows, matchingauth.jsonand the Unix0o600— a literalproviders.<id>.api_keyinconfig.jsonis a real credential, but the Windows write path used a plainstd::fs::writeunder the inherited%APPDATA%ACL whileauth.jsonstrips inheritance and grants only the owner. The config save now applies the sameicaclsowner-only grant (the dir with inheritance, then the file), reusingauth's helpers; Unix and theapi_key_envpath are unchanged. - Widened the child-process secret-env scrub to broker/cache connection strings that embed inline
user:pass@hostcredentials (REDIS_URL,MONGODB_URI/MONGODB_URL,AMQP_URL,RABBITMQ_URL,CELERY_BROKER_URL), matching the existingDATABASE_URLrule, so a prompt-injectedrun_shell/hook/MCP command can't read them viaenv. Matched by specific name (not a blanket_URL/_URI), so non-secretREDIS_HOST/MONGO_HOSTsiblings still pass through. - Fixed the welcome card's right border misaligning when the workspace path or model id contains wide CJK/emoji characters — the text column measured and trimmed its body by code-point count while the column itself is sized in display columns (the sprite column already used display width), so a wide glyph overran the column and pushed the
│out. It now measures and truncates by display width (unicode-width), keeping the box flush. - Fixed
/costover-estimating Opus 4.5/4.6/4.7/4.8 spend by 3x — the price table charged every Opus id the original $15/$75 per MTok, but Opus 4.5 and later have been published at $5/$25 (verified against the June 2026 model docs). The Opus rate is now version-gated: 4.5+ bills at $5/$25 (cache read $0.50, write $6.25), while Opus 4.1 and older — including the dated bare-majorclaude-opus-4-20250514, whose version string doesn't parse — keep the original $15/$75. - Fixed
/cost(andtomte cost) under-estimating spend for an OpenAI-codex/-chat-latestmodel variant — the price table carried those variants only forgpt-5.3, sogpt-5.5-codex/gpt-5.4-codex(real ids the model catalog recognizes) fell through to the unknown-model fallback rate, a ~4x under-estimate. They now price at their base family's published rate, completing the existing gpt-5.3 pattern. - Fixed long multi-line pastes still firing off partial messages on Windows — the burst-coalescing drain bailed the instant the input channel was momentarily empty (
now_or_never), so a paste the OS delivered in chunks split across loop turns and a stray newline submitted mid-paste. The drain now waits briefly (PASTE_COALESCE_GAP) for the next event, so the whole block lands in the composer as one message. (Windows emits no bracketed-paste event, so this key-burst path is the fix there.) - Fixed pasting an image on Windows — a
Win+Shift+Sscreenshot (or any clipboard bitmap) now attaches via Alt+V. arboard's CF_DIB reader misses Snip & Sketch / screenshot bitmaps, so on Windows tomte reads the clipboard image through PowerShell'sSystem.Windows.Forms.Clipboard::GetImage()thenSave(.., Png)(the same mechanism Claude Code uses); macOS and Linux keep arboard's native reader. Alt+V also reports an empty clipboard instead of doing nothing. - Fixed the TUI getting laggier the longer a session ran — the alt-screen renderer re-wrapped and cloned the entire transcript every frame (a full-transcript clone at ~60fps while streaming, plus on idle redraws and every mouse-move). It now materializes only the visible window — O(viewport) per frame instead of O(transcript) — reuses the cached prefix while streaming, and stores the prefix split as an index so the wrapped transcript lives in memory once. Mirrors how Claude Code's Ink
<Static>renders history once and only redraws the live region. - Fixed
glob/grephanging for minutes on a machine without ripgrep (common on Windows) — the no-ripgrep fallback walked the whole tree skipping only.git, so every pattern crawlednode_modules,target,.next,build(tens of thousands of files). It now honors.gitignore/.ignorevia theignorecrate (ripgrep's own engine), so those directories are skipped and search is fast and correct whether or notrgis installed. - Fixed
dispatch_agenterroring with "subagentgeneral-purposenot found in any agents directory" —general-purpose(the defaultsubagent_type, and what Claude-Code-trained models pass) is now a built-in that needs no file on disk: all built-in tools, the parent's own system prompt. A fresh install with no agent files no longer fails everydispatch_agentcall; ageneral-purpose.mdunder any agents root still overrides the built-in. - Fixed
dispatch_agentaborting a fan-out with "subagentcode-explorernot found" — asubagent_typethe model guessed now resolves instead of hard-failing. Three more built-ins ship with no file needed:Explore(read-only code search),Plan(read-only architecture planning), andcode-reviewer(read-only review); a loosely-spelled name (code-explorer,explorer,planner,security-reviewer, …) folds onto the closest built-in; a name that still matches nothing falls back togeneral-purposewith a note telling the model the valid types instead of erroring; and the system prompt now lists the built-in types so the model names a real one up front. A same-named file under any agents root still overrides a built-in. - Closed a
run_shelloverride-gate bypass — a destructive command fused to a shell control operator with no surrounding space (ls;rm -rf /,true&&rm -rf /,cd /tmp&&rm -rf /,dir&del /s …,(rm -rf /)) slipped past every destructive-command guard, because the classifier tokenized on whitespace alone and kept the command word glued inside one token (ls;rm, never recognized asrm). It now splits on;/&/|/ newline /()as well, so the fused command is recognized and still clears the override prompt — without the fix it auto-ran under arun_shell(…:*)grant or--dangerously-skip-permissions, and on Windows (no OS filesystem sandbox) ran unconfined. The same pass also flags an inline-interpreter flag glued to its program (python -c'…',node -e'…'), and routine operator chains (cargo build && cargo test) stay unflagged. - Closed the matching indirect-prompt-injection gap on an MCP server's error result —
tools/callfenced a success result in<untrusted-mcp-output>(framework markers neutralized, a forged closing tag broken), but theisErrorpath returned the server's text raw, so a compromised server could smuggle a directive through an error. Both paths now go through the same fence. - Fixed
edit_file/multi_editsilently editing the wrong occurrence in a mixed CRLF+LF file —read_filestrips\r, so the model's LFold_stringcan't tell a CRLF region from an LF region, and counting only the verbatim-LF form reported a single match and edited the LF copy even when the CRLF one was meant. Both encodings are now counted, so an ambiguous cross-encoding target trips the uniqueness gate (which asks for more surrounding context) instead of guessing; a single-encoding file is unaffected. - Hardened the decision trail (
decisions.jsonl, the cross-model moat) against concurrent-session data loss and crash truncation — two tomte sessions open in the same project share one trail, andreconcilerewrote the whole file from a snapshot taken before it scanned the working tree, so a decision the other session appended in that window was silently clobbered;reconcilenow re-reads and re-heals immediately before the rewrite, shrinking that window to the rename itself. Each append is now a singleO_APPENDwrite (so two appenders can't interleave a half-line) and isfsync'd, and the reconcile rewrite flushes its staging file and the directory before/after the atomic rename — bringing the trail up to the same durability bar the session and config writers already hold. - Closed a prompt-exfiltration vector in project config — a cloned repo's
.tomte/config.jsoncould setmodel(orfallback_models) to a built-in-preset prefix likeopenrouter/…orgroq/…, which routes every prompt and all file context to that third-party endpoint using the user's own<PROVIDER>_API_KEY. Theproviderskey was already blocked from a project overlay for exactly this reason, butmodelreached the same endpoints and wasn't; a projectmodel/fallback_modelsentry whose prefix resolves to a built-in or user-configured provider is now dropped with a warning (bare ids and nativeopenai//anthropic/specs are still honored). - Fixed
tomte doctor//doctorfalsely reporting a valid MCP server or hook as "not found on PATH" on Windows — itswhichcheck only looked for.exe, butnpx,prettier,pnpmand friends are.cmd/.batshims that the runtime spawns fine via PATH×PATHEXT. The check now searches the samePATHEXTset the runtime uses, so it stops flagging a working setup (git.exeand the search tools were never affected). - Fixed
/rewindto an earlier turn over-reverting after an/undo—/undo(and theundo_last_edittool) popped the undo stack without lowering the edits-since counter each checkpoint records, so a later/rewindcounted the already-undone edits as still pending and reverted that many entries from the top of the stack, reaching past the chosen checkpoint into edits made before it and rolling those files back too (and the picker's blast-radius preview over-counted the same way)./undonow drops that counter in lockstep — eviction of the capped stack's oldest entry still doesn't, so the count survives eviction as before — so/rewindreverts exactly the edits still live since the chosen turn and the preview shows the true file count. Host-side bookkeeping only; no model- or OS-specific behavior. - Closed the last unserialized writers of
auth.json—/logout,/apikey, and the API-key activation in both login flows did a bareload → mutate → savewhile a token refresh (which holds the sharedREFRESH_LOCKacross its own load→network→save) could be in flight in the same process; the interleave could revert a freshly-rotated single-use refresh token (bricking that credential on its next refresh) or, mirror-image, the refresh's merge could re-persist OAuth tokens the user had just logged out. All four writers now go through a new lockedauth::mutate_auth(same lock as the refreshes), andensure_freshnow treats "this provider's slot vanished from disk mid-refresh" as a logout — the fresh access token still serves the in-flight turn, but the credential is never written back, so a logout from another process can't be silently resurrected either. - Widened the child-process secret-env scrub to the
*_PASSfamily (DB_PASS,SMTP_PASS,REDIS_PASS, …) —PASSWORD/PASSWD/*_PWDwere caught but the equally common_PASSsuffix leaked into every spawned shell/hook/MCP child. The required underscore sparesBYPASS/COMPASS-style names. - Gave
settings.jsonthe same owner-only enforcementauth.json/config.jsonalready get —tomte mcp add --env KEY=VALUEcan store a real credential there, but the write path set0o600on Unix only; on Windows it now applies the sameicaclsowner-only grant before the atomic rename.load_authalso repairs a too-broadauth.jsonACL on Windows now (once per process —icaclsshells out, and the Unix-style cheap mode check has no Windows analogue), covering a credential file restored from a backup with a broad inherited ACL. - Rejected Windows reserved device names (
NUL,CON,PRN,AUX,COM1–COM9,LPT1–LPT9, with or without an extension likeCON.txt) in the file-tool path resolver — Win32 resolves them to devices regardless of directory, so they slipped past the lexical sandbox check and a write would target the console or the null sink instead of a file in the workspace. Ordinary names that merely start with a reserved word (console.txt,common.rs,com10.txt) still resolve; non-Windows platforms are unchanged (conis a legal filename there). - Flagged string-BUILT PowerShell execution in the destructive-command classifier — a
-Commandpayload is normally plain PowerShell the flattened token scan reads, but[scriptblock]::Create(…), aFromBase64Stringdecode, or[char]-array-joinassembly composes the real command at runtime where no token rule can see it. When PowerShell is invoked anywhere in the command those builders now classify as inline code (an approval prompt, never an auto-run); naming them in another program's arguments (grep -r frombase64string src/) stays unflagged. Closes the gap next to the already-covered-EncodedCommand. - Named persisted
run_shellallow-rules after the program that will actually match — the "Allow<prog>in this project" label took the raw first word of the command, so a quoted or path-qualified spelling ("git" status,/usr/bin/git pull) persisted a rule labeled"git"/gitinconsistently with the matcher's own quote-stripping/basename normalization. The label now runs through the sameprogram_namethe matcher uses, so what the user agrees to is exactly what later auto-runs. - Fixed the approval and conscience modals clipping their own options off-screen — both counted logical lines while
Wrapproduced more visual rows, so a longargspreview or decision text pushed the Allow/Deny (or Abort/Supersede) rows and the key hint below the frame, leaving the user staring at a prompt with no visible choices. The two cards now share one modal engine: every row is pre-truncated to the popup's width (span-aware, display-column based), the popup is sized to its real row count, and when the terminal is too short the context rows are trimmed first — the options and the key hint always win. The conscience card's decision text and conflict reason wrap (capped at 3 and 2 rows) instead of being cut to one line, since they are the substance of that choice. - Fixed wide CJK/emoji text overrunning its row in the fleet panel, the todo panel, the queued-message list, and the
Bash(…)header — those truncations counted characters while the rows are budgeted in display columns, so a wide glyph cost two columns but was counted as one and the text spilled past its slot (the same class of bug the welcome card fix covered). All four now truncate by display width via the shared unicode-width helper. - Aligned the last two stray text colors onto the calm palette — the markdown table header and the
/contextheadline each carried their own near-white RGB literal (235,235,240 / 230,230,235), one and two notches offTEXT_BRIGHT(231,231,231). Both now use the palette constant, so "bright ink" is one color everywhere; the documented exceptions (per-provider auth dots,/contextcategory swatches, the buddy sprite) are untouched. The inline-code amber moved into the palette as a documented exception (INLINE_CODE/INLINE_CODE_BG) instead of loose literals in the markdown renderer. - Made the status line degrade gracefully on a narrow terminal — the right side (auth dot · model · effort · context gauge · cwd) used to clip mid-text when it outgrew the row; it now drops segments from the tail in priority order (cwd first, then the gauge, then effort), so the auth dot and the model name always survive.
- Budgeted the fleet row so a long sub-agent prompt can no longer push the live metrics (
· activity · tokens · elapsed) off the right edge — the metrics tail is the signal, so it keeps its full width and the prompt absorbs the squeeze with an ellipsis. - Truncated picker rows to the popup's inner width — a long session title or description used to clip dead at the border; the title is never cut below its own length and the description absorbs the squeeze with an ellipsis. The generic-tool header's argument preview (
compact_args) also cuts by display width now, completing the chars→columns sweep. - Fixed
/undo(and theundo_last_edittool) refusing the second of two stacked edits to the same file — restoring the first edit rewrote the file with a fresh mtime, so the next undo entry's staleness guard read tomte's own restore as an outside change and bailed with "file has been modified since the edit." After a restore the next same-file entry's(mtime, size)snapshot is refreshed to what was just written, so a stacked edit unwinds all the way back one step at a time; a genuine outside edit still moves the mtime and is still refused, so the anti-clobber guard is intact. (Cross-platform: mtime resolution differs per OS, but the refresh keys off the file tomte itself just wrote.)
Tests
- Locked Markdown table border alignment with a regression test — an audit of the wide-CJK/long-word column-shrink + cell-wrap path found it already keeps every rendered row at one display width (
md_cell_widthmeasures display width, the shrink floor stays at 3, andtextwrapbreaks CJK and long words), so the box borders can't drift; the test guards that against a future change. No behavior change. - Added regression tests for the OAuth PKCE primitive (
auth/pkce.rs, previously untested) — locking that the code verifier and challenge are unpadded base64url within RFC 7636's length bound, the challenge is exactly theS256transform (base64url(SHA-256(verifier))), and the verifier/CSRF-state nonces differ per draw. No behavior change; guards the OAuth security primitive against a future refactor. - Added regression tests locking two control-flow-critical argument-alias maps: the
goal_updatestatus family (everyblocked/in_progress/completespelling, so a dropped alias can't silently strand an active/goal) and thedispatch_agentspawn-mode family (read-only spellings confine a sub-agent to plan mode, edit/write spellings don't). No behavior change. - Added a security regression test for the todo
<system-reminder>sanitizer (safe_system_reminder_text, previously untested) — model-controlled todo text injected into the per-turn reminder has its</>/&escaped (so it can't forge or break out of the reminder block), control characters collapsed to spaces, and length capped by character count; the test locks all three. No behavior change. - Added unit tests for the OpenAI strict-schema transform (
tools/schema.rs::strict_parameters_schema, previously untested) — locking that an optional property becomes nullable AND is added torequired,additionalPropertiesis forced false, nested objects and arrayitemsare strictened recursively, and an optional enum gains anullvariant. A regression here would make OpenAI reject every tool call with a 400. No behavior change. - Added unit tests for the OpenAI Responses stream value extractors (
openai/stream/value.rs, previously only exercised end-to-end) — locking that a tool-argument fragment keeps a real value verbatim (including a streamed barenull) while dropping the emptynull/""/{}/[]placeholders, and that the text extractor flattens part arrays and picks thetext/content/deltakeys. Guards the streamed-tool-call contract prior fixes established. No behavior change. - Added unit tests for the per-turn todo
<system-reminder>builder (agent/todo_reminder.rs::todo_reminder_text, previously untested) — locking that an empty list yields no reminder, each status renders (only in-progress carries itsactive:form), the list is capped with anN more todo(s) omittedsummary, and model-controlled todo text is escaped so it can't forge or break out of the reminder block. No behavior change. - Added a unit test for the model-facing todo-status alias parser (
TodoStatus::parse, previously untested) — locking everypending/in_progress/completedspelling (incl. trim, case, and-/space normalization) and that an unknown value is rejected rather than guessed. No behavior change. - Added CLI-crate unit tests for the
/exportand/commithelpers (tui/app/prompts.rs, previously untested) — locking the safe-fence invariant (markdown_fence_foralways opens a fence longer than the longest backtick run inside the content, so an embedded code block can't close the export early), that/commitand/commit-push-prcarry the full git-safety protocol plus any user extra, and the all-todos-completed predicate. No behavior change. - Added unit tests for the slash-command parser and message-token estimator (
tui/app/slash.rs, previously untested) — locking thatsplit_slash_commandsplits head/arg on the first whitespace, trims, and slices on a char boundary (a multibyte arg likecompact 中文can't panic it), and thatestimate_messages_tokensis the chars/4 estimate that skips the Welcome/Rich widgets. No behavior change. - Extended the composer
TextInputtests to the previously-untested cursor-navigation primitives (tui/input.rs) —backspace(removes a full multibyte char, no-op at start),delete_word_left,move_left/move_right(whole-char steps),move_up/move_down(preserve the display column across lines), andmove_home/move_end. These are the byte-vs-char-boundary slicing paths most prone to a panic; the tests lock them. No behavior change. - Added unit tests for the assistant-block invariant helpers and screen/cwd resolution (
tui/app/blocks.rs,tui/app/helpers.rs, previously untested) — locking thatfinish_open_assistant_block/rotate_assistant_blockclose open blocks and drop empty ones (no stale empty stanza, exactly one fresh open block),last_assistant_mut_openreturns only an open block,initial_screenroutes to login only when fully unauthenticated, andresolve_cwd_argaccepts only an existing directory. No behavior change.