Release v0.0.3 · ryan-mt/tomte

0.0.3

Added

Added a first-run setup card — on startup tomte auto-detects your OS and package manager and, if an important external tool is missing (currently git), shows the exact install command for your platform (winget/scoop/choco, brew, apt/dnf/pacman/…, or a download link). It only shows the command — it never runs an installer — and no card appears when the environment is ready.
Added an OS terminal window/tab title — tomte names the window tomte on launch and tomte — <task> after the first prompt, resetting to tomte on /clear so the next prompt re-titles it, so several tomte sessions are easy to tell apart. Cross-platform via crossterm (SetConsoleTitle on Windows, the OSC title escape on macOS/Linux/Windows Terminal); the task text is the first prompt line with control characters stripped, so a crafted prompt can't inject its own terminal escape.
Added an optional focus to /compact — /compact <what to keep> steers the summary toward the topic you name (e.g. /compact the auth refactor and the failing test) while still producing a self-contained summary, so a compaction at 85% doesn't drop the thread you care about. A bare /compact is unchanged, an auto-compaction never carries a steer, and a blank focus (/compact ) is treated as no focus; the steer is consumed once per run so it can't leak into the next summary. (The instruction is appended to the existing compaction prompt; history replacement and checkpoint handling are untouched.)
Added /rewind — restore the session to an earlier turn: it truncates the conversation back to a chosen turn AND reverts the file edits made since (the custodian you can follow and undo). A checkpoint is recorded at every turn; /rewind opens a picker of them (newest first), each row showing its blast radius before you commit — … · drops N later turns · reverts M files (Pillar 1) — and selecting one reverts each touched file to its pre-turn content, newest-edit-first so stacked edits to one file collapse to a single restore. A file you changed outside tomte is reported and left as-is, never clobbered; run_shell side effects can't be undone and are counted in the calm summary (↩ rewound to: … · reverted N files · M shell effects could not be undone). In-session only — checkpoints reset on /clear, /compact, and /resume, since they index the runtime undo stack. Reuses the same atomic-restore + staleness guard as /undo; edits since a checkpoint are tracked by a monotonic counter so the capped undo stack's eviction can't miscount.
Added a live thinking display — while the model reasons, tomte now shows the reasoning text in muted italic (like Claude Code) so you can follow its thought, then collapses it to a compact Thought for Xs line the moment the answer starts. On by default; /thoughts off (or show_thinking: false in config.json) hides the text and keeps only the spinner's thinking cue, /thoughts on brings it back. Provider-agnostic — it renders whatever reasoning the active model streams (Anthropic thinking, OpenAI reasoning), so it carries across a model switch. (/thinking is unchanged — it still picks reasoning effort; /thoughts is the new display toggle. The reasoning was already captured per turn; this surfaces it instead of suppressing it.)
Added tomte --continue (-c) — resume the most recent session in the current directory immediately, skipping the /resume picker (parity with claude --continue). It reuses the exact restore path the picker uses (history, reasoning, and the active goal), and a directory with no saved session starts fresh with a one-line note instead of erroring. tomte resume still opens the picker to choose among older sessions.
Added the Proof Capsule — "done means verified." /prove (in a session) and tomte prove (headless) collect an evidence bundle the CLI gathers itself: the files git reports changed, plus the real exit codes of the project's own verification scripts — test, typecheck, lint, build — which tomte runs and observes. The model never supplies these numbers and can't fabricate a green capsule; at most it explains one the CLI already collected. The card reads ✅ Verified / ❌ Not verified / ⚠️ Unverified, lists each check (✅ test passed cargo test), shows a failing check's output tail, and ends with a one-line reproduce command. A check the project could define but doesn't (a Node project with no typecheck script) surfaces as a deterministic "⚠️ not verified", never silently dropped — so an absent test suite can't masquerade as a passing one. The toolchain is auto-detected per ecosystem: Rust (cargo test/check/clippy/build), Node (package.json scripts via the detected npm/pnpm/yarn/bun, resolving typecheck/type-check/tsc aliases), Go (go test/vet/build ./...), and Python (pytest/mypy/ruff, each present only when its tool is on PATH). tomte prove exits non-zero when any check fails, so it can gate a commit hook or CI step; tomte prove --json emits the capsule for scripting. In the TUI the collection runs on a background task (it can shell out to a full build/test suite) so the UI keeps animating, and a second /prove while one is in flight is a no-op. Cross-platform (runs each script through cmd /C on Windows, sh -c elsewhere; secret-looking env vars are scrubbed from the child as everywhere else).
Added the Repo Twin / Context X-Ray — a verifiable map of the repository the agent (and you) can trust. tomte twin builds five indexes straight from the source — file/import graph, symbol/function graph, test→source map, git recent-change map, and project conventions (AGENTS.md/README/docs) — and caches them as JSON beside the memory/decision stores, rebuilding automatically when the working tree changes (--rebuild forces a fresh scan, --json emits the summary). tomte why-context <file|symbol|file:line> is the headline query: given a seed — a file, a stack-trace location, or a symbol name — it prints the files a maintainer would pull into context, each tagged with the index it came from (import / symbol / test / git / recorded decision), and the nearby files it deliberately leaves out, each with the reason it's unreachable. Every claim is grounded in a real import edge, definition, test, commit, or decision — never an invented "this project uses pattern X": the symbol graph only traces globally-distinctive names and skips method/field accesses (so a generic name like append can't manufacture a false reference), and recorded decisions on the seed are shown with a freshness flag (fresh / drifted / stale) so you can see which memory has gone out of date. Multi-language (Rust, JavaScript/TypeScript, Python, Go) and cross-platform, with no native database — pure Rust/JSON. The same map is reachable from inside a session: /twin [--rebuild] shows the index summary and /why-context <seed> (alias /xray) runs the X-ray query, each on a blocking task off the UI thread so the first full scan doesn't freeze the session.
Added the Handoff capsule — tomte handoff (and /handoff in a session): the shift report that lets the next session pick the house up where this one left it, whether that's a colleague, tomorrow's you, or a different model entirely — the decision trail is cross-model on purpose, and this is the door it walks through. One paste-ready markdown capsule collects, from real state and never from a model's summary: where the tree stands (branch, HEAD, dirty files capped with an "and N more", recent commits), why things are the way they are (the newest recorded decisions with who recorded them, a pointer to the full trail, and a drift-watch line — anchors holding / healed / needing eyes — from the same reconcile /why --reconcile runs), the twin's five-index map summary, and the top of the Repo Pulse. Sections degrade gracefully (outside a git repo it says so; an empty trail points at record_decision) and it ends with the house rule: tomte prove — done means verified. --json for scripts, --out HANDOFF.md to write a file; in the TUI the collection runs off the UI thread.
Added Repo Pulse — tomte pulse (and /pulse in a session): which files are most likely to break next, scored from the Repo Twin's own indexes with a formula printed right on the card — risk = commits in the recent git window × (import fan-in + 1) × 2 when no test covers the file. Change heat says the code is in play, blast radius says others lean on it, and a missing test means the regression lands silently — every factor is a real index entry (a GitStat, an ImportEdge, a TestEdge), so the verdict is reproducible: rerun it, get the same card, argue with the numbers instead of a model's vibe. The card lists the top 10 with the most recent commit subject for each, plus two vitals — how many hot source files have no covering test, and the file with the widest import fan-in (shown only at ≥2 importers, since a Rust mod tree gives every file exactly one declaring parent). Test files and non-source files are never scored; ties break by path so equal twins render byte-identical cards; --json emits the report for scripts and --rebuild re-scans first. Costs nothing beyond the cached twin load — pure index math, no shell-outs, no model.
Added Claude Fable 5 (claude-fable-5, GA June 9, 2026) — Anthropic's new top tier above Opus — to the model catalog: it appears first in the Anthropic picker/login list (tagged 1M ctx · most capable · top tier), carries its published facts (1M context window by default, 128K max output, adaptive-only thinking, xhigh/max effort honoured instead of clamped), prices at the published $10/$50 per MTok in /cost (cache read $1.00, 5m cache write $12.50 — the same 0.1×/1.25× rule as the other Claude tiers), and model: fable in a subagent file resolves to the concrete id like opus/sonnet/haiku do. The request shape was already safe for Fable's API surface: tomte never sends thinking: {"type":"disabled"} (Fable rejects it — adaptive thinking is always on, so the param is simply omitted when no effort is set), never sends sampling parameters to Anthropic models, and an unknown future Claude id still routes to the forward-compatible adaptive shape. The same family rule recognizes dated claude-fable-5-* snapshots and prices the limited-availability claude-mythos-5 at its published $10/$50.
Added the Agent Tournament — "don't trust one agent; make several compete." tomte race "<task>" --agents N (default 4, max 8) runs the task with N contestants — varying model (--models a,b round-robins a list), reasoning effort, and style, with the last contestant always the conservative minimal-patch entry — each in its own git worktree branched from HEAD, so contestants can never touch each other's tree or yours. Each contestant runs through the existing headless tomte run (sandboxed workspace-write, JSON event stream captured). The judge is deterministic and decides the winner — an LLM is never the referee: it measures the evidence itself per worktree — the project's own test/typecheck/lint/build checks via the Proof Capsule collector, diff size and files touched from git diff --numstat, whether a regression test was added (test-path conventions or added #[test]/def test_/it(-style markers in the diff), and how many risky shell commands the contestant ran, classified by the same classify_danger guard the live agent uses. Ranking is tiered so a clever-but-broken patch can never beat a working one (verified > checks-failed > no-change/errored), then scored within the tier (+verified, +added-test, −per-file, −per-line, −per-risky-command), with a smallest-diff tie-break; every reason on the card is generated from the measured numbers, so the verdict is reproducible. The winning patch is saved beside the project's tomte state (race-winner-<label>.patch) and --apply applies it to your working tree; --json emits the full report; worktrees are always torn down, even when a contestant errors or times out.
Added the Commit Seal — tomte seal: the Proof Capsule that travels with the commit. tomte prove answers "is the tree verified right now?" and the answer evaporates the moment anyone moves on; seal notarizes it onto the commit itself. Run at a clean HEAD (a dirty tree is refused with the changes named — a seal describes a commit, not a drifting tree), it collects the capsule the CLI gathers itself (real exit codes of the project's own test/typecheck/lint/build; a model is never consulted) and attaches it as a git note under refs/notes/tomte-seal, bound to the commit and tree ids it was collected at. Notes are ordinary git objects, so the proof is pushed/fetched with the history it certifies (git push origin refs/notes/tomte-seal) — a teammate's clone, CI, or tomorrow's machine reads the same seal. tomte seal show [rev] prints a commit's seal (with a ⚠ warning when the note answers for a different commit than it hangs on); tomte seal verify [rev] is the CI gate — exit 0 only when the commit is sealed, the binding matches (a copied note or one whose JSON was edited onto another commit never verifies), and the sealed capsule is green with at least one check actually run — "nothing failed because nothing ran" doesn't gate. A red capsule still seals (a seal is a notarized observation, not an award) and seal mirrors prove's non-zero exit on failure; re-sealing replaces the note. --json on all three for scripts; headless by design, like rounds — sealing wants a clean tree, which is a commit-time posture, not a mid-session one.
Added why_context as an agent tool — the Repo Twin's Context X-Ray was only reachable through user commands (tomte why-context, /why-context); the model itself still picked context by grepping around. It is now a read-only built-in the agent calls on its own: pass a seed (a file, a stack-trace file:line, or a symbol) and it returns the connected files with the index each claim came from, plus the nearby files to skip. The system prompt's tool-discipline list teaches the timing (call it FIRST on a seeded task in a large or unfamiliar area); subagents can whitelist it (why_context, aliases xray/why-context), and its arguments accept the spellings other models reach for (query/file/symbol/path/target).
Added /prove explain — the explain layer on the Proof Capsule: after the CLI collects the card (real git state, real exit codes), one agent turn interprets it — what the verdict does and does not prove for the work in this session, the residual risks the checks cannot see, and what to verify by hand before shipping. The model never supplies or alters the numbers; it only reads the capsule the CLI already gathered. A bare /prove is unchanged, and an unknown argument prints usage instead of running.
Added live progress to tomte race — a tournament can run many minutes, and until now the CLI was silent from launch to the final card. Each contestant's lifecycle is now narrated on stderr with elapsed timestamps (agent-b (gpt-5.5) started, finished its run in 4s, verifying — running the project's own checks…, checks: 3 passed, 0 failed, plus worktree-creation failures), so you can see who is still running and who is being verified. Progress goes to stderr so --json stdout stays clean for piping.
Added markdown link rendering in the chat — [text](https://…) now draws the label accent-underlined with the target kept visible in dim parens (a terminal can't click, so hiding the url would lose the one thing the reader needs). Only the safe shape linkifies: a matched [label](url) whose target carries a real scheme (http://, https://, mailto:); indexing like arr[i](x), footnote [1], and relative targets stay literal, same as the existing matched-pair rule for */`.
Added the remaining everyday markdown to the chat renderer — ~~strikethrough~~ (matched-pair rule, so a home path ~/src or an unterminated ~~ stays literal), thematic breaks (---, ***, ___, spaced - - - render as a faint horizontal rule instead of literal dashes; two dashes or mixed markers stay prose), and GFM task-list items (- [ ] / - [x] render as ☐/✓ glyphs mirroring the todo panel's marks; a non-checkbox bracket body keeps the plain bullet).
Added description search to every picker — typing what a command does now finds it (reasoning surfaces the effort row), not just its name/key; powers the slash menu, /model, /resume, /rewind, and the rest.
Added an Environment block to the system prompt — the runtime facts a model can't guess and reliably gets wrong when left to its training data: the working directory and git standing (branch + HEAD at session start, or "not a git repository"), the platform and architecture, which shell run_shell actually executes through (cmd /C on Windows — with a nudge toward Windows-console syntax and an explicit powershell -Command escape hatch — sh -c elsewhere), and today's date with an instruction to trust it over the training cutoff when reasoning about "latest" versions. Each absent fact used to cost real turns (bash syntax sent to cmd.exe, stale version reasoning); now the harness states them up front. The block is marker-delimited and range-replaced in place on re-apply — it sits before the memory/decision-trail/skill blocks, so the truncate-style strip the other blocks use would have destroyed them. Applied at session start (TUI, resume, and headless chat) and on every refresh_system_context.
Strengthened the system prompt's discipline rails — a new Git & version control section (the repo's state belongs to the user: never commit/push/tag/amend unless asked; read git status/diff/log and match the repo's message convention before a requested commit; stage specific files, never blanket git add -A; never force-push, rewrite published history, or --no-verify past a failing hook; gh pr create for PRs with a reviewer-trustworthy summary), a new Security stance (help defend code freely, refuse harm-purposed code with a one-line alternative, never echo or commit secrets), and two tool-discipline rules that close real failure loops: a tool call the user denied is a decision — never re-issue it unchanged — and a hook that blocks or annotates a call speaks for the user, not an obstacle to route around.
Added Night Rounds — tomte rounds: the custodian's read-only inspection walk, named for what the tomte of the stories actually does at night. One command re-checks every store tomte already keeps and reports what changed since the last walk: the Repo Twin is rebuilt fresh (never trusted from cache — an inspection that trusts yesterday's map defeats itself) with a Δ line for file/test-edge counts; the Pulse is scored uncapped and diffed against the recorded baseline, so the card lists risers (mainloop.rs 18→31 (+13)) and files that turned hot-and-untested since last rounds; the decision trail is reconciled (the same drift watch handoff runs) with any GONE/AMBIGUOUS anchor named on the card; TODO/FIXME marks added since the baseline are listed with file:line (matched by file+text, so a mark that merely moved lines never reads as new); and the Proof Capsule pass re-runs the project's own test/typecheck/lint/build with real exit codes (--no-proof skips it). The baseline lives beside the memory/decision/twin stores and updates every walk. Exit semantics make it a CI morning gate: red — a decision whose anchored line is gone or ambiguous, or a failed check — exits 1; an amber walk (new TODOs, risers) still exits 0 and says the marks are worth a look; a clean walk says "A quiet night — nothing out of order." Every line is computed from real indexes and real exit codes — a model is never consulted, so two walks over the same tree say the same thing. Read-only by design (the opposite of the background-autonomy lane): rounds never edits the tree. --json for scripts, --out rounds.md to keep the morning report as a file.
Added the esc to interrupt affordance to the busy spinner — Esc has always cancelled a running turn, but the only place that said so was /help; cancellation is now discoverable at the exact moment it's needed, on the very line that shows the turn running.
Made the status line's ? for shortcuts hint true — a bare ? on an empty composer now shows the same card as /help (it previously just typed a ? into the input). With any text in the composer, ? stays an ordinary character.

Changed

Changed Ctrl+C to a double-press quit guard — a single reflexive Ctrl+C (the terminal copy/cancel habit) used to kill the whole session instantly, mid-turn included, while inside an open approval card it did nothing at all. Now the first press clears the composer (stashing any draft into the ↑ recall history) and arms a two-second window with a ctrl+c again to quit status hint; only a second press inside the window exits, and a press after the window lapses re-arms instead of quitting. The rule is uniform — identical idle, busy, in pickers, and under the approval/conscience cards — and any other key disarms the guard. Ctrl+D on an empty composer still quits immediately (the deliberate EOF idiom).
Changed Esc on an idle composer to stash the cleared draft into the ↑ recall history instead of discarding it — a reflexive Esc on a long half-written prompt is no longer an unrecoverable loss. (Esc while a turn is running still cancels the turn, unchanged.)
Switched the default renderer back to the full-screen alternate screen (input pinned to the bottom, in-app scroll), reverting 0.0.2's inline default; the inline native-scrollback viewport is now opt-in via TOMTE_INLINE=1. A mouse-wheel scroll now clears any in-app text selection instead of mis-tracking it onto unrelated rows.
Changed the live fleet view to show a sub-agent's cumulative output tokens (· 1.2k tokens ·) instead of a raw tool-call step count — a truer signal of how much work a child has done. The parent now forwards each sub-agent's token usage as it streams.
Changed tomte race to state its honest isolation posture up front on a platform without an OS sandbox (Windows): one stderr line before the start banner notes that contestants are still worktree-isolated and dangerous commands stay hard-blocked, but other shell side effects are not filesystem/network-confined there. On Linux/macOS (Landlock/seatbelt) nothing changes.

Fixed

Made the approval card show the same arguments the history records — the card used to display the model's raw spelling (cmd=…, filePath=…) while the transcript recorded the canonical fold (command=…, path=…), so what the user approved and what the session recorded could read as two different calls. The card now renders the canonical shape with absent-field null placeholders stripped, so it shows exactly what the call carries and nothing else; tools without a canonical mapping pass through unchanged.
Fixed house-rules surfacing for alias-spelled edits — the Pillar 5 recall card looked up a file's recorded decisions only under the path key, so an edit the tool happily executes when spelled filePath/file_path silently skipped the lookup and the house rules never appeared. The lookup now accepts every alias the executing tool's deserializer accepts (the same consumer-parity rule that already fixed the conscience pre-check).
Fixed the stream-truncation error message for a tool call whose name is whitespace-only — it interpolated the raw spaces (tool ` ` ) instead of `<missing>`.
Fixed undo_last_edit//undo refusing to unwind edits interleaved across files — after restoring a file, the staleness-guard refresh only looked at the TOP undo entry, so with an edit order of A → B → A the second undo of A was refused as "file has been modified since the edit" (the guard read tomte's own earlier restore as an external edit). The refresh now finds the newest remaining entry for the restored file wherever it sits in the stack; a genuine external edit still trips the guard. — the numbered code preview under ● Write(src/main.rs) rendered every line in one bright color, while the assistant's own fenced code blocks were already syntect-highlighted, so the same code read differently depending on who printed it. The preview now runs through the same syntect pipeline and theme, resolving the language from the target file's extension (the fence-token aliases — rust, ts, py, … — apply), with the code-block background bed for the same panel look; a path with no recognizable extension degrades to the plain-text syntax, never an error. The highlighter is fed line-by-line in order, so multi-line constructs (block comments, raw strings) keep their state across the preview.
Fixed the streaming stutter — in a long session the chat visibly hitched on every tool event, because each one (a call starting, its args streaming, its result landing, a pre-flight card attaching) invalidated the render cache and forced the next frame to re-wrap and re-syntax-highlight the entire transcript; an agentic turn does that several times per tool call, so the cost landed as continuous 50–300 ms hiccups exactly while the agent worked. The cache is now a stable-prefix cache: everything before the live turn is wrapped once, validated each frame by a cheap fingerprint fold (so /resume, /rewind, and /clear still invalidate it naturally), and extended append-only when a turn settles — while only the live turn re-wraps per frame. No event handler manages the cache at all anymore, so this class of regression can't come back by forgetting an invalidation site; a cache-vs-fresh equivalence test pins the rendered frames byte-identical across a scripted streaming turn.
Smoothed the paint pipeline — three smaller sources of shimmer/jank, fixed together: every frame's terminal writes (the inline scrollback commit plus the diff) are now bracketed in a DECSET 2026 synchronized update so the terminal paints them atomically instead of mid-write (auto-follow shifts the whole chat region every frame, so unsynchronized paints showed as tearing; terminals without the mode ignore the markers), stdout now goes through a 256 KB BufWriter so a frame reaches the terminal as one write instead of the line-buffered dribble, and inline mode's insert_before now uses ratatui's scrolling-regions feature — inserting history via terminal scroll regions instead of clearing and redrawing the live viewport, which blinked on every committed block. Input latency is tightened in the same pass: a keystroke arriving mid-stream draws immediately instead of waiting behind the 16 ms frame budget, and a frame the budget deferred now paints on the budget's remainder instead of up to 80 ms late when the stream happens to go quiet.
Made tool-call argument parsing tolerate a double-encoded payload — some models (and some providers' streaming layers) stringify the arguments object twice, so arguments arrives as a JSON string whose content is the real object; tomte bounced that back as "arguments must be a JSON object, got string", costing a round-trip. The payload is now unwrapped exactly one level when (and only when) the inner text parses to an object — a bare string, a double-encoded array, and invalid JSON still get the existing self-correct error with the tool's schema hint. Provider-agnostic, so a model added later benefits automatically.
Closed the conscience-check gap that tolerance opened — the Pillar 5 pre-check parsed an edit call's arguments more strictly than the tool phase that executes them: a double-encoded payload (now unwrapped and run) or an alias spelling the tools themselves accept (filePath, oldString, …) made the pre-check silently skip, so an edit could land against a recorded decision without its conscience card. The pre-check now resolves arguments through the same parse and alias fold as execution, so whatever spelling runs is also what the conscience sees.
Made the race judge count what a contestant actually ran — count_risky_commands matched the literal tool name run_shell and the literal command field, but a contestant's model may spell the tool bash/shell (the registry resolves those at execution), double-encode the arguments, or use the cmd alias (the executing agent tolerates all three). Such a contestant ran the risky command without the judge counting it, skewing the deterministic score in its favor. The judge now canonicalises the tool name and parses arguments with the same tolerance as the agent it replays.
Made the Repo Twin fully deterministic — two index nondeterminisms could make "rerun it, get the same card" untrue: a Go package import resolved to whichever .go file a HashSet iteration happened to visit first (so rebuilds could flip import edges and Pulse fan-in counts), and a convention rule that mentioned several seed tokens cited whichever token the set yielded first. Both now pick the lexicographically-smallest candidate, so equal trees build byte-identical twins.
Fixed why-context reporting an absolute-path seed as "missing" — a pasted stack-trace location is usually absolute (C:/repo/src/x.rs:88, /home/me/repo/src/x.rs:88), but the twin stores root-relative paths, so the documented stack-trace use case never resolved. The repo root is now stripped from an absolute seed before matching.
Stopped a closed pipe reading as a crash — tomte twin | head -3 (or any evidence command piped into head/Select-Object -First in a script) made the stdlib's print macros panic with "failed printing to stdout: Broken pipe", exiting 101/-1 for a completely routine shell pattern. A panic whose payload is exactly that pipe-closed abort (Unix Broken pipe, Windows os error 232) now exits 0 — the consumer closing the pipe means it has all the output it wanted — while every other panic still reaches the default hook and aborts loudly. Matters most for tomte prove --json / twin / why-context / why / blame in CI pipelines.
Closed a run_shell destructive-command classifier bypass — an inline-code interpreter hidden behind a command-wrapper (env python3 -c '…', sudo node -e '…', command perl -e '…', nice ruby -e '…', env awk 'BEGIN{system(…)}') slipped past the guard, because the inline-interpreter and process-substitution checks inspected only the first command word of a segment and saw the wrapper, not the interpreter. They now peel known wrappers (env/sudo/xargs/command/exec/nice/nohup/time/…) the same way the pipe-into-interpreter guard already did, so the interpreter behind them is classified — and the same peel covers a shell handed a process substitution behind a wrapper (env bash <(curl …)). Without the fix the payload auto-ran under a matching run_shell(…:*) grant or bypass mode, and on Windows (no OS filesystem sandbox) ran unconfined; a benign wrapped command (env NODE_ENV=prod node server.js, sudo apt-get install nodejs) stays unflagged.
Closed a Windows-specific destructive-command bypass — cmd.exe accepts switch clusters with no separating space (del /s/q dir, rd /s/q dir), but the recursive-delete guard tested for /s as its own whitespace token, so the glued form was never recognized. It now detects the /s switch inside a /-joined single-letter cluster, while still leaving a real path like /usr/s and a non-/s switch (del /q file) unflagged. Matters most on Windows, where the sandbox does not confine the filesystem.
Closed a cross-provider credential clobber on OAuth sign-in — completing a ChatGPT/Codex or Claude login did load_auth → set this provider's tokens → save_auth without the shared REFRESH_LOCK that ensure_fresh holds, so a sibling provider's token refresh landing in that window had its freshly-rotated, single-use refresh token reverted by the login's write-back (bricking it until a manual re-login). Both login completions now take the lock for just the load→save tail — acquired after the browser wait and code exchange, so a pending login never blocks refreshes while the user is still authorizing.
Hardened config.json to owner-only on Windows, matching auth.json and the Unix 0o600 — a literal providers.<id>.api_key in config.json is a real credential, but the Windows write path used a plain std::fs::write under the inherited %APPDATA% ACL while auth.json strips inheritance and grants only the owner. The config save now applies the same icacls owner-only grant (the dir with inheritance, then the file), reusing auth's helpers; Unix and the api_key_env path are unchanged.
Widened the child-process secret-env scrub to broker/cache connection strings that embed inline user:pass@host credentials (REDIS_URL, MONGODB_URI/MONGODB_URL, AMQP_URL, RABBITMQ_URL, CELERY_BROKER_URL), matching the existing DATABASE_URL rule, so a prompt-injected run_shell/hook/MCP command can't read them via env. Matched by specific name (not a blanket _URL/_URI), so non-secret REDIS_HOST/MONGO_HOST siblings still pass through.
Fixed the welcome card's right border misaligning when the workspace path or model id contains wide CJK/emoji characters — the text column measured and trimmed its body by code-point count while the column itself is sized in display columns (the sprite column already used display width), so a wide glyph overran the column and pushed the │ out. It now measures and truncates by display width (unicode-width), keeping the box flush.
Fixed /cost over-estimating Opus 4.5/4.6/4.7/4.8 spend by 3x — the price table charged every Opus id the original $15/$75 per MTok, but Opus 4.5 and later have been published at $5/$25 (verified against the June 2026 model docs). The Opus rate is now version-gated: 4.5+ bills at $5/$25 (cache read $0.50, write $6.25), while Opus 4.1 and older — including the dated bare-major claude-opus-4-20250514, whose version string doesn't parse — keep the original $15/$75.
Fixed /cost (and tomte cost) under-estimating spend for an OpenAI -codex / -chat-latest model variant — the price table carried those variants only for gpt-5.3, so gpt-5.5-codex / gpt-5.4-codex (real ids the model catalog recognizes) fell through to the unknown-model fallback rate, a ~4x under-estimate. They now price at their base family's published rate, completing the existing gpt-5.3 pattern.
Fixed long multi-line pastes still firing off partial messages on Windows — the burst-coalescing drain bailed the instant the input channel was momentarily empty (now_or_never), so a paste the OS delivered in chunks split across loop turns and a stray newline submitted mid-paste. The drain now waits briefly (PASTE_COALESCE_GAP) for the next event, so the whole block lands in the composer as one message. (Windows emits no bracketed-paste event, so this key-burst path is the fix there.)
Fixed pasting an image on Windows — a Win+Shift+S screenshot (or any clipboard bitmap) now attaches via Alt+V. arboard's CF_DIB reader misses Snip & Sketch / screenshot bitmaps, so on Windows tomte reads the clipboard image through PowerShell's System.Windows.Forms.Clipboard::GetImage() then Save(.., Png) (the same mechanism Claude Code uses); macOS and Linux keep arboard's native reader. Alt+V also reports an empty clipboard instead of doing nothing.
Fixed the TUI getting laggier the longer a session ran — the alt-screen renderer re-wrapped and cloned the entire transcript every frame (a full-transcript clone at ~60fps while streaming, plus on idle redraws and every mouse-move). It now materializes only the visible window — O(viewport) per frame instead of O(transcript) — reuses the cached prefix while streaming, and stores the prefix split as an index so the wrapped transcript lives in memory once. Mirrors how Claude Code's Ink <Static> renders history once and only redraws the live region.
Fixed glob/grep hanging for minutes on a machine without ripgrep (common on Windows) — the no-ripgrep fallback walked the whole tree skipping only .git, so every pattern crawled node_modules, target, .next, build (tens of thousands of files). It now honors .gitignore/.ignore via the ignore crate (ripgrep's own engine), so those directories are skipped and search is fast and correct whether or not rg is installed.
Fixed dispatch_agent erroring with "subagent general-purpose not found in any agents directory" — general-purpose (the default subagent_type, and what Claude-Code-trained models pass) is now a built-in that needs no file on disk: all built-in tools, the parent's own system prompt. A fresh install with no agent files no longer fails every dispatch_agent call; a general-purpose.md under any agents root still overrides the built-in.
Fixed dispatch_agent aborting a fan-out with "subagent code-explorer not found" — a subagent_type the model guessed now resolves instead of hard-failing. Three more built-ins ship with no file needed: Explore (read-only code search), Plan (read-only architecture planning), and code-reviewer (read-only review); a loosely-spelled name (code-explorer, explorer, planner, security-reviewer, …) folds onto the closest built-in; a name that still matches nothing falls back to general-purpose with a note telling the model the valid types instead of erroring; and the system prompt now lists the built-in types so the model names a real one up front. A same-named file under any agents root still overrides a built-in.
Closed a run_shell override-gate bypass — a destructive command fused to a shell control operator with no surrounding space (ls;rm -rf /, true&&rm -rf /, cd /tmp&&rm -rf /, dir&del /s …, (rm -rf /)) slipped past every destructive-command guard, because the classifier tokenized on whitespace alone and kept the command word glued inside one token (ls;rm, never recognized as rm). It now splits on ; / & / | / newline / () as well, so the fused command is recognized and still clears the override prompt — without the fix it auto-ran under a run_shell(…:*) grant or --dangerously-skip-permissions, and on Windows (no OS filesystem sandbox) ran unconfined. The same pass also flags an inline-interpreter flag glued to its program (python -c'…', node -e'…'), and routine operator chains (cargo build && cargo test) stay unflagged.
Closed the matching indirect-prompt-injection gap on an MCP server's error result — tools/call fenced a success result in <untrusted-mcp-output> (framework markers neutralized, a forged closing tag broken), but the isError path returned the server's text raw, so a compromised server could smuggle a directive through an error. Both paths now go through the same fence.
Fixed edit_file / multi_edit silently editing the wrong occurrence in a mixed CRLF+LF file — read_file strips \r, so the model's LF old_string can't tell a CRLF region from an LF region, and counting only the verbatim-LF form reported a single match and edited the LF copy even when the CRLF one was meant. Both encodings are now counted, so an ambiguous cross-encoding target trips the uniqueness gate (which asks for more surrounding context) instead of guessing; a single-encoding file is unaffected.
Hardened the decision trail (decisions.jsonl, the cross-model moat) against concurrent-session data loss and crash truncation — two tomte sessions open in the same project share one trail, and reconcile rewrote the whole file from a snapshot taken before it scanned the working tree, so a decision the other session appended in that window was silently clobbered; reconcile now re-reads and re-heals immediately before the rewrite, shrinking that window to the rename itself. Each append is now a single O_APPEND write (so two appenders can't interleave a half-line) and is fsync'd, and the reconcile rewrite flushes its staging file and the directory before/after the atomic rename — bringing the trail up to the same durability bar the session and config writers already hold.
Closed a prompt-exfiltration vector in project config — a cloned repo's .tomte/config.json could set model (or fallback_models) to a built-in-preset prefix like openrouter/… or groq/…, which routes every prompt and all file context to that third-party endpoint using the user's own <PROVIDER>_API_KEY. The providers key was already blocked from a project overlay for exactly this reason, but model reached the same endpoints and wasn't; a project model/fallback_models entry whose prefix resolves to a built-in or user-configured provider is now dropped with a warning (bare ids and native openai//anthropic/ specs are still honored).
Fixed tomte doctor / /doctor falsely reporting a valid MCP server or hook as "not found on PATH" on Windows — its which check only looked for .exe, but npx, prettier, pnpm and friends are .cmd/.bat shims that the runtime spawns fine via PATH×PATHEXT. The check now searches the same PATHEXT set the runtime uses, so it stops flagging a working setup (git.exe and the search tools were never affected).
Fixed /rewind to an earlier turn over-reverting after an /undo — /undo (and the undo_last_edit tool) popped the undo stack without lowering the edits-since counter each checkpoint records, so a later /rewind counted the already-undone edits as still pending and reverted that many entries from the top of the stack, reaching past the chosen checkpoint into edits made before it and rolling those files back too (and the picker's blast-radius preview over-counted the same way). /undo now drops that counter in lockstep — eviction of the capped stack's oldest entry still doesn't, so the count survives eviction as before — so /rewind reverts exactly the edits still live since the chosen turn and the preview shows the true file count. Host-side bookkeeping only; no model- or OS-specific behavior.
Closed the last unserialized writers of auth.json — /logout, /apikey, and the API-key activation in both login flows did a bare load → mutate → save while a token refresh (which holds the shared REFRESH_LOCK across its own load→network→save) could be in flight in the same process; the interleave could revert a freshly-rotated single-use refresh token (bricking that credential on its next refresh) or, mirror-image, the refresh's merge could re-persist OAuth tokens the user had just logged out. All four writers now go through a new locked auth::mutate_auth (same lock as the refreshes), and ensure_fresh now treats "this provider's slot vanished from disk mid-refresh" as a logout — the fresh access token still serves the in-flight turn, but the credential is never written back, so a logout from another process can't be silently resurrected either.
Widened the child-process secret-env scrub to the *_PASS family (DB_PASS, SMTP_PASS, REDIS_PASS, …) — PASSWORD/PASSWD/*_PWD were caught but the equally common _PASS suffix leaked into every spawned shell/hook/MCP child. The required underscore spares BYPASS/COMPASS-style names.
Gave settings.json the same owner-only enforcement auth.json/config.json already get — tomte mcp add --env KEY=VALUE can store a real credential there, but the write path set 0o600 on Unix only; on Windows it now applies the same icacls owner-only grant before the atomic rename. load_auth also repairs a too-broad auth.json ACL on Windows now (once per process — icacls shells out, and the Unix-style cheap mode check has no Windows analogue), covering a credential file restored from a backup with a broad inherited ACL.
Rejected Windows reserved device names (NUL, CON, PRN, AUX, COM1–COM9, LPT1–LPT9, with or without an extension like CON.txt) in the file-tool path resolver — Win32 resolves them to devices regardless of directory, so they slipped past the lexical sandbox check and a write would target the console or the null sink instead of a file in the workspace. Ordinary names that merely start with a reserved word (console.txt, common.rs, com10.txt) still resolve; non-Windows platforms are unchanged (con is a legal filename there).
Flagged string-BUILT PowerShell execution in the destructive-command classifier — a -Command payload is normally plain PowerShell the flattened token scan reads, but [scriptblock]::Create(…), a FromBase64String decode, or [char]-array -join assembly composes the real command at runtime where no token rule can see it. When PowerShell is invoked anywhere in the command those builders now classify as inline code (an approval prompt, never an auto-run); naming them in another program's arguments (grep -r frombase64string src/) stays unflagged. Closes the gap next to the already-covered -EncodedCommand.
Named persisted run_shell allow-rules after the program that will actually match — the "Allow <prog> in this project" label took the raw first word of the command, so a quoted or path-qualified spelling ("git" status, /usr/bin/git pull) persisted a rule labeled "git"/git inconsistently with the matcher's own quote-stripping/basename normalization. The label now runs through the same program_name the matcher uses, so what the user agrees to is exactly what later auto-runs.
Fixed the approval and conscience modals clipping their own options off-screen — both counted logical lines while Wrap produced more visual rows, so a long args preview or decision text pushed the Allow/Deny (or Abort/Supersede) rows and the key hint below the frame, leaving the user staring at a prompt with no visible choices. The two cards now share one modal engine: every row is pre-truncated to the popup's width (span-aware, display-column based), the popup is sized to its real row count, and when the terminal is too short the context rows are trimmed first — the options and the key hint always win. The conscience card's decision text and conflict reason wrap (capped at 3 and 2 rows) instead of being cut to one line, since they are the substance of that choice.
Fixed wide CJK/emoji text overrunning its row in the fleet panel, the todo panel, the queued-message list, and the Bash(…) header — those truncations counted characters while the rows are budgeted in display columns, so a wide glyph cost two columns but was counted as one and the text spilled past its slot (the same class of bug the welcome card fix covered). All four now truncate by display width via the shared unicode-width helper.
Aligned the last two stray text colors onto the calm palette — the markdown table header and the /context headline each carried their own near-white RGB literal (235,235,240 / 230,230,235), one and two notches off TEXT_BRIGHT (231,231,231). Both now use the palette constant, so "bright ink" is one color everywhere; the documented exceptions (per-provider auth dots, /context category swatches, the buddy sprite) are untouched. The inline-code amber moved into the palette as a documented exception (INLINE_CODE/INLINE_CODE_BG) instead of loose literals in the markdown renderer.
Made the status line degrade gracefully on a narrow terminal — the right side (auth dot · model · effort · context gauge · cwd) used to clip mid-text when it outgrew the row; it now drops segments from the tail in priority order (cwd first, then the gauge, then effort), so the auth dot and the model name always survive.
Budgeted the fleet row so a long sub-agent prompt can no longer push the live metrics (· activity · tokens · elapsed) off the right edge — the metrics tail is the signal, so it keeps its full width and the prompt absorbs the squeeze with an ellipsis.
Truncated picker rows to the popup's inner width — a long session title or description used to clip dead at the border; the title is never cut below its own length and the description absorbs the squeeze with an ellipsis. The generic-tool header's argument preview (compact_args) also cuts by display width now, completing the chars→columns sweep.
Fixed /undo (and the undo_last_edit tool) refusing the second of two stacked edits to the same file — restoring the first edit rewrote the file with a fresh mtime, so the next undo entry's staleness guard read tomte's own restore as an outside change and bailed with "file has been modified since the edit." After a restore the next same-file entry's (mtime, size) snapshot is refreshed to what was just written, so a stacked edit unwinds all the way back one step at a time; a genuine outside edit still moves the mtime and is still refused, so the anti-clobber guard is intact. (Cross-platform: mtime resolution differs per OS, but the refresh keys off the file tomte itself just wrote.)

Tests

Locked Markdown table border alignment with a regression test — an audit of the wide-CJK/long-word column-shrink + cell-wrap path found it already keeps every rendered row at one display width (md_cell_width measures display width, the shrink floor stays at 3, and textwrap breaks CJK and long words), so the box borders can't drift; the test guards that against a future change. No behavior change.
Added regression tests for the OAuth PKCE primitive (auth/pkce.rs, previously untested) — locking that the code verifier and challenge are unpadded base64url within RFC 7636's length bound, the challenge is exactly the S256 transform (base64url(SHA-256(verifier))), and the verifier/CSRF-state nonces differ per draw. No behavior change; guards the OAuth security primitive against a future refactor.
Added regression tests locking two control-flow-critical argument-alias maps: the goal_update status family (every blocked / in_progress / complete spelling, so a dropped alias can't silently strand an active /goal) and the dispatch_agent spawn-mode family (read-only spellings confine a sub-agent to plan mode, edit/write spellings don't). No behavior change.
Added a security regression test for the todo <system-reminder> sanitizer (safe_system_reminder_text, previously untested) — model-controlled todo text injected into the per-turn reminder has its < / > / & escaped (so it can't forge or break out of the reminder block), control characters collapsed to spaces, and length capped by character count; the test locks all three. No behavior change.
Added unit tests for the OpenAI strict-schema transform (tools/schema.rs::strict_parameters_schema, previously untested) — locking that an optional property becomes nullable AND is added to required, additionalProperties is forced false, nested objects and array items are strictened recursively, and an optional enum gains a null variant. A regression here would make OpenAI reject every tool call with a 400. No behavior change.
Added unit tests for the OpenAI Responses stream value extractors (openai/stream/value.rs, previously only exercised end-to-end) — locking that a tool-argument fragment keeps a real value verbatim (including a streamed bare null) while dropping the empty null / "" / {} / [] placeholders, and that the text extractor flattens part arrays and picks the text / content / delta keys. Guards the streamed-tool-call contract prior fixes established. No behavior change.
Added unit tests for the per-turn todo <system-reminder> builder (agent/todo_reminder.rs::todo_reminder_text, previously untested) — locking that an empty list yields no reminder, each status renders (only in-progress carries its active: form), the list is capped with an N more todo(s) omitted summary, and model-controlled todo text is escaped so it can't forge or break out of the reminder block. No behavior change.
Added a unit test for the model-facing todo-status alias parser (TodoStatus::parse, previously untested) — locking every pending / in_progress / completed spelling (incl. trim, case, and -/space normalization) and that an unknown value is rejected rather than guessed. No behavior change.
Added CLI-crate unit tests for the /export and /commit helpers (tui/app/prompts.rs, previously untested) — locking the safe-fence invariant (markdown_fence_for always opens a fence longer than the longest backtick run inside the content, so an embedded code block can't close the export early), that /commit and /commit-push-pr carry the full git-safety protocol plus any user extra, and the all-todos-completed predicate. No behavior change.
Added unit tests for the slash-command parser and message-token estimator (tui/app/slash.rs, previously untested) — locking that split_slash_command splits head/arg on the first whitespace, trims, and slices on a char boundary (a multibyte arg like compact 中文 can't panic it), and that estimate_messages_tokens is the chars/4 estimate that skips the Welcome/Rich widgets. No behavior change.
Extended the composer TextInput tests to the previously-untested cursor-navigation primitives (tui/input.rs) — backspace (removes a full multibyte char, no-op at start), delete_word_left, move_left/move_right (whole-char steps), move_up/move_down (preserve the display column across lines), and move_home/move_end. These are the byte-vs-char-boundary slicing paths most prone to a panic; the tests lock them. No behavior change.
Added unit tests for the assistant-block invariant helpers and screen/cwd resolution (tui/app/blocks.rs, tui/app/helpers.rs, previously untested) — locking that finish_open_assistant_block / rotate_assistant_block close open blocks and drop empty ones (no stale empty stanza, exactly one fresh open block), last_assistant_mut_open returns only an open block, initial_screen routes to login only when fully unauthenticated, and resolve_cwd_arg accepts only an existing directory. No behavior change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.3

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

0.0.3

Added

Changed

Fixed

Tests

Uh oh!