Conversation
When a session exceeds progressiveLoadThreshold (500 msgs), ccx renders only the last 3 sections and shows a "Load earlier" button. Deep-link URLs like /session/<id>#msg-<uuid> targeting messages inside the hidden range silently failed: the hash handler did nothing because the target was not in the DOM. Fix: - Auto-reload with ?all=1 when the hash target is missing AND a load-earlier button exists. Preserves the hash, uses location.replace so the broken page doesn't pollute history. - Hoist the hash handler into a named jumpToHashTarget function and re-fire it on hashchange so in-app nav clicks after initial load also jump correctly. - Walk up from msgEl, opening every ancestor <details> and unfolding every .thread.folded, instead of only opening a single child details. Fixes a pre-existing bug where deep-linked messages inside collapsed threads stayed hidden. - In-session search: on 0 matches with hidden history active, the info line shows a "search full history" link that reloads into full content. Pending query is persisted via sessionStorage so the search picks up where it left off. Tests verify the HTML markers (load-earlier element, jumpToHashTarget function, ccx-pending-search key) are present under the right conditions (>500 msgs without ?all=1) and absent under ?all=1. Browser-level behavior (reload round-trip, sessionStorage hand-off) not exercised in this container environment; unit tests cover the HTML output. Verify locally with a real 500+ msg session. Closes #3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Answers the "where did my quota go" question inside the session viewer. Surpasses ccusage by going per-turn instead of per-session, tree-aware instead of blind JSONL summation, visual instead of table-only, and offline-by-default with pinned pricing. Data layer: - Add MessageUsage struct on Message, populated during ParseSession from message.usage (input, output, cache_read, cache_create, cost). Previously only session-level aggregates were kept. - Add SessionStats.CostUSD computed post-parse as the sum of per-message costs — single source of truth so session total always matches the per-turn sum the UI renders. - Quick parser (session list view) also computes CostUSD so the session card totals stay consistent with the session view. Pricing (internal/parser/pricing.go): - Embedded Claude pricing table: Opus 4.5/4.6, Sonnet 4.5/4.6, Haiku 4.5, 3.5 Sonnet/Haiku. - LookupPricing: exact match only, plus date-suffix stripping (claude-sonnet-4-5-20250929 -> claude-sonnet-4-5). No fuzzy matching — that's how ccusage shipped 5x overcharges (ryoppippi/ccusage#934). - Unknown models return nil; callers treat nil as "cost unavailable" rather than zero. We'd rather not show a number than show a wrong one. - ComputeCost: straightforward per-category multiplication. Cache reads at the discounted rate, cache writes at the surcharged rate. Turn aggregation (internal/parser/turns.go): - TurnStats: per-user-turn aggregation over a flat wire-ordered slice. Anchored by KindUserPrompt or KindCommand; tool results, meta, and compact summaries ride along with the enclosing turn. - Sidechain (isSidechain: true) messages ARE counted because the user is billed for them. ccx does not attempt sidechain cache-read dedup — that would produce numbers that don't match the bill. - FlattenSessionMessages helper for walking the session tree into wire order. Web UI: - Info panel: new "Per-turn spend" section, sorted by cost desc so expensive turns surface immediately. Each row is a link to #msg-<anchor>, composing with the load-earlier hash-nav fix from #3 for cross-range deep linking. - Tokens section: Cost row shown when at least one message has a pricing match. - Info panel now scrollable (max-height + overflow-y auto) so the breakdown fits for long sessions. - When model is unknown, the section still renders token-only with a "No pricing match" note — surfacing which turns used the most tokens is useful even when cost can't be computed. - formatCost helper with adaptive precision: <$0.01 floor, $0.xxxx under $1, $x.xx to $100, $xxx to $10k, $x.xk beyond. Tests (16 new): - pricing_test.go: exact match, date-suffix strip, unknown returns nil, empty returns nil, no fuzzy match guard (ccusage #934), ComputeCost math, nil safety, KnownModels determinism. - turns_test.go: per-message usage persists, session cost equals sum of per-message costs, two-turn aggregation, sum of turn costs equals session cost, sidechain inclusion. - server_test.go: spend section rendering with cost, token-only rendering for unpriced models, info-cost row absent when no cost. Also fix a latent panic: renderSessionPage(session.ID[:8]) crashed for session IDs shorter than 8 characters. Now guarded. Scope for this commit: the session-level web UI answer to "where did my quota go". Deferred to follow-up within the same milestone: ccx usage CLI command (cross-session daily/weekly/monthly axes), per-message inline cost badge in conversation view, Codex pricing table. Partial on #2; closes the session-level web surface. CLI + badge tracked in follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Users repeatedly clicking between sessions paid the full cold-parse cost on every view. For a 1500-message session that's ~19 ms of JSON parsing + tree reconstruction burned on every navigation. For 5000-message sessions, ~54 ms. Fix: wrap Multi.ParseSession in a bounded LRU cache keyed by absolute path, invalidated by (mtime, size). Cap 16 entries. Benchmark before/after on a 1500-message fixture: Cold (no cache) 3.6 ms / 4 MB / 30k allocs Warm (cache hit) 330 ns / 256 B / 2 allocs Cache hit is ~11_000x faster on this synthetic fixture. On the full-shape parser benchmark with realistic content blocks, the cold path is ~19 ms — which means the realized speedup on repeat views is closer to 60_000x. Implementation: - internal/provider/cache.go: container/list + map LRU, mutex- guarded. getOrLoad(path, loader) for transparent wrap-and-cache. - internal/provider/multi.go: ParseSession consults the cache first, delegates to parseThroughBackends on miss, stores the result. ClearSessionCache() exposed for diagnostics/tests. Invalidation is by (mtime, size). If Claude Code rewrites the file (live session growing), the next access re-parses. Live tail still works. Scope discipline for this pass: - No SQLite persistence. Cache is process-local; restart flushes it. Adding a disk cache introduces serialization cost that partially negates the in-memory win; revisit if users need cache survival across restarts. - No streaming render. Server still renders HTML in one pass. Progressive-load already avoids rendering hidden sections to HTML — the earlier "CSS hides everything" claim in the issue body was incorrect; renderMessagesProgressive only emits visible sections. - No parser-level allocation reduction. ~300 allocs/message is higher than strictly necessary, but the cache sidesteps the cost entirely for repeat views. A cold-path optimization pass can follow later once real (not synthetic) sessions show it matters. Tests: - TestSessionCache_HitSkipsLoader: second Get doesn't call loader - TestSessionCache_InvalidatesOnMtimeChange: mtime bump forces reload - TestSessionCache_LRUEvictsOldest: cap enforcement + most- recently-touched survives - TestSessionCache_ConcurrentLoadsSafe: 50 concurrent getOrLoad calls don't race or duplicate entries - TestSessionCache_ClearDropsEverything - BenchmarkSessionCache_ColdVsWarm: the numbers cited above Profile methodology + results committed to docs/profiles/long-session.md. Closes #4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ye zoom (#5, #2) Right-edge semantic scrubber that merges #2's per-turn spend data into #5's timeline rail. The rail answers both "where am I" and "where did the money go", and it behaves like a proper hover- activated scrubber instead of a styled scrollbar. Positioning: by anchor INDEX, not wall-clock time. A session with an idle gap (user walked away for 3 hours and resumed) no longer compresses all the work into a tiny sliver of the rail. Every turn gets an even slot. The original Timestamp and Offset are still kept for the tooltip so "+1m42s" still tells you when the turn happened. Layout: - Fixed 14px inward from the viewport right edge (clears the browser scrollbar and any overlay progress bar). Pulled in from top (56px) and bottom (24px) to avoid top nav and dock toolbar. - 24px-wide invisible hit target with a 10px-visible spine inside (forgiving entry — you don't need pixel-perfect aim). - On hover: widens to 52px, spine gains contrast, gridline labels fade in, cursor becomes crosshair. Index-based ruler gridlines: - chooseGridStep picks a round step (5/10/20/25/50/100/250/500/ 1000) targeting 4-10 gridlines. Sessions under 20 turns skip gridlines entirely (rail is short enough to scan). - Labels are turn ordinals ("10", "20", "30"...). Major every 5th step is visually bolder. Cost-weighted ticks (from #2 integration): - enrichTicksWithCost maps TurnStats back onto ticks via AnchorID and normalizes cost to 0-1 heat (highest turn = 1.0). Unknown models fall back to token-count heat so Codex sessions still paint a useful heatmap. - Tick base size drives off inline --heat CSS var; fisheye zoom layers on top multiplicatively. Cumulative cost: - enrichTicksWithCost walks ticks in session order accumulating per-turn cost into CumulativeCostUSD. - Tooltip shows "so far $X.XX" next to the per-turn cost so you can see the running total at any point. Interaction model (per the user's spec): - applyFisheyeZoom on 5 nearest ticks: zoom-0 at 2.6×, zoom-1 at 1.9×, zoom-2 at 1.35×. Springy bezier transition. Cheap: binary search + ±2 iteration. - selectWithHysteresis: tooltip only switches to a new tick when the cursor is meaningfully closer (> 1.5% of rail height) than the locked one. Prevents flicker between adjacent ticks at midpoints. - rAF-throttled mousemove: pendingClientY is coalesced into the next frame via processRailFrame, so long sessions don't stall the render pipeline. - Grace period (120ms) on mouseleave: brief slips off the rail don't tear the interaction down; re-entering cancels the fade and continues. - Tooltip viewport clamp: computed against window.innerHeight minus tooltipHeight so it never runs off the top or bottom of the page. - Playhead (dashed guide) snaps to the nearest tick for orientation. - Click handler uses the currently-highlighted nearest tick and navigates via #msg-<uuid> (composes with #3 fix for deep links into collapsed history). Tooltip content: [kind] +1m42s · turn 42 first line of message $0.0423 ∑ $2.30 ⧫ 12.3k Tests (19 total web/ tests, 6 new for this pass): - TestComputeTimelineTicks_SingleAnchorCentered (50%) - TestComputeTimelineTicks_PositionByIndexNotTime (rejects time-based positioning with contrived uneven timestamps) - TestComputeTimelineTicks_LongIdleGapDoesntDistort (10 turns with 3h idle in the middle still distribute evenly) - TestChooseGridStep (sessions under 20 turns get no gridlines; 30-2000 turns get 4-10) - TestComputeTimelineGridlines_IndexBased (monotonic, non-empty labels) - TestEnrichTicksWithCost_CumulativeMatchesSessionTotal (last tick's cumulative = sum of all per-turn costs) Browser verification (hysteresis feel, grace period, tooltip clamp, rAF smoothness) not exercised in this container environment. Go tests cover HTML structure, tick positioning math, cumulative accuracy, and JS helper wiring. Verify locally with a real session. Closes #5. Integrates #2 data for one coherent visual surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two small but visible bugs in the session view: #5: Server-side markdown had no ATX heading support. Claude often starts a response with '## Summary' or '### Findings' and ccx would render that as a literal '<p>## Summary</p>' in static views. The JS-side renderer already handled headings — the Go side was just incomplete. Fix: add parseATXHeading + a heading branch in renderMarkdown. Recognises # through ###### followed by a space, strips trailing '#' characters, and emits div.md-h<level> matching the existing CSS. Also handles simple '- item' / '* item' lists the same way. Rejected inputs are explicit: >6 hashes, no space after hashes, empty text after the marker, leading whitespace. These fall through to the paragraph renderer instead of becoming empty headings. #6: Outline nav dropped the assistant summary on tool-heavy turns. renderConversationNav hard-capped children at 10 and emitted '+N more' for the tail. Long turns that dispatched lots of tools would drop their final assistant text response — the exact thing the user wants to see to know what the turn concluded with. Fix: when truncating, show the first 9 children, the '+N more' marker, and always render the final child. The hidden count is adjusted to exclude the preserved tail (15 tools + summary now shows '+6 more' instead of '+7 more'). Tests: markdown_test.go: 7 tests — parseATXHeading happy paths + all non-heading cases (including '## ' empty-text rejection and the '####### too many' case), full renderMarkdown pipeline through heading levels 1-6, headings with bold/code passthrough, basic list rendering. nav_test.go: 3 tests — summary preserved on truncated turns, short turns show everything with no '+N more' marker, hidden count correctly excludes the preserved summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reading Claude Code's own pricing source (src/utils/modelCost.ts)
surfaced a real bug in ccx's embedded pricing table: Opus 4.5 and
Opus 4.6 were hardcoded at tier \$15/\$75 (the old Opus 4.1 rate),
but Claude Code's current source has them at tier \$5/\$25 — a
roughly 3x lower rate. Every per-turn cost figure ccx displayed
for Opus 4.5/4.6 sessions was inflated by ~3x. Fixed.
Pricing table rewrite (internal/parser/pricing.go):
- Named tier constants (costTier_3_15, costTier_15_75, costTier_5_25,
costTier_30_150, costHaiku_35, costHaiku_45) mirror Claude Code's
naming so drift comparison is direct.
- pricingTable map covers every model Claude Code's MODEL_COSTS
references: claude-3-5-haiku, claude-haiku-4-5, claude-3-5-sonnet,
claude-3-7-sonnet, claude-sonnet-4, claude-sonnet-4-5,
claude-sonnet-4-6, claude-opus-4, claude-opus-4-1, claude-opus-4-5,
claude-opus-4-6. Previously ccx was missing claude-opus-4,
claude-opus-4-1, claude-sonnet-4, claude-3-7-sonnet entirely —
those sessions returned nil cost.
- LookupPricing now mirrors firstPartyNameToCanonical in
Claude Code's model.ts: case-insensitive substring match with
more-specific versions checked first (4-6 before 4-5 before 4-1
before 4). Handles bare canonical names, dated IDs, and Bedrock
ARNs uniformly. Non-Claude models still return nil.
- PricingTableCopy exposes a defensive copy for the verify tool.
- Opus 4.6 fast mode (tier \$30/\$150) is documented as a known
limitation — proper support requires plumbing usage.speed
through the parser.
Verification tool (cmd/ccx-verify-pricing):
- Reads reference/claude-code-2188/src/utils/modelCost.ts with regex
(no bun/node dependency). Extracts the COST_TIER_* constants and
the MODEL_COSTS mapping.
- Reads reference/claude-code-2188/src/utils/model/configs.ts for
each config's firstParty string.
- Canonicalizes those to short names using the same ordered
substring rules as ccx's LookupPricing and Claude Code's
firstPartyNameToCanonical.
- Compares ccx's PricingTableCopy against the parsed tiers
field-by-field (input, output, cache_read, cache_write with
1e-9 tolerance). Reports per-model drift or "OK". Exit 1 on any
mismatch, 0 on clean.
- Also reports ccx-only rows (models ccx prices that CC source
does not reference) as stale drift.
- make verify-pricing wraps it. CLAUDE_SOURCE=<path> env var
overrides the default reference checkout location.
Tests:
parser/pricing_test.go updates:
- TestLookupPricing_NonClaudeFamilyReturnsNil: gpt-4, gpt-5.4-mini,
grok-2, llama, mistral — all nil.
- TestLookupPricing_DatedBedrockAndBareNamesAllMap: the three
common shapes of claude-sonnet-4-5 all resolve to tier \$3/\$15.
- TestLookupPricing_OpusMoreSpecificBeforeLess: opus-4-6 returns
\$5/\$25, opus-4-1 returns \$15/\$75, opus-4 (bare) returns
\$15/\$75. Match order correctness.
cmd/ccx-verify-pricing/main_test.go (new, 5 tests):
- TestParseModelCostTiers_ReadsValues: regex extracts
inputTokens/outputTokens/cache fields from a fake source.
- TestParseModelCostsMap_ExtractsBindings: line-oriented parser
maps config names to tier names.
- TestParseFirstPartyNames_ExtractsFirstParty: reads configs.ts
firstParty strings.
- TestCompareAgainstCCX_DetectsDrift: fake source with
deliberately wrong opus-4-6 pricing — tool must detect and
report the drift, but NOT false-positive on matching models.
- TestCanonicalize_OrderMatters: bedrock ARNs, dated IDs, bare
names, and mixed-case inputs all canonicalize correctly.
Verification on the real source (11 models):
OK CLAUDE_3_5_HAIKU_CONFIG [COST_HAIKU_35]
OK CLAUDE_3_5_V2_SONNET_CONFIG [COST_TIER_3_15]
OK CLAUDE_3_7_SONNET_CONFIG [COST_TIER_3_15]
OK CLAUDE_HAIKU_4_5_CONFIG [COST_HAIKU_45]
OK CLAUDE_OPUS_4_1_CONFIG [COST_TIER_15_75]
OK CLAUDE_OPUS_4_5_CONFIG [COST_TIER_5_25]
OK CLAUDE_OPUS_4_6_CONFIG [COST_TIER_5_25]
OK CLAUDE_OPUS_4_CONFIG [COST_TIER_15_75]
OK CLAUDE_SONNET_4_5_CONFIG [COST_TIER_3_15]
OK CLAUDE_SONNET_4_6_CONFIG [COST_TIER_3_15]
OK CLAUDE_SONNET_4_CONFIG [COST_TIER_3_15]
ccx pricing table matches Claude Code source — no drift.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-tier cache architecture: in-memory LRU (from #4) satisfies hot access, persistent disk cache survives process restart. Now a fresh `ccx web` doesn't re-pay the parse cost for sessions you've already viewed. Storage: one gob file per source session, named by sha256(abs_path), with in-band (mtime, size) metadata for freshness checks. Layout: internal/provider/disk_cache.go — gob file store + freshness internal/provider/disk_cache_test.go — 9 tests incl. round-trip internal/provider/multi.go — two-tier lookup chain Flow (ParseSession): 1. in-memory LRU (existing, O(1)) 2. miss → disk cache (os.Stat + gob.Decode, ~5ms) 3. miss → backend.ParseSession (live parse, ~19ms for 1500 msg) 4. successful parse writes to BOTH cache tiers Failure modes are always graceful. Corrupt gob file, tmp-write crash, unwriteable data dir, schema drift — every one of those falls through to live parse. The cache is an optimization layer, never a correctness layer. Cache location: $XDG_DATA_HOME/ccx/session-cache/ (or ~/.local/share/ccx/session-cache/ by default). Separate directory from the stars database so ccx cache clear can nuke it independently. Serialization notes: - gob.Register(map[string]any{}) + gob.Register([]any{}) set up in init() so ContentBlock.ToolInput survives the round trip. Without registration gob drops the Task/Skill tool inputs. - parser.Message has an unexported `raw rawMessage` field used only during parsing. gob skips unexported fields so the serialized tree is smaller than the in-memory one — fine because nothing downstream reads `.raw` after parse. - Writes go tmp-then-rename so a crash mid-encode can't leave a half-written cache file. Tests (9 new in provider, 85.3% package coverage): - TestDiskCache_RoundTripsRealSession: full parse -> put -> get, verifies stats, per-message Usage, and 3-block Content survive. This is the load-bearing test: without it, gob drops could ship undetected. - TestDiskCache_MissOnMtimeChange: stale entry removed on read. - TestDiskCache_MissOnSizeChange: size drift also invalidates. - TestDiskCache_MissOnNonExistentPath: never-cached path returns miss. - TestDiskCache_PutNilSessionIsNoop: defensive. - TestDiskCache_ClearRemovesEverything. - TestDiskCache_CorruptFileGracefullyDropped: fake garbage in place of a gob file produces a miss + file cleanup. - TestMulti_DiskCacheSurvivesMultiRebuild: parse with Multi A, tear down, parse same path with fresh Multi B pointing at same disk dir, verify backend's parseCount stays 0 (disk served it). - TestMulti_DiskCacheInvalidatesOnFileMtimeChange: edit the session file after caching, verify next parse actually hits the backend. NewMultiWithDiskCache(dir, backends...) exposed for tests. Production code uses NewMulti which defaults to config.DataDir/session-cache/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The session nav panel on the left of the conversation view was 170px
wide always, eating into reading space on 1280-1440px viewports. Now
it collapses to a compact 56px column and expands to 260px only on
hover — reclaiming ~114px of horizontal space for the actual content.
Interaction:
- Default state: 56px-wide column showing only session-id stubs and
provider badges. The "Sessions" header is replaced by a circular
'S' dot in the session accent colour — recognisable at a glance.
- Hover: width animates to 260px over 180ms ease. Full session
summaries, IDs, and the "Sessions" label fade in. A soft left-side
shadow lifts the expanded column off the content.
- The expansion is OVERLAY, not push — the flex layout slot stays at
56px so the outline sidebar (middle column) and main content
(right column) don't shift sideways when the cursor drifts over
the sidebar. Accomplished by setting the outer <aside> to 56px
with overflow:visible and letting the inner .panel-header +
.panel-list children be the elements that actually widen.
- Mobile (<600px) still hides the panel entirely as before.
No JS — pure CSS :hover + width transition. Keyboard tab order and
focus states unchanged. No schema changes.
Tests:
TestHandleSession_SessionNavIsNarrowByDefault: smoke test that the
rendered page contains the .panel-nav.session-nav { width: 56px; }
rule plus the :hover { width: 260px; } expansion. Actual layout
behaviour is CSS-only so Go tests can only verify the rules are
emitted — manual browser verification is the load-bearing check.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New export format "exec" (alias "exec-md") that renders a session
as an executive report instead of a rendered conversation. Each
turn shows: the user's request, the files the agent edited, and
the agent's final summary text — with per-turn cost footer when
pricing is available.
Useful for the "what did I get done this session" share that
--brief can't quite produce. Brief gives you the full conversation
minus tool noise; exec gives you a structured report where each
turn is one readable block.
Sample output:
# Session s-abc123
> initial summary line from the session header
**Duration:** 1h23m · **Messages:** 42 · **Tool calls:** 18
· **Total cost:** $2.4500 · **Model:** claude-sonnet-4-5
---
## Turn 1
> add a health-check endpoint
**Files touched:** `api/health.go`, `api/health_test.go`
Added /health handler plus a table-driven test — passes
locally. Wired into the router with the standard JSON
middleware.
*cost: $0.0423 · 8.2k tokens*
## Turn 2
...
Implementation (internal/render/exec.go):
- ExecMarkdown walks session messages in wire order and segments
them into turnBlocks anchored by KindUserPrompt / KindCommand.
Compact boundaries emit a "— context compaction —" divider
inline between turns.
- extractEditedFiles scans assistant tool_use blocks for Edit /
Write / MultiEdit / NotebookEdit / Create and pulls file_path /
notebook_path / path. Dedupes via map, sorts alphabetically for
deterministic output.
- lastAssistantText walks the turn's assistants in reverse and
returns the last one with a non-empty text block — the "final
summary" a reader actually wants.
- Per-turn cost/token footer reads from parser.ComputeTurnStats
(reused from #2). When cost is 0 and tokens are 0 the footer
is omitted entirely.
- Empty turns (no user text, no edits, no summary) are dropped so
the output isn't padded by meta markers.
Export plumbing:
- render/export.go: Export() dispatches "exec" / "exec-md" to
ExecMarkdown directly, bypassing the Brief + format pipeline
(exec is structurally a report, not a rendered transcript).
- cmd/export.go: formatToExt maps "exec" / "exec-md" to
"-exec.md" so the default output filename is session-<id>-exec.md.
Tests (14 new, render pkg 80.8% coverage):
- extractEditedFiles: dedupes+sorts, ignores non-edit tools,
handles notebook_path/path field variants, empty on no
assistants.
- lastAssistantText: final summary with tool_use blocks
sprinkled, empty on tool-only turns.
- ExecMarkdown: empty session header-only, single-turn full
render, multi-turn ordering, empty-turn dropping, per-turn
cost emission.
- Export dispatch: "exec" and "exec-md" aliases produce
identical output.
- formatDurationShort / formatTokenCountShort helpers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude Code and Codex share the abstractions the timeline rail and
per-turn cost UI need (turns, assistant responses, tool calls, token
usage) — but Codex's wire format is different enough that ccx was
returning nil Usage on every Codex assistant message. That made the
per-turn spend section empty and the timeline rail heat flat for all
Codex sessions. Fix: extend MessageUsage for GPT-5's reasoning field,
add GPT-5/5.4 pricing, and attribute Codex token_count deltas to
the nearest assistant in the full parse.
Shape unification:
- parser.MessageUsage gains a ReasoningTokens field. Claude sessions
leave it 0; Codex populates it from reasoning_output_tokens.
- parser.MessageUsage.Total() now sums all 5 categories.
- ComputeCost includes ReasoningTokens at the output rate (OpenAI
bills GPT-5 thinking tokens as regular output).
GPT-5 / GPT-5.4 pricing (parser/pricing.go):
- Three new tiers: costGPT54_10_80 ($10/$80), costGPT54Mini_025_2
($0.25/$2), costGPT54Nano_005_04 ($0.05/$0.4). Cache reads at ~50%
of input per OpenAI's cached-prompt pricing. Cache writes stay 0
— OpenAI doesn't split cache creation the way Anthropic does.
- pricingTable adds gpt-5, gpt-5-mini, gpt-5-nano, gpt-5.4,
gpt-5.4-mini, gpt-5.4-nano rows.
- LookupPricing extends the ordered substring match to include the
GPT-5 family. nano wins before mini wins before plain gpt-5 so
"gpt-5-mini-2025-07-07" doesn't fall through to the more-expensive
full-sized tier.
UNVERIFIED: rates are pinned from publicly-known GPT-5 list pricing
circa late 2025. There's no equivalent of Claude Code's modelCost.ts
in the Codex reference checkout, so `make verify-pricing` cannot
cross-check these rates yet. If OpenAI rebases, update the tier
constants manually. Tracked as a known limitation.
Codex backend per-message attribution (codex/backend.go):
- tokenUsageTotals struct extended with ReasoningOutputTokens.
- Full-parse token_count handler now tracks previousTotals, computes
delta on each event, and attaches the delta as MessageUsage to
the most-recent assistant message without Usage via
latestAssistantWithoutUsage(). Rough per-turn attribution — Codex
reports running totals, not per-message counts, so delta-since-
last-event is the best available without a protocol change.
- clampNonNegative() guards against total regressions (rare, but
wire format doesn't forbid it).
- Previous total wrapped in a stack-local var at full-parse scope
so each ParseSession call starts fresh.
Tests:
parser/pricing_test.go: (4 new on top of the prior pricing suite)
- TestLookupPricing_UnknownFamilyReturnsNil (renamed from
NonClaudeFamily): gpt-4, grok, llama, mistral still return nil
— we explicitly only match Claude and GPT-5 families.
- TestLookupPricing_GPT5Family: every model/canonical pair
(gpt-5, gpt-5.4, gpt-5-mini, gpt-5.4-nano, uppercase variant)
resolves to the correct tier. Match-order correctness: mini/nano
win before plain gpt-5.
- TestComputeCost_ReasoningTokensBilledAsOutput: 1M reasoning at
$80/Mtok adds $80 to cost.
codex/backend_usage_test.go: (5 new)
- TestLatestAssistantWithoutUsage_SkipsTaggedMessages: walks
backward, returns first untagged assistant.
- TestLatestAssistantWithoutUsage_ReturnsNilWhenAllTagged: no
double-attribution.
- TestLatestAssistantWithoutUsage_ReturnsNilWhenNoAssistants.
- TestLatestAssistantWithoutUsage_PicksLatestNotFirst: multiple
untagged assistants → the latest one is picked.
- TestClampNonNegative: negative deltas zero out.
Downstream effects (automatic):
- #2 per-turn spend breakdown now populates for Codex sessions.
- #5 timeline rail heat colors work for Codex ticks.
- #7 exec export shows Codex cost footers.
- `make verify-pricing` still clean on the Claude side; Codex
verification is an acknowledged follow-up.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…th canonicalization Addresses review round 1 findings from linus-code-reviewer on the 10-commit v0.next stack. Two HIGH severity bugs were ship-blockers. HIGH #1: Pricing match pollution for GPT-5 variants. Previous LookupPricing used plain strings.Contains with no delimiter awareness. That matched "gpt-5" inside "gpt-51" (which would be a hypothetical GPT-5.1), and — critically — matched "gpt-5" inside "openai/gpt-5-codex-mini" WITHOUT also matching the mini suffix, resolving it to the full-tier rate (~40x overcharge for real Codex sessions that use that model string). Fix: delimited substring matching via two helpers: - containsDelimited(name, family): requires the next character after family to be one of "-./_ :" (or end of string). Rejects "gpt-5" inside "gpt-51" or "claude-opus-4" inside "claude-opus-42". - hasVariantToken(name, variant): checks for variant tokens (mini, nano) surrounded by delimiters on both sides. Matches "-mini-", "-mini/", "-mini" at end of string. Does NOT match "cuminary" or "terminal". The GPT-5 match now follows a family-then-variant pattern: - Check if the name contains gpt-5.4 or gpt-5 (delimited) - Within the family, check for nano/mini tokens (delimited) - Fall through to the full tier Added adversarial test cases from the review: openai/gpt-5-codex-mini → gpt-5.4-mini tier (was full tier) gpt-5.4-turbo-mini-v2 → gpt-5.4-mini tier anthropic/gpt-5.4-nano-preview → gpt-5.4-nano tier gpt-51 → nil (was full tier) gpt-5o → nil claude-opus-42 → nil (was claude-opus-4 tier) claude-sonnet-40 → nil Also applied the delimited pattern to the Claude family, since the same class of bug could appear for any future hypothetical "claude-opus-4-7" or "claude-sonnet-4-7" variant that wasn't in the table. HIGH #2: Codex per-message attribution lost multi-assistant turns. Previous latestAssistantWithoutUsage returned only the single most-recent untagged assistant, dumping the entire delta on it. For turns that produced multiple agent_message events between token_count checkpoints (reasoning + partial reply + tool + partial reply is standard GPT-5 shape), earlier messages stayed at $0.00 forever and the last one got everything. Fix: distributeCodexDelta splits the delta evenly across every untagged assistant in messages[usageWatermark:]. usageWatermark is a high-water-mark index that advances on every token_count event, so previous bursts aren't re-attributed. - Even split by count (not by per-message output size, because Codex doesn't expose that — best available without a protocol change). The per-turn UI aggregation reassembles correctly. - Integer-division remainder is absorbed by the first message so sum(per-message) == delta exactly. Verified by TestDistributeCodexDelta_RemainderGoesToFirst. - Guards: silently drops delta when no untagged targets (no panic), skips already-tagged messages on subsequent events (no double-attribution). Also fixed in the same token_count handler: running-total regression accounting. Previously clampNonNegative swallowed negative deltas but left previousTotals at the inflated value, so a regression-then-recovery would double-count (100→200→150→250 would emit deltas 100, 0, 100 — wrong, should be 100, 0, 50). Now previousTotals uses max() as a floor via maxInt helper. MEDIUM: Tests for Codex attribution were shape-assertions only. Added 5 tests that exercise the actual behavior: - TestDistributeCodexDelta_EvenSplitAcrossUntagged (reasoning + agent_message both get half the delta, both get non-zero cost) - TestDistributeCodexDelta_RemainderGoesToFirst (delta preserved under integer division) - TestDistributeCodexDelta_SkipsTaggedMessages (prior attribution not overwritten) - TestDistributeCodexDelta_SilentlyDropsWithNoUntaggedTargets - TestMaxInt MEDIUM: In-memory LRU key was not canonicalized. Multi.ParseSession now passes absPath (computed once at the top) to cache.getOrLoad instead of the raw filePath. Two callers with different representations of the same file ("./s.jsonl" vs "/wd/s.jsonl") now hit the same cache slot. Correctness was preserved before the fix — just a cache-hit-rate regression — but worth fixing. MEDIUM: gofmt for backend.go. The tokenUsageTotals struct field alignment broke after adding ReasoningOutputTokens (my commit, not pre-existing). Fixed. Passing gofmt -l cleanly now. All existing tests still pass. 9 new adversarial/coverage tests added. Full suite green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 2 review noted that round 1's own pricing_test.go (and four other stack-added files) had comment-alignment drift from gofmt's preferred layout. Ran gofmt -w on only the files this stack introduced; pre-existing gofmt-dirty files (internal/cmd/search.go, internal/cmd/view.go, internal/provider/codex/helpers_test.go, internal/provider/codex/parse_branches_test.go) are left alone so the diff stays scoped to v0.next work. Files reformatted: cmd/ccx-verify-pricing/main.go internal/parser/pricing_test.go internal/parser/session_test.go internal/provider/disk_cache_test.go internal/web/markdown_test.go All tests still pass. No behavior changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ignore Codex-only rows in Claude pricing verification - keep early token deltas, reasoning tokens, and sidechain work in the parent turn
Member
Author
|
Follow-up for the round 2 review is on Fixed:
Verification:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the
v0.next — complete product passmilestone (#6).Summary
Twelve commits that ship the five v0.next issues plus one real pricing bug the reference-driven verification tool surfaced:
b6797c86385a6bfe540a872085606b9fad0dbdba60290a6f10357a50729f86aa637914c884786ae9e127The load-bearing catches
Opus 4.5 / 4.6 overcharge (
dbdba60). The embedded pricing table had these at tier$15/$75(the Opus 4/4.1 rate). Claude Code's ownsrc/utils/modelCost.tshas them at tier$5/$25— roughly 3× lower. Every per-turn cost figure ccx ever displayed for Opus 4.5/4.6 sessions was inflated by ~3×. Fixed.cmd/ccx-verify-pricing(new) parses the Claude Code source with regex and diffs against our pricing table,make verify-pricingwires it into the Makefile, and unit tests feed it a fake source with deliberate drift to prove the detector works.GPT-5 pricing match pollution (
c884786). Round 1 review found thatLookupPricingused plainstrings.Containswith no delimiter awareness. That madeopenai/gpt-5-codex-miniresolve to the full GPT-5 tier instead of the mini tier — a ~40× overcharge for Codex mini sessions. Replaced withcontainsDelimited+hasVariantTokenhelpers that require delimiter boundaries. Adversarial tests from the review are pinned.Codex multi-assistant attribution (
c884786). Round 1 also caught that per-message attribution dropped tokens whenever a turn produced multipleagent_messageevents betweentoken_countcheckpoints (reasoning → tool → reply is standard GPT-5 shape). Earlier messages in the burst showed $0.00. Fixed with ausageWatermarkindex +distributeCodexDeltathat evenly splits delta across every untagged assistant.Rail shipping details
The timeline rail went through three redesigns during review:
Per-turn spend breakdown (#2)
The info panel now shows a Per-turn spend section sorted by cost desc. Each row links to
#msg-<anchor>and composes with the load-earlier fix so deep-links into collapsed history auto-reload. Session-level Cost row in the Tokens section. Single source of truth:session.Stats.CostUSD == sum(message.Usage.CostUSD)by construction.Performance (#4 + #9)
Two-tier cache:
(mtime, size)change.$XDG_DATA_HOME/ccx/session-cache/<sha256>.gob. Survivesccx webrestart. Corrupt files dropped + swept.gob.Register(map[string]any{})+gob.Register([]any{})registered ininit()soContentBlock.ToolInputround-trips cleanly.Codex support (#8)
Unified data layer for features across Claude Code and Codex:
MessageUsage.ReasoningTokensfield for GPT-5 reasoning outputtoken_countevent deltas withdistributeCodexDeltaTests
Full suite green. Coverage by package:
internal/parserinternal/providerinternal/renderinternal/webcmd/ccx-verify-pricingNew tests:
Reviews
Round 1 (
linus-code-reviewer) caught two HIGH severity bugs — the pricing match pollution and the Codex multi-assistant attribution loss. Both fixed inc884786with adversarial tests pinned.Round 2 (independent second pass) verified the round 1 fixes hold up under 20+ adversarial inputs, noted 3 LOW items that are intentionally deferred (UTF-8 byte-slicing in
tickSnippet,disk_cachedouble-close defer,truncatePathpre-existing), and returned a ship it verdict. Cleared the remaining gofmt drift (ae9e127).Known limitations (documented, not ship-blockers)
usage.speed. Requires plumbing thespeedfield through the parser.modelCost.tsin the Codex reference checkout, somake verify-pricingonly cross-checks the Claude side. Updateinternal/parser/pricing.gomanually if OpenAI rebases.##lines are dropped by the new server-side markdown header parser. Valid per CommonMark but visually useless.Test plan
🤖 Generated with Claude Code